hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-07-03 12:23:08 +00:00

History

Teknium 404640a2b7 feat(goals): /goal checklist + /subgoal user controls (#23456 ) * feat(goals): /goal checklist + /subgoal user controls Two-phase judge for /goal — Phase A decomposes the goal into a detailed checklist on first turn; Phase B evaluates each pending item harshly against the agent's most recent response. The goal completes only when every item is in a terminal status (completed or impossible). Adds /subgoal so the user can append, complete, mark impossible, undo, remove, or clear items the judge missed or got wrong. Mechanics: - GoalState gains `checklist` and `decomposed` fields, both backwards compatible (old state_meta rows load unchanged). - Phase A: aux call writes a harsh, exhaustive checklist; biased toward more items not fewer. Falls through to legacy freeform judge when decompose fails. - Phase B: judge gets the checklist + last-response snippet + path to a per-session conversation dump at <HERMES_HOME>/goals/<sid>.json. A bounded read_file tool (max 5 calls per turn, restricted to that one file) lets the judge inspect history when the snippet is ambiguous. Stickiness in code: terminal items are frozen, only the user can revert via /subgoal undo. - Continuation prompt shows checklist progress when non-empty; reverts to old prompt when empty. - Status line shows M/N done counts. CLI + gateway + TUI gateway all pass the agent reference into evaluate_after_turn so the dump can be written. Gateway-side /subgoal is allowed mid-run since it only modifies the checklist the judge consults at turn boundaries. Tests: 24 new cases — backcompat round-trip, Phase A decompose, Phase B updates + new_items + stickiness, user override flows, conversation dump (incl. unsafe-sid sanitization), judge read_file restriction. Existing freeform-mode tests updated to patch the renamed `judge_goal_freeform` and skip Phase A explicitly. * fix(goals): off-by-one in judge index, message-list plumbing, prompt tuning Three live-test findings from running /goal end-to-end against gemini-3-flash-preview as the judge: 1. Off-by-one bug — the judge sees the checklist rendered with 1-based indices ('1. [ ] foo, 2. [ ] bar') but the apply layer indexed state.checklist as 0-based. Result: every judge update landed on the wrong item, evidence got attached to neighbouring rows, and the genuine 'first pending' item (usually #1) never got marked. Fix: convert 1 → 0 in _parse_evaluate_response. Also tightened the user prompt to call out the 1-based scheme explicitly. New tests cover the parser conversion + an end-to-end fake-judge round-trip. 2. Conversation dump never happened — _extract_agent_messages tried common AIAgent attribute names (.messages, .conversation_history, etc.) but AIAgent doesn't expose the message list as an instance attribute; it lives inside run_conversation()'s scope. Result: the judge's read_file tool always saw history_path=unavailable. Fix: added an explicit messages= kwarg to evaluate_after_turn that all three call sites (CLI, gateway, TUI gateway) now pass directly. Agent-attribute extraction kept as back-compat fallback. 3. Prompt was too harsh on simple goals. The original 'be HARSH, default to leaving items pending' wording made the judge refuse to mark 'file exists' completed even after the agent ran ls, test -f, os.path.isfile, and find — burning the entire 8-turn budget on a fizzbuzz task. Softened to 'strict but not absurd' with explicit guidance on what counts as evidence and a directive not to require re-proving items already established earlier. Re-tested live with the same fizzbuzz goal: now terminates in 2 turns with all 8 checklist items correctly attributed to their own evidence. /subgoal user-action flow (add / complete / undo / impossible) verified live as well.		2026-05-10 16:56:51 -07:00
..
__init__.py	refactor(tests): re-architect tests + fix CI failures (#5946 )	2026-04-07 17:19:07 -07:00
test_branch_command.py	feat(memory): notify providers on mid-process session_id rotation (#17409 )	2026-04-29 04:57:22 -07:00
test_busy_input_mode_command.py	feat(busy): add 'steer' as a third display.busy_input_mode option (#16279 )	2026-04-26 18:21:29 -07:00
test_cli_approval_ui.py	feat(openrouter): wire Pareto Code router with min_coding_score knob (#22838 )	2026-05-09 14:47:00 -07:00
test_cli_background_tui_refresh.py	refactor(tests): re-architect tests + fix CI failures (#5946 )	2026-04-07 17:19:07 -07:00
test_cli_bracketed_paste_sanitizer.py	fix(cli): strip leaked bracketed-paste wrappers	2026-04-26 21:47:40 -07:00
test_cli_browser_connect.py	fix(browser): address Copilot review on /browser connect	2026-04-28 22:11:10 -07:00
test_cli_context_warning.py	refactor(tests): re-architect tests + fix CI failures (#5946 )	2026-04-07 17:19:07 -07:00
test_cli_copy_command.py	feat: add /copy and /agents	2026-04-09 17:19:36 -05:00
test_cli_extension_hooks.py	refactor(tests): re-architect tests + fix CI failures (#5946 )	2026-04-07 17:19:07 -07:00
test_cli_external_editor.py	feat(cli): add editor workflow for drafts	2026-04-20 02:53:40 -07:00
test_cli_file_drop.py	fix(cli): catch OSError in _resolve_attachment_path to prevent ENAMETOOLONG dropping long slash commands	2026-05-06 06:34:48 -07:00
test_cli_force_redraw.py	fix(cli): recover classic CLI output after resize	2026-05-06 04:20:54 -07:00
test_cli_goal_interrupt.py	feat(goals): /goal checklist + /subgoal user controls (#23456 )	2026-05-10 16:56:51 -07:00
test_cli_image_command.py	fix(termux): harden execute_code and mobile browser/audio UX	2026-04-09 16:24:53 -07:00
test_cli_init.py	fix(cli): make Ctrl+Enter insert newline on WSL/SSH/Windows Terminal (#22777 )	2026-05-09 12:48:14 -07:00
test_cli_interrupt_subagent.py	fix: resolve CI test failures — add missing functions, fix stale tests (#9483 )	2026-04-14 01:43:45 -07:00
test_cli_loading_indicator.py	fix: clean up defensive shims and finish CI stabilization from #17660 (#17801 )	2026-04-29 23:53:17 -07:00
test_cli_markdown_rendering.py	fix(cli): preserve Windows hidden-dir paths in markdown	2026-05-04 05:04:36 -07:00
test_cli_mcp_config_watch.py	refactor(tests): re-architect tests + fix CI failures (#5946 )	2026-04-07 17:19:07 -07:00
test_cli_new_session.py	feat: confirm prompt for destructive slash commands (#4069 ) (#22687 )	2026-05-09 11:04:46 -07:00
test_cli_prefix_matching.py	refactor(tests): re-architect tests + fix CI failures (#5946 )	2026-04-07 17:19:07 -07:00
test_cli_preloaded_skills.py	refactor(tests): re-architect tests + fix CI failures (#5946 )	2026-04-07 17:19:07 -07:00
test_cli_provider_resolution.py	refactor: remove smart_model_routing feature (#12732 )	2026-04-19 18:12:55 -07:00
test_cli_reload_skills.py	refactor(reload-skills): queue note for next turn, drop cache invalidation + agent tool	2026-04-29 21:07:47 -07:00
test_cli_retry.py	refactor(tests): re-architect tests + fix CI failures (#5946 )	2026-04-07 17:19:07 -07:00
test_cli_save_config_value.py	fix(cli): preserve config comments on setting writes	2026-05-09 17:55:12 -07:00
test_cli_secret_capture.py	refactor(tests): re-architect tests + fix CI failures (#5946 )	2026-04-07 17:19:07 -07:00
test_cli_shift_enter_newline.py	feat(cli): recognise Shift+Enter as a newline key	2026-05-08 16:26:51 -07:00
test_cli_shutdown_memory_messages.py	fix(cli): pass session messages to shutdown_memory_provider (#15165 sibling)	2026-04-27 06:41:16 -07:00
test_cli_skin_integration.py	fix(tui): restore macOS copy behavior and theme polish (#17131 )	2026-04-28 18:47:14 -05:00
test_cli_status_bar.py	refactor: replace 'cmp' text with 🗜️ emoji in status bar	2026-05-07 05:27:45 -07:00
test_cli_status_command.py	fix(profile): use existing get_active_profile_name() for /profile command	2026-04-15 17:52:03 -07:00
test_cli_steer_busy_path.py	fix(cli): dispatch /steer inline while agent is running (#13354 )	2026-04-20 23:05:38 -07:00
test_cli_terminal_response_sanitizer.py	fix(cli): tighten mouse leak sanitizer	2026-04-29 22:10:18 -05:00
test_cli_tools_command.py	refactor(tests): re-architect tests + fix CI failures (#5946 )	2026-04-07 17:19:07 -07:00
test_cli_user_message_preview.py	feat(cli): improve multiline previews	2026-04-20 02:53:40 -07:00
test_compress_focus.py	feat: /compress <focus> — guided compression with focus topic (#8017 )	2026-04-11 19:23:29 -07:00
test_cprint_bg_thread.py	fix(cli): recover classic CLI output after resize	2026-05-06 04:20:54 -07:00
test_ctrl_enter_newline.py	fix(cli): make Ctrl+Enter insert newline on WSL/SSH/Windows Terminal (#22777 )	2026-05-09 12:48:14 -07:00
test_cwd_env_respect.py	fix(cli): local backend CLI always uses launch directory, stops .env sync of TERMINAL_CWD (#19334 )	2026-05-04 11:36:19 +05:30
test_destructive_slash_confirm.py	feat: confirm prompt for destructive slash commands (#4069 ) (#22687 )	2026-05-09 11:04:46 -07:00
test_fast_command.py	fix(anthropic): restrict fast mode to Opus 4.6 (Anthropic API contract)	2026-05-04 06:23:52 -07:00
test_gquota_command.py	fix(cli): sanitize interactive command output	2026-04-19 01:16:34 -07:00
test_manual_compress.py	fix(cli): persist manual compress handoff	2026-05-05 04:42:48 -07:00
test_personality_none.py	fix(gateway): use profile-aware Hermes paths in runtime hints	2026-04-15 17:52:03 -07:00
test_prompt_text_input_thread_safety.py	fix(cli): drive _prompt_text_input directly when off main thread (#23454 )	2026-05-10 16:16:10 -07:00
test_quick_commands.py	fix(tests): resolve 17 persistent CI test failures (#15084 )	2026-04-24 03:46:46 -07:00
test_reasoning_command.py	test: pin per-turn reasoning extraction semantics	2026-05-05 05:00:05 -07:00
test_resume_display.py	fix(cli): recover classic CLI output after resize	2026-05-06 04:20:54 -07:00
test_save_conversation_location.py	fix(sessions): /save lands under $HERMES_HOME, widen browse+TUI picker, force-refresh ollama-cloud on setup (#16296 )	2026-04-26 18:49:48 -07:00
test_session_boundary_hooks.py	fix: add gateway coverage for session boundary hooks, move test to tests/cli/	2026-04-08 04:27:34 -07:00
test_stream_delta_think_tag.py	fix(streaming): prevent <think> in prose from suppressing response output	2026-04-09 22:16:36 -07:00
test_surrogate_sanitization.py	fix(surrogates): sanitize reasoning/reasoning_content/reasoning_details fields (#11628 )	2026-04-17 13:30:47 -07:00
test_tool_progress_scrollback.py	fix(cli): restore stacked tool progress scrollback in TUI (#8201 )	2026-04-11 23:22:34 -07:00
test_worktree.py	fix: aggressive worktree and branch cleanup to prevent accumulation (#6134 )	2026-04-08 04:44:49 -07:00
test_worktree_security.py	refactor(tests): re-architect tests + fix CI failures (#5946 )	2026-04-07 17:19:07 -07:00