hermes-agent/hermes_cli
Teknium 404640a2b7
feat(goals): /goal checklist + /subgoal user controls (#23456)
* feat(goals): /goal checklist + /subgoal user controls

Two-phase judge for /goal — Phase A decomposes the goal into a detailed
checklist on first turn; Phase B evaluates each pending item harshly
against the agent's most recent response. The goal completes only when
every item is in a terminal status (completed or impossible). Adds
/subgoal so the user can append, complete, mark impossible, undo,
remove, or clear items the judge missed or got wrong.

Mechanics:
- GoalState gains `checklist` and `decomposed` fields, both backwards
  compatible (old state_meta rows load unchanged).
- Phase A: aux call writes a harsh, exhaustive checklist; biased toward
  more items not fewer. Falls through to legacy freeform judge when
  decompose fails.
- Phase B: judge gets the checklist + last-response snippet + path to
  a per-session conversation dump at <HERMES_HOME>/goals/<sid>.json.
  A bounded read_file tool (max 5 calls per turn, restricted to that
  one file) lets the judge inspect history when the snippet is
  ambiguous. Stickiness in code: terminal items are frozen, only the
  user can revert via /subgoal undo.
- Continuation prompt shows checklist progress when non-empty;
  reverts to old prompt when empty.
- Status line shows M/N done counts.

CLI + gateway + TUI gateway all pass the agent reference into
evaluate_after_turn so the dump can be written. Gateway-side
/subgoal is allowed mid-run since it only modifies the checklist
the judge consults at turn boundaries.

Tests: 24 new cases — backcompat round-trip, Phase A decompose,
Phase B updates + new_items + stickiness, user override flows,
conversation dump (incl. unsafe-sid sanitization), judge read_file
restriction. Existing freeform-mode tests updated to patch the
renamed `judge_goal_freeform` and skip Phase A explicitly.

* fix(goals): off-by-one in judge index, message-list plumbing, prompt tuning

Three live-test findings from running /goal end-to-end against
gemini-3-flash-preview as the judge:

1. Off-by-one bug — the judge sees the checklist rendered with 1-based
   indices ('1. [ ] foo, 2. [ ] bar') but the apply layer indexed
   state.checklist as 0-based. Result: every judge update landed on
   the wrong item, evidence got attached to neighbouring rows, and
   the genuine 'first pending' item (usually #1) never got marked.
   Fix: convert 1 → 0 in _parse_evaluate_response. Also tightened the
   user prompt to call out the 1-based scheme explicitly. New tests
   cover the parser conversion + an end-to-end fake-judge round-trip.

2. Conversation dump never happened — _extract_agent_messages tried
   common AIAgent attribute names (.messages, .conversation_history,
   etc.) but AIAgent doesn't expose the message list as an instance
   attribute; it lives inside run_conversation()'s scope. Result: the
   judge's read_file tool always saw history_path=unavailable. Fix:
   added an explicit messages= kwarg to evaluate_after_turn that all
   three call sites (CLI, gateway, TUI gateway) now pass directly.
   Agent-attribute extraction kept as back-compat fallback.

3. Prompt was too harsh on simple goals. The original 'be HARSH,
   default to leaving items pending' wording made the judge refuse
   to mark 'file exists' completed even after the agent ran ls,
   test -f, os.path.isfile, and find — burning the entire 8-turn
   budget on a fizzbuzz task. Softened to 'strict but not absurd'
   with explicit guidance on what counts as evidence and a directive
   not to require re-proving items already established earlier.

Re-tested live with the same fizzbuzz goal: now terminates in 2
turns with all 8 checklist items correctly attributed to their
own evidence. /subgoal user-action flow (add / complete / undo /
impossible) verified live as well.
2026-05-10 16:56:51 -07:00
..
__init__.py chore: release v0.13.0 (2026.5.7) (#21406) 2026-05-07 09:22:48 -07:00
_parser.py fix: add dashboard to CLI help epilogue and Docker CI smoke test 2026-05-07 06:16:23 -07:00
_subprocess_compat.py feat(windows): close remaining POSIX-only landmines — TUI crash, kanban waitpid, AF_UNIX sandbox, /bin/bash, npm .cmd shims, cwd tracking, detach flags 2026-05-08 14:27:40 -07:00
auth.py feat(cross-platform): psutil for PID/process management + Windows footgun checker 2026-05-08 14:27:40 -07:00
auth_commands.py auth: use get_default_hermes_root() for shared nous_auth.json path 2026-05-08 14:27:40 -07:00
azure_detect.py chore: remove unused imports and dead locals (ruff F401, F841) (#17010) 2026-04-28 06:46:45 -07:00
backup.py codebase: add encoding='utf-8' to all bare open() calls (PLW1514) 2026-05-08 14:27:40 -07:00
banner.py fix(banner): resolve update-check repo from running code, not profile-scoped path 2026-05-09 04:10:35 -07:00
browser_connect.py fix(browser): address Copilot review on /browser connect 2026-04-28 22:11:10 -07:00
callbacks.py fix: ESC cancels secret/sudo prompts, clearer skip messaging (#9902) 2026-04-14 16:11:37 -07:00
checkpoints.py feat(checkpoints): v2 single-store rewrite with real pruning + disk guardrails (#20709) 2026-05-06 05:44:35 -07:00
claw.py Merge origin/main and resolve conflict in nix/tui.nix 2026-05-07 22:56:19 +00:00
cli_output.py refactor: remove dead code — 1,784 lines across 77 files (#9180) 2026-04-13 16:32:04 -07:00
clipboard.py feat: fix img pasting in new ink plus newline after tools 2026-04-11 13:14:32 -05:00
codex_models.py docs(codex-spark): document ChatGPT Pro entitlement gating 2026-05-09 23:17:25 -07:00
colors.py
commands.py feat(goals): /goal checklist + /subgoal user controls (#23456) 2026-05-10 16:56:51 -07:00
completion.py fix(completion): use valid zsh _arguments exclusion-group syntax 2026-05-09 13:36:44 -07:00
config.py fix(terminal): bridge docker_env config to TERMINAL_DOCKER_ENV 2026-05-09 17:53:35 -07:00
copilot_auth.py fix(oauth,gateway): monotonic deadlines for polling/timeout loops 2026-05-07 05:09:39 -07:00
cron.py feat(cron): add no_agent mode for script-only cron jobs (watchdog pattern) (#19709) 2026-05-04 12:31:01 -07:00
curator.py feat(curator): show rename map in user-visible summary (#22910) 2026-05-09 18:43:40 -07:00
curses_ui.py fix: treat ctrl-c as curses cancel 2026-05-04 01:36:44 -07:00
debug.py fix(debug): redact log content at upload time in hermes debug share 2026-05-03 11:42:20 -07:00
default_soul.py
dingtalk_auth.py chore: remove unused imports and dead locals (ruff F401, F841) (#17010) 2026-04-28 06:46:45 -07:00
doctor.py fix(doctor): normalize provider name and aliases before dedicated-skip check 2026-05-09 13:36:33 -07:00
dump.py refactor(env): use shared Hermes dotenv loader 2026-05-05 10:13:13 -07:00
env_loader.py feat(cross-platform): psutil for PID/process management + Windows footgun checker 2026-05-08 14:27:40 -07:00
fallback_cmd.py feat(cli): add 'hermes fallback' command to manage fallback providers (#16052) 2026-04-26 06:19:04 -07:00
gateway.py fix(gateway): detect gateway process via /proc in Docker without procps 2026-05-09 17:54:17 -07:00
gateway_windows.py fix(gateway): preserve Ctrl+C for Windows foreground runs 2026-05-09 14:34:18 -07:00
goals.py feat(goals): /goal checklist + /subgoal user controls (#23456) 2026-05-10 16:56:51 -07:00
hooks.py codebase: add encoding='utf-8' to all bare open() calls (PLW1514) 2026-05-08 14:27:40 -07:00
kanban.py fix(kanban): /kanban slash command emits argparse garbage instead of help 2026-05-09 22:49:29 -07:00
kanban_db.py fix(kanban): extend stale claim instead of killing live worker 2026-05-10 15:23:04 -07:00
kanban_diagnostics.py fix(kanban): unify failure counter across spawn/timeout/crash outcomes (#20410) 2026-05-05 13:55:37 -07:00
kanban_specify.py feat(kanban): add specify — auxiliary LLM fleshes out triage tasks (#21435) 2026-05-07 13:04:41 -07:00
logs.py feat: component-separated logging with session context and filtering (#7991) 2026-04-11 17:23:36 -07:00
main.py fix(windows): unbreak install + update on Windows (#23394) 2026-05-10 13:07:08 -07:00
mcp_config.py feat(mcp): add codex preset for built-in MCP server discovery 2026-05-09 11:11:28 -07:00
memory_setup.py codebase: add encoding='utf-8' to all bare open() calls (PLW1514) 2026-05-08 14:27:40 -07:00
model_catalog.py codebase: add encoding='utf-8' to all bare open() calls (PLW1514) 2026-05-08 14:27:40 -07:00
model_normalize.py fix(opencode-go): keep users on opencode-go instead of hijacking to native providers (#20802) 2026-05-06 09:08:33 -07:00
model_switch.py docs(codex-spark): document ChatGPT Pro entitlement gating 2026-05-09 23:17:25 -07:00
models.py fix(xai): drop models being retired May 15, 2026 from pickers (#23291) 2026-05-10 12:12:55 -07:00
nous_subscription.py feat(web): add SearXNG as a native search-only backend 2026-05-06 10:05:29 -07:00
oneshot.py fix: make session search initialize session db 2026-05-09 14:36:58 -07:00
pairing.py fix(pairing): enforce lockout on approve_code, not just generate_code (#10195) (#21325) 2026-05-07 07:18:21 -07:00
platforms.py feat: complete plugin platform parity — all 12 integration points 2026-04-29 21:56:51 -07:00
plugins.py feat(plugins): run any LLM call from inside a plugin via ctx.llm (#23194) 2026-05-10 07:09:28 -07:00
plugins_cmd.py fix(plugins): resolve Git binary for installs under minimal PATH 2026-05-09 11:10:04 -07:00
profile_distribution.py feat(profile): shareable profile distributions via git (#20831) 2026-05-08 10:04:32 -07:00
profiles.py fix(profiles): exclude infrastructure artifacts when cloning with --clone-all 2026-05-09 04:10:35 -07:00
providers.py fix: prevent bare 'custom' slug in model.provider (#17478) 2026-04-30 04:32:11 -07:00
pt_input_extras.py fix(cli): make Ctrl+Enter insert newline on WSL/SSH/Windows Terminal (#22777) 2026-05-09 12:48:14 -07:00
pty_bridge.py feat(cross-platform): psutil for PID/process management + Windows footgun checker 2026-05-08 14:27:40 -07:00
relaunch.py fix(windows): prefer npm.cmd over npm.ps1, skip .py argv0 in relaunch 2026-05-08 14:27:40 -07:00
runtime_provider.py fix: use credential_pool for custom endpoint model listing probes 2026-05-09 17:54:58 -07:00
setup.py fix(xai): drop models being retired May 15, 2026 from pickers (#23291) 2026-05-10 12:12:55 -07:00
skills_config.py refactor(config): migrate remaining 33 cfg_get call sites (#17311) 2026-04-29 04:03:03 -07:00
skills_hub.py codebase: add encoding='utf-8' to all bare open() calls (PLW1514) 2026-05-08 14:27:40 -07:00
skin_engine.py fix(tui): honor skin highlight colors (#20895) 2026-05-06 14:01:56 -07:00
slack_cli.py fix(slack): enable writable app home DMs in manifest 2026-05-08 17:01:12 -07:00
status.py fix(status): add missing popular provider API keys to hermes status display 2026-05-04 05:14:13 -07:00
stdio.py fix(windows): quote cache paths in bash + augment PATH so rg/bash resolve on first launch 2026-05-08 14:27:40 -07:00
timeouts.py refactor(timeouts): drop redundant ImportError in except clause 2026-04-26 20:48:20 -07:00
tips.py feat: Ctrl+Enter inserts newline on Windows Terminal 2026-05-08 14:27:40 -07:00
tools_config.py fix(windows): unbreak install + update on Windows (#23394) 2026-05-10 13:07:08 -07:00
uninstall.py feat(windows uninstall): clean up User env, PATH, Scheduled Task, and portable tooling 2026-05-08 14:27:40 -07:00
vercel_auth.py feat: add Vercel Sandbox backend 2026-04-29 07:22:33 -07:00
voice.py fix(tui): restore voice push-to-talk parity (#20897) 2026-05-06 15:49:59 -07:00
web_server.py fix(security): require dashboard auth for plugin API routes 2026-05-10 07:04:18 -07:00
webhook.py refactor(config): migrate remaining 33 cfg_get call sites (#17311) 2026-04-29 04:03:03 -07:00