hermes-agent

mirrors/hermes-agent

Fork 0

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-05-18 04:41:56 +00:00

Commit graph

Author SHA1 Message Date

Author	SHA1	Message	Date
Teknium	3e7145e0bb	revert: roll back /goal checklist + /subgoal feature stack (#23813 ) * Revert "fix(goals): force judge to use tool calls instead of JSON-text replies (#23547)" This reverts commit `a63a2b7c78`. * Revert "fix(goals): forward standing /goal state on auto-compression session rotation (#23530)" This reverts commit `4a080b1d5a`. * Revert "feat(goals): /goal checklist + /subgoal user controls (#23456)" This reverts commit `404640a2b7`.	2026-05-11 07:06:27 -07:00
Teknium	404640a2b7	feat(goals): /goal checklist + /subgoal user controls (#23456 ) * feat(goals): /goal checklist + /subgoal user controls Two-phase judge for /goal — Phase A decomposes the goal into a detailed checklist on first turn; Phase B evaluates each pending item harshly against the agent's most recent response. The goal completes only when every item is in a terminal status (completed or impossible). Adds /subgoal so the user can append, complete, mark impossible, undo, remove, or clear items the judge missed or got wrong. Mechanics: - GoalState gains `checklist` and `decomposed` fields, both backwards compatible (old state_meta rows load unchanged). - Phase A: aux call writes a harsh, exhaustive checklist; biased toward more items not fewer. Falls through to legacy freeform judge when decompose fails. - Phase B: judge gets the checklist + last-response snippet + path to a per-session conversation dump at <HERMES_HOME>/goals/<sid>.json. A bounded read_file tool (max 5 calls per turn, restricted to that one file) lets the judge inspect history when the snippet is ambiguous. Stickiness in code: terminal items are frozen, only the user can revert via /subgoal undo. - Continuation prompt shows checklist progress when non-empty; reverts to old prompt when empty. - Status line shows M/N done counts. CLI + gateway + TUI gateway all pass the agent reference into evaluate_after_turn so the dump can be written. Gateway-side /subgoal is allowed mid-run since it only modifies the checklist the judge consults at turn boundaries. Tests: 24 new cases — backcompat round-trip, Phase A decompose, Phase B updates + new_items + stickiness, user override flows, conversation dump (incl. unsafe-sid sanitization), judge read_file restriction. Existing freeform-mode tests updated to patch the renamed `judge_goal_freeform` and skip Phase A explicitly. * fix(goals): off-by-one in judge index, message-list plumbing, prompt tuning Three live-test findings from running /goal end-to-end against gemini-3-flash-preview as the judge: 1. Off-by-one bug — the judge sees the checklist rendered with 1-based indices ('1. [ ] foo, 2. [ ] bar') but the apply layer indexed state.checklist as 0-based. Result: every judge update landed on the wrong item, evidence got attached to neighbouring rows, and the genuine 'first pending' item (usually #1) never got marked. Fix: convert 1 → 0 in _parse_evaluate_response. Also tightened the user prompt to call out the 1-based scheme explicitly. New tests cover the parser conversion + an end-to-end fake-judge round-trip. 2. Conversation dump never happened — _extract_agent_messages tried common AIAgent attribute names (.messages, .conversation_history, etc.) but AIAgent doesn't expose the message list as an instance attribute; it lives inside run_conversation()'s scope. Result: the judge's read_file tool always saw history_path=unavailable. Fix: added an explicit messages= kwarg to evaluate_after_turn that all three call sites (CLI, gateway, TUI gateway) now pass directly. Agent-attribute extraction kept as back-compat fallback. 3. Prompt was too harsh on simple goals. The original 'be HARSH, default to leaving items pending' wording made the judge refuse to mark 'file exists' completed even after the agent ran ls, test -f, os.path.isfile, and find — burning the entire 8-turn budget on a fizzbuzz task. Softened to 'strict but not absurd' with explicit guidance on what counts as evidence and a directive not to require re-proving items already established earlier. Re-tested live with the same fizzbuzz goal: now terminates in 2 turns with all 8 checklist items correctly attributed to their own evidence. /subgoal user-action flow (add / complete / undo / impossible) verified live as well.	2026-05-10 16:56:51 -07:00
Teknium	674fad1483	fix(goals): Ctrl+C during /goal loop auto-pauses the goal (#21888 ) Reported: Ctrl+C during an active /goal loop felt like it did nothing — the agent would interrupt the current turn, then immediately queue another continuation and keep going until the session ended or the 20-turn budget ran out. Root cause: cli.py's _maybe_continue_goal_after_turn() ran in the finally: block around self.chat(...) unconditionally. Whether the turn completed normally, got interrupted, or returned an empty string, the judge ran on whatever was in conversation_history and — because the judge is fail-open — a "continue" verdict pushed another CONTINUATION_PROMPT onto _pending_input. Ctrl+C was invisible to the hook. Fix: - chat() now captures result['interrupted'] onto self._last_turn_interrupted (resets to False at entry so early-returns don't leak prior state). - _maybe_continue_goal_after_turn() checks the flag first: on interrupt, auto-pause via mgr.pause(reason='user-interrupted (Ctrl+C)') and print a one-liner pointing the user at /goal resume or /goal clear. No judge call, no continuation enqueued. - Also added an empty-response guard that mirrors gateway/run.py's _handle_message logic (empty reply → transient failure → skip judging so we don't trip the consecutive-parse-failures backstop unnecessarily). The goal stays in the DB as paused, so /goal resume recovers it after the user has sorted out whatever made them cancel. /goal clear still works as before for a full stop. Tests: tests/cli/test_cli_goal_interrupt.py covers: - interrupted turn pauses + doesn't queue + judge is NOT called - paused goal is resumable - empty / whitespace / missing assistant reply skips judging - healthy turn still enqueues continuation / marks done - chat() resets _last_turn_interrupted at entry (anti-leak guard) All 55 existing goal tests still pass.	2026-05-08 06:53:13 -07:00

Teknium

3e7145e0bb

revert: roll back /goal checklist + /subgoal feature stack (#23813 )

* Revert "fix(goals): force judge to use tool calls instead of JSON-text replies (#23547)"

This reverts commit a63a2b7c78.

* Revert "fix(goals): forward standing /goal state on auto-compression session rotation (#23530)"

This reverts commit 4a080b1d5a.

* Revert "feat(goals): /goal checklist + /subgoal user controls (#23456)"

This reverts commit 404640a2b7.

2026-05-11 07:06:27 -07:00

Teknium

404640a2b7

feat(goals): /goal checklist + /subgoal user controls (#23456 )

* feat(goals): /goal checklist + /subgoal user controls

Two-phase judge for /goal — Phase A decomposes the goal into a detailed
checklist on first turn; Phase B evaluates each pending item harshly
against the agent's most recent response. The goal completes only when
every item is in a terminal status (completed or impossible). Adds
/subgoal so the user can append, complete, mark impossible, undo,
remove, or clear items the judge missed or got wrong.

Mechanics:
- GoalState gains `checklist` and `decomposed` fields, both backwards
  compatible (old state_meta rows load unchanged).
- Phase A: aux call writes a harsh, exhaustive checklist; biased toward
  more items not fewer. Falls through to legacy freeform judge when
  decompose fails.
- Phase B: judge gets the checklist + last-response snippet + path to
  a per-session conversation dump at <HERMES_HOME>/goals/<sid>.json.
  A bounded read_file tool (max 5 calls per turn, restricted to that
  one file) lets the judge inspect history when the snippet is
  ambiguous. Stickiness in code: terminal items are frozen, only the
  user can revert via /subgoal undo.
- Continuation prompt shows checklist progress when non-empty;
  reverts to old prompt when empty.
- Status line shows M/N done counts.

CLI + gateway + TUI gateway all pass the agent reference into
evaluate_after_turn so the dump can be written. Gateway-side
/subgoal is allowed mid-run since it only modifies the checklist
the judge consults at turn boundaries.

Tests: 24 new cases — backcompat round-trip, Phase A decompose,
Phase B updates + new_items + stickiness, user override flows,
conversation dump (incl. unsafe-sid sanitization), judge read_file
restriction. Existing freeform-mode tests updated to patch the
renamed `judge_goal_freeform` and skip Phase A explicitly.

* fix(goals): off-by-one in judge index, message-list plumbing, prompt tuning

Three live-test findings from running /goal end-to-end against
gemini-3-flash-preview as the judge:

1. Off-by-one bug — the judge sees the checklist rendered with 1-based
   indices ('1. [ ] foo, 2. [ ] bar') but the apply layer indexed
   state.checklist as 0-based. Result: every judge update landed on
   the wrong item, evidence got attached to neighbouring rows, and
   the genuine 'first pending' item (usually #1) never got marked.
   Fix: convert 1 → 0 in _parse_evaluate_response. Also tightened the
   user prompt to call out the 1-based scheme explicitly. New tests
   cover the parser conversion + an end-to-end fake-judge round-trip.

2. Conversation dump never happened — _extract_agent_messages tried
   common AIAgent attribute names (.messages, .conversation_history,
   etc.) but AIAgent doesn't expose the message list as an instance
   attribute; it lives inside run_conversation()'s scope. Result: the
   judge's read_file tool always saw history_path=unavailable. Fix:
   added an explicit messages= kwarg to evaluate_after_turn that all
   three call sites (CLI, gateway, TUI gateway) now pass directly.
   Agent-attribute extraction kept as back-compat fallback.

3. Prompt was too harsh on simple goals. The original 'be HARSH,
   default to leaving items pending' wording made the judge refuse
   to mark 'file exists' completed even after the agent ran ls,
   test -f, os.path.isfile, and find — burning the entire 8-turn
   budget on a fizzbuzz task. Softened to 'strict but not absurd'
   with explicit guidance on what counts as evidence and a directive
   not to require re-proving items already established earlier.

Re-tested live with the same fizzbuzz goal: now terminates in 2
turns with all 8 checklist items correctly attributed to their
own evidence. /subgoal user-action flow (add / complete / undo /
impossible) verified live as well.

2026-05-10 16:56:51 -07:00

Teknium

674fad1483

fix(goals): Ctrl+C during /goal loop auto-pauses the goal (#21888 )

Reported: Ctrl+C during an active /goal loop felt like it did nothing —
the agent would interrupt the current turn, then immediately queue another
continuation and keep going until the session ended or the 20-turn budget
ran out.

Root cause: cli.py's _maybe_continue_goal_after_turn() ran in the finally:
block around self.chat(...) unconditionally. Whether the turn completed
normally, got interrupted, or returned an empty string, the judge ran on
whatever was in conversation_history and — because the judge is fail-open
— a "continue" verdict pushed another CONTINUATION_PROMPT onto
_pending_input. Ctrl+C was invisible to the hook.

Fix:
- chat() now captures result['interrupted'] onto self._last_turn_interrupted
  (resets to False at entry so early-returns don't leak prior state).
- _maybe_continue_goal_after_turn() checks the flag first: on interrupt,
  auto-pause via mgr.pause(reason='user-interrupted (Ctrl+C)') and print
  a one-liner pointing the user at /goal resume or /goal clear. No judge
  call, no continuation enqueued.
- Also added an empty-response guard that mirrors gateway/run.py's
  _handle_message logic (empty reply → transient failure → skip judging
  so we don't trip the consecutive-parse-failures backstop unnecessarily).

The goal stays in the DB as paused, so /goal resume recovers it after
the user has sorted out whatever made them cancel. /goal clear still
works as before for a full stop.

Tests: tests/cli/test_cli_goal_interrupt.py covers:
  - interrupted turn pauses + doesn't queue + judge is NOT called
  - paused goal is resumable
  - empty / whitespace / missing assistant reply skips judging
  - healthy turn still enqueues continuation / marks done
  - chat() resets _last_turn_interrupted at entry (anti-leak guard)

All 55 existing goal tests still pass.

2026-05-08 06:53:13 -07:00

3 commits