hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-07-08 13:12:08 +00:00

History

Teknium 404640a2b7 feat(goals): /goal checklist + /subgoal user controls (#23456 ) * feat(goals): /goal checklist + /subgoal user controls Two-phase judge for /goal — Phase A decomposes the goal into a detailed checklist on first turn; Phase B evaluates each pending item harshly against the agent's most recent response. The goal completes only when every item is in a terminal status (completed or impossible). Adds /subgoal so the user can append, complete, mark impossible, undo, remove, or clear items the judge missed or got wrong. Mechanics: - GoalState gains `checklist` and `decomposed` fields, both backwards compatible (old state_meta rows load unchanged). - Phase A: aux call writes a harsh, exhaustive checklist; biased toward more items not fewer. Falls through to legacy freeform judge when decompose fails. - Phase B: judge gets the checklist + last-response snippet + path to a per-session conversation dump at <HERMES_HOME>/goals/<sid>.json. A bounded read_file tool (max 5 calls per turn, restricted to that one file) lets the judge inspect history when the snippet is ambiguous. Stickiness in code: terminal items are frozen, only the user can revert via /subgoal undo. - Continuation prompt shows checklist progress when non-empty; reverts to old prompt when empty. - Status line shows M/N done counts. CLI + gateway + TUI gateway all pass the agent reference into evaluate_after_turn so the dump can be written. Gateway-side /subgoal is allowed mid-run since it only modifies the checklist the judge consults at turn boundaries. Tests: 24 new cases — backcompat round-trip, Phase A decompose, Phase B updates + new_items + stickiness, user override flows, conversation dump (incl. unsafe-sid sanitization), judge read_file restriction. Existing freeform-mode tests updated to patch the renamed `judge_goal_freeform` and skip Phase A explicitly. * fix(goals): off-by-one in judge index, message-list plumbing, prompt tuning Three live-test findings from running /goal end-to-end against gemini-3-flash-preview as the judge: 1. Off-by-one bug — the judge sees the checklist rendered with 1-based indices ('1. [ ] foo, 2. [ ] bar') but the apply layer indexed state.checklist as 0-based. Result: every judge update landed on the wrong item, evidence got attached to neighbouring rows, and the genuine 'first pending' item (usually #1) never got marked. Fix: convert 1 → 0 in _parse_evaluate_response. Also tightened the user prompt to call out the 1-based scheme explicitly. New tests cover the parser conversion + an end-to-end fake-judge round-trip. 2. Conversation dump never happened — _extract_agent_messages tried common AIAgent attribute names (.messages, .conversation_history, etc.) but AIAgent doesn't expose the message list as an instance attribute; it lives inside run_conversation()'s scope. Result: the judge's read_file tool always saw history_path=unavailable. Fix: added an explicit messages= kwarg to evaluate_after_turn that all three call sites (CLI, gateway, TUI gateway) now pass directly. Agent-attribute extraction kept as back-compat fallback. 3. Prompt was too harsh on simple goals. The original 'be HARSH, default to leaving items pending' wording made the judge refuse to mark 'file exists' completed even after the agent ran ls, test -f, os.path.isfile, and find — burning the entire 8-turn budget on a fizzbuzz task. Softened to 'strict but not absurd' with explicit guidance on what counts as evidence and a directive not to require re-proving items already established earlier. Re-tested live with the same fizzbuzz goal: now terminates in 2 turns with all 8 checklist items correctly attributed to their own evidence. /subgoal user-action flow (add / complete / undo / impossible) verified live as well.		2026-05-10 16:56:51 -07:00
..
assets	fix: improve telegram topic mode setup	2026-05-04 12:07:17 -07:00
builtin_hooks	remove: BOOT.md built-in hook (#17093 )	2026-04-28 09:50:27 -07:00
platforms	fix: use UTF-16 length for Telegram stream consumer message splitting	2026-05-10 16:21:07 -07:00
__init__.py	Enhance CLI with multi-platform messaging integration and configuration management	2026-02-02 19:01:51 -08:00
channel_directory.py	feat: complete plugin platform parity — all 12 integration points	2026-04-29 21:56:51 -07:00
config.py	feat(gateway): per-platform admin/user split for slash commands (salvage of #4443 ) (#23373 )	2026-05-10 12:33:54 -07:00
delivery.py	fix(gateway): preserve case-sensitive chat IDs in DeliveryTarget.parse	2026-05-01 14:01:26 -07:00
display_config.py	feat(gateway): opt-in cleanup of temporary progress bubbles (#21186 )	2026-05-07 05:04:37 -07:00
hooks.py	fix(plugins): register dynamically-loaded modules in sys.modules before exec	2026-04-29 23:34:35 -07:00
mirror.py	fix(gateway): avoid cross-user mirror writes in per-user group sessions	2026-04-26 18:31:24 -07:00
pairing.py	fix(pairing): enforce lockout on approve_code, not just generate_code (#10195 ) (#21325 )	2026-05-07 07:18:21 -07:00
platform_registry.py	feat(plugins): add standalone_sender_fn for out-of-process cron delivery	2026-05-09 02:56:29 -07:00
restart.py	fix(gateway): address restart review feedback	2026-04-10 21:18:34 -07:00
run.py	feat(goals): /goal checklist + /subgoal user controls (#23456 )	2026-05-10 16:56:51 -07:00
runtime_footer.py	feat(gateway): opt-in runtime-metadata footer on final replies (#17026 )	2026-04-28 06:50:04 -07:00
session.py	refactor(gateway): simplify auto-resume + extend to crash recovery	2026-05-07 05:05:34 -07:00
session_context.py	fix(cron): run due jobs in parallel to prevent serial tick starvation (#13021 )	2026-04-20 11:53:07 -07:00
shutdown_forensics.py	feat(gateway): shutdown forensics — non-blocking diag, per-phase timing, stale-unit warning (#23285 )	2026-05-10 09:01:51 -07:00
slash_access.py	feat(gateway): per-platform admin/user split for slash commands (salvage of #4443 ) (#23373 )	2026-05-10 12:33:54 -07:00
status.py	fix(gateway): refresh runtime argv metadata	2026-05-09 11:08:23 -07:00
sticker_cache.py	chore: remove ~100 unused imports across 55 files (#3016 )	2026-03-25 15:02:03 -07:00
stream_consumer.py	fix: use UTF-16 length for Telegram stream consumer message splitting	2026-05-10 16:21:07 -07:00
whatsapp_identity.py	fix(whatsapp_identity): pin identifier regex to ASCII, clarify it's defense-in-depth	2026-04-26 20:48:31 -07:00