hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-05-11 03:31:55 +00:00

Author	SHA1	Message	Date
Teknium	38b1c7dce5	refactor(gateway): simplify auto-resume + extend to crash recovery Follow-up on top of @kyan12's PR #20888 — same feature, cleaner shape, wider coverage. Changes: - Drop the synthetic '[System note: ...]' in the internal MessageEvent. The existing _is_resume_pending branch in _handle_message_with_agent (run.py ~L13738) already injects a reason-aware recovery system note on the next turn. With kyan's text in place the model saw two stacked system notes. Now the event text is empty and the existing injection path owns the wording. - Drop SessionStore.list_resume_pending() as a new public method. The filter is 8 lines inline in _schedule_resume_pending_sessions() — one caller, no other pluggability need. - Add 'restart_interrupted' to the auto-resume reason set. That's the reason SessionStore.suspend_recently_active() stamps on sessions recovered from a crash/OOM/SIGKILL (no .clean_shutdown marker). Previously those sessions had to wait for a real user message to auto-resume; now they continue automatically at startup like drain-timeout interruptions do. - Reasons live in a _AUTO_RESUME_REASONS frozenset at class scope so future reasons (e.g. 'manual_resume_request') can be opted in with one line. Test coverage added: - drain-timeout + crash-recovery both scheduled - stale entries skipped (outside freshness window) - suspended entries skipped (suspended > resume_pending) - originless entries skipped (no routing target) - disallowed reasons skipped (graceful forward-compat) E2E verified end-to-end with a real on-disk SessionStore: 2 eligible sessions scheduled, 2 ineligible skipped, empty-text internal events delivered to the adapter. Co-authored-by: Kevin Yan <kevyan1998@gmail.com>	2026-05-07 05:05:34 -07:00
Kevin Yan	961a3535fa	fix(gateway): preserve resume marker on interrupted restart	2026-05-07 05:05:34 -07:00
Kevin Yan	fad684b1f3	feat(gateway): auto-resume interrupted sessions after restart	2026-05-07 05:05:34 -07:00
leprincep35700	b59bb4e351	fix(gateway): preserve home-channel thread targets across restart notifications	2026-05-03 08:47:49 -07:00
kshitijk4poor	6f2dab248a	fix: update tests for resume_pending semantics + add AUTHOR_MAP entries Tests updated to reflect suspend_recently_active now setting resume_pending=True (preserves session) instead of suspended=True (wipes session history). AUTHOR_MAP entries: millerc79 (#19033), shellybotmoyer (#18915)	2026-05-03 03:54:03 -07:00
johnncenae	1ef9e88549	fix(gateway): write restart markers atomically and fix Windows lock collisions	2026-04-30 19:58:16 -07:00
teknium1	7444e49d4e	fix(gateway): use transcript timestamp for auto-continue freshness Follow-up to PR #16802 (BeliefanX). The original fix read `agent_history[-1].get("timestamp")` for the tool-tail freshness gate, but `gateway/run.py` strips the `timestamp` field off all tool/tool_call rows when building `agent_history` from the raw transcript (see `clean_msg = {k: v for k, v in msg.items() if k != "timestamp"}`). At runtime the tool-tail branch always saw `None` and silently took the legacy-fresh path — the stale-guard never fired for the tool-tail case it was supposed to cover. Changes: - Read the freshness signal from the RAW `history` list (via new `_last_transcript_timestamp()` helper) BEFORE the strip. Both the resume_pending branch and the tool-tail branch use this single signal, replacing the two divergent ones. - Default window bumped 15 min → 1 hour via new `_AUTO_CONTINUE_FRESHNESS_SECS_DEFAULT`. The 15-minute default was shorter than the default `gateway_timeout` of 30 min, so a legitimate long-running turn interrupted near its timeout boundary and resumed shortly after would have been misclassified as stale. - Configurable via `config.yaml` `agent.gateway_auto_continue_freshness` (bridged to `HERMES_AUTO_CONTINUE_FRESHNESS` at gateway startup — same pattern as `gateway_timeout`). Set to 0 to disable the gate. - `_coerce_gateway_timestamp` now explicitly rejects bool (which is a subclass of int and would otherwise coerce to 0.0/1.0). - Tests rewritten to exercise the real production data shape: raw `history` → `_build_agent_history` strip → freshness decision. A regression guard (`test_stale_tool_tail_with_production_data_shape`) asserts `agent_history` tool rows carry NO timestamp, protecting against someone "fixing" the original bug by re-adding the stripped field (which would break the OpenAI tool-result message contract). Add BeliefanX to scripts/release.py AUTHOR_MAP. E2E verified: config.yaml → env var bridge → helper returns configured value; default 1h window; malformed/empty env var falls back to default; ISO-Z timestamps parse; ms-epoch coerced; bool rejected.	2026-04-28 05:20:35 -07:00
beliefanx	93feffbcfa	fix(gateway): avoid stale interrupted turn auto-continue	2026-04-28 05:20:35 -07:00
Teknium	c49a58a6d0	fix(gateway): mark only still-running sessions resume_pending on drain timeout (#12332 ) Follow-up to #12301. The drain-timeout branch of _stop_impl() was iterating the drain-start snapshot (active_agents) when marking sessions resume_pending. That snapshot can include sessions that finished gracefully during the drain window — marking them would give their next turn a stray 'your previous turn was interrupted by a gateway restart' system note even though the prior turn actually completed cleanly. Iterate self._running_agents at timeout time instead, mirroring _interrupt_running_agents() exactly: - only sessions still blocking the shutdown get marked - pending sentinels (AIAgent construction not yet complete) are skipped Changes: - gateway/run.py: swap active_agents.keys() for filtered self._running_agents.items() iteration in the drain-timeout mark loop. - tests/gateway/test_restart_resume_pending.py: two regression tests — finisher-during-drain not marked, pending sentinel not marked.	2026-04-18 17:40:34 -07:00
Teknium	cb4addacab	fix(gateway): auto-resume sessions after drain-timeout restart (#11852 ) (#12301 ) The shutdown banner promised "send any message after restart to resume where you left off" but the code did the opposite: a drain-timeout restart skipped the .clean_shutdown marker, which made the next startup call suspend_recently_active(), which marked the session suspended, which made get_or_create_session() spawn a fresh session_id with a 'Session automatically reset. Use /resume...' notice — contradicting the banner. Introduce a resume_pending state on SessionEntry that is distinct from suspended. Drain-timeout shutdown flags active sessions resume_pending instead of letting startup-wide suspension destroy them. The next message on the same session_key preserves the session_id, reloads the transcript, and the agent receives a reason-aware restart-resume system note that subsumes the existing tool-tail auto-continue note (PR #9934). Terminal escalation still flows through the existing .restart_failure_counts stuck-loop counter (PR #7536, threshold 3) — no parallel counter on SessionEntry. suspended still wins over resume_pending in get_or_create_session() so genuinely stuck sessions converge to a clean slate. Spec: PR #11852 (BrennerSpear). Implementation follows the spec with the approved correction (reuse .restart_failure_counts rather than adding a resume_attempts field). Changes: - gateway/session.py: SessionEntry.resume_pending/resume_reason/ last_resume_marked_at + to_dict/from_dict; SessionStore .mark_resume_pending()/clear_resume_pending(); get_or_create_session() returns existing entry when resume_pending (suspended still wins); suspend_recently_active() skips resume_pending entries. - gateway/run.py: _stop_impl() drain-timeout branch marks active sessions resume_pending before _interrupt_running_agents(); _run_agent() injects reason-aware restart-resume system note that subsumes the tool-tail case; successful-turn cleanup also clears resume_pending next to _clear_restart_failure_count(); _notify_active_sessions_of_shutdown() softens the restart banner to 'I'll try to resume where you left off' (honest about stuck-loop escalation). - tests/gateway/test_restart_resume_pending.py: 29 new tests covering SessionEntry roundtrip, mark/clear helpers, get_or_create_session precedence (suspended > resume_pending), suspend_recently_active skip, drain-timeout mark reason (restart vs shutdown), system-note injection decision tree (including tool-tail subsumption), banner wording, and stuck-loop escalation override.	2026-04-18 17:32:17 -07:00

10 commits