hermes-agent/gateway
Teknium cb4addacab
fix(gateway): auto-resume sessions after drain-timeout restart (#11852) (#12301)
The shutdown banner promised "send any message after restart to resume
where you left off" but the code did the opposite: a drain-timeout
restart skipped the .clean_shutdown marker, which made the next startup
call suspend_recently_active(), which marked the session suspended,
which made get_or_create_session() spawn a fresh session_id with a
'Session automatically reset. Use /resume...' notice — contradicting
the banner.

Introduce a resume_pending state on SessionEntry that is distinct from
suspended. Drain-timeout shutdown flags active sessions resume_pending
instead of letting startup-wide suspension destroy them. The next
message on the same session_key preserves the session_id, reloads the
transcript, and the agent receives a reason-aware restart-resume
system note that subsumes the existing tool-tail auto-continue note
(PR #9934).

Terminal escalation still flows through the existing
.restart_failure_counts stuck-loop counter (PR #7536, threshold 3) —
no parallel counter on SessionEntry. suspended still wins over
resume_pending in get_or_create_session() so genuinely stuck sessions
converge to a clean slate.

Spec: PR #11852 (BrennerSpear). Implementation follows the spec with
the approved correction (reuse .restart_failure_counts rather than
adding a resume_attempts field).

Changes:
- gateway/session.py: SessionEntry.resume_pending/resume_reason/
  last_resume_marked_at + to_dict/from_dict; SessionStore
  .mark_resume_pending()/clear_resume_pending(); get_or_create_session()
  returns existing entry when resume_pending (suspended still wins);
  suspend_recently_active() skips resume_pending entries.
- gateway/run.py: _stop_impl() drain-timeout branch marks active
  sessions resume_pending before _interrupt_running_agents();
  _run_agent() injects reason-aware restart-resume system note that
  subsumes the tool-tail case; successful-turn cleanup also clears
  resume_pending next to _clear_restart_failure_count();
  _notify_active_sessions_of_shutdown() softens the restart banner to
  'I'll try to resume where you left off' (honest about stuck-loop
  escalation).
- tests/gateway/test_restart_resume_pending.py: 29 new tests covering
  SessionEntry roundtrip, mark/clear helpers, get_or_create_session
  precedence (suspended > resume_pending), suspend_recently_active
  skip, drain-timeout mark reason (restart vs shutdown), system-note
  injection decision tree (including tool-tail subsumption), banner
  wording, and stuck-loop escalation override.
2026-04-18 17:32:17 -07:00
..
builtin_hooks refactor: remove dead code — 1,784 lines across 77 files (#9180) 2026-04-13 16:32:04 -07:00
platforms feat(steer): /steer <prompt> injects a mid-run note after the next tool call (#12116) 2026-04-18 04:17:18 -07:00
__init__.py Enhance CLI with multi-platform messaging integration and configuration management 2026-02-02 19:01:51 -08:00
channel_directory.py feat(discord): support forum channels 2026-04-17 20:25:48 -07:00
config.py fix(qqbot): add back-compat for env var rename; drop qrcode core dep 2026-04-17 15:31:14 -07:00
delivery.py refactor: remove dead code — 1,784 lines across 77 files (#9180) 2026-04-13 16:32:04 -07:00
display_config.py fix(gateway): fix regression causing display.streaming to override root streaming key 2026-04-14 10:52:23 -07:00
hooks.py feat: built-in boot-md hook — run BOOT.md on gateway startup (#3733) 2026-03-29 10:19:54 -07:00
mirror.py chore: remove ~100 unused imports across 55 files (#3016) 2026-03-25 15:02:03 -07:00
pairing.py fix: multiple platform adaptors concurrency 2026-04-06 16:49:54 -07:00
restart.py fix(gateway): address restart review feedback 2026-04-10 21:18:34 -07:00
run.py fix(gateway): auto-resume sessions after drain-timeout restart (#11852) (#12301) 2026-04-18 17:32:17 -07:00
session.py fix(gateway): auto-resume sessions after drain-timeout restart (#11852) (#12301) 2026-04-18 17:32:17 -07:00
session_context.py fix: prevent stale os.environ leak after clear_session_vars (#10304) (#10527) 2026-04-15 14:27:17 -07:00
status.py fix(gateway): detect legacy hermes.service + mark --replace SIGTERM as planned (#11909) 2026-04-17 19:27:58 -07:00
sticker_cache.py chore: remove ~100 unused imports across 55 files (#3016) 2026-03-25 15:02:03 -07:00
stream_consumer.py feat(dingtalk): AI Cards streaming, emoji reactions, and media handling 2026-04-17 19:26:53 -07:00