hermes-agent/gateway
Teknium 6c89306437 fix: break stuck session resume loops after repeated restarts (#7536)
When a session gets stuck (hung terminal, runaway tool loop) and the
user restarts the gateway, the same session history loads and puts the
agent right back in the stuck state. The user is trapped in a loop:
restart → stuck → restart → stuck.

Fix: track restart-failure counts per session using a simple JSON file
(.restart_failure_counts). On each shutdown with active agents, the
counter increments for those sessions. On startup, if any session has
been active across 3+ consecutive restarts, it's auto-suspended —
giving the user a clean slate on their next message.

The counter resets to 0 when a session completes a turn successfully
(response delivered), so normal sessions that happen to be active
during planned restarts (/restart, hermes update) won't accumulate
false counts.

Implementation:
- _increment_restart_failure_counts(): called during stop() when
  agents are active. Writes {session_key: count} to JSON file.
  Sessions NOT active are dropped (loop broken).
- _suspend_stuck_loop_sessions(): called on startup. Reads the file,
  suspends sessions at threshold (3), clears the file.
- _clear_restart_failure_count(): called after successful response
  delivery. Removes the session from the counter file.

No SessionEntry schema changes. No database migration. Pure file-based
tracking that naturally cleans up.

Test plan:
- 9 new stuck-loop tests (increment, accumulate, threshold, clear,
  suspend, file cleanup, edge cases)
- All 28 gateway lifecycle tests pass (restart drain + auto-continue
  + stuck loop)
2026-04-14 17:08:35 -07:00
..
builtin_hooks refactor: remove dead code — 1,784 lines across 77 files (#9180) 2026-04-13 16:32:04 -07:00
platforms fix(discord): register native /restart slash command 2026-04-14 16:55:48 -07:00
__init__.py Enhance CLI with multi-platform messaging integration and configuration management 2026-02-02 19:01:51 -08:00
channel_directory.py fix(gateway): derive channel directory platforms from enum instead of hardcoded list (#7450) 2026-04-10 17:27:32 -07:00
config.py feat(gateway): add ignored_threads config for Telegram 2026-04-14 01:40:32 -07:00
delivery.py refactor: remove dead code — 1,784 lines across 77 files (#9180) 2026-04-13 16:32:04 -07:00
display_config.py fix(gateway): fix regression causing display.streaming to override root streaming key 2026-04-14 10:52:23 -07:00
hooks.py feat: built-in boot-md hook — run BOOT.md on gateway startup (#3733) 2026-03-29 10:19:54 -07:00
mirror.py chore: remove ~100 unused imports across 55 files (#3016) 2026-03-25 15:02:03 -07:00
pairing.py fix: multiple platform adaptors concurrency 2026-04-06 16:49:54 -07:00
restart.py fix(gateway): address restart review feedback 2026-04-10 21:18:34 -07:00
run.py fix: break stuck session resume loops after repeated restarts (#7536) 2026-04-14 17:08:35 -07:00
session.py refactor: remove dead code — 1,784 lines across 77 files (#9180) 2026-04-13 16:32:04 -07:00
session_context.py fix(gateway): add HERMES_SESSION_KEY to session_context contextvars 2026-04-11 15:35:04 -07:00
status.py fix: gateway auto-recovers from unexpected SIGTERM via systemd (#5646) 2026-04-14 15:35:58 -07:00
sticker_cache.py chore: remove ~100 unused imports across 55 files (#3016) 2026-03-25 15:02:03 -07:00
stream_consumer.py fix: prevent streaming cursor from appearing as standalone messages (#9538) 2026-04-14 01:52:42 -07:00