hermes-agent/gateway
Teknium d848ea7109
fix: circuit breaker stops CPU-burning restart loops on persistent errors
When a gateway session hits a non-retryable error (e.g. invalid model
ID → HTTP 400), the agent fails and returns. But if the session keeps
receiving messages (or something periodically recreates agents), each
attempt spawns a new AIAgent — reinitializing MCP server connections,
burning CPU — only to hit the same 400 error again. On a 4-core server,
this pegs an entire core per stuck session and accumulates 300+ minutes
of CPU time over hours.

Fix: add a per-session consecutive failure counter in the gateway runner.

- Track consecutive non-retryable failures per session key
- After 3 consecutive failures (_MAX_CONSECUTIVE_FAILURES), block
  further agent creation for that session and notify the user:
  '⚠️ This session has failed N times in a row with a non-retryable
  error. Use /reset to start a new session.'
- Evict the cached agent when the circuit breaker engages to prevent
  stale state from accumulating
- Reset the counter on successful agent runs
- Clear the counter on /reset and /new so users can recover
- Uses getattr() pattern so bare GatewayRunner instances (common in
  tests using object.__new__) don't crash

Tests:
- 8 new tests in test_circuit_breaker.py covering counter behavior,
  threshold, reset, session isolation, and bare-runner safety

Addresses #7130.
2026-04-10 21:07:10 -07:00
..
builtin_hooks refactor: replace inline HERMES_HOME re-implementations with get_hermes_home() 2026-04-07 10:40:34 -07:00
platforms fix(api): send tool progress as custom SSE event to prevent model corruption (#6972) 2026-04-10 18:55:26 -07:00
__init__.py Enhance CLI with multi-platform messaging integration and configuration management 2026-02-02 19:01:51 -08:00
channel_directory.py fix(gateway): derive channel directory platforms from enum instead of hardcoded list (#7450) 2026-04-10 17:27:32 -07:00
config.py feat(matrix): add MATRIX_DM_MENTION_THREADS env var 2026-04-10 15:46:20 -07:00
delivery.py fix: remove 115 verified dead code symbols across 46 production files 2026-04-10 03:44:43 -07:00
hooks.py feat: built-in boot-md hook — run BOOT.md on gateway startup (#3733) 2026-03-29 10:19:54 -07:00
mirror.py chore: remove ~100 unused imports across 55 files (#3016) 2026-03-25 15:02:03 -07:00
pairing.py fix: multiple platform adaptors concurrency 2026-04-06 16:49:54 -07:00
run.py fix: circuit breaker stops CPU-burning restart loops on persistent errors 2026-04-10 21:07:10 -07:00
session.py fix: remove 115 verified dead code symbols across 46 production files 2026-04-10 03:44:43 -07:00
session_context.py fix(gateway): replace os.environ session state with contextvars for concurrency safety 2026-04-10 17:04:38 -07:00
status.py fix(gateway): implement platform-aware PID termination 2026-04-10 03:52:00 -07:00
sticker_cache.py chore: remove ~100 unused imports across 55 files (#3016) 2026-03-25 15:02:03 -07:00
stream_consumer.py fix(gateway): prevent duplicate messages on no-message-id platforms 2026-04-10 03:52:00 -07:00