mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-06-19 10:02:16 +00:00
* fix: circuit breaker stops CPU-burning restart loops on persistent errors
When a gateway session hits a non-retryable error (e.g. invalid model
ID → HTTP 400), the agent fails and returns. But if the session keeps
receiving messages (or something periodically recreates agents), each
attempt spawns a new AIAgent — reinitializing MCP server connections,
burning CPU — only to hit the same 400 error again. On a 4-core server,
this pegs an entire core per stuck session and accumulates 300+ minutes
of CPU time over hours.
Fix: add a per-session consecutive failure counter in the gateway runner.
- Track consecutive non-retryable failures per session key
- After 3 consecutive failures (_MAX_CONSECUTIVE_FAILURES), block
further agent creation for that session and notify the user:
'⚠️ This session has failed N times in a row with a non-retryable
error. Use /reset to start a new session.'
- Evict the cached agent when the circuit breaker engages to prevent
stale state from accumulating
- Reset the counter on successful agent runs
- Clear the counter on /reset and /new so users can recover
- Uses getattr() pattern so bare GatewayRunner instances (common in
tests using object.__new__) don't crash
Tests:
- 8 new tests in test_circuit_breaker.py covering counter behavior,
threshold, reset, session isolation, and bare-runner safety
Addresses #7130.
* Revert "fix: circuit breaker stops CPU-burning restart loops on persistent errors"
This reverts commit
|
||
|---|---|---|
| .. | ||
| builtin_hooks | ||
| platforms | ||
| __init__.py | ||
| channel_directory.py | ||
| config.py | ||
| delivery.py | ||
| hooks.py | ||
| mirror.py | ||
| pairing.py | ||
| run.py | ||
| session.py | ||
| session_context.py | ||
| status.py | ||
| sticker_cache.py | ||
| stream_consumer.py | ||