fix(gateway): persist memory flush state to prevent redundant re-flushes on restart (#4481)

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-04-25 00:51:20 +00:00

* fix: force-close TCP sockets on client cleanup, detect and recover dead connections

When a provider drops connections mid-stream (e.g. OpenRouter outage),
httpx's graceful close leaves sockets in CLOSE-WAIT indefinitely. These
zombie connections accumulate and can prevent recovery without restarting.

Changes:
- _force_close_tcp_sockets: walks the httpx connection pool and issues
  socket.shutdown(SHUT_RDWR) + close() to force TCP RST on every socket
  when a client is closed, preventing CLOSE-WAIT accumulation
- _cleanup_dead_connections: probes the primary client's pool for dead
  sockets (recv MSG_PEEK), rebuilds the client if any are found
- Pre-turn health check at the start of each run_conversation call that
  auto-recovers with a user-facing status message
- Primary client rebuild after stale stream detection to purge pool
- User-facing messages on streaming connection failures:
  "Connection to provider dropped — Reconnecting (attempt 2/3)"
  "Connection failed after 3 attempts — try again in a moment"

Made-with: Cursor

* fix: pool entry missing base_url for openrouter, clean error messages

- _resolve_runtime_from_pool_entry: add OPENROUTER_BASE_URL fallback
  when pool entry has no runtime_base_url (pool entries from auth.json
  credential_pool often omit base_url)
- Replace Rich console.print for auth errors with plain print() to
  prevent ANSI escape code mangling through prompt_toolkit's stdout patch
- Force-close TCP sockets on client cleanup to prevent CLOSE-WAIT
  accumulation after provider outages
- Pre-turn dead connection detection with auto-recovery and user message
- Primary client rebuild after stale stream detection
- User-facing status messages on streaming connection failures/retries

Made-with: Cursor

* fix(gateway): persist memory flush state to prevent redundant re-flushes on restart

The _session_expiry_watcher tracked flushed sessions in an in-memory set
(_pre_flushed_sessions) that was lost on gateway restart. Expired sessions
remained in sessions.json and were re-discovered every restart, causing
redundant AIAgent runs that burned API credits and blocked the event loop.

Fix: Add a memory_flushed boolean field to SessionEntry, persisted in
sessions.json. The watcher sets it after a successful flush. On restart,
the flag survives and the watcher skips already-flushed sessions.

- Add memory_flushed field to SessionEntry with to_dict/from_dict support
- Old sessions.json entries without the field default to False (backward compat)
- Remove the ephemeral _pre_flushed_sessions set from SessionStore
- Update tests: save/load roundtrip, legacy entry compat, auto-reset behavior

This commit is contained in:

Teknium

2026-04-01 12:05:02 -07:00

• committed by

GitHub

parent 1515e8c8f2

commit 16d9f58445

No known key found for this signature in database

GPG key ID: B5690EEEBB952194

6 changed files with 290 additions and 35 deletions

									
										2

hermes_cli/runtime_provider.py
									
										View file
										
				@ -133,6 +133,8 @@ def _resolve_runtime_from_pool_entry(

				        if cfg_provider == "anthropic":

				            cfg_base_url = str(model_cfg.get("base_url") or "").strip().rstrip("/")

				        base_url = cfg_base_url or base_url or "https://api.anthropic.com"

				    elif provider == "openrouter":

				        base_url = base_url or OPENROUTER_BASE_URL

				    elif provider == "nous":

				        api_mode = "chat_completions"

				    elif provider == "copilot":

Rows
Columns

fix(gateway): persist memory flush state to prevent redundant re-flushes on restart (#4481)

2 hermes_cli/runtime_provider.py Unescape Escape View file

2

hermes_cli/runtime_provider.py

View file