mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-07-05 12:42:30 +00:00
Three targeted fixes for Desktop GUI WebSocket stability when agent
turns starve the uvicorn event loop of CPU (GIL contention):
1. Loosen ws_ping_timeout for loopback binds (QW-1)
- Loopback (Desktop): ping 30s interval / 60s timeout
- Non-loopback (Cloudflare Tunnel): unchanged 20/20
- A GIL-heavy agent turn can stall the event loop past 20s;
uvicorn's keepalive ping runs on that same starved loop, so a
20s timeout kills an otherwise-healthy local connection over a
recoverable stall. 60s rides out the stall without affecting
half-open detection on public binds.
2. Coalesce streaming token frames in WSTransport (CF-2)
- Buffer high-frequency delta frames (message.delta, reasoning.delta,
thinking.delta) and flush as a batch every ~33ms (~30fps)
- Non-streaming frames (RPC responses, control/tool/completion events)
flush pending tokens first — wire ordering preserved
- Thread-safe via threading.Lock; worker threads return immediately
instead of blocking on per-token loop wakeups
- Reduces event-loop wakeup churn by orders of magnitude during model
streaming, directly cutting GIL pressure
3. Loop heartbeat watchdog (CF-1)
- Self-rearming call_later tick (2s) measures drift between expected
and actual fire time using loop.time() (monotonic)
- Logs 'event loop stalled Ns (GIL pressure suspected)' when drift >5s
- Turns mysterious WS drops into diagnosable log entries
- Uses call_later chain (not a task) — dies with the loop, nothing
to cancel on shutdown
Root cause: uvicorn's ws keepalive ping (20/20s) runs on the same
starved event loop as agent turns. Under GIL pressure from heavy agent
turns or delegation, the loop can't service the ping within 20s, so
the websockets protocol declares the connection dead. Reconnects fail
with ready_send_failed because the old process's loop is still wedged.
None of these fixes touch the model-facing message array, prompt
caching, message role alternation, or the wire protocol — they are
strictly display-transport improvements plus a config tweak and a
diagnostic log.
Tests: 762 passed, 17 skipped (0 failures) across test_tui_gateway_ws,
test_tui_gateway_server, test_web_server, and tui_gateway/ suites.
|
||
|---|---|---|
| .. | ||
| __init__.py | ||
| entry.py | ||
| event_publisher.py | ||
| git_probe.py | ||
| loop_noise.py | ||
| project_tree.py | ||
| render.py | ||
| server.py | ||
| slash_worker.py | ||
| transport.py | ||
| ws.py | ||