hermes-agent/tui_gateway
Peetwan ebb81f10cb fix(tui_gateway): prevent WS disconnect under GIL pressure
Three targeted fixes for Desktop GUI WebSocket stability when agent
turns starve the uvicorn event loop of CPU (GIL contention):

1. Loosen ws_ping_timeout for loopback binds (QW-1)
   - Loopback (Desktop): ping 30s interval / 60s timeout
   - Non-loopback (Cloudflare Tunnel): unchanged 20/20
   - A GIL-heavy agent turn can stall the event loop past 20s;
     uvicorn's keepalive ping runs on that same starved loop, so a
     20s timeout kills an otherwise-healthy local connection over a
     recoverable stall. 60s rides out the stall without affecting
     half-open detection on public binds.

2. Coalesce streaming token frames in WSTransport (CF-2)
   - Buffer high-frequency delta frames (message.delta, reasoning.delta,
     thinking.delta) and flush as a batch every ~33ms (~30fps)
   - Non-streaming frames (RPC responses, control/tool/completion events)
     flush pending tokens first — wire ordering preserved
   - Thread-safe via threading.Lock; worker threads return immediately
     instead of blocking on per-token loop wakeups
   - Reduces event-loop wakeup churn by orders of magnitude during model
     streaming, directly cutting GIL pressure

3. Loop heartbeat watchdog (CF-1)
   - Self-rearming call_later tick (2s) measures drift between expected
     and actual fire time using loop.time() (monotonic)
   - Logs 'event loop stalled Ns (GIL pressure suspected)' when drift >5s
   - Turns mysterious WS drops into diagnosable log entries
   - Uses call_later chain (not a task) — dies with the loop, nothing
     to cancel on shutdown

Root cause: uvicorn's ws keepalive ping (20/20s) runs on the same
starved event loop as agent turns. Under GIL pressure from heavy agent
turns or delegation, the loop can't service the ping within 20s, so
the websockets protocol declares the connection dead. Reconnects fail
with ready_send_failed because the old process's loop is still wedged.

None of these fixes touch the model-facing message array, prompt
caching, message role alternation, or the wire protocol — they are
strictly display-transport improvements plus a config tweak and a
diagnostic log.

Tests: 762 passed, 17 skipped (0 failures) across test_tui_gateway_ws,
test_tui_gateway_server, test_web_server, and tui_gateway/ suites.
2026-06-30 03:11:13 -07:00
..
__init__.py feat: new tui based on ink 2026-04-02 19:07:53 -05:00
entry.py fix(mcp): late-refresh must see desktop/dashboard discovery thread owner (#55514) 2026-06-30 02:08:37 -07:00
event_publisher.py chore: address copilot comments 2026-04-24 12:51:04 -04:00
git_probe.py fix(windows): hide console-window flash on backend git/gh/wmic/bash subprocess spawns 2026-06-28 05:28:45 -07:00
loop_noise.py fix(tui_gateway): suppress WS peer-hangup teardown error flood (#50005) (#54126) 2026-06-28 02:35:01 -07:00
project_tree.py feat(gateway): build authoritative project tree 2026-06-25 16:40:27 -05:00
render.py tui: inherit Python-side rendering via gateway bridge 2026-04-05 18:50:41 -05:00
server.py feat(display): friendly human-phrased tool labels for built-in tools (#55166) 2026-06-29 20:31:17 -07:00
slash_worker.py revert(windows): roll back terminal-popup PRs #53791 #53810 #53829 (#53853) 2026-06-27 15:59:00 -07:00
transport.py fix(tui-gateway): harden stdio transport against half-closed pipes + SIGTERM races (#17118) 2026-04-28 17:54:06 -05:00
ws.py fix(tui_gateway): prevent WS disconnect under GIL pressure 2026-06-30 03:11:13 -07:00