mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-01 01:51:44 +00:00
* fix(tui-gateway): harden stdio transport against half-closed pipes + SIGTERM races
`tui_gateway` reports `tui_gateway_crash.log` traces where the main
thread sits in `sys.stdin` while a worker holds `_stdout_lock` mid-
flush, and SIGTERM then calls `sys.exit(0)` while the lock is still
held — the interpreter shutdown stalls behind the wedged write.
Two narrowly scoped hardenings:
**`tui_gateway/transport.py`**
* Move JSON serialisation outside the lock — long messages no longer
block sibling writers while we serialise.
* Treat `BrokenPipeError`, `ValueError` ("I/O on closed file") and
generic `OSError` from both `write` and `flush` as "peer is gone":
return `False` instead of bubbling, matching what `write_json`'s
callers in `entry.py` already expect.
* Split `flush` into its own try block so a stuck flush never strands
a partial write or holds the lock indefinitely on its way out.
* Optional `HERMES_TUI_GATEWAY_NO_FLUSH=1` env knob to skip explicit
`flush()` entirely on environments where a half-closed read pipe
produces an indefinite kernel-level block. Default unchanged.
**`tui_gateway/entry.py`**
* `_log_signal` now spawns a 1-second daemon timer that calls
`os._exit(0)` if the orderly `sys.exit(0)` path is itself stuck
behind a wedged worker. Atexit handlers run inside the grace
window when they can; the timer is the safety net so a deadlocked
flush no longer strands the gateway process.
Tests:
* `test_write_json_closed_stream_returns_false` — ValueError path.
* `test_write_json_oserror_on_flush_returns_false` — OSError on flush
must not strand the lock; the write portion still landed before the
flush failure.
* `test_write_json_no_flush_env_skips_flush` — env knob bypass.
Validation: `scripts/run_tests.sh tests/tui_gateway/test_protocol.py`
(42/42 pass; one pre-existing failure on
`test_session_resume_returns_hydrated_messages` is unrelated to this
change — same `include_ancestors` mock kwarg issue tracked elsewhere).
`scripts/run_tests.sh tests/test_tui_gateway_server.py` 90/90 pass.
* review(copilot): tighten transport hardening comments + test cleanup
* review(copilot): narrow exception capture, configurable grace, simpler no-flush test
* fix(tui-gateway): narrow ValueError to closed-stream; surface UnicodeEncodeError
Copilot review on PR #17118: `UnicodeEncodeError` is a ValueError
subclass, so a non-UTF-8 stdout (mismatched PYTHONIOENCODING / locale)
would have been silently swallowed as 'peer gone' under
`except ValueError`. That hides a real environment bug.
Now:
- UnicodeEncodeError → log with exc_info (warning) and drop the frame
- ValueError where str(e) contains 'closed file' → peer gone, return False
- Any other ValueError → log loudly, drop frame (defensive, but visible)
Same shape applied to flush. Adds two regression tests.
* fix(tui-gateway): reserve write() False for peer-gone; re-raise programming errors
Round 2 Copilot review on PR #17118: `Transport.write()` returning
`False` is documented as 'peer is gone', and `entry.py` reacts by
calling `sys.exit(0)`. But the implementation also returned False
for non-IO conditions (non-JSON-safe payloads, UnicodeEncodeError,
unrelated ValueErrors), so a programming error or local env bug would
present as a clean disconnect — exactly the diagnosis pain we wanted
to eliminate.
Now:
- `json.dumps` failure → re-raises (TypeError/ValueError surfaces in crash log)
- `BrokenPipeError` → False (peer gone)
- `ValueError('...closed file...')` → False (peer gone)
- `UnicodeEncodeError` and any other ValueError → re-raise
- `OSError` → False (existing IO-failure semantics, debug-logged)
Tests updated to assert the re-raise behaviour and added a
non-serializable-payload regression test.
* fix(tui-gateway): narrow OSError to peer-gone errnos; honest test naming
Round 3 Copilot review on PR #17118:
- Docstring claimed False = peer gone, but generic OSError on write/flush
also returned False — meaning ENOSPC/EACCES/EIO would silently exit.
Added `_PEER_GONE_ERRNOS = {EPIPE, ECONNRESET, EBADF, ESHUTDOWN, +WSA}`
and narrowed the OSError handlers; non-peer-gone errnos re-raise.
Docstring now lists OSError as peer-gone branch with the errno set.
- The `_DISABLE_FLUSH` test was named after the env var but actually
patched the module constant. Renamed it to reflect the contract being
tested (skips flush when constant is true) AND added a real
end-to-end test that sets the env var, reloads transport.py, and
asserts the constant flips. Cleanup reload restores defaults so
parallel tests stay isolated.
Self-review (avoid round 4):
- Verified TeeTransport's secondary-swallow stays intentional.
- _log_signal grace path already covered by separate tests.
206 lines
7.8 KiB
Python
206 lines
7.8 KiB
Python
import json
|
|
import os
|
|
import signal
|
|
import sys
|
|
import time
|
|
import traceback
|
|
|
|
from tui_gateway import server
|
|
from tui_gateway.server import _CRASH_LOG, dispatch, resolve_skin, write_json
|
|
from tui_gateway.transport import TeeTransport
|
|
|
|
|
|
def _install_sidecar_publisher() -> None:
|
|
"""Mirror every dispatcher emit to the dashboard sidebar via WS.
|
|
|
|
Activated by `HERMES_TUI_SIDECAR_URL`, set by the dashboard's
|
|
``/api/pty`` endpoint when a chat tab passes a ``channel`` query param.
|
|
Best-effort: connect failure or runtime drop falls back to stdio-only.
|
|
"""
|
|
url = os.environ.get("HERMES_TUI_SIDECAR_URL")
|
|
|
|
if not url:
|
|
return
|
|
|
|
from tui_gateway.event_publisher import WsPublisherTransport
|
|
|
|
server._stdio_transport = TeeTransport(
|
|
server._stdio_transport, WsPublisherTransport(url)
|
|
)
|
|
|
|
|
|
# How long to wait for orderly shutdown (atexit + finalisers) before
|
|
# falling back to ``os._exit(0)`` so a wedged worker mid-flush can't
|
|
# strand the process. 1s covers the gateway's own shutdown work
|
|
# (thread-pool drain + session finalize) on every machine we've
|
|
# tested; override via ``HERMES_TUI_GATEWAY_SHUTDOWN_GRACE_S`` if a
|
|
# slower environment needs more headroom (e.g. encrypted disks
|
|
# flushing checkpoints) and accept that a longer grace also means a
|
|
# longer wait when shutdown actually deadlocks.
|
|
_DEFAULT_SHUTDOWN_GRACE_S = 1.0
|
|
|
|
|
|
def _shutdown_grace_seconds() -> float:
|
|
raw = (os.environ.get("HERMES_TUI_GATEWAY_SHUTDOWN_GRACE_S") or "").strip()
|
|
if not raw:
|
|
return _DEFAULT_SHUTDOWN_GRACE_S
|
|
try:
|
|
value = float(raw)
|
|
except ValueError:
|
|
return _DEFAULT_SHUTDOWN_GRACE_S
|
|
return value if value > 0 else _DEFAULT_SHUTDOWN_GRACE_S
|
|
|
|
|
|
def _log_signal(signum: int, frame) -> None:
|
|
"""Capture WHICH thread and WHERE a termination signal hit us.
|
|
|
|
SIG_DFL for SIGPIPE kills the process silently the instant any
|
|
background thread (TTS playback, beep, voice status emitter, etc.)
|
|
writes to a stdout the TUI has stopped reading. Without this
|
|
handler the gateway-exited banner in the TUI has no trace — the
|
|
crash log never sees a Python exception because the kernel reaps
|
|
the process before the interpreter runs anything.
|
|
|
|
Termination semantics: ``sys.exit(0)`` here used to race the worker
|
|
pool — a thread holding ``_stdout_lock`` mid-flush would block the
|
|
interpreter shutdown indefinitely. We now log the stack, give the
|
|
process the configured shutdown grace
|
|
(``HERMES_TUI_GATEWAY_SHUTDOWN_GRACE_S``, default
|
|
``_DEFAULT_SHUTDOWN_GRACE_S``) to drain naturally on a background
|
|
thread, and fall back to ``os._exit(0)`` so a wedged write/flush
|
|
can never strand the process.
|
|
"""
|
|
name = {
|
|
signal.SIGPIPE: "SIGPIPE",
|
|
signal.SIGTERM: "SIGTERM",
|
|
signal.SIGHUP: "SIGHUP",
|
|
}.get(signum, f"signal {signum}")
|
|
try:
|
|
os.makedirs(os.path.dirname(_CRASH_LOG), exist_ok=True)
|
|
with open(_CRASH_LOG, "a", encoding="utf-8") as f:
|
|
f.write(
|
|
f"\n=== {name} received · {time.strftime('%Y-%m-%d %H:%M:%S')} ===\n"
|
|
)
|
|
if frame is not None:
|
|
f.write("main-thread stack at signal delivery:\n")
|
|
traceback.print_stack(frame, file=f)
|
|
# All live threads — signal may have been triggered by a
|
|
# background thread (write to broken stdout from TTS, etc.).
|
|
import threading as _threading
|
|
for tid, th in _threading._active.items():
|
|
f.write(f"\n--- thread {th.name} (id={tid}) ---\n")
|
|
f.write("".join(traceback.format_stack(sys._current_frames().get(tid))))
|
|
except Exception:
|
|
pass
|
|
print(f"[gateway-signal] {name}", file=sys.stderr, flush=True)
|
|
|
|
import threading as _threading
|
|
|
|
def _hard_exit() -> None:
|
|
# If a worker thread is still mid-flush on a half-closed pipe,
|
|
# ``sys.exit(0)`` would wait forever for it to drop the GIL on
|
|
# interpreter shutdown. ``os._exit`` skips atexit handlers but
|
|
# breaks the deadlock. The crash log + stderr line above are
|
|
# the forensic trail.
|
|
os._exit(0)
|
|
|
|
timer = _threading.Timer(_shutdown_grace_seconds(), _hard_exit)
|
|
timer.daemon = True
|
|
timer.start()
|
|
|
|
try:
|
|
sys.exit(0)
|
|
except SystemExit:
|
|
# Re-raise so the main-thread interpreter unwinds and runs
|
|
# atexit + finalisers inside the grace window. Python signal
|
|
# handlers always run on the main thread, but a worker thread
|
|
# holding ``_stdout_lock`` mid-flush can keep that unwind
|
|
# waiting indefinitely; the daemon timer above is the safety
|
|
# net for that exact case.
|
|
raise
|
|
|
|
|
|
# SIGPIPE: ignore, don't exit. The old SIG_DFL killed the process
|
|
# silently whenever a *background* thread (TTS playback chain, voice
|
|
# debug stderr emitter, beep thread) wrote to a pipe the TUI had gone
|
|
# quiet on — even though the main thread was perfectly fine waiting on
|
|
# stdin. Ignoring the signal lets Python raise BrokenPipeError on the
|
|
# offending write (write_json already handles that with a clean
|
|
# sys.exit(0) + _log_exit), which keeps the gateway alive as long as
|
|
# the main command pipe is still readable. Terminal signals still
|
|
# route through _log_signal so kills and hangups are diagnosable.
|
|
signal.signal(signal.SIGPIPE, signal.SIG_IGN)
|
|
signal.signal(signal.SIGTERM, _log_signal)
|
|
signal.signal(signal.SIGHUP, _log_signal)
|
|
signal.signal(signal.SIGINT, signal.SIG_IGN)
|
|
|
|
|
|
def _log_exit(reason: str) -> None:
|
|
"""Record why the gateway subprocess is shutting down.
|
|
|
|
Three exit paths (startup write fail, parse-error-response write fail,
|
|
dispatch-response write fail, stdin EOF) all collapse into a silent
|
|
sys.exit(0) here. Without this trail the TUI shows "gateway exited"
|
|
with no actionable clue about WHICH broken pipe or WHICH message
|
|
triggered it — the main reason voice-mode turns look like phantom
|
|
crashes when the real story is "TUI read pipe closed on this event".
|
|
"""
|
|
try:
|
|
os.makedirs(os.path.dirname(_CRASH_LOG), exist_ok=True)
|
|
with open(_CRASH_LOG, "a", encoding="utf-8") as f:
|
|
f.write(
|
|
f"\n=== gateway exit · {time.strftime('%Y-%m-%d %H:%M:%S')} "
|
|
f"· reason={reason} ===\n"
|
|
)
|
|
except Exception:
|
|
pass
|
|
print(f"[gateway-exit] {reason}", file=sys.stderr, flush=True)
|
|
|
|
|
|
def main():
|
|
_install_sidecar_publisher()
|
|
|
|
# MCP tool discovery — inline is safe here: TUI entry is a plain
|
|
# sync loop with no asyncio event loop to block. Previously ran as
|
|
# a model_tools.py module-level side effect; moved to explicit
|
|
# startup calls to avoid freezing the gateway's loop on lazy import
|
|
# (#16856).
|
|
try:
|
|
from tools.mcp_tool import discover_mcp_tools
|
|
discover_mcp_tools()
|
|
except Exception:
|
|
pass
|
|
|
|
if not write_json({
|
|
"jsonrpc": "2.0",
|
|
"method": "event",
|
|
"params": {"type": "gateway.ready", "payload": {"skin": resolve_skin()}},
|
|
}):
|
|
_log_exit("startup write failed (broken stdout pipe before first event)")
|
|
sys.exit(0)
|
|
|
|
for raw in sys.stdin:
|
|
line = raw.strip()
|
|
if not line:
|
|
continue
|
|
|
|
try:
|
|
req = json.loads(line)
|
|
except json.JSONDecodeError:
|
|
if not write_json({"jsonrpc": "2.0", "error": {"code": -32700, "message": "parse error"}, "id": None}):
|
|
_log_exit("parse-error-response write failed (broken stdout pipe)")
|
|
sys.exit(0)
|
|
continue
|
|
|
|
method = req.get("method") if isinstance(req, dict) else None
|
|
resp = dispatch(req)
|
|
if resp is not None:
|
|
if not write_json(resp):
|
|
_log_exit(f"response write failed for method={method!r} (broken stdout pipe)")
|
|
sys.exit(0)
|
|
|
|
_log_exit("stdin EOF (TUI closed the command pipe)")
|
|
|
|
|
|
if __name__ == "__main__":
|
|
main()
|