fix(tui-gateway): harden stdio transport against half-closed pipes + SIGTERM races (#17118)

* fix(tui-gateway): harden stdio transport against half-closed pipes + SIGTERM races

`tui_gateway` reports `tui_gateway_crash.log` traces where the main
thread sits in `sys.stdin` while a worker holds `_stdout_lock` mid-
flush, and SIGTERM then calls `sys.exit(0)` while the lock is still
held — the interpreter shutdown stalls behind the wedged write.

Two narrowly scoped hardenings:

**`tui_gateway/transport.py`**

* Move JSON serialisation outside the lock — long messages no longer
  block sibling writers while we serialise.
* Treat `BrokenPipeError`, `ValueError` ("I/O on closed file") and
  generic `OSError` from both `write` and `flush` as "peer is gone":
  return `False` instead of bubbling, matching what `write_json`'s
  callers in `entry.py` already expect.
* Split `flush` into its own try block so a stuck flush never strands
  a partial write or holds the lock indefinitely on its way out.
* Optional `HERMES_TUI_GATEWAY_NO_FLUSH=1` env knob to skip explicit
  `flush()` entirely on environments where a half-closed read pipe
  produces an indefinite kernel-level block.  Default unchanged.

**`tui_gateway/entry.py`**

* `_log_signal` now spawns a 1-second daemon timer that calls
  `os._exit(0)` if the orderly `sys.exit(0)` path is itself stuck
  behind a wedged worker.  Atexit handlers run inside the grace
  window when they can; the timer is the safety net so a deadlocked
  flush no longer strands the gateway process.

Tests:

* `test_write_json_closed_stream_returns_false` — ValueError path.
* `test_write_json_oserror_on_flush_returns_false` — OSError on flush
  must not strand the lock; the write portion still landed before the
  flush failure.
* `test_write_json_no_flush_env_skips_flush` — env knob bypass.

Validation: `scripts/run_tests.sh tests/tui_gateway/test_protocol.py`
(42/42 pass; one pre-existing failure on
`test_session_resume_returns_hydrated_messages` is unrelated to this
change — same `include_ancestors` mock kwarg issue tracked elsewhere).
`scripts/run_tests.sh tests/test_tui_gateway_server.py` 90/90 pass.

* review(copilot): tighten transport hardening comments + test cleanup

* review(copilot): narrow exception capture, configurable grace, simpler no-flush test

* fix(tui-gateway): narrow ValueError to closed-stream; surface UnicodeEncodeError

Copilot review on PR #17118: `UnicodeEncodeError` is a ValueError
subclass, so a non-UTF-8 stdout (mismatched PYTHONIOENCODING / locale)
would have been silently swallowed as 'peer gone' under
`except ValueError`.  That hides a real environment bug.

Now:
- UnicodeEncodeError → log with exc_info (warning) and drop the frame
- ValueError where str(e) contains 'closed file' → peer gone, return False
- Any other ValueError → log loudly, drop frame (defensive, but visible)

Same shape applied to flush.  Adds two regression tests.

* fix(tui-gateway): reserve write() False for peer-gone; re-raise programming errors

Round 2 Copilot review on PR #17118: `Transport.write()` returning
`False` is documented as 'peer is gone', and `entry.py` reacts by
calling `sys.exit(0)`.  But the implementation also returned False
for non-IO conditions (non-JSON-safe payloads, UnicodeEncodeError,
unrelated ValueErrors), so a programming error or local env bug would
present as a clean disconnect — exactly the diagnosis pain we wanted
to eliminate.

Now:
- `json.dumps` failure → re-raises (TypeError/ValueError surfaces in crash log)
- `BrokenPipeError` → False (peer gone)
- `ValueError('...closed file...')` → False (peer gone)
- `UnicodeEncodeError` and any other ValueError → re-raise
- `OSError` → False (existing IO-failure semantics, debug-logged)

Tests updated to assert the re-raise behaviour and added a
non-serializable-payload regression test.

* fix(tui-gateway): narrow OSError to peer-gone errnos; honest test naming

Round 3 Copilot review on PR #17118:

- Docstring claimed False = peer gone, but generic OSError on write/flush
  also returned False — meaning ENOSPC/EACCES/EIO would silently exit.
  Added `_PEER_GONE_ERRNOS = {EPIPE, ECONNRESET, EBADF, ESHUTDOWN, +WSA}`
  and narrowed the OSError handlers; non-peer-gone errnos re-raise.
  Docstring now lists OSError as peer-gone branch with the errno set.
- The `_DISABLE_FLUSH` test was named after the env var but actually
  patched the module constant. Renamed it to reflect the contract being
  tested (skips flush when constant is true) AND added a real
  end-to-end test that sets the env var, reloads transport.py, and
  asserts the constant flips. Cleanup reload restores defaults so
  parallel tests stay isolated.

Self-review (avoid round 4):
- Verified TeeTransport's secondary-swallow stays intentional.
- _log_signal grace path already covered by separate tests.
This commit is contained in:
brooklyn! 2026-04-28 15:54:06 -07:00 committed by GitHub
parent af6b1a3343
commit 1e326c686d
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
3 changed files with 283 additions and 8 deletions

View file

@ -83,6 +83,134 @@ def test_write_json_broken_pipe(server):
assert server.write_json({"x": 1}) is False
def test_write_json_closed_stream_returns_false(server):
"""ValueError ('I/O on closed file') used to bubble up; treat as gone."""
class _Closed:
def write(self, _): raise ValueError("I/O operation on closed file")
def flush(self): raise ValueError("I/O operation on closed file")
server._real_stdout = _Closed()
assert server.write_json({"x": 1}) is False
def test_write_json_unicode_encode_error_re_raises(server):
"""A non-UTF-8 stdout encoding raises UnicodeEncodeError (a ValueError
subclass). It must NOT be swallowed as 'peer gone' that would let
`entry.py` exit cleanly via the False path and hide the real config
bug. We re-raise so the existing crash-log infrastructure records it."""
class _AsciiOnly:
def write(self, line):
line.encode("ascii") # raises UnicodeEncodeError on non-ascii
def flush(self): pass
server._real_stdout = _AsciiOnly()
with pytest.raises(UnicodeEncodeError):
server.write_json({"msg": "héllo"})
def test_write_json_unrelated_value_error_re_raises(server):
"""Only ValueError('...closed file...') means peer gone. Other
ValueErrors are programming errors and must surface."""
class _BadValue:
def write(self, _): raise ValueError("something else entirely")
def flush(self): pass
server._real_stdout = _BadValue()
with pytest.raises(ValueError, match="something else entirely"):
server.write_json({"x": 1})
def test_write_json_non_serializable_payload_re_raises(server):
"""Non-JSON-safe payloads are programming errors — they must NOT be
silently dropped via the False path (which would trigger a clean exit
in entry.py and mask the real bug)."""
import io
server._real_stdout = io.StringIO()
with pytest.raises(TypeError):
server.write_json({"obj": object()})
def test_write_json_peer_gone_oserror_on_flush_returns_false(server):
"""A flush that raises a peer-gone OSError (EPIPE) must not strand
the lock or crash; it returns False so the dispatcher exits cleanly."""
import errno
written = []
class _FlushPeerGone:
def write(self, line): written.append(line)
def flush(self): raise OSError(errno.EPIPE, "broken pipe")
server._real_stdout = _FlushPeerGone()
assert server.write_json({"x": 1}) is False
assert written and json.loads(written[0]) == {"x": 1}
def test_write_json_non_peer_gone_oserror_re_raises(server):
"""Host I/O failures (ENOSPC, EACCES, EIO …) are NOT peer-gone — they
must re-raise so the crash log records them instead of looking like
a clean disconnect via the False path."""
import errno
class _DiskFull:
def write(self, _): raise OSError(errno.ENOSPC, "no space left")
def flush(self): pass
server._real_stdout = _DiskFull()
with pytest.raises(OSError, match="no space"):
server.write_json({"x": 1})
def test_write_json_skips_flush_when_disable_flush_true(monkeypatch):
"""`StdioTransport` skips flush when `_DISABLE_FLUSH` is true.
Tests the runtime *behaviour* via direct module-attr patch. The env
var module constant wiring is covered by the dedicated env test
below; reloading server.py here would re-register atexit hooks and
recreate the worker pool.
"""
import importlib
transport_mod = importlib.import_module("tui_gateway.transport")
monkeypatch.setattr(transport_mod, "_DISABLE_FLUSH", True)
flushed = {"count": 0}
written = []
class _Stream:
def write(self, line): written.append(line)
def flush(self): flushed["count"] += 1
stream = _Stream()
transport = transport_mod.StdioTransport(lambda: stream, threading.Lock())
assert transport.write({"x": 1}) is True
assert flushed["count"] == 0
def test_disable_flush_env_var_actually_wires_to_module_constant(monkeypatch):
"""End-to-end: setting `HERMES_TUI_GATEWAY_NO_FLUSH=1` and importing
`tui_gateway.transport` fresh actually flips `_DISABLE_FLUSH` true.
Reloads only the transport module server.py is untouched so its
atexit hooks/worker pool stay intact."""
import importlib
monkeypatch.setenv("HERMES_TUI_GATEWAY_NO_FLUSH", "1")
transport_mod = importlib.reload(importlib.import_module("tui_gateway.transport"))
try:
assert transport_mod._DISABLE_FLUSH is True
finally:
# Restore the env-disabled state so other tests see the default.
monkeypatch.delenv("HERMES_TUI_GATEWAY_NO_FLUSH", raising=False)
importlib.reload(transport_mod)
# ── _emit ────────────────────────────────────────────────────────────