hermes-agent/tui_gateway/entry.py
Teknium e93bfc6c93 feat(windows): close remaining POSIX-only landmines — TUI crash, kanban waitpid, AF_UNIX sandbox, /bin/bash, npm .cmd shims, cwd tracking, detach flags
Second pass on native Windows support, driven by a systematic audit across
five areas: POSIX-only primitives (signal.SIGKILL/SIGHUP/SIGPIPE, os.WNOHANG,
os.setsid), path translation bugs (/c/Users → C:\Users), subprocess patterns
(npm.cmd batch shims, start_new_session no-op on Windows), subsystem health
(cron, gateway daemon, update flow), and module-level import guards.

Every change is platform-gated — POSIX (Linux/macOS) behaviour is preserved
bit-identical. Explicit "do no harm" test: test_posix_path_preserved_on_linux,
test_posix_noop, test_windows_detach_popen_kwargs_is_posix_equivalent_on_posix.

## New module

- hermes_cli/_subprocess_compat.py — shared helpers (resolve_node_command,
  windows_detach_flags, windows_hide_flags, windows_detach_popen_kwargs).
  All no-ops on non-Windows.

## CRITICAL fixes (would crash or silently break on Windows)

- tui_gateway/entry.py: SIGPIPE/SIGHUP referenced at module top level would
  AttributeError on import on Windows, breaking `hermes --tui` entirely (it
  spawns this module as a subprocess).  Guard each signal.signal() call with
  hasattr() and add SIGBREAK as Windows' SIGHUP equivalent.

- hermes_cli/kanban_db.py: os.waitpid(-1, os.WNOHANG) in dispatcher tick was
  unguarded.  os.WNOHANG doesn't exist on Windows.  Gate the whole reap loop
  behind `os.name != "nt"` — Windows has no zombies anyway.

- tools/code_execution_tool.py: AF_UNIX socket for execute_code RPC fails on
  most Windows builds.  Fall back to loopback TCP (AF_INET on 127.0.0.1:0
  ephemeral port) when _IS_WINDOWS.  HERMES_RPC_SOCKET env var now accepts
  either a filesystem path (POSIX) or `tcp://127.0.0.1:<port>` (Windows).
  Generated sandbox client parses both.

- cron/scheduler.py: `argv = ["/bin/bash", str(path)]` hardcoded.  Use
  shutil.which("bash") so Windows (Git Bash via MinGit) works, with a
  readable error when bash is genuinely absent.

- 6 bare npm/npx spawn sites: tools_config.py x2, doctor.py, whatsapp.py
  (npm install + node version probe), browser_tool.py x2.  On Windows npm
  is npm.cmd / npx is npx.cmd (batch shims); subprocess.Popen(["npm", ...])
  fails with WinError 193.  shutil.which(...) returns the absolute .cmd
  path which CreateProcessW accepts because the extension routes through
  cmd.exe /c.  POSIX behaviour unchanged (shutil.which still returns the
  same path subprocess would resolve itself).

## HIGH fixes (silent misbehaviour on Windows)

- tools/environments/local.py get_temp_dir: hardcoded /tmp returned on
  Windows meant `_cwd_file = "/tmp/hermes-cwd-*.txt"`, which bash wrote
  via MSYS2's virtual /tmp but native Python couldn't open.  Result: cwd
  tracking silently broken — `cd` in terminal tool did nothing.  Windows
  branch now returns `%HERMES_HOME%/cache/terminal` with forward slashes
  (works in both bash and Python, guaranteed no spaces).

- tools/environments/local.py _make_run_env PATH injection: `/usr/bin not
  in split(":")` heuristic mangles Windows PATH (";" separator).  Gate
  the injection behind `not _IS_WINDOWS`.

- hermes_cli/gateway.py launch_detached_profile_gateway_restart: outer
  Popen + watcher-script Popen both used start_new_session=True, which
  Windows silently ignores.  Watcher stayed attached to CLI's console,
  died when user closed terminal after `hermes update`, left gateway
  stale.  Now branches through windows_detach_popen_kwargs() helper
  (CREATE_NEW_PROCESS_GROUP | DETACHED_PROCESS | CREATE_NO_WINDOW on
  Windows, start_new_session=True on POSIX — identical to main).

## MEDIUM fixes

- gateway/run.py /restart and /update handlers: hardcoded bash/setsid
  chain crashes on Windows when user triggers /update in-gateway.  Now
  has sys.platform=="win32" branch using sys.executable + a tiny
  Python watcher with proper detach flags.  POSIX path is unchanged.

- cli.py _git_repo_root: Git on Windows sometimes returns /c/Users/...
  style paths that break subprocess.Popen(cwd=...) and Path().resolve().
  Added _normalize_git_bash_path() helper that translates /c/Users,
  /cygdrive/c, /mnt/c variants to native C:\Users form.  POSIX no-op.
  _git_repo_root() now routes every result through it.

- cli.py worktree .worktreeinclude: os.symlink on directories failed
  hard on Windows (requires admin or Developer Mode).  Falls back to
  shutil.copytree with a warning log.

## Tests

- 29 new tests in tests/tools/test_windows_native_support.py covering:
  subprocess_compat helpers, TUI entry signal guards, kanban waitpid
  guard, code_execution TCP fallback source-level invariants, cron bash
  resolution, npm/npx bare-spawn lint per-file, local env Windows temp
  dir, PATH injection gating, git bash path normalization, symlink
  fallback, gateway detached watcher flags.

- One existing test assertion adjusted in test_browser_homebrew_paths:
  it compared captured Popen argv to the BARE `"npx"` literal; after the
  shutil.which() change argv[0] is the absolute path.  New assertion
  checks the shape (two items, second is `agent-browser`) rather than
  the exact first-item string.  Behaviour unchanged; test was too strict.

All 56 tests pass on Linux (30 from previous commits + 26 new).
267 tests from the affected files/dirs (browser, code_exec, local_env,
process_registry, kanban_db, windows_compat) all pass — zero regressions.
tests/hermes_cli/ (3909 pass) and tests/gateway/ (5021 pass) unchanged;
all pre-existing test failures confirmed unrelated via `git stash` re-run.

## What's still deferred (LOW priority)

- Visible cmd-window flashes on short-lived console apps (~14 sites) —
  cosmetic, needs a follow-up pass once we have user reports.
- agent/file_safety.py POSIX-only security deny patterns — separate
  hardening task.
- tools/process_registry.py returning "/tmp" as fallback — theoretical;
  reachable only when all env-var candidates fail.
2026-05-08 14:27:40 -07:00

251 lines
10 KiB
Python

import os
import sys
# Guard against a local utils/ (or other package) in CWD shadowing installed
# hermes modules. hermes_cli sets HERMES_PYTHON_SRC_ROOT before spawning this
# subprocess; inserting it first ensures the installed packages win.
_src_root = os.environ.get("HERMES_PYTHON_SRC_ROOT", "")
if _src_root and _src_root not in sys.path:
sys.path.insert(0, _src_root)
# Strip '' and '.' — both resolve to CWD at import time and can let a local
# directory shadow installed packages.
sys.path = [p for p in sys.path if p not in ("", ".")]
import json
import signal
import time
import traceback
from tui_gateway import server
from tui_gateway.server import _CRASH_LOG, dispatch, resolve_skin, write_json
from tui_gateway.transport import TeeTransport
def _install_sidecar_publisher() -> None:
"""Mirror every dispatcher emit to the dashboard sidebar via WS.
Activated by `HERMES_TUI_SIDECAR_URL`, set by the dashboard's
``/api/pty`` endpoint when a chat tab passes a ``channel`` query param.
Best-effort: connect failure or runtime drop falls back to stdio-only.
"""
url = os.environ.get("HERMES_TUI_SIDECAR_URL")
if not url:
return
from tui_gateway.event_publisher import WsPublisherTransport
server._stdio_transport = TeeTransport(
server._stdio_transport, WsPublisherTransport(url)
)
# How long to wait for orderly shutdown (atexit + finalisers) before
# falling back to ``os._exit(0)`` so a wedged worker mid-flush can't
# strand the process. 1s covers the gateway's own shutdown work
# (thread-pool drain + session finalize) on every machine we've
# tested; override via ``HERMES_TUI_GATEWAY_SHUTDOWN_GRACE_S`` if a
# slower environment needs more headroom (e.g. encrypted disks
# flushing checkpoints) and accept that a longer grace also means a
# longer wait when shutdown actually deadlocks.
_DEFAULT_SHUTDOWN_GRACE_S = 1.0
def _shutdown_grace_seconds() -> float:
raw = (os.environ.get("HERMES_TUI_GATEWAY_SHUTDOWN_GRACE_S") or "").strip()
if not raw:
return _DEFAULT_SHUTDOWN_GRACE_S
try:
value = float(raw)
except ValueError:
return _DEFAULT_SHUTDOWN_GRACE_S
return value if value > 0 else _DEFAULT_SHUTDOWN_GRACE_S
def _log_signal(signum: int, frame) -> None:
"""Capture WHICH thread and WHERE a termination signal hit us.
SIG_DFL for SIGPIPE kills the process silently the instant any
background thread (TTS playback, beep, voice status emitter, etc.)
writes to a stdout the TUI has stopped reading. Without this
handler the gateway-exited banner in the TUI has no trace — the
crash log never sees a Python exception because the kernel reaps
the process before the interpreter runs anything.
Termination semantics: ``sys.exit(0)`` here used to race the worker
pool — a thread holding ``_stdout_lock`` mid-flush would block the
interpreter shutdown indefinitely. We now log the stack, give the
process the configured shutdown grace
(``HERMES_TUI_GATEWAY_SHUTDOWN_GRACE_S``, default
``_DEFAULT_SHUTDOWN_GRACE_S``) to drain naturally on a background
thread, and fall back to ``os._exit(0)`` so a wedged write/flush
can never strand the process.
"""
# SIGPIPE and SIGHUP don't exist on Windows — build the lookup
# dict from attributes that actually exist on the current platform.
_signal_names: dict[int, str] = {}
for _attr in ("SIGPIPE", "SIGTERM", "SIGHUP", "SIGINT", "SIGBREAK"):
_sig = getattr(signal, _attr, None)
if _sig is not None:
_signal_names[int(_sig)] = _attr
name = _signal_names.get(signum, f"signal {signum}")
try:
os.makedirs(os.path.dirname(_CRASH_LOG), exist_ok=True)
with open(_CRASH_LOG, "a", encoding="utf-8") as f:
f.write(
f"\n=== {name} received · {time.strftime('%Y-%m-%d %H:%M:%S')} ===\n"
)
if frame is not None:
f.write("main-thread stack at signal delivery:\n")
traceback.print_stack(frame, file=f)
# All live threads — signal may have been triggered by a
# background thread (write to broken stdout from TTS, etc.).
import threading as _threading
for tid, th in _threading._active.items():
f.write(f"\n--- thread {th.name} (id={tid}) ---\n")
f.write("".join(traceback.format_stack(sys._current_frames().get(tid))))
except Exception:
pass
print(f"[gateway-signal] {name}", file=sys.stderr, flush=True)
import threading as _threading
def _hard_exit() -> None:
# If a worker thread is still mid-flush on a half-closed pipe,
# ``sys.exit(0)`` would wait forever for it to drop the GIL on
# interpreter shutdown. ``os._exit`` skips atexit handlers but
# breaks the deadlock. The crash log + stderr line above are
# the forensic trail.
os._exit(0)
timer = _threading.Timer(_shutdown_grace_seconds(), _hard_exit)
timer.daemon = True
timer.start()
try:
sys.exit(0)
except SystemExit:
# Re-raise so the main-thread interpreter unwinds and runs
# atexit + finalisers inside the grace window. Python signal
# handlers always run on the main thread, but a worker thread
# holding ``_stdout_lock`` mid-flush can keep that unwind
# waiting indefinitely; the daemon timer above is the safety
# net for that exact case.
raise
# SIGPIPE: ignore, don't exit. The old SIG_DFL killed the process
# silently whenever a *background* thread (TTS playback chain, voice
# debug stderr emitter, beep thread) wrote to a pipe the TUI had gone
# quiet on — even though the main thread was perfectly fine waiting on
# stdin. Ignoring the signal lets Python raise BrokenPipeError on the
# offending write (write_json already handles that with a clean
# sys.exit(0) + _log_exit), which keeps the gateway alive as long as
# the main command pipe is still readable. Terminal signals still
# route through _log_signal so kills and hangups are diagnosable.
#
# SIGPIPE and SIGHUP don't exist on Windows; guard each installation
# with hasattr so ``python -m tui_gateway.entry`` (spawned by
# ``hermes --tui``) imports cleanly there. SIGBREAK (Windows' Ctrl+Break)
# is installed when available as a weaker equivalent of SIGHUP.
if hasattr(signal, "SIGPIPE"):
signal.signal(signal.SIGPIPE, signal.SIG_IGN)
if hasattr(signal, "SIGTERM"):
signal.signal(signal.SIGTERM, _log_signal)
if hasattr(signal, "SIGHUP"):
signal.signal(signal.SIGHUP, _log_signal)
elif hasattr(signal, "SIGBREAK"):
# Windows-only: Ctrl+Break in a console window delivers SIGBREAK.
# Route it through the same handler so kills are diagnosable.
signal.signal(signal.SIGBREAK, _log_signal)
if hasattr(signal, "SIGINT"):
signal.signal(signal.SIGINT, signal.SIG_IGN)
def _log_exit(reason: str) -> None:
"""Record why the gateway subprocess is shutting down.
Three exit paths (startup write fail, parse-error-response write fail,
dispatch-response write fail, stdin EOF) all collapse into a silent
sys.exit(0) here. Without this trail the TUI shows "gateway exited"
with no actionable clue about WHICH broken pipe or WHICH message
triggered it — the main reason voice-mode turns look like phantom
crashes when the real story is "TUI read pipe closed on this event".
"""
try:
os.makedirs(os.path.dirname(_CRASH_LOG), exist_ok=True)
with open(_CRASH_LOG, "a", encoding="utf-8") as f:
f.write(
f"\n=== gateway exit · {time.strftime('%Y-%m-%d %H:%M:%S')} "
f"· reason={reason} ===\n"
)
except Exception:
pass
print(f"[gateway-exit] {reason}", file=sys.stderr, flush=True)
def main():
_install_sidecar_publisher()
# MCP tool discovery — inline is safe here: TUI entry is a plain
# sync loop with no asyncio event loop to block. Previously ran as
# a model_tools.py module-level side effect; moved to explicit
# startup calls to avoid freezing the gateway's loop on lazy import
# (#16856).
#
# Cold-start guard: importing ``tools.mcp_tool`` transitively pulls the
# full MCP SDK (mcp, pydantic, httpx, jsonschema, starlette parsers —
# ~200ms on macOS), which runs on the TUI's critical path before
# ``gateway.ready`` can be emitted. The overwhelming majority of users
# have no ``mcp_servers`` configured, in which case every byte of that
# import is wasted. Check the config first (cheap — it's already been
# loaded once by ``_config_mtime`` elsewhere) and only pay the import
# cost when there's actually MCP work to do.
try:
from hermes_cli.config import read_raw_config
_mcp_servers = (read_raw_config() or {}).get("mcp_servers")
_has_mcp_servers = isinstance(_mcp_servers, dict) and len(_mcp_servers) > 0
except Exception:
# Be conservative: if we can't decide, fall back to the old
# behaviour and let the discovery path handle its own errors.
_has_mcp_servers = True
if _has_mcp_servers:
try:
from tools.mcp_tool import discover_mcp_tools
discover_mcp_tools()
except Exception:
pass
if not write_json({
"jsonrpc": "2.0",
"method": "event",
"params": {"type": "gateway.ready", "payload": {"skin": resolve_skin()}},
}):
_log_exit("startup write failed (broken stdout pipe before first event)")
sys.exit(0)
for raw in sys.stdin:
line = raw.strip()
if not line:
continue
try:
req = json.loads(line)
except json.JSONDecodeError:
if not write_json({"jsonrpc": "2.0", "error": {"code": -32700, "message": "parse error"}, "id": None}):
_log_exit("parse-error-response write failed (broken stdout pipe)")
sys.exit(0)
continue
method = req.get("method") if isinstance(req, dict) else None
resp = dispatch(req)
if resp is not None:
if not write_json(resp):
_log_exit(f"response write failed for method={method!r} (broken stdout pipe)")
sys.exit(0)
_log_exit("stdin EOF (TUI closed the command pipe)")
if __name__ == "__main__":
main()