* fix(windows): stop terminal-window popups from background spawns
Native-Windows desktop/gateway users saw cmd/conhost windows flash on
gateway restart, image paste, the dashboard Projects tree, voice notes,
and ~5 min after closing the app (detached cron). Two root causes:
- Console-subsystem exes (taskkill, schtasks, wmic, netstat, tasklist,
agent-browser, git, ffmpeg, powershell, git-bash) spawned via raw
subprocess allocate a fresh console when the launching process has
none (pythonw desktop backend / detached gateway) - even with output
captured.
- uv venv pythonw shims re-exec console python.exe, so Python children
get a console regardless of how they're launched.
Fixes:
- Single hidden-spawn primitive (_subprocess_compat.run/.popen) that ORs
CREATE_NO_WINDOW on Windows, no-op on POSIX. Route every Hermes-owned
console-exe spawn through it.
- FreeConsole() catch-all in hermes_bootstrap: any Python child that
exclusively owns an auto-allocated console detaches it at startup
(GetConsoleProcessList()==1 gate leaves shared interactive consoles
untouched).
- Replace PowerShell/wmic gateway PID scans with in-process psutil.
- Skip schtasks queries on non-interactive desktop restarts.
- Prefer native agent-browser .exe over .cmd shims.
- Guard test bans raw subprocess spawns of the Windows-only console
tools repo-wide so the popup class can't regress.
* fix(windows): scope FreeConsole to background entry points; fix merge fallout
Console detach review (per #53810 feedback): GetConsoleProcessList()==1 can't
tell a uv pythonw->python phantom console apart from a user opening the
interactive CLI/TUI in its own fresh console (double-click, shortcut, ConPTY) —
both report a single attached process with a tty. Running FreeConsole() in the
import-time bootstrap therefore risked detaching a legitimately-interactive
terminal.
- Extract FreeConsole into explicit hermes_bootstrap.detach_orphan_console();
remove it from apply_windows_utf8_bootstrap() (import side effect).
- Call it only from known background mains: gateway run, dashboard backend
(start_server, what the desktop spawns), cron standalone, tui_gateway entry,
slash worker. Interactive CLI/TUI never calls it.
- Behavior-contract tests: frees only when solo owner, leaves shared console,
no-op without console / on POSIX, and asserts it's not an import side effect.
Merge fallout from origin/main (#53791):
- local.py: 3-way merge left a dangling **_popen_kwargs (NameError crashing
every terminal init). _subprocess_compat.popen already hides the window, so
drop it.
- discord adapter: merge stacked an undefined windows_hide_flags() onto the
primitive call; drop the redundant arg.
- test_gateway: scan now goes psutil-first (zero spawn); rewrite the
case-variant test to drive that production path.
* test(claw): mock _subprocess_compat.run seam for Windows process scan
claw.py's Windows tasklist/powershell scan routes through the hidden-spawn
primitive; the tests still patched claw_mod.subprocess, so on win32 the mock
was never hit and real spawns returned nothing. Patch the actual seam.
When an MCP server requires OAuth, the interactive `hermes` TUI froze on
startup: background MCP discovery hit the OAuth flow, which on an interactive
TTY spawns a daemon thread doing a blocking `sys.stdin.readline()` (the
"paste the redirect URL" fallback in mcp_oauth._wait_for_callback). That
thread competes with the TUI's own stdin reader for the same terminal, so
keystrokes get swallowed and the TUI appears frozen (up to the 300s OAuth
timeout). Reported symptom: "MCP OAuth: authorization required / Open this URL
... the tui is freezing, not respond to typing."
Add a thread-local `suppress_interactive_oauth()` context manager in
tools/mcp_oauth.py; `_is_interactive()` returns False while it's active, so the
stdin paste-thread and prompt are never created. Background discovery
(hermes_cli/mcp_startup.py, tui_gateway/entry.py) now runs discovery inside
that context, so OAuth-requiring servers soft-skip (raise
OAuthNonInteractiveError, already handled) instead of stealing the TUI's stdin.
A real `hermes mcp login` on the main thread is unaffected (thread-local).
Salvaged from #35945 by @zapabob (authorship preserved via cherry-pick;
resolved a conflict against main's new mcp_discovery_timeout / wait_for_mcp_
discovery refactor, keeping both). Verified E2E: with suppression the paste
prompt is NOT printed and no stdin thread spawns (raises OAuthNonInteractive
soft-skip); without it the prompt shows (the freeze). Mutation-verified
(removing the suppress check in _is_interactive fails the regression test).
76 tests pass, ruff clean.
Closes#35927.
SELF-REVIEW FIX: the original #35945 used threading.local(), which does NOT
propagate to the dedicated mcp-event-loop thread where OAuth actually runs
(discover_mcp_tools dispatches the connect via run_coroutine_threadsafe), so
the suppression was a NO-OP in production (the tests passed only by stubbing
out the cross-thread dispatch). Converted to a contextvars.ContextVar, which
asyncio copies onto the scheduled coroutine — empirically verified suppression
now holds on the mcp-event-loop thread through the real _run_on_mcp_loop path.
Added a cross-thread regression test (fails on threading.local, passes on the
ContextVar) so the no-op can't regress.
Launching Hermes from a directory that ships its own top-level package with a
Hermes-internal name (utils/, proxy/, ui/) crashed the gateway/TUI child with
an ImportError (exit 1, crash loop): from utils import atomic_replace resolved
to the user's package.
tui_gateway/entry.py already stripped the relative cwd forms ('' / '.'), but
the launch dir also reaches sys.path as its own ABSOLUTE path (venv activation
or a project that adds itself to PYTHONPATH), which the strip missed and which
sat ahead of the Hermes root.
Centralize a hardened guard in hermes_bootstrap.harden_import_path(): drop the
relative forms AND force the Hermes source root to the front even when an
absolute cwd entry is present. Wire it into tui_gateway/entry.py and
acp_adapter/entry.py (both spawn into arbitrary cwds); hermes_cli/main.py and
gateway/run.py already insert the root at front. gatewayClient.ts now also
exports HERMES_PYTHON_SRC_ROOT for defense in depth.
Mirror the CLI's exit-path behaviour in the TUI gateway so that
unpersisted conversation messages are flushed to state.db and the
on_session_end plugin hook fires before the session is closed.
Root cause: _finalize_session() only called db.end_session() to
mark the session row as ended, but did NOT flush in-memory messages
via _persist_session() or fire the on_session_end hook. When the
user force-quit (double Ctrl-C, terminal-close, SIGHUP) while the
agent was mid-turn, messages accumulated since the last persist
point were silently lost.
Changes
-------
tui_gateway/server.py - _finalize_session():
- Persist unflushed messages via agent._persist_session() before
db.end_session(). Prefers agent._session_messages (set by the
last _persist_session call inside run_conversation) over
session['history'] (stale when agent is mid-turn).
- Fire on_session_end(interrupted=True) plugin hook so crash-
recovery plugins can flush buffers, matching cli.py behaviour.
tui_gateway/entry.py - _log_signal():
- Explicitly call _shutdown_sessions() before sys.exit(0) in the
SIGHUP/SIGTERM handler as belt-and-suspenders over atexit.
tests/tui_gateway/test_finalize_session_persist.py (new):
- 11 tests covering: history persistence, _session_messages
priority, empty-history skip, missing-agent, double-finalize,
persist-exception resilience, hook firing, hook-exception
resilience, and db.end_session preservation.
Related
-------
Closes the TUI half of #5021 (CLI already handles this via its
atexit handler). Also addresses the session-persistence gap
discussed in #18465 and #18269.
MCP servers that connect after the agent's one-time tool snapshot were
invisible for the whole session. Two root causes, fixed together:
1. The startup discovery wait was a flat 0.75s. HTTP/OAuth servers
commonly take 2-6s on a cold connect, so they missed the window and
their tools never entered the agent's snapshot. `thread.join(timeout)`
already returns the instant discovery completes, so raising the bound
costs ~0s for the common case (no MCP / fast servers) and only ever
blocks for a genuinely-pending server, capped so a dead server can't
freeze startup. The bound is now configurable via
`mcp_discovery_timeout` (config.yaml, default 5.0s).
2. Three call sites duplicated the agent tool-snapshot rebuild (the TUI
`reload.mcp` RPC, the gateway reload, and the TUI late-binding refresh
thread), and the late-refresh detected changes by tool COUNT — missing
an equal-size add/remove swap. Consolidated into one shared
`tools.mcp_tool.refresh_agent_mcp_tools(agent)` helper that diffs by
tool NAME, mutates the agent under a lock (thread-safe), and respects
the agent's own enabled/disabled toolsets.
The late-binding refresh keeps its pre-first-turn cache-safety guard:
it never rebuilds the tool list once a turn has started, so the cached
prompt prefix is never invalidated mid-conversation.
Tests: new tests/tools/test_refresh_agent_mcp_tools.py covers the
name-based diff, in-place mutation, agent-scoped filtering, thread
safety, and the config-driven discovery bound (incl. instant-return
when nothing is pending). 75 passed across the touched areas.
The TUI banner reported fewer tools than the classic CLI for the same
config (e.g. 32 vs 38) when an MCP server connected slowly. Root cause:
the agent snapshots `agent.tools` once at build time and never re-reads
the registry. `_make_agent` briefly joins the background MCP discovery
thread (`wait_for_mcp_discovery`, ~0.75s) so fast servers land in that
snapshot, but a server slower than the bound — common for an HTTP MCP
server on first connect — lands *after* the agent is built. Its tools are
then absent from both the agent (uncallable until `/reload-mcp`) and the
banner for the whole session.
The classic CLI doesn't hit this because it re-derives
`get_tool_definitions()` at banner render time (which re-waits for
discovery), so it picks the late tools up.
Fix: after a fresh agent is built and its first `session.info` emitted,
if discovery is still in flight, schedule an off-critical-path daemon that
waits for it to finish, then rebuilds the tool snapshot and re-emits
`session.info` — the same rebuild `/reload-mcp` performs, but automatic.
Both the agent's callable tools and the banner count catch up.
Cache safety: the rebuild runs only while the session is still
pre-first-turn (`_user_turn_count`/`_api_call_count` both 0 → nothing
cached to invalidate). Once the user has sent a message we leave the
snapshot frozen rather than break the cached prompt prefix mid-conversation;
late tools then require an explicit `/reload-mcp` (user-consented), exactly
as today. No-op when discovery finished before the agent build, when the
join times out, when the registry was unchanged, or when the session was
swapped/closed while waiting.
Adds entry.mcp_discovery_in_flight() / join_mcp_discovery() accessors and
covers the matrix (added/none/post-turn/timeout/unchanged/replaced) with
unit tests.
The 'summoning hermes…' phase blocked on gateway.ready, which ran MCP
tool discovery inline. Any configured-but-unreachable MCP server burned
its full connect-retry backoff (1+2+4s ≈ 7s) before the composer
appeared — startup went from instant to ~7.5s of dead air for anyone
with a down stdio/http server in mcp_servers.
Move discovery into a background daemon thread so gateway.ready fires
immediately; tools register into the shared registry as servers connect,
and the agent isn't built until the first prompt. Measured spawn→ready:
~7500ms → ~115ms (dead twozero_td server in config).
Also drop rich.console + prompt_toolkit off banner.py's import path
(lazy-imported inside cprint/build_welcome_banner). tui_gateway.server
imports banner only to reach the lightweight prefetch_update_check
helper; the eager rich/pt imports added ~45ms before gateway.ready for
no benefit. tui_gateway.server import: ~115ms → ~69ms.
Replace with for all literal-tuple
membership tests. Set lookup is O(1) vs O(n) for tuple — consistent
micro-optimization across the codebase.
608 instances fixed via `ruff --fix --unsafe-fixes`, 0 remaining.
133 files, +626/-626 (net zero).
Second pass on native Windows support, driven by a systematic audit across
five areas: POSIX-only primitives (signal.SIGKILL/SIGHUP/SIGPIPE, os.WNOHANG,
os.setsid), path translation bugs (/c/Users → C:\Users), subprocess patterns
(npm.cmd batch shims, start_new_session no-op on Windows), subsystem health
(cron, gateway daemon, update flow), and module-level import guards.
Every change is platform-gated — POSIX (Linux/macOS) behaviour is preserved
bit-identical. Explicit "do no harm" test: test_posix_path_preserved_on_linux,
test_posix_noop, test_windows_detach_popen_kwargs_is_posix_equivalent_on_posix.
## New module
- hermes_cli/_subprocess_compat.py — shared helpers (resolve_node_command,
windows_detach_flags, windows_hide_flags, windows_detach_popen_kwargs).
All no-ops on non-Windows.
## CRITICAL fixes (would crash or silently break on Windows)
- tui_gateway/entry.py: SIGPIPE/SIGHUP referenced at module top level would
AttributeError on import on Windows, breaking `hermes --tui` entirely (it
spawns this module as a subprocess). Guard each signal.signal() call with
hasattr() and add SIGBREAK as Windows' SIGHUP equivalent.
- hermes_cli/kanban_db.py: os.waitpid(-1, os.WNOHANG) in dispatcher tick was
unguarded. os.WNOHANG doesn't exist on Windows. Gate the whole reap loop
behind `os.name != "nt"` — Windows has no zombies anyway.
- tools/code_execution_tool.py: AF_UNIX socket for execute_code RPC fails on
most Windows builds. Fall back to loopback TCP (AF_INET on 127.0.0.1:0
ephemeral port) when _IS_WINDOWS. HERMES_RPC_SOCKET env var now accepts
either a filesystem path (POSIX) or `tcp://127.0.0.1:<port>` (Windows).
Generated sandbox client parses both.
- cron/scheduler.py: `argv = ["/bin/bash", str(path)]` hardcoded. Use
shutil.which("bash") so Windows (Git Bash via MinGit) works, with a
readable error when bash is genuinely absent.
- 6 bare npm/npx spawn sites: tools_config.py x2, doctor.py, whatsapp.py
(npm install + node version probe), browser_tool.py x2. On Windows npm
is npm.cmd / npx is npx.cmd (batch shims); subprocess.Popen(["npm", ...])
fails with WinError 193. shutil.which(...) returns the absolute .cmd
path which CreateProcessW accepts because the extension routes through
cmd.exe /c. POSIX behaviour unchanged (shutil.which still returns the
same path subprocess would resolve itself).
## HIGH fixes (silent misbehaviour on Windows)
- tools/environments/local.py get_temp_dir: hardcoded /tmp returned on
Windows meant `_cwd_file = "/tmp/hermes-cwd-*.txt"`, which bash wrote
via MSYS2's virtual /tmp but native Python couldn't open. Result: cwd
tracking silently broken — `cd` in terminal tool did nothing. Windows
branch now returns `%HERMES_HOME%/cache/terminal` with forward slashes
(works in both bash and Python, guaranteed no spaces).
- tools/environments/local.py _make_run_env PATH injection: `/usr/bin not
in split(":")` heuristic mangles Windows PATH (";" separator). Gate
the injection behind `not _IS_WINDOWS`.
- hermes_cli/gateway.py launch_detached_profile_gateway_restart: outer
Popen + watcher-script Popen both used start_new_session=True, which
Windows silently ignores. Watcher stayed attached to CLI's console,
died when user closed terminal after `hermes update`, left gateway
stale. Now branches through windows_detach_popen_kwargs() helper
(CREATE_NEW_PROCESS_GROUP | DETACHED_PROCESS | CREATE_NO_WINDOW on
Windows, start_new_session=True on POSIX — identical to main).
## MEDIUM fixes
- gateway/run.py /restart and /update handlers: hardcoded bash/setsid
chain crashes on Windows when user triggers /update in-gateway. Now
has sys.platform=="win32" branch using sys.executable + a tiny
Python watcher with proper detach flags. POSIX path is unchanged.
- cli.py _git_repo_root: Git on Windows sometimes returns /c/Users/...
style paths that break subprocess.Popen(cwd=...) and Path().resolve().
Added _normalize_git_bash_path() helper that translates /c/Users,
/cygdrive/c, /mnt/c variants to native C:\Users form. POSIX no-op.
_git_repo_root() now routes every result through it.
- cli.py worktree .worktreeinclude: os.symlink on directories failed
hard on Windows (requires admin or Developer Mode). Falls back to
shutil.copytree with a warning log.
## Tests
- 29 new tests in tests/tools/test_windows_native_support.py covering:
subprocess_compat helpers, TUI entry signal guards, kanban waitpid
guard, code_execution TCP fallback source-level invariants, cron bash
resolution, npm/npx bare-spawn lint per-file, local env Windows temp
dir, PATH injection gating, git bash path normalization, symlink
fallback, gateway detached watcher flags.
- One existing test assertion adjusted in test_browser_homebrew_paths:
it compared captured Popen argv to the BARE `"npx"` literal; after the
shutil.which() change argv[0] is the absolute path. New assertion
checks the shape (two items, second is `agent-browser`) rather than
the exact first-item string. Behaviour unchanged; test was too strict.
All 56 tests pass on Linux (30 from previous commits + 26 new).
267 tests from the affected files/dirs (browser, code_exec, local_env,
process_registry, kanban_db, windows_compat) all pass — zero regressions.
tests/hermes_cli/ (3909 pass) and tests/gateway/ (5021 pass) unchanged;
all pre-existing test failures confirmed unrelated via `git stash` re-run.
## What's still deferred (LOW priority)
- Visible cmd-window flashes on short-lived console apps (~14 sites) —
cosmetic, needs a follow-up pass once we have user reports.
- agent/file_safety.py POSIX-only security deny patterns — separate
hardening task.
- tools/process_registry.py returning "/tmp" as fallback — theoretical;
reachable only when all env-var candidates fail.
When the TUI backend (tui_gateway/entry.py) is spawned by Node.js with the
user's CWD containing a local utils/ directory, that directory shadows the
installed utils module, causing ImportError in run_agent and hermes_cli.
Strip '' and '.' from sys.path and prepend HERMES_PYTHON_SRC_ROOT (already
set by hermes_cli before spawning the subprocess) so installed packages
always win over CWD artifacts.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Clean up the remaining review nits:
- let the deferred @hermes/ink import retry after a transient failure instead
of memoizing a rejected promise forever
- keep memory-monitor in-flight state inside a finally so future exceptions
cannot suppress that memory level indefinitely
- use read_raw_config for the TUI MCP cold-start probe instead of full
load_config()
- keep input.detect_drop for explicit relative path prefixes (./ and ../)
while preserving the no-RPC fast path for ordinary plain prompts
Tests:
- python -m py_compile tui_gateway/server.py tui_gateway/entry.py
- cd ui-tui && npm run type-check && npm run build
- scripts/run_tests.sh tests/tui_gateway/test_protocol.py::test_sess_found tests/tools/test_code_execution_modes.py tests/tools/test_code_execution.py
- cd ui-tui && npm test -- --run src/__tests__/useSessionLifecycle.test.ts src/__tests__/useConfigSync.test.ts
Two targeted fixes on the critical path from `hermes --tui` launch to
`gateway.ready`:
1. **Defer `@hermes/ink` import in memoryMonitor.ts.** The static top-level
import dragged the full ~414KB Ink bundle (React + renderer + all
components/hooks) onto the critical path *before* `gw.start()` could
spawn the Python gateway — serialising ~155ms of Node work in front of
it on every launch. `evictInkCaches` only runs inside the 10-second
tick under heap pressure, so it moves to a lazy dynamic import. First
tick hits the ESM cache because the app entry has long since imported
`@hermes/ink`.
2. **Gate `tools.mcp_tool` import on config in tui_gateway/entry.py.**
Importing the module transitively pulls the MCP SDK + pydantic + httpx
+ jsonschema + starlette formparsers (~200ms). The overwhelming
majority of users have no `mcp_servers` configured, so this runs for
nothing. A cheap `load_config()` check (~25ms) skips the 200ms import
when no servers are declared, with a conservative fallback to the old
behaviour if the config probe itself fails.
## Measurements (macOS Terminal.app, Apple Silicon, n=12)
| Metric | Before (p50) | After (p50) | Δ |
|----------------------------|--------------|-------------|----------|
| Python gateway boot alone | 252–365ms | 105–151ms | −180ms |
| `hermes --tui` banner paint | 686ms | 665ms | −21ms |
| `hermes --tui` → ready | **1843ms** | **1655ms** | **−188ms (−10.2%)** |
| `hermes --tui` → ready p90 | 1932ms | 1778ms | −154ms |
| stdev (ready) | 126ms | 83ms | also more consistent |
## Tests
- `scripts/run_tests.sh tests/tui_gateway/ tests/tools/test_mcp_tool.py`:
195 passed. (The one pre-existing failure in
`test_session_resume_returns_hydrated_messages` reproduces on main —
unrelated, it's a mock-DB kwarg mismatch.)
- `ui-tui` vitest: 430 tests, all pass.
- `npm run type-check` in ui-tui: clean.
## Notes
- Node-side first paint ("banner") didn't move meaningfully because that
latency is dominated by Ink's render pipeline + React mount, not by
which imports load first.
- The win shows up entirely in the time from banner to `gateway.ready`
— exactly where we expected it, since both fixes shorten the Python
gateway's boot path or let it overlap more with Node startup.
- No user-visible behaviour change. Memory monitoring still fires every
10s; MCP still works when `mcp_servers` is configured.
* fix(tui-gateway): harden stdio transport against half-closed pipes + SIGTERM races
`tui_gateway` reports `tui_gateway_crash.log` traces where the main
thread sits in `sys.stdin` while a worker holds `_stdout_lock` mid-
flush, and SIGTERM then calls `sys.exit(0)` while the lock is still
held — the interpreter shutdown stalls behind the wedged write.
Two narrowly scoped hardenings:
**`tui_gateway/transport.py`**
* Move JSON serialisation outside the lock — long messages no longer
block sibling writers while we serialise.
* Treat `BrokenPipeError`, `ValueError` ("I/O on closed file") and
generic `OSError` from both `write` and `flush` as "peer is gone":
return `False` instead of bubbling, matching what `write_json`'s
callers in `entry.py` already expect.
* Split `flush` into its own try block so a stuck flush never strands
a partial write or holds the lock indefinitely on its way out.
* Optional `HERMES_TUI_GATEWAY_NO_FLUSH=1` env knob to skip explicit
`flush()` entirely on environments where a half-closed read pipe
produces an indefinite kernel-level block. Default unchanged.
**`tui_gateway/entry.py`**
* `_log_signal` now spawns a 1-second daemon timer that calls
`os._exit(0)` if the orderly `sys.exit(0)` path is itself stuck
behind a wedged worker. Atexit handlers run inside the grace
window when they can; the timer is the safety net so a deadlocked
flush no longer strands the gateway process.
Tests:
* `test_write_json_closed_stream_returns_false` — ValueError path.
* `test_write_json_oserror_on_flush_returns_false` — OSError on flush
must not strand the lock; the write portion still landed before the
flush failure.
* `test_write_json_no_flush_env_skips_flush` — env knob bypass.
Validation: `scripts/run_tests.sh tests/tui_gateway/test_protocol.py`
(42/42 pass; one pre-existing failure on
`test_session_resume_returns_hydrated_messages` is unrelated to this
change — same `include_ancestors` mock kwarg issue tracked elsewhere).
`scripts/run_tests.sh tests/test_tui_gateway_server.py` 90/90 pass.
* review(copilot): tighten transport hardening comments + test cleanup
* review(copilot): narrow exception capture, configurable grace, simpler no-flush test
* fix(tui-gateway): narrow ValueError to closed-stream; surface UnicodeEncodeError
Copilot review on PR #17118: `UnicodeEncodeError` is a ValueError
subclass, so a non-UTF-8 stdout (mismatched PYTHONIOENCODING / locale)
would have been silently swallowed as 'peer gone' under
`except ValueError`. That hides a real environment bug.
Now:
- UnicodeEncodeError → log with exc_info (warning) and drop the frame
- ValueError where str(e) contains 'closed file' → peer gone, return False
- Any other ValueError → log loudly, drop frame (defensive, but visible)
Same shape applied to flush. Adds two regression tests.
* fix(tui-gateway): reserve write() False for peer-gone; re-raise programming errors
Round 2 Copilot review on PR #17118: `Transport.write()` returning
`False` is documented as 'peer is gone', and `entry.py` reacts by
calling `sys.exit(0)`. But the implementation also returned False
for non-IO conditions (non-JSON-safe payloads, UnicodeEncodeError,
unrelated ValueErrors), so a programming error or local env bug would
present as a clean disconnect — exactly the diagnosis pain we wanted
to eliminate.
Now:
- `json.dumps` failure → re-raises (TypeError/ValueError surfaces in crash log)
- `BrokenPipeError` → False (peer gone)
- `ValueError('...closed file...')` → False (peer gone)
- `UnicodeEncodeError` and any other ValueError → re-raise
- `OSError` → False (existing IO-failure semantics, debug-logged)
Tests updated to assert the re-raise behaviour and added a
non-serializable-payload regression test.
* fix(tui-gateway): narrow OSError to peer-gone errnos; honest test naming
Round 3 Copilot review on PR #17118:
- Docstring claimed False = peer gone, but generic OSError on write/flush
also returned False — meaning ENOSPC/EACCES/EIO would silently exit.
Added `_PEER_GONE_ERRNOS = {EPIPE, ECONNRESET, EBADF, ESHUTDOWN, +WSA}`
and narrowed the OSError handlers; non-peer-gone errnos re-raise.
Docstring now lists OSError as peer-gone branch with the errno set.
- The `_DISABLE_FLUSH` test was named after the env var but actually
patched the module constant. Renamed it to reflect the contract being
tested (skips flush when constant is true) AND added a real
end-to-end test that sets the env var, reloads transport.py, and
asserts the constant flips. Cleanup reload restores defaults so
parallel tests stay isolated.
Self-review (avoid round 4):
- Verified TeeTransport's secondary-swallow stays intentional.
- _log_signal grace path already covered by separate tests.
model_tools.py ran discover_mcp_tools() as a module-level side effect.
discover_mcp_tools() uses a blocking 120s wait internally (via
_run_on_mcp_loop -> future.result(timeout=120)).
The gateway lazy-imports run_agent -> model_tools on the first user
message, which happens inside the asyncio event loop thread. A slow or
unreachable MCP server therefore froze Discord shard heartbeats and
Telegram polling for up to 120s on the first message after gateway
start.
Fix: remove the module-level call. Every entry point now runs
discovery explicitly at its own startup, using the context-appropriate
blocking/non-blocking pattern:
- gateway/run.py: loop.run_in_executor(None, discover_mcp_tools)
before platforms start accepting traffic
- hermes_cli/main.py: inline (no event loop at CLI startup)
- tui_gateway/entry.py: inline (sync stdin loop, no event loop)
- acp_adapter/entry.py: inline before asyncio.run()
Closes#16856.
Exposes hermes --tui over a PTY-backed WebSocket so the dashboard can
embed the real TUI rather than reimplement its surface. The browser
attaches xterm.js to the socket; keystrokes flow in, PTY output bytes
flow out.
Architecture:
browser <Terminal> (xterm.js)
│ onData ───► ws.send(keystrokes)
│ onResize ► ws.send('\x1b[RESIZE:cols;rows]')
│ write ◄── ws.onmessage (PTY bytes)
▼
FastAPI /api/pty (token-gated, loopback-only)
▼
PtyBridge (ptyprocess) ── spawns node ui-tui/dist/entry.js ──► tui_gateway + AIAgent
Components
----------
hermes_cli/pty_bridge.py
Thin wrapper around ptyprocess.PtyProcess: byte-safe read/write on the
master fd via os.read/os.write (not PtyProcessUnicode — ANSI is
inherently byte-oriented and UTF-8 boundaries may land mid-read),
non-blocking select-based reads, TIOCSWINSZ resize, idempotent
SIGHUP→SIGTERM→SIGKILL teardown, platform guard (POSIX-only; Windows
is WSL-supported only).
hermes_cli/web_server.py
@app.websocket("/api/pty") endpoint gated by the existing
_SESSION_TOKEN (via ?token= query param since browsers can't set
Authorization on WS upgrades). Loopback-only enforcement. Reader task
uses run_in_executor to pump PTY bytes without blocking the event
loop. Writer loop intercepts a custom \x1b[RESIZE:cols;rows] escape
before forwarding to the PTY. The endpoint resolves the TUI argv
through a _resolve_chat_argv hook so tests can inject fake commands
without building the real TUI.
Tests
-----
tests/hermes_cli/test_pty_bridge.py — 12 unit tests: spawn, stdout,
stdin round-trip, EOF, resize (via TIOCSWINSZ + tput readback), close
idempotency, cwd, env forwarding, unavailable-platform error.
tests/hermes_cli/test_web_server.py — TestPtyWebSocket adds 7 tests:
missing/bad token rejection (close code 4401), stdout streaming,
stdin round-trip, resize escape forwarding, unavailable-platform ANSI
error frame + 1011 close, resume parameter forwarding to argv.
96 tests pass under scripts/run_tests.sh.
(cherry picked from commit 29b337bca7)
feat(web): add Chat tab with xterm.js terminal + Sessions resume button
(cherry picked from commit 3d21aee8 by emozilla, conflicts resolved
against current main: BUILTIN_ROUTES table + plugin slot layout)
fix(tui): replace OSC 52 jargon in /copy confirmation
When the user ran /copy successfully, Ink confirmed with:
sent OSC52 copy sequence (terminal support required)
That reads like a protocol spec to everyone who isn't a terminal
implementer. The caveat was a historical artifact — OSC 52 wasn't
universally supported when this message was written, so the TUI
honestly couldn't guarantee the copy had landed anywhere.
Today every modern terminal (including the dashboard's embedded
xterm.js) handles OSC 52 reliably. Say what the user actually wants
to know — that it copied, and how much — matching the message the
TUI already uses for selection copy:
copied 1482 chars
(cherry picked from commit a0701b1d5a)
docs: document the dashboard Chat tab
AGENTS.md — new subsection under TUI Architecture explaining that the
dashboard embeds the real hermes --tui rather than rewriting it,
with pointers to the pty_bridge + WebSocket endpoint and the rule
'never add a parallel chat surface in React.'
website/docs/user-guide/features/web-dashboard.md — user-facing Chat
section inside the existing Web Dashboard page, covering how it works
(WebSocket + PTY + xterm.js), the Sessions-page resume flow, and
prerequisites (Node.js, ptyprocess, POSIX kernel / WSL on Windows).
(cherry picked from commit 2c2e32cc45)
feat(tui-gateway): transport-aware dispatch + WebSocket sidecar
Decouples the JSON-RPC dispatcher from its I/O sink so the same handler
surface can drive multiple transports concurrently. The PTY chat tab
already speaks to the TUI binary as bytes — this adds a structured
event channel alongside it for dashboard-side React widgets that need
typed events (tool.start/complete, model picker state, slash catalog)
that PTY can't surface.
- `tui_gateway/transport.py` — `Transport` protocol + `contextvars` binding
+ module-level `StdioTransport` fallback. The stdio stream resolves
through a lambda so existing tests that monkey-patch `_real_stdout`
keep passing without modification.
- `tui_gateway/ws.py` — WebSocket transport implementation; FastAPI
endpoint mounting lives in hermes_cli/web_server.py.
- `tui_gateway/server.py`:
- `write_json` routes via session transport (for async events) →
contextvar transport (for in-request writes) → stdio fallback.
- `dispatch(req, transport=None)` binds the transport for the request
lifetime and propagates it to pool workers via `contextvars.copy_context`
so async handlers don't lose their sink.
- `_init_session` and the manual-session create path stash the
request's transport so out-of-band events (subagent.complete, etc.)
fan out to the right peer.
`tui_gateway.entry` (Ink's stdio handshake) is unchanged externally —
it falls through every precedence step into the stdio fallback, byte-
identical to the previous behaviour.
feat(web): ChatSidebar — JSON-RPC sidecar next to xterm.js terminal
Composes the two transports into a single Chat tab:
┌─────────────────────────────────────────┬──────────────┐
│ xterm.js / PTY (emozilla #13379) │ ChatSidebar │
│ the literal hermes --tui process │ /api/ws │
└─────────────────────────────────────────┴──────────────┘
terminal bytes structured events
The terminal pane stays the canonical chat surface — full TUI fidelity,
slash commands, model picker, mouse, skin engine, wide chars all paint
inside the terminal. The sidebar opens a parallel JSON-RPC WebSocket
to the same gateway and renders metadata that PTY can't surface to
React chrome:
• model + provider badge with connection state (click → switch)
• running tool-call list (driven by tool.start / tool.progress /
tool.complete events)
• model picker dialog (gateway-driven, reuses ModelPickerDialog)
The sidecar is best-effort. If the WS can't connect (older gateway,
network hiccup, missing token) the terminal pane keeps working
unimpaired — sidebar just shows the connection-state badge in the
appropriate tone.
- `web/src/components/ChatSidebar.tsx` — new component (~270 lines).
Owns its GatewayClient, drives the model picker through
`slash.exec`, fans tool events into a capped tool list.
- `web/src/pages/ChatPage.tsx` — split layout: terminal pane
(`flex-1`) + sidebar (`w-80`, `lg+` only).
- `hermes_cli/web_server.py` — mount `/api/ws` (token + loopback
guards mirror /api/pty), delegate to `tui_gateway.ws.handle_ws`.
Co-authored-by: emozilla <emozilla@nousresearch.com>
refactor(web): /clean pass on ChatSidebar + ChatPage lint debt
- ChatSidebar: lift gw out of useRef into a useMemo derived from a
reconnect counter. React 19's react-hooks/refs and react-hooks/
set-state-in-effect rules both fire when you touch a ref during
render or call setState from inside a useEffect body. The
counter-derived gw is the canonical pattern for "external resource
that needs to be replaceable on user action" — re-creating the
client comes from bumping `version`, the effect just wires + tears
down. Drops the imperative `gwRef.current = …` reassign in
reconnect, drops the truthy ref guard in JSX. modelLabel +
banner inlined as derived locals (one-off useMemo was overkill).
- ChatPage: lazy-init the banner state from the missing-token check
so the effect body doesn't have to setState on first run. Drops
the unused react-hooks/exhaustive-deps eslint-disable. Adds a
scoped no-control-regex disable on the SGR mouse parser regex
(the \\x1b is intentional for xterm escape sequences).
All my-touched files now lint clean. Remaining warnings on web/
belong to pre-existing files this PR doesn't touch.
Verified: vitest 249/249, ui-tui eslint clean, web tsc clean,
python imports clean.
chore: uptick
fix(web): drop ChatSidebar tool list — events can't cross PTY/WS boundary
The /api/pty endpoint spawns `hermes --tui` as a child process with its
own tui_gateway and _sessions dict; /api/ws runs handle_ws in-process in
the dashboard server with a separate _sessions dict. Tool events fire on
the child's gateway and never reach the WS sidecar, so the sidebar's
tool.start/progress/complete listeners always observed an empty list.
Drop the misleading list (and the now-orphaned ToolCall primitive),
keep model badge + connection state + model picker + error banner —
those work because they're sidecar-local concerns. Surfacing tool calls
in the sidebar requires cross-process forwarding (PTY child opens a
back-WS to the dashboard, gateway tees emits onto stdio + sidecar
transport) — proper feature for a follow-up.
feat(web): wire ChatSidebar tool list to PTY child via /api/pub broadcast
The dashboard's /api/pty spawns hermes --tui as a child process; tool
events fire in the python tui_gateway grandchild and never crossed the
process boundary into the in-process WS sidecar — so the sidebar tool
list was always empty.
Cross-process forwarding:
- tui_gateway: TeeTransport (transport.py) + WsPublisherTransport
(event_publisher.py, sync websockets client). entry.py installs the
tee on _stdio_transport when HERMES_TUI_SIDECAR_URL is set, mirroring
every dispatcher emit to a back-WS without disturbing Ink's stdio
handshake.
- hermes_cli/web_server.py: new /api/pub (publisher) + /api/events
(subscriber) endpoints with a per-channel registry. /api/pty now
accepts ?channel= and propagates the sidecar URL via env. start_server
also stashes app.state.bound_port so the URL is constructable.
- web/src/pages/ChatPage.tsx: generates a channel UUID per mount,
passes it to /api/pty and as a prop to ChatSidebar.
- web/src/components/ChatSidebar.tsx: opens /api/events?channel=, fans
tool.start/progress/complete back into the ToolCall list. Restores
the ToolCall primitive.
Tests: 4 new TestPtyWebSocket cases cover channel propagation,
broadcast fan-out, and missing-channel rejection (10 PTY tests pass,
120 web_server tests overall).
fix(web): address Copilot review on #14890
Five threads, all real:
- gatewayClient.ts: register `message`/`close` listeners BEFORE awaiting
the open handshake. Server emits `gateway.ready` immediately after
accept, so a listener attached after the open promise could race past
the initial skin payload and lose it.
- ChatSidebar.tsx: wire `error`/`close` on the /api/events subscriber
WS into the existing error banner. 4401/4403 (auth/loopback reject)
surface as a "reload the page" message; mid-stream drops surface as
"events feed disconnected" with the existing reconnect button. Clean
unmount closes (1000/1001) stay silent.
- web-dashboard.md: install hint was `pip install hermes-agent[web]` but
ptyprocess lives in the `pty` extra, not `web`. Switch to
`hermes-agent[web,pty]` in both prerequisite blocks.
- AGENTS.md: previous "never add a parallel React chat surface" guidance
was overbroad and contradicted this PR's sidebar. Tightened to forbid
re-implementing the transcript/composer/PTY terminal while explicitly
allowing structured supporting widgets (sidebar / model picker /
inspectors), matching the actual architecture.
- web/package-lock.json: regenerated cleanly so the wterm sibling
workspace paths (extraneous machine-local entries) stop polluting CI.
Tests: 249/249 vitest, 10/10 PTY/events, web tsc clean.
refactor(web): /clean pass on ChatSidebar events handler
Spotted in the round-2 review:
- Banner flashed on clean unmount: `ws.close()` from the effect cleanup
fires `close` with code 1005, opened=true, neither 1000 nor 1001 —
hit the "unexpected drop" branch. Track `unmounting` in the effect
scope and gate the banner through a `surface()` helper so cleanup
closes stay silent.
- DRY the duplicated "events feed disconnected" string into a local
const used by both the error and close handlers.
- Drop the `opened` flag (no longer needed once the unmount guard is
the source of truth for "is this an expected close?").
Crash-log stack trace (tui_gateway_crash.log) from the user's session
pinned the regression: SIGPIPE arrived while main thread was blocked on
for-raw-in-sys.stdin — i.e., a background thread (debug print to stderr,
most likely from HERMES_VOICE_DEBUG=1) wrote to a pipe whose buffer the
TUI hadn't drained yet, and SIG_DFL promptly killed the process.
Two fixes that together restore CLI parity:
- entry.py: SIGPIPE → SIG_IGN instead of the _log_signal handler that
then exited. With SIG_IGN, Python raises BrokenPipeError on the
offending write, which write_json already handles with a clean exit
via _log_exit. SIGTERM / SIGHUP still route through _log_signal so
real termination signals remain diagnosable.
- hermes_cli/voice.py:_debug: wrap the stderr print in a BrokenPipeError
/ OSError try/except. This runs from daemon threads (silence callback,
TTS playback, beep), so a broken stderr must not escape and ride up
into the main event loop.
Verified by spawning the gateway subprocess locally:
voice.toggle status → 200 OK, process stays alive, clean exit on
stdin close logs "reason=stdin EOF" instead of a silent reap.
SIG_DFL for SIGPIPE means the kernel reaps the gateway subprocess the
instant a background thread (TTS playback, silence callback, voice
status emitter) writes to a stdout the TUI stopped reading — before
the Python interpreter can run excepthook, threading.excepthook,
atexit, or the entry.py post-loop _log_exit.
Replace the three SIG_DFL / SIG_IGN bindings with a _log_signal
handler that:
- records which signal (SIGPIPE / SIGTERM / SIGHUP) fired and when;
- dumps the main-thread stack at signal delivery AND every live
thread's stack via sys._current_frames — the background-thread
write that provoked SIGPIPE is almost always visible here;
- writes everything to ~/.hermes/logs/tui_gateway_crash.log and prints
a [gateway-signal] breadcrumb to stderr so the TUI Activity surfaces
it as well.
SIGINT stays ignored (TUI handles Ctrl+C for the user).
Gateway exits weren't reaching the panic hook because entry.py calls
sys.exit(0) on broken stdout — clean termination, no exception. That
left "gateway exited" in the TUI with zero forensic trail when pipe
breaks happened mid-turn.
Entry.py now tags each exit path — startup-write failure, parse-error-
response write failure, per-method response write failure, stdin EOF —
with a one-line entry in ~/.hermes/logs/tui_gateway_crash.log and a
gateway.stderr breadcrumb. Includes the JSON-RPC method name on the
dispatch path, which is the only way to tell "died right after handling
voice.toggle on" from "died emitting the second message.complete".
The stdin-read loop in entry.py calls handle_request() inline, so the
five handlers that can block for seconds to minutes
(slash.exec, cli.exec, shell.exec, session.resume, session.branch)
freeze the dispatcher. While one is running, any inbound RPC —
notably approval.respond and session.interrupt — sits unread in the
pipe buffer and lands only after the slow handler returns.
Route only those five onto a small ThreadPoolExecutor; every other
handler stays on the main thread so the fast-path ordering is
unchanged and the audit surface stays small. write_json is already
_stdout_lock-guarded, so concurrent response writes are safe. Pool
size defaults to 4 (overridable via HERMES_TUI_RPC_POOL_WORKERS).
- add _LONG_HANDLERS set + ThreadPoolExecutor + atexit shutdown
- new dispatch(req) function: pool for long handlers, inline for rest
- _run_and_emit wraps pool work in a try/except so a misbehaving
handler still surfaces as a JSON-RPC error instead of silently
dying in a worker
- entry.py swaps handle_request → dispatch
- 5 new tests: sync path still inline, long handlers emit via stdout,
fast handler not blocked behind slow one, handler exceptions map to
error responses, non-long methods always take the sync path
Manual repro confirms the fix: shell.exec(sleep 3) + terminal.resize
sent back-to-back now returns the resize response at t=0s while the
sleep finishes independently at t=3s. Before, both landed together
at t=3s.
Fixes#12546.