mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-06-14 09:11:54 +00:00
* feat(desktop): session-scoped status stack + kill new-window theme flash Stack subagents, background tasks, and the queue into one collapsible "sink" above the composer, reusing the queue's chrome so every status reads as one piece. Extracts shared StatusSection / StatusRow / TerminalOutput primitives and a unified $statusItemsBySession store (subagents mirrored, background owned here, merged + grouped for render). Renames BrailleSpinner → GlyphSpinner now that it drives more than braille. Separately, fix the white flash on every new/cmd-clicked window: macOS `vibrancy` paints an NSVisualEffectView that follows the OS appearance and ignores `backgroundColor`, so a dark app on a light-mode Mac flashed white until the renderer painted over it. Pin `nativeTheme.themeSource` to the app theme (persisted to userData so cold launches paint right before the renderer loads), hold windows with `show:false` until `ready-to-show`, and pre-paint the themed background via an inline script before the bundle runs. * feat(desktop): dock the slash popover to the composer via one shared fill var The slash·@ popover (and ? help) now docks onto the composer's edge with the same chrome as the queue/status stack — rounded outer corners, fused borderless edge, no shadow — but keeps its own narrow width. Surface + drawer paint a single --composer-fill var; the state ladder (rest / scrolled / focused / drawer-open) lives once in styles.css on [data-slot='composer-root']. The :has() drawer-open rule is last and forces an opaque fill, since translucent glass sampling different backdrops (thread vs fade gradient) can never match. This replaces the focus-within !important override that repainted the surface behind every previous matching attempt. Also drop the chevron column from the project file tree — the folder open/closed icon already carries the expand state. * feat(desktop): base inset for file tree rows (post-chevron alignment) * feat(desktop): wire the status stack's background tasks to the real process registry The background group was UI-only (dev-mock seeded). Now it's live e2e: - tui_gateway: new session-scoped `process.list` (registry snapshot filtered by the session's session_key, plus a 4KB output tail for the inline terminal viewer) and `process.kill` (single process, ownership-checked — unlike process.stop's kill_all). - Renderer: `reconcileBackgroundProcesses` syncs snapshots into the store layout-stably — rows keep their position when state flips (never re-sort), new processes append, unchanged rows keep object identity so memoised rows skip re-rendering, and a dismissed-set stops the registry's retained finished procs from resurrecting X-ed rows. - Refresh triggers: session open, terminal/process tool.complete, status.update(kind=process) from the gateway's notification poller, and a 5s poll armed only while a running row is visible (catches silent exits). - Stop = real `process.kill` + optimistic dismiss; Dismiss = client-side with resurrection guard. - Re-keyed the stack to the RUNTIME session id: it was keyed by the stored session id, where neither subagent events nor process.list would ever land. - Deleted dev-status-mocks.ts (__hermesStatusMocks) — no more seed shit. Reconcile invariants covered in store/composer-status.test.ts. * feat(desktop): todos + openable subagents in the status stack, self-healing file tree - todo lists move out of the inline chat panel into the composer status stack (checklist icon, dashed ring = pending, spinner = in progress, check = done), fed live from todo tool events and seeded from history on session open - subagent rows carry the child's real session id end-to-end (delegate_tool → gateway → renderer) so clicking one opens ITS session window - status stack publishes its measured height so the thread's bottom clearance grows with it; card paints the shared --composer-fill so focused/scrolled states match the composer exactly - file tree self-heals: ENOENT roots retry on a 3s cadence + Try again button, and the main process expands ~ in IPC paths (gateway cwds arrive as ~/...) - composer drag-drop of tree entries inserts inline refs instead of attachments * fix(desktop): file tree falls back to the workspace dir when a session's cwd is gone Sessions record their launch cwd; deleted worktrees leave that path dead, so opening such a session swapped the tree from the default workspace to a directory that ENOENTs forever — the 3s retry just spun on it. On a root read error the tree now asks main to sanitize the cwd (prefers the configured default project dir), displays that fallback, and quietly re-probes the original path so it switches back if the dir reappears. * feat(desktop): working restore-checkpoint button on past user prompts The discard icon on hover of a past user bubble was decorative — clicking did nothing. It's now a real control: a confirmation dialog explains that everything after the prompt is removed, then the session rewinds to that turn and reruns the same prompt (prompt.submit with truncate_before_user_ordinal, the same mechanism the edit composer uses). Failures rethrow into the dialog's inline error instead of toasting. * fix(desktop): show the restore-checkpoint button on the latest user prompt too Restoring the most recent prompt is just 'retry this turn' — no reason to exclude it. Stop still takes the slot while the turn is running. * fix(desktop): finished todo lists clear themselves out of the status stack A list whose every item is completed/cancelled lingers ~4s so the final checkmark is visible, then the todo group drops out of the stack. A fresh active list arriving within the linger cancels the scheduled clear. * chore(desktop): drop dead editableCheckpoint copy, terser restore confirm * fix(desktop): rewind clears the abandoned timeline's todos + background Restoring to (or editing) an earlier prompt rewinds the conversation, but the todos and background processes spawned by the now-discarded turns kept showing in the status stack — and the real background processes kept running. Both rewind paths now clear the session's todo rows and kill + drop its background processes before the fresh run repopulates them. Also drops the click-to-edit clamp transition, which flashed a half-expanded bubble on the way into the edit composer. * feat(desktop): user messages are always editable; edit/restore revert mid-stream The bubble is now always click-to-edit — even while a turn streams — instead of going inert during a run. Sending an edit acts like restore: it rewinds to that prompt and re-runs with the new text. Both edit and restore can fire mid-stream now; the gateway refuses prompt.submit while a turn runs (4009 "session busy"), so they interrupt the live turn first and retry the submit until the cooperative interrupt winds it down. Restore (re-run as-is) shows on every prompt except the latest running one, which keeps the Stop button. * fix(desktop): label preview-pane ⌘L selections with the filename, not "zsh" The terminal owns a global ⌘/Ctrl+L "send selection to composer" shortcut, so selecting text in the file preview pane and hitting it fell through to the terminal handler — which imported the right text but labelled the composer ref "zsh:N lines" off the shell name. When the selection isn't an xterm selection, label it with the previewed file instead. * fix(desktop): ⌘L on a preview line selection inserts the @line ref, like dragging The source preview lets you select lines in the gutter and drag them into the composer as an @line:path:start-end ref. ⌘/Ctrl+L now does the same when a line selection is active — it drops the identical ref instead of falling through to the terminal's global handler (which grabbed the native text selection and sent a bogus terminal block). Capture-phase + stopPropagation so it wins; with a line selection there's no native selection, so the terminal handler stays out of it. * chore: gitignore apps/desktop/demo/ scratch output The desktop demo prompt writes demo/*.txt during recorded walkthroughs; it's throwaway, never part of the app. Ignore it so it stops cluttering git status. * feat(desktop): subagent watch windows, hard stop, sidebar hygiene Child-session mirror for live subagent windows, delegate sessions tagged and excluded from the sidebar, composer focus/stop polish, and WS stall resilience on the gateway transport. * refactor: DRY delegate SQL + trim status-stack noise Extract shared listable-child and delegate-delete helpers in hermes_state, collapse cancelRun busy release, and cut comment bloat in resume/status paths. * fix(desktop): hide orphaned subagent sessions in sidebar Cascade-delete all ephemeral children on parent delete (not just tagged rows), run v16 backfill to tag legacy orphans, and record new delegates as source=subagent. * fix: restore orphan contract for untagged children + lazy session eviction Cascade-delete only _delegate_from-tagged rows (v16 backfill covers legacy), walk marker chains recursively with FK-safe orphaning, gate lazy watch sessions out of the still-starting eviction exemption via an explicit flag, pass session_id to _make_agent only when resuming, and hide source=subagent from session search. * fix(gateway): gate child mirror off upgraded sessions + age out stale run entries Review findings: the mirror could interleave synthetic events with a real native stream once a watch window upgrades (prompt.submit builds an agent), and a lost subagent.complete left _active_child_runs pinning running=true forever. Mirror now stops when the live session owns an agent; liveness reads ignore entries older than an hour. * fix(gateway): reject prompt.submit into a watch session while its child runs A lazy watch session's running flag is False (the run lives in the parent turn), so typing mid-run sailed past the busy guard and built a second agent racing the in-flight child on the same stored session. Busy error until the run completes; afterwards the submit upgrades into a normal conversation. * refactor(gateway): DRY watch-resume payload + compose listable-child SQL Fold the duplicated child-run busy overlay into one _reuse_live_payload helper across both resume reuse paths, collapse the twin mirror early-returns, and build _LISTABLE_CHILD_SQL from _BRANCH_CHILD_SQL instead of restating it. * fix(desktop): clip horizontal overflow on sidebar scroll areas Add overflow-x-hidden alongside overflow-y-auto on session list scrollers and the shared SidebarContent primitive — vertical scroll unchanged.
340 lines
13 KiB
Python
340 lines
13 KiB
Python
"""WebSocket transport for the tui_gateway JSON-RPC server.
|
|
|
|
Reuses :func:`tui_gateway.server.dispatch` verbatim so every RPC method, every
|
|
slash command, every approval/clarify/sudo flow, and every agent event flows
|
|
through the same handlers whether the client is Ink over stdio or an iOS /
|
|
web client over WebSocket.
|
|
|
|
Wire protocol
|
|
-------------
|
|
Identical to stdio: newline-delimited JSON-RPC in both directions. The server
|
|
emits a ``gateway.ready`` event immediately after connection accept, then
|
|
echoes responses/events for inbound requests. No framing differences.
|
|
|
|
Mounting
|
|
--------
|
|
from fastapi import WebSocket
|
|
from tui_gateway.ws import handle_ws
|
|
|
|
@app.websocket("/api/ws")
|
|
async def ws(ws: WebSocket):
|
|
await handle_ws(ws)
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
import asyncio
|
|
import concurrent.futures
|
|
import json
|
|
import logging
|
|
import socket
|
|
from typing import Any
|
|
|
|
from tui_gateway import server
|
|
|
|
_log = logging.getLogger(__name__)
|
|
|
|
# Max seconds a pool-dispatched handler will block waiting for the event loop
|
|
# to flush a WS frame before we mark the transport dead. Protects handler
|
|
# threads from a wedged socket.
|
|
_WS_WRITE_TIMEOUT_S = 10.0
|
|
_WS_LOG_PAYLOAD_PREVIEW = 240
|
|
|
|
# Keep starlette optional at import time; handle_ws uses the real class when
|
|
# it's available and falls back to a generic Exception sentinel otherwise.
|
|
try:
|
|
from starlette.websockets import WebSocketDisconnect as _WebSocketDisconnect
|
|
except ImportError: # pragma: no cover - starlette is a required install path
|
|
_WebSocketDisconnect = Exception # type: ignore[assignment]
|
|
|
|
|
|
class WSTransport:
|
|
"""Per-connection WS transport.
|
|
|
|
``write`` is safe to call from any thread *other than* the event loop
|
|
thread that owns the socket. Pool workers (the only real caller) run in
|
|
their own threads, so marshalling onto the loop via
|
|
:func:`asyncio.run_coroutine_threadsafe` + ``future.result()`` is correct
|
|
and deadlock-free there.
|
|
|
|
When called from the loop thread itself (e.g. by ``handle_ws`` for an
|
|
inline response) the same call would deadlock: we'd schedule work onto
|
|
the loop we're currently blocking. We detect that case and fire-and-
|
|
forget instead. Callers that need to know when the bytes are on the wire
|
|
should use :meth:`write_async` from the loop thread.
|
|
"""
|
|
|
|
def __init__(
|
|
self,
|
|
ws: Any,
|
|
loop: asyncio.AbstractEventLoop,
|
|
*,
|
|
peer: str = "unknown",
|
|
) -> None:
|
|
self._ws = ws
|
|
self._loop = loop
|
|
self._peer = peer
|
|
self._closed = False
|
|
|
|
def write(self, obj: dict) -> bool:
|
|
if self._closed:
|
|
return False
|
|
|
|
line = json.dumps(obj, ensure_ascii=False)
|
|
|
|
try:
|
|
on_loop = asyncio.get_running_loop() is self._loop
|
|
except RuntimeError:
|
|
on_loop = False
|
|
|
|
if on_loop:
|
|
# Fire-and-forget — don't block the loop waiting on itself.
|
|
self._loop.create_task(self._safe_send(line))
|
|
return True
|
|
|
|
try:
|
|
from agent.async_utils import safe_schedule_threadsafe
|
|
fut = safe_schedule_threadsafe(self._safe_send(line), self._loop)
|
|
if fut is None:
|
|
self._closed = True
|
|
return False
|
|
fut.result(timeout=_WS_WRITE_TIMEOUT_S)
|
|
return not self._closed
|
|
except concurrent.futures.TimeoutError: # builtin TimeoutError on 3.11+
|
|
# The event loop is stalled (GIL-heavy agent turn, delegation
|
|
# running N children), NOT the socket dead. The send coroutine is
|
|
# already scheduled and will flush once the loop breathes — latching
|
|
# _closed here permanently silenced live windows after one slow
|
|
# write (the "subagent window shows zero streaming" bug). Unblock
|
|
# the worker thread and keep the transport alive; _safe_send latches
|
|
# on a real socket error when the frame actually fails.
|
|
_log.warning(
|
|
"ws write slow (loop stalled >%ss) peer=%s — frame left in flight",
|
|
_WS_WRITE_TIMEOUT_S, self._peer,
|
|
)
|
|
return not self._closed
|
|
except Exception as exc:
|
|
self._closed = True
|
|
_log.warning(
|
|
"ws write failed peer=%s error_type=%s error=%s",
|
|
self._peer, type(exc).__name__, exc,
|
|
)
|
|
return False
|
|
|
|
async def write_async(self, obj: dict) -> bool:
|
|
"""Send from the owning event loop. Awaits until the frame is on the wire."""
|
|
if self._closed:
|
|
return False
|
|
await self._safe_send(json.dumps(obj, ensure_ascii=False))
|
|
return not self._closed
|
|
|
|
async def _safe_send(self, line: str) -> None:
|
|
try:
|
|
await self._ws.send_text(line)
|
|
except Exception as exc:
|
|
self._closed = True
|
|
_log.warning(
|
|
"ws send failed peer=%s error_type=%s error=%s",
|
|
self._peer, type(exc).__name__, exc,
|
|
)
|
|
|
|
def close(self) -> None:
|
|
self._closed = True
|
|
|
|
|
|
def _ws_peer_label(ws: Any) -> str:
|
|
"""Return ``host:port`` when available, else a stable placeholder."""
|
|
client = getattr(ws, "client", None)
|
|
if client is None:
|
|
return "unknown"
|
|
host = getattr(client, "host", None) or "unknown"
|
|
port = getattr(client, "port", None)
|
|
return f"{host}:{port}" if port is not None else host
|
|
|
|
|
|
def _disable_nagle(ws: Any) -> None:
|
|
"""Disable Nagle so streamed JSON-RPC frames go out individually.
|
|
|
|
Without it the kernel coalesces the small per-token frames, so a burst after
|
|
the model's think-pause lands on the client in one tick and no client-side
|
|
smoothing can recover the cadence. GUI/WS only; chat platforms don't hit
|
|
this path. Best-effort — skip silently if the socket isn't reachable.
|
|
"""
|
|
try:
|
|
scope = getattr(ws, "scope", None) or {}
|
|
transport = (scope.get("extensions") or {}).get("transport") or getattr(ws, "transport", None)
|
|
sock = transport.get_extra_info("socket") if transport is not None else None
|
|
if sock is not None:
|
|
sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
|
|
except Exception as exc: # pragma: no cover - best-effort tuning
|
|
_log.debug("ws TCP_NODELAY skip: %s", exc)
|
|
|
|
|
|
async def handle_ws(ws: Any) -> None:
|
|
"""Run one WebSocket session. Wire-compatible with ``tui_gateway.entry``."""
|
|
peer = _ws_peer_label(ws)
|
|
transport: WSTransport | None = None
|
|
messages = 0
|
|
parse_errors = 0
|
|
dispatch_crashes = 0
|
|
send_failures = 0
|
|
disconnect_reason = "not_connected"
|
|
|
|
try:
|
|
await ws.accept()
|
|
disconnect_reason = "connected"
|
|
# Push small streamed frames out immediately instead of letting Nagle
|
|
# batch them — keeps the live token cadence intact for GUI clients.
|
|
_disable_nagle(ws)
|
|
_log.info("ws accepted peer=%s", peer)
|
|
|
|
transport = WSTransport(ws, asyncio.get_running_loop(), peer=peer)
|
|
|
|
ready_ok = await transport.write_async(
|
|
{
|
|
"jsonrpc": "2.0",
|
|
"method": "event",
|
|
"params": {
|
|
"type": "gateway.ready",
|
|
"payload": {"skin": server.resolve_skin()},
|
|
},
|
|
}
|
|
)
|
|
if not ready_ok:
|
|
disconnect_reason = "ready_send_failed"
|
|
send_failures += 1
|
|
_log.error("ws ready frame send failed peer=%s", peer)
|
|
return
|
|
|
|
while True:
|
|
try:
|
|
raw = await ws.receive_text()
|
|
except _WebSocketDisconnect as exc:
|
|
disconnect_reason = (
|
|
"client_disconnect("
|
|
f"code={getattr(exc, 'code', None)},"
|
|
f"reason={getattr(exc, 'reason', None)})"
|
|
)
|
|
break
|
|
except Exception:
|
|
disconnect_reason = "receive_failed"
|
|
_log.exception("ws receive failed peer=%s", peer)
|
|
break
|
|
|
|
line = raw.strip()
|
|
if not line:
|
|
continue
|
|
messages += 1
|
|
|
|
try:
|
|
req = json.loads(line)
|
|
except json.JSONDecodeError as exc:
|
|
parse_errors += 1
|
|
_log.warning(
|
|
"ws parse error peer=%s index=%d error=%s payload=%r",
|
|
peer,
|
|
messages,
|
|
exc,
|
|
line[:_WS_LOG_PAYLOAD_PREVIEW],
|
|
)
|
|
ok = await transport.write_async(
|
|
{
|
|
"jsonrpc": "2.0",
|
|
"error": {"code": -32700, "message": "parse error"},
|
|
"id": None,
|
|
}
|
|
)
|
|
if not ok:
|
|
disconnect_reason = "send_failed_after_parse_error"
|
|
send_failures += 1
|
|
_log.warning("ws parse-error reply send failed peer=%s", peer)
|
|
break
|
|
continue
|
|
|
|
# dispatch() may schedule long handlers on the pool; it returns
|
|
# None in that case and the worker writes the response itself via
|
|
# the transport we pass in (a separate thread, so transport.write
|
|
# is the safe path there). For inline handlers it returns the
|
|
# response dict, which we write here from the loop.
|
|
req_id = req.get("id") if isinstance(req, dict) else None
|
|
req_method = req.get("method") if isinstance(req, dict) else None
|
|
try:
|
|
resp = await asyncio.to_thread(server.dispatch, req, transport)
|
|
except Exception:
|
|
dispatch_crashes += 1
|
|
_log.exception(
|
|
"ws dispatch crash peer=%s id=%s method=%s",
|
|
peer,
|
|
req_id,
|
|
req_method,
|
|
)
|
|
ok = await transport.write_async(
|
|
{
|
|
"jsonrpc": "2.0",
|
|
"error": {"code": -32603, "message": "internal error"},
|
|
"id": req_id if req_id is not None else None,
|
|
}
|
|
)
|
|
if not ok:
|
|
disconnect_reason = "send_failed_after_dispatch_crash"
|
|
send_failures += 1
|
|
_log.warning(
|
|
"ws dispatch-crash reply send failed peer=%s id=%s method=%s",
|
|
peer,
|
|
req_id,
|
|
req_method,
|
|
)
|
|
break
|
|
continue
|
|
if resp is not None and not await transport.write_async(resp):
|
|
disconnect_reason = "send_failed_after_response"
|
|
send_failures += 1
|
|
_log.warning(
|
|
"ws response send failed peer=%s id=%s method=%s",
|
|
peer,
|
|
req_id,
|
|
req_method,
|
|
)
|
|
break
|
|
finally:
|
|
reaped_sessions = 0
|
|
detached_sessions = 0
|
|
if transport is not None:
|
|
transport.close()
|
|
|
|
# Reap sessions this transport owned (close_on_disconnect sidecar
|
|
# sessions) or detach the rest to the drop sentinel so later emits
|
|
# don't crash into a closed socket or fall through to desktop stdout
|
|
# logs. Detached sessions are handed to the grace-windowed WS-orphan
|
|
# reaper inside _close_sessions_for_transport (a quick reconnect /
|
|
# session.resume cancels it). This is the single WS-disconnect
|
|
# teardown path.
|
|
#
|
|
# Offloaded: _close_session_by_id does a blocking worker.close()
|
|
# (terminate + waits) plus a synchronous DB write — inline that
|
|
# would freeze the uvicorn event loop for every other live
|
|
# connection.
|
|
try:
|
|
reaped_sessions, detached_sessions = await asyncio.to_thread(
|
|
server._close_sessions_for_transport,
|
|
transport,
|
|
end_reason="ws_disconnect",
|
|
)
|
|
except Exception:
|
|
_log.exception("ws transport teardown failed peer=%s", peer)
|
|
try:
|
|
await ws.close()
|
|
except Exception as exc:
|
|
_log.debug("ws close failed peer=%s error=%s", peer, exc)
|
|
_log.info(
|
|
"ws closed peer=%s reason=%s messages=%d parse_errors=%d "
|
|
"dispatch_crashes=%d send_failures=%d reaped_sessions=%d detached_sessions=%d",
|
|
peer,
|
|
disconnect_reason,
|
|
messages,
|
|
parse_errors,
|
|
dispatch_crashes,
|
|
send_failures,
|
|
reaped_sessions,
|
|
detached_sessions,
|
|
)
|