hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-06-17 09:41:58 +00:00

History

brooklyn! a726e8a811 fix(tui): auto-recover session on unexpected gateway death (+ persist lifecycle breadcrumbs) (#35893 ) * fix(tui): persist gateway lifecycle breadcrumbs to crash log A backend SIGTERM (`=== SIGTERM received ===` in tui_gateway_crash.log) is always a parent action — `gw.kill()` (graceful-exit on a signal to Node, or an explicit /quit) or `start()` replacing a live child. #31051 added parent-side lifecycle breadcrumbs but left them in an in-memory CircularBuffer that dies with the process, so SIGTERM crash reports arrive with no parent context and no way to tell a signal-driven kill from a memory-critical `process.exit(137)` (which closes the child's stdin → clean EOF, not SIGTERM). Persist the death-explaining breadcrumbs (spawn / transport-exit / child-exit / replace-live-child / kill-reason / startup-timeout) plus the graceful-exit signal name and the memory-critical exit into the same crash log the Python side writes, so they interleave by timestamp next to the child's panic entry — making these recurring reports diagnosable. Gated off under VITEST so unit tests stay hermetic. * feat(tui): auto-recover the session when the gateway dies unexpectedly When a still-owned gateway child dies while the TUI is alive (a crash, OOM process.exit, or a SIGTERM/SIGHUP forwarded to it), the app currently nulls the session and drops to an inert "gateway exited" state — the user loses a long session and has to restart + re-run everything. That single behavior is most of the "TUI doesn't survive heavy work" complaint, independent of what does the killing. The 'exit' event only reaches this handler on an unexpected death: a user /quit calls process.exit before it fires, and a replaced child is identity- skipped in GatewayClient. So on exit we now respawn the gateway and resume the session that was live (history is persisted in SQLite) via a one-shot recoverSidRef the next gateway.ready consults before forging a new session. The in-flight reply is lost (it died with the process) but the session survives. Bounded to GATEWAY_RECOVERY_LIMIT (3) attempts per GATEWAY_RECOVERY_WINDOW_MS (60s) so a gateway that crash-loops on startup can't spawn-storm; past the budget we fall back to the inert state. * fix(tui): sanitize newlines + soften SIGTERM-cause claim in parentLog Address PR review: - recordParentLifecycle collapses embedded \r\n so a multi-line value (e.g. an error message) stays a single breadcrumb and can't masquerade as a separate entry or as the child's panic output sharing the crash log. - Reword the header: a backend SIGTERM is usually a parent action but can come straight from an external supervisor (s6, cgroup OOM, stray kill); the presence/absence of a [tui-parent] line before the child's panic is precisely what disambiguates the two. * fix(tui): clear sid during recovery + extract/test the recovery budget Address PR review: - Null `sid` immediately in the gateway exit handler. While the gateway is down (busy=false) the old sid would otherwise let sid-guarded effects (the 1.5s session.active_list poll, queue drain) fire RPCs at a dead/respawning gateway. recoverSidRef carries the session forward; resumeById restores sid on ready. - Extract the respawn budget into a pure evalRecovery() (gatewayRecovery.ts) and unit-test the bound: allows GATEWAY_RECOVERY_LIMIT within the window, blocks past it, and prunes attempts older than the window so recovery re-arms. * fix(tui): cap parent-log breadcrumb length (PR review) Truncate a single persisted breadcrumb to 4096 chars (matching GatewayClient's in-memory log-line cap) so a pathological value — e.g. a giant error string — can't bloat the shared crash log or add noticeable blocking on the synchronous append during a failure path. Covered by a test. * fix(tui): keep "recovering session…" status visible during resume (PR review) resumeById() synchronously sets status to 'resuming…' on entry, so the recovery branch now applies its 'recovering session…' label after calling resumeById — the distinct label sticks for the duration of the resume RPC (which later flips to 'ready') instead of being immediately clobbered. Test updated to assert the ordering. * fix(tui): keep recovery budget alive across a startup crash-loop (PR review) deadSid was read from getUiState().sid, which the first exit nulls — so if the respawned gateway crash-looped before gateway.ready (resumeById never restored sid), later exits saw null and abandoned the session after a single attempt, defeating the bounded retry budget. Lift the whole decision into a pure planGatewayRecovery() that falls back to the pending recoverSidRef target when the live sid is already cleared, and unit-test the crash-loop sequence (keeps retrying the same session up to the limit, then falls back to inert). Supersedes evalRecovery. * chore(tui): drop non-null assertion + clarify breadcrumb cap comment (PR review) - Recovery branch guards on `recoverSidRef && recoverSid` so the ref write needs no `!` assertion (avoids a future unsafe refactor). - Reword the parentLog cap comment: it slices the value to 4096 chars and appends a short truncation marker (so the written line is slightly longer), rather than implying a strict 4096-byte limit. * chore(tui): soften "absence ⇒ external signal" + "any in-flight reply" (PR review) - parentLog header: a missing [tui-parent] line only suggests an external signal (the logger is best-effort: VITEST-disabled, failed append swallowed), not a definitive conclusion. - Recovery notice says "any in-flight reply was lost" since the gateway can also exit while idle.		2026-05-31 10:36:57 -05:00
..
__tests__	fix(tui): auto-recover session on unexpected gateway death (+ persist lifecycle breadcrumbs) (#35893 )	2026-05-31 10:36:57 -05:00
app	fix(tui): auto-recover session on unexpected gateway death (+ persist lifecycle breadcrumbs) (#35893 )	2026-05-31 10:36:57 -05:00
components	fix: surface /agents nudge while delegate_task is in-flight (TUI + CLI)	2026-05-30 13:22:45 +05:30
config	feat: configurable paste collapse thresholds (TUI + CLI)	2026-05-25 06:23:18 -07:00
content	feat: add TUI session orchestrator	2026-05-26 20:51:59 -07:00
domain	fix(tui): stabilize sticky prompt tracking	2026-04-28 22:10:40 -05:00
hooks	fix(tui): recompute virtual tail after width resize	2026-05-23 14:49:26 -05:00
lib	fix(tui): auto-recover session on unexpected gateway death (+ persist lifecycle breadcrumbs) (#35893 )	2026-05-31 10:36:57 -05:00
protocol	refactor(tui): /clean pass across ui-tui — 49 files, −217 LOC	2026-04-16 22:32:53 -05:00
types	fix(tui): keep Ink displayCursor in sync with fast-echo writes so cursor stops drifting (#26717 )	2026-05-16 00:28:12 -05:00
app.tsx	fix(tui): apply ui-tui fix pass and restore type-check	2026-04-25 14:08:54 -05:00
banner.ts	feat(tui): responsive banner tiers	2026-05-23 17:37:51 -05:00
entry.tsx	fix(tui): auto-recover session on unexpected gateway death (+ persist lifecycle breadcrumbs) (#35893 )	2026-05-31 10:36:57 -05:00
gatewayClient.ts	fix(tui): auto-recover session on unexpected gateway death (+ persist lifecycle breadcrumbs) (#35893 )	2026-05-31 10:36:57 -05:00
gatewayTypes.ts	feat(tui): nudge toward /agents dashboard when delegation starts	2026-05-30 12:26:36 +05:30
theme.ts	fix(tui): honor skin highlight colors (#20895 )	2026-05-06 14:01:56 -07:00
types.ts	fix(tui): surface verbose tool details (#30225 )	2026-05-22 00:16:52 -05:00