hermes-agent/ui-tui/src/lib/parentLog.ts
brooklyn! a726e8a811
fix(tui): auto-recover session on unexpected gateway death (+ persist lifecycle breadcrumbs) (#35893)
* fix(tui): persist gateway lifecycle breadcrumbs to crash log

A backend SIGTERM (`=== SIGTERM received ===` in tui_gateway_crash.log) is
always a parent action — `gw.kill()` (graceful-exit on a signal to Node, or an
explicit /quit) or `start()` replacing a live child. #31051 added parent-side
lifecycle breadcrumbs but left them in an in-memory CircularBuffer that dies
with the process, so SIGTERM crash reports arrive with no parent context and no
way to tell a signal-driven kill from a memory-critical `process.exit(137)`
(which closes the child's stdin → clean EOF, not SIGTERM).

Persist the death-explaining breadcrumbs (spawn / transport-exit / child-exit /
replace-live-child / kill-reason / startup-timeout) plus the graceful-exit
signal name and the memory-critical exit into the same crash log the Python
side writes, so they interleave by timestamp next to the child's panic entry —
making these recurring reports diagnosable.

Gated off under VITEST so unit tests stay hermetic.

* feat(tui): auto-recover the session when the gateway dies unexpectedly

When a still-owned gateway child dies while the TUI is alive (a crash, OOM
process.exit, or a SIGTERM/SIGHUP forwarded to it), the app currently nulls the
session and drops to an inert "gateway exited" state — the user loses a long
session and has to restart + re-run everything. That single behavior is most of
the "TUI doesn't survive heavy work" complaint, independent of what does the
killing.

The 'exit' event only reaches this handler on an *unexpected* death: a user
/quit calls process.exit before it fires, and a replaced child is identity-
skipped in GatewayClient. So on exit we now respawn the gateway and resume the
session that was live (history is persisted in SQLite) via a one-shot
recoverSidRef the next gateway.ready consults before forging a new session. The
in-flight reply is lost (it died with the process) but the session survives.

Bounded to GATEWAY_RECOVERY_LIMIT (3) attempts per GATEWAY_RECOVERY_WINDOW_MS
(60s) so a gateway that crash-loops on startup can't spawn-storm; past the
budget we fall back to the inert state.

* fix(tui): sanitize newlines + soften SIGTERM-cause claim in parentLog

Address PR review:
- recordParentLifecycle collapses embedded \r\n so a multi-line value (e.g. an
  error message) stays a single breadcrumb and can't masquerade as a separate
  entry or as the child's panic output sharing the crash log.
- Reword the header: a backend SIGTERM is *usually* a parent action but can come
  straight from an external supervisor (s6, cgroup OOM, stray kill); the
  presence/absence of a [tui-parent] line before the child's panic is precisely
  what disambiguates the two.

* fix(tui): clear sid during recovery + extract/test the recovery budget

Address PR review:
- Null `sid` immediately in the gateway exit handler. While the gateway is down
  (busy=false) the old sid would otherwise let sid-guarded effects (the 1.5s
  session.active_list poll, queue drain) fire RPCs at a dead/respawning gateway.
  recoverSidRef carries the session forward; resumeById restores sid on ready.
- Extract the respawn budget into a pure evalRecovery() (gatewayRecovery.ts) and
  unit-test the bound: allows GATEWAY_RECOVERY_LIMIT within the window, blocks
  past it, and prunes attempts older than the window so recovery re-arms.

* fix(tui): cap parent-log breadcrumb length (PR review)

Truncate a single persisted breadcrumb to 4096 chars (matching GatewayClient's
in-memory log-line cap) so a pathological value — e.g. a giant error string —
can't bloat the shared crash log or add noticeable blocking on the synchronous
append during a failure path. Covered by a test.

* fix(tui): keep "recovering session…" status visible during resume (PR review)

resumeById() synchronously sets status to 'resuming…' on entry, so the
recovery branch now applies its 'recovering session…' label *after* calling
resumeById — the distinct label sticks for the duration of the resume RPC
(which later flips to 'ready') instead of being immediately clobbered. Test
updated to assert the ordering.

* fix(tui): keep recovery budget alive across a startup crash-loop (PR review)

deadSid was read from getUiState().sid, which the first exit nulls — so if the
respawned gateway crash-looped before gateway.ready (resumeById never restored
sid), later exits saw null and abandoned the session after a single attempt,
defeating the bounded retry budget.

Lift the whole decision into a pure planGatewayRecovery() that falls back to the
pending recoverSidRef target when the live sid is already cleared, and unit-test
the crash-loop sequence (keeps retrying the same session up to the limit, then
falls back to inert). Supersedes evalRecovery.

* chore(tui): drop non-null assertion + clarify breadcrumb cap comment (PR review)

- Recovery branch guards on `recoverSidRef && recoverSid` so the ref write needs
  no `!` assertion (avoids a future unsafe refactor).
- Reword the parentLog cap comment: it slices the value to 4096 chars and
  appends a short truncation marker (so the written line is slightly longer),
  rather than implying a strict 4096-byte limit.

* chore(tui): soften "absence ⇒ external signal" + "any in-flight reply" (PR review)

- parentLog header: a missing [tui-parent] line only *suggests* an external
  signal (the logger is best-effort: VITEST-disabled, failed append swallowed),
  not a definitive conclusion.
- Recovery notice says "any in-flight reply was lost" since the gateway can also
  exit while idle.
2026-05-31 10:36:57 -05:00

57 lines
2.8 KiB
TypeScript

import { appendFileSync, mkdirSync } from 'node:fs'
import { homedir } from 'node:os'
import { join } from 'node:path'
// Mirror the Python gateway's panic log (tui_gateway/server.py::_CRASH_LOG) from
// the Node parent so lifecycle breadcrumbs interleave, by timestamp, with the
// child's `=== SIGTERM received ===` / `=== gateway exit ===` entries.
//
// A backend SIGTERM is *usually* a parent action — `gw.kill()` (graceful-exit on
// a signal to Node, or an explicit /quit) or `start()` replacing a live child —
// but it can also come straight from an external supervisor (s6, a cgroup OOM
// reaper, a stray `kill`) signalling the child directly. Telling those apart is
// exactly the point: #31051 left these breadcrumbs in an in-memory CircularBuffer
// that dies with the process, so SIGTERM crash reports arrived with no parent
// context. A `[tui-parent]` line immediately before the child's panic means a
// parent kill; its absence *suggests* an external signal — not definitive,
// since this logger is best-effort (disabled under VITEST, and a failed append
// is swallowed). Persisting the death-explaining events here is what makes that
// distinction (and a memory-critical `process.exit(137)`, which closes stdin →
// clean EOF, not SIGTERM) diagnosable after the fact.
const logDir = join(process.env.HERMES_HOME?.trim() || join(homedir(), '.hermes'), 'logs')
const CRASH_LOG = join(logDir, 'tui_gateway_crash.log')
// Skipped under vitest so unit tests exercising start()/kill() can't write into
// a real ~/.hermes (tests must stay hermetic — see AGENTS.md).
const enabled = !process.env.VITEST
// Slice a single breadcrumb's value to MAX_BREADCRUMB chars (a short
// "[truncated …]" marker is appended, so the written line is slightly longer)
// so a pathological value (e.g. a giant error) can't bloat the shared crash log
// or add noticeable blocking on the synchronous append. Mirrors the spirit of
// GatewayClient's in-memory log-line cap.
const MAX_BREADCRUMB = 4096
let warned = false
export function recordParentLifecycle(line: string): void {
if (!enabled) {
return
}
try {
// Collapse embedded newlines so a multi-line value (e.g. an error message)
// stays one breadcrumb and can't masquerade as a separate log entry or as
// the child's panic output sharing this file.
const oneLine = line.replace(/[\r\n]+/g, ' ↵ ')
const capped =
oneLine.length > MAX_BREADCRUMB ? `${oneLine.slice(0, MAX_BREADCRUMB)}… [truncated ${oneLine.length} chars]` : oneLine
mkdirSync(logDir, { recursive: true })
appendFileSync(CRASH_LOG, `[tui-parent] ${new Date().toISOString()} ${capped}\n`)
} catch {
if (!warned) {
warned = true
process.stderr.write('hermes-tui: parent lifecycle log unavailable\n')
}
}
}