hermes-agent/ui-tui/src/__tests__/gatewayRecovery.test.ts
brooklyn! a726e8a811
fix(tui): auto-recover session on unexpected gateway death (+ persist lifecycle breadcrumbs) (#35893)
* fix(tui): persist gateway lifecycle breadcrumbs to crash log

A backend SIGTERM (`=== SIGTERM received ===` in tui_gateway_crash.log) is
always a parent action — `gw.kill()` (graceful-exit on a signal to Node, or an
explicit /quit) or `start()` replacing a live child. #31051 added parent-side
lifecycle breadcrumbs but left them in an in-memory CircularBuffer that dies
with the process, so SIGTERM crash reports arrive with no parent context and no
way to tell a signal-driven kill from a memory-critical `process.exit(137)`
(which closes the child's stdin → clean EOF, not SIGTERM).

Persist the death-explaining breadcrumbs (spawn / transport-exit / child-exit /
replace-live-child / kill-reason / startup-timeout) plus the graceful-exit
signal name and the memory-critical exit into the same crash log the Python
side writes, so they interleave by timestamp next to the child's panic entry —
making these recurring reports diagnosable.

Gated off under VITEST so unit tests stay hermetic.

* feat(tui): auto-recover the session when the gateway dies unexpectedly

When a still-owned gateway child dies while the TUI is alive (a crash, OOM
process.exit, or a SIGTERM/SIGHUP forwarded to it), the app currently nulls the
session and drops to an inert "gateway exited" state — the user loses a long
session and has to restart + re-run everything. That single behavior is most of
the "TUI doesn't survive heavy work" complaint, independent of what does the
killing.

The 'exit' event only reaches this handler on an *unexpected* death: a user
/quit calls process.exit before it fires, and a replaced child is identity-
skipped in GatewayClient. So on exit we now respawn the gateway and resume the
session that was live (history is persisted in SQLite) via a one-shot
recoverSidRef the next gateway.ready consults before forging a new session. The
in-flight reply is lost (it died with the process) but the session survives.

Bounded to GATEWAY_RECOVERY_LIMIT (3) attempts per GATEWAY_RECOVERY_WINDOW_MS
(60s) so a gateway that crash-loops on startup can't spawn-storm; past the
budget we fall back to the inert state.

* fix(tui): sanitize newlines + soften SIGTERM-cause claim in parentLog

Address PR review:
- recordParentLifecycle collapses embedded \r\n so a multi-line value (e.g. an
  error message) stays a single breadcrumb and can't masquerade as a separate
  entry or as the child's panic output sharing the crash log.
- Reword the header: a backend SIGTERM is *usually* a parent action but can come
  straight from an external supervisor (s6, cgroup OOM, stray kill); the
  presence/absence of a [tui-parent] line before the child's panic is precisely
  what disambiguates the two.

* fix(tui): clear sid during recovery + extract/test the recovery budget

Address PR review:
- Null `sid` immediately in the gateway exit handler. While the gateway is down
  (busy=false) the old sid would otherwise let sid-guarded effects (the 1.5s
  session.active_list poll, queue drain) fire RPCs at a dead/respawning gateway.
  recoverSidRef carries the session forward; resumeById restores sid on ready.
- Extract the respawn budget into a pure evalRecovery() (gatewayRecovery.ts) and
  unit-test the bound: allows GATEWAY_RECOVERY_LIMIT within the window, blocks
  past it, and prunes attempts older than the window so recovery re-arms.

* fix(tui): cap parent-log breadcrumb length (PR review)

Truncate a single persisted breadcrumb to 4096 chars (matching GatewayClient's
in-memory log-line cap) so a pathological value — e.g. a giant error string —
can't bloat the shared crash log or add noticeable blocking on the synchronous
append during a failure path. Covered by a test.

* fix(tui): keep "recovering session…" status visible during resume (PR review)

resumeById() synchronously sets status to 'resuming…' on entry, so the
recovery branch now applies its 'recovering session…' label *after* calling
resumeById — the distinct label sticks for the duration of the resume RPC
(which later flips to 'ready') instead of being immediately clobbered. Test
updated to assert the ordering.

* fix(tui): keep recovery budget alive across a startup crash-loop (PR review)

deadSid was read from getUiState().sid, which the first exit nulls — so if the
respawned gateway crash-looped before gateway.ready (resumeById never restored
sid), later exits saw null and abandoned the session after a single attempt,
defeating the bounded retry budget.

Lift the whole decision into a pure planGatewayRecovery() that falls back to the
pending recoverSidRef target when the live sid is already cleared, and unit-test
the crash-loop sequence (keeps retrying the same session up to the limit, then
falls back to inert). Supersedes evalRecovery.

* chore(tui): drop non-null assertion + clarify breadcrumb cap comment (PR review)

- Recovery branch guards on `recoverSidRef && recoverSid` so the ref write needs
  no `!` assertion (avoids a future unsafe refactor).
- Reword the parentLog cap comment: it slices the value to 4096 chars and
  appends a short truncation marker (so the written line is slightly longer),
  rather than implying a strict 4096-byte limit.

* chore(tui): soften "absence ⇒ external signal" + "any in-flight reply" (PR review)

- parentLog header: a missing [tui-parent] line only *suggests* an external
  signal (the logger is best-effort: VITEST-disabled, failed append swallowed),
  not a definitive conclusion.
- Recovery notice says "any in-flight reply was lost" since the gateway can also
  exit while idle.
2026-05-31 10:36:57 -05:00

47 lines
1.9 KiB
TypeScript

import { describe, expect, it } from 'vitest'
import { GATEWAY_RECOVERY_LIMIT, GATEWAY_RECOVERY_WINDOW_MS, planGatewayRecovery } from '../app/gatewayRecovery.js'
describe('planGatewayRecovery', () => {
it('recovers the live session and records the attempt', () => {
const plan = planGatewayRecovery('sess-1', null, [], 1000)
expect(plan).toEqual({ attempts: [1000], recover: true, sid: 'sess-1' })
})
it('does not recover when there is no session to resume', () => {
expect(planGatewayRecovery(null, null, [], 1000)).toEqual({ attempts: [], recover: false, sid: null })
})
it('keeps retrying the recovery target through a startup crash-loop, bounded by the budget', () => {
// First exit: live sid present.
let attempts: number[] = []
let plan = planGatewayRecovery('sess-1', null, attempts, 0)
expect(plan.recover).toBe(true)
expect(plan.sid).toBe('sess-1')
attempts = plan.attempts
// Respawn crash-loops before gateway.ready: live sid is now null, but the
// recovery target carries it forward so we keep trying up to the budget.
for (let i = 1; i < GATEWAY_RECOVERY_LIMIT; i++) {
plan = planGatewayRecovery(null, 'sess-1', attempts, i)
expect(plan.recover).toBe(true)
expect(plan.sid).toBe('sess-1')
attempts = plan.attempts
}
// Budget exhausted: fall back to the inert state instead of spawn-storming.
plan = planGatewayRecovery(null, 'sess-1', attempts, GATEWAY_RECOVERY_LIMIT)
expect(plan.recover).toBe(false)
expect(plan.sid).toBe('sess-1')
})
it('prunes attempts older than the window so recovery re-arms', () => {
const old = Array.from({ length: GATEWAY_RECOVERY_LIMIT }, (_, i) => i)
const plan = planGatewayRecovery('sess-1', null, old, GATEWAY_RECOVERY_WINDOW_MS + 100)
expect(plan.attempts).toEqual([GATEWAY_RECOVERY_WINDOW_MS + 100])
expect(plan.recover).toBe(true)
})
})