hermes-agent/ui-tui/src
brooklyn! a726e8a811
fix(tui): auto-recover session on unexpected gateway death (+ persist lifecycle breadcrumbs) (#35893)
* fix(tui): persist gateway lifecycle breadcrumbs to crash log

A backend SIGTERM (`=== SIGTERM received ===` in tui_gateway_crash.log) is
always a parent action — `gw.kill()` (graceful-exit on a signal to Node, or an
explicit /quit) or `start()` replacing a live child. #31051 added parent-side
lifecycle breadcrumbs but left them in an in-memory CircularBuffer that dies
with the process, so SIGTERM crash reports arrive with no parent context and no
way to tell a signal-driven kill from a memory-critical `process.exit(137)`
(which closes the child's stdin → clean EOF, not SIGTERM).

Persist the death-explaining breadcrumbs (spawn / transport-exit / child-exit /
replace-live-child / kill-reason / startup-timeout) plus the graceful-exit
signal name and the memory-critical exit into the same crash log the Python
side writes, so they interleave by timestamp next to the child's panic entry —
making these recurring reports diagnosable.

Gated off under VITEST so unit tests stay hermetic.

* feat(tui): auto-recover the session when the gateway dies unexpectedly

When a still-owned gateway child dies while the TUI is alive (a crash, OOM
process.exit, or a SIGTERM/SIGHUP forwarded to it), the app currently nulls the
session and drops to an inert "gateway exited" state — the user loses a long
session and has to restart + re-run everything. That single behavior is most of
the "TUI doesn't survive heavy work" complaint, independent of what does the
killing.

The 'exit' event only reaches this handler on an *unexpected* death: a user
/quit calls process.exit before it fires, and a replaced child is identity-
skipped in GatewayClient. So on exit we now respawn the gateway and resume the
session that was live (history is persisted in SQLite) via a one-shot
recoverSidRef the next gateway.ready consults before forging a new session. The
in-flight reply is lost (it died with the process) but the session survives.

Bounded to GATEWAY_RECOVERY_LIMIT (3) attempts per GATEWAY_RECOVERY_WINDOW_MS
(60s) so a gateway that crash-loops on startup can't spawn-storm; past the
budget we fall back to the inert state.

* fix(tui): sanitize newlines + soften SIGTERM-cause claim in parentLog

Address PR review:
- recordParentLifecycle collapses embedded \r\n so a multi-line value (e.g. an
  error message) stays a single breadcrumb and can't masquerade as a separate
  entry or as the child's panic output sharing the crash log.
- Reword the header: a backend SIGTERM is *usually* a parent action but can come
  straight from an external supervisor (s6, cgroup OOM, stray kill); the
  presence/absence of a [tui-parent] line before the child's panic is precisely
  what disambiguates the two.

* fix(tui): clear sid during recovery + extract/test the recovery budget

Address PR review:
- Null `sid` immediately in the gateway exit handler. While the gateway is down
  (busy=false) the old sid would otherwise let sid-guarded effects (the 1.5s
  session.active_list poll, queue drain) fire RPCs at a dead/respawning gateway.
  recoverSidRef carries the session forward; resumeById restores sid on ready.
- Extract the respawn budget into a pure evalRecovery() (gatewayRecovery.ts) and
  unit-test the bound: allows GATEWAY_RECOVERY_LIMIT within the window, blocks
  past it, and prunes attempts older than the window so recovery re-arms.

* fix(tui): cap parent-log breadcrumb length (PR review)

Truncate a single persisted breadcrumb to 4096 chars (matching GatewayClient's
in-memory log-line cap) so a pathological value — e.g. a giant error string —
can't bloat the shared crash log or add noticeable blocking on the synchronous
append during a failure path. Covered by a test.

* fix(tui): keep "recovering session…" status visible during resume (PR review)

resumeById() synchronously sets status to 'resuming…' on entry, so the
recovery branch now applies its 'recovering session…' label *after* calling
resumeById — the distinct label sticks for the duration of the resume RPC
(which later flips to 'ready') instead of being immediately clobbered. Test
updated to assert the ordering.

* fix(tui): keep recovery budget alive across a startup crash-loop (PR review)

deadSid was read from getUiState().sid, which the first exit nulls — so if the
respawned gateway crash-looped before gateway.ready (resumeById never restored
sid), later exits saw null and abandoned the session after a single attempt,
defeating the bounded retry budget.

Lift the whole decision into a pure planGatewayRecovery() that falls back to the
pending recoverSidRef target when the live sid is already cleared, and unit-test
the crash-loop sequence (keeps retrying the same session up to the limit, then
falls back to inert). Supersedes evalRecovery.

* chore(tui): drop non-null assertion + clarify breadcrumb cap comment (PR review)

- Recovery branch guards on `recoverSidRef && recoverSid` so the ref write needs
  no `!` assertion (avoids a future unsafe refactor).
- Reword the parentLog cap comment: it slices the value to 4096 chars and
  appends a short truncation marker (so the written line is slightly longer),
  rather than implying a strict 4096-byte limit.

* chore(tui): soften "absence ⇒ external signal" + "any in-flight reply" (PR review)

- parentLog header: a missing [tui-parent] line only *suggests* an external
  signal (the logger is best-effort: VITEST-disabled, failed append swallowed),
  not a definitive conclusion.
- Recovery notice says "any in-flight reply was lost" since the gateway can also
  exit while idle.
2026-05-31 10:36:57 -05:00
..
__tests__ fix(tui): auto-recover session on unexpected gateway death (+ persist lifecycle breadcrumbs) (#35893) 2026-05-31 10:36:57 -05:00
app fix(tui): auto-recover session on unexpected gateway death (+ persist lifecycle breadcrumbs) (#35893) 2026-05-31 10:36:57 -05:00
components fix: surface /agents nudge while delegate_task is in-flight (TUI + CLI) 2026-05-30 13:22:45 +05:30
config feat: configurable paste collapse thresholds (TUI + CLI) 2026-05-25 06:23:18 -07:00
content feat: add TUI session orchestrator 2026-05-26 20:51:59 -07:00
domain fix(tui): stabilize sticky prompt tracking 2026-04-28 22:10:40 -05:00
hooks fix(tui): recompute virtual tail after width resize 2026-05-23 14:49:26 -05:00
lib fix(tui): auto-recover session on unexpected gateway death (+ persist lifecycle breadcrumbs) (#35893) 2026-05-31 10:36:57 -05:00
protocol refactor(tui): /clean pass across ui-tui — 49 files, −217 LOC 2026-04-16 22:32:53 -05:00
types fix(tui): keep Ink displayCursor in sync with fast-echo writes so cursor stops drifting (#26717) 2026-05-16 00:28:12 -05:00
app.tsx fix(tui): apply ui-tui fix pass and restore type-check 2026-04-25 14:08:54 -05:00
banner.ts feat(tui): responsive banner tiers 2026-05-23 17:37:51 -05:00
entry.tsx fix(tui): auto-recover session on unexpected gateway death (+ persist lifecycle breadcrumbs) (#35893) 2026-05-31 10:36:57 -05:00
gatewayClient.ts fix(tui): auto-recover session on unexpected gateway death (+ persist lifecycle breadcrumbs) (#35893) 2026-05-31 10:36:57 -05:00
gatewayTypes.ts feat(tui): nudge toward /agents dashboard when delegation starts 2026-05-30 12:26:36 +05:30
theme.ts fix(tui): honor skin highlight colors (#20895) 2026-05-06 14:01:56 -07:00
types.ts fix(tui): surface verbose tool details (#30225) 2026-05-22 00:16:52 -05:00