mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-06-20 10:11:58 +00:00
* fix(tui): persist gateway lifecycle breadcrumbs to crash log A backend SIGTERM (`=== SIGTERM received ===` in tui_gateway_crash.log) is always a parent action — `gw.kill()` (graceful-exit on a signal to Node, or an explicit /quit) or `start()` replacing a live child. #31051 added parent-side lifecycle breadcrumbs but left them in an in-memory CircularBuffer that dies with the process, so SIGTERM crash reports arrive with no parent context and no way to tell a signal-driven kill from a memory-critical `process.exit(137)` (which closes the child's stdin → clean EOF, not SIGTERM). Persist the death-explaining breadcrumbs (spawn / transport-exit / child-exit / replace-live-child / kill-reason / startup-timeout) plus the graceful-exit signal name and the memory-critical exit into the same crash log the Python side writes, so they interleave by timestamp next to the child's panic entry — making these recurring reports diagnosable. Gated off under VITEST so unit tests stay hermetic. * feat(tui): auto-recover the session when the gateway dies unexpectedly When a still-owned gateway child dies while the TUI is alive (a crash, OOM process.exit, or a SIGTERM/SIGHUP forwarded to it), the app currently nulls the session and drops to an inert "gateway exited" state — the user loses a long session and has to restart + re-run everything. That single behavior is most of the "TUI doesn't survive heavy work" complaint, independent of what does the killing. The 'exit' event only reaches this handler on an *unexpected* death: a user /quit calls process.exit before it fires, and a replaced child is identity- skipped in GatewayClient. So on exit we now respawn the gateway and resume the session that was live (history is persisted in SQLite) via a one-shot recoverSidRef the next gateway.ready consults before forging a new session. The in-flight reply is lost (it died with the process) but the session survives. Bounded to GATEWAY_RECOVERY_LIMIT (3) attempts per GATEWAY_RECOVERY_WINDOW_MS (60s) so a gateway that crash-loops on startup can't spawn-storm; past the budget we fall back to the inert state. * fix(tui): sanitize newlines + soften SIGTERM-cause claim in parentLog Address PR review: - recordParentLifecycle collapses embedded \r\n so a multi-line value (e.g. an error message) stays a single breadcrumb and can't masquerade as a separate entry or as the child's panic output sharing the crash log. - Reword the header: a backend SIGTERM is *usually* a parent action but can come straight from an external supervisor (s6, cgroup OOM, stray kill); the presence/absence of a [tui-parent] line before the child's panic is precisely what disambiguates the two. * fix(tui): clear sid during recovery + extract/test the recovery budget Address PR review: - Null `sid` immediately in the gateway exit handler. While the gateway is down (busy=false) the old sid would otherwise let sid-guarded effects (the 1.5s session.active_list poll, queue drain) fire RPCs at a dead/respawning gateway. recoverSidRef carries the session forward; resumeById restores sid on ready. - Extract the respawn budget into a pure evalRecovery() (gatewayRecovery.ts) and unit-test the bound: allows GATEWAY_RECOVERY_LIMIT within the window, blocks past it, and prunes attempts older than the window so recovery re-arms. * fix(tui): cap parent-log breadcrumb length (PR review) Truncate a single persisted breadcrumb to 4096 chars (matching GatewayClient's in-memory log-line cap) so a pathological value — e.g. a giant error string — can't bloat the shared crash log or add noticeable blocking on the synchronous append during a failure path. Covered by a test. * fix(tui): keep "recovering session…" status visible during resume (PR review) resumeById() synchronously sets status to 'resuming…' on entry, so the recovery branch now applies its 'recovering session…' label *after* calling resumeById — the distinct label sticks for the duration of the resume RPC (which later flips to 'ready') instead of being immediately clobbered. Test updated to assert the ordering. * fix(tui): keep recovery budget alive across a startup crash-loop (PR review) deadSid was read from getUiState().sid, which the first exit nulls — so if the respawned gateway crash-looped before gateway.ready (resumeById never restored sid), later exits saw null and abandoned the session after a single attempt, defeating the bounded retry budget. Lift the whole decision into a pure planGatewayRecovery() that falls back to the pending recoverSidRef target when the live sid is already cleared, and unit-test the crash-loop sequence (keeps retrying the same session up to the limit, then falls back to inert). Supersedes evalRecovery. * chore(tui): drop non-null assertion + clarify breadcrumb cap comment (PR review) - Recovery branch guards on `recoverSidRef && recoverSid` so the ref write needs no `!` assertion (avoids a future unsafe refactor). - Reword the parentLog cap comment: it slices the value to 4096 chars and appends a short truncation marker (so the written line is slightly longer), rather than implying a strict 4096-byte limit. * chore(tui): soften "absence ⇒ external signal" + "any in-flight reply" (PR review) - parentLog header: a missing [tui-parent] line only *suggests* an external signal (the logger is best-effort: VITEST-disabled, failed append swallowed), not a definitive conclusion. - Recovery notice says "any in-flight reply was lost" since the gateway can also exit while idle.
47 lines
1.9 KiB
TypeScript
47 lines
1.9 KiB
TypeScript
import { describe, expect, it } from 'vitest'
|
|
|
|
import { GATEWAY_RECOVERY_LIMIT, GATEWAY_RECOVERY_WINDOW_MS, planGatewayRecovery } from '../app/gatewayRecovery.js'
|
|
|
|
describe('planGatewayRecovery', () => {
|
|
it('recovers the live session and records the attempt', () => {
|
|
const plan = planGatewayRecovery('sess-1', null, [], 1000)
|
|
|
|
expect(plan).toEqual({ attempts: [1000], recover: true, sid: 'sess-1' })
|
|
})
|
|
|
|
it('does not recover when there is no session to resume', () => {
|
|
expect(planGatewayRecovery(null, null, [], 1000)).toEqual({ attempts: [], recover: false, sid: null })
|
|
})
|
|
|
|
it('keeps retrying the recovery target through a startup crash-loop, bounded by the budget', () => {
|
|
// First exit: live sid present.
|
|
let attempts: number[] = []
|
|
let plan = planGatewayRecovery('sess-1', null, attempts, 0)
|
|
|
|
expect(plan.recover).toBe(true)
|
|
expect(plan.sid).toBe('sess-1')
|
|
attempts = plan.attempts
|
|
|
|
// Respawn crash-loops before gateway.ready: live sid is now null, but the
|
|
// recovery target carries it forward so we keep trying up to the budget.
|
|
for (let i = 1; i < GATEWAY_RECOVERY_LIMIT; i++) {
|
|
plan = planGatewayRecovery(null, 'sess-1', attempts, i)
|
|
expect(plan.recover).toBe(true)
|
|
expect(plan.sid).toBe('sess-1')
|
|
attempts = plan.attempts
|
|
}
|
|
|
|
// Budget exhausted: fall back to the inert state instead of spawn-storming.
|
|
plan = planGatewayRecovery(null, 'sess-1', attempts, GATEWAY_RECOVERY_LIMIT)
|
|
expect(plan.recover).toBe(false)
|
|
expect(plan.sid).toBe('sess-1')
|
|
})
|
|
|
|
it('prunes attempts older than the window so recovery re-arms', () => {
|
|
const old = Array.from({ length: GATEWAY_RECOVERY_LIMIT }, (_, i) => i)
|
|
const plan = planGatewayRecovery('sess-1', null, old, GATEWAY_RECOVERY_WINDOW_MS + 100)
|
|
|
|
expect(plan.attempts).toEqual([GATEWAY_RECOVERY_WINDOW_MS + 100])
|
|
expect(plan.recover).toBe(true)
|
|
})
|
|
})
|