mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-06-12 08:51:53 +00:00
* fix(tui): persist gateway lifecycle breadcrumbs to crash log A backend SIGTERM (`=== SIGTERM received ===` in tui_gateway_crash.log) is always a parent action — `gw.kill()` (graceful-exit on a signal to Node, or an explicit /quit) or `start()` replacing a live child. #31051 added parent-side lifecycle breadcrumbs but left them in an in-memory CircularBuffer that dies with the process, so SIGTERM crash reports arrive with no parent context and no way to tell a signal-driven kill from a memory-critical `process.exit(137)` (which closes the child's stdin → clean EOF, not SIGTERM). Persist the death-explaining breadcrumbs (spawn / transport-exit / child-exit / replace-live-child / kill-reason / startup-timeout) plus the graceful-exit signal name and the memory-critical exit into the same crash log the Python side writes, so they interleave by timestamp next to the child's panic entry — making these recurring reports diagnosable. Gated off under VITEST so unit tests stay hermetic. * feat(tui): auto-recover the session when the gateway dies unexpectedly When a still-owned gateway child dies while the TUI is alive (a crash, OOM process.exit, or a SIGTERM/SIGHUP forwarded to it), the app currently nulls the session and drops to an inert "gateway exited" state — the user loses a long session and has to restart + re-run everything. That single behavior is most of the "TUI doesn't survive heavy work" complaint, independent of what does the killing. The 'exit' event only reaches this handler on an *unexpected* death: a user /quit calls process.exit before it fires, and a replaced child is identity- skipped in GatewayClient. So on exit we now respawn the gateway and resume the session that was live (history is persisted in SQLite) via a one-shot recoverSidRef the next gateway.ready consults before forging a new session. The in-flight reply is lost (it died with the process) but the session survives. Bounded to GATEWAY_RECOVERY_LIMIT (3) attempts per GATEWAY_RECOVERY_WINDOW_MS (60s) so a gateway that crash-loops on startup can't spawn-storm; past the budget we fall back to the inert state. * fix(tui): sanitize newlines + soften SIGTERM-cause claim in parentLog Address PR review: - recordParentLifecycle collapses embedded \r\n so a multi-line value (e.g. an error message) stays a single breadcrumb and can't masquerade as a separate entry or as the child's panic output sharing the crash log. - Reword the header: a backend SIGTERM is *usually* a parent action but can come straight from an external supervisor (s6, cgroup OOM, stray kill); the presence/absence of a [tui-parent] line before the child's panic is precisely what disambiguates the two. * fix(tui): clear sid during recovery + extract/test the recovery budget Address PR review: - Null `sid` immediately in the gateway exit handler. While the gateway is down (busy=false) the old sid would otherwise let sid-guarded effects (the 1.5s session.active_list poll, queue drain) fire RPCs at a dead/respawning gateway. recoverSidRef carries the session forward; resumeById restores sid on ready. - Extract the respawn budget into a pure evalRecovery() (gatewayRecovery.ts) and unit-test the bound: allows GATEWAY_RECOVERY_LIMIT within the window, blocks past it, and prunes attempts older than the window so recovery re-arms. * fix(tui): cap parent-log breadcrumb length (PR review) Truncate a single persisted breadcrumb to 4096 chars (matching GatewayClient's in-memory log-line cap) so a pathological value — e.g. a giant error string — can't bloat the shared crash log or add noticeable blocking on the synchronous append during a failure path. Covered by a test. * fix(tui): keep "recovering session…" status visible during resume (PR review) resumeById() synchronously sets status to 'resuming…' on entry, so the recovery branch now applies its 'recovering session…' label *after* calling resumeById — the distinct label sticks for the duration of the resume RPC (which later flips to 'ready') instead of being immediately clobbered. Test updated to assert the ordering. * fix(tui): keep recovery budget alive across a startup crash-loop (PR review) deadSid was read from getUiState().sid, which the first exit nulls — so if the respawned gateway crash-looped before gateway.ready (resumeById never restored sid), later exits saw null and abandoned the session after a single attempt, defeating the bounded retry budget. Lift the whole decision into a pure planGatewayRecovery() that falls back to the pending recoverSidRef target when the live sid is already cleared, and unit-test the crash-loop sequence (keeps retrying the same session up to the limit, then falls back to inert). Supersedes evalRecovery. * chore(tui): drop non-null assertion + clarify breadcrumb cap comment (PR review) - Recovery branch guards on `recoverSidRef && recoverSid` so the ref write needs no `!` assertion (avoids a future unsafe refactor). - Reword the parentLog cap comment: it slices the value to 4096 chars and appends a short truncation marker (so the written line is slightly longer), rather than implying a strict 4096-byte limit. * chore(tui): soften "absence ⇒ external signal" + "any in-flight reply" (PR review) - parentLog header: a missing [tui-parent] line only *suggests* an external signal (the logger is best-effort: VITEST-disabled, failed append swallowed), not a definitive conclusion. - Recovery notice says "any in-flight reply was lost" since the gateway can also exit while idle.
57 lines
2.8 KiB
TypeScript
57 lines
2.8 KiB
TypeScript
import { appendFileSync, mkdirSync } from 'node:fs'
|
|
import { homedir } from 'node:os'
|
|
import { join } from 'node:path'
|
|
|
|
// Mirror the Python gateway's panic log (tui_gateway/server.py::_CRASH_LOG) from
|
|
// the Node parent so lifecycle breadcrumbs interleave, by timestamp, with the
|
|
// child's `=== SIGTERM received ===` / `=== gateway exit ===` entries.
|
|
//
|
|
// A backend SIGTERM is *usually* a parent action — `gw.kill()` (graceful-exit on
|
|
// a signal to Node, or an explicit /quit) or `start()` replacing a live child —
|
|
// but it can also come straight from an external supervisor (s6, a cgroup OOM
|
|
// reaper, a stray `kill`) signalling the child directly. Telling those apart is
|
|
// exactly the point: #31051 left these breadcrumbs in an in-memory CircularBuffer
|
|
// that dies with the process, so SIGTERM crash reports arrived with no parent
|
|
// context. A `[tui-parent]` line immediately before the child's panic means a
|
|
// parent kill; its absence *suggests* an external signal — not definitive,
|
|
// since this logger is best-effort (disabled under VITEST, and a failed append
|
|
// is swallowed). Persisting the death-explaining events here is what makes that
|
|
// distinction (and a memory-critical `process.exit(137)`, which closes stdin →
|
|
// clean EOF, not SIGTERM) diagnosable after the fact.
|
|
const logDir = join(process.env.HERMES_HOME?.trim() || join(homedir(), '.hermes'), 'logs')
|
|
const CRASH_LOG = join(logDir, 'tui_gateway_crash.log')
|
|
|
|
// Skipped under vitest so unit tests exercising start()/kill() can't write into
|
|
// a real ~/.hermes (tests must stay hermetic — see AGENTS.md).
|
|
const enabled = !process.env.VITEST
|
|
// Slice a single breadcrumb's value to MAX_BREADCRUMB chars (a short
|
|
// "[truncated …]" marker is appended, so the written line is slightly longer)
|
|
// so a pathological value (e.g. a giant error) can't bloat the shared crash log
|
|
// or add noticeable blocking on the synchronous append. Mirrors the spirit of
|
|
// GatewayClient's in-memory log-line cap.
|
|
const MAX_BREADCRUMB = 4096
|
|
let warned = false
|
|
|
|
export function recordParentLifecycle(line: string): void {
|
|
if (!enabled) {
|
|
return
|
|
}
|
|
|
|
try {
|
|
// Collapse embedded newlines so a multi-line value (e.g. an error message)
|
|
// stays one breadcrumb and can't masquerade as a separate log entry or as
|
|
// the child's panic output sharing this file.
|
|
const oneLine = line.replace(/[\r\n]+/g, ' ↵ ')
|
|
|
|
const capped =
|
|
oneLine.length > MAX_BREADCRUMB ? `${oneLine.slice(0, MAX_BREADCRUMB)}… [truncated ${oneLine.length} chars]` : oneLine
|
|
|
|
mkdirSync(logDir, { recursive: true })
|
|
appendFileSync(CRASH_LOG, `[tui-parent] ${new Date().toISOString()} ${capped}\n`)
|
|
} catch {
|
|
if (!warned) {
|
|
warned = true
|
|
process.stderr.write('hermes-tui: parent lifecycle log unavailable\n')
|
|
}
|
|
}
|
|
}
|