mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-18 04:41:56 +00:00
When the gateway received SIGTERM, the shutdown_signal_handler ran a
synchronous 'ps aux' (3s timeout) inside the asyncio event loop, then
asyncio.create_task(runner.stop()). On a busy host that ate 1-3s of
the teardown budget before draining could even start, and the resulting
log line was a multi-line ps dump that didn't tell us who sent the
signal. The shutdown path itself logged 'Stopping gateway...' and then
nothing until 'Gateway stopped' — when systemd SIGKILLed mid-drain,
there was no way to see which phase wedged.
Changes:
- New gateway/shutdown_forensics.py:
* snapshot_shutdown_context(sig) — sub-millisecond /proc-only capture
of signal name, parent pid+name+cmdline, INVOCATION_ID (systemd
marker), loadavg_1m, TracerPid, takeover/planned-stop marker
presence + whether-it-names-self. Pure stdlib, never raises.
* spawn_async_diagnostic(log_path, sig) — detached subprocess with
its own 'timeout 5s', start_new_session=True, writes ps auxf +
pstree + dmesg to ~/.hermes/logs/gateway-shutdown-diag.log.
Returns immediately, can't block the event loop or the cgroup
teardown.
* check_systemd_timing_alignment(drain_timeout) — reads
/proc/self/cgroup for our unit, asks systemctl show for
TimeoutStopUSec, returns mismatch info when the unit's stop
timeout is smaller than restart_drain_timeout + 30s headroom
(the case where systemd SIGKILLs mid-drain).
* _parse_systemd_duration_to_us — covers '90s', '1min 30s',
'500ms', '1h' style values from systemctl show.
* format_context_for_log — single scannable key=value line, parent
cmdline last.
- gateway/run.py shutdown_signal_handler:
* Replaces synchronous ps aux + ad-hoc 'hermes-related lines' filter
with snapshot + detached spawn.
* Always logs 'Shutdown context: signal=... parent_pid=...
parent_cmdline=...' regardless of planned/unexpected so we can
correlate signal source even on planned restarts.
- gateway/run.py _stop_impl:
* Per-phase '+X.XXs' timing for notify_active_sessions, drain
(with drain_seconds, active_at_start, active_now, timed_out),
post-interrupt tool kill, each adapter disconnect (Xs),
all adapters disconnected, final-cleanup tool kill, SessionDB
close, total teardown.
- gateway/run.py start():
* Stale-unit warning at startup when the running systemd unit's
TimeoutStopSec is smaller than the configured drain timeout.
Points the user at 'hermes gateway service install --replace'
to regenerate, or at shortening agent.restart_drain_timeout.
Tests: 30 new in tests/gateway/test_shutdown_forensics.py — snapshot
speed bound, signal name resolution, marker detection self-vs-other,
async diag spawn doesn't block caller, systemd duration parser, and
alignment check returns None outside systemd. Wider tests/gateway/
suite: 5258 passing, 3 pre-existing TTS-routing failures unchanged
on main.
|
||
|---|---|---|
| .. | ||
| assets | ||
| builtin_hooks | ||
| platforms | ||
| __init__.py | ||
| channel_directory.py | ||
| config.py | ||
| delivery.py | ||
| display_config.py | ||
| hooks.py | ||
| mirror.py | ||
| pairing.py | ||
| platform_registry.py | ||
| restart.py | ||
| run.py | ||
| runtime_footer.py | ||
| session.py | ||
| session_context.py | ||
| shutdown_forensics.py | ||
| status.py | ||
| sticker_cache.py | ||
| stream_consumer.py | ||
| whatsapp_identity.py | ||