hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-07-26 17:38:36 +00:00

History

Ben Barclay 5cf6e28a2f fix(gateway): auto-start after container restart via planned-stop marker (#42675 ) (#43236 ) * fix(gateway): auto-start after container restart via planned-stop marker On Docker (s6-overlay), the gateway runs as a dynamically-registered s6 service. When the container stops/restarts/upgrades, s6 sends the gateway a plain SIGTERM. The shutdown path (_stop_impl) ended with an unconditional _update_runtime_status("stopped"), persisting gateway_state=stopped to the volume. container_boot.py reads that on the next boot and only auto-starts gateways whose last state was "running" (_AUTOSTART_STATES) — so after a routine `docker compose up --force-recreate` the gateway stays down and messaging channels silently go dark, with no error surfaced (issue #42675). The codebase already distinguishes intentional stops from unexpected signals via the planned-stop marker (write_planned_stop_marker / consume_planned_stop_marker_for_self): `hermes gateway stop`, systemd/launchd ExecStop, and Ctrl+C write a marker before signalling, so the handler classifies them as planned. An unmarked SIGTERM (container/s6 restart, OOM, bare kill) is signal-initiated. This wires that existing classification through to the state persist, rather than adding unreliable signal-source inference: - run.py: GatewayRunner._signal_initiated_shutdown, set in shutdown_signal_handler's unmarked-signal branch. In _stop_impl, a signal-initiated (non-restart) teardown now persists "running" instead of "stopped" — preserving the operator's run-intent and overwriting the mid-shutdown "draining" marker so _AUTOSTART_STATES matches on reboot. Operator stops and restarts persist "stopped" as before. - service_manager.py: S6ServiceManager.stop() now writes the planned-stop marker for the supervised PID (read from s6-svstat) before `s6-svc -d`, so an in-container `hermes gateway stop` is correctly classified as intentional (parity with the systemd/launchd/host stop paths, which already mark). Best-effort: a marker-write failure falls back to the safe signal-initiated path. Tests: shutdown persist-decision table (signal→running, operator→stopped, restart→stopped), s6 stop marker write + svstat PID parse + failure tolerance. The signal→running and s6-marker tests fail without the respective source change. Verified end-to-end against a container built from this branch: an unmarked SIGTERM to the live gateway leaves gateway_state=running (shutdown-context log confirms signal path); existing real container-restart suite still green. * docs(docker): clarify gateway autostart distinguishes operator-stop from container-kill The per-profile-supervision section described the autostart-across-restart contract as "running gateways come back, stopped stay stopped" without spelling out what records 'stopped'. That contract was the source of #42675 confusion: users expected a restart to bring the gateway back and it didn't. With the write-side fix, only an explicit `hermes gateway stop` records 'stopped'; container/s6 restart SIGTERMs (incl. image upgrades and unexpected exits) leave the state 'running' so the gateway auto-starts. Make that distinction explicit in both the multi-profile and per-profile-supervision sections. * test(docker): real-restart autostart E2E for #42675 Adds test_live_gateway_autostarts_after_real_restart_without_manual_state_stamp: a live s6-supervised gateway is killed by an actual `docker restart` SIGTERM (no manual gateway_state stamp, no planned-stop marker) and must auto-start on the next boot. Exercises the WRITE side of the fix that the existing stamp-based tests bypass. Verified to FAIL against an origin/main image (reconciler logs prior_state=stopped action=registered — the #42675 bug) and PASS against the fixed image (prior_state=running action=started).		2026-06-10 14:01:34 +10:00
..
developer-guide	feat(compression): raise compaction trigger to 85% for gpt-5.5 on Codex OAuth (#40957 )	2026-06-07 01:40:50 -07:00
getting-started	docs: remove --include-desktop install instructions (#39762 )	2026-06-05 06:53:58 -07:00
guides	fix(plugins): thread-safe lazy-singleton helpers; fix honcho TOCTOU (#24759 ) (#42150 )	2026-06-08 09:35:22 -07:00
integrations	docs: deep audit — registry drift, stale claims, 2-week PR coverage, dashboard screenshot (#40952 )	2026-06-07 01:39:06 -07:00
reference	feat(skills): add simplify-code skill — parallel 3-agent code review and cleanup (#41691 )	2026-06-07 22:02:41 -07:00
user-guide	fix(gateway): auto-start after container restart via planned-stop marker (#42675 ) (#43236 )	2026-06-10 14:01:34 +10:00
index.mdx	docs: remove --include-desktop install instructions (#39762 )	2026-06-05 06:53:58 -07:00
user-stories.mdx	docs(website): add User Stories and Use Cases collage page (#18282 )	2026-04-30 23:56:59 -07:00