mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-06-23 10:42:00 +00:00
* fix(gateway): auto-start after container restart via planned-stop marker
On Docker (s6-overlay), the gateway runs as a dynamically-registered s6
service. When the container stops/restarts/upgrades, s6 sends the gateway
a plain SIGTERM. The shutdown path (_stop_impl) ended with an
unconditional _update_runtime_status("stopped"), persisting
gateway_state=stopped to the volume. container_boot.py reads that on the
next boot and only auto-starts gateways whose last state was "running"
(_AUTOSTART_STATES) — so after a routine `docker compose up
--force-recreate` the gateway stays down and messaging channels silently
go dark, with no error surfaced (issue #42675).
The codebase already distinguishes intentional stops from unexpected
signals via the planned-stop marker (write_planned_stop_marker /
consume_planned_stop_marker_for_self): `hermes gateway stop`,
systemd/launchd ExecStop, and Ctrl+C write a marker before signalling,
so the handler classifies them as planned. An unmarked SIGTERM
(container/s6 restart, OOM, bare kill) is signal-initiated.
This wires that existing classification through to the state persist,
rather than adding unreliable signal-source inference:
- run.py: GatewayRunner._signal_initiated_shutdown, set in
shutdown_signal_handler's unmarked-signal branch. In _stop_impl, a
signal-initiated (non-restart) teardown now persists "running" instead
of "stopped" — preserving the operator's run-intent and overwriting the
mid-shutdown "draining" marker so _AUTOSTART_STATES matches on reboot.
Operator stops and restarts persist "stopped" as before.
- service_manager.py: S6ServiceManager.stop() now writes the planned-stop
marker for the supervised PID (read from s6-svstat) before `s6-svc -d`,
so an in-container `hermes gateway stop` is correctly classified as
intentional (parity with the systemd/launchd/host stop paths, which
already mark). Best-effort: a marker-write failure falls back to the
safe signal-initiated path.
Tests: shutdown persist-decision table (signal→running, operator→stopped,
restart→stopped), s6 stop marker write + svstat PID parse + failure
tolerance. The signal→running and s6-marker tests fail without the
respective source change. Verified end-to-end against a container built
from this branch: an unmarked SIGTERM to the live gateway leaves
gateway_state=running (shutdown-context log confirms signal path);
existing real container-restart suite still green.
* docs(docker): clarify gateway autostart distinguishes operator-stop from container-kill
The per-profile-supervision section described the autostart-across-restart
contract as "running gateways come back, stopped stay stopped" without
spelling out what records 'stopped'. That contract was the source of
#42675 confusion: users expected a restart to bring the gateway back and
it didn't. With the write-side fix, only an explicit `hermes gateway stop`
records 'stopped'; container/s6 restart SIGTERMs (incl. image upgrades and
unexpected exits) leave the state 'running' so the gateway auto-starts.
Make that distinction explicit in both the multi-profile and
per-profile-supervision sections.
* test(docker): real-restart autostart E2E for #42675
Adds test_live_gateway_autostarts_after_real_restart_without_manual_state_stamp:
a live s6-supervised gateway is killed by an actual `docker restart`
SIGTERM (no manual gateway_state stamp, no planned-stop marker) and must
auto-start on the next boot. Exercises the WRITE side of the fix that the
existing stamp-based tests bypass.
Verified to FAIL against an origin/main image (reconciler logs
prior_state=stopped action=registered — the #42675 bug) and PASS against
the fixed image (prior_state=running action=started).
This commit is contained in:
parent
b4170f3ac2
commit
5cf6e28a2f
7 changed files with 363 additions and 3 deletions
|
|
@ -250,3 +250,79 @@ def test_stale_gateway_pid_cleaned_up_on_restart(restart_container: str) -> None
|
|||
assert r.returncode != 0, "stale gateway.pid survived restart"
|
||||
r = _sh(container, "test -f /opt/data/profiles/ghost/processes.json")
|
||||
assert r.returncode != 0, "stale processes.json survived restart"
|
||||
|
||||
|
||||
def test_live_gateway_autostarts_after_real_restart_without_manual_state_stamp(
|
||||
restart_container: str,
|
||||
) -> None:
|
||||
"""End-to-end guard for issue #42675.
|
||||
|
||||
The other tests in this module stamp gateway_state.json directly to
|
||||
exercise the reconciler's READ side. This one exercises the WRITE
|
||||
side: a real, live gateway is killed by the container/s6 SIGTERM that
|
||||
`docker restart` sends — no manual state stamp — and must come back up
|
||||
on the next boot.
|
||||
|
||||
Before the fix, the shutdown handler unconditionally persisted
|
||||
gateway_state=stopped on that SIGTERM, so the reconciler saw 'stopped'
|
||||
and registered the slot DOWN — the gateway silently stayed dark after
|
||||
every container restart. The fix classifies an unmarked SIGTERM as
|
||||
signal-initiated and persists 'running' instead, so auto-start works.
|
||||
"""
|
||||
container = restart_container
|
||||
|
||||
_exec(container, "hermes", "profile", "create", "live").check_returncode()
|
||||
r = _exec(container, "hermes", "-p", "live", "gateway", "start", timeout=60)
|
||||
assert r.returncode == 0, f"gateway start failed: {r.stderr}"
|
||||
|
||||
# Wait for the gateway to actually come up under supervision AND write
|
||||
# its own gateway_state=running (we do NOT stamp it ourselves).
|
||||
deadline = time.monotonic() + 20.0
|
||||
while time.monotonic() < deadline:
|
||||
r = _sh(container, "/command/s6-svstat /run/service/gateway-live")
|
||||
if r.returncode == 0 and "up " in r.stdout:
|
||||
break
|
||||
time.sleep(0.5)
|
||||
assert "up " in r.stdout, f"gateway never came up pre-restart: {r.stdout!r}"
|
||||
|
||||
# Confirm the gateway persisted its own 'running' state (sanity: we're
|
||||
# testing the real write path, not a stamped fixture).
|
||||
deadline = time.monotonic() + 15.0
|
||||
state = ""
|
||||
while time.monotonic() < deadline:
|
||||
r = _sh(
|
||||
container,
|
||||
"cat /opt/data/profiles/live/gateway_state.json 2>/dev/null",
|
||||
)
|
||||
if r.returncode == 0 and '"gateway_state"' in r.stdout:
|
||||
state = r.stdout
|
||||
break
|
||||
time.sleep(0.5)
|
||||
assert '"running"' in state, (
|
||||
f"gateway never persisted running state pre-restart: {state!r}"
|
||||
)
|
||||
|
||||
# Real restart — Docker sends SIGTERM to PID 1; s6 propagates it to the
|
||||
# supervised gateway. No planned-stop marker is written (this is not an
|
||||
# operator `hermes gateway stop`), so the shutdown is signal-initiated.
|
||||
_docker("restart", container, timeout=60).check_returncode()
|
||||
|
||||
log = _wait_for_reconcile_log_mention(container, "live", deadline_s=30.0)
|
||||
assert "profile=live" in log, (
|
||||
f"reconciler never logged live after restart: {log!r}"
|
||||
)
|
||||
# The crux: the reconciler must AUTO-START it, not register it down.
|
||||
assert "action=started" in log, (
|
||||
f"gateway did NOT auto-start after a real restart (issue #42675 "
|
||||
f"regression): {log!r}"
|
||||
)
|
||||
|
||||
# Slot recreated, and NO down marker (we expect auto-start).
|
||||
assert _wait_for_path(
|
||||
container, "/run/service/gateway-live", kind="d", deadline_s=10.0,
|
||||
), "slot not recreated after restart"
|
||||
r = _sh(container, "test -f /run/service/gateway-live/down")
|
||||
assert r.returncode != 0, (
|
||||
"down marker present despite a live gateway being restarted — "
|
||||
"the signal-initiated shutdown wrongly persisted 'stopped' (#42675)"
|
||||
)
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue