fix(gateway): auto-start after container restart via planned-stop marker (#42675) (#43236)

* fix(gateway): auto-start after container restart via planned-stop marker

On Docker (s6-overlay), the gateway runs as a dynamically-registered s6
service. When the container stops/restarts/upgrades, s6 sends the gateway
a plain SIGTERM. The shutdown path (_stop_impl) ended with an
unconditional _update_runtime_status("stopped"), persisting
gateway_state=stopped to the volume. container_boot.py reads that on the
next boot and only auto-starts gateways whose last state was "running"
(_AUTOSTART_STATES) — so after a routine `docker compose up
--force-recreate` the gateway stays down and messaging channels silently
go dark, with no error surfaced (issue #42675).

The codebase already distinguishes intentional stops from unexpected
signals via the planned-stop marker (write_planned_stop_marker /
consume_planned_stop_marker_for_self): `hermes gateway stop`,
systemd/launchd ExecStop, and Ctrl+C write a marker before signalling,
so the handler classifies them as planned. An unmarked SIGTERM
(container/s6 restart, OOM, bare kill) is signal-initiated.

This wires that existing classification through to the state persist,
rather than adding unreliable signal-source inference:

- run.py: GatewayRunner._signal_initiated_shutdown, set in
  shutdown_signal_handler's unmarked-signal branch. In _stop_impl, a
  signal-initiated (non-restart) teardown now persists "running" instead
  of "stopped" — preserving the operator's run-intent and overwriting the
  mid-shutdown "draining" marker so _AUTOSTART_STATES matches on reboot.
  Operator stops and restarts persist "stopped" as before.

- service_manager.py: S6ServiceManager.stop() now writes the planned-stop
  marker for the supervised PID (read from s6-svstat) before `s6-svc -d`,
  so an in-container `hermes gateway stop` is correctly classified as
  intentional (parity with the systemd/launchd/host stop paths, which
  already mark). Best-effort: a marker-write failure falls back to the
  safe signal-initiated path.

Tests: shutdown persist-decision table (signal→running, operator→stopped,
restart→stopped), s6 stop marker write + svstat PID parse + failure
tolerance. The signal→running and s6-marker tests fail without the
respective source change. Verified end-to-end against a container built
from this branch: an unmarked SIGTERM to the live gateway leaves
gateway_state=running (shutdown-context log confirms signal path);
existing real container-restart suite still green.

* docs(docker): clarify gateway autostart distinguishes operator-stop from container-kill

The per-profile-supervision section described the autostart-across-restart
contract as "running gateways come back, stopped stay stopped" without
spelling out what records 'stopped'. That contract was the source of
#42675 confusion: users expected a restart to bring the gateway back and
it didn't. With the write-side fix, only an explicit `hermes gateway stop`
records 'stopped'; container/s6 restart SIGTERMs (incl. image upgrades and
unexpected exits) leave the state 'running' so the gateway auto-starts.
Make that distinction explicit in both the multi-profile and
per-profile-supervision sections.

* test(docker): real-restart autostart E2E for #42675

Adds test_live_gateway_autostarts_after_real_restart_without_manual_state_stamp:
a live s6-supervised gateway is killed by an actual `docker restart`
SIGTERM (no manual gateway_state stamp, no planned-stop marker) and must
auto-start on the next boot. Exercises the WRITE side of the fix that the
existing stamp-based tests bypass.

Verified to FAIL against an origin/main image (reconciler logs
prior_state=stopped action=registered — the #42675 bug) and PASS against
the fixed image (prior_state=running action=started).
This commit is contained in:
Ben Barclay 2026-06-10 14:01:34 +10:00 committed by GitHub
parent b4170f3ac2
commit 5cf6e28a2f
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
7 changed files with 363 additions and 3 deletions

View file

@ -250,3 +250,79 @@ def test_stale_gateway_pid_cleaned_up_on_restart(restart_container: str) -> None
assert r.returncode != 0, "stale gateway.pid survived restart"
r = _sh(container, "test -f /opt/data/profiles/ghost/processes.json")
assert r.returncode != 0, "stale processes.json survived restart"
def test_live_gateway_autostarts_after_real_restart_without_manual_state_stamp(
restart_container: str,
) -> None:
"""End-to-end guard for issue #42675.
The other tests in this module stamp gateway_state.json directly to
exercise the reconciler's READ side. This one exercises the WRITE
side: a real, live gateway is killed by the container/s6 SIGTERM that
`docker restart` sends no manual state stamp and must come back up
on the next boot.
Before the fix, the shutdown handler unconditionally persisted
gateway_state=stopped on that SIGTERM, so the reconciler saw 'stopped'
and registered the slot DOWN the gateway silently stayed dark after
every container restart. The fix classifies an unmarked SIGTERM as
signal-initiated and persists 'running' instead, so auto-start works.
"""
container = restart_container
_exec(container, "hermes", "profile", "create", "live").check_returncode()
r = _exec(container, "hermes", "-p", "live", "gateway", "start", timeout=60)
assert r.returncode == 0, f"gateway start failed: {r.stderr}"
# Wait for the gateway to actually come up under supervision AND write
# its own gateway_state=running (we do NOT stamp it ourselves).
deadline = time.monotonic() + 20.0
while time.monotonic() < deadline:
r = _sh(container, "/command/s6-svstat /run/service/gateway-live")
if r.returncode == 0 and "up " in r.stdout:
break
time.sleep(0.5)
assert "up " in r.stdout, f"gateway never came up pre-restart: {r.stdout!r}"
# Confirm the gateway persisted its own 'running' state (sanity: we're
# testing the real write path, not a stamped fixture).
deadline = time.monotonic() + 15.0
state = ""
while time.monotonic() < deadline:
r = _sh(
container,
"cat /opt/data/profiles/live/gateway_state.json 2>/dev/null",
)
if r.returncode == 0 and '"gateway_state"' in r.stdout:
state = r.stdout
break
time.sleep(0.5)
assert '"running"' in state, (
f"gateway never persisted running state pre-restart: {state!r}"
)
# Real restart — Docker sends SIGTERM to PID 1; s6 propagates it to the
# supervised gateway. No planned-stop marker is written (this is not an
# operator `hermes gateway stop`), so the shutdown is signal-initiated.
_docker("restart", container, timeout=60).check_returncode()
log = _wait_for_reconcile_log_mention(container, "live", deadline_s=30.0)
assert "profile=live" in log, (
f"reconciler never logged live after restart: {log!r}"
)
# The crux: the reconciler must AUTO-START it, not register it down.
assert "action=started" in log, (
f"gateway did NOT auto-start after a real restart (issue #42675 "
f"regression): {log!r}"
)
# Slot recreated, and NO down marker (we expect auto-start).
assert _wait_for_path(
container, "/run/service/gateway-live", kind="d", deadline_s=10.0,
), "slot not recreated after restart"
r = _sh(container, "test -f /run/service/gateway-live/down")
assert r.returncode != 0, (
"down marker present despite a live gateway being restarted — "
"the signal-initiated shutdown wrongly persisted 'stopped' (#42675)"
)