hermes-agent/tests/docker
Ben Barclay 5cf6e28a2f
fix(gateway): auto-start after container restart via planned-stop marker (#42675) (#43236)
* fix(gateway): auto-start after container restart via planned-stop marker

On Docker (s6-overlay), the gateway runs as a dynamically-registered s6
service. When the container stops/restarts/upgrades, s6 sends the gateway
a plain SIGTERM. The shutdown path (_stop_impl) ended with an
unconditional _update_runtime_status("stopped"), persisting
gateway_state=stopped to the volume. container_boot.py reads that on the
next boot and only auto-starts gateways whose last state was "running"
(_AUTOSTART_STATES) — so after a routine `docker compose up
--force-recreate` the gateway stays down and messaging channels silently
go dark, with no error surfaced (issue #42675).

The codebase already distinguishes intentional stops from unexpected
signals via the planned-stop marker (write_planned_stop_marker /
consume_planned_stop_marker_for_self): `hermes gateway stop`,
systemd/launchd ExecStop, and Ctrl+C write a marker before signalling,
so the handler classifies them as planned. An unmarked SIGTERM
(container/s6 restart, OOM, bare kill) is signal-initiated.

This wires that existing classification through to the state persist,
rather than adding unreliable signal-source inference:

- run.py: GatewayRunner._signal_initiated_shutdown, set in
  shutdown_signal_handler's unmarked-signal branch. In _stop_impl, a
  signal-initiated (non-restart) teardown now persists "running" instead
  of "stopped" — preserving the operator's run-intent and overwriting the
  mid-shutdown "draining" marker so _AUTOSTART_STATES matches on reboot.
  Operator stops and restarts persist "stopped" as before.

- service_manager.py: S6ServiceManager.stop() now writes the planned-stop
  marker for the supervised PID (read from s6-svstat) before `s6-svc -d`,
  so an in-container `hermes gateway stop` is correctly classified as
  intentional (parity with the systemd/launchd/host stop paths, which
  already mark). Best-effort: a marker-write failure falls back to the
  safe signal-initiated path.

Tests: shutdown persist-decision table (signal→running, operator→stopped,
restart→stopped), s6 stop marker write + svstat PID parse + failure
tolerance. The signal→running and s6-marker tests fail without the
respective source change. Verified end-to-end against a container built
from this branch: an unmarked SIGTERM to the live gateway leaves
gateway_state=running (shutdown-context log confirms signal path);
existing real container-restart suite still green.

* docs(docker): clarify gateway autostart distinguishes operator-stop from container-kill

The per-profile-supervision section described the autostart-across-restart
contract as "running gateways come back, stopped stay stopped" without
spelling out what records 'stopped'. That contract was the source of
#42675 confusion: users expected a restart to bring the gateway back and
it didn't. With the write-side fix, only an explicit `hermes gateway stop`
records 'stopped'; container/s6 restart SIGTERMs (incl. image upgrades and
unexpected exits) leave the state 'running' so the gateway auto-starts.
Make that distinction explicit in both the multi-profile and
per-profile-supervision sections.

* test(docker): real-restart autostart E2E for #42675

Adds test_live_gateway_autostarts_after_real_restart_without_manual_state_stamp:
a live s6-supervised gateway is killed by an actual `docker restart`
SIGTERM (no manual gateway_state stamp, no planned-stop marker) and must
auto-start on the next boot. Exercises the WRITE side of the fix that the
existing stamp-based tests bypass.

Verified to FAIL against an origin/main image (reconciler logs
prior_state=stopped action=registered — the #42675 bug) and PASS against
the fixed image (prior_state=running action=started).
2026-06-10 14:01:34 +10:00
..
__init__.py test(docker): add conftest fixtures for docker harness 2026-05-24 18:05:14 -07:00
conftest.py fix(service_manager): s6 detection works for unprivileged hermes user 2026-05-24 18:05:33 -07:00
test_container_restart.py fix(gateway): auto-start after container restart via planned-stop marker (#42675) (#43236) 2026-06-10 14:01:34 +10:00
test_dashboard.py feat(dashboard): always enable embedded chat; remove dashboard --tui flag 2026-06-04 03:03:35 -07:00
test_docker_exec_privilege_drop.py fix(docker): drop docker exec to hermes uid before invoking the CLI 2026-05-28 13:30:36 +10:00
test_dump_build_sha.py fix(docker): bake build-time git SHA into the image 2026-05-28 15:14:05 +10:00
test_gateway_run_supervised.py fix(docker): tee supervised gateway stdout to docker logs 2026-05-28 13:18:41 +10:00
test_main_invocation.py test(docker): lock baseline behavior for Phase 0 harness 2026-05-24 18:05:14 -07:00
test_profile_gateway.py test(docker): fix svstat 'want up' assertion in profile-gateway lifecycle test 2026-05-25 12:25:06 +10:00
test_s6_profile_gateway_integration.py fix(service_manager): rip out dead port parameter 2026-05-24 18:05:33 -07:00
test_tui_passthrough.py test(docker): make tty-passthrough probe robust to container boot-log noise (#38665) 2026-06-04 13:19:13 +10:00
test_tui_prebuilt_bundle.py fix(docker): point TUI launcher at prebuilt bundle via HERMES_TUI_DIR (#37923) 2026-06-03 15:30:45 +10:00
test_zombie_reaping.py fix(service_manager): s6 detection works for unprivileged hermes user 2026-05-24 18:05:33 -07:00