PR #30136 review item O6: test_container_restart.py used fixed
`time.sleep(8)` calls after `docker restart` to wait for the
cont-init reconciler to finish. Fixed sleeps are slow when the
event happens fast and false-fail when the event happens slow.
Replace with two polling helpers:
* `_wait_for_path(container, path, kind='f' | 'd', deadline_s=...)`
— generic `test -f/-d` poller. Returns True on success, False on
timeout; callers assert with a clear message.
* `_wait_for_reconcile_log_mention(container, profile, ...)` — the
reconciler's per-profile log line is the canonical signal that
the cont-init reconcile has finished for that profile. Poll on
it instead of a sleep that hopes 8 seconds is enough.
The fixture-level setup wait is similarly migrated: it now polls
for `profile=default` in the boot log (every container always
gets a default-slot entry per item I1) and raises a clear timeout
error from the fixture if the container never finishes cont-init —
much better diagnostics than a mid-test KeyError.
The remaining `time.sleep()` calls are all internal interval_s
between probe attempts; no fixed wait points left.
PR #30136 review surfaced two issues, both rooted in the same audit gap:
docker integration tests were running as root, not the unprivileged
`hermes` user (UID 10000) that the runtime actually uses via
`s6-setuidgid hermes`. Anything that probed PID-1 state or wrote to
the s6 control surface worked as root in the tests but was inert in
production.
Fixes:
1. `_s6_running()` previously called `Path("/proc/1/exe").resolve()`,
which is root-only readable. For UID 10000 the symlink yields
PermissionError, `resolve()` silently returns the unresolved path,
and `exe.name == "exe"` — so detection always returned False, the
service-manager runtime-registration path was inert, and every
`hermes profile create` / `hermes -p X gateway start` silently
skipped the s6 hook. Replace with `/proc/1/comm` (world-readable)
+ `/run/s6/basedir` (s6-overlay-specific) — both required, fail
closed.
2. `02-reconcile-profiles` now also chowns `/run/service/.s6-svscan/`
{control,lock} to hermes so `s6-svscanctl -a/-an` works without
root. Previously the directory chown stopped at `/run/service`
and the FIFO inside stayed root-owned, so `register_profile_gateway`
from hermes failed at the rescan-trigger step with EACCES — the
wrapper in profiles.py caught the exception and printed a swallowed
warning, so profile creation appeared to succeed while the slot
was rolled back.
Audit changes to flush this class of bug next time:
- Add `docker_exec` / `docker_exec_sh` helpers to `tests/docker/conftest.py`
that default to `-u hermes`. The module docstring explains why and
flags `user="root"` as opt-in only for tests that explicitly need
root (none currently do).
- Refactor every `docker exec` call in tests/docker/ through the new
helpers (test_dashboard.py, test_zombie_reaping.py, test_profile_gateway.py,
test_container_restart.py, test_s6_profile_gateway_integration.py).
- Add 5 unit tests covering `_s6_running` under various probe states
(both signals present; comm wrong; basedir missing; PermissionError
on /proc/1/comm; missing /proc — non-Linux). The PermissionError
test is the explicit regression guard for the original bug.
Known follow-up: the per-service `supervise/control` FIFO inside each
`/run/service/gateway-<profile>/supervise/` is created root-owned by
s6-supervise (which runs as root because s6-svscan is PID 1). `s6-svc
-u/-d/-t` from the hermes user will get EACCES on those. The audit
under `-u hermes` will reveal this in lifecycle tests — surfacing the
issue cleanly so it can be fixed in a focused follow-up (likely via a
small SUID helper or a polling chown loop in cont-init.d). The
detection + svscanctl fixes here are independent and complete on
their own.
Phase 4 of the s6-overlay supervision plan. Activates the Phase 3
S6ServiceManager by hooking it into the profile lifecycle and the
`hermes gateway start/stop/restart` dispatcher, and adds a cont-
init.d-time reconciliation pass that survives `docker restart`.
Task 4.0 — container-boot reconciliation:
/run/service/ is tmpfs, so every `docker restart` wipes every
per-profile gateway slot. /etc/cont-init.d/02-reconcile-profiles
invokes hermes_cli.container_boot.reconcile_profile_gateways() on
every boot, which walks $HERMES_HOME/profiles/<name>/, reads each
gateway_state.json, recreates the s6 service slot, and auto-starts
only those whose last state was 'running'. Other states
(stopped, starting, startup_failed, missing) register the slot
in the down state — avoiding crash-loops across restarts for a
gateway that was broken last boot. Per-profile outcome is recorded
to $HERMES_HOME/logs/container-boot.log.
Implementation: hermes_cli/container_boot.py + 12 unit tests.
Profile-marker is SOUL.md, not config.yaml, because `hermes profile
create` only seeds SOUL.md by default (config.yaml comes from
`hermes setup`).
Task 4.1 / 4.2 — profile create/delete hooks:
hermes_cli/profiles.py::create_profile now calls
_maybe_register_gateway_service(<canon>) at the end, which routes
through ServiceManager.register_profile_gateway when running on s6
and no-ops on host backends. delete_profile mirrors with
_maybe_unregister_gateway_service. _allocate_gateway_port produces
a deterministic SHA-256-derived port in [9200, 9800).
Task 4.3 — gateway dispatch + remove rejection arms:
_dispatch_via_service_manager_if_s6(action) intercepts
start/stop/restart at the top of each subcommand and routes them
through S6ServiceManager.{start,stop,restart}. The pre-Phase-4
`elif is_container():` rejection arms are kept as fallback for
pre-s6 containers / unsupported runtimes, but only ever fire when
detect_service_manager() != 's6'. install/uninstall under s6
print informational guidance pointing users at profile create/delete.
Removed the two xfail(strict=True) markers from
tests/docker/test_profile_gateway.py — both tests now pass strictly.
Task 4.4 — status reporting:
get_gateway_runtime_snapshot() reports
Manager: 's6 (container supervisor)' inside an s6 container instead
of 'docker (foreground)'.
Plan-vs-reality drift fixed in this commit:
- Plan's S6ServiceManager._render_run_script used
`gateway start --foreground --port {port}` — invented args; the
real CLI is `gateway run`. Switched accordingly. port arg
retained for API parity but now documented as 'currently ignored'.
- Plan's reconciler keyed on config.yaml; switched to SOUL.md
(config.yaml is created by hermes setup, not by hermes profile
create, so the original gate caught nothing).
- The plan's _dispatch helper used _profile_arg() which returns
'--profile <name>' (i.e. with the flag prefix). Switched to
_profile_suffix() which returns the bare name.
- Architecture B's docker exec doesn't get /command on PATH or
the venv on PATH; Dockerfile's runtime PATH now includes
/opt/hermes/.venv/bin so 'docker exec <c> hermes ...' works
without sourcing the venv.
- stage2-hook now chowns $HERMES_HOME/profiles to hermes on every
boot, not just on the UID-remap path. Without this, files created
by docker-exec-as-root accumulate and the next reconciler run
fails with PermissionError reading SOUL.md.
Test harness:
19 passed, 0 xfailed (the two pre-Phase-4 xfail targets flip to
passing). 78 unit tests across service_manager + container_boot +
profiles_s6_hooks + gateway_s6_dispatch. Hadolint + shellcheck
pass cleanly.
Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md