feat(docker): per-profile s6 supervision + container-restart reconciliation

Phase 4 of the s6-overlay supervision plan. Activates the Phase 3
S6ServiceManager by hooking it into the profile lifecycle and the
`hermes gateway start/stop/restart` dispatcher, and adds a cont-
init.d-time reconciliation pass that survives `docker restart`.

Task 4.0 — container-boot reconciliation:
  /run/service/ is tmpfs, so every `docker restart` wipes every
  per-profile gateway slot. /etc/cont-init.d/02-reconcile-profiles
  invokes hermes_cli.container_boot.reconcile_profile_gateways() on
  every boot, which walks $HERMES_HOME/profiles/<name>/, reads each
  gateway_state.json, recreates the s6 service slot, and auto-starts
  only those whose last state was 'running'. Other states
  (stopped, starting, startup_failed, missing) register the slot
  in the down state — avoiding crash-loops across restarts for a
  gateway that was broken last boot. Per-profile outcome is recorded
  to $HERMES_HOME/logs/container-boot.log.

  Implementation: hermes_cli/container_boot.py + 12 unit tests.
  Profile-marker is SOUL.md, not config.yaml, because `hermes profile
  create` only seeds SOUL.md by default (config.yaml comes from
  `hermes setup`).

Task 4.1 / 4.2 — profile create/delete hooks:
  hermes_cli/profiles.py::create_profile now calls
  _maybe_register_gateway_service(<canon>) at the end, which routes
  through ServiceManager.register_profile_gateway when running on s6
  and no-ops on host backends. delete_profile mirrors with
  _maybe_unregister_gateway_service. _allocate_gateway_port produces
  a deterministic SHA-256-derived port in [9200, 9800).

Task 4.3 — gateway dispatch + remove rejection arms:
  _dispatch_via_service_manager_if_s6(action) intercepts
  start/stop/restart at the top of each subcommand and routes them
  through S6ServiceManager.{start,stop,restart}. The pre-Phase-4
  `elif is_container():` rejection arms are kept as fallback for
  pre-s6 containers / unsupported runtimes, but only ever fire when
  detect_service_manager() != 's6'. install/uninstall under s6
  print informational guidance pointing users at profile create/delete.

  Removed the two xfail(strict=True) markers from
  tests/docker/test_profile_gateway.py — both tests now pass strictly.

Task 4.4 — status reporting:
  get_gateway_runtime_snapshot() reports
  Manager: 's6 (container supervisor)' inside an s6 container instead
  of 'docker (foreground)'.

Plan-vs-reality drift fixed in this commit:
  - Plan's S6ServiceManager._render_run_script used
    `gateway start --foreground --port {port}` — invented args; the
    real CLI is `gateway run`. Switched accordingly. port arg
    retained for API parity but now documented as 'currently ignored'.
  - Plan's reconciler keyed on config.yaml; switched to SOUL.md
    (config.yaml is created by hermes setup, not by hermes profile
    create, so the original gate caught nothing).
  - The plan's _dispatch helper used _profile_arg() which returns
    '--profile <name>' (i.e. with the flag prefix). Switched to
    _profile_suffix() which returns the bare name.
  - Architecture B's docker exec doesn't get /command on PATH or
    the venv on PATH; Dockerfile's runtime PATH now includes
    /opt/hermes/.venv/bin so 'docker exec <c> hermes ...' works
    without sourcing the venv.
  - stage2-hook now chowns $HERMES_HOME/profiles to hermes on every
    boot, not just on the UID-remap path. Without this, files created
    by docker-exec-as-root accumulate and the next reconciler run
    fails with PermissionError reading SOUL.md.

Test harness:
  19 passed, 0 xfailed (the two pre-Phase-4 xfail targets flip to
  passing). 78 unit tests across service_manager + container_boot +
  profiles_s6_hooks + gateway_s6_dispatch. Hadolint + shellcheck
  pass cleanly.

Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md
This commit is contained in:
Ben 2026-05-21 16:56:51 +10:00 committed by teknium1
parent 0abf661f71
commit 2afefc501c
No known key found for this signature in database
13 changed files with 1217 additions and 39 deletions

View file

@ -0,0 +1,30 @@
#!/command/with-contenv sh
# shellcheck shell=sh
# Container-boot reconciliation of per-profile gateway s6 services.
#
# Runs as root after 01-hermes-setup (the stage2 hook) has chowned
# the volume and seeded $HERMES_HOME, but before s6-rc starts user
# services. /etc/cont-init.d/* scripts run in lexicographic order,
# so the `02-` prefix guarantees ordering.
#
# Service directories under /run/service/ live on tmpfs and are
# wiped on every container restart. Profile directories under
# $HERMES_HOME/profiles/ live on the persistent VOLUME. This script
# walks the persistent profiles, recreates the s6 service slots,
# and auto-starts only those whose last recorded state was
# `running` — see hermes_cli/container_boot.py.
#
# Phase 4 also needs hermes-user writes to /run/service/ (so the
# profile create/delete hooks can register/unregister at runtime),
# so we chown the scandir before invoking the reconciler. The
# .s6-svscan/ subdir stays root-owned; only sibling directories
# (gateway-<profile>/) need to be hermes-writable.
set -e
# Make the dynamic scandir hermes-writable. The directory itself
# starts root-owned by s6-overlay; we leave .s6-svscan/ alone since
# only s6 itself writes there.
chown hermes:hermes /run/service 2>/dev/null || true
exec s6-setuidgid hermes /opt/hermes/.venv/bin/python -m hermes_cli.container_boot

View file

@ -53,6 +53,17 @@ if [ "$needs_chown" = true ]; then
echo "[stage2] Warning: chown .venv failed (rootless container?) — continuing"
fi
# Always reset ownership of $HERMES_HOME/profiles to hermes on every
# boot. Profile dirs and files can land owned by root when commands
# are invoked via `docker exec <container> hermes …` (which defaults
# to root unless `-u` is passed), and that breaks the cont-init
# reconciler (02-reconcile-profiles) which runs as hermes and walks
# the profiles dir. Idempotent; skipped on rootless containers where
# chown would fail.
if [ -d "$HERMES_HOME/profiles" ]; then
chown -R hermes:hermes "$HERMES_HOME/profiles" 2>/dev/null || true
fi
# --- config.yaml permissions ---
# Ensure config.yaml is readable by the hermes runtime user even if it
# was edited on the host after initial ownership setup.