docs(s6): document container supervision; doctor + skill + user-guide updates

Phase 5 of the s6-overlay supervision plan. Documentation + small diagnostic cleanups; no behavior changes. website/docs/user-guide/docker.md: - Replace the old 'entrypoint script does the bootstrap' section with the s6-overlay boot flow (cont-init.d/01-hermes-setup, cont-init.d/02-reconcile-profiles, static main-hermes + dashboard services, ENTRYPOINT-as-main-program pattern). - Add a 'Per-profile gateway supervision' subsection covering the new lifecycle commands, restart semantics, log persistence, and 'Manager: s6 (container supervisor)' status reporting. - Add 'Breaking change vs. pre-s6 images' callout naming the /init ENTRYPOINT and pointing affected wrappers at the pin workaround. website/docs/user-guide/profiles.md: - Add a note under 'Persistent services' pointing container users at the docker.md section explaining s6 supervision inside the image. Host-side systemd/launchd documentation is unchanged. skills/software-development/hermes-s6-container-supervision/SKILL.md: - New maintainer skill covering the supervision-tree map, file layout, the Architecture B rationale (cont-init.d args + halt exit-code propagation), quick recipes, and the 8 pitfalls we hit while implementing the plan (PATH-without-/command, root-owned profile dirs, SOUL.md as marker, the '143' anti-pattern, etc.). hermes_cli/doctor.py: - _check_gateway_service_linger skips on s6 (the linger concept doesn't apply inside the container). - New _check_s6_supervision section reports main-hermes/dashboard state and per-profile-gateway count (registered vs supervised up), only inside the s6 container. Host doctor output unchanged. - External Tools / Docker check no longer emits a 'docker not found' warning inside the container; prints an explanatory info line instead. Still respects an explicit TERMINAL_ENV=docker (in case the user mounted /var/run/docker.sock). hermes_cli/gateway.py: - Document _container_systemd_operational more precisely: it's NOT for our Hermes Docker image (s6-overlay handles that via detect_service_manager() == 's6'). It still covers systemd-nspawn / k8s-with-systemd-init cases, so leaving it in place is correct; the docstring just makes that explicit. Test harness (verification, no test changes in this commit): 19 passed, 0 xfailed. 66 service-manager / container-boot / profiles-s6-hooks / gateway-s6-dispatch unit tests still green. 61 doctor tests still green. Hadolint + shellcheck clean. Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md
2026-06-05 07:41:39 +00:00 · 2026-05-21 17:05:32 +10:00 · 2026-05-21 17:05:32 +10:00 · 7d07dd60a8
commit 7d07dd60a8
parent 57c6e29666
5 changed files with 314 additions and 13 deletions
--- a/website/docs/user-guide/docker.md
+++ b/website/docs/user-guide/docker.md
@ -261,24 +261,51 @@ The official image is based on `debian:13.4` and includes:
 - Python 3 with all Hermes dependencies (`uv pip install -e ".[all]"`)
 - Node.js + npm (for browser automation and WhatsApp bridge)
 - Playwright with Chromium (`npx playwright install --with-deps chromium --only-shell`)
- ripgrep, ffmpeg, git, and tini as system utilities
+- ripgrep, ffmpeg, git, and `xz-utils` as system utilities
 - **`docker-cli`** — so agents running inside the container can drive the host's Docker daemon (bind-mount `/var/run/docker.sock` to opt in) for `docker build`, `docker run`, container inspection, etc.
 - **`openssh-client`** — enables the [SSH terminal backend](/docs/user-guide/configuration#ssh-backend) from inside the container. The SSH backend shells out to the system `ssh` binary; without this, it failed silently in containerized installs.
 - The WhatsApp bridge (`scripts/whatsapp-bridge/`)
+- **[`s6-overlay`](https://github.com/just-containers/s6-overlay) v3** as PID 1 (replaces the older `tini`) — supervises the dashboard and per-profile gateways with auto-restart on crash, reaps zombie subprocesses, and forwards signals.

-The entrypoint script (`docker/entrypoint.sh`) bootstraps the data volume on first run:
- Creates the directory structure (`sessions/`, `memories/`, `skills/`, etc.)
- Copies `.env.example` → `.env` if no `.env` exists
- Copies default `config.yaml` if missing
- Copies default `SOUL.md` if missing
- Syncs bundled skills using a manifest-based approach (preserves user edits)
- Optionally launches `hermes dashboard` as a background side-process when `HERMES_DASHBOARD=1` (see [Running the dashboard](#running-the-dashboard))
- Then runs `hermes` with whatever arguments you pass
+The container's `ENTRYPOINT` is s6-overlay's `/init`. On boot it:
+1. Runs `/etc/cont-init.d/01-hermes-setup` (= `docker/stage2-hook.sh`) as root: optional UID/GID remap, fixes volume ownership, seeds `.env` / `config.yaml` / `SOUL.md` on first boot, syncs bundled skills.
+2. Runs `/etc/cont-init.d/02-reconcile-profiles` (= `hermes_cli.container_boot`): walks `$HERMES_HOME/profiles/<name>/`, recreates the per-profile gateway s6 service slot under `/run/service/gateway-<profile>/`, and auto-starts only those whose last recorded state was `running` (see [Per-profile gateway supervision](#per-profile-gateway-supervision)).
+3. Starts the static `main-hermes` and `dashboard` s6-rc services.
+4. Exec's the container's CMD as the main program (`/opt/hermes/docker/main-wrapper.sh`), which routes the arguments the user passed to `docker run`:
+   - no args → `hermes` (the default)
+   - first arg is an executable on PATH (e.g. `sleep`, `bash`) → exec it directly
+   - anything else → `hermes <args>` (subcommand passthrough)
+   The container exits when this main program exits, with its exit code.

-:::warning
-Do not override the image entrypoint unless you keep `/opt/hermes/docker/entrypoint.sh` in the command chain. The entrypoint drops root privileges to the `hermes` user before gateway state files are created. Starting `hermes gateway run` as root inside the official image is refused by default because it can leave root-owned files in `/opt/data` and break later dashboard or gateway starts. Set `HERMES_ALLOW_ROOT_GATEWAY=1` only when you intentionally accept that risk.
+:::warning Breaking change vs. pre-s6 images
+The container ENTRYPOINT is now `/init` (s6-overlay), not `/usr/bin/tini`. All five documented `docker run` invocation patterns (no args, `chat -q "…"`, `sleep infinity`, `bash`, `--tui`) behave identically to the tini-based image. If you have a downstream wrapper that depended on tini-specific signal behavior or hard-coded `/usr/bin/tini --` invocation, pin to the previous image tag.
 :::

+:::warning Privilege model
+Do not override the image entrypoint unless you keep `/init` (or, equivalently, the legacy `docker/entrypoint.sh` shim that forwards to the stage2 hook) in the command chain. s6-overlay's `/init` runs as root so it can chown the volume on first boot, then drops to the `hermes` user via `s6-setuidgid` for every supervised service AND for the main program. Starting `hermes gateway run` as root inside the official image is refused by default because it can leave root-owned files in `/opt/data` and break later dashboard or gateway starts. Set `HERMES_ALLOW_ROOT_GATEWAY=1` only when you intentionally accept that risk.
+:::
+
+### Per-profile gateway supervision
+
+Inside the container, each profile created with `hermes profile create <name>` automatically gets an s6-supervised gateway service registered at `/run/service/gateway-<name>/`. The lifecycle commands you'd run on the host work the same way:
+
+```sh
+hermes profile create coder            # registers gateway-coder s6 slot
+hermes -p coder gateway start          # s6-svc -u  → supervised gateway
+hermes -p coder gateway stop           # s6-svc -d  → service down
+hermes -p coder gateway restart        # s6-svc -t  → SIGTERM the supervisor
+hermes profile delete coder            # tears down the s6 slot
+```
+
+**Supervision benefits over the pre-s6 image:**
+
+- Gateway crashes are auto-restarted by `s6-supervise` after a ~1s backoff.
+- Dashboard crashes are auto-restarted (set `HERMES_DASHBOARD=1` to start it).
+- `docker restart` preserves running gateways: the cont-init reconciler reads `$HERMES_HOME/profiles/<name>/gateway_state.json` and brings the slot back up if the last recorded state was `running`. Stopped gateways stay stopped.
+- Per-profile gateway logs persist under `$HERMES_HOME/logs/gateways/<profile>/current` (rotated by `s6-log`), and the reconciler's actions are appended to `$HERMES_HOME/logs/container-boot.log` per boot.
+
+`hermes status` inside the container reports `Manager: s6 (container supervisor)`. Use `/command/s6-svstat /run/service/gateway-<name>` for the raw supervisor view (note `/command/` is on PATH for supervision-tree processes only; pass the absolute path when calling from `docker exec`).
+
 ## Upgrading

 Pull the latest image and recreate the container. Your data directory is untouched.