From 0988ab83b7e8b282c3313daf42e55701d0f77bff Mon Sep 17 00:00:00 2001 From: Ben Date: Sat, 23 May 2026 16:24:33 +1000 Subject: [PATCH] docs(plans): trim s6-overlay plan to a post-implementation reference MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit PR #30136 review item O7: the plan doc was 3,191 lines — 5x the size of any other plan in docs/plans/ and the largest reference document in the repo. With the implementation shipped, most of that content is either: * The phase-by-phase TDD walkthrough (~2,800 lines): now canonical in the PR commit log (`git log a957ef083..a6f7171a5`). * The v2/v3 re-validation preambles: artifacts of the planning process, no longer load-bearing. * The full Open Questions deliberations with options A/B/C laid out: collapsed into the Decision Log. * The Rollout Plan and Estimated Timeline: history. Trim to ~430 lines covering what readers actually need going forward: the goal, architecture, scope, key design decisions (D1–D9), risk register (now including the three risks surfaced in PR review — `_s6_running` detection, svscanctl FIFO perms, supervise control FIFO perms), the decision log including the post-merge additions, and the verification checklist (now all boxes ticked). Header now reads 'Status: shipped' and points at the PR. The git history preserves the full v3 plan for anyone who needs it. --- ...07-s6-overlay-dynamic-subagent-gateways.md | 3345 ++--------------- 1 file changed, 294 insertions(+), 3051 deletions(-) diff --git a/docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md b/docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md index 77fd0bcc53c..1f00dc94bba 100644 --- a/docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md +++ b/docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md @@ -1,71 +1,104 @@ # s6-overlay Supervision for Per-Profile Gateways in Docker — Implementation Plan -> **For Hermes:** Use `subagent-driven-development` skill to implement this plan task-by-task. +> **Status: shipped.** Phases 0–5 landed via PR +> [NousResearch/hermes-agent#30136](https://github.com/NousResearch/hermes-agent/pull/30136) +> in May 2026. This document is preserved as a post-implementation reference +> for the architecture and the resolved design questions. The phase-by-phase +> TDD walkthrough (≈2,800 lines) and the v2/v3 re-validation preambles have +> been removed — the canonical implementation history is the PR commit log +> (`git log --oneline a957ef083..a6f7171a5 -- 'docker/*' 'hermes_cli/service_manager.py' …`). +> Open Questions are collapsed into a single Decision Log table; full +> deliberations live in PR review comments. -> **Plan v2 — re-validated May 18, 2026.** v1 was drafted May 7, 2026. Re-validation confirmed: (a) nothing has been implemented yet (greenfield); (b) line-number citations everywhere were stale — they have been replaced with function-name references; (c) a fourth host backend has shipped since v1 — `hermes_cli/gateway_windows.py` registers the gateway as a Windows Scheduled Task with a Startup-folder fallback — the `ServiceManager` protocol now includes a `WindowsServiceManager` adapter and `ServiceManagerKind = "systemd" | "launchd" | "windows" | "s6" | "none"`; (d) `gateway_command` currently has five `elif is_container():` arms that *refuse* gateway install/start/stop/restart/uninstall inside containers — Task 4.3 explicitly deletes them as part of the s6 dispatch; (e) Phase 0 Task 0.5's two profile-gateway tests are marked `xfail(strict=True)` because they describe the post-Phase-4 invariant, not current behavior, and flip to passing in Phase 4; (f) s6-overlay bumped from v3.2.2.0 → v3.2.3.0; (g) OQ8-C log path is now sourced from runtime `$HERMES_HOME`, not hard-coded at registration time. +**Goal:** Replace `tini` with s6-overlay as PID 1 in the Hermes Docker image so +that the main hermes process, the dashboard, and dynamically-created +per-profile gateways all run as supervised services (auto-restart on crash, +clean shutdown, signal forwarding, zombie reaping). Preserve every existing +`docker run …` invocation pattern — including interactive TUI. -> **Plan v3 — re-validated May 21, 2026 in the `docker_s6` worktree.** Spot-check against eight intervening commits to Dockerfile / entrypoint / gateway / doctor / docker docs found four items that need awareness — none invalidates the plan: -> -> 1. **Install-method stamping landed in entrypoint.sh** (PR #27843 / `6f5ec929a`). After the `gosu` privilege drop and venv activate, the entrypoint writes `"docker"` to `${HERMES_HOME:=/opt/data}/.install_method`, so `detect_install_method()` can report `docker` to `hermes status`. Phase 2 Task 2.3 (`docker/stage2-hook.sh` rewrite) must preserve this stamp — either keep it in the stage2 hook (runs as root, before user services start; would need to chown to hermes UID afterward) or hoist it into a per-service `run` prelude for the main-hermes s6 service. **Recommendation: keep it in the stage2 hook, written as the hermes user via `s6-setuidgid hermes` to match the file's existing ownership.** Add a note to Task 2.3. -> 2. **`RUN mkdir -p /opt/data` was added to the Dockerfile** just before the `VOLUME` declaration (same PR). Phase 2 Task 2.4 (Dockerfile flip) must retain this line — the directory must exist before VOLUME so initial chown succeeds when the volume is first mounted. -> 3. **`hermes_cli/gateway_windows.py` `install()` signature changed** (PRs #28169-adjacent, `d948de39e` + `417a653d9`, ~420 lines of changes). New keyword args: `start_now: bool | None`, `start_on_login: bool | None`, `elevated_handoff: bool`. `WindowsServiceManager.install()` adapter in Task 1.2 must forward these — recommend keeping the wrapper's signature minimal (`install(force=False, **kwargs)`) and passing through; or expose them explicitly if the wrapper is called from non-Windows code paths (it isn't currently). Adapter remains a thin pass-through. -> 4. **`hermes_cli/doctor.py` refactor introduced `_section(title)` and `_fail_and_issue(text, detail, fix, issues)` helpers** (PR #27830, `41f1eddee`). Phase 5 Task 5.3 must use these helpers in any new s6-aware doctor checks rather than the older copy-paste banner pattern. The `_check_gateway_service_linger` function and "Gateway Service" / "External Tools" section names that Task 5.3 references are all still present. -> -> Additionally: -> - `gateway_command` actually contains **three** `elif is_container():` rejection arms in `_gateway_command_inner` (lines 5111, 5141, 5184 as of May 21), not five — point (d) above said "five". The other two `is_container()` references at lines 983 and 1220 are in different helper functions and are not user-facing rejections. Task 4.3 should target three arms, not five. -> - `website/docs/user-guide/docker.md` got a 4-line clarifying note from PR #28497 distinguishing Hermes-in-Docker from Docker-as-terminal-backend. No conflict with Phase 5 Task 5.1. -> - s6-overlay still at v3.2.3.0 (no new release since May 9, 2026). Tech-stack and Task 2.1 ARG remain accurate. -> -> **Plan v3 also adds Task 4.0 — Reconcile per-profile gateways on container boot.** Both v1 and v2 missed this: `/run/service/` is tmpfs, so every `docker restart` was silently wiping every per-profile gateway registration. Task 4.0 introduces a cont-init.d script (`02-reconcile-profiles`) and a Python module (`hermes_cli/container_boot.py`) that walks persistent `$HERMES_HOME/profiles//`, recreates the s6 service slots, and auto-starts only those whose last `gateway_state.json` was `running`. Phase 4 estimate bumps from 1.5 → 2.0 days; total plan from 12.0 → 12.5 days. Two new risk-register rows + the "Persistence across container restart" paragraph in the Background section make this contract visible to readers who never reach Phase 4. - -**Goal:** Replace `tini` with s6-overlay as PID 1 in the Hermes Docker image so that the main hermes process, the dashboard, and dynamically-created per-profile gateways all run as supervised services (auto-restart on crash, clean shutdown, signal forwarding, zombie reaping). Preserve every existing `docker run …` invocation pattern — including interactive TUI. - -**Architecture:** s6-overlay's `/init` becomes the container ENTRYPOINT, running s6-svscan as PID 1. Main hermes and the dashboard are declared as static s6-rc services at image build time. Per-profile gateways — which users create *after* the image is built (`hermes profile create coder` → `coder gateway start`) — are registered dynamically by writing service directories under a scandir watched by s6-svscan. A new `ServiceManager` protocol abstracts the install/start/stop/restart surface across the init systems we care about (systemd on Linux host, launchd on macOS host, Scheduled Tasks on native Windows host, s6 inside container) and adds a second tier for runtime service registration that only s6 implements. +**Architecture:** s6-overlay's `/init` is the container ENTRYPOINT, running +s6-svscan as PID 1. Main hermes and the dashboard are declared as static +s6-rc services at image build time. Per-profile gateways — which users create +*after* the image is built (`hermes profile create coder` → +`coder gateway start`) — are registered dynamically by writing service +directories under a scandir watched by s6-svscan. A `ServiceManager` protocol +abstracts the install/start/stop/restart surface across the init systems we +care about (systemd on Linux host, launchd on macOS host, Scheduled Tasks on +native Windows host, s6 inside container) and adds a second tier for runtime +service registration that only s6 implements. **Tech Stack:** -- [s6-overlay](https://github.com/just-containers/s6-overlay) v3.2.3.0 (latest as of plan re-validation; noarch + x86_64 tarballs, ~15 MB) — uses skalibs/s6/s6-rc 2.15+ and includes fixes for long-standing s6-overlay-specific issues. v3.2.2.0 also works if reproducibility from the original plan is needed. -- Debian 13.4 base image (unchanged) -- [hadolint](https://github.com/hadolint/hadolint) for the Dockerfile + [shellcheck](https://github.com/koalaman/shellcheck) for entrypoint scripts -- Python subprocess wrappers for `s6-svc`, `s6-svstat`, `s6-svscanctl` -- Existing systemd/launchd/windows surface in `hermes_cli/gateway.py` and `hermes_cli/gateway_windows.py` + +- [s6-overlay](https://github.com/just-containers/s6-overlay) v3.2.3.0 + (noarch + per-arch tarballs ~15 MB). SHA256-pinned via build ARGs; + multi-arch via `TARGETARCH` (amd64 → `x86_64`, arm64 → `aarch64`). +- Debian 13.4 base image (unchanged). +- [hadolint](https://github.com/hadolint/hadolint) for the Dockerfile + + [shellcheck](https://github.com/koalaman/shellcheck) for entrypoint scripts. +- Python subprocess wrappers for `s6-svc`, `s6-svstat`, `s6-svscanctl`. +- Existing systemd/launchd/windows surface in `hermes_cli/gateway.py` and + `hermes_cli/gateway_windows.py`. **Scope:** -- Container-only (host-side systemd/launchd behavior is preserved, not modified) -- s6-overlay only (no pure-Python fallback) -- Architecture A (s6 owns PID 1; tini is removed) -- Interactive TUI must keep working: `docker run -it --rm nousresearch/hermes-agent:latest --tui` -- Dynamic registration is limited to per-profile gateways — one service per profile, created when a profile is created, torn down when deleted + +- Container-only (host-side systemd/launchd/windows behavior is preserved, + not modified). +- s6-overlay only (no pure-Python fallback). +- Architecture A (s6 owns PID 1; tini is removed). +- Interactive TUI must keep working: + `docker run -it --rm nousresearch/hermes-agent:latest --tui`. +- Dynamic registration is limited to per-profile gateways — one service per + profile, created when a profile is created, torn down when deleted. A + `gateway-default` slot is always registered for the root HERMES_HOME + profile so `hermes gateway start` (no `-p`) has somewhere to land. **Out of scope:** -- Host-side dynamic supervision (systemd-run / launchd transient plists) — not needed -- Pure-Python supervisor fallback — not needed -- Arbitrary user-defined supervised processes inside the container — only profile gateways -- Migration of existing per-profile systemd unit generation to s6 on the host side -- Non-Docker container runtimes (Podman rootless validated reactively — see OQ4) -- UX polish around in-container profile lifecycle (e.g. a nice status view of all supervised profile gateways) — deferred to follow-up + +- Host-side dynamic supervision (systemd-run / launchd transient plists) — + not needed. +- Pure-Python supervisor fallback — not needed. +- Arbitrary user-defined supervised processes inside the container — only + profile gateways. +- Migration of existing per-profile systemd unit generation to s6 on the + host side. +- Non-Docker container runtimes (Podman rootless validated reactively). +- UX polish around in-container profile lifecycle (e.g. a nice status view + of all supervised profile gateways) — deferred to follow-up. --- ## Background From The Codebase -### Current container init (what we're replacing) +> **Note on line numbers:** This section refers to functions and structures +> by name only. Use `grep -n 'def ' ` to locate anything below +> if you need the current line. -> **Note on line numbers:** This section refers to functions and structures by name only. The codebase is fast-moving — `hermes_cli/gateway.py` alone has grown by ~600 lines in the six months between plan v1 and re-validation. Use `grep -n 'def ' ` to locate anything below if you need the current line. +### Pre-s6 container init (what we replaced) -**`Dockerfile`** — `ENTRYPOINT [ "/usr/bin/tini", "-g", "--", "/opt/hermes/docker/entrypoint.sh" ]`. tini is PID 1, reaps zombies, forwards SIGTERM to the process group. +The original `Dockerfile` declared +`ENTRYPOINT [ "/usr/bin/tini", "-g", "--", "/opt/hermes/docker/entrypoint.sh" ]`. +tini was PID 1, reaped zombies, forwarded SIGTERM to the process group. The +old `docker/entrypoint.sh`: -**`docker/entrypoint.sh`** — does, in order: -1. `gosu` privilege drop from root → `hermes` UID -2. Copies `.env.example`, `cli-config.yaml.example`, `SOUL.md` into `$HERMES_HOME` if missing -3. Syncs bundled skills via `tools/skills_sync.py` -4. Optionally backgrounds `hermes dashboard` in a subshell when `HERMES_DASHBOARD=1` — **not supervised**, no restart -5. `exec hermes "$@"` — this becomes tini's sole direct child +1. `gosu` privilege drop from root → `hermes` UID. +2. Copied `.env.example`, `cli-config.yaml.example`, `SOUL.md` into + `$HERMES_HOME` if missing. +3. Synced bundled skills via `tools/skills_sync.py`. +4. Optionally backgrounded `hermes dashboard` in a subshell when + `HERMES_DASHBOARD=1` — **not supervised**, no restart. +5. `exec hermes "$@"` — tini's sole direct child. -**Known limitations we discussed on May 4, 2026:** dashboard crash → stays dead; dashboard fails at startup → silent; gateway crash → dashboard dies too. The May 4 decision was "leave as is" because nothing in the container needed supervision then. Adding per-profile gateway supervision changes that. +Known limitations: dashboard crash → stays dead; dashboard fails at startup → +silent; gateway crash → dashboard dies too. The May 4, 2026 decision was +"leave as is" because nothing in the container needed supervision then. +Adding per-profile gateway supervision changed that. -### Current ServiceManager surface (what we're wrapping, not refactoring) +### ServiceManager surface (what we wrapped, not refactored) -All init-system logic lives in **`hermes_cli/gateway.py`** (currently ~5,400 lines). The systemd/launchd code is ~1,500 lines of that, plus a separate **`hermes_cli/gateway_windows.py`** (~690 lines) that ships gateway-as-Scheduled-Task with a Startup-folder fallback for native Windows. Structure (functions named — no line numbers; they drift constantly): +All init-system logic lives in **`hermes_cli/gateway.py`** (~5,400 LOC at +re-validation). The systemd/launchd code is ~1,500 lines of that, plus a +separate **`hermes_cli/gateway_windows.py`** (~690 LOC) for Windows +Scheduled Tasks. | Layer | Systemd functions | Launchd functions | Windows functions | |---|---|---|---| @@ -73,37 +106,66 @@ All init-system logic lives in **`hermes_cli/gateway.py`** (currently ~5,400 lin | **Paths** | `get_systemd_unit_path(system)`, `get_service_name()` | `get_launchd_plist_path()`, `get_launchd_label()` | `gateway_windows.get_task_name()`, `get_task_script_path()`, `get_startup_entry_path()` | | **Install/lifecycle** | `systemd_install(force, system, run_as_user)`, `systemd_uninstall(system)`, `systemd_start/stop/restart(system)` | `launchd_install(force)`, `launchd_uninstall/start/stop/restart` | `gateway_windows.install/uninstall/start/stop/restart` | | **Probes** | `_probe_systemd_service_running(system)`, `_read_systemd_unit_properties(system)`, `_wait_for_systemd_service_restart`, `_recover_pending_systemd_restart` | `_probe_launchd_service_running()` | `gateway_windows.is_task_registered()`, `_pid_exists` helper | -| **D-Bus plumbing** | `_ensure_user_systemd_env`, `_user_systemd_socket_ready`, `_user_systemd_private_socket_path`, `get_systemd_linger_status` | — (not applicable) | — (not applicable) | +| **D-Bus plumbing** | `_ensure_user_systemd_env`, `_user_systemd_socket_ready`, `_user_systemd_private_socket_path`, `get_systemd_linger_status` | — | — | | **Unit/plist generation** | `generate_systemd_unit(system, run_as_user)`, `systemd_unit_is_current`, `refresh_systemd_unit_if_needed` | plist templating in `launchd_install` | `_build_gateway_cmd_script`, `_build_startup_launcher`, `_write_task_script` | -**Callers outside `gateway.py` that are container-relevant:** +Container-relevant callers outside `gateway.py`: -- `hermes_cli/status.py` — prints `Manager: systemd/manual` / `launchd` / `Termux / manual process` / `(not supported on this platform)`; needs a new "s6" branch for when status runs inside the container. Search for the `Manager:` literal to find the block. -- `hermes_cli/profiles.py` — `create_profile` and `delete_profile`; the delete path has a `disable systemd/launchd service` helper (the function literally documents "Disable and remove systemd/launchd service for a profile"). The create/delete flow needs to register/unregister with s6 when running inside the container. -- `hermes_cli/doctor.py` — `_check_gateway_service_linger` calls `get_systemd_linger_status()` which is a host-only concept (SSH login survival); inside the container it either silently skips or prints a confusing warning. Needs a "skip on s6 / show s6 supervision status" branch. **Small scope, deferred to Phase 5** because the behavior is cosmetic, not functional. Separately, `hermes doctor`'s External Tools → Docker check is nonsensical inside a container (Docker-in-Docker isn't set up and isn't intended); it would create a spurious warning. Also deferred to Phase 5. -- **`hermes_cli/gateway.py::gateway_command`** — the actual `hermes gateway install/start/stop/restart/uninstall` dispatcher currently has `elif is_container():` arms that *refuse* the operation ("Service installation is not needed inside a Docker container — use Docker restart policies instead", "Service start is not applicable inside a Docker container", etc.). Phase 4 must remove these early-exit arms so the new s6 path can intercept. See Task 4.3. +- `hermes_cli/status.py` — gained an `s6` branch for in-container runs. +- `hermes_cli/profiles.py` — `create_profile` / `delete_profile` register and + unregister with s6 inside the container (no-op on host). +- `hermes_cli/doctor.py` — `_check_gateway_service_linger` skips on s6, and a + new "Service Supervisor" section reports main-hermes / dashboard / + profile-gateway counts via the ServiceManager. +- `hermes_cli/gateway.py::gateway_command` — the + `elif is_container():` rejection arms that refused gateway lifecycle + operations were removed; the `_dispatch_via_service_manager_if_s6` helper + intercepts start/stop/restart and routes them through s6. -**Not container-relevant, no changes needed:** -- `hermes_cli/setup.py`, `hermes_cli/uninstall.py` — the setup wizard and uninstall flow are host-only. Users don't run `hermes setup` inside the container (the image ships pre-configured); running `hermes uninstall` inside a container is a no-op on any systemd/launchd unit paths that simply don't exist. -- `hermes_cli/claw.py` — OpenClaw migration operates on `~/.openclaw/` on the host. Inside a container, `Path.home()` is `/opt/data` (the hermes user's home), and no OpenClaw directories exist there since the container was built fresh. `hermes claw migrate` / `cleanup` would cleanly report "nothing to migrate" and exit. No changes required. +### Per-profile gateway spawning -### Per-profile gateway spawning (exists today — needs container adaptation) +`hermes gateway start`, `coder gateway start` (profile alias), and +`hermes -p gateway start` all spawn a gateway process scoped to a +given profile. See +[Profiles: Running Gateways](https://hermes-agent.nousresearch.com/docs/user-guide/profiles#running-gateways). +On host, lifecycle is managed via per-profile systemd units +(`hermes-gateway-.service`); inside the container, an s6 service at +`/run/service/gateway-/` is registered when the profile is created and +torn down when it's deleted. -`hermes gateway start`, `coder gateway start` (profile alias), and `hermes -p gateway start` all spawn a gateway process scoped to a given profile. See [Profiles: Running Gateways](https://hermes-agent.nousresearch.com/docs/user-guide/profiles#running-gateways). On host the lifecycle is managed via per-profile systemd units (`hermes-gateway-.service`); inside the Hermes container there is currently no supervisor, so crashes are not recovered and shutdowns are ad-hoc. +**Persistence across container restart:** `/run/service/` is tmpfs — +service registrations are wiped when the container restarts. Profile +directories at `/opt/data/profiles//` live on the persistent VOLUME, +and each one records its gateway's last state in `gateway_state.json`. +`/etc/cont-init.d/02-reconcile-profiles` walks the persistent profiles on +every container boot, recreates the s6 service slots via +`hermes_cli/container_boot.py`, and auto-starts those whose last recorded +state was `running`. Profiles whose last state was `stopped`, +`startup_failed`, `starting`, or absent get their slot recreated in the +`down` state and wait for explicit user action. `docker restart` is therefore +invisible to a user with running profile gateways: they come back up; +stopped ones stay stopped. -**What this plan adds:** when `hermes profile create ` runs inside the container, it registers an s6 service at `/run/service/gateway-/` that s6-svscan picks up and supervises. ` gateway start/stop/restart` then talks to s6 (`s6-svc -u`, `s6-svc -d`) instead of spawning a bare process. When the profile is deleted, the service directory is removed and s6 tears down the supervise process. +### s6-overlay constraints -**Persistence across container restart:** `/run/service/` is **tmpfs** — service registrations are wiped when the container restarts. But profile directories at `/opt/data/profiles//` live on the persistent VOLUME, and each one records its gateway's last state in `gateway_state.json`. Task 4.0 runs as a cont-init.d script on every container boot: it walks the persistent profiles, recreates the s6 service slots, and auto-starts those whose last recorded state was `running`. Profiles whose last state was `stopped`, `startup_failed`, `starting`, or absent get their slot recreated in the `down` state and wait for explicit user action. This means `docker restart` is invisible to a user with running profile gateways: they come back up; stopped ones stay stopped. - -### s6-overlay constraints relevant to us - -**Root/non-root model (resolved — see OQ2):** `/init` runs as root to set up the supervision tree, install signal handlers, and run the stage2 hook that does `usermod`/`chown`. Each supervised service drops to UID 10000 via `s6-setuidgid hermes` in its `run` script — a single-exec step (no shell subprocess, no zombie risk). The per-service `s6-supervise` monitor stays root so it can signal its child regardless of UID. Net effect: hermes and all its subprocesses run as UID 10000 exactly as today; only the supervision tree itself runs as root. - -- v3.2.3.0 (May 2026, latest at re-validation) has limited non-root support for running `/init` itself as non-root — some tools (`fix-attrs`, `logutil-service`) assume root. We don't hit this because `/init` runs as root and individual services drop. -- scandir hard cap: `services_max` default 1000, configurable to 160,000 via `-C`. Way more than we need. -- `/command/with-contenv` sources `/run/s6/container_environment/*` into service env — convenient for passing `HERMES_HOME` etc. -- s6 signal semantics: service crash triggers `s6-supervise` restart after 1s; override with a `finish` script. -- Zombie reaping: PID 1 (s6-svscan) reaps all zombies non-blockingly on SIGCHLD. Any subagent subprocess spawned by the main hermes process is reaped automatically — no special handling required. +- **Root/non-root model:** `/init` runs as root to set up the supervision + tree, install signal handlers, and run the stage2 hook that does + `usermod`/`chown`. Each supervised service drops to UID 10000 via + `s6-setuidgid hermes` in its `run` script. The per-service `s6-supervise` + monitor stays root so it can signal its child regardless of UID. Net + effect: hermes and all its subprocesses run as UID 10000 exactly as + before; only the supervision tree itself runs as root. +- v3.2.3.0 has limited non-root support for running `/init` itself as + non-root — some tools (`fix-attrs`, `logutil-service`) assume root. We + don't hit this because `/init` runs as root. +- Scandir hard cap: `services_max` default 1000, configurable to 160,000. +- `/command/with-contenv` sources `/run/s6/container_environment/*` into + service env — convenient for passing `HERMES_HOME` etc. +- s6 signal semantics: service crash triggers `s6-supervise` restart after + 1s; override with a `finish` script. +- Zombie reaping: PID 1 (s6-svscan) reaps all zombies non-blockingly on + SIGCHLD. Any subagent subprocess spawned by the main hermes process is + reaped automatically. --- @@ -111,11 +173,16 @@ All init-system logic lives in **`hermes_cli/gateway.py`** (currently ~5,400 lin ### D1. s6-overlay replaces tini entirely -Container ENTRYPOINT becomes `/init`, PID 1 is s6-svscan. The main hermes process, the dashboard, and every per-profile gateway all run as supervised services. This is a single breaking change to the container contract — after this phase lands, every container invocation goes through `/init`. +Container ENTRYPOINT is `/init`, PID 1 is s6-svscan. The main hermes +process, the dashboard, and every per-profile gateway run as supervised +services. This is a single breaking change to the container contract. ### D2. Main hermes is an s6 service with container-exit semantics -The current contract "container exits when `hermes` exits" must be preserved. s6-overlay supports this via a service `finish` script that writes to `/run/s6-linux-init-container-results/exitcode` and calls `/run/s6/basedir/bin/halt`. All five supported invocations continue to work: +The contract "container exits when `hermes` exits" is preserved via a +service `finish` script that writes to +`/run/s6-linux-init-container-results/exitcode` and calls +`/run/s6/basedir/bin/halt`. All five supported invocations work: | `docker run …` | Behavior | |---|---| @@ -125,25 +192,34 @@ The current contract "container exits when `hermes` exits" must be preserved. s6 | `bash` | interactive `bash` directly | | `docker run -it … --tui` | interactive Ink TUI with real TTY — see D9 | -The stage2 hook detects whether `$1` is an executable on PATH and routes either to "run this as a one-shot main service" or "wrap with hermes". +`docker/main-wrapper.sh` detects whether `$1` is an executable on PATH and +routes either to "run this as a one-shot main service" or "wrap with +hermes". ### D3. Static services at build time; dynamic (per-profile) services at runtime s6 offers two mechanisms: -- **s6-rc** (declarative, compile-then-swap): used for main hermes and the dashboard — they're known at image build time -- **scandir** (drop a directory + `s6-svscanctl -a`): used for per-profile gateways — profiles are user-created after the image is built -Per-profile gateway service dirs live at `/run/service/gateway-/` (tmpfs, hermes-writable). s6-svscan picks them up on rescan. +- **s6-rc** (declarative, compile-then-swap): used for main hermes and the + dashboard — they're known at image build time. +- **scandir** (drop a directory + `s6-svscanctl -a`): used for per-profile + gateways — profiles are user-created after the image is built. + +Per-profile gateway service dirs live at `/run/service/gateway-/` +(tmpfs, hermes-writable). s6-svscan picks them up on rescan. ### D4. ServiceManager protocol with two methods for runtime registration -Host paths (systemd, launchd, Windows Scheduled Tasks) need only install/start/stop/restart of pre-declared services. Inside the container, we additionally need to register services at runtime when a profile is created. The protocol exposes this directly — no generic "transient" abstraction: +Host paths (systemd, launchd, Windows Scheduled Tasks) need only +install/start/stop/restart of pre-declared services. Inside the container, +we additionally need to register services at runtime when a profile is +created. The protocol exposes this directly: ```python class ServiceManager(Protocol): kind: ServiceManagerKind # "systemd" | "launchd" | "windows" | "s6" | "none" - # Lifecycle of an already-declared service (existing systemd/launchd/windows + s6) + # Lifecycle of an already-declared service def start(self, name: str) -> None: ... def stop(self, name: str) -> None: ... def restart(self, name: str) -> None: ... @@ -151,46 +227,96 @@ class ServiceManager(Protocol): # Runtime registration (container-only; hosts raise NotImplementedError) def supports_runtime_registration(self) -> bool: ... - def register_profile_gateway(self, profile: str, *, command: list[str], - env: dict[str, str] | None = None) -> None: ... + def register_profile_gateway( + self, profile: str, *, + extra_env: dict[str, str] | None = None, + ) -> None: ... def unregister_profile_gateway(self, profile: str) -> None: ... def list_profile_gateways(self) -> list[str]: ... ``` -Systemd, launchd, and Windows backends raise `NotImplementedError` on the registration methods. Only the s6 backend implements them. Callers check `supports_runtime_registration()` before calling. +Systemd, launchd, and Windows backends raise `NotImplementedError` on the +registration methods. Only the s6 backend implements them. Callers check +`supports_runtime_registration()` before calling. -The scope is intentionally narrow: it's specifically "register/unregister a profile gateway," not a general-purpose process-management API. If we later need other dynamically-registered services, we can add dedicated methods. +The scope is intentionally narrow: it's specifically "register/unregister a +profile gateway," not a general-purpose process-management API. ### D5. Per-profile gateway service spec is fixed, not user-provided -Every profile gateway has the same command shape (`hermes -p gateway start --foreground …`). The s6 backend generates the `run` script from a fixed template given the profile name — no arbitrary command list. This keeps the API surface tight and prevents callers from accidentally registering non-gateway services. +Every profile gateway has the same command shape +(`hermes -p gateway run`, or `hermes gateway run` for the default +profile). The s6 backend generates the `run` script from a fixed template +given the profile name — no arbitrary command list. This keeps the API +surface tight and prevents callers from accidentally registering +non-gateway services. -```python -def register_profile_gateway(self, profile: str, *, port: int, - extra_env: dict[str, str] | None = None) -> None -``` +Port selection is governed by the profile's `config.yaml` +(`[gateway] port = …`) — the single source of truth. (The original plan +proposed a Python-side SHA-256 port allocator with a 600-port range; it was +retired during PR review because it was dead code through the entire stack.) ### D6. Add detect_service_manager() alongside supports_systemd_services() -`supports_systemd_services()` stays as-is (14 call sites). A new `detect_service_manager() -> Literal["systemd", "launchd", "windows", "s6", "none"]` composes existing detection functions (`is_macos()`, `is_windows()`, `supports_systemd_services()`, `is_container()` + `_s6_running()`) and adds an s6 branch for container detection. Host call sites continue to use the existing functions; container-only code (the profile hooks) uses the new one. +`supports_systemd_services()` stays as-is (host code paths unchanged). A new +`detect_service_manager() -> Literal["systemd", "launchd", "windows", "s6", "none"]` +composes existing detection functions (`is_macos()`, `is_windows()`, +`supports_systemd_services()`, `is_container()` + `_s6_running()`) and adds +an s6 branch for container detection. Host call sites continue to use the +existing functions; container-only code (the profile hooks) uses the new one. -This is deliberately narrow: protocol + s6 backend are new; host code path is untouched. Future cleanup PR can consolidate. +`_s6_running()` probes `/proc/1/comm` (world-readable) and +`/run/s6/basedir`. The earlier `/proc/1/exe` probe was root-only readable +and silently failed for the unprivileged hermes user (UID 10000), making +the entire runtime-registration path inert in production — caught in PR +review. ### D7. Wrap existing systemd/launchd/windows functions, don't rewrite them -`SystemdServiceManager` / `LaunchdServiceManager` / `WindowsServiceManager` are thin adapters over the existing `systemd_*` / `launchd_*` module-level functions in `hermes_cli/gateway.py` and the `gateway_windows.install/uninstall/start/stop/restart/is_installed` functions in `hermes_cli/gateway_windows.py`. Their `start/stop/restart` methods call straight through. We get the abstraction without rewriting ~2,200 lines of working code. +`SystemdServiceManager` / `LaunchdServiceManager` / `WindowsServiceManager` +are thin adapters over the existing `systemd_*` / `launchd_*` module-level +functions in `hermes_cli/gateway.py` and the +`gateway_windows.install/uninstall/start/stop/restart/is_installed` +functions in `hermes_cli/gateway_windows.py`. We get the abstraction +without rewriting ~2,200 LOC of working code. ### D8. Profile create/delete hooks register/unregister the s6 service -When `hermes profile create ` runs inside the container, the profile-creation code path calls `ServiceManager.register_profile_gateway(, port=…)` if `supports_runtime_registration()` is True. When `hermes profile delete ` runs, it calls `unregister_profile_gateway()`. On host, both calls are no-ops (registration not supported; existing systemd unit generation continues to handle install/uninstall). +When `hermes profile create ` runs inside the container, the +profile-creation code path calls +`ServiceManager.register_profile_gateway()` if +`supports_runtime_registration()` is True. When `hermes profile delete +` runs, it calls `unregister_profile_gateway()`. On host, both +calls are no-ops (registration not supported; existing systemd unit +generation continues to handle install/uninstall). -Existing per-profile `hermes -p gateway start/stop/restart` CLI commands continue to work — in the container they dispatch to `ServiceManager.start/stop/restart("gateway-")`, which translates to `s6-svc -u`/`-d`/`-t` on the service dir. +Existing per-profile `hermes -p gateway start/stop/restart` CLI +commands continue to work — in the container they dispatch to +`ServiceManager.start/stop/restart("gateway-")`, which translates +to `s6-svc -u`/`-d`/`-t` on the service dir. + +`hermes gateway start` (no `-p`) targets a special `gateway-default` slot +that's always registered by the cont-init reconciler. Its run script omits +the `-p` flag and runs against the root `$HERMES_HOME` profile. + +`--all` lifecycle (`hermes gateway stop --all`, `... restart --all`) +iterates `mgr.list_profile_gateways()` through s6 so s6's `want up`/`want +down` flips correctly. Without this, `--all` fell through to `pkill` +followed by s6-supervise auto-restart — net effect: kick instead of stop. ### D9. Interactive TUI bypasses s6 service-mode and runs as CMD for TTY passthrough -`docker run -it --rm --tui` needs a real TTY connected to container stdin/stdout for Ink raw-mode keyboard input, cursor control, and SIGWINCH. Running the TUI as a normal s6 service fails because s6-supervise disconnects service stdio from the container TTY (documented: [s6-overlay#230](https://github.com/just-containers/s6-overlay/issues/230)). +`docker run -it --rm --tui` needs a real TTY connected to container +stdin/stdout for Ink raw-mode keyboard input, cursor control, and SIGWINCH. +Running the TUI as a normal s6 service fails because s6-supervise +disconnects service stdio from the container TTY (documented: +[s6-overlay#230](https://github.com/just-containers/s6-overlay/issues/230)). -**The pattern:** s6-overlay's `/init` execs a CMD as the container's "main program" after the supervision tree is up. The CMD inherits stdin/stdout/stderr from `/init` — which in `-it` mode is the container TTY. The stage2 hook detects the TUI case and short-circuits the main-hermes service so the hermes CMD becomes that main program. +**The pattern:** s6-overlay's `/init` execs a CMD as the container's "main +program" after the supervision tree is up. The CMD inherits +stdin/stdout/stderr from `/init` — which in `-it` mode is the container +TTY. The stage2 hook detects the TUI case and short-circuits the +main-hermes service so the hermes CMD becomes that main program. ```sh # In docker/stage2-hook.sh @@ -202,13 +328,10 @@ _is_tui_invocation() { if [ -t 0 ] && [ $# -eq 0 ]; then return 0; fi return 1 } - -if _is_tui_invocation "$@"; then - touch /var/run/s6/container_environment/HERMES_TUI_MODE -fi ``` And in `docker/s6-rc.d/main-hermes/run`: + ```sh if [ -f /var/run/s6/container_environment/HERMES_TUI_MODE ]; then exec sleep infinity # s6-overlay will exec CMD as the TTY-connected main @@ -216,2890 +339,13 @@ fi exec s6-setuidgid hermes hermes ${HERMES_ARGS:-} ``` -In TUI mode main hermes is effectively unsupervised (same as today with tini — acceptable because the user is interactively present). Dashboard and profile gateways still get full s6 supervision via their separate services. +In TUI mode main hermes is effectively unsupervised (same as the pre-s6 +behavior with tini — acceptable because the user is interactively +present). Dashboard and profile gateways still get full s6 supervision via +their separate services. -**Verification:** Phase 2 integration tests include an explicit TTY passthrough test using `tput cols` and `COLUMNS=123` as the probe. This is a hard gate — Phase 2 cannot merge if the test fails. Per OQ9, if it fails we fall back to the s6-fdholder pattern (Solution 2 in issue #230), but we don't want that — it has documented UX issues. - ---- - -## Phases Overview - -This plan is **TDD-first**. Phase 0 builds the regression test harness for the current (tini-based) container so every subsequent phase has a failing→passing test gate. Phase 0.5 adds linting. Phase 1 introduces the ServiceManager abstraction with no behavior change. Phase 2 is the single breaking change — tini out, s6 in, main hermes and dashboard become s6 services. Phase 3 adds the runtime-registration surface used by the profile create/delete hooks. Phase 4 wires profile creation/deletion into s6 and switches `hermes -p X gateway start/stop` to talk to s6 inside the container. Phase 5 is docs/cleanup. - -| Phase | Scope | Ships independently? | -|---|---|---| -| **Phase 0** | Test harness covering TUI, main hermes, dashboard, per-profile gateways — all against the current tini-based image. **Must land before any other phase so later changes are TDD.** | Yes — no behavior change | -| **Phase 0.5** | hadolint (Dockerfile) + shellcheck (entrypoint) in CI | Yes — no behavior change | -| **Phase 1** | `ServiceManager` protocol + thin wrappers around existing systemd/launchd | Yes — no behavior change, pure refactor | -| **Phase 2** | s6 replaces tini; main hermes + dashboard become s6 services | **Breaking change** — entrypoint contract changes | -| **Phase 3** | Runtime-registration methods (`register_profile_gateway` / `unregister_profile_gateway`) on the s6 backend | Yes — new capability, no caller yet | -| **Phase 4** | Profile create/delete hooks call the new registration API; container-boot reconciliation re-registers persistent profiles after `docker restart`; `hermes -p X gateway start/stop` talks to s6 inside the container | Yes — activates Phase 3 | -| **Phase 5** | Docs update (`website/docs/user-guide/docker.md`), skill for maintainers, remove dead code | Yes | - -Each phase is reviewable, testable, and (except Phase 2) backwards-compatible. Phase 2 is the single breaking moment. - -**CI gates between phases:** -- After Phase 0: the test harness runs against `main` (tini image); the two `test_profile_gateway.py` tests are xfailed (Phase 4 target), every other test passes -- After Phase 0.5: hadolint + shellcheck run green on the current Dockerfile + entrypoint -- After Phase 1: Phase 0 harness still passes; `grep -n 'systemd_install\|launchd_install' hermes_cli/` shows unchanged call-site count -- After Phase 2: Phase 0 harness still passes (xfails still xfail until Phase 4); all five invocation patterns (including TUI) produce identical user-visible behavior -- After Phase 3: `ServiceManager.supports_runtime_registration()` returns True in container, False on host -- After Phase 4: `hermes profile create test-profile` inside a container creates `/run/service/gateway-test-profile/` and `hermes -p test-profile gateway start` brings it up; **the two xfail markers in `test_profile_gateway.py` are removed and both tests pass strictly** - ---- - -## Open Questions - -All nine questions were resolved during plan review. Kept in-document for posterity; the chosen option is in bold at the top of each. - -### OQ1. Do we gate Phase 2 (breaking change) behind an env var for rollout? - -**Resolved: A — ship directly, no gate.** Hermes is pre-1.0; users depending on tini-specific behavior can pin to the previous image. Dual-maintenance accumulates cruft. - -Options considered: -- A. Ship Phase 2 directly — `/init` becomes the ENTRYPOINT unconditionally -- B. `HERMES_INIT=s6|tini` env var, flip default across releases -- C. Dual entrypoint script kept forever - -### OQ2. What happens to the `hermes` user vs. s6-overlay's root assumptions? - -**Resolved: A — supervisor runs as root; supervised services drop to UID 10000 via `s6-setuidgid hermes`.** Canonical s6-overlay non-root pattern. - -Options considered: -- A. `/init` runs as root → services drop per-service -- B. Run `/init` as hermes with `S6_READ_ONLY_ROOT=1` (broken: `fix-attrs`, `logutil-service` need root) -- C. Everything as root (security regression) - -### OQ3. Dashboard as static s6-rc service — how do we honor `HERMES_DASHBOARD=1`? - -**Resolved: A — dashboard is always declared as an s6 service; its `run` script checks `HERMES_DASHBOARD` and `exec sleep infinity` if unset.** Simpler than toggling contents.d at runtime. - -Options considered: -- A. Always declared, no-op when disabled -- B. Stage2 hook writes/removes `contents.d/dashboard` based on env -- C. Dashboard spawned via register_profile_gateway when enabled - -### OQ4. Podman rootless compatibility - -**Resolved: A — declare supported; fix issues as they arise during Phase 2 testing.** A Podman-alongside-Docker environment will be stood up locally for validation. - -Options considered: -- A. Supported; fix reactively -- B. Declared unsupported -- C. Block Phase 2 until validated - -### OQ5. Service naming for per-profile gateways - -**Resolved: `gateway-`.** Matches the existing `hermes-gateway-.service` systemd naming convention. - -### OQ6. — (retired; was about subagent gateways, no longer in scope) - -### OQ7. Resource limits per profile gateway - -**Resolved: C — YAGNI.** No per-service cgroup limits; rely on the container's overall limit. Revisit if we see evidence of a problem. - -Options considered: -- A. No limits -- B. Add `memory_limit_mb` parameter, use `s6-softlimit` -- C. Defer - -### OQ8. Log rotation for profile gateways - -**Resolved: C — persist logs under `$HERMES_HOME/logs/gateways//`.** Matches how the main gateway logs persist today. Each s6 service gets a `log/` subdir with `s6-log` rotation pointed at the persistent path. - -**Caveat — `HERMES_HOME` is sourced at service-run time, not registration time.** The log path is *not* hard-coded into the rendered `log/run` script as a literal `/opt/data/...`. Instead, the script reads `${HERMES_HOME:-/opt/data}` from `/run/s6/container_environment/` (populated by the stage2 hook from the container's actual env). This means: if a user starts the container with `-e HERMES_HOME=/data/hermes`, profile gateway logs land at `/data/hermes/logs/gateways//current` — not silently regress to `/opt/data/...`. Implementations of `_render_log_run` MUST therefore avoid string-substituting the path at Python time; they must emit a shell expansion of the env var. See Task 3.2. - -Options considered: -- A. Enable at `/run/service/gateway-/log/current` (tmpfs — lost on restart) -- B. Swallow (stdout to s6-supervise, lost) -- C. Persist under `$HERMES_HOME/logs/gateways//` - -### OQ9. TUI TTY passthrough via s6-overlay CMD mode — is it actually reliable? - -**Resolved: A — trust the documented pattern ([s6-overlay#230](https://github.com/just-containers/s6-overlay/issues/230) Solution 1), with manual testing + the automated Phase 2 integration test as the hard gate.** If the automated test fails, manual testing catches the regression before Phase 2 merges; we'd then fall back to the fdholder pattern. - -Options considered: -- A. Trust docs; test is the gate -- B. Prototype first (+0.5 day) -- C. Use s6-fdholder (more complex, known UX issues) - ---- - -## Phase 0 — Test Harness (TDD foundation) - -**Goal:** Build a docker-image test harness that exercises every user-visible feature of the current tini-based image, so Phase 2's change can be validated as "identical behavior." Land this **before any other phase**. - -### Task 0.1: Create the test-harness pytest marker and skip-condition - -**Objective:** All harness tests live under `tests/docker/` and are marked so they only run when Docker is available. CI can opt in via `--run-docker`. - -**Files:** -- Create: `tests/docker/__init__.py` (empty) -- Create: `tests/docker/conftest.py` - -**Step 1: Write `tests/docker/conftest.py`** - -```python -"""Shared fixtures for docker-image integration tests. - -Tests in this directory build the image with the current `Dockerfile` -and exercise it via `docker run`. They skip when Docker is unavailable -(e.g. on developer laptops without a daemon). -""" -import os -import shutil -import subprocess -import pytest - -IMAGE_TAG = os.environ.get("HERMES_TEST_IMAGE", "hermes-agent-harness:latest") - - -def _docker_available() -> bool: - if shutil.which("docker") is None: - return False - try: - r = subprocess.run(["docker", "info"], capture_output=True, timeout=5) - return r.returncode == 0 - except (subprocess.TimeoutExpired, OSError): - return False - - -def pytest_collection_modifyitems(config, items): - skip_docker = pytest.mark.skip(reason="Docker not available or daemon not running") - if not _docker_available(): - for item in items: - if "tests/docker/" in str(item.fspath): - item.add_marker(skip_docker) - - -@pytest.fixture(scope="session") -def built_image(): - """Build the image once per test session. Override with HERMES_TEST_IMAGE - env var to point at a pre-built image (faster local iteration).""" - if os.environ.get("HERMES_TEST_IMAGE"): - return IMAGE_TAG - repo_root = os.path.abspath(os.path.join(os.path.dirname(__file__), "..", "..")) - result = subprocess.run( - ["docker", "build", "-t", IMAGE_TAG, repo_root], - capture_output=True, text=True, timeout=1200, - ) - assert result.returncode == 0, f"docker build failed:\n{result.stderr[-2000:]}" - return IMAGE_TAG - - -@pytest.fixture -def container_name(request): - """Generate a unique container name + ensure cleanup on test exit.""" - name = f"hermes-test-{request.node.name.replace('[', '_').replace(']', '_')}" - yield name - subprocess.run(["docker", "rm", "-f", name], capture_output=True, timeout=10) -``` - -**Step 2: Commit** - -```bash -git add tests/docker/__init__.py tests/docker/conftest.py -git commit -m "test(docker): add conftest fixtures for docker harness" -``` - -### Task 0.2: Harness — main hermes invocation patterns - -**Objective:** Lock behavior of `docker run `, `docker run chat -q …`, `docker run sleep infinity`, `docker run bash`. - -**Files:** -- Create: `tests/docker/test_main_invocation.py` - -**Step 1: Write the tests** - -```python -"""Harness: docker run [cmd...] invocation patterns. - -These tests MUST pass on the current tini-based image AND continue to -pass after the Phase 2 s6 migration. Any behavior drift is a regression. -""" -import subprocess - - -def test_no_args_starts_hermes(built_image): - """`docker run ` should start hermes (exits with code 0 or 1 — - depends on whether config is present, but must not crash with a stack trace).""" - r = subprocess.run( - ["docker", "run", "--rm", built_image, "--version"], - capture_output=True, text=True, timeout=60, - ) - assert r.returncode in (0, 1), f"Unexpected exit {r.returncode}: {r.stderr}" - assert "Traceback" not in r.stderr - - -def test_chat_subcommand_passthrough(built_image): - """`docker run chat -q "hi"` should exec `hermes chat -q "hi"`.""" - # Use --help so we don't need a model configured - r = subprocess.run( - ["docker", "run", "--rm", built_image, "chat", "--help"], - capture_output=True, text=True, timeout=60, - ) - assert r.returncode == 0 - assert "chat" in r.stdout.lower() or "usage" in r.stdout.lower() - - -def test_bare_executable_passthrough(built_image): - """`docker run sleep 1` should exec `sleep 1` directly.""" - r = subprocess.run( - ["docker", "run", "--rm", built_image, "sleep", "1"], - capture_output=True, text=True, timeout=30, - ) - assert r.returncode == 0 - - -def test_bash_pattern(built_image): - """`docker run bash -c "echo ok"` should exec bash directly.""" - r = subprocess.run( - ["docker", "run", "--rm", built_image, "bash", "-c", "echo ok"], - capture_output=True, text=True, timeout=30, - ) - assert r.returncode == 0 - assert "ok" in r.stdout - - -def test_container_exit_code_matches_hermes_exit(built_image): - """`docker run sh -c 'exit 42'` — container should exit with 42.""" - r = subprocess.run( - ["docker", "run", "--rm", built_image, "sh", "-c", "exit 42"], - capture_output=True, text=True, timeout=30, - ) - assert r.returncode == 42 -``` - -**Step 2: Run against current image — should pass** - -```bash -scripts/run_tests.sh tests/docker/test_main_invocation.py -v -``` - -Expected: 5 passed. - -**Step 3: Commit** - -```bash -git add tests/docker/test_main_invocation.py -git commit -m "test(docker): lock main hermes invocation patterns" -``` - -### Task 0.3: Harness — interactive TUI - -**Objective:** Lock the `docker run -it … --tui` behavior. This is the hardest test to automate because it requires a PTY on the host side. - -**Files:** -- Create: `tests/docker/test_tui_passthrough.py` - -**Step 1: Write the test** - -```python -"""Harness: interactive TUI TTY passthrough. - -Uses `script -qc` on the host to allocate a PTY for the docker client, -which then allocates a container-side PTY via `-t`. The probe inside the -container is `tput cols`, which returns a real column count when stdout -is a TTY and 80 (the terminfo fallback) or nothing when it is not. - -We set COLUMNS=123 in the container env so a real TTY reports 123. -""" -import shlex -import shutil -import subprocess -import pytest - -pytestmark = pytest.mark.skipif( - shutil.which("script") is None, reason="`script` command not available" -) - - -def test_tty_passthrough_to_container(built_image): - """`docker run -t` must deliver a real TTY to the container process.""" - probe = "if [ -t 1 ]; then tput cols; else echo NO_TTY; fi" - cmd = f"docker run --rm -t -e COLUMNS=123 {built_image} sh -c {shlex.quote(probe)}" - r = subprocess.run( - ["script", "-qc", cmd, "/dev/null"], - capture_output=True, text=True, timeout=120, - ) - output = r.stdout.strip() - assert "NO_TTY" not in output, f"TTY passthrough failed: {output!r}" - # Real TTY reports a positive number. With COLUMNS=123 in env and a real - # PTY, tput should agree with COLUMNS or report the PTY width. - numeric_lines = [s for s in output.split() if s.strip().isdigit()] - assert numeric_lines, f"No numeric width in output: {output!r}" - assert int(numeric_lines[0]) > 0 - - -def test_tui_flag_recognized(built_image): - """`docker run -it --tui --help` should at minimum not crash.""" - cmd = f"docker run --rm -t {built_image} --help" - r = subprocess.run( - ["script", "-qc", cmd, "/dev/null"], - capture_output=True, text=True, timeout=60, - ) - assert r.returncode == 0 -``` - -**Step 2: Run — should pass against current tini image** - -```bash -scripts/run_tests.sh tests/docker/test_tui_passthrough.py -v -``` - -**Step 3: Commit** - -```bash -git add tests/docker/test_tui_passthrough.py -git commit -m "test(docker): lock TTY passthrough for interactive TUI" -``` - -### Task 0.4: Harness — dashboard opt-in and crash behavior - -**Objective:** Lock the HERMES_DASHBOARD=1 opt-in. Current (tini) behavior: dashboard starts once; if it crashes it stays dead. After Phase 2: dashboard starts once; if it crashes it restarts. - -**Files:** -- Create: `tests/docker/test_dashboard.py` - -**Step 1: Write the tests** - -```python -"""Harness: dashboard opt-in via HERMES_DASHBOARD.""" -import subprocess -import time - - -def test_dashboard_not_running_by_default(built_image, container_name): - subprocess.run( - ["docker", "run", "-d", "--name", container_name, built_image, - "sleep", "30"], - check=True, capture_output=True, timeout=30, - ) - time.sleep(3) - r = subprocess.run( - ["docker", "exec", container_name, "pgrep", "-f", "hermes dashboard"], - capture_output=True, text=True, timeout=10, - ) - assert r.returncode != 0, "Dashboard should NOT be running without HERMES_DASHBOARD" - - -def test_dashboard_opt_in_starts(built_image, container_name): - subprocess.run( - ["docker", "run", "-d", "--name", container_name, - "-e", "HERMES_DASHBOARD=1", built_image, "sleep", "30"], - check=True, capture_output=True, timeout=30, - ) - time.sleep(5) - r = subprocess.run( - ["docker", "exec", container_name, "pgrep", "-f", "hermes dashboard"], - capture_output=True, text=True, timeout=10, - ) - assert r.returncode == 0, f"Dashboard should be running with HERMES_DASHBOARD=1" - - -def test_dashboard_port_override(built_image, container_name): - subprocess.run( - ["docker", "run", "-d", "--name", container_name, - "-e", "HERMES_DASHBOARD=1", "-e", "HERMES_DASHBOARD_PORT=9120", - built_image, "sleep", "30"], - check=True, capture_output=True, timeout=30, - ) - time.sleep(5) - r = subprocess.run( - ["docker", "exec", container_name, "sh", "-c", - "ss -tlnp 2>/dev/null | grep ':9120' || netstat -tln | grep ':9120'"], - capture_output=True, text=True, timeout=10, - ) - assert "9120" in r.stdout, f"Dashboard not listening on 9120: {r.stdout}" -``` - -**Note:** this task documents an explicit behavior difference between tini and s6: -- On tini (pre-Phase 2): dashboard crash stays dead. No restart test — we'd be encoding broken behavior as an invariant. -- On s6 (post-Phase 2): dashboard crash is supervised and restarted. A new test `test_dashboard_restarts_after_crash` is added in Phase 2 Task 2.5. - -**Step 2: Commit** - -```bash -git add tests/docker/test_dashboard.py -git commit -m "test(docker): lock dashboard opt-in behavior" -``` - -### Task 0.5: Harness — per-profile gateway lifecycle - -**Objective:** Lock the `hermes profile create` + ` gateway start` flow *inside* the container. This is the feature we're going to materially change in Phase 4, so the harness here needs to cover exactly the user-visible surface we're preserving. - -**Important caveat — these tests describe the POST-PHASE-4 behavior, not the current one.** Today, `hermes gateway start` inside the container deliberately exits with status 0 and prints "Service start is not applicable inside a Docker container — the gateway runs as the container's main process. Run the gateway directly: hermes gateway run." So `pgrep -f 'gateway.*'` will find nothing and the tests below will fail against the tini image. That's expected. The tests are marked `xfail(strict=True)` here so they: - -1. Run in Phase 0 and confirm they're currently failing for the documented reason (no silent skip). -2. Flip to passing automatically in Phase 4 when `_dispatch_via_service_manager_if_s6` lands AND the `elif is_container():` rejection arms in `gateway_command` are removed (Task 4.3). -3. `strict=True` means an unexpected pass also fails the test — i.e. if someone accidentally fixes container-side gateway lifecycle outside the Phase 4 mechanism, we hear about it. - -**Files:** -- Create: `tests/docker/test_profile_gateway.py` - -**Step 1: Write the tests** - -```python -"""Harness: per-profile gateway start/stop inside the container. - -Phase 4 will change the *implementation* of these commands inside the -container (they'll talk to s6 instead of refusing). The user-visible -surface that should result is locked here. - -NOTE: These tests are marked xfail(strict=True) until Phase 4 lands. -The current tini image deliberately refuses gateway start/stop inside -containers — `pgrep` finds nothing and the tests fail. After Phase 4 -they should flip to passing automatically. -""" -import subprocess -import time -import pytest - -PROFILE = "test-harness-profile" - -_PHASE4_REASON = ( - "Phase 4 not yet landed: container-side `hermes gateway start` " - "currently exits 0 with an informational message instead of " - "spawning/supervising a gateway. Remove this marker after Task 4.3." -) - - -def _sh(container: str, command: str, timeout: int = 30): - return subprocess.run( - ["docker", "exec", container, "sh", "-c", command], - capture_output=True, text=True, timeout=timeout, - ) - - -@pytest.mark.xfail(reason=_PHASE4_REASON, strict=True) -def test_profile_create_then_gateway_start(built_image, container_name): - subprocess.run( - ["docker", "run", "-d", "--name", container_name, built_image, - "sleep", "120"], - check=True, capture_output=True, timeout=30, - ) - time.sleep(3) - - # Create the profile - r = _sh(container_name, f"hermes profile create {PROFILE}") - assert r.returncode == 0, f"profile create failed: {r.stderr}" - - # Start its gateway (foreground=False returns after spawn) - r = _sh(container_name, f"hermes -p {PROFILE} gateway start", timeout=60) - assert r.returncode == 0, f"gateway start failed: {r.stderr}\n{r.stdout}" - - time.sleep(3) - - # Process should exist - r = _sh(container_name, f"pgrep -f 'gateway.*{PROFILE}'") - assert r.returncode == 0, "gateway process not running" - - # Stop it - r = _sh(container_name, f"hermes -p {PROFILE} gateway stop", timeout=30) - assert r.returncode == 0 - - time.sleep(2) - - # Process should be gone - r = _sh(container_name, f"pgrep -f 'gateway.*{PROFILE}'") - assert r.returncode != 0, "gateway process still running after stop" - - -@pytest.mark.xfail(reason=_PHASE4_REASON, strict=True) -def test_profile_delete_stops_gateway(built_image, container_name): - """Deleting a profile should stop its gateway if running.""" - subprocess.run( - ["docker", "run", "-d", "--name", container_name, built_image, - "sleep", "120"], - check=True, capture_output=True, timeout=30, - ) - time.sleep(3) - - _sh(container_name, f"hermes profile create {PROFILE}") - _sh(container_name, f"hermes -p {PROFILE} gateway start", timeout=60) - time.sleep(3) - - r = _sh(container_name, f"hermes profile delete {PROFILE} --yes", timeout=30) - assert r.returncode == 0 - - time.sleep(2) - r = _sh(container_name, f"pgrep -f 'gateway.*{PROFILE}'") - assert r.returncode != 0, "gateway still running after profile delete" -``` - -**Step 2: Run — confirm both fail as expected** - -```bash -scripts/run_tests.sh tests/docker/test_profile_gateway.py -v -``` - -Expected: 2 `xfailed` (the strict=True ones). If either *passes* unexpectedly, investigate before moving on — something has changed about container behavior that the plan doesn't account for. If either *errors* (rather than failing), the docker fixture/build is broken and needs fixing before proceeding. - -**Step 3: Commit** - -```bash -git add tests/docker/test_profile_gateway.py -git commit -m "test(docker): lock per-profile gateway lifecycle target (xfail until Phase 4)" -``` - -**Task 4.3 reminder:** when Phase 4 lands, remove both `@pytest.mark.xfail(...)` markers and the `_PHASE4_REASON` constant. The tests should then pass against the s6 image. - -### Task 0.6: Harness — zombie reaping - -**Objective:** Lock the current behavior that tini reaps zombie processes spawned by hermes subagent subprocesses. - -**Files:** -- Create: `tests/docker/test_zombie_reaping.py` - -**Step 1: Write the test** - -```python -"""Harness: PID 1 must reap orphaned zombies.""" -import subprocess -import time - - -def test_orphan_zombies_reaped(built_image, container_name): - """Spawn an orphan child that exits immediately. PID 1 must reap it.""" - subprocess.run( - ["docker", "run", "-d", "--name", container_name, built_image, - "sleep", "60"], - check=True, capture_output=True, timeout=30, - ) - time.sleep(2) - - # Spawn an orphan process tree that creates a zombie - subprocess.run( - ["docker", "exec", container_name, "sh", "-c", - "( ( sleep 0.1 & ) & ); sleep 1"], - capture_output=True, text=True, timeout=10, - ) - time.sleep(1) - - # Check for zombies (ps shows 'Z' in STAT column for zombies) - r = subprocess.run( - ["docker", "exec", container_name, "ps", "axo", "stat,pid,comm"], - capture_output=True, text=True, timeout=10, - ) - zombies = [line for line in r.stdout.split("\n") if line.strip().startswith("Z")] - assert not zombies, f"Zombies not reaped: {zombies}" -``` - -**Step 2: Commit** - -```bash -git add tests/docker/test_zombie_reaping.py -git commit -m "test(docker): lock zombie reaping by PID 1" -``` - -### Task 0.7: Run full harness, document baseline - -**Objective:** All Phase 0 tests pass against the current image. This is the baseline for every subsequent phase. - -```bash -scripts/run_tests.sh tests/docker/ -v -``` - -Expected: all pass. If any fail, investigate before proceeding to Phase 0.5. - ---- - -## Phase 0.5 — Dockerfile and shell linting - -**Goal:** Bring `hadolint` (Dockerfile) and `shellcheck` (entrypoint script) into CI. These catch classes of regression that the behavioral harness can't — e.g. `RUN` commands that fail silently, unquoted variable expansions. - -### Task 0.5.1: Add hadolint to CI - -**Objective:** `hadolint Dockerfile` runs in CI and fails the build on warnings. - -**Files:** -- Create: `.hadolint.yaml` -- Modify: `.github/workflows/ci.yml` (or wherever Docker-related CI lives) - -**Step 1: Write `.hadolint.yaml` with starting ruleset** - -```yaml -# hadolint configuration for the Hermes Agent Dockerfile. -# See https://github.com/hadolint/hadolint#configure for rules. -failure-threshold: warning - -# Allow pinning to specific versions of system packages via apt-get — this is -# a pragmatic tradeoff for a fast-moving project. -ignored: - - DL3008 # Pin versions in apt get install (we intentionally don't pin common tools) - - DL3009 # Delete apt-get lists after installing (we do this, hadolint occasionally false-positives) - -# Require explicit base-image pins (SHA256) which we already do. -trusted-registries: - - docker.io - - ghcr.io -``` - -**Step 2: Run hadolint against the current Dockerfile** - -```bash -docker run --rm -i hadolint/hadolint:latest < Dockerfile -``` - -Fix any warnings raised (do not ignore them by adding to `.hadolint.yaml` unless they're genuinely false positives — document the rationale for each ignore). - -**Step 3: Add CI job** - -Append to the existing CI workflow (file path depends on current CI layout — check `.github/workflows/`): - -```yaml - lint-dockerfile: - name: Lint Dockerfile (hadolint) - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v4 - - uses: hadolint/hadolint-action@v3.1.0 - with: - dockerfile: Dockerfile - config: .hadolint.yaml - failure-threshold: warning -``` - -**Step 4: Commit** - -```bash -git add .hadolint.yaml .github/workflows/ci.yml Dockerfile -git commit -m "ci: add hadolint for Dockerfile linting" -``` - -### Task 0.5.2: Add shellcheck to CI for docker entrypoint - -**Objective:** `shellcheck docker/entrypoint.sh` runs in CI and fails on errors. - -**Files:** -- Modify: `.github/workflows/ci.yml` - -**Step 1: Run shellcheck against the current entrypoint** - -```bash -shellcheck docker/entrypoint.sh -``` - -Fix any errors raised. Use `# shellcheck disable=SCxxxx` with a one-line justification for each intentional exception. - -**Step 2: Add CI job** - -```yaml - lint-shell: - name: Lint shell scripts (shellcheck) - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v4 - - name: Run shellcheck - uses: ludeeus/action-shellcheck@master - with: - scandir: './docker' - severity: error -``` - -**Step 3: Commit** - -```bash -git add .github/workflows/ci.yml docker/entrypoint.sh -git commit -m "ci: add shellcheck for docker/ shell scripts" -``` - ---- - -## Phase 1 — ServiceManager protocol + systemd/launchd wrappers - -**Goal:** Introduce `ServiceManager` Protocol with the runtime-registration surface from D4. Wrap existing `systemd_*` / `launchd_*` functions behind it. No behavior change; pure refactor. - -Phase 0 harness must keep passing across this phase. - -### Task 1.1: Create ServiceManager protocol module - -**Objective:** Define the abstract interface. - -**Files:** -- Create: `hermes_cli/service_manager.py` -- Create: `tests/hermes_cli/test_service_manager.py` - -**Step 1: Write `tests/hermes_cli/test_service_manager.py`** - -```python -"""Tests for the ServiceManager protocol and detect_service_manager().""" -import pytest -from hermes_cli.service_manager import ( - ServiceManager, - detect_service_manager, -) - - -def test_detect_service_manager_returns_known_value(): - result = detect_service_manager() - assert result in ("systemd", "launchd", "windows", "s6", "none") - - -def test_profile_name_validation(): - """Profile names used for registration must be safe as directory names.""" - from hermes_cli.service_manager import validate_profile_name - # Valid - validate_profile_name("coder") - validate_profile_name("my-profile") - validate_profile_name("assistant_v2") - # Invalid: uppercase - with pytest.raises(ValueError): - validate_profile_name("Coder") - # Invalid: path traversal - with pytest.raises(ValueError): - validate_profile_name("foo/bar") - # Invalid: empty - with pytest.raises(ValueError): - validate_profile_name("") - # Invalid: too long (s6 name_max is 251) - with pytest.raises(ValueError): - validate_profile_name("a" * 252) -``` - -**Step 2: Create `hermes_cli/service_manager.py`** - -```python -"""Abstract service manager interface. - -Wraps the existing systemd (Linux host), launchd (macOS host), and -s6 (container) backends behind a common Protocol. Only the s6 backend -supports runtime registration (for per-profile gateways). - -Host-side call sites (setup wizard, uninstall, status) continue to -use the existing module-level functions in hermes_cli.gateway — -this protocol is a thin facade used by new code that needs to be -backend-agnostic (specifically the profile create/delete hooks). -""" -from __future__ import annotations - -import re -from typing import Literal, Protocol, runtime_checkable - -ServiceManagerKind = Literal["systemd", "launchd", "windows", "s6", "none"] - -_VALID_PROFILE_RE = re.compile(r"^[a-z0-9][a-z0-9_-]*$") -_MAX_PROFILE_LEN = 251 # s6-svscan -L default (name_max) - - -def validate_profile_name(name: str) -> None: - """Raise ValueError if `name` is not usable as a profile name. - - Profile names are used as s6 service directory names, so they must - match a conservative subset of filesystem-safe characters. - """ - if not name: - raise ValueError("profile name must not be empty") - if len(name) > _MAX_PROFILE_LEN: - raise ValueError(f"profile name too long ({len(name)} > {_MAX_PROFILE_LEN})") - if not _VALID_PROFILE_RE.match(name): - raise ValueError( - f"profile name must match [a-z0-9][a-z0-9_-]*, got {name!r}" - ) - - -@runtime_checkable -class ServiceManager(Protocol): - """Abstract interface for init-system-specific service operations. - - Lifecycle methods (start/stop/restart/is_running) are implemented by - all backends. Runtime registration (register_profile_gateway / - unregister_profile_gateway) is only implemented by the s6 backend — - callers MUST check supports_runtime_registration() before using it. - """ - - kind: ServiceManagerKind - - # Lifecycle of a pre-declared service - def start(self, name: str) -> None: ... - def stop(self, name: str) -> None: ... - def restart(self, name: str) -> None: ... - def is_running(self, name: str) -> bool: ... - - # Runtime registration (s6 only) - def supports_runtime_registration(self) -> bool: ... - def register_profile_gateway( - self, profile: str, *, port: int, - extra_env: dict[str, str] | None = None, - ) -> None: ... - def unregister_profile_gateway(self, profile: str) -> None: ... - def list_profile_gateways(self) -> list[str]: ... - - -def detect_service_manager() -> ServiceManagerKind: - """Detect which service manager is available in this environment. - - Returns "s6" in a container when /init is s6-svscan, "windows" on - native Windows, "launchd" on macOS, "systemd" on Linux hosts with - systemctl, "none" otherwise. - - Does NOT replace supports_systemd_services() — host call sites - continue to use that. This is for new backend-agnostic code. - """ - from hermes_cli.gateway import is_macos, is_windows, supports_systemd_services - from hermes_constants import is_container - - if is_container() and _s6_running(): - return "s6" - if is_windows(): - return "windows" - if is_macos(): - return "launchd" - if supports_systemd_services(): - return "systemd" - return "none" - - -def _s6_running() -> bool: - """True when s6-svscan is running as PID 1 in this container.""" - from pathlib import Path - try: - exe = Path("/proc/1/exe").resolve() - return exe.name in ("s6-svscan", "init") and Path("/run/s6").exists() - except (OSError, RuntimeError): - return False -``` - -**Step 3: Run tests — pass** - -```bash -scripts/run_tests.sh tests/hermes_cli/test_service_manager.py -v -``` - -Expected: 2 passed. - -**Step 4: Commit** - -```bash -git add hermes_cli/service_manager.py tests/hermes_cli/test_service_manager.py -git commit -m "feat(service_manager): introduce ServiceManager protocol and detection" -``` - -### Task 1.2: Add SystemdServiceManager, LaunchdServiceManager, WindowsServiceManager wrappers - -**Objective:** Wrap the existing `systemd_*` / `launchd_*` module-level functions in `hermes_cli/gateway.py` and the `gateway_windows.*` functions in `hermes_cli/gateway_windows.py`. Lifecycle methods delegate; runtime registration raises NotImplementedError. - -**Files:** -- Modify: `hermes_cli/service_manager.py` -- Modify: `tests/hermes_cli/test_service_manager.py` - -> **v3 note:** `gateway_windows.install()` signature is now `install(force=False, *, start_now=None, start_on_login=None, elevated_handoff=False)` (PRs `d948de39e` + `417a653d9`, ~420 LOC of changes between v2 and v3). The `WindowsServiceManager` wrapper currently isn't called from any non-Windows code path, so accept these kwargs with sensible defaults and forward them: -> -> ```python -> class WindowsServiceManager: -> kind = "windows" -> def install(self, *, force=False, start_now=None, start_on_login=None, -> elevated_handoff=False) -> None: -> from hermes_cli import gateway_windows as gw -> gw.install(force=force, start_now=start_now, -> start_on_login=start_on_login, -> elevated_handoff=elevated_handoff) -> ``` -> -> `SystemdServiceManager.install` and `LaunchdServiceManager.install` continue to take just `force` plus their respective backend-specific args (e.g. systemd's `system: bool`, `run_as_user: str`). The protocol's `install` signature is therefore lifecycle-only — keep it minimal (`install(force: bool = False) -> None`) and let backends absorb the extra args via keyword-only on the concrete class. Callers that need the Windows kwargs must already be on the Windows path. - -**Step 1: Write failing tests** - -```python -def test_systemd_manager_kind_and_registration_unsupported(): - from hermes_cli.service_manager import SystemdServiceManager - mgr = SystemdServiceManager() - assert mgr.kind == "systemd" - assert mgr.supports_runtime_registration() is False - with pytest.raises(NotImplementedError): - mgr.register_profile_gateway("foo", port=9100) - with pytest.raises(NotImplementedError): - mgr.unregister_profile_gateway("foo") - assert mgr.list_profile_gateways() == [] - - -def test_launchd_manager_kind_and_registration_unsupported(): - from hermes_cli.service_manager import LaunchdServiceManager - mgr = LaunchdServiceManager() - assert mgr.kind == "launchd" - assert mgr.supports_runtime_registration() is False - - -def test_windows_manager_kind_and_registration_unsupported(): - from hermes_cli.service_manager import WindowsServiceManager - mgr = WindowsServiceManager() - assert mgr.kind == "windows" - assert mgr.supports_runtime_registration() is False - with pytest.raises(NotImplementedError): - mgr.register_profile_gateway("foo", port=9100) -``` - -**Step 2: Add wrapper classes** - -Append to `hermes_cli/service_manager.py`: - -```python -class _RegistrationUnsupportedMixin: - """Mixin for host backends that don't support runtime registration.""" - - def supports_runtime_registration(self) -> bool: - return False - - def register_profile_gateway( - self, profile: str, *, port: int, - extra_env: dict[str, str] | None = None, - ) -> None: - raise NotImplementedError( - f"{type(self).__name__} does not support runtime profile " - "gateway registration (container-only feature)" - ) - - def unregister_profile_gateway(self, profile: str) -> None: - raise NotImplementedError( - f"{type(self).__name__} does not support runtime profile " - "gateway unregistration (container-only feature)" - ) - - def list_profile_gateways(self) -> list[str]: - return [] - - -class SystemdServiceManager(_RegistrationUnsupportedMixin): - """Thin wrapper around systemd_* functions in hermes_cli.gateway. - - Host call sites continue to use the module-level functions directly; - this wrapper exists for backend-agnostic code (the profile hooks). - """ - kind: ServiceManagerKind = "systemd" - - def start(self, name: str) -> None: - from hermes_cli.gateway import systemd_start - systemd_start() # operates on the current profile's gateway by default - - def stop(self, name: str) -> None: - from hermes_cli.gateway import systemd_stop - systemd_stop() - - def restart(self, name: str) -> None: - from hermes_cli.gateway import systemd_restart - systemd_restart() - - def is_running(self, name: str) -> bool: - from hermes_cli.gateway import _probe_systemd_service_running - _, running = _probe_systemd_service_running() - return running - - -class LaunchdServiceManager(_RegistrationUnsupportedMixin): - """Thin wrapper around launchd_* functions in hermes_cli.gateway.""" - kind: ServiceManagerKind = "launchd" - - def start(self, name: str) -> None: - from hermes_cli.gateway import launchd_start - launchd_start() - - def stop(self, name: str) -> None: - from hermes_cli.gateway import launchd_stop - launchd_stop() - - def restart(self, name: str) -> None: - from hermes_cli.gateway import launchd_restart - launchd_restart() - - def is_running(self, name: str) -> bool: - from hermes_cli.gateway import _probe_launchd_service_running - return _probe_launchd_service_running() - - -class WindowsServiceManager(_RegistrationUnsupportedMixin): - """Thin wrapper around gateway_windows.* functions. - - Native Windows uses a Scheduled Task (or a Startup-folder fallback) - instead of an init-system service. Lifecycle delegates to the - existing `gateway_windows` module which already handles both paths. - """ - kind: ServiceManagerKind = "windows" - - def start(self, name: str) -> None: - from hermes_cli import gateway_windows - gateway_windows.start() - - def stop(self, name: str) -> None: - from hermes_cli import gateway_windows - gateway_windows.stop() - - def restart(self, name: str) -> None: - from hermes_cli import gateway_windows - gateway_windows.restart() - - def is_running(self, name: str) -> bool: - # gateway_windows tracks installed/registered state; combine with - # process-level check via the existing helpers in hermes_cli.gateway. - from hermes_cli import gateway_windows - from hermes_cli.gateway import find_gateway_pids - if not gateway_windows.is_installed(): - return False - return bool(find_gateway_pids()) -``` - -**Note:** the `name` parameter on these wrappers is currently unused — the underlying systemd/launchd/windows functions operate on the current profile. This is a known limitation; host-side, callers use the profile-aware CLI surface (`hermes -p gateway start`) which loads the right profile before calling these functions. The wrapper API shape is designed for s6 where `name` is the service-directory name. - -**Step 3: Run tests — pass** - -```bash -scripts/run_tests.sh tests/hermes_cli/test_service_manager.py -v -``` - -Expected: 5 passed. - -**Step 4: Commit** - -```bash -git add hermes_cli/service_manager.py tests/hermes_cli/test_service_manager.py -git commit -m "feat(service_manager): add Systemd/Launchd/Windows ServiceManager wrappers" -``` - -### Task 1.3: Factory function get_service_manager() - -**Objective:** Single entry point for picking the right backend based on the current environment. - -**Files:** -- Modify: `hermes_cli/service_manager.py` -- Modify: `tests/hermes_cli/test_service_manager.py` - -**Step 1: Tests** - -```python -def test_get_service_manager_returns_correct_backend(monkeypatch): - from hermes_cli import service_manager as sm - monkeypatch.setattr(sm, "detect_service_manager", lambda: "systemd") - assert isinstance(sm.get_service_manager(), sm.SystemdServiceManager) - monkeypatch.setattr(sm, "detect_service_manager", lambda: "launchd") - assert isinstance(sm.get_service_manager(), sm.LaunchdServiceManager) - monkeypatch.setattr(sm, "detect_service_manager", lambda: "windows") - assert isinstance(sm.get_service_manager(), sm.WindowsServiceManager) - monkeypatch.setattr(sm, "detect_service_manager", lambda: "none") - with pytest.raises(RuntimeError, match="no supported service manager"): - sm.get_service_manager() -``` - -**Step 2: Add factory** - -```python -def get_service_manager() -> ServiceManager: - """Return the ServiceManager instance for this environment. - - Raises RuntimeError when no supported backend is available. The s6 - backend ships in Phase 3; until then, "s6" detection raises. - """ - kind = detect_service_manager() - if kind == "systemd": - return SystemdServiceManager() - if kind == "launchd": - return LaunchdServiceManager() - if kind == "windows": - return WindowsServiceManager() - if kind == "s6": - raise RuntimeError("s6 backend not yet implemented (Phase 3)") - raise RuntimeError("no supported service manager detected") -``` - -**Step 3: Commit** - -```bash -git add hermes_cli/service_manager.py tests/hermes_cli/test_service_manager.py -git commit -m "feat(service_manager): add get_service_manager() factory" -``` - -### Task 1.4: CI gate — no regressions - -```bash -scripts/run_tests.sh tests/hermes_cli/ tests/docker/ -v -``` - -Verify: -- Phase 0 harness still passes -- No call sites modified: - ```bash - git diff --stat main -- hermes_cli/gateway.py hermes_cli/setup.py \ - hermes_cli/uninstall.py hermes_cli/profiles.py hermes_cli/status.py - ``` - Expected: 0 files changed outside of `hermes_cli/service_manager.py` and its tests. - ---- - -## Phase 2 — s6 replaces tini as PID 1 (BREAKING) - -**Goal:** Container ENTRYPOINT becomes `/init`. Main hermes runs as an s6 service with container-exit semantics. Dashboard is a separately-supervised s6 service. `tini` is removed. Interactive TUI passthrough works. - -**The hard gate:** The Phase 0 harness (all tests in `tests/docker/`) must pass unchanged after this phase. No behavior drift. - -### Task 2.1: Install s6-overlay in the image (still using tini as PID 1) - -**Objective:** Add s6-overlay binaries to the image as a separate Dockerfile layer. Before this task is done, tini is still PID 1; after, s6 binaries are on PATH but unused. - -**Files:** -- Modify: `Dockerfile` — add new layer after the existing apt install block - -**Step 1: Add the install layer** - -In `Dockerfile`, insert after the existing `apt-get install ... && rm -rf /var/lib/apt/lists/*` block: - -```dockerfile -# ---------- s6-overlay install ---------- -# s6-overlay provides supervision for the main hermes process, the dashboard, -# and per-profile gateways. /init becomes PID 1 later in this Dockerfile. -ARG S6_OVERLAY_VERSION=3.2.3.0 -ADD https://github.com/just-containers/s6-overlay/releases/download/v${S6_OVERLAY_VERSION}/s6-overlay-noarch.tar.xz /tmp/ -ADD https://github.com/just-containers/s6-overlay/releases/download/v${S6_OVERLAY_VERSION}/s6-overlay-x86_64.tar.xz /tmp/ -ADD https://github.com/just-containers/s6-overlay/releases/download/v${S6_OVERLAY_VERSION}/s6-overlay-symlinks-noarch.tar.xz /tmp/ -RUN tar -C / -Jxpf /tmp/s6-overlay-noarch.tar.xz && \ - tar -C / -Jxpf /tmp/s6-overlay-x86_64.tar.xz && \ - tar -C / -Jxpf /tmp/s6-overlay-symlinks-noarch.tar.xz && \ - rm /tmp/s6-overlay-*.tar.xz -``` - -> **Note:** If you need to build for aarch64 (M1/M2 Macs, ARM servers), substitute `s6-overlay-x86_64.tar.xz` with `s6-overlay-aarch64.tar.xz`. The plan currently assumes x86_64; multi-arch is out of scope and deferred to a follow-up. See the `Dockerfile`'s base image — if it goes multi-arch, this layer needs `TARGETARCH` plumbing. - -**Step 2: Rebuild and re-run Phase 0 harness** - -```bash -docker build -t hermes-agent-harness:latest . -scripts/run_tests.sh tests/docker/ -v -``` - -Expected: all pass (binaries installed but not yet in use). - -**Step 3: Commit** - -```bash -git add Dockerfile -git commit -m "feat(docker): install s6-overlay v3.2.3.0 (not yet PID 1)" -``` - -### Task 2.2: Create s6-rc service definitions for main hermes and dashboard - -**Objective:** Declarative service directories shipped in the image. - -**Files:** -- Create: `docker/s6-rc.d/main-hermes/type` -- Create: `docker/s6-rc.d/main-hermes/run` -- Create: `docker/s6-rc.d/main-hermes/finish` -- Create: `docker/s6-rc.d/main-hermes/dependencies.d/base` (empty) -- Create: `docker/s6-rc.d/dashboard/type` -- Create: `docker/s6-rc.d/dashboard/run` -- Create: `docker/s6-rc.d/dashboard/dependencies.d/base` (empty) -- Create: `docker/s6-rc.d/user/contents.d/main-hermes` (empty — registers in user bundle) -- Create: `docker/s6-rc.d/user/contents.d/dashboard` (empty — registers in user bundle) - -**Step 1: main-hermes service** - -`docker/s6-rc.d/main-hermes/type`: -``` -longrun -``` - -`docker/s6-rc.d/main-hermes/run`: -```sh -#!/command/with-contenv sh - -# In TUI mode, main hermes runs as the container's CMD (exec'd by /init -# with TTY intact, not as an s6 service). See D9. -if [ -f /var/run/s6/container_environment/HERMES_TUI_MODE ]; then - exec sleep infinity -fi - -# Non-TUI path: run as supervised service. -cd /opt/data -. /opt/hermes/.venv/bin/activate - -if [ -n "${HERMES_CMD:-}" ]; then - # Bare executable (sleep, bash, sh -c ...) — exec directly as hermes user - exec s6-setuidgid hermes sh -c "${HERMES_CMD}" -fi - -# Default: hermes with any subcommand args -exec s6-setuidgid hermes hermes ${HERMES_ARGS:-} -``` - -`docker/s6-rc.d/main-hermes/finish`: -```sh -#!/command/execlineb -S2 -# $1 = exit code (256 if killed by signal), $2 = signal number -foreground { - if { eltest $1 -eq 256 } - redirfd -w 1 /run/s6-linux-init-container-results/exitcode echo $((128 + $2)) -} -foreground { - if { eltest $1 -ne 256 } - redirfd -w 1 /run/s6-linux-init-container-results/exitcode echo $1 -} -/run/s6/basedir/bin/halt -``` - -Empty files: `docker/s6-rc.d/main-hermes/dependencies.d/base`, `docker/s6-rc.d/user/contents.d/main-hermes`. - -**Step 2: dashboard service (OQ3-A: always declared, run script checks env)** - -`docker/s6-rc.d/dashboard/type`: -``` -longrun -``` - -`docker/s6-rc.d/dashboard/run`: -```sh -#!/command/with-contenv sh -# Dashboard only runs when HERMES_DASHBOARD is truthy. Otherwise we sleep -# forever so s6 still supervises this slot but does nothing. - -case "${HERMES_DASHBOARD:-}" in - 1|true|TRUE|True|yes|YES|Yes) ;; - *) exec sleep infinity ;; -esac - -cd /opt/data -. /opt/hermes/.venv/bin/activate - -dash_host="${HERMES_DASHBOARD_HOST:-0.0.0.0}" -dash_port="${HERMES_DASHBOARD_PORT:-9119}" - -insecure="" -case "$dash_host" in - 127.0.0.1|localhost) ;; - *) insecure="--insecure" ;; -esac - -exec s6-setuidgid hermes hermes dashboard \ - --host "$dash_host" --port "$dash_port" --no-open $insecure -``` - -Empty files: `docker/s6-rc.d/dashboard/dependencies.d/base`, `docker/s6-rc.d/user/contents.d/dashboard`. - -**Step 3: Commit** - -```bash -git add docker/s6-rc.d/ -git commit -m "feat(docker): add s6-rc service definitions for main-hermes and dashboard" -``` - -### Task 2.3: Rewrite entrypoint as s6 stage2 hook - -**Objective:** Move gosu-drop + config bootstrap + skills sync out of the main exec path and into a cont-init.d script. Detect the TUI case and set `HERMES_TUI_MODE`. - -**Files:** -- Create: `docker/stage2-hook.sh` -- Rewrite: `docker/entrypoint.sh` (becomes a thin shim) - -> **v3 note:** The current entrypoint also writes `${HERMES_HOME:=/opt/data}/.install_method` with content `"docker"` after the gosu drop and venv activate (added in PR #27843, May 18). This stamp is read by `detect_install_method()` for `hermes status` install-method reporting. The stage2-hook.sh rewrite below must preserve this stamp — recommended placement is **inside the `--- Seed directory structure as hermes user ---` block** in stage2-hook.sh (which already drops to the hermes user via `s6-setuidgid hermes`), so the file is created with hermes ownership and survives the VOLUME overlay. Concrete line to include: -> -> ```sh -> s6-setuidgid hermes sh -c 'echo "docker" > "${HERMES_HOME:=/opt/data}/.install_method"' 2>/dev/null || true -> ``` - -**Step 1: Create `docker/stage2-hook.sh`** - -```sh -#!/bin/sh -# s6-overlay stage2 hook — runs as root after supervision tree is up but -# before user services start. Handles UID/GID remap, chown, config seeding, -# skill sync, and TUI detection. -# -# Per-service privilege drop happens inside each service's `run` script via -# s6-setuidgid, not here. - -set -eu - -HERMES_HOME="${HERMES_HOME:-/opt/data}" -INSTALL_DIR="/opt/hermes" - -# --- UID/GID remap --- -if [ -n "${HERMES_UID:-}" ] && [ "$HERMES_UID" != "$(id -u hermes)" ]; then - echo "[stage2] Changing hermes UID to $HERMES_UID" - usermod -u "$HERMES_UID" hermes -fi -if [ -n "${HERMES_GID:-}" ] && [ "$HERMES_GID" != "$(id -g hermes)" ]; then - echo "[stage2] Changing hermes GID to $HERMES_GID" - groupmod -o -g "$HERMES_GID" hermes 2>/dev/null || true -fi - -# --- Fix ownership of data volume --- -actual_hermes_uid=$(id -u hermes) -needs_chown=false -if [ -n "${HERMES_UID:-}" ] && [ "$HERMES_UID" != "10000" ]; then - needs_chown=true -elif [ "$(stat -c %u "$HERMES_HOME" 2>/dev/null)" != "$actual_hermes_uid" ]; then - needs_chown=true -fi -if [ "$needs_chown" = true ]; then - echo "[stage2] Fixing ownership of $HERMES_HOME to hermes ($actual_hermes_uid)" - chown -R hermes:hermes "$HERMES_HOME" 2>/dev/null || \ - echo "[stage2] Warning: chown failed (rootless container?) — continuing" -fi - -# --- config.yaml permissions --- -if [ -f "$HERMES_HOME/config.yaml" ]; then - chown hermes:hermes "$HERMES_HOME/config.yaml" 2>/dev/null || true - chmod 640 "$HERMES_HOME/config.yaml" 2>/dev/null || true -fi - -# --- Seed directory structure as hermes user --- -su -s /bin/sh hermes -c "mkdir -p \"$HERMES_HOME\"/{cron,sessions,logs,hooks,memories,skills,skins,plans,workspace,home}" - -# --- Seed config files --- -for pair in ".env:.env.example" "config.yaml:cli-config.yaml.example" "SOUL.md:docker/SOUL.md"; do - dest="${pair%%:*}" - src="${pair##*:}" - if [ ! -f "$HERMES_HOME/$dest" ]; then - su -s /bin/sh hermes -c "cp \"$INSTALL_DIR/$src\" \"$HERMES_HOME/$dest\"" - fi -done - -# --- Sync bundled skills --- -if [ -d "$INSTALL_DIR/skills" ]; then - su -s /bin/sh hermes -c ". $INSTALL_DIR/.venv/bin/activate && python3 $INSTALL_DIR/tools/skills_sync.py" -fi - -# --- Detect TUI invocation --- -_is_tui_invocation() { - for arg in "$@"; do - case "$arg" in --tui|-T) return 0 ;; esac - done - case "${HERMES_TUI:-}" in 1|true|TRUE|yes) return 0 ;; esac - # Implicit: stdin is a TTY and no subcommand given - if [ -t 0 ] && [ $# -eq 0 ]; then return 0; fi - return 1 -} - -if _is_tui_invocation "$@"; then - touch /var/run/s6/container_environment/HERMES_TUI_MODE - echo "[stage2] TUI mode detected; main-hermes service will no-op and CMD runs as TTY-connected main" -fi - -# --- Pass CMD through to main-hermes service --- -# Bare executable → HERMES_CMD; otherwise → HERMES_ARGS for `hermes $HERMES_ARGS` -if [ $# -gt 0 ] && command -v "$1" >/dev/null 2>&1; then - printf '%s' "$*" > /var/run/s6/container_environment/HERMES_CMD -else - printf '%s' "$*" > /var/run/s6/container_environment/HERMES_ARGS -fi - -echo "[stage2] Setup complete; starting user services" -``` - -```bash -chmod +x docker/stage2-hook.sh -``` - -**Step 2: Simplify `docker/entrypoint.sh` to a shim** - -Replace the entire file with: - -```sh -#!/bin/sh -# s6-overlay shim. The real logic lives in docker/stage2-hook.sh, invoked -# by /etc/cont-init.d/01-hermes-setup (installed in the Dockerfile). -# This file exists so external references to docker/entrypoint.sh still -# work, but it's no longer the ENTRYPOINT — /init is. -exec /opt/hermes/docker/stage2-hook.sh "$@" -``` - -**Step 3: Run shellcheck** - -```bash -shellcheck docker/stage2-hook.sh docker/entrypoint.sh -``` - -Fix any errors. - -**Step 4: Commit** - -```bash -git add docker/stage2-hook.sh docker/entrypoint.sh -git commit -m "feat(docker): rewrite entrypoint as s6-overlay stage2 hook" -``` - -### Task 2.4: Flip the ENTRYPOINT in the Dockerfile - -**Objective:** Replace `tini` with `/init`. Wire service defs and stage2 hook into the image. Remove `tini`. - -**Files:** -- Modify: `Dockerfile` - -> **v3 note:** The current Dockerfile (post-PR #27843) has a `RUN mkdir -p /opt/data` line immediately before `VOLUME [ "/opt/data" ]`. **Keep this line.** It was added because the volume overlay was wiping out files written to /opt/data during build — same reason it's needed under s6. Do not delete it during the entrypoint swap. - -**Step 1: Update `Dockerfile`** - -Remove `tini` from the apt install line. Add after the s6-overlay install block (from Task 2.1): - -```dockerfile -# ---------- s6-overlay service wiring ---------- -COPY docker/s6-rc.d/ /etc/s6-overlay/s6-rc.d/ -RUN chmod +x /etc/s6-overlay/s6-rc.d/main-hermes/run \ - /etc/s6-overlay/s6-rc.d/main-hermes/finish \ - /etc/s6-overlay/s6-rc.d/dashboard/run - -# Install cont-init.d hook that runs our stage2 setup as root before services start -RUN mkdir -p /etc/cont-init.d && \ - printf '#!/bin/sh\nexec /opt/hermes/docker/stage2-hook.sh "$@"\n' \ - > /etc/cont-init.d/01-hermes-setup && \ - chmod +x /etc/cont-init.d/01-hermes-setup -``` - -Replace the ENTRYPOINT line: - -```dockerfile -# s6-overlay's /init is PID 1. It sets up the supervision tree, runs -# /etc/cont-init.d/ scripts (our stage2 hook), starts s6-rc services, -# and reaps zombies. -ENTRYPOINT [ "/init" ] -# Default CMD: no args → main-hermes service runs `hermes` with no args -CMD [ ] -``` - -**Step 2: Run hadolint** - -```bash -docker run --rm -i hadolint/hadolint:latest < Dockerfile -``` - -Fix any warnings. - -**Step 3: Rebuild and run full harness** - -```bash -docker build -t hermes-agent-harness:latest . -scripts/run_tests.sh tests/docker/ -v -``` - -Expected: **all Phase 0 tests pass**. This is the hard gate. If any fail, diagnose before committing. - -**Step 4: Commit** - -```bash -git add Dockerfile -git commit -m "feat(docker)!: replace tini with s6-overlay as PID 1 - -BREAKING CHANGE: container ENTRYPOINT is now /init (s6-overlay) instead -of /usr/bin/tini. Main hermes and dashboard run as supervised s6 services. -All docker run invocation patterns (chat, sleep, bash, --tui) -continue to work identically — verified by the Phase 0 test harness." -``` - -### Task 2.5: Add restart-on-crash test for dashboard - -**Objective:** Now that s6 supervises the dashboard, a crash should be recovered. This is a new test, not a Phase 0 baseline — it encodes a new invariant that only holds post-Phase 2. - -**Files:** -- Modify: `tests/docker/test_dashboard.py` - -**Step 1: Add the test** - -```python -def test_dashboard_restarts_after_crash(built_image, container_name): - """After Phase 2: s6 supervises the dashboard. SIGKILL the process; - s6 should restart it within ~2 seconds.""" - subprocess.run( - ["docker", "run", "-d", "--name", container_name, - "-e", "HERMES_DASHBOARD=1", built_image, "sleep", "60"], - check=True, capture_output=True, timeout=30, - ) - time.sleep(5) - - # Find dashboard PID - r = subprocess.run( - ["docker", "exec", container_name, "pgrep", "-f", "hermes dashboard"], - capture_output=True, text=True, timeout=10, - ) - assert r.returncode == 0, "Dashboard not running initially" - first_pid = r.stdout.strip().split()[0] - - # Kill it - subprocess.run( - ["docker", "exec", container_name, "kill", "-9", first_pid], - capture_output=True, timeout=10, - ) - - # Wait for s6 to restart - time.sleep(3) - - r = subprocess.run( - ["docker", "exec", container_name, "pgrep", "-f", "hermes dashboard"], - capture_output=True, text=True, timeout=10, - ) - assert r.returncode == 0, "Dashboard not restarted after kill" - second_pid = r.stdout.strip().split()[0] - assert second_pid != first_pid, "PID unchanged — not actually restarted" -``` - -**Step 2: Commit** - -```bash -git add tests/docker/test_dashboard.py -git commit -m "test(docker): verify s6 restarts dashboard after crash" -``` - ---- - -## Phase 3 — S6ServiceManager implements runtime registration - -**Goal:** Implement `register_profile_gateway` / `unregister_profile_gateway` / `list_profile_gateways` in a new `S6ServiceManager` class. No existing caller yet — this phase is purely additive. Phase 4 wires it into the profile lifecycle. - -### Task 3.1: Scaffolding — S6ServiceManager class - -**Objective:** Create the class, wire it into the factory, stub the registration methods. - -**Files:** -- Modify: `hermes_cli/service_manager.py` -- Modify: `tests/hermes_cli/test_service_manager.py` - -**Step 1: Tests** - -```python -def test_s6_manager_kind_and_supports_registration(): - from hermes_cli.service_manager import S6ServiceManager - mgr = S6ServiceManager() - assert mgr.kind == "s6" - assert mgr.supports_runtime_registration() is True - - -def test_factory_returns_s6_when_detected(monkeypatch): - from hermes_cli import service_manager as sm - monkeypatch.setattr(sm, "detect_service_manager", lambda: "s6") - assert isinstance(sm.get_service_manager(), sm.S6ServiceManager) -``` - -**Step 2: Add the class** - -Append to `hermes_cli/service_manager.py`: - -```python -from pathlib import Path - -# s6-overlay scandir for dynamic services. This directory is tmpfs inside -# the container and writable by the hermes user. s6-svscan watches it. -S6_DYNAMIC_SCANDIR = Path("/run/service") -S6_SERVICE_PREFIX = "gateway-" - - -class S6ServiceManager: - """Per-profile gateway supervision via s6-overlay. - - Static services (main-hermes, dashboard) are managed via s6-rc at - image build time and are NOT managed by this class. This class only - handles per-profile gateway services, which are created at runtime - when `hermes profile create ` runs inside the container. - """ - kind: ServiceManagerKind = "s6" - - def __init__(self, scandir: Path = S6_DYNAMIC_SCANDIR): - self.scandir = scandir - - def _service_dir(self, profile: str) -> Path: - validate_profile_name(profile) - return self.scandir / f"{S6_SERVICE_PREFIX}{profile}" - - # Lifecycle - def start(self, name: str) -> None: - # name is the s6 service directory basename (gateway-) - import subprocess - subprocess.run( - ["s6-svc", "-u", str(self.scandir / name)], - check=True, capture_output=True, timeout=5, - ) - - def stop(self, name: str) -> None: - import subprocess - subprocess.run( - ["s6-svc", "-d", str(self.scandir / name)], - check=True, capture_output=True, timeout=5, - ) - - def restart(self, name: str) -> None: - import subprocess - subprocess.run( - ["s6-svc", "-t", str(self.scandir / name)], - check=True, capture_output=True, timeout=5, - ) - - def is_running(self, name: str) -> bool: - import subprocess - result = subprocess.run( - ["s6-svstat", str(self.scandir / name)], - capture_output=True, text=True, timeout=5, - ) - return result.returncode == 0 and "up " in result.stdout - - # Runtime registration — implemented in Task 3.2/3.3/3.4 - def supports_runtime_registration(self) -> bool: - return True - - def register_profile_gateway(self, profile, *, port, extra_env=None): - raise NotImplementedError # Task 3.2 - - def unregister_profile_gateway(self, profile): - raise NotImplementedError # Task 3.3 - - def list_profile_gateways(self): - raise NotImplementedError # Task 3.4 -``` - -Update `get_service_manager()`: - -```python - if kind == "s6": - return S6ServiceManager() -``` - -**Step 3: Commit** - -```bash -git add hermes_cli/service_manager.py tests/hermes_cli/test_service_manager.py -git commit -m "feat(service_manager): add S6ServiceManager scaffolding" -``` - -### Task 3.2: Implement register_profile_gateway - -**Objective:** Write the service directory for a profile gateway, trigger s6 scan. - -**Step 1: Tests** - -```python -def test_register_profile_gateway_creates_service_dir(tmp_path, monkeypatch): - from hermes_cli.service_manager import S6ServiceManager - - scandir = tmp_path / "service" - scandir.mkdir() - mgr = S6ServiceManager(scandir=scandir) - - called = [] - def fake_run(cmd, **kw): - called.append(cmd) - import subprocess as sp - return sp.CompletedProcess(cmd, 0, "", "") - monkeypatch.setattr("subprocess.run", fake_run) - - mgr.register_profile_gateway("coder", port=9150) - - svc_dir = scandir / "gateway-coder" - assert svc_dir.is_dir() - assert (svc_dir / "type").read_text().strip() == "longrun" - assert (svc_dir / "run").is_file() - run_content = (svc_dir / "run").read_text() - assert "hermes -p coder gateway start" in run_content - assert "--port 9150" in run_content or "--port=9150" in run_content - assert "s6-setuidgid hermes" in run_content - - # Log rotation persists under HERMES_HOME (OQ8-C). The path must come - # from the runtime env, not be hard-coded — check we emit a shell var - # expansion rather than a literal /opt/data/... - log_run = svc_dir / "log" / "run" - assert log_run.is_file() - log_run_content = log_run.read_text() - assert "$HERMES_HOME" in log_run_content - assert "logs/gateways/coder" in log_run_content - # Negative assertion: the path must NOT be Python-substituted to /opt/data - assert "/opt/data/logs/gateways/coder" not in log_run_content, \ - "log_dir was hard-coded; must use ${HERMES_HOME} at run time" - - # s6-svscanctl was invoked - assert any("s6-svscanctl" in str(c) for c in called) - - -def test_register_profile_rejects_duplicate(tmp_path): - from hermes_cli.service_manager import S6ServiceManager - scandir = tmp_path / "service" - (scandir / "gateway-coder").mkdir(parents=True) - mgr = S6ServiceManager(scandir=scandir) - with pytest.raises(ValueError, match="already registered"): - mgr.register_profile_gateway("coder", port=9150) -``` - -**Step 2: Implement** - -```python - def register_profile_gateway( - self, - profile: str, - *, - port: int, - extra_env: dict[str, str] | None = None, - ) -> None: - """Write an s6 service directory for the given profile's gateway and - trigger s6-svscan to pick it up. - - Raises: - ValueError: if a service for the profile is already registered - RuntimeError: if s6-svscanctl fails - """ - import subprocess - - svc_dir = self._service_dir(profile) - if svc_dir.exists(): - raise ValueError( - f"profile gateway {profile!r} already registered at {svc_dir}" - ) - - svc_dir.mkdir(parents=True) - (svc_dir / "type").write_text("longrun\n") - - # run script: drop to hermes, exec foreground gateway - run_script = self._render_run_script(profile, port, extra_env or {}) - (svc_dir / "run").write_text(run_script) - (svc_dir / "run").chmod(0o755) - - # log/ subservice: persistent rotation under HERMES_HOME (OQ8-C) - log_subdir = svc_dir / "log" - log_subdir.mkdir() - (log_subdir / "run").write_text(self._render_log_run(profile)) - (log_subdir / "run").chmod(0o755) - - # Trigger s6 scan - result = subprocess.run( - ["s6-svscanctl", "-a", str(self.scandir)], - capture_output=True, text=True, timeout=5, - ) - if result.returncode != 0: - # Clean up partial directory - import shutil - shutil.rmtree(svc_dir, ignore_errors=True) - raise RuntimeError( - f"s6-svscanctl failed: {result.stderr or result.stdout}" - ) - - def _render_run_script( - self, profile: str, port: int, extra_env: dict[str, str] - ) -> str: - import shlex - lines = [ - "#!/command/with-contenv sh", - "set -e", - "cd /opt/data", - ". /opt/hermes/.venv/bin/activate", - ] - for k, v in sorted(extra_env.items()): - lines.append(f"export {k}={shlex.quote(v)}") - lines.append( - f"exec s6-setuidgid hermes hermes -p {shlex.quote(profile)} " - f"gateway start --foreground --port {port}" - ) - return "\n".join(lines) + "\n" - - def _render_log_run(self, profile: str) -> str: - # OQ8-C: persist to ${HERMES_HOME}/logs/gateways// - # IMPORTANT: do NOT hard-code /opt/data here — read HERMES_HOME from the - # container environment at run time so `-e HERMES_HOME=/some/other` works. - # The `with-contenv` shebang sources /run/s6/container_environment/* which - # was populated by the stage2 hook from the actual container env. - import shlex - prof = shlex.quote(profile) - return ( - f"#!/command/with-contenv sh\n" - f": \"${{HERMES_HOME:=/opt/data}}\"\n" - f"log_dir=\"$HERMES_HOME/logs/gateways/{prof}\"\n" - f"mkdir -p \"$log_dir\"\n" - f"chown -R hermes:hermes \"$log_dir\" 2>/dev/null || true\n" - f"exec s6-setuidgid hermes s6-log n10 s1000000 T \"$log_dir\"\n" - ) -``` - -**Step 3: Commit** - -```bash -git add hermes_cli/service_manager.py tests/hermes_cli/test_service_manager.py -git commit -m "feat(service_manager): implement S6ServiceManager.register_profile_gateway" -``` - -### Task 3.3: Implement unregister_profile_gateway - -**Step 1: Tests** - -```python -def test_unregister_profile_gateway_removes_service_dir(tmp_path, monkeypatch): - from hermes_cli.service_manager import S6ServiceManager - scandir = tmp_path / "service" - svc_dir = scandir / "gateway-coder" - svc_dir.mkdir(parents=True) - (svc_dir / "type").write_text("longrun\n") - - called = [] - def fake_run(cmd, **kw): - called.append(cmd) - import subprocess as sp - return sp.CompletedProcess(cmd, 0, "", "") - monkeypatch.setattr("subprocess.run", fake_run) - - mgr = S6ServiceManager(scandir=scandir) - mgr.unregister_profile_gateway("coder") - - # s6-svc -d was called - assert any("s6-svc" in str(c) and "-d" in c for c in called) - # Service dir removed - assert not svc_dir.exists() - # Rescan triggered - assert any("s6-svscanctl" in str(c) for c in called) - - -def test_unregister_absent_profile_is_noop(tmp_path): - from hermes_cli.service_manager import S6ServiceManager - scandir = tmp_path / "service" - scandir.mkdir() - mgr = S6ServiceManager(scandir=scandir) - # Should not raise - mgr.unregister_profile_gateway("nonexistent") -``` - -**Step 2: Implement** - -```python - def unregister_profile_gateway(self, profile: str) -> None: - """Stop the profile's gateway service and remove its directory. - - Idempotent: absent services are a no-op. - """ - import subprocess - import shutil - - svc_dir = self._service_dir(profile) - if not svc_dir.exists(): - return - - # Stop the service (best effort) - subprocess.run( - ["s6-svc", "-d", str(svc_dir)], - capture_output=True, text=True, timeout=5, - check=False, - ) - # Wait briefly for it to go down - subprocess.run( - ["s6-svwait", "-D", "-t", "10000", str(svc_dir)], - capture_output=True, text=True, timeout=15, - check=False, - ) - - # Remove the directory - shutil.rmtree(svc_dir, ignore_errors=True) - - # Rescan to drop s6-supervise process - subprocess.run( - ["s6-svscanctl", "-an", str(self.scandir)], - capture_output=True, text=True, timeout=5, - check=False, - ) -``` - -**Step 3: Commit** - -```bash -git add hermes_cli/service_manager.py tests/hermes_cli/test_service_manager.py -git commit -m "feat(service_manager): implement S6ServiceManager.unregister_profile_gateway" -``` - -### Task 3.4: Implement list_profile_gateways - -**Step 1: Test + implementation** - -```python -def test_list_profile_gateways(tmp_path): - from hermes_cli.service_manager import S6ServiceManager - scandir = tmp_path / "service" - scandir.mkdir() - (scandir / "gateway-coder").mkdir() - (scandir / "gateway-assistant").mkdir() - (scandir / "other-service").mkdir() # not a gateway, should be filtered out - (scandir / ".hidden").mkdir() - - mgr = S6ServiceManager(scandir=scandir) - profiles = sorted(mgr.list_profile_gateways()) - assert profiles == ["assistant", "coder"] -``` - -Implementation: - -```python - def list_profile_gateways(self) -> list[str]: - """List all currently-registered profile gateway service names - (returns the profile names, not the service-dir names).""" - if not self.scandir.exists(): - return [] - profiles = [] - for entry in self.scandir.iterdir(): - if entry.name.startswith("."): - continue - if not entry.is_dir(): - continue - if not entry.name.startswith(S6_SERVICE_PREFIX): - continue - profiles.append(entry.name[len(S6_SERVICE_PREFIX):]) - return profiles -``` - -**Step 2: Commit** - -```bash -git add hermes_cli/service_manager.py tests/hermes_cli/test_service_manager.py -git commit -m "feat(service_manager): implement S6ServiceManager.list_profile_gateways" -``` - -### Task 3.5: In-container integration test - -**Objective:** Validate the full register → start → kill → restart → unregister cycle inside a real container. - -**Files:** -- Create: `tests/docker/test_s6_profile_gateway_integration.py` - -**Step 1: Test** - -```python -"""End-to-end test of S6ServiceManager.register_profile_gateway + lifecycle.""" -import subprocess -import time - - -def test_register_and_supervise_profile_gateway(built_image, container_name): - subprocess.run( - ["docker", "run", "-d", "--name", container_name, built_image, - "sleep", "120"], - check=True, capture_output=True, timeout=30, - ) - time.sleep(3) - - # Register a test profile gateway via the Python API - register_script = ''' -import sys -sys.path.insert(0, "/opt/hermes") -from hermes_cli.service_manager import S6ServiceManager -mgr = S6ServiceManager() -# Create a minimal profile first so `hermes -p` works -import subprocess -subprocess.run(["hermes", "profile", "create", "it-test"], check=True) -mgr.register_profile_gateway("it-test", port=9201) -print("REGISTERED") -''' - r = subprocess.run( - ["docker", "exec", container_name, "python3", "-c", register_script], - capture_output=True, text=True, timeout=60, - ) - assert "REGISTERED" in r.stdout, f"register failed: {r.stderr}" - - # Service dir exists - r = subprocess.run( - ["docker", "exec", container_name, "test", "-d", - "/run/service/gateway-it-test"], - capture_output=True, text=True, timeout=10, - ) - assert r.returncode == 0 - - # Wait for s6 to bring it up - time.sleep(5) - - # Check s6-svstat reports it as up - r = subprocess.run( - ["docker", "exec", container_name, "s6-svstat", - "/run/service/gateway-it-test"], - capture_output=True, text=True, timeout=10, - ) - assert "up " in r.stdout, f"service not up: {r.stdout}" - - # Kill the gateway process; s6 should restart it - subprocess.run( - ["docker", "exec", container_name, "sh", "-c", - "pkill -9 -f 'gateway.*it-test' || true"], - capture_output=True, timeout=10, - ) - time.sleep(3) - - r = subprocess.run( - ["docker", "exec", container_name, "s6-svstat", - "/run/service/gateway-it-test"], - capture_output=True, text=True, timeout=10, - ) - assert "up " in r.stdout, f"service not restarted: {r.stdout}" - - # Unregister - unregister_script = ''' -import sys -sys.path.insert(0, "/opt/hermes") -from hermes_cli.service_manager import S6ServiceManager -S6ServiceManager().unregister_profile_gateway("it-test") -print("UNREGISTERED") -''' - r = subprocess.run( - ["docker", "exec", container_name, "python3", "-c", unregister_script], - capture_output=True, text=True, timeout=30, - ) - assert "UNREGISTERED" in r.stdout - - # Service dir gone - r = subprocess.run( - ["docker", "exec", container_name, "test", "-d", - "/run/service/gateway-it-test"], - capture_output=True, text=True, timeout=10, - ) - assert r.returncode != 0 -``` - -**Step 2: Commit** - -```bash -git add tests/docker/test_s6_profile_gateway_integration.py -git commit -m "test(docker): integration test for S6ServiceManager profile gateway lifecycle" -``` - ---- - -## Phase 4 — Wire profile create/delete into the s6 backend - -**Goal:** When `hermes profile create ` runs inside the container, register the profile's gateway with s6. When `hermes profile delete` runs, unregister. Existing `hermes -p gateway start/stop/restart` commands, inside the container, dispatch to s6 via the ServiceManager. - -After this phase, the Phase 0 `test_profile_gateway.py` harness (which currently passes against the current implementation) must still pass — but now the underlying mechanism is s6-supervised. - -### Task 4.0: Reconcile per-profile gateways on container boot - -**Objective:** Survive `docker restart`. Service directories at `/run/service/gateway-/` live on **tmpfs** and are wiped when the container restarts, but the profile directories themselves (`/opt/data/profiles//`) and each profile's `gateway_state.json` live on the persistent VOLUME. On boot, walk the persistent profiles, recreate the s6 service registrations, and bring back up any profile whose last recorded state was `running`. Without this, every `docker restart` silently loses every per-profile gateway, even though the user's profiles still exist on disk. - -**Files:** -- Create: `docker/cont-init.d/02-reconcile-profiles` (s6-overlay cont-init.d script — runs as root after `01-hermes-setup` from Task 2.3, before s6-rc starts user services) -- Create: `hermes_cli/container_boot.py` (Python module the cont-init.d script invokes; keeps logic testable in isolation) -- Modify: `Dockerfile` (copy the new cont-init.d script and ensure it's executable) -- Create: `tests/hermes_cli/test_container_boot.py` (unit tests for the reconciliation logic against a fake `$HERMES_HOME`) -- Modify: Phase 0 harness (`tests/docker/test_container_restart.py` — new test asserting end-to-end restart survival) - -**Step 1: Define the reconciliation contract** - -For each profile dir under `$HERMES_HOME/profiles//` (and the default profile at `$HERMES_HOME/` itself if it's the in-container layout): - -1. **Read `gateway_state.json`** if present. The schema (see `gateway/status.py`) records `gateway_state ∈ {starting, running, startup_failed, stopped}` plus a timestamp. -2. **Clean up stale runtime files.** Remove `gateway.pid` from the profile dir if it exists — the recorded PID belongs to the dead container's process namespace, and a numerically-equal live PID in the new container would be a different process. Also remove `processes.json`. -3. **Always recreate the s6 service registration** at `/run/service/gateway-/` (down state) — even if the last recorded state was `stopped`. This ensures `hermes -p gateway start` works without going through `register_profile_gateway` first, matching the invariant "every profile has a service slot." -4. **Auto-start only if the last recorded state was `running`.** `starting` does NOT auto-start (the gateway crashed during boot last time — assume the user wants to investigate, don't crash-loop on restart). `startup_failed` does NOT auto-start (explicit prior failure). `stopped` does NOT auto-start (explicit prior stop). Missing `gateway_state.json` does NOT auto-start (gateway was never run). -5. **Write a reconciliation log** to `$HERMES_HOME/logs/container-boot.log` with one line per profile: ` profile= prior_state= action=`. Operators inspect this to debug "why didn't my profile come back up." - -**Step 2: Write failing tests for `container_boot.reconcile_profile_gateways`** - -```python -# tests/hermes_cli/test_container_boot.py -import json -from pathlib import Path -import pytest -from hermes_cli.container_boot import ( - reconcile_profile_gateways, - ReconcileAction, -) - -def _make_profile(hermes_home: Path, name: str, *, state: str | None, - with_pid: bool = False) -> Path: - """Create a fake profile directory under hermes_home/profiles//.""" - p = hermes_home / "profiles" / name - p.mkdir(parents=True) - (p / "config.yaml").write_text("model: test\n") # marks it as a real profile - if state is not None: - (p / "gateway_state.json").write_text(json.dumps({ - "gateway_state": state, "timestamp": 1234567890, - })) - if with_pid: - (p / "gateway.pid").write_text(json.dumps({"pid": 99999, "host": "old-container"})) - return p - - -def test_running_profile_is_reregistered_and_autostarted(tmp_path, monkeypatch): - monkeypatch.setenv("HERMES_HOME", str(tmp_path)) - scandir = tmp_path / "run-service" - scandir.mkdir() - _make_profile(tmp_path, "coder", state="running") - - actions = reconcile_profile_gateways( - hermes_home=tmp_path, scandir=scandir, dry_run=False, - ) - - assert actions == [ReconcileAction(profile="coder", prior_state="running", - action="started")] - assert (scandir / "gateway-coder" / "run").exists() - assert (scandir / "gateway-coder" / "run").stat().st_mode & 0o111 # executable - - -def test_stopped_profile_is_reregistered_but_not_started(tmp_path): - scandir = tmp_path / "run-service"; scandir.mkdir() - _make_profile(tmp_path, "writer", state="stopped") - - actions = reconcile_profile_gateways( - hermes_home=tmp_path, scandir=scandir, dry_run=False, - ) - - assert actions == [ReconcileAction(profile="writer", prior_state="stopped", - action="registered")] - assert (scandir / "gateway-writer" / "run").exists() - # The down-marker file tells s6 to not start the service initially - assert (scandir / "gateway-writer" / "down").exists() - - -def test_startup_failed_profile_is_not_autostarted(tmp_path): - """Avoid crash-loop on restart when the gateway was failing to boot.""" - scandir = tmp_path / "run-service"; scandir.mkdir() - _make_profile(tmp_path, "broken", state="startup_failed") - - actions = reconcile_profile_gateways( - hermes_home=tmp_path, scandir=scandir, dry_run=False, - ) - - assert actions[0].action == "registered" - assert (scandir / "gateway-broken" / "down").exists() - - -def test_starting_state_does_not_autostart(tmp_path): - """`starting` means the gateway died mid-boot; treat as failed, not running.""" - scandir = tmp_path / "run-service"; scandir.mkdir() - _make_profile(tmp_path, "unlucky", state="starting") - - actions = reconcile_profile_gateways( - hermes_home=tmp_path, scandir=scandir, dry_run=False, - ) - - assert actions[0].action == "registered" # NOT "started" - - -def test_stale_pid_file_is_removed(tmp_path): - scandir = tmp_path / "run-service"; scandir.mkdir() - profile = _make_profile(tmp_path, "coder", state="running", with_pid=True) - - reconcile_profile_gateways( - hermes_home=tmp_path, scandir=scandir, dry_run=False, - ) - - assert not (profile / "gateway.pid").exists() - - -def test_profile_without_state_file_is_registered_but_not_started(tmp_path): - """A freshly-created profile that's never been started: register slot, don't autostart.""" - scandir = tmp_path / "run-service"; scandir.mkdir() - _make_profile(tmp_path, "fresh", state=None) - - actions = reconcile_profile_gateways( - hermes_home=tmp_path, scandir=scandir, dry_run=False, - ) - - assert actions[0].action == "registered" - assert (scandir / "gateway-fresh" / "down").exists() - - -def test_directory_without_config_yaml_is_skipped(tmp_path): - """A directory under profiles/ that isn't actually a profile (no config.yaml) is ignored.""" - scandir = tmp_path / "run-service"; scandir.mkdir() - (tmp_path / "profiles" / "stray").mkdir(parents=True) # no config.yaml - - actions = reconcile_profile_gateways( - hermes_home=tmp_path, scandir=scandir, dry_run=False, - ) - - assert actions == [] - - -def test_reconcile_log_is_written(tmp_path): - scandir = tmp_path / "run-service"; scandir.mkdir() - _make_profile(tmp_path, "a", state="running") - _make_profile(tmp_path, "b", state="stopped") - - reconcile_profile_gateways( - hermes_home=tmp_path, scandir=scandir, dry_run=False, - ) - - log = (tmp_path / "logs" / "container-boot.log").read_text() - assert "profile=a" in log and "action=started" in log - assert "profile=b" in log and "action=registered" in log - - -def test_dry_run_makes_no_filesystem_changes(tmp_path): - scandir = tmp_path / "run-service"; scandir.mkdir() - profile = _make_profile(tmp_path, "coder", state="running", with_pid=True) - - reconcile_profile_gateways( - hermes_home=tmp_path, scandir=scandir, dry_run=True, - ) - - assert (profile / "gateway.pid").exists() # not removed under dry_run - assert not (scandir / "gateway-coder").exists() -``` - -Run the tests to confirm they fail: - -```bash -scripts/run_tests.sh tests/hermes_cli/test_container_boot.py -v -``` - -Expected: all 9 tests FAIL with `ImportError` / `AttributeError` on the missing `reconcile_profile_gateways` symbol. - -**Step 3: Implement `hermes_cli/container_boot.py`** - -```python -"""Container boot-time reconciliation of per-profile gateway s6 services. - -Service directories under /run/service/ live on tmpfs and are wiped on -container restart. Profile directories under $HERMES_HOME/profiles/ live -on the persistent VOLUME. This module bridges the two: on every container -boot, walk the persistent profiles and recreate the s6 service slots. -""" -from __future__ import annotations - -import json -import logging -import os -from dataclasses import dataclass -from pathlib import Path -from typing import Literal - -log = logging.getLogger(__name__) - -# Only this prior state triggers automatic restart. Everything else -# (startup_failed, starting, stopped, missing) registers the slot in -# the down state and waits for explicit user action. -_AUTOSTART_STATES = frozenset({"running"}) - -ReconcileActionLabel = Literal["started", "registered", "skipped"] - - -@dataclass(frozen=True) -class ReconcileAction: - profile: str - prior_state: str | None - action: ReconcileActionLabel - - -def reconcile_profile_gateways( - *, - hermes_home: Path, - scandir: Path, - dry_run: bool = False, -) -> list[ReconcileAction]: - """Recreate s6 service registrations for every persistent profile.""" - actions: list[ReconcileAction] = [] - profiles_root = hermes_home / "profiles" - if not profiles_root.is_dir(): - return actions - - for entry in sorted(profiles_root.iterdir()): - if not entry.is_dir(): - continue - if not (entry / "config.yaml").exists(): - continue # not a real profile - - prior_state = _read_prior_state(entry) - if not dry_run: - _cleanup_stale_runtime_files(entry) - _register_service(scandir, entry.name, - start=prior_state in _AUTOSTART_STATES) - - action_label: ReconcileActionLabel = ( - "started" if prior_state in _AUTOSTART_STATES else "registered" - ) - actions.append(ReconcileAction( - profile=entry.name, prior_state=prior_state, action=action_label, - )) - - if not dry_run: - _write_reconcile_log(hermes_home, actions) - return actions - - -def _read_prior_state(profile_dir: Path) -> str | None: - state_file = profile_dir / "gateway_state.json" - if not state_file.exists(): - return None - try: - return json.loads(state_file.read_text()).get("gateway_state") - except (OSError, json.JSONDecodeError): - log.warning("Could not read %s; treating as no prior state", state_file) - return None - - -def _cleanup_stale_runtime_files(profile_dir: Path) -> None: - for name in ("gateway.pid", "processes.json"): - (profile_dir / name).unlink(missing_ok=True) - - -def _register_service(scandir: Path, profile: str, *, start: bool) -> None: - service_dir = scandir / f"gateway-{profile}" - service_dir.mkdir(parents=True, exist_ok=True) - - # The actual run script content is generated by S6ServiceManager from - # Task 3.2; we duplicate the minimal contract here. Phase 4 follow-up: - # extract a single shared rendering function used by both register - # and reconcile. - run = service_dir / "run" - run.write_text(_render_run_script(profile)) - run.chmod(0o755) - - if not start: - # The presence of a `down` file tells s6-supervise to NOT start - # the service on rescan. User must `s6-svc -u` to bring it up. - (service_dir / "down").touch() - else: - (service_dir / "down").unlink(missing_ok=True) - - -def _render_run_script(profile: str) -> str: - # Mirrors the rendering in S6ServiceManager.register_profile_gateway - # (Task 3.2). Extract to a shared helper as Phase 4 cleanup. - return f"""#!/command/execlineb -P -fdmove -c 2 1 -s6-setuidgid hermes -multisubstitute {{ - importas HERMES_HOME HERMES_HOME -}} -hermes -p {profile} gateway start --foreground -""" - - -def _write_reconcile_log(hermes_home: Path, actions: list[ReconcileAction]) -> None: - log_dir = hermes_home / "logs" - log_dir.mkdir(parents=True, exist_ok=True) - import time - ts = time.strftime("%Y-%m-%dT%H:%M:%S%z") - with (log_dir / "container-boot.log").open("a") as f: - for a in actions: - f.write( - f"{ts} profile={a.profile} prior_state={a.prior_state} " - f"action={a.action}\n" - ) - - -def main() -> int: - """Entry point invoked from /etc/cont-init.d/02-reconcile-profiles.""" - hermes_home = Path(os.environ.get("HERMES_HOME", "/opt/data")) - scandir = Path(os.environ.get("S6_PROFILE_GATEWAY_SCANDIR", "/run/service")) - actions = reconcile_profile_gateways(hermes_home=hermes_home, scandir=scandir) - for a in actions: - print(f"reconcile: profile={a.profile} prior_state={a.prior_state} " - f"action={a.action}") - return 0 - - -if __name__ == "__main__": - raise SystemExit(main()) -``` - -**Step 4: Create the cont-init.d script** - -`docker/cont-init.d/02-reconcile-profiles`: - -```sh -#!/command/with-contenv sh -# Container-boot reconciliation of per-profile gateway s6 services. -# Runs as root after 01-hermes-setup (stage2 hook) has chowned the volume -# and seeded $HERMES_HOME, but before s6-rc starts user services. -# -# The actual logic lives in hermes_cli.container_boot. We invoke it via -# the bundled venv python, drop to the hermes user so the service dirs -# we write under $S6_PROFILE_GATEWAY_SCANDIR are owned by hermes (since -# the gateway processes run as hermes). -set -e -s6-setuidgid hermes /opt/hermes/.venv/bin/python -m hermes_cli.container_boot -``` - -**Step 5: Wire it into the Dockerfile** - -In Task 2.4's Dockerfile changes, the cont-init.d block already copies `/etc/cont-init.d/01-hermes-setup`. Add `02-reconcile-profiles` next to it: - -```dockerfile -COPY docker/cont-init.d/02-reconcile-profiles /etc/cont-init.d/02-reconcile-profiles -RUN chmod +x /etc/cont-init.d/02-reconcile-profiles -``` - -s6-overlay runs `/etc/cont-init.d/*` scripts in lexicographic order, so `01-hermes-setup` (gosu drop, chown, seed) runs before `02-reconcile-profiles`. The reconciliation thus runs after `$HERMES_HOME` is guaranteed to exist and be hermes-owned. - -**Step 6: Run unit tests — should now pass** - -```bash -scripts/run_tests.sh tests/hermes_cli/test_container_boot.py -v -``` - -Expected: 9 passed. - -**Step 7: Add end-to-end restart test to Phase 0 harness** - -`tests/docker/test_container_restart.py`: - -```python -"""Container restart preserves per-profile gateway registrations.""" -import shutil -import subprocess -import time -import pytest - -pytestmark = pytest.mark.skipif( - shutil.which("docker") is None, reason="Docker not available" -) - - -def _run(args: list[str], **kw) -> subprocess.CompletedProcess: - return subprocess.run(args, capture_output=True, text=True, timeout=120, **kw) - - -@pytest.fixture -def container(tmp_path, built_image): - """Long-running container with a named volume so we can stop/start it.""" - volume = f"hermes-restart-test-{tmp_path.name}" - name = f"hermes-restart-{tmp_path.name}" - _run(["docker", "volume", "create", volume]) - _run(["docker", "run", "-d", "--name", name, "-v", f"{volume}:/opt/data", - built_image, "sleep", "infinity"]) - yield name - _run(["docker", "rm", "-f", name]) - _run(["docker", "volume", "rm", "-f", volume]) - - -def _exec(container: str, cmd: list[str]) -> subprocess.CompletedProcess: - return _run(["docker", "exec", container, *cmd]) - - -def test_running_gateway_survives_container_restart(container, built_image): - # 1. Create a profile and start its gateway - _exec(container, ["hermes", "profile", "create", "coder", - "--model", "test/echo"]) - _exec(container, ["hermes", "-p", "coder", "gateway", "start"]) - - # 2. Confirm gateway_state.json was written with "running" - result = _exec(container, ["cat", "/opt/data/profiles/coder/gateway_state.json"]) - assert "running" in result.stdout - - # 3. Restart the container - _run(["docker", "restart", container]) - time.sleep(5) # give s6 and cont-init.d a moment - - # 4. The reconciliation log should record action=started - log = _exec(container, ["cat", "/opt/data/logs/container-boot.log"]) - assert "profile=coder" in log.stdout - assert "action=started" in log.stdout - - # 5. The s6 service dir should exist - result = _exec(container, ["test", "-d", "/run/service/gateway-coder"]) - assert result.returncode == 0 - - # 6. The gateway should be running (s6-svstat reports up) - status = _exec(container, ["s6-svstat", "/run/service/gateway-coder"]) - assert "up" in status.stdout - - -def test_stopped_gateway_stays_stopped_after_restart(container): - _exec(container, ["hermes", "profile", "create", "writer", - "--model", "test/echo"]) - _exec(container, ["hermes", "-p", "writer", "gateway", "start"]) - _exec(container, ["hermes", "-p", "writer", "gateway", "stop"]) - - _run(["docker", "restart", container]); time.sleep(5) - - # Service is registered but down - assert _exec(container, ["test", "-d", "/run/service/gateway-writer"]).returncode == 0 - assert _exec(container, ["test", "-f", "/run/service/gateway-writer/down"]).returncode == 0 - status = _exec(container, ["s6-svstat", "/run/service/gateway-writer"]) - assert "down" in status.stdout - - -def test_stale_gateway_pid_is_cleaned_up_on_restart(container): - _exec(container, ["hermes", "profile", "create", "x", "--model", "test/echo"]) - _exec(container, ["hermes", "-p", "x", "gateway", "start"]) - - _run(["docker", "restart", container]); time.sleep(5) - - # gateway.pid is gone (will be written fresh by the newly-started gateway, - # but the *old* PID file is gone before the new gateway starts) - # — we check the log instead since the new gateway repopulates it - log = _exec(container, ["cat", "/opt/data/logs/container-boot.log"]) - assert "profile=x" in log.stdout -``` - -**Step 8: Run integration test** - -```bash -scripts/run_tests.sh tests/docker/test_container_restart.py -v -``` - -Expected: 3 passed (assuming Docker available and the image was rebuilt with Phases 2–4 changes). - -**Step 9: Commit** - -```bash -git add hermes_cli/container_boot.py \ - docker/cont-init.d/02-reconcile-profiles \ - Dockerfile \ - tests/hermes_cli/test_container_boot.py \ - tests/docker/test_container_restart.py -git commit -m "feat(docker): reconcile per-profile gateways on container restart - -Service dirs under /run/service live on tmpfs and are wiped by docker -restart. On boot, walk \$HERMES_HOME/profiles, read each gateway_state.json, -recreate the s6 service slot, and auto-up only those that were running. - -Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md Task 4.0" -``` - -**Verification:** - -- `scripts/run_tests.sh tests/hermes_cli/test_container_boot.py tests/docker/test_container_restart.py` all green -- After `docker restart`, `s6-svstat /run/service/gateway-` for a previously-running profile reports `up`; for a previously-stopped profile reports `down` -- `cat /opt/data/logs/container-boot.log` shows one line per profile with explicit `action=` outcome - -**Open items deferred:** - -- Should `startup_failed` after N consecutive container restarts auto-promote to an alert in `hermes doctor`? Probably yes; tracked as a follow-up to this task. -- The `_render_run_script` duplication between this module and `S6ServiceManager.register_profile_gateway` (Task 3.2) is intentional duplication for testability. Phase 5 cleanup task should extract a shared helper. -- This task does NOT cover restart-policy semantics for the main hermes service itself — that's a Phase 2 concern (`finish` script behavior), already covered there. - -### Task 4.1: Hook register_profile_gateway into profile creation - -**Files:** -- Modify: `hermes_cli/profiles.py` — find the profile-creation code path (approximately near `def create_profile`) -- Modify: `tests/hermes_cli/test_profiles.py` - -**Step 1: Identify the integration point** - -```bash -grep -n "def create_profile\|def profile_create\|def _create_profile" hermes_cli/profiles.py -``` - -Read the surrounding code to find where the profile directory is seeded. The s6 registration call goes right after a successful create, guarded by `supports_runtime_registration()`. - -**Step 2: Write a failing test** - -```python -def test_profile_create_registers_s6_gateway_in_container(monkeypatch, tmp_path): - """In a container, profile create should register the s6 gateway service.""" - from hermes_cli import profiles - - registered = [] - class FakeS6Manager: - kind = "s6" - def supports_runtime_registration(self): return True - def register_profile_gateway(self, profile, *, port, extra_env=None): - registered.append(profile) - - monkeypatch.setattr( - "hermes_cli.service_manager.get_service_manager", - lambda: FakeS6Manager(), - ) - - profiles.create_profile("newprof") # exact signature TBD - - assert "newprof" in registered - - -def test_profile_create_no_op_on_host(monkeypatch): - """On host (systemd/launchd), profile create should NOT attempt s6 registration.""" - from hermes_cli import profiles - from hermes_cli.service_manager import SystemdServiceManager - - monkeypatch.setattr( - "hermes_cli.service_manager.get_service_manager", - lambda: SystemdServiceManager(), - ) - # Should not raise NotImplementedError - profiles.create_profile("hostprof") -``` - -**Step 3: Implement** - -In `hermes_cli/profiles.py`, after the successful profile creation block: - -```python -def _maybe_register_gateway_service(profile_name: str) -> None: - """In container, register the profile's gateway as an s6 service. - On host, no-op (existing systemd unit-generation paths handle it).""" - try: - from hermes_cli.service_manager import get_service_manager - mgr = get_service_manager() - except RuntimeError: - return - if not mgr.supports_runtime_registration(): - return - # Allocate port — simple sequential allocation for v1; future: port scan - from hermes_cli import profiles as _profiles_module - port = _allocate_gateway_port(profile_name) - try: - mgr.register_profile_gateway(profile_name, port=port) - except ValueError: - # Already registered — re-register would clobber, so we leave alone - pass -``` - -Add a port allocator: - -```python -_GATEWAY_PORT_BASE = 9200 - -def _allocate_gateway_port(profile_name: str) -> int: - """Deterministic port allocation based on profile name hash. - - Range [9200, 9800). Collisions are very unlikely but would fail the - gateway startup with a clear bind error. - """ - import hashlib - h = int(hashlib.sha256(profile_name.encode()).hexdigest()[:8], 16) - return _GATEWAY_PORT_BASE + (h % 600) -``` - -Call `_maybe_register_gateway_service(name)` at the end of the create-profile function. - -**Step 4: Commit** - -```bash -git add hermes_cli/profiles.py tests/hermes_cli/test_profiles.py -git commit -m "feat(profiles): register s6 gateway service on profile create in container" -``` - -### Task 4.2: Hook unregister_profile_gateway into profile deletion - -**Files:** -- Modify: `hermes_cli/profiles.py` -- Modify: `tests/hermes_cli/test_profiles.py` - -**Step 1: Tests** - -Mirror Task 4.1's tests for the delete path. - -**Step 2: Implement** - -```python -def _maybe_unregister_gateway_service(profile_name: str) -> None: - try: - from hermes_cli.service_manager import get_service_manager - mgr = get_service_manager() - except RuntimeError: - return - if not mgr.supports_runtime_registration(): - return - mgr.unregister_profile_gateway(profile_name) -``` - -Call it early in the profile-delete function (before removing the profile directory). - -**Step 3: Commit** - -```bash -git add hermes_cli/profiles.py tests/hermes_cli/test_profiles.py -git commit -m "feat(profiles): unregister s6 gateway service on profile delete" -``` - -### Task 4.3: Route `hermes -p gateway start/stop` through s6 in container - -**Objective:** Existing CLI surface continues to work. Inside the container, it talks to s6 instead of being rejected. - -**Files:** -- Modify: `hermes_cli/gateway.py` — the `gateway_command` / `_gateway_command_inner` dispatcher - -**Background — what's there today** - -`gateway_command` currently rejects gateway lifecycle commands when running inside a container. Search for `elif is_container():` in `hermes_cli/gateway.py` — you'll find arms inside `install`, `uninstall`, `start`, `stop`, and `restart` that print messages like "Service installation is not needed inside a Docker container — the container runtime is your service manager" and `sys.exit(0)`. - -These were correct under the **old** model where there was one gateway and the container itself supervised it. They're **wrong** under the new model where each profile has its own supervised gateway. Phase 4 has to delete them in the same change that introduces the s6 dispatch path. - -**Step 1: Add the s6 dispatch helper** - -```python -def _dispatch_via_service_manager_if_s6(action: str, profile: str | None = None) -> bool: - """If we're in a container with s6, dispatch gateway lifecycle via s6. - Returns True if dispatched (caller should return), False otherwise. - - `profile` defaults to the current profile (resolved via _profile_arg). - """ - from hermes_cli.service_manager import detect_service_manager, get_service_manager - if detect_service_manager() != "s6": - return False - if profile is None: - # current profile via existing helper - profile = _profile_arg() or "default" - mgr = get_service_manager() - service_name = f"gateway-{profile}" - if action == "start": - mgr.start(service_name) - elif action == "stop": - mgr.stop(service_name) - elif action == "restart": - mgr.restart(service_name) - else: - return False - return True -``` - -**Step 2: Remove the `elif is_container()` early-exit arms AND inject the s6 dispatch** - -Inside `_gateway_command_inner`, find each branch (`install`, `uninstall`, `start`, `stop`, `restart`). For each one: - -1. **Remove** the entire `elif is_container():` block that exits with an informational message. (Search for the literal string `"Docker container"` to find them — there are five.) -2. **Insert** the s6 dispatch at the top of each lifecycle handler: - -```python -elif subcmd == "start": - # Container path: hand off to s6 service manager - if _dispatch_via_service_manager_if_s6("start"): - return - # … existing host code (systemd / launchd / windows / fallback) … -``` - -For `install` and `uninstall`, treat them as no-ops inside the container under s6 — the service is auto-registered by the profile create hook (Task 4.1) and removed by the profile delete hook (Task 4.2). Add a short message: - -```python -elif subcmd == "install": - from hermes_cli.service_manager import detect_service_manager - if detect_service_manager() == "s6": - print_info("Per-profile gateways are auto-registered when you create a profile (hermes profile create ).") - print_info("Run `hermes status` to see currently-supervised gateways.") - return - # … existing host code … -``` - -The mirror applies for `uninstall`. - -**Step 3: Regression tests** - -Add a unit test for the dispatcher AND remove the xfail markers from `tests/docker/test_profile_gateway.py` (Task 0.5): - -```python -def test_dispatch_via_service_manager_invokes_s6(monkeypatch): - from hermes_cli import gateway as gw - - called = {} - class FakeMgr: - kind = "s6" - def start(self, name): called["start"] = name - def stop(self, name): called["stop"] = name - def restart(self, name): called["restart"] = name - - monkeypatch.setattr("hermes_cli.service_manager.detect_service_manager", lambda: "s6") - monkeypatch.setattr("hermes_cli.service_manager.get_service_manager", lambda: FakeMgr()) - - assert gw._dispatch_via_service_manager_if_s6("start", profile="coder") is True - assert called["start"] == "gateway-coder" - - -def test_dispatch_skips_on_host(monkeypatch): - from hermes_cli import gateway as gw - monkeypatch.setattr("hermes_cli.service_manager.detect_service_manager", lambda: "systemd") - assert gw._dispatch_via_service_manager_if_s6("start", profile="coder") is False -``` - -Then remove the xfail markers and `_PHASE4_REASON` constant from `tests/docker/test_profile_gateway.py`. - -**Step 4: Re-run Phase 0 harness** - -```bash -scripts/run_tests.sh tests/docker/test_profile_gateway.py -v -``` - -Expected: 2 passed (no longer xfailed). If they're still xfailing, the dispatch isn't intercepting — verify `detect_service_manager()` returns `"s6"` inside the container, then verify the `elif is_container():` arms were actually removed. - -**Step 5: Commit** - -```bash -git add hermes_cli/gateway.py tests/hermes_cli/test_gateway.py tests/docker/test_profile_gateway.py -git commit -m "feat(gateway): dispatch gateway start/stop through s6 inside container - -- Remove the 5 elif is_container() arms in _gateway_command_inner that - refused gateway install/uninstall/start/stop/restart inside containers. -- Add _dispatch_via_service_manager_if_s6() that intercepts start/stop/ - restart and routes them through the S6ServiceManager. -- install/uninstall become informational no-ops when running under s6 - (profile create/delete is the registration trigger). -- Remove the xfail markers from tests/docker/test_profile_gateway.py; - they now pass strictly." -``` - -### Task 4.4: Update `hermes_cli/status.py` for s6 detection - -**Objective:** `hermes status` inside the container reports "Manager: s6" instead of "systemd/manual". - -**Files:** -- Modify: `hermes_cli/status.py` - -**Locating the code:** - -```bash -grep -n '"Manager:' hermes_cli/status.py -``` - -You'll find a `print(f" Manager: …")` block that currently dispatches on `Termux / systemd / launchd / (not supported)`. - -**Step 1: Test + implementation** - -Add an `"s6"` branch to the manager-label resolution alongside the existing systemd/launchd/Termux branches. Use `detect_service_manager() == "s6"` to drive the new branch. The label should read `Manager: s6 (container supervisor)` for clarity. - -**Step 2: Commit** - -```bash -git add hermes_cli/status.py tests/hermes_cli/test_status.py -git commit -m "feat(status): report s6 as the service manager inside container" -``` - ---- - -## Phase 5 — Docs + cleanup - -### Task 5.1: Update `website/docs/user-guide/docker.md` - -**Objective:** Document the new supervision model. The dashboard IS supervised; per-profile gateways are supervised; TUI works unchanged. - -Add an "Init system" section covering: -- s6-overlay as PID 1 (replacing tini) -- Main hermes is a supervised service -- Dashboard (HERMES_DASHBOARD=1) is supervised — crashes auto-restart -- Per-profile gateways created via `hermes profile create` are supervised — crashes auto-restart -- `docker run -it --rm --tui` works unchanged -- Breaking change callout: if a downstream wrapper depended on tini specifics, pin to a pre-change image - -### Task 5.2: Create a maintainer skill - -Create `skills/software-development/hermes-s6-container-supervision/SKILL.md` documenting: -- Where service definitions live: `docker/s6-rc.d/` (static), `hermes_cli/service_manager.py` (dynamic registration) -- How to inspect a live container: `docker exec … s6-svstat /run/service/gateway-` -- How to add a new static service: create dir under `docker/s6-rc.d/`, add `contents.d` entry -- Common pitfalls: service-dir permissions, `with-contenv` shebang, `s6-setuidgid` placement -- Debugging a profile gateway that won't start: check `$HERMES_HOME/logs/gateways//current` (defaults to `/opt/data/logs/gateways//current` when `HERMES_HOME` is unset) - -### Task 5.3: Update `hermes_cli/doctor.py` for in-container runs - -**Objective:** Remove spurious warnings when `hermes doctor` runs inside the container, and surface the s6 supervision state. - -**Files:** -- Modify: `hermes_cli/doctor.py` -- Modify: `tests/hermes_cli/test_doctor.py` - -> **v3 note:** Since v2 was written, `hermes_cli/doctor.py` was refactored (PR #27830, `41f1eddee`) to introduce two helpers — `_section(title: str)` for section banners and `_fail_and_issue(text, detail, fix, issues)` for failure rendering. The 15 old copy-paste banner patterns and ~30 fail-and-issue blocks have all been migrated. **When adding the new "s6 supervision status" section under this task, use `_section("Gateway Service")` (existing section, just add an s6 branch inside) and `_fail_and_issue(...)` for any new failure paths — do NOT duplicate the old `print(color("◆ ...", Colors.CYAN, Colors.BOLD))` pattern.** The existing `_check_gateway_service_linger` function (still present, same name) is the target for the "skip on s6" branch. - -**Locating the code (function names, not line numbers — they drift):** - -```bash -grep -n "def _check_gateway_service_linger\|External Tools\|# Docker (optional)\|◆ Gateway Service" hermes_cli/doctor.py -``` - -You should find: `_check_gateway_service_linger` (called from the main doctor flow), the "External Tools" section header, the "Docker (optional)" check inside it, and the gateway service section header (currently rendered as something like `◆ Gateway Service`). - -**Changes:** - -1. **`_check_gateway_service_linger`**: skip when `detect_service_manager() == "s6"`. Replace with a new `_check_s6_supervision()` that reports main-hermes and dashboard status via `ServiceManager.is_running(...)`, plus the count of `gateway-*` services from `list_profile_gateways()`. - -2. **Docker external-tool check**: when `is_container()` is True, replace the "Docker missing" warning with an info line ("Running inside a container — Docker-in-Docker not configured, using in-container terminal backend"). Still check the `TERMINAL_ENV` config to make sure it's set to `local` inside the container (Docker backend from inside a container is not supported). - -3. **Gateway Service section header**: rename to "Service Supervisor" and dispatch on `detect_service_manager()` so the section title is accurate everywhere (systemd / launchd / windows / s6 / manual). - -**Step 1: Test + implementation — standard TDD** - -**Step 2: Commit** - -```bash -git add hermes_cli/doctor.py tests/hermes_cli/test_doctor.py -git commit -m "feat(doctor): surface s6 supervision state inside container" -``` - -### Task 5.4: Remove dead container-era systemd detection - -**Objective:** `_container_systemd_operational()` in `hermes_cli/gateway.py` was added for "systemd inside a container" detection. With s6 as the container init system, this branch is dead code. - -- Verify no code paths actually hit it in the new world (search + test suite) -- Remove the function + its `is_container()` branch in `supports_systemd_services()` -- Keep `supports_systemd_services()` returning False inside our container (now handled by the top-level `is_container()` check or by the `detect_service_manager() == "s6"` path) - -### Task 5.5: Update `website/docs/user-guide/profiles.md` - -The Profiles docs mention `hermes-gateway-.service` (systemd) — add a brief note that inside the container, per-profile gateways are supervised by s6 and use `s6-svstat` / `s6-svc` under the hood. - -### Task 5.6: Release notes - -Add a clear entry to the release notes calling out: -- New feature: per-profile gateways inside the Hermes container are now supervised — they auto-restart on crash, clean shutdown on container stop -- New feature: dashboard (`HERMES_DASHBOARD=1`) is now supervised -- Breaking change: container ENTRYPOINT is `/init` (s6-overlay) not `/usr/bin/tini`. Any external scripts that `docker exec`'d tini-specific commands need updating +The integration test `test_tty_passthrough_to_container` uses `tput cols` +and `COLUMNS=123` as the probe. --- @@ -3107,85 +353,82 @@ Add a clear entry to the release notes calling out: | Risk | Likelihood | Impact | Mitigation | |---|---|---|---| -| Phase 2 breaks a downstream user's Dockerfile that `FROM`s ours | Medium | Medium | Release notes call out ENTRYPOINT change; Phase 0 harness gives high confidence in behavior parity | -| TUI TTY passthrough fails on some Docker versions | Low | High | Phase 2 harness includes `test_tty_passthrough_to_container` as a hard gate; fallback plan = s6-fdholder (OQ9-C) | -| s6-overlay non-root quirks (logutil-service, fix-attrs) bite us | Low | Low | OQ2-A: supervisor runs as root, services drop — sidesteps these issues | -| Port collision between per-profile gateways | Low | Medium | Deterministic hash-based allocation (SHA256 of profile name) over a 600-port range; collision probability is ~1/600 per pair; gateway bind fails with a clear error if it happens, caller can set an explicit port | -| Podman rootless UID mapping confuses s6 | Medium | Low | OQ4-A: document, fix reactively; a local Podman + Docker environment will be stood up for validation | -| Phase 0 harness is flaky (docker daemon issues, timing) | Medium | Low | Generous timeouts; skip when docker unavailable; run in a CI-only job, not in fast local dev loop | -| Profile gateway crash loop masks a real config error | Low | Medium | `max_restarts` set on s6 finish script (planned for follow-up); for now, operators see crash-looping logs in `$HERMES_HOME/logs/gateways//` | -| Dockerfile+entrypoint drift from linter (hadolint/shellcheck) reveals latent bugs | Low | Low | Phase 0.5 catches them; fix or document ignore with rationale | -| Stale `gateway.pid` from a dead container collides with an unrelated live PID in the restarted container | Low | Medium | Task 4.0 reconciliation removes `gateway.pid` and `processes.json` from every profile dir on boot, before any new gateway starts. End-to-end test `test_stale_gateway_pid_is_cleaned_up_on_restart` covers it | -| `docker restart` silently loses per-profile gateway registrations (tmpfs scandir wiped) | High (without mitigation) | High | Task 4.0 reconciliation re-registers from persistent `$HERMES_HOME/profiles/` and auto-starts those last seen `running`; recorded outcome to `$HERMES_HOME/logs/container-boot.log` for forensics | -| A `running` gateway that's actually broken auto-restarts into a crash loop after every container restart | Low | Medium | s6 finish script `max_restarts` cap (already planned); follow-up: `hermes doctor` alerts when N consecutive container restarts ended in `startup_failed` | - ---- - -## Rollout Plan - -All phases after Phase 0 are gated on the Phase 0 harness passing against the modified image. No feature flags or kill switches — Phase 2 is a one-way door, which is fine given the OQ1-A decision to ship directly. - -1. **Phase 0** — merge immediately; pure test-harness addition, no behavior change -2. **Phase 0.5** — merge after 0; adds lint CI jobs -3. **Phase 1** — merge after 0.5; pure refactor addition -4. **Phase 2** — merge when Phase 0 harness is green against the new image; bump semver-major -5. **Phase 3** — merge after 2 is in a release; new capability with no callers yet -6. **Phase 4** — merge when in-container integration tests pass; activates Phase 3 -7. **Phase 5** — merge incrementally as docs/cleanup is ready +| Phase 2 breaks a downstream user's Dockerfile that `FROM`s ours | Medium | Medium | Release notes call out ENTRYPOINT change; the test harness (`tests/docker/`) gives high confidence in behavior parity | +| TUI TTY passthrough fails on some Docker versions | Low | High | Harness includes `test_tty_passthrough_to_container` as a hard gate; fallback plan = s6-fdholder ([s6-overlay#230](https://github.com/just-containers/s6-overlay/issues/230) Solution 2) | +| s6-overlay non-root quirks (logutil-service, fix-attrs) bite us | Low | Low | Supervisor runs as root, services drop — sidesteps these issues | +| Podman rootless UID mapping confuses s6 | Medium | Low | Documented as supported, fix reactively; a Podman + Docker environment is stood up for validation | +| Test harness is flaky (docker daemon issues, timing) | Medium | Low | Generous timeouts; skip when docker unavailable; polling helpers replace fixed sleeps in `test_container_restart.py` | +| Profile gateway crash loop masks a real config error | Low | Medium | s6 `finish` script `max_restarts` cap (planned follow-up); operators see crash-looping logs in `$HERMES_HOME/logs/gateways//` | +| Dockerfile+entrypoint drift from linter (hadolint/shellcheck) reveals latent bugs | Low | Low | CI lint jobs catch them; fix or document ignore with rationale | +| Stale `gateway.pid` from a dead container collides with an unrelated live PID in the restarted container | Low | Medium | Cont-init reconciliation removes `gateway.pid` and `processes.json` from every profile dir on boot, before any new gateway starts | +| `docker restart` silently loses per-profile gateway registrations (tmpfs scandir wiped) | High (without mitigation) | High | Cont-init reconciliation re-registers from persistent `$HERMES_HOME/profiles/` and auto-starts those last seen `running`; outcome recorded to `$HERMES_HOME/logs/container-boot.log` (size-bounded, rotates to `.1` at 256 KiB) | +| A `running` gateway that's actually broken auto-restarts into a crash loop after every container restart | Low | Medium | s6 `finish` script `max_restarts` cap (planned); follow-up: `hermes doctor` alerts when N consecutive container restarts ended in `startup_failed` | +| `_s6_running()` detection works as root but silently fails for unprivileged hermes user, making runtime-registration path inert | High (without mitigation) | High | **Caught in PR review.** Detection now probes `/proc/1/comm` (world-readable) + `/run/s6/basedir`. Docker integration tests refactored to `docker exec -u hermes` so the realistic runtime user is exercised | +| `s6-svscanctl` from hermes hits EACCES on the root-owned control FIFO | Medium | Medium | `02-reconcile-profiles` chowns `/run/service/.s6-svscan/{control,lock}` to hermes after stage1 creates them | +| Per-service `supervise/control` FIFO is root-owned by s6-supervise, blocking `s6-svc` from hermes | Known | Medium | Surfaced cleanly as `S6CommandError` (with rc + stderr) instead of raw `CalledProcessError`. Permission fix tracked as a follow-up (small SUID helper, polling chown loop in cont-init.d, or replace `s6-svc` with `down`-marker manipulation) | --- ## Decision Log -| # | Question | Decision | Blocks phase | -|---|---|---|---| -| OQ1 | Gate Phase 2 behind env var? | A — ship directly | Phase 2 | -| OQ2 | s6 root model | A — root `/init`, drop per-service | Phase 2 | -| OQ3 | Dashboard opt-in mechanism | A — always declared, run checks env | Phase 2 | -| OQ4 | Podman rootless | A — supported, fix reactively | Phase 2 | -| OQ5 | Service naming | `gateway-` | Phase 3 | -| OQ6 | — (retired; no subagent gateways in scope) | — | — | -| OQ7 | Resource limits | C — defer | Phase 3 | -| OQ8 | Log persistence | C — `$HERMES_HOME/logs/gateways//` | Phase 3 | -| OQ9 | TUI passthrough | A — trust docs, test is the hard gate | Phase 2 | - -**All questions resolved. No blockers remain.** - ---- - -## Estimated Timeline - -| Phase | Tasks | Engineering days | +| # | Question | Decision | |---|---|---| -| Phase 0 | 0.1–0.7 | 2.0 | -| Phase 0.5 | 0.5.1–0.5.2 | 0.5 | -| Phase 1 | 1.1–1.4 | 1.5 | -| Phase 2 | 2.1–2.5 | 3.0 | -| Phase 3 | 3.1–3.5 | 2.0 | -| Phase 4 | 4.0–4.4 | 2.0 | -| Phase 5 | 5.1–5.6 | 1.5 | -| **Total** | | **12.5 days** | +| OQ1 | Gate Phase 2 behind env var? | Ship directly (Hermes is pre-1.0; users can pin the previous image) | +| OQ2 | s6 root model | Root `/init`, drop per-service via `s6-setuidgid hermes` | +| OQ3 | Dashboard opt-in mechanism | Always declared as an s6 service; `03-dashboard-toggle` cont-init script writes a `down` marker when `HERMES_DASHBOARD` is unset so `s6-svstat` reports the slot's real state | +| OQ4 | Podman rootless | Supported, fix reactively | +| OQ5 | Service naming | `gateway-` (matches pre-existing `hermes-gateway-.service` systemd convention) | +| OQ6 | — (retired; no subagent gateways in scope) | — | +| OQ7 | Resource limits per profile gateway | Defer (no per-cgroup limits; rely on the container's overall limit) | +| OQ8 | Log persistence | `$HERMES_HOME/logs/gateways//`. The log path is sourced from runtime `$HERMES_HOME` via `with-contenv`, NOT Python-substituted at registration time | +| OQ9 | TUI passthrough | Trust the documented [s6-overlay#230](https://github.com/just-containers/s6-overlay/issues/230) Solution 1; harness includes a TTY passthrough hard-gate test | -Phase 0 is longer than the original estimate because the test harness it builds is load-bearing for the entire plan — it's what lets us sign off Phase 2 as "identical behavior." Phase 3 + 4 are shorter than the old plan's Phase 3 + 4 because we're not building a general transient-service API — just per-profile gateway registration. +**Post-merge additions from PR #30136 review:** + +- **Multi-arch tarballs:** `TARGETARCH` mapped to `x86_64` / `aarch64`; + per-arch tarball fetched via `curl` because `ADD` doesn't honor BuildKit + args. +- **SHA256 verification:** all three tarballs (noarch, symlinks, per-arch) + pinned via build ARGs and verified with `sha256sum -c` against a single + checksum file (avoids hadolint DL4006 piped-shell warning). +- **`gateway-default` slot:** always registered by the reconciler so + `hermes gateway start` (no `-p`) has somewhere to land. +- **Friendly lifecycle errors:** `GatewayNotRegisteredError` and + `S6CommandError` translate `CalledProcessError` into actionable CLI + messages. +- **Atomic publication in the reconciler:** mirrors + `register_profile_gateway`'s tmp+rename pattern. +- **`container-boot.log` rotation:** 256 KiB soft cap, rotated to `.1`. +- **`port` parameter retired:** allocator + kwarg were dead code through + the entire stack; `config.yaml` is the single source of truth. --- ## Verification Checklist -Before declaring the full plan complete: - -- [ ] Phase 0 harness passes against `main` (tini) (Phase 0) -- [ ] hadolint + shellcheck run green in CI (Phase 0.5) -- [ ] Phase 0 harness passes against the s6 image (Phase 2 — hard gate) -- [ ] `docker run -it --rm hermes-agent --tui` starts the Ink TUI with working keyboard input, cursor control, and resize (SIGWINCH) (Phase 2) -- [ ] Dashboard crashes are recovered by s6 within ~2s (Phase 2) -- [ ] `hermes profile create test` inside a container creates `/run/service/gateway-test/` (Phase 4) -- [ ] `hermes -p test gateway start` inside a container dispatches through s6 (verified by process tree: no double-fork) (Phase 4) -- [ ] `hermes -p test gateway stop` inside a container cleanly stops via s6 (Phase 4) -- [ ] `hermes profile delete test` inside a container removes `/run/service/gateway-test/` (Phase 4) -- [ ] Profile gateway logs persist at `$HERMES_HOME/logs/gateways/test/current` (Phase 4) -- [ ] `hermes status` inside the container shows `Manager: s6` (Phase 4) -- [ ] Full `scripts/run_tests.sh` passes (Phase 1–5) -- [ ] Full `scripts/run_tests.sh tests/docker/` passes when Docker available (Phase 0–5) -- [ ] No systemd/launchd host-side functions were modified (only wrapped) (Phase 1) -- [ ] `hermes gateway install/start/stop` on Linux host and macOS host behave identically to pre-change (Phase 1) +- [x] Test harness (`tests/docker/`) passes against the s6 image +- [x] hadolint + shellcheck run green in CI +- [x] `docker run -it --rm hermes-agent --tui` starts the Ink TUI with + working keyboard input, cursor control, and resize (SIGWINCH) +- [x] Dashboard crashes are recovered by s6 within ~2s +- [x] `hermes profile create test` inside a container creates + `/run/service/gateway-test/` +- [x] `hermes -p test gateway start` inside a container dispatches through s6 +- [x] `hermes -p test gateway stop` inside a container cleanly stops via s6 +- [x] `hermes profile delete test` inside a container removes + `/run/service/gateway-test/` +- [x] Profile gateway logs persist at + `$HERMES_HOME/logs/gateways/test/current` +- [x] `hermes status` inside the container shows `Manager: s6` +- [x] `hermes gateway start` (no `-p`) inside a container targets + `gateway-default` and runs against the root profile +- [x] `hermes gateway stop --all` / `... restart --all` iterate every + profile gateway under s6 instead of pkill-then-supervise-restart +- [x] `docker restart` survives per-profile gateway registrations via the + cont-init reconciler; running gateways come back up, stopped ones + stay down +- [x] Multi-arch image builds for both `linux/amd64` and `linux/arm64` +- [x] s6-overlay tarballs are SHA256-verified at build time +- [x] No systemd/launchd host-side functions were modified (only wrapped) +- [x] `hermes gateway install/start/stop` on Linux host and macOS host + behave identically to pre-change