mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-06-09 08:21:50 +00:00

docs(plans): trim s6-overlay plan to a post-implementation reference

PR #30136 review item O7: the plan doc was 3,191 lines — 5x the
size of any other plan in docs/plans/ and the largest reference
document in the repo. With the implementation shipped, most of
that content is either:

* The phase-by-phase TDD walkthrough (~2,800 lines): now canonical
  in the PR commit log (`git log a957ef083..a6f7171a5`).
* The v2/v3 re-validation preambles: artifacts of the planning
  process, no longer load-bearing.
* The full Open Questions deliberations with options A/B/C laid
  out: collapsed into the Decision Log.
* The Rollout Plan and Estimated Timeline: history.

Trim to ~430 lines covering what readers actually need going
forward: the goal, architecture, scope, key design decisions
(D1–D9), risk register (now including the three risks surfaced
in PR review — `_s6_running` detection, svscanctl FIFO perms,
supervise control FIFO perms), the decision log including the
post-merge additions, and the verification checklist (now all
boxes ticked).

Header now reads 'Status: shipped' and points at the PR. The git
history preserves the full v3 plan for anyone who needs it.

2026-05-24 18:05:33 -07:00

24 KiB

Raw Blame History

s6-overlay Supervision for Per-Profile Gateways in Docker — Implementation Plan

Status: shipped. Phases 0–5 landed via PR NousResearch/hermes-agent#30136 in May 2026. This document is preserved as a post-implementation reference for the architecture and the resolved design questions. The phase-by-phase TDD walkthrough (≈2,800 lines) and the v2/v3 re-validation preambles have been removed — the canonical implementation history is the PR commit log (git log --oneline a957ef083..a6f7171a5 -- 'docker/*' 'hermes_cli/service_manager.py' …). Open Questions are collapsed into a single Decision Log table; full deliberations live in PR review comments.

Goal: Replace tini with s6-overlay as PID 1 in the Hermes Docker image so that the main hermes process, the dashboard, and dynamically-created per-profile gateways all run as supervised services (auto-restart on crash, clean shutdown, signal forwarding, zombie reaping). Preserve every existing docker run … invocation pattern — including interactive TUI.

Architecture: s6-overlay's /init is the container ENTRYPOINT, running s6-svscan as PID 1. Main hermes and the dashboard are declared as static s6-rc services at image build time. Per-profile gateways — which users create after the image is built (hermes profile create coder → coder gateway start) — are registered dynamically by writing service directories under a scandir watched by s6-svscan. A ServiceManager protocol abstracts the install/start/stop/restart surface across the init systems we care about (systemd on Linux host, launchd on macOS host, Scheduled Tasks on native Windows host, s6 inside container) and adds a second tier for runtime service registration that only s6 implements.

Tech Stack:

s6-overlay v3.2.3.0 (noarch + per-arch tarballs ~15 MB). SHA256-pinned via build ARGs; multi-arch via TARGETARCH (amd64 → x86_64, arm64 → aarch64).
Debian 13.4 base image (unchanged).
hadolint for the Dockerfile + shellcheck for entrypoint scripts.
Python subprocess wrappers for s6-svc, s6-svstat, s6-svscanctl.
Existing systemd/launchd/windows surface in hermes_cli/gateway.py and hermes_cli/gateway_windows.py.

Scope:

Container-only (host-side systemd/launchd/windows behavior is preserved, not modified).
s6-overlay only (no pure-Python fallback).
Architecture A (s6 owns PID 1; tini is removed).
Interactive TUI must keep working: docker run -it --rm nousresearch/hermes-agent:latest --tui.
Dynamic registration is limited to per-profile gateways — one service per profile, created when a profile is created, torn down when deleted. A gateway-default slot is always registered for the root HERMES_HOME profile so hermes gateway start (no -p) has somewhere to land.

Out of scope:

Host-side dynamic supervision (systemd-run / launchd transient plists) — not needed.
Pure-Python supervisor fallback — not needed.
Arbitrary user-defined supervised processes inside the container — only profile gateways.
Migration of existing per-profile systemd unit generation to s6 on the host side.
Non-Docker container runtimes (Podman rootless validated reactively).
UX polish around in-container profile lifecycle (e.g. a nice status view of all supervised profile gateways) — deferred to follow-up.

Background From The Codebase

Note on line numbers: This section refers to functions and structures by name only. Use grep -n 'def <name>' <file> to locate anything below if you need the current line.

Pre-s6 container init (what we replaced)

The original Dockerfile declared ENTRYPOINT [ "/usr/bin/tini", "-g", "--", "/opt/hermes/docker/entrypoint.sh" ]. tini was PID 1, reaped zombies, forwarded SIGTERM to the process group. The old docker/entrypoint.sh:

gosu privilege drop from root → hermes UID.
Copied .env.example, cli-config.yaml.example, SOUL.md into $HERMES_HOME if missing.
Synced bundled skills via tools/skills_sync.py.
Optionally backgrounded hermes dashboard in a subshell when HERMES_DASHBOARD=1 — not supervised, no restart.
exec hermes "$@" — tini's sole direct child.

Known limitations: dashboard crash → stays dead; dashboard fails at startup → silent; gateway crash → dashboard dies too. The May 4, 2026 decision was "leave as is" because nothing in the container needed supervision then. Adding per-profile gateway supervision changed that.

ServiceManager surface (what we wrapped, not refactored)

All init-system logic lives in hermes_cli/gateway.py (~5,400 LOC at re-validation). The systemd/launchd code is ~1,500 lines of that, plus a separate hermes_cli/gateway_windows.py (~690 LOC) for Windows Scheduled Tasks.

Layer	Systemd functions	Launchd functions	Windows functions
Detection	`supports_systemd_services()`, `_systemd_operational()`, `_wsl_systemd_operational()`, `_container_systemd_operational()`	`is_macos()`	`is_windows()`, `gateway_windows.is_installed()`
Paths	`get_systemd_unit_path(system)`, `get_service_name()`	`get_launchd_plist_path()`, `get_launchd_label()`	`gateway_windows.get_task_name()`, `get_task_script_path()`, `get_startup_entry_path()`
Install/lifecycle	`systemd_install(force, system, run_as_user)`, `systemd_uninstall(system)`, `systemd_start/stop/restart(system)`	`launchd_install(force)`, `launchd_uninstall/start/stop/restart`	`gateway_windows.install/uninstall/start/stop/restart`
Probes	`_probe_systemd_service_running(system)`, `_read_systemd_unit_properties(system)`, `_wait_for_systemd_service_restart`, `_recover_pending_systemd_restart`	`_probe_launchd_service_running()`	`gateway_windows.is_task_registered()`, `_pid_exists` helper
D-Bus plumbing	`_ensure_user_systemd_env`, `_user_systemd_socket_ready`, `_user_systemd_private_socket_path`, `get_systemd_linger_status`	—	—
Unit/plist generation	`generate_systemd_unit(system, run_as_user)`, `systemd_unit_is_current`, `refresh_systemd_unit_if_needed`	plist templating in `launchd_install`	`_build_gateway_cmd_script`, `_build_startup_launcher`, `_write_task_script`

Container-relevant callers outside gateway.py:

hermes_cli/status.py — gained an s6 branch for in-container runs.
hermes_cli/profiles.py — create_profile / delete_profile register and unregister with s6 inside the container (no-op on host).
hermes_cli/doctor.py — _check_gateway_service_linger skips on s6, and a new "Service Supervisor" section reports main-hermes / dashboard / profile-gateway counts via the ServiceManager.
hermes_cli/gateway.py::gateway_command — the elif is_container(): rejection arms that refused gateway lifecycle operations were removed; the _dispatch_via_service_manager_if_s6 helper intercepts start/stop/restart and routes them through s6.

Per-profile gateway spawning

hermes gateway start, coder gateway start (profile alias), and hermes -p <profile> gateway start all spawn a gateway process scoped to a given profile. See Profiles: Running Gateways. On host, lifecycle is managed via per-profile systemd units (hermes-gateway-<profile>.service); inside the container, an s6 service at /run/service/gateway-<name>/ is registered when the profile is created and torn down when it's deleted.

Persistence across container restart: /run/service/ is tmpfs — service registrations are wiped when the container restarts. Profile directories at /opt/data/profiles/<name>/ live on the persistent VOLUME, and each one records its gateway's last state in gateway_state.json. /etc/cont-init.d/02-reconcile-profiles walks the persistent profiles on every container boot, recreates the s6 service slots via hermes_cli/container_boot.py, and auto-starts those whose last recorded state was running. Profiles whose last state was stopped, startup_failed, starting, or absent get their slot recreated in the down state and wait for explicit user action. docker restart is therefore invisible to a user with running profile gateways: they come back up; stopped ones stay stopped.

s6-overlay constraints

Root/non-root model: /init runs as root to set up the supervision tree, install signal handlers, and run the stage2 hook that does usermod/chown. Each supervised service drops to UID 10000 via s6-setuidgid hermes in its run script. The per-service s6-supervise monitor stays root so it can signal its child regardless of UID. Net effect: hermes and all its subprocesses run as UID 10000 exactly as before; only the supervision tree itself runs as root.
v3.2.3.0 has limited non-root support for running /init itself as non-root — some tools (fix-attrs, logutil-service) assume root. We don't hit this because /init runs as root.
Scandir hard cap: services_max default 1000, configurable to 160,000.
/command/with-contenv sources /run/s6/container_environment/* into service env — convenient for passing HERMES_HOME etc.
s6 signal semantics: service crash triggers s6-supervise restart after 1s; override with a finish script.
Zombie reaping: PID 1 (s6-svscan) reaps all zombies non-blockingly on SIGCHLD. Any subagent subprocess spawned by the main hermes process is reaped automatically.

Key Design Decisions

D1. s6-overlay replaces tini entirely

Container ENTRYPOINT is /init, PID 1 is s6-svscan. The main hermes process, the dashboard, and every per-profile gateway run as supervised services. This is a single breaking change to the container contract.

D2. Main hermes is an s6 service with container-exit semantics

The contract "container exits when hermes exits" is preserved via a service finish script that writes to /run/s6-linux-init-container-results/exitcode and calls /run/s6/basedir/bin/halt. All five supported invocations work:

`docker run <image> …`	Behavior
(no args)	`hermes` with no args, container exits when hermes exits
`chat -q "..."`	`hermes chat -q "..."`, container exits with hermes exit code
`sleep infinity`	`sleep infinity` directly (long-lived sandbox mode)
`bash`	interactive `bash` directly
`docker run -it … --tui`	interactive Ink TUI with real TTY — see D9

docker/main-wrapper.sh detects whether $1 is an executable on PATH and routes either to "run this as a one-shot main service" or "wrap with hermes".

D3. Static services at build time; dynamic (per-profile) services at runtime

s6 offers two mechanisms:

s6-rc (declarative, compile-then-swap): used for main hermes and the dashboard — they're known at image build time.
scandir (drop a directory + s6-svscanctl -a): used for per-profile gateways — profiles are user-created after the image is built.

Per-profile gateway service dirs live at /run/service/gateway-<profile>/ (tmpfs, hermes-writable). s6-svscan picks them up on rescan.

D4. ServiceManager protocol with two methods for runtime registration

Host paths (systemd, launchd, Windows Scheduled Tasks) need only install/start/stop/restart of pre-declared services. Inside the container, we additionally need to register services at runtime when a profile is created. The protocol exposes this directly:

class ServiceManager(Protocol):
    kind: ServiceManagerKind  # "systemd" | "launchd" | "windows" | "s6" | "none"

    # Lifecycle of an already-declared service
    def start(self, name: str) -> None: ...
    def stop(self, name: str) -> None: ...
    def restart(self, name: str) -> None: ...
    def is_running(self, name: str) -> bool: ...

    # Runtime registration (container-only; hosts raise NotImplementedError)
    def supports_runtime_registration(self) -> bool: ...
    def register_profile_gateway(
        self, profile: str, *,
        extra_env: dict[str, str] | None = None,
    ) -> None: ...
    def unregister_profile_gateway(self, profile: str) -> None: ...
    def list_profile_gateways(self) -> list[str]: ...

Systemd, launchd, and Windows backends raise NotImplementedError on the registration methods. Only the s6 backend implements them. Callers check supports_runtime_registration() before calling.

The scope is intentionally narrow: it's specifically "register/unregister a profile gateway," not a general-purpose process-management API.

D5. Per-profile gateway service spec is fixed, not user-provided

Every profile gateway has the same command shape (hermes -p <profile> gateway run, or hermes gateway run for the default profile). The s6 backend generates the run script from a fixed template given the profile name — no arbitrary command list. This keeps the API surface tight and prevents callers from accidentally registering non-gateway services.

Port selection is governed by the profile's config.yaml ([gateway] port = …) — the single source of truth. (The original plan proposed a Python-side SHA-256 port allocator with a 600-port range; it was retired during PR review because it was dead code through the entire stack.)

D6. Add detect_service_manager() alongside supports_systemd_services()

supports_systemd_services() stays as-is (host code paths unchanged). A new detect_service_manager() -> Literal["systemd", "launchd", "windows", "s6", "none"] composes existing detection functions (is_macos(), is_windows(), supports_systemd_services(), is_container() + _s6_running()) and adds an s6 branch for container detection. Host call sites continue to use the existing functions; container-only code (the profile hooks) uses the new one.

_s6_running() probes /proc/1/comm (world-readable) and /run/s6/basedir. The earlier /proc/1/exe probe was root-only readable and silently failed for the unprivileged hermes user (UID 10000), making the entire runtime-registration path inert in production — caught in PR review.

D7. Wrap existing systemd/launchd/windows functions, don't rewrite them

SystemdServiceManager / LaunchdServiceManager / WindowsServiceManager are thin adapters over the existing systemd_* / launchd_* module-level functions in hermes_cli/gateway.py and the gateway_windows.install/uninstall/start/stop/restart/is_installed functions in hermes_cli/gateway_windows.py. We get the abstraction without rewriting ~2,200 LOC of working code.

D8. Profile create/delete hooks register/unregister the s6 service

When hermes profile create <name> runs inside the container, the profile-creation code path calls ServiceManager.register_profile_gateway(<name>) if supports_runtime_registration() is True. When hermes profile delete <name> runs, it calls unregister_profile_gateway(<name>). On host, both calls are no-ops (registration not supported; existing systemd unit generation continues to handle install/uninstall).

Existing per-profile hermes -p <profile> gateway start/stop/restart CLI commands continue to work — in the container they dispatch to ServiceManager.start/stop/restart("gateway-<profile>"), which translates to s6-svc -u/-d/-t on the service dir.

hermes gateway start (no -p) targets a special gateway-default slot that's always registered by the cont-init reconciler. Its run script omits the -p flag and runs against the root $HERMES_HOME profile.

--all lifecycle (hermes gateway stop --all, ... restart --all) iterates mgr.list_profile_gateways() through s6 so s6's want up/want down flips correctly. Without this, --all fell through to pkill followed by s6-supervise auto-restart — net effect: kick instead of stop.

D9. Interactive TUI bypasses s6 service-mode and runs as CMD for TTY passthrough

docker run -it --rm <image> --tui needs a real TTY connected to container stdin/stdout for Ink raw-mode keyboard input, cursor control, and SIGWINCH. Running the TUI as a normal s6 service fails because s6-supervise disconnects service stdio from the container TTY (documented: s6-overlay#230).

The pattern: s6-overlay's /init execs a CMD as the container's "main program" after the supervision tree is up. The CMD inherits stdin/stdout/stderr from /init — which in -it mode is the container TTY. The stage2 hook detects the TUI case and short-circuits the main-hermes service so the hermes CMD becomes that main program.

# In docker/stage2-hook.sh
_is_tui_invocation() {
    for arg in "$@"; do
        case "$arg" in --tui|-T) return 0 ;; esac
    done
    case "${HERMES_TUI:-}" in 1|true|TRUE|yes) return 0 ;; esac
    if [ -t 0 ] && [ $# -eq 0 ]; then return 0; fi
    return 1
}

And in docker/s6-rc.d/main-hermes/run:

if [ -f /var/run/s6/container_environment/HERMES_TUI_MODE ]; then
    exec sleep infinity   # s6-overlay will exec CMD as the TTY-connected main
fi
exec s6-setuidgid hermes hermes ${HERMES_ARGS:-}

In TUI mode main hermes is effectively unsupervised (same as the pre-s6 behavior with tini — acceptable because the user is interactively present). Dashboard and profile gateways still get full s6 supervision via their separate services.

The integration test test_tty_passthrough_to_container uses tput cols and COLUMNS=123 as the probe.

Risk Register

Risk	Likelihood	Impact	Mitigation
Phase 2 breaks a downstream user's Dockerfile that `FROM`s ours	Medium	Medium	Release notes call out ENTRYPOINT change; the test harness (`tests/docker/`) gives high confidence in behavior parity
TUI TTY passthrough fails on some Docker versions	Low	High	Harness includes `test_tty_passthrough_to_container` as a hard gate; fallback plan = s6-fdholder (s6-overlay#230 Solution 2)
s6-overlay non-root quirks (logutil-service, fix-attrs) bite us	Low	Low	Supervisor runs as root, services drop — sidesteps these issues
Podman rootless UID mapping confuses s6	Medium	Low	Documented as supported, fix reactively; a Podman + Docker environment is stood up for validation
Test harness is flaky (docker daemon issues, timing)	Medium	Low	Generous timeouts; skip when docker unavailable; polling helpers replace fixed sleeps in `test_container_restart.py`
Profile gateway crash loop masks a real config error	Low	Medium	s6 `finish` script `max_restarts` cap (planned follow-up); operators see crash-looping logs in `$HERMES_HOME/logs/gateways/<profile>/`
Dockerfile+entrypoint drift from linter (hadolint/shellcheck) reveals latent bugs	Low	Low	CI lint jobs catch them; fix or document ignore with rationale
Stale `gateway.pid` from a dead container collides with an unrelated live PID in the restarted container	Low	Medium	Cont-init reconciliation removes `gateway.pid` and `processes.json` from every profile dir on boot, before any new gateway starts
`docker restart` silently loses per-profile gateway registrations (tmpfs scandir wiped)	High (without mitigation)	High	Cont-init reconciliation re-registers from persistent `$HERMES_HOME/profiles/` and auto-starts those last seen `running`; outcome recorded to `$HERMES_HOME/logs/container-boot.log` (size-bounded, rotates to `.1` at 256 KiB)
A `running` gateway that's actually broken auto-restarts into a crash loop after every container restart	Low	Medium	s6 `finish` script `max_restarts` cap (planned); follow-up: `hermes doctor` alerts when N consecutive container restarts ended in `startup_failed`
`_s6_running()` detection works as root but silently fails for unprivileged hermes user, making runtime-registration path inert	High (without mitigation)	High	Caught in PR review. Detection now probes `/proc/1/comm` (world-readable) + `/run/s6/basedir`. Docker integration tests refactored to `docker exec -u hermes` so the realistic runtime user is exercised
`s6-svscanctl` from hermes hits EACCES on the root-owned control FIFO	Medium	Medium	`02-reconcile-profiles` chowns `/run/service/.s6-svscan/{control,lock}` to hermes after stage1 creates them
Per-service `supervise/control` FIFO is root-owned by s6-supervise, blocking `s6-svc` from hermes	Known	Medium	Surfaced cleanly as `S6CommandError` (with rc + stderr) instead of raw `CalledProcessError`. Permission fix tracked as a follow-up (small SUID helper, polling chown loop in cont-init.d, or replace `s6-svc` with `down`-marker manipulation)

Decision Log

#	Question	Decision
OQ1	Gate Phase 2 behind env var?	Ship directly (Hermes is pre-1.0; users can pin the previous image)
OQ2	s6 root model	Root `/init`, drop per-service via `s6-setuidgid hermes`
OQ3	Dashboard opt-in mechanism	Always declared as an s6 service; `03-dashboard-toggle` cont-init script writes a `down` marker when `HERMES_DASHBOARD` is unset so `s6-svstat` reports the slot's real state
OQ4	Podman rootless	Supported, fix reactively
OQ5	Service naming	`gateway-<profile>` (matches pre-existing `hermes-gateway-<profile>.service` systemd convention)
OQ6	— (retired; no subagent gateways in scope)	—
OQ7	Resource limits per profile gateway	Defer (no per-cgroup limits; rely on the container's overall limit)
OQ8	Log persistence	`$HERMES_HOME/logs/gateways/<profile>/`. The log path is sourced from runtime `$HERMES_HOME` via `with-contenv`, NOT Python-substituted at registration time
OQ9	TUI passthrough	Trust the documented s6-overlay#230 Solution 1; harness includes a TTY passthrough hard-gate test

Post-merge additions from PR #30136 review:

Multi-arch tarballs: TARGETARCH mapped to x86_64 / aarch64; per-arch tarball fetched via curl because ADD doesn't honor BuildKit args.
SHA256 verification: all three tarballs (noarch, symlinks, per-arch) pinned via build ARGs and verified with sha256sum -c against a single checksum file (avoids hadolint DL4006 piped-shell warning).
gateway-default slot: always registered by the reconciler so hermes gateway start (no -p) has somewhere to land.
Friendly lifecycle errors: GatewayNotRegisteredError and S6CommandError translate CalledProcessError into actionable CLI messages.
Atomic publication in the reconciler: mirrors register_profile_gateway's tmp+rename pattern.
container-boot.log rotation: 256 KiB soft cap, rotated to .1.
port parameter retired: allocator + kwarg were dead code through the entire stack; config.yaml is the single source of truth.

Verification Checklist

Test harness (tests/docker/) passes against the s6 image
hadolint + shellcheck run green in CI
docker run -it --rm hermes-agent --tui starts the Ink TUI with working keyboard input, cursor control, and resize (SIGWINCH)
Dashboard crashes are recovered by s6 within ~2s
hermes profile create test inside a container creates /run/service/gateway-test/
hermes -p test gateway start inside a container dispatches through s6
hermes -p test gateway stop inside a container cleanly stops via s6
hermes profile delete test inside a container removes /run/service/gateway-test/
Profile gateway logs persist at $HERMES_HOME/logs/gateways/test/current
hermes status inside the container shows Manager: s6
hermes gateway start (no -p) inside a container targets gateway-default and runs against the root profile
hermes gateway stop --all / ... restart --all iterate every profile gateway under s6 instead of pkill-then-supervise-restart
docker restart survives per-profile gateway registrations via the cont-init reconciler; running gateways come back up, stopped ones stay down
Multi-arch image builds for both linux/amd64 and linux/arm64
s6-overlay tarballs are SHA256-verified at build time
No systemd/launchd host-side functions were modified (only wrapped)
hermes gateway install/start/stop on Linux host and macOS host behave identically to pre-change

24 KiB Raw Blame History Unescape Escape