Two parallel public-path allowlists drifted: _PUBLIC_API_PATHS in
hermes_cli/web_server.py (legacy _SESSION_TOKEN middleware) and
_GATE_PUBLIC_PREFIXES in hermes_cli/dashboard_auth/middleware.py
(OAuth gate). The legacy list included /api/status (documented as a
non-sensitive read-only liveness target); the OAuth gate's list did not.
Effect: every wildcard-subdomain agent surfaced as STARTING/down to the
portal even though the dashboard was serving correctly. Nous account
service (src/server/agents/fly-provider.ts
getInstanceRuntimeStatus) fetches ``/api/status`` without a cookie
as its sole liveness probe; the OAuth gate's 401 looked identical to
'agent dead' on the portal side.
Fix: lift the allowlist into hermes_cli/dashboard_auth/public_paths.py
and have both middlewares import it. _path_is_public now consults
the shared frozenset first, then falls back to the gate's
auth-bootstrap/static prefix list. Future additions to the public list
hit both gates automatically.
Endpoint inventory (verified safe to remain public):
* /api/status — version, gateway state, active session count,
auth-gate shape. Portal liveness probe target.
* /api/config/defaults — config-defaults feed for the SPA's Config page
* /api/config/schema — config schema for the SPA's Config page
* /api/model/info — model catalogue metadata (context windows)
* /api/dashboard/themes — theme manifests for the skin engine
* /api/dashboard/plugins — plugin manifests for the dashboard
No user data, no session content, no secrets. Same shape an external
monitoring agent would hit on /healthz.
Tests:
* New: test_gated_status_is_public (regression guard with the NAS
fly-provider.ts liveness-probe rationale spelled out in the docstring)
* New: test_other_public_api_paths_are_public_under_gate (parametrised
over the rest of PUBLIC_API_PATHS — proves 401 / 302-to-login is
never the response)
* New: docker integration check #3 in
test_dashboard_oauth_gate_engaged_by_default — /api/status
remains 200 under the gate AND reports auth_required=True so the
portal can distinguish modes
* Updated: test_full_login_round_trip_unlocks_gated_api now probes
/api/sessions instead of /api/status (status is public, so it
can no longer distinguish 'logged in' from 'gate accidentally
disabled')
* Updated: TestApi401Envelope (the no-cookie / invalid-cookie /
dead-cookie tests) probes /api/sessions for the same reason
* Updated: docker integration check #2 in
test_dashboard_oauth_gate_engaged_by_default probes
/api/sessions to prove the gate is intercepting
* Removed: dead _login() helper in
test_dashboard_auth_status_endpoint.py (no longer needed since
/api/status is reachable cold)
Companion to docs/handover/hermes-agent-dashboard-s6-insecure-fix.md
(the --insecure flag fix that shipped earlier).
The Docker integration test job started failing on main after
fb5125362 ("docker: opt in to dashboard --insecure via env var").
Two distinct failures, both fallout from that change being more
behaviour-changing than the existing test harness anticipated.
Failure 1 — test_dashboard_port_override (silent regression in an
already-existing test)
The test starts the container with just HERMES_DASHBOARD=1, defaults
to host=0.0.0.0, no HERMES_DASHBOARD_OAUTH_CLIENT_ID, no
HERMES_DASHBOARD_INSECURE. Pre-fix that combination got --insecure
auto-injected by the s6 run script (anything non-loopback was
implicitly insecure), so the OAuth gate stayed off and start_server
bound the port. Post-fix the gate engages, no provider is
registered, and start_server raises SystemExit before binding —
under s6 the dashboard goes into a restart loop and the test's
/proc/net/tcp poll finds nothing.
Same silent regression was masking three sibling tests
(test_dashboard_slot_reports_up_when_enabled, test_dashboard_opt_in_starts,
test_dashboard_restarts_after_crash) — they all only sample pgrep
or s6-svstat and so caught the supervised process mid-restart
loop, appearing to pass while the dashboard was actually never
reaching a healthy state.
Fix: pin HERMES_DASHBOARD_INSECURE=1 on every test that enables
the dashboard but doesn't itself exercise the auth gate. Each
pinned site carries an inline comment pointing back to
test_dashboard_slot_reports_up_when_enabled for the full
rationale.
Failure 2 — test_dashboard_oauth_gate_engages_on_non_loopback_bind
(bug in the test I added in fb5125362)
The probe used urllib.request.urlopen() against /api/status. Under
the now-engaged OAuth gate /api/status no longer answers
unauthenticated callers (the gate middleware runs upstream of the
legacy _SESSION_TOKEN allowlist and 401s anything without a valid
session cookie). urlopen() raises HTTPError on the 401, the wrapper
treated that as "not ready yet", and the poll loop hit
timeout.
Fix: split the probe into a generic _http_probe() helper that
returns (status_code, body) for any HTTP response — including 401,
which IS the gate-engaged success signal. The helper feeds a
multi-line Python program over stdin via a POSIX heredoc so the
try/except branch reads naturally; far less fragile than the
earlier semicolon-laden -c one-liner.
The OAuth-gate test now verifies two independent observable
consequences of the gate being on:
1. GET /api/auth/providers (publicly reachable through the gate
so the login page can bootstrap) returns 200 with `nous` in
the provider list — proves the bundled provider registered.
2. GET /api/status returns 401 — proves the OAuth gate runs
upstream of the legacy public-paths allowlist and is
actively intercepting unauthenticated callers.
The insecure-opt-out test still hits /api/status, but now
asserts status_code == 200 first (proves the gate is bypassed)
before parsing the JSON for auth_required: false (proves the
gate-state flag is also correctly off).
Verified locally end-to-end against a fresh image build on a
real Docker daemon: all 41 tests under tests/docker/ pass in
2m38s, including the two formerly-failing dashboard tests and
the three sibling tests that were passing by accident.
The s6 dashboard run script flipped `--insecure` on whenever
`HERMES_DASHBOARD_HOST` was anything other than 127.0.0.1 / localhost.
That comment ("the dashboard refuses otherwise") predates the OAuth
auth gate: back when it was written, `start_server` would SystemExit
on any non-loopback bind, so the run script's `--insecure` was the
only way to make in-container deployments work at all.
The gate has since been replaced by `should_require_auth(host,
allow_public)`, which engages the OAuth flow when a
`DashboardAuthProvider` is registered (the bundled `dashboard_auth/nous`
provider auto-registers on `HERMES_DASHBOARD_OAUTH_CLIENT_ID`) and
fails closed with a specific operator-facing error when none is. The
host-derived `--insecure` ran upstream of all that and silently
disabled the gate on every container-deployed dashboard.
Most visible under the portal's wildcard-subdomain rollout: every Fly
machine binds 0.0.0.0 so the edge can reach Flycast, every machine
boots with the correct `HERMES_DASHBOARD_OAUTH_CLIENT_ID`, the nous
provider registers — and `/api/status` still returns
`{"auth_required": false, "auth_providers": ["nous"]}` because the
run script disabled the gate before `start_server` ever saw the
request. The dashboard SPA was served to anyone, no `/login` redirect,
no OAuth challenge.
Fix: derive `--insecure` from an explicit opt-in env var,
`HERMES_DASHBOARD_INSECURE` (truthy values matching the rest of the
s6 boolean envs: 1, true, TRUE, True, yes, YES, Yes). Operators on
trusted LANs behind a reverse proxy without the OAuth contract
(the existing `docker-compose.windows.yml` use case) opt in
explicitly; portal-managed agent deployments leave it unset and let
the gate engage.
`docker-compose.windows.yml` already passes `--insecure` on the
`command:` array directly (line 38), so it doesn't depend on the s6
auto-injection. No compose-file change required.
Tests:
* `tests/test_docker_home_override_scripts.py` — extends the existing
static-text guard with a regression assertion that the legacy
host-derived case-statement is gone and the new env-var opt-in is
present (locks against accidental revert).
* `tests/docker/test_dashboard.py` — adds two Docker-in-Docker tests
exercising the actual `/api/status` round-trip:
- 0.0.0.0 bind + `HERMES_DASHBOARD_OAUTH_CLIENT_ID` → gate engaged
- 0.0.0.0 bind + `HERMES_DASHBOARD_INSECURE=1` → gate disabled
Docs:
* `website/docs/user-guide/docker.md` + zh-Hans i18n — adds the new
env var to the table, replaces the stale prose ("the entrypoint
no longer auto-enables insecure mode" — which until this PR was
flat-out wrong) with an accurate description of the gate's
trigger conditions and the explicit opt-out.
shellcheck clean. Python static-text test passes locally. Behavioural
test will run against any future image build (CI's Docker harness).
`hermes dump` and the startup banner both call `git rev-parse HEAD` to
report the running commit, but `.dockerignore` line 2 excludes `.git` —
so inside the published image `hermes dump` shows
`version: ... [(unknown)]` and the banner drops its `· upstream <sha>`
suffix entirely. That makes support triage from container bug reports
impossible: we can't tell which commit the user is actually running.
Fix: thread the build-time SHA through as a Docker build-arg, write it
to `/opt/hermes/.hermes_build_sha` in the image, and have a new
`hermes_cli/build_info.get_build_sha()` read it as a fallback after the
existing live-git lookup fails. Output format is unchanged in both
callsites — same 8-char short SHA whether resolved live or baked.
Wiring:
- Dockerfile: `ARG HERMES_GIT_SHA=` + write-file step after the source
copy. Empty/missing arg → no file written → callers fall through to
live git (so local `docker build` without --build-arg is unchanged).
- docker-publish.yml: passes `HERMES_GIT_SHA=${{ github.sha }}` on all
four build-push-action steps (amd64/arm64, smoke-test + final push).
- dump.py:_get_git_commit() / banner.py:get_git_banner_state(): try
live git first, fall back to baked SHA, then to legacy `(unknown)`
/ None. Banner returns `upstream == local, ahead=0` because a built
image is by definition pinned to one commit.
Coverage:
- Unit tests cover build_info (file present/absent/empty/error,
truncation, whitespace), dump (live-git wins, both fallbacks,
identical output-format regression guard), and banner (no-repo +
baked, no-repo + no-sha, shallow-clone fallback).
- tests/docker/test_dump_build_sha.py is an integration regression
guard that runs against the real image, reads
`/opt/hermes/.hermes_build_sha`, and asserts `hermes dump` surfaces
its content (or stays at `(unknown)` if no file).
- Verified end-to-end: `docker build --build-arg HERMES_GIT_SHA=abc...`
→ `docker run ... dump` reports `[abc12345]`; without the build-arg
it reports `[(unknown)]` as before.
When operators ran `docker exec <c> hermes login` (or anything else
that wrote under $HERMES_HOME) they defaulted to root, leaving
/opt/data/auth.json root:root mode 0600. The supervised gateway
(UID 10000) then couldn't read its own credentials and returned
"Provider authentication failed: Hermes is not logged into Nous
Portal" on every Telegram/Discord/etc. message — even though
`docker exec <c> hermes chat -q ping` (also root) succeeded because
root could read its own root-owned file. _load_auth_store swallowed
PermissionError as a parse failure and copied the file aside as
auth.json.corrupt, making the diagnostic more misleading.
Fix: install a privilege-drop shim at /opt/hermes/bin/hermes,
prepended ahead of the venv on PATH. When invoked as root the shim
exec's the real venv binary via `s6-setuidgid hermes` — so any file
the docker-exec session writes is uid-aligned with the supervised
processes. Non-root callers (the supervised processes themselves,
`docker exec --user hermes`, kanban subagents, anything inside the
container that's not coming through docker-exec) hit a single exec
to the absolute venv path with no privilege change.
Recursion is impossible: the shim exec's the venv binary by
absolute path (/opt/hermes/.venv/bin/hermes), so the second hop
cannot re-enter the shim regardless of PATH state. No sentinel env
var needed (unlike #33583's gateway-run redirect which DOES need
HERMES_S6_SUPERVISED_CHILD because there's no absolute-path
equivalent for the s6 dispatch).
Opt-out: `docker exec -e HERMES_DOCKER_EXEC_AS_ROOT=1 …` for
diagnostic sessions where the operator deliberately wants root.
Strict truthiness (1/true/yes case-insensitive); typos like `=0`
do not silently opt out, mirroring HERMES_GATEWAY_NO_SUPERVISE in
#33583.
If `s6-setuidgid` is missing (someone stripped s6-overlay in a
downstream fork), the shim exits 126 with a remediation message
pointing at `--user hermes` and the opt-out — never silently runs
as root.
Test plan:
- tests/docker/test_docker_exec_privilege_drop.py — 11 tests
- shim drops root to hermes uid (file ownership check)
- shim short-circuits for non-root docker exec
- HERMES_DOCKER_EXEC_AS_ROOT=1 keeps root
- strict-truthiness parametrization (5 falsy values reject)
- main CMD path unaffected (recursion guard)
- E2E: every file written by docker-exec is readable by uid 10000
- Full tests/docker/ harness: 32/32 pass against fresh image build
- shellcheck --severity=error: clean
- hadolint: clean
- Manual: reproduced the original symptom (root-owned auth.json)
by bypassing the shim; confirmed default docker-exec produces
hermes-owned files; confirmed opt-out env keeps root semantics.
Known follow-up: this prevents NEW instances of the bug. Volumes
that already have root:root /opt/data/auth.json from a pre-shim
image need a one-time `chown hermes:hermes` before rebooting onto
the new image. A stage2-hook chown sweep can self-heal that, but
is deferred per scope decision.
Follow-up to #33583 (the gateway-run-supervised redirect).
Before this fix, the supervised gateway's stdout (most visibly the
"Hermes Gateway Starting…" rich-console banner) was swallowed by
`s6-log` into the rotated file at
`${HERMES_HOME}/logs/gateways/<profile>/current` and never reached
`docker logs`. Operational signal lived in two places:
* **docker logs** — saw stderr (Python `logging` defaults to
stderr), so warnings/errors were visible.
* **the rotated file** — saw stdout (rich banners, `print()`
output, third-party libs that wrote to fd 1).
This was surprising for users coming from the pre-s6 image, where
`docker run … gateway run` produced a single unified stream in
`docker logs`. They'd see partial output, conclude something was
broken, and dig around for the missing pieces.
Fix: add the `1` s6-log action directive before the file destination
so each line is forwarded to s6-log's stdout — which propagates up
the s6-supervise pipeline to /init's stdout = container stdout =
`docker logs`. The file destination is preserved as a second
destination, so the rotated log (with ISO 8601 timestamps) still
exists for `hermes logs` and for survival across container restarts.
Trade-off considered: timestamps. Putting `T` between `1` and the
file destination (not before `1`) means:
* docker logs sees raw lines — Python's logging formatter has its
own timestamps, and `docker logs --timestamps` adds another
layer when desired. No double-stamping in the common reading
path.
* The persisted file gets s6-log's ISO 8601 timestamp so even
output that lacked a Python-logger timestamp (rich banners,
third-party raw prints) is correlatable in `current`.
Verification:
* New unit-test assertion in `test_service_manager.py` locks the
`s6-log 1` directive into the rendered run-script. Mutation-
tested by reverting to the pre-fix script (no `1`); the assert
catches it cleanly.
* New docker-harness test `test_supervised_gateway_stdout_reaches_docker_logs`
builds the image, runs `docker run … gateway run`, and asserts
the unique `⚕` banner glyph reaches `docker logs`. Also verifies
the rotated file still contains the banner (no regression on
the existing file destination). Mutation-tested end-to-end: built
a deliberately-broken image without the `1` directive and the
test failed exactly as designed, citing the banner present in
`current` but absent from `docker logs`.
* `website/docs/user-guide/docker.md` gains a new `:::note Where
gateway logs go` admonition documenting both destinations and
the audit-log file at `${HERMES_HOME}/logs/container-boot.log`.
Existing functionality preserved: every other docker-harness test
still passes against the new image. Unit-test sweep across
`tests/hermes_cli/` (5561 tests) is green.
Pre-s6, `docker run nousresearch/hermes-agent gateway run` was the
standard invocation: gateway ran as the container's main process,
tini reaped zombies, container exit code matched gateway exit code,
no supervision. With s6-overlay as PID 1, the same invocation now
auto-upgrades to supervised semantics — auto-restart on crash,
dashboard supervised alongside (when HERMES_DASHBOARD=1 is set),
multiple profile gateways under the same /init.
Users get the new behavior with zero changes to their docker run
command. A loud one-line breadcrumb on stderr explains the upgrade
and points at the opt-out for users who genuinely want pre-s6
foreground semantics.
How it works:
1. `_gateway_command_inner` (the `gateway run` handler) checks if
we're inside a container with s6 as PID 1.
2. If yes, dispatches `start` to the s6 service manager (registers
and starts gateway-default), then `exec sleep infinity` to keep
the CMD process alive without binding container lifetime to
gateway PID lifetime. The supervised gateway can flap freely;
`docker stop` still tears everything down via /init stage 3.
3. If no, falls through to the existing foreground code path
unchanged. Host runs of `hermes gateway run` are unaffected.
Three gates make the redirect inert outside the intended scope:
* `detect_service_manager() != "s6"` — host/non-s6-container runs.
* `HERMES_S6_SUPERVISED_CHILD=1` env var (recursion guard) —
exported by `S6ServiceManager._render_run_script` for the
s6-supervised invocation itself. Without this guard, the
supervised `gateway run --replace` would re-enter the redirect
and recurse (run → start → run → start → ...) infinitely.
* `--no-supervise` CLI flag OR `HERMES_GATEWAY_NO_SUPERVISE=1` env
var — explicit user opt-out for CI smoke tests, debugging the
foreground startup path, or any case wanting "CMD exit =
container exit" semantics. Strict truthiness (1/true/yes,
case-insensitive); typos like `=0` do NOT silently opt out.
Tests:
* Unit tests in tests/hermes_cli/test_gateway_s6_dispatch.py
cover all five paths (host no-op, supervised fire, sentinel
recursion guard, CLI flag, env var truthy + falsy). The two
load-bearing gates (sentinel + opt-out) were mutation-tested
by removing each gate in isolation and confirming the dedicated
test fails with the expected error.
* Docker harness tests in tests/docker/test_gateway_run_supervised.py
cover the round trips end-to-end against a built image: redirect
fires (sleep-infinity heartbeat + supervised gateway-default
slot + breadcrumb), --no-supervise opt-out (foreground gateway,
no want-up on the slot), HERMES_GATEWAY_NO_SUPERVISE env var
works identically, recursion is impossible (≤1 supervised
python gateway-run + exactly 1 sleep-infinity parented to the
CMD wrapper), and HERMES_DASHBOARD=1 produces both supervised
gateway and supervised dashboard.
Docs:
* Added a `:::tip Gateway runs supervised` admonition near the
main docker.md example explaining the upgrade and pointing at
the opt-out. Pre-s6 (tini-based) images still run gateway run
as the foreground main process, so the note is scoped to the
s6 image only.
Trade-off documented in the helper docstring: container exit code
under the redirect is sleep's exit code (always 0 on SIGTERM), not
the gateway's. That was an explicit design call — the supervised
gateway is allowed to flap without taking the container with it,
which is what "supervision" means. CI users who want exit-code
forwarding can pass --no-supervise.
After the supervise-perms fix lands, the s6 lifecycle actually works
for the hermes user — hermes -p <profile> gateway start now genuinely
brings the supervised gateway up rather than silently no-op'ing on
EACCES. That exposes a latent bug in this test's assertion: it
expected 'want up' to appear literally in s6-svstat output, but
s6-svstat elides redundancies — when the slot is currently up AND
s6 wants it up, the output is just 'up (pid N pgid N) X seconds';
the explicit 'want up' token only appears when current ≠ wanted
(e.g. 'down (exitcode 1) … , want up' on a crash-loop).
Add a small helper _svstat_wants_up() that reads the want-state
correctly across both spellings:
* 'up …' → wanted up (unless explicit 'want down')
* 'down …, want up' → wanted up explicitly
* 'down …' → wanted down
Both stop and start assertions now use the helper. Also rewords
the module docstring to acknowledge that the supervised process
may succeed OR crash-loop depending on environment, but the want-
state contract holds either way.
(cherry picked from commit 02c933aedc)
PR #30136 review item O6: test_container_restart.py used fixed
`time.sleep(8)` calls after `docker restart` to wait for the
cont-init reconciler to finish. Fixed sleeps are slow when the
event happens fast and false-fail when the event happens slow.
Replace with two polling helpers:
* `_wait_for_path(container, path, kind='f' | 'd', deadline_s=...)`
— generic `test -f/-d` poller. Returns True on success, False on
timeout; callers assert with a clear message.
* `_wait_for_reconcile_log_mention(container, profile, ...)` — the
reconciler's per-profile log line is the canonical signal that
the cont-init reconcile has finished for that profile. Poll on
it instead of a sleep that hopes 8 seconds is enough.
The fixture-level setup wait is similarly migrated: it now polls
for `profile=default` in the boot log (every container always
gets a default-slot entry per item I1) and raises a clear timeout
error from the fixture if the container never finishes cont-init —
much better diagnostics than a mid-test KeyError.
The remaining `time.sleep()` calls are all internal interval_s
between probe attempts; no fixed wait points left.
PR #30136 review caught: `_allocate_gateway_port()` in profiles.py
computed a SHA-256-derived port that was threaded through
`register_profile_gateway(profile, port=N)` →
`_render_run_script(profile, port, extra_env)` → and then **ignored**.
The rendered run script picked the bind port from the profile's
config.yaml (`[gateway] port = …`), never from the allocator. So
the entire allocator + parameter chain was dead code.
Remove:
* `hermes_cli.profiles._allocate_gateway_port` (deterministic
SHA-256 → [9200, 9800) — never used).
* `port` kwarg from `ServiceManager.register_profile_gateway`
(Protocol + Mixin + S6 implementation).
* `port` positional arg from `_render_run_script(profile, port,
extra_env)` — now `_render_run_script(profile, extra_env)`.
* The pass-through call in `profiles._maybe_register_gateway_service`.
config.yaml is now the single source of truth for gateway port
selection — matches reality and reduces the API surface. Three
explanatory comments in service_manager.py / profiles.py document
the retirement so future readers don't reach for the allocator and
find a ghost.
Tests: drop the three `_allocate_gateway_port` tests; update
fakes' signatures throughout test_service_manager.py and
test_profiles_s6_hooks.py to match the new no-port API.
PR #30136 review caught a false positive: when HERMES_DASHBOARD was
unset, the dashboard run script did `exec sleep infinity`, so
`s6-svstat /run/service/dashboard` reported the slot as 'up'.
`hermes doctor` and any other s6-svstat-based health check saw the
dashboard as supervised-running even though no dashboard process
existed.
Add cont-init.d/03-dashboard-toggle: writes a `down` marker file
into `/run/service/dashboard/` when HERMES_DASHBOARD is falsy,
removes any leftover marker when it's truthy. s6-supervise honors
`down` by not starting the service, so s6-svstat reports 'down' —
matching reality.
The run script's HERMES_DASHBOARD case-statement stays in place as
a belt-and-suspenders guard, so the two layers can never disagree.
Two new integration tests lock the behavior: slot reports down
when unset; slot reports up when set to 1.
PR #30136 review surfaced two issues, both rooted in the same audit gap:
docker integration tests were running as root, not the unprivileged
`hermes` user (UID 10000) that the runtime actually uses via
`s6-setuidgid hermes`. Anything that probed PID-1 state or wrote to
the s6 control surface worked as root in the tests but was inert in
production.
Fixes:
1. `_s6_running()` previously called `Path("/proc/1/exe").resolve()`,
which is root-only readable. For UID 10000 the symlink yields
PermissionError, `resolve()` silently returns the unresolved path,
and `exe.name == "exe"` — so detection always returned False, the
service-manager runtime-registration path was inert, and every
`hermes profile create` / `hermes -p X gateway start` silently
skipped the s6 hook. Replace with `/proc/1/comm` (world-readable)
+ `/run/s6/basedir` (s6-overlay-specific) — both required, fail
closed.
2. `02-reconcile-profiles` now also chowns `/run/service/.s6-svscan/`
{control,lock} to hermes so `s6-svscanctl -a/-an` works without
root. Previously the directory chown stopped at `/run/service`
and the FIFO inside stayed root-owned, so `register_profile_gateway`
from hermes failed at the rescan-trigger step with EACCES — the
wrapper in profiles.py caught the exception and printed a swallowed
warning, so profile creation appeared to succeed while the slot
was rolled back.
Audit changes to flush this class of bug next time:
- Add `docker_exec` / `docker_exec_sh` helpers to `tests/docker/conftest.py`
that default to `-u hermes`. The module docstring explains why and
flags `user="root"` as opt-in only for tests that explicitly need
root (none currently do).
- Refactor every `docker exec` call in tests/docker/ through the new
helpers (test_dashboard.py, test_zombie_reaping.py, test_profile_gateway.py,
test_container_restart.py, test_s6_profile_gateway_integration.py).
- Add 5 unit tests covering `_s6_running` under various probe states
(both signals present; comm wrong; basedir missing; PermissionError
on /proc/1/comm; missing /proc — non-Linux). The PermissionError
test is the explicit regression guard for the original bug.
Known follow-up: the per-service `supervise/control` FIFO inside each
`/run/service/gateway-<profile>/supervise/` is created root-owned by
s6-supervise (which runs as root because s6-svscan is PID 1). `s6-svc
-u/-d/-t` from the hermes user will get EACCES on those. The audit
under `-u hermes` will reveal this in lifecycle tests — surfacing the
issue cleanly so it can be fixed in a focused follow-up (likely via a
small SUID helper or a polling chown loop in cont-init.d). The
detection + svscanctl fixes here are independent and complete on
their own.
Phase 4 of the s6-overlay supervision plan. Activates the Phase 3
S6ServiceManager by hooking it into the profile lifecycle and the
`hermes gateway start/stop/restart` dispatcher, and adds a cont-
init.d-time reconciliation pass that survives `docker restart`.
Task 4.0 — container-boot reconciliation:
/run/service/ is tmpfs, so every `docker restart` wipes every
per-profile gateway slot. /etc/cont-init.d/02-reconcile-profiles
invokes hermes_cli.container_boot.reconcile_profile_gateways() on
every boot, which walks $HERMES_HOME/profiles/<name>/, reads each
gateway_state.json, recreates the s6 service slot, and auto-starts
only those whose last state was 'running'. Other states
(stopped, starting, startup_failed, missing) register the slot
in the down state — avoiding crash-loops across restarts for a
gateway that was broken last boot. Per-profile outcome is recorded
to $HERMES_HOME/logs/container-boot.log.
Implementation: hermes_cli/container_boot.py + 12 unit tests.
Profile-marker is SOUL.md, not config.yaml, because `hermes profile
create` only seeds SOUL.md by default (config.yaml comes from
`hermes setup`).
Task 4.1 / 4.2 — profile create/delete hooks:
hermes_cli/profiles.py::create_profile now calls
_maybe_register_gateway_service(<canon>) at the end, which routes
through ServiceManager.register_profile_gateway when running on s6
and no-ops on host backends. delete_profile mirrors with
_maybe_unregister_gateway_service. _allocate_gateway_port produces
a deterministic SHA-256-derived port in [9200, 9800).
Task 4.3 — gateway dispatch + remove rejection arms:
_dispatch_via_service_manager_if_s6(action) intercepts
start/stop/restart at the top of each subcommand and routes them
through S6ServiceManager.{start,stop,restart}. The pre-Phase-4
`elif is_container():` rejection arms are kept as fallback for
pre-s6 containers / unsupported runtimes, but only ever fire when
detect_service_manager() != 's6'. install/uninstall under s6
print informational guidance pointing users at profile create/delete.
Removed the two xfail(strict=True) markers from
tests/docker/test_profile_gateway.py — both tests now pass strictly.
Task 4.4 — status reporting:
get_gateway_runtime_snapshot() reports
Manager: 's6 (container supervisor)' inside an s6 container instead
of 'docker (foreground)'.
Plan-vs-reality drift fixed in this commit:
- Plan's S6ServiceManager._render_run_script used
`gateway start --foreground --port {port}` — invented args; the
real CLI is `gateway run`. Switched accordingly. port arg
retained for API parity but now documented as 'currently ignored'.
- Plan's reconciler keyed on config.yaml; switched to SOUL.md
(config.yaml is created by hermes setup, not by hermes profile
create, so the original gate caught nothing).
- The plan's _dispatch helper used _profile_arg() which returns
'--profile <name>' (i.e. with the flag prefix). Switched to
_profile_suffix() which returns the bare name.
- Architecture B's docker exec doesn't get /command on PATH or
the venv on PATH; Dockerfile's runtime PATH now includes
/opt/hermes/.venv/bin so 'docker exec <c> hermes ...' works
without sourcing the venv.
- stage2-hook now chowns $HERMES_HOME/profiles to hermes on every
boot, not just on the UID-remap path. Without this, files created
by docker-exec-as-root accumulate and the next reconciler run
fails with PermissionError reading SOUL.md.
Test harness:
19 passed, 0 xfailed (the two pre-Phase-4 xfail targets flip to
passing). 78 unit tests across service_manager + container_boot +
profiles_s6_hooks + gateway_s6_dispatch. Hadolint + shellcheck
pass cleanly.
Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md
Phase 3 of the s6-overlay supervision plan. Implements the runtime-
registration surface from D4 — only the s6 backend supports
register_profile_gateway / unregister_profile_gateway /
list_profile_gateways; host backends continue to raise
NotImplementedError. No caller yet (Phase 4 wires in the profile
create/delete hooks).
Key implementation notes:
- Service directory shape: /run/service/gateway-<profile>/{type,run,log/run}.
Atomic register: write to gateway-<profile>.tmp, fsync via
os.rename. Cleanup on rescan failure.
- Run script uses #!/command/with-contenv sh so HERMES_HOME and any
extra_env arrive at exec time. The hermes -p <profile> gateway
start --foreground --port <port> command is wrapped in
s6-setuidgid hermes for the per-service privilege drop (OQ2-A).
- Log script (OQ8-C): persists via s6-log to
${HERMES_HOME}/logs/gateways/<profile>/. CRITICAL — HERMES_HOME is
a runtime env-var expansion in the rendered script, NOT a Python
f-string substitution. Negative-asserted in
test_s6_register_creates_service_dir_and_triggers_scan so
regressions are caught.
- PATH gotcha: /command/ is only on PATH for processes spawned by
the supervision tree (services, cont-init.d). `docker exec` and
profile-create hooks don't get it. S6ServiceManager calls all
s6-* binaries via absolute path through the new _S6_BIN_DIR
constant so callers don't have to fix up env vars.
- validate_profile_name rejects path-traversal, leading-dash (s6
would parse as a flag), uppercase, whitespace, and names >251
chars (s6-svscan default name_max).
Test coverage:
- 13 new unit tests in tests/hermes_cli/test_service_manager.py
(kind detection, run-script content, env quoting, register
rollback on rescan failure, unregister idempotence, list filter,
lifecycle dispatch, svstat parsing). Total: 36 passing.
- 2 new in-container integration tests in
tests/docker/test_s6_profile_gateway_integration.py validating
end-to-end registration against a real s6 supervision tree.
Docker harness: 14 passed, 2 xfailed (Phase 4 target unchanged).
Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md
BREAKING CHANGE: the container ENTRYPOINT is now /init (s6-overlay)
instead of /usr/bin/tini. Main hermes runs as the container CMD with
TTY inherited (preserving --tui), dashboard runs as a supervised s6-rc
service (HERMES_DASHBOARD=1 starts it; crashes auto-restart), and the
ground is laid for per-profile gateway supervision (Phase 3+4).
All five pre-s6 docker run invocation patterns continue to work
identically — verified by the Phase 0 docker harness:
docker run <image> → `hermes` with no args
docker run <image> chat -q "..." → `hermes chat -q ...` passthrough
docker run <image> sleep infinity → `sleep infinity` direct
docker run <image> bash → interactive bash
docker run -it <image> --tui → interactive Ink TUI
Phase 2 harness result: 12 passed, 2 xfailed (Phase 4 target). Hadolint
+ shellcheck pass cleanly.
Architecture pivot from plan v3 (documented in main-hermes/run header):
the plan called for main hermes to be an s6-supervised service, but
two real s6-overlay v3 mechanics blocked that — cont-init.d scripts
receive no arguments (CMD args are not visible to stage2-hook), and
`/run/s6/basedir/bin/halt` after writing the exit code did not
propagate the desired exit code (container exits 143). We use the
s6-overlay-native CMD pattern instead: main-wrapper.sh is the
container's main program (ENTRYPOINT prepends it so leading-dash
args like --version aren't intercepted by /init), exec's the final
program with stdin/stdout/stderr inherited, and the program's exit
code becomes the container exit code. main-hermes is now a no-op
`sleep infinity` slot kept for future supervised-gateway-container
modes. This trades "supervised restart of main hermes" for arg-
parity with the pre-s6 contract — main hermes was already unsupervised
under tini, so we lose nothing functional. Dashboard supervision is
the only new guarantee added by this phase.
Files added:
docker/main-wrapper.sh # arg routing + s6-setuidgid drop
docker/stage2-hook.sh # gosu-equivalent + chown + seed
docker/s6-rc.d/main-hermes/{type,run,dependencies.d/base}
docker/s6-rc.d/dashboard/{type,run,dependencies.d/base}
docker/s6-rc.d/user/contents.d/{main-hermes,dashboard}
Files changed:
Dockerfile: tini → s6-overlay install + ENTRYPOINT flip + service wiring
docker/entrypoint.sh: thin shim to stage2-hook.sh for back-compat
tests/docker/test_dashboard.py: add test_dashboard_restarts_after_crash
Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md
Two pre-existing baseline issues found while running the Phase 0 harness
against the tini image that need fixing before later phases can use the
harness as a behavior-parity oracle:
1. The autouse `_enforce_test_timeout` fixture in tests/conftest.py
hard-coded a 30s SIGALRM, which preempted any `pytest.mark.timeout`
marker (already honored by pytest-timeout). Honor the marker if
present; fall back to 30s otherwise. Docker harness tests carry a
180s marker applied at collection time in tests/docker/conftest.py.
2. test_dashboard_port_override polled via `ss -tlnp` / `netstat -tln`
— neither is installed in the Hermes image, so the probe trivially
failed even when the dashboard was bound. The dashboard also takes
8-15s to bind on cold image; the 5s sleep was insufficient. Replace
with a poll loop reading /proc/net/tcp directly (port 9120 = 0x23A0,
state 0A = LISTEN). Bump probe deadline to 60s and switch
test_dashboard_opt_in_starts to a similar poll for pgrep so we don't
regress to the same race.
Result: 11 passed, 2 xfailed (Phase 4 target) on tini image. Harness
now ready to serve as Phase 2's behavior-parity oracle.
The agent-test suite default is 30s; docker test_no_args (the dashboard
spin-up, the container restart) routinely take 60-90s. Without this
they intermittently fail in CI with TimeoutError.
Tasks 0.2-0.6 of the s6-overlay supervision plan. Locks the
user-visible behavior we must preserve through the Phase 2 init-
system swap:
- test_main_invocation.py (Task 0.2): docker run <image> with no
args, chat subcommand passthrough, bare executable passthrough,
bash pattern, exit-code propagation
- test_tui_passthrough.py (Task 0.3): TTY allocation via docker -t
using the host's script(1) for a PTY
- test_dashboard.py (Task 0.4): HERMES_DASHBOARD=1 opt-in,
HERMES_DASHBOARD_PORT override
- test_profile_gateway.py (Task 0.5): per-profile gateway
start/stop and profile-delete-stops-gateway. Both marked
xfail(strict=True) because the current tini image refuses
gateway lifecycle commands inside the container; Phase 4
Task 4.3 flips them to passing.
- test_zombie_reaping.py (Task 0.6): PID 1 reaps orphaned
zombies. tini does this today; s6-overlay's /init must
continue to.
Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md
Task 0.1 of the s6-overlay supervision plan. Establishes the test
infrastructure for tests/docker/: skip-on-missing-Docker collection
hook, session-scoped image-build fixture (overridable via the
HERMES_TEST_IMAGE env var for faster local iteration), and a
container_name fixture that ensures cleanup on test exit.
Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md