PR #30136 review caught: `_allocate_gateway_port()` in profiles.py
computed a SHA-256-derived port that was threaded through
`register_profile_gateway(profile, port=N)` →
`_render_run_script(profile, port, extra_env)` → and then **ignored**.
The rendered run script picked the bind port from the profile's
config.yaml (`[gateway] port = …`), never from the allocator. So
the entire allocator + parameter chain was dead code.
Remove:
* `hermes_cli.profiles._allocate_gateway_port` (deterministic
SHA-256 → [9200, 9800) — never used).
* `port` kwarg from `ServiceManager.register_profile_gateway`
(Protocol + Mixin + S6 implementation).
* `port` positional arg from `_render_run_script(profile, port,
extra_env)` — now `_render_run_script(profile, extra_env)`.
* The pass-through call in `profiles._maybe_register_gateway_service`.
config.yaml is now the single source of truth for gateway port
selection — matches reality and reduces the API surface. Three
explanatory comments in service_manager.py / profiles.py document
the retirement so future readers don't reach for the allocator and
find a ghost.
Tests: drop the three `_allocate_gateway_port` tests; update
fakes' signatures throughout test_service_manager.py and
test_profiles_s6_hooks.py to match the new no-port API.
PR #30136 review surfaced two issues, both rooted in the same audit gap:
docker integration tests were running as root, not the unprivileged
`hermes` user (UID 10000) that the runtime actually uses via
`s6-setuidgid hermes`. Anything that probed PID-1 state or wrote to
the s6 control surface worked as root in the tests but was inert in
production.
Fixes:
1. `_s6_running()` previously called `Path("/proc/1/exe").resolve()`,
which is root-only readable. For UID 10000 the symlink yields
PermissionError, `resolve()` silently returns the unresolved path,
and `exe.name == "exe"` — so detection always returned False, the
service-manager runtime-registration path was inert, and every
`hermes profile create` / `hermes -p X gateway start` silently
skipped the s6 hook. Replace with `/proc/1/comm` (world-readable)
+ `/run/s6/basedir` (s6-overlay-specific) — both required, fail
closed.
2. `02-reconcile-profiles` now also chowns `/run/service/.s6-svscan/`
{control,lock} to hermes so `s6-svscanctl -a/-an` works without
root. Previously the directory chown stopped at `/run/service`
and the FIFO inside stayed root-owned, so `register_profile_gateway`
from hermes failed at the rescan-trigger step with EACCES — the
wrapper in profiles.py caught the exception and printed a swallowed
warning, so profile creation appeared to succeed while the slot
was rolled back.
Audit changes to flush this class of bug next time:
- Add `docker_exec` / `docker_exec_sh` helpers to `tests/docker/conftest.py`
that default to `-u hermes`. The module docstring explains why and
flags `user="root"` as opt-in only for tests that explicitly need
root (none currently do).
- Refactor every `docker exec` call in tests/docker/ through the new
helpers (test_dashboard.py, test_zombie_reaping.py, test_profile_gateway.py,
test_container_restart.py, test_s6_profile_gateway_integration.py).
- Add 5 unit tests covering `_s6_running` under various probe states
(both signals present; comm wrong; basedir missing; PermissionError
on /proc/1/comm; missing /proc — non-Linux). The PermissionError
test is the explicit regression guard for the original bug.
Known follow-up: the per-service `supervise/control` FIFO inside each
`/run/service/gateway-<profile>/supervise/` is created root-owned by
s6-supervise (which runs as root because s6-svscan is PID 1). `s6-svc
-u/-d/-t` from the hermes user will get EACCES on those. The audit
under `-u hermes` will reveal this in lifecycle tests — surfacing the
issue cleanly so it can be fixed in a focused follow-up (likely via a
small SUID helper or a polling chown loop in cont-init.d). The
detection + svscanctl fixes here are independent and complete on
their own.
Phase 3 of the s6-overlay supervision plan. Implements the runtime-
registration surface from D4 — only the s6 backend supports
register_profile_gateway / unregister_profile_gateway /
list_profile_gateways; host backends continue to raise
NotImplementedError. No caller yet (Phase 4 wires in the profile
create/delete hooks).
Key implementation notes:
- Service directory shape: /run/service/gateway-<profile>/{type,run,log/run}.
Atomic register: write to gateway-<profile>.tmp, fsync via
os.rename. Cleanup on rescan failure.
- Run script uses #!/command/with-contenv sh so HERMES_HOME and any
extra_env arrive at exec time. The hermes -p <profile> gateway
start --foreground --port <port> command is wrapped in
s6-setuidgid hermes for the per-service privilege drop (OQ2-A).
- Log script (OQ8-C): persists via s6-log to
${HERMES_HOME}/logs/gateways/<profile>/. CRITICAL — HERMES_HOME is
a runtime env-var expansion in the rendered script, NOT a Python
f-string substitution. Negative-asserted in
test_s6_register_creates_service_dir_and_triggers_scan so
regressions are caught.
- PATH gotcha: /command/ is only on PATH for processes spawned by
the supervision tree (services, cont-init.d). `docker exec` and
profile-create hooks don't get it. S6ServiceManager calls all
s6-* binaries via absolute path through the new _S6_BIN_DIR
constant so callers don't have to fix up env vars.
- validate_profile_name rejects path-traversal, leading-dash (s6
would parse as a flag), uppercase, whitespace, and names >251
chars (s6-svscan default name_max).
Test coverage:
- 13 new unit tests in tests/hermes_cli/test_service_manager.py
(kind detection, run-script content, env quoting, register
rollback on rescan failure, unregister idempotence, list filter,
lifecycle dispatch, svstat parsing). Total: 36 passing.
- 2 new in-container integration tests in
tests/docker/test_s6_profile_gateway_integration.py validating
end-to-end registration against a real s6 supervision tree.
Docker harness: 14 passed, 2 xfailed (Phase 4 target unchanged).
Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md