hermes-agent/tests/docker/test_profile_gateway.py
Ben fc39296e1f
fix(service_manager): s6 detection works for unprivileged hermes user
PR #30136 review surfaced two issues, both rooted in the same audit gap:
docker integration tests were running as root, not the unprivileged
`hermes` user (UID 10000) that the runtime actually uses via
`s6-setuidgid hermes`. Anything that probed PID-1 state or wrote to
the s6 control surface worked as root in the tests but was inert in
production.

Fixes:

1. `_s6_running()` previously called `Path("/proc/1/exe").resolve()`,
   which is root-only readable. For UID 10000 the symlink yields
   PermissionError, `resolve()` silently returns the unresolved path,
   and `exe.name == "exe"` — so detection always returned False, the
   service-manager runtime-registration path was inert, and every
   `hermes profile create` / `hermes -p X gateway start` silently
   skipped the s6 hook. Replace with `/proc/1/comm` (world-readable)
   + `/run/s6/basedir` (s6-overlay-specific) — both required, fail
   closed.

2. `02-reconcile-profiles` now also chowns `/run/service/.s6-svscan/`
   {control,lock} to hermes so `s6-svscanctl -a/-an` works without
   root. Previously the directory chown stopped at `/run/service`
   and the FIFO inside stayed root-owned, so `register_profile_gateway`
   from hermes failed at the rescan-trigger step with EACCES — the
   wrapper in profiles.py caught the exception and printed a swallowed
   warning, so profile creation appeared to succeed while the slot
   was rolled back.

Audit changes to flush this class of bug next time:

- Add `docker_exec` / `docker_exec_sh` helpers to `tests/docker/conftest.py`
  that default to `-u hermes`. The module docstring explains why and
  flags `user="root"` as opt-in only for tests that explicitly need
  root (none currently do).
- Refactor every `docker exec` call in tests/docker/ through the new
  helpers (test_dashboard.py, test_zombie_reaping.py, test_profile_gateway.py,
  test_container_restart.py, test_s6_profile_gateway_integration.py).
- Add 5 unit tests covering `_s6_running` under various probe states
  (both signals present; comm wrong; basedir missing; PermissionError
  on /proc/1/comm; missing /proc — non-Linux). The PermissionError
  test is the explicit regression guard for the original bug.

Known follow-up: the per-service `supervise/control` FIFO inside each
`/run/service/gateway-<profile>/supervise/` is created root-owned by
s6-supervise (which runs as root because s6-svscan is PID 1). `s6-svc
-u/-d/-t` from the hermes user will get EACCES on those. The audit
under `-u hermes` will reveal this in lifecycle tests — surfacing the
issue cleanly so it can be fixed in a focused follow-up (likely via a
small SUID helper or a polling chown loop in cont-init.d). The
detection + svscanctl fixes here are independent and complete on
their own.
2026-05-24 18:05:33 -07:00

108 lines
3.8 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

"""Harness: per-profile gateway start/stop inside the container.
Phase 4 wires `hermes -p <profile> gateway start/stop` through the s6
ServiceManager dispatch path inside the container — so the lifecycle
commands now bring up an s6-supervised gateway rather than refusing
with the pre-Phase-4 informational message.
These tests were marked ``xfail(strict=True)`` through Phase 03 and
flip to plain ``test_…`` once Phase 4 lands (now).
NB: The harness profile created here has no model/auth configured,
so the gateway process itself will exit with code 1 on every start
attempt (s6 will keep restarting it). We assert against s6's
``want up`` / ``want down`` state — which reflects the lifecycle
command's intent, not the supervised process's health.
Every ``docker exec`` here runs as the unprivileged ``hermes`` user
(via :func:`docker_exec_sh` in conftest); see the conftest module
docstring.
"""
from __future__ import annotations
import subprocess
import time
from tests.docker.conftest import docker_exec_sh
PROFILE = "test-harness-profile"
def _sh(
container: str, command: str, timeout: int = 30,
) -> subprocess.CompletedProcess[str]:
return docker_exec_sh(container, command, timeout=timeout)
def _svstat(container: str) -> str:
"""Returns the raw s6-svstat output for the test profile's slot.
/command/s6-svstat is called by absolute path because /command/
isn't on PATH for docker-exec sessions."""
r = _sh(container, f"/command/s6-svstat /run/service/gateway-{PROFILE}")
return r.stdout if r.returncode == 0 else ""
def test_profile_create_then_gateway_start(
built_image: str, container_name: str,
) -> None:
subprocess.run(
["docker", "run", "-d", "--name", container_name, built_image,
"sleep", "120"],
check=True, capture_output=True, timeout=30,
)
time.sleep(3)
r = _sh(container_name, f"hermes profile create {PROFILE}")
assert r.returncode == 0, f"profile create failed: {r.stderr}"
# Profile create's s6-register hook should have produced a service slot.
r = _sh(container_name, f"test -d /run/service/gateway-{PROFILE}")
assert r.returncode == 0, "s6 service slot not created on profile create"
r = _sh(container_name, f"hermes -p {PROFILE} gateway start", timeout=60)
assert r.returncode == 0, (
f"gateway start failed: stderr={r.stderr!r} stdout={r.stdout!r}"
)
# After start, s6's intent is "up" — even if the supervised gateway
# process spin-fails (no model/auth in the test profile), the
# supervision-state contract holds.
time.sleep(2)
state = _svstat(container_name)
assert "want up" in state, f"want up not in svstat: {state!r}"
r = _sh(container_name, f"hermes -p {PROFILE} gateway stop", timeout=30)
assert r.returncode == 0
time.sleep(2)
state = _svstat(container_name)
assert "want up" not in state, f"want up still in svstat: {state!r}"
def test_profile_delete_stops_gateway(
built_image: str, container_name: str,
) -> None:
"""Deleting a profile should stop its gateway and remove the s6
service slot."""
subprocess.run(
["docker", "run", "-d", "--name", container_name, built_image,
"sleep", "120"],
check=True, capture_output=True, timeout=30,
)
time.sleep(3)
_sh(container_name, f"hermes profile create {PROFILE}")
_sh(container_name, f"hermes -p {PROFILE} gateway start", timeout=60)
time.sleep(3)
r = _sh(
container_name,
f"hermes profile delete {PROFILE} --yes",
timeout=30,
)
assert r.returncode == 0, f"profile delete failed: {r.stderr}"
time.sleep(2)
# Service slot should be gone.
r = _sh(container_name, f"test -d /run/service/gateway-{PROFILE}")
assert r.returncode != 0, "s6 service slot still present after profile delete"