fix(service_manager): s6 detection works for unprivileged hermes user

PR #30136 review surfaced two issues, both rooted in the same audit gap: docker integration tests were running as root, not the unprivileged `hermes` user (UID 10000) that the runtime actually uses via `s6-setuidgid hermes`. Anything that probed PID-1 state or wrote to the s6 control surface worked as root in the tests but was inert in production. Fixes: 1. `_s6_running()` previously called `Path("/proc/1/exe").resolve()`, which is root-only readable. For UID 10000 the symlink yields PermissionError, `resolve()` silently returns the unresolved path, and `exe.name == "exe"` — so detection always returned False, the service-manager runtime-registration path was inert, and every `hermes profile create` / `hermes -p X gateway start` silently skipped the s6 hook. Replace with `/proc/1/comm` (world-readable) + `/run/s6/basedir` (s6-overlay-specific) — both required, fail closed. 2. `02-reconcile-profiles` now also chowns `/run/service/.s6-svscan/` {control,lock} to hermes so `s6-svscanctl -a/-an` works without root. Previously the directory chown stopped at `/run/service` and the FIFO inside stayed root-owned, so `register_profile_gateway` from hermes failed at the rescan-trigger step with EACCES — the wrapper in profiles.py caught the exception and printed a swallowed warning, so profile creation appeared to succeed while the slot was rolled back. Audit changes to flush this class of bug next time: - Add `docker_exec` / `docker_exec_sh` helpers to `tests/docker/conftest.py` that default to `-u hermes`. The module docstring explains why and flags `user="root"` as opt-in only for tests that explicitly need root (none currently do). - Refactor every `docker exec` call in tests/docker/ through the new helpers (test_dashboard.py, test_zombie_reaping.py, test_profile_gateway.py, test_container_restart.py, test_s6_profile_gateway_integration.py). - Add 5 unit tests covering `_s6_running` under various probe states (both signals present; comm wrong; basedir missing; PermissionError on /proc/1/comm; missing /proc — non-Linux). The PermissionError test is the explicit regression guard for the original bug. Known follow-up: the per-service `supervise/control` FIFO inside each `/run/service/gateway-<profile>/supervise/` is created root-owned by s6-supervise (which runs as root because s6-svscan is PID 1). `s6-svc -u/-d/-t` from the hermes user will get EACCES on those. The audit under `-u hermes` will reveal this in lifecycle tests — surfacing the issue cleanly so it can be fixed in a focused follow-up (likely via a small SUID helper or a polling chown loop in cont-init.d). The detection + svscanctl fixes here are independent and complete on their own.
2026-06-07 08:02:23 +00:00 · 2026-05-23 14:56:39 +10:00 · 2026-05-23 14:56:39 +10:00 · 2f8ceeab9a
commit 2f8ceeab9a
parent a6f7171a5e
9 changed files with 241 additions and 53 deletions
--- a/tests/docker/test_dashboard.py
+++ b/tests/docker/test_dashboard.py
@ -5,12 +5,18 @@ it stays dead. After Phase 2 (s6): dashboard starts once; if it crashes
 it is restarted under supervision. The restart-after-crash test lives in
 Phase 2 Task 2.5; this file only locks the opt-in surface (which must
 not change between tini and s6).
+
+Every ``docker exec`` here runs as the unprivileged ``hermes`` user
+(via :func:`docker_exec`/:func:`docker_exec_sh` in conftest), matching
+the realistic runtime context. See the conftest module docstring.
 """
 from __future__ import annotations

 import subprocess
 import time

+from tests.docker.conftest import docker_exec, docker_exec_sh
+

 def _poll(container: str, probe: str, *, deadline_s: float = 30.0,
          interval_s: float = 0.5) -> tuple[bool, str]:
@ -19,10 +25,7 @@ def _poll(container: str, probe: str, *, deadline_s: float = 30.0,
    end = time.monotonic() + deadline_s
    last = ""
    while time.monotonic() < end:
-        r = subprocess.run(
-            ["docker", "exec", container, "sh", "-c", probe],
-            capture_output=True, text=True, timeout=10,
-        )
+        r = docker_exec_sh(container, probe, timeout=10)
        last = r.stdout
        if r.returncode == 0:
            return True, last
@ -42,11 +45,7 @@ def test_dashboard_not_running_by_default(
    # Give the entrypoint enough time to finish bootstrap; if a dashboard
    # were going to start it'd be visible by now.
    time.sleep(5)
-    r = subprocess.run(
-        ["docker", "exec", container_name,
-         "pgrep", "-f", "hermes dashboard"],
-        capture_output=True, text=True, timeout=10,
-    )
+    r = docker_exec(container_name, "pgrep", "-f", "hermes dashboard")
    # pgrep exits non-zero when no match found
    assert r.returncode != 0, (
        "Dashboard should not be running without HERMES_DASHBOARD"
@ -121,10 +120,8 @@ def test_dashboard_restarts_after_crash(
    # a couple of times before giving up.
    first_pid: str | None = None
    for _attempt in range(10):
-        first_pid_result = subprocess.run(
-            ["docker", "exec", container_name,
-             "pgrep", "-f", "hermes dashboard"],
-            capture_output=True, text=True, timeout=10,
+        first_pid_result = docker_exec(
+            container_name, "pgrep", "-f", "hermes dashboard",
        )
        first_pids = first_pid_result.stdout.strip().split()
        if first_pids:
@ -133,21 +130,15 @@ def test_dashboard_restarts_after_crash(
        time.sleep(0.5)
    assert first_pid is not None, "Could not capture initial dashboard PID"

-    # Kill the dashboard.
-    subprocess.run(
-        ["docker", "exec", container_name, "kill", "-9", first_pid],
-        capture_output=True, timeout=10,
-    )
+    # Kill the dashboard. The dashboard process runs as hermes, so the
+    # hermes user can kill it (same UID).
+    docker_exec(container_name, "kill", "-9", first_pid)

    # s6 backs off ~1s before restart; allow up to 15s for the new
    # process to appear with a different PID.
    deadline = time.monotonic() + 15.0
    while time.monotonic() < deadline:
-        r = subprocess.run(
-            ["docker", "exec", container_name,
-             "pgrep", "-f", "hermes dashboard"],
-            capture_output=True, text=True, timeout=10,
-        )
+        r = docker_exec(container_name, "pgrep", "-f", "hermes dashboard")
        pids = r.stdout.strip().split() if r.returncode == 0 else []
        if pids and pids[0] != first_pid:
            return  # success