mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-31 06:51:29 +00:00
PR #30136 review surfaced two issues, both rooted in the same audit gap: docker integration tests were running as root, not the unprivileged `hermes` user (UID 10000) that the runtime actually uses via `s6-setuidgid hermes`. Anything that probed PID-1 state or wrote to the s6 control surface worked as root in the tests but was inert in production. Fixes: 1. `_s6_running()` previously called `Path("/proc/1/exe").resolve()`, which is root-only readable. For UID 10000 the symlink yields PermissionError, `resolve()` silently returns the unresolved path, and `exe.name == "exe"` — so detection always returned False, the service-manager runtime-registration path was inert, and every `hermes profile create` / `hermes -p X gateway start` silently skipped the s6 hook. Replace with `/proc/1/comm` (world-readable) + `/run/s6/basedir` (s6-overlay-specific) — both required, fail closed. 2. `02-reconcile-profiles` now also chowns `/run/service/.s6-svscan/` {control,lock} to hermes so `s6-svscanctl -a/-an` works without root. Previously the directory chown stopped at `/run/service` and the FIFO inside stayed root-owned, so `register_profile_gateway` from hermes failed at the rescan-trigger step with EACCES — the wrapper in profiles.py caught the exception and printed a swallowed warning, so profile creation appeared to succeed while the slot was rolled back. Audit changes to flush this class of bug next time: - Add `docker_exec` / `docker_exec_sh` helpers to `tests/docker/conftest.py` that default to `-u hermes`. The module docstring explains why and flags `user="root"` as opt-in only for tests that explicitly need root (none currently do). - Refactor every `docker exec` call in tests/docker/ through the new helpers (test_dashboard.py, test_zombie_reaping.py, test_profile_gateway.py, test_container_restart.py, test_s6_profile_gateway_integration.py). - Add 5 unit tests covering `_s6_running` under various probe states (both signals present; comm wrong; basedir missing; PermissionError on /proc/1/comm; missing /proc — non-Linux). The PermissionError test is the explicit regression guard for the original bug. Known follow-up: the per-service `supervise/control` FIFO inside each `/run/service/gateway-<profile>/supervise/` is created root-owned by s6-supervise (which runs as root because s6-svscan is PID 1). `s6-svc -u/-d/-t` from the hermes user will get EACCES on those. The audit under `-u hermes` will reveal this in lifecycle tests — surfacing the issue cleanly so it can be fixed in a focused follow-up (likely via a small SUID helper or a polling chown loop in cont-init.d). The detection + svscanctl fixes here are independent and complete on their own.
108 lines
3.8 KiB
Python
108 lines
3.8 KiB
Python
"""Harness: per-profile gateway start/stop inside the container.
|
||
|
||
Phase 4 wires `hermes -p <profile> gateway start/stop` through the s6
|
||
ServiceManager dispatch path inside the container — so the lifecycle
|
||
commands now bring up an s6-supervised gateway rather than refusing
|
||
with the pre-Phase-4 informational message.
|
||
|
||
These tests were marked ``xfail(strict=True)`` through Phase 0–3 and
|
||
flip to plain ``test_…`` once Phase 4 lands (now).
|
||
|
||
NB: The harness profile created here has no model/auth configured,
|
||
so the gateway process itself will exit with code 1 on every start
|
||
attempt (s6 will keep restarting it). We assert against s6's
|
||
``want up`` / ``want down`` state — which reflects the lifecycle
|
||
command's intent, not the supervised process's health.
|
||
|
||
Every ``docker exec`` here runs as the unprivileged ``hermes`` user
|
||
(via :func:`docker_exec_sh` in conftest); see the conftest module
|
||
docstring.
|
||
"""
|
||
from __future__ import annotations
|
||
|
||
import subprocess
|
||
import time
|
||
|
||
from tests.docker.conftest import docker_exec_sh
|
||
|
||
PROFILE = "test-harness-profile"
|
||
|
||
|
||
def _sh(
|
||
container: str, command: str, timeout: int = 30,
|
||
) -> subprocess.CompletedProcess[str]:
|
||
return docker_exec_sh(container, command, timeout=timeout)
|
||
|
||
|
||
def _svstat(container: str) -> str:
|
||
"""Returns the raw s6-svstat output for the test profile's slot.
|
||
/command/s6-svstat is called by absolute path because /command/
|
||
isn't on PATH for docker-exec sessions."""
|
||
r = _sh(container, f"/command/s6-svstat /run/service/gateway-{PROFILE}")
|
||
return r.stdout if r.returncode == 0 else ""
|
||
|
||
|
||
def test_profile_create_then_gateway_start(
|
||
built_image: str, container_name: str,
|
||
) -> None:
|
||
subprocess.run(
|
||
["docker", "run", "-d", "--name", container_name, built_image,
|
||
"sleep", "120"],
|
||
check=True, capture_output=True, timeout=30,
|
||
)
|
||
time.sleep(3)
|
||
|
||
r = _sh(container_name, f"hermes profile create {PROFILE}")
|
||
assert r.returncode == 0, f"profile create failed: {r.stderr}"
|
||
|
||
# Profile create's s6-register hook should have produced a service slot.
|
||
r = _sh(container_name, f"test -d /run/service/gateway-{PROFILE}")
|
||
assert r.returncode == 0, "s6 service slot not created on profile create"
|
||
|
||
r = _sh(container_name, f"hermes -p {PROFILE} gateway start", timeout=60)
|
||
assert r.returncode == 0, (
|
||
f"gateway start failed: stderr={r.stderr!r} stdout={r.stdout!r}"
|
||
)
|
||
|
||
# After start, s6's intent is "up" — even if the supervised gateway
|
||
# process spin-fails (no model/auth in the test profile), the
|
||
# supervision-state contract holds.
|
||
time.sleep(2)
|
||
state = _svstat(container_name)
|
||
assert "want up" in state, f"want up not in svstat: {state!r}"
|
||
|
||
r = _sh(container_name, f"hermes -p {PROFILE} gateway stop", timeout=30)
|
||
assert r.returncode == 0
|
||
|
||
time.sleep(2)
|
||
state = _svstat(container_name)
|
||
assert "want up" not in state, f"want up still in svstat: {state!r}"
|
||
|
||
|
||
def test_profile_delete_stops_gateway(
|
||
built_image: str, container_name: str,
|
||
) -> None:
|
||
"""Deleting a profile should stop its gateway and remove the s6
|
||
service slot."""
|
||
subprocess.run(
|
||
["docker", "run", "-d", "--name", container_name, built_image,
|
||
"sleep", "120"],
|
||
check=True, capture_output=True, timeout=30,
|
||
)
|
||
time.sleep(3)
|
||
|
||
_sh(container_name, f"hermes profile create {PROFILE}")
|
||
_sh(container_name, f"hermes -p {PROFILE} gateway start", timeout=60)
|
||
time.sleep(3)
|
||
|
||
r = _sh(
|
||
container_name,
|
||
f"hermes profile delete {PROFILE} --yes",
|
||
timeout=30,
|
||
)
|
||
assert r.returncode == 0, f"profile delete failed: {r.stderr}"
|
||
|
||
time.sleep(2)
|
||
# Service slot should be gone.
|
||
r = _sh(container_name, f"test -d /run/service/gateway-{PROFILE}")
|
||
assert r.returncode != 0, "s6 service slot still present after profile delete"
|