mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-07-01 12:02:05 +00:00
Some checks are pending
CI / Detect affected areas (push) Waiting to run
CI / Python tests (push) Blocked by required conditions
CI / Python lints (push) Blocked by required conditions
CI / TypeScript (push) Blocked by required conditions
CI / Docs Site (push) Blocked by required conditions
CI / Deny unrelated histories (push) Blocked by required conditions
CI / Check contributors (push) Blocked by required conditions
CI / Check uv.lock (push) Blocked by required conditions
CI / Lint Docker scripts (push) Blocked by required conditions
CI / Build&Test Docker image (push) Blocked by required conditions
CI / Supply-chain scan (push) Blocked by required conditions
CI / OSV scan (push) Waiting to run
CI / All required checks pass (push) Blocked by required conditions
Deploy Site / deploy-vercel (push) Waiting to run
Deploy Site / deploy-docs (push) Waiting to run
The register path builds each profile-gateway slot in a sibling staging dir under /run/service (the scandir s6-svscan watches), then atomically renames it to the live gateway-<profile> name. The staging dir was named gateway-<profile>.tmp — a NON-dotfile — so a concurrent `s6-svscanctl -a` rescan (fired by the cont-init reconciler registering gateway-default, or by a sibling register) would supervise the half-built slot the moment it had a valid type/run: s6-supervise spawns AS ROOT and mkdirs supervise/ root-owned 0700, then the in-flight _seed_supervise_skeleton early-returns on the now-existing supervise/ and the next `mkdir supervise/event` hits PermissionError. That is the arm64-only CI flake on test_s6_unregister_removes_service_dir_in_live_container (PermissionError: /run/service/gateway-phase3test.tmp/supervise/event) — arm64-only because the native-arm runner's wider scheduling jitter lets the rescan land inside the ~ms seed window; amd64 ran 30/30 clean. Fix: dot-prefix the staging dir (.gateway-<profile>.tmp) in both register paths (S6ServiceManager.register_profile_gateway and container_boot._register_service). s6-svscan skips any scandir entry whose name begins with '.', so the half-built slot can never be supervised mid-build. The atomic rename to the dotless live name is unchanged. Verified on a real s6 image (amd64): a non-dotted staging dir is picked up by an svscanctl -a rescan (SUPERVISED owner=root) while a dot-prefixed one is ignored (NOT-SUPERVISED). Added a docker-harness regression test that asserts both, plus a unit test that the staging dir is dot-prefixed.
201 lines
8.4 KiB
Python
201 lines
8.4 KiB
Python
"""Harness: in-container integration tests for S6ServiceManager.
|
|
|
|
The unit tests in tests/hermes_cli/test_service_manager.py exercise the
|
|
class against a tmp-path scandir with a stubbed ``subprocess.run``.
|
|
These tests run the real class inside a real container against the
|
|
real s6-svc / s6-svscanctl binaries, validating end-to-end.
|
|
|
|
Phase 3 only registers the service slot — it doesn't depend on the
|
|
gateway actually starting (the binary will refuse to start without a
|
|
valid profile config). The full register → start → supervised-restart
|
|
→ unregister cycle is covered by Phase 4 once profile create/delete
|
|
hooks land.
|
|
|
|
Every ``docker exec`` here runs as the unprivileged ``hermes`` user
|
|
(via :func:`docker_exec` in conftest); see the conftest module
|
|
docstring. ``/run/service`` is chowned hermes-writable by the
|
|
``02-reconcile-profiles`` cont-init.d script, so register/unregister
|
|
operations work correctly under UID 10000.
|
|
"""
|
|
from __future__ import annotations
|
|
|
|
from tests.docker.conftest import docker_exec, start_container
|
|
|
|
|
|
_REGISTER_SCRIPT = """
|
|
import sys
|
|
sys.path.insert(0, "/opt/hermes")
|
|
from hermes_cli.service_manager import S6ServiceManager
|
|
S6ServiceManager().register_profile_gateway("phase3test")
|
|
# Don't worry about whether the gateway actually starts — we only care
|
|
# that the supervision slot was created. The gateway run script will
|
|
# likely error out (no profile config exists) but that's expected.
|
|
print("REGISTERED")
|
|
"""
|
|
|
|
_UNREGISTER_SCRIPT = """
|
|
import sys
|
|
sys.path.insert(0, "/opt/hermes")
|
|
from hermes_cli.service_manager import S6ServiceManager
|
|
S6ServiceManager().unregister_profile_gateway("phase3test")
|
|
print("UNREGISTERED")
|
|
"""
|
|
|
|
|
|
def test_s6_register_creates_service_dir_in_live_container(
|
|
built_image: str, container_name: str,
|
|
) -> None:
|
|
"""S6ServiceManager.register_profile_gateway must create
|
|
``/run/service/gateway-<profile>/`` and trigger s6-svscan rescan
|
|
against the real s6 supervision tree."""
|
|
start_container(built_image, container_name, cmd="sleep 120")
|
|
|
|
r = docker_exec(container_name, "python3", "-c", _REGISTER_SCRIPT, timeout=30)
|
|
assert "REGISTERED" in r.stdout, (
|
|
f"register failed: stderr={r.stderr!r} stdout={r.stdout!r}"
|
|
)
|
|
|
|
# Service directory exists with the expected structure.
|
|
r = docker_exec(container_name, "test", "-d", "/run/service/gateway-phase3test")
|
|
assert r.returncode == 0, "service directory not created"
|
|
|
|
r = docker_exec(container_name, "test", "-f", "/run/service/gateway-phase3test/run")
|
|
assert r.returncode == 0, "run script not created"
|
|
|
|
r = docker_exec(container_name, "test", "-f",
|
|
"/run/service/gateway-phase3test/log/run")
|
|
assert r.returncode == 0, "log/run script not created"
|
|
|
|
# s6-svscan picked it up — s6-svstat works against the dir.
|
|
# `docker exec` doesn't put /command/ on PATH (only the supervision
|
|
# tree does), so call s6-svstat by absolute path.
|
|
r = docker_exec(container_name, "/command/s6-svstat",
|
|
"/run/service/gateway-phase3test")
|
|
assert r.returncode == 0, f"s6-svstat failed: {r.stderr or r.stdout}"
|
|
|
|
# list_profile_gateways picks it up.
|
|
r = docker_exec(container_name, "python3", "-c", (
|
|
"from hermes_cli.service_manager import S6ServiceManager;"
|
|
"print(S6ServiceManager().list_profile_gateways())"
|
|
))
|
|
assert "phase3test" in r.stdout, f"list output: {r.stdout!r}"
|
|
|
|
|
|
def test_s6_unregister_removes_service_dir_in_live_container(
|
|
built_image: str, container_name: str,
|
|
) -> None:
|
|
"""unregister_profile_gateway must stop the service, remove the
|
|
directory, and trigger s6-svscan rescan so the supervise process
|
|
is dropped."""
|
|
start_container(built_image, container_name, cmd="sleep 120")
|
|
|
|
# First register so we have something to unregister.
|
|
r = docker_exec(container_name, "python3", "-c", _REGISTER_SCRIPT, timeout=30)
|
|
assert "REGISTERED" in r.stdout
|
|
|
|
# Then unregister.
|
|
r = docker_exec(container_name, "python3", "-c", _UNREGISTER_SCRIPT, timeout=30)
|
|
assert "UNREGISTERED" in r.stdout, (
|
|
f"unregister failed: stderr={r.stderr!r} stdout={r.stdout!r}"
|
|
)
|
|
|
|
# Directory is gone.
|
|
r = docker_exec(container_name, "test", "-d", "/run/service/gateway-phase3test")
|
|
assert r.returncode != 0, "service directory still exists after unregister"
|
|
|
|
# list_profile_gateways no longer includes it.
|
|
r = docker_exec(container_name, "python3", "-c", (
|
|
"from hermes_cli.service_manager import S6ServiceManager;"
|
|
"print(S6ServiceManager().list_profile_gateways())"
|
|
))
|
|
assert "phase3test" not in r.stdout
|
|
|
|
|
|
# Shell probe: build a service-shaped staging dir under the live scandir
|
|
# with a given NAME, fire a real `s6-svscanctl -a` rescan, wait, and
|
|
# report whether s6-svscan supervised it (which would create a root-owned
|
|
# supervise/ dir). Used to prove the dot-prefixed staging name is INVISIBLE
|
|
# to a concurrent rescan while a non-dotted one is not.
|
|
#
|
|
# Echoes one of: SUPERVISED / NOT-SUPERVISED, plus the supervise/ owner.
|
|
_SVSCAN_PICKUP_PROBE = r"""
|
|
set -eu
|
|
NAME="$1"
|
|
SCANDIR=/run/service
|
|
DIR="$SCANDIR/$NAME"
|
|
rm -rf "$DIR"
|
|
mkdir -p "$DIR"
|
|
printf 'longrun\n' > "$DIR/type"
|
|
printf '#!/command/execlineb -P\n/command/s6-sleep 600\n' > "$DIR/run"
|
|
chmod 755 "$DIR/run"
|
|
# Trigger a full rescan, exactly as register/reconcile do.
|
|
/command/s6-svscanctl -a "$SCANDIR"
|
|
# Give s6-svscan time to act (its scan is async; 200ms is the manager's
|
|
# own settle delay, use 2s here to be comfortably past it on any arch).
|
|
/command/s6-sleep 2
|
|
if [ -d "$DIR/supervise" ]; then
|
|
owner=$(stat -c '%U' "$DIR/supervise" 2>/dev/null || echo '?')
|
|
echo "SUPERVISED owner=$owner"
|
|
else
|
|
echo "NOT-SUPERVISED"
|
|
fi
|
|
# Best-effort teardown so the probe leaves no live supervisor behind.
|
|
/command/s6-svc -d "$DIR" 2>/dev/null || true
|
|
/command/s6-svscanctl -an "$SCANDIR" 2>/dev/null || true
|
|
/command/s6-sleep 1
|
|
rm -rf "$DIR" 2>/dev/null || true
|
|
"""
|
|
|
|
|
|
def test_s6_dotfile_staging_dir_is_ignored_by_svscan_rescan(
|
|
built_image: str, container_name: str,
|
|
) -> None:
|
|
"""Regression for the arm64 register-seed race.
|
|
|
|
The register path builds the slot in a sibling staging dir and then
|
|
atomically renames it to the live ``gateway-<profile>`` name. That
|
|
staging dir lives INSIDE the scandir s6-svscan watches, so its NAME
|
|
decides whether a concurrent ``s6-svscanctl -a`` rescan (fired by the
|
|
cont-init reconciler registering ``gateway-default``, or by another
|
|
register) supervises the half-built slot.
|
|
|
|
- A NON-dotted name (the old ``gateway-<p>.tmp``) IS picked up: once it
|
|
has a valid ``type``/``run``, s6-svscan spawns ``s6-supervise`` AS
|
|
ROOT, creating a root-owned ``supervise/`` — which makes the in-flight
|
|
``_seed_supervise_skeleton`` EACCES on ``mkdir supervise/event``. That
|
|
is the arm64-only flake (the native-arm runner's wider scheduling
|
|
jitter lets the rescan land inside the seed window).
|
|
- A DOT-prefixed name (the fix, ``.gateway-<p>.tmp``) is SKIPPED by
|
|
s6-svscan and never supervised, so no root-owned ``supervise/`` can
|
|
appear under the staging dir.
|
|
|
|
This proves the mechanism directly and is arch-independent (it does not
|
|
rely on hitting the narrow timing window — it forces the rescan and
|
|
checks pickup), so it guards the fix on the amd64 job too.
|
|
"""
|
|
start_container(built_image, container_name, cmd="sleep 120")
|
|
|
|
# Control: a NON-dotted service-shaped dir IS supervised by the rescan
|
|
# (root-owned supervise/). This is the pre-fix staging-name behaviour and
|
|
# confirms the probe actually exercises s6-svscan pickup.
|
|
r = docker_exec(
|
|
container_name, "sh", "-c", _SVSCAN_PICKUP_PROBE, "probe",
|
|
"gateway-raceprobe.tmp", user="root", timeout=30,
|
|
)
|
|
assert "SUPERVISED" in r.stdout and "NOT-SUPERVISED" not in r.stdout, (
|
|
"control failed: a non-dotted staging dir should be picked up by "
|
|
f"s6-svscan. stdout={r.stdout!r} stderr={r.stderr!r}"
|
|
)
|
|
|
|
# The fix: a DOT-prefixed staging dir (the name register/reconcile now
|
|
# use) must be IGNORED by the same rescan — no supervisor, no root-owned
|
|
# supervise/, so the in-flight seed can never EACCES.
|
|
r = docker_exec(
|
|
container_name, "sh", "-c", _SVSCAN_PICKUP_PROBE, "probe",
|
|
".gateway-raceprobe.tmp", user="root", timeout=30,
|
|
)
|
|
assert "NOT-SUPERVISED" in r.stdout, (
|
|
"dot-prefixed staging dir was supervised by s6-svscan — the race "
|
|
f"that EACCESes the seed is still reachable. stdout={r.stdout!r} "
|
|
f"stderr={r.stderr!r}"
|
|
)
|