mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-06-01 07:01:41 +00:00
Resolves the explicit "Known follow-up" left by commit2f8ceeab9and the resulting CI failures in tests/docker/test_dashboard.py and tests/docker/test_s6_profile_gateway_integration.py. The product gap --------------- Every hermes runtime operation inside the container runs as the hermes user (UID 10000) via s6-setuidgid. But s6-supervise — spawned by s6-svscan running as PID 1 — creates each service's supervise/ and top-level event/ directories with mode 0700 owned by its effective UID (root). That left every s6-svc / s6-svstat / s6-svwait call from hermes hitting EACCES on the supervise/control FIFO and supervise/status — i.e. the entire S6ServiceManager lifecycle (register, start, stop, unregister) was inert in production. The2f8ceeab9commit message called this out and deferred the fix. The audit changes that landed alongside it (defaulting docker_exec to -u hermes) made the integration tests reproduce the bug deterministically; the fix below resolves it. The fix: pre-create the supervise/ skeleton hermes-owned ---------------------------------------------------------- Reading s6's source (src/supervision/s6-supervise.c::trymkdir + control_init), the mkdir and mkfifo calls that build the supervise tree are EEXIST-safe: if the directory or FIFO is already present, s6-supervise reuses it and skips the chown/chmod fix-up that would normally make event/ 03730 root:root. So if we lay the skeleton down with hermes ownership before triggering s6-svscanctl -a, s6-supervise inherits our layout and never touches it. The death_tally / lock / status regular files written later by s6-supervise (still as root) land mode 0644 — world-readable — which is all s6-svstat needs. New module-level helper _seed_supervise_skeleton(svc_dir) in hermes_cli/service_manager.py lays down: svc_dir/event/ hermes:hermes 03730 svc_dir/supervise/ hermes:hermes 0755 svc_dir/supervise/event/ hermes:hermes 03730 svc_dir/supervise/control hermes:hermes 0660 (FIFO) svc_dir/log/event/ hermes:hermes 03730 (if log/ present) svc_dir/log/supervise/ hermes:hermes 0755 svc_dir/log/supervise/event/ hermes:hermes 03730 svc_dir/log/supervise/control hermes:hermes 0660 (FIFO) The log/ branch matters because the logger is a second s6-supervise instance — without it, unregister rmtree races on the logger's root-owned supervise dir even after the parent slot's supervise/ is hermes-owned. The helper is idempotent and swallows PermissionError on chown so it works equally well when called from root (cont-init.d) or hermes (runtime register). Wiring ------ 1. S6ServiceManager.register_profile_gateway calls _seed_supervise_skeleton(tmp_dir) just before publishing the slot via Path.replace. Runtime-registered profile gateways are set up by hermes. 2. container_boot._register_service does the same in the cont-init.d reconciliation path so boot-time-restored profile slots inherit the same layout. 3. New cont-init.d/015-supervise-perms script chowns the supervise/ and event/ trees for STATIC s6-rc services (dashboard, main-hermes). These are spawned by s6-rc before cont-init.d gets to run, so the EEXIST-trick doesn't apply; we chown the already-existing tree instead. s6-supervise keeps using the same files; it never re-asserts ownership on a running service. The script skips s6-overlay internal services (s6rc-*, s6-linux-*) so the supervision tree itself stays root-only. 015- slot is intentional: lex-sorts between 01-hermes-setup and 02-reconcile-profiles in the container's C-locale, so the chown finishes before the reconciler walks the scandir. Unregister teardown reordering ------------------------------ S6ServiceManager.unregister_profile_gateway now fires s6-svscanctl -an BEFORE rmtree (with a 200ms grace), so s6-svscan reaps the supervise child and releases its file handles on supervise/lock + supervise/status before we try to remove the directory. Previously rmtree raced s6-supervise on a set of files inside the supervise dir, and even with the parent supervise/ now hermes-owned, the contained files (death_tally, lock, status, written by root) could still be in use. Dashboard down-state redesign ----------------------------- The original PR #30136 review fix wrote a 'down' marker file into /run/service/dashboard/ via cont-init.d/03-dashboard-toggle. That approach was broken in two ways: (a) /run/service/dashboard is a symlink to a TRANSIENT /run/s6-rc:s6-rc-init:<tmpdir>/ directory while s6-rc is mid-transaction; the touch landed in a soon-to-be-discarded tmp. (b) Even when written to the final /run/s6-rc/servicedirs/ location, the 'down' file is only consulted by s6-supervise at slot startup. s6-rc's user-bundle explicitly transitions 'dashboard' to 'up' on every boot, overriding any down marker. The right fix is the canonical s6 pattern: when HERMES_DASHBOARD is unset, the dashboard run script exits 0 and a companion finish script exits 125. Per s6-supervise(8), exit code 125 from the finish script is the 'permanent failure, do not restart' marker — equivalent to s6-svc -O. The slot reports as 'down' to s6-svstat, matching the reality that no dashboard process is running. When HERMES_DASHBOARD IS truthy, finish exits 0 and restart-on-crash semantics apply. 03-dashboard-toggle is removed (its function is now subsumed by the run/finish pair). Tests ----- Adds four unit tests for _seed_supervise_skeleton covering the produced layout, the log/ subservice case, the skip-when-no-log case, and idempotency. The live-container verification continues to live in tests/docker/test_s6_profile_gateway_integration.py and tests/docker/test_dashboard.py — both now pass against the rebuilt image. References ---------- * Skarnet skaware mailing list 2020-02-02 (Laurent Bercot + Guillermo Diaz Hartusch) on unprivileged s6 tool semantics: http://skarnet.org/lists/skaware/1424.html * just-containers/s6-overlay#130 — same EEXIST-preseed pattern, community-validated 2016 onward * https://skarnet.org/software/s6/servicedir.html — exit-code 125 semantics in finish scripts (cherry picked from commitc41f908ad4)
90 lines
3.9 KiB
Text
90 lines
3.9 KiB
Text
#!/command/with-contenv sh
|
|
# shellcheck shell=sh
|
|
# Make supervise/ trees for ALL declared s6 services queryable and
|
|
# controllable by the unprivileged hermes user (UID 10000).
|
|
#
|
|
# Background (PR #30136 review item I4): the entire s6 lifecycle
|
|
# (s6-svc, s6-svstat, s6-svwait) is dispatched as the hermes user
|
|
# inside the container (every Hermes runtime path runs under
|
|
# ``s6-setuidgid hermes``). But s6-supervise creates each service's
|
|
# ``supervise/`` and top-level ``event/`` directory with mode 0700
|
|
# owned by its effective UID — which is root, because s6-supervise
|
|
# is spawned by s6-svscan running as PID 1. So unprivileged clients
|
|
# get EACCES on every probe / control call against the slot.
|
|
#
|
|
# Two fixes, one in each registration path:
|
|
#
|
|
# 1. For RUNTIME-registered profile gateways (created via the s6
|
|
# runtime register hooks in profiles.py): the Python helper
|
|
# ``_seed_supervise_skeleton`` pre-creates supervise/ + event/ +
|
|
# supervise/control owned by hermes BEFORE s6-svscanctl -a fires.
|
|
# s6-supervise's mkdir/mkfifo are EEXIST-safe, so it inherits our
|
|
# ownership and never tries to chown back to root.
|
|
#
|
|
# 2. For STATIC s6-rc services (dashboard, main-hermes) declared at
|
|
# image-build time under /etc/s6-overlay/s6-rc.d/*: these are
|
|
# compiled by s6-rc at boot, and s6-supervise spawns BEFORE
|
|
# cont-init.d gets to run — so by the time we're here, the
|
|
# supervise/ tree is already there as root:root 0700. We chown
|
|
# it here. s6-supervise will keep using the same files; it never
|
|
# re-asserts ownership on a running service.
|
|
#
|
|
# This script runs as root after 01-hermes-setup but before
|
|
# 02-reconcile-profiles, so the chowns are settled before the
|
|
# Python reconciler walks the scandir. Lexicographic ordering
|
|
# guarantees this — the suffix is unusual because we want to slot
|
|
# in between 01 and the existing 02-reconcile-profiles without
|
|
# renumbering both (which would be a churn-noise patch on its own).
|
|
|
|
set -eu
|
|
|
|
# /run/s6-rc/servicedirs holds the live, compiled service directories
|
|
# for every static (s6-rc) service. Symlinks under /run/service/*
|
|
# point here. Per-service supervise/ + event/ both need hermes
|
|
# ownership for s6-svstat etc. to work as hermes.
|
|
SVC_ROOT=/run/s6-rc/servicedirs
|
|
|
|
if [ ! -d "$SVC_ROOT" ]; then
|
|
echo "[supervise-perms] $SVC_ROOT not present; skipping"
|
|
exit 0
|
|
fi
|
|
|
|
for svc in "$SVC_ROOT"/*; do
|
|
[ -d "$svc" ] || continue
|
|
name=$(basename "$svc")
|
|
|
|
# Skip s6-overlay-internal services (they need to stay root-only;
|
|
# the s6rc-* helpers manage the supervision tree itself).
|
|
case "$name" in
|
|
s6rc-*|s6-linux-*)
|
|
continue
|
|
;;
|
|
esac
|
|
|
|
# supervise/ tree — needed by s6-svstat / s6-svc.
|
|
if [ -d "$svc/supervise" ]; then
|
|
chown -R hermes:hermes "$svc/supervise" 2>/dev/null || \
|
|
echo "[supervise-perms] could not chown $svc/supervise"
|
|
# 0710 = group searchable. ``s6-svstat`` only needs to openat
|
|
# status, not list the dir, but giving the hermes group +x is
|
|
# the minimum that lets group members access the contents.
|
|
chmod 0710 "$svc/supervise" 2>/dev/null || true
|
|
# supervise/control is a FIFO that s6-svc writes commands
|
|
# into; the hermes user needs +w. Owner is already hermes
|
|
# after the recursive chown above; widen perms to 0660 so
|
|
# ``s6-svc`` works for any member of the hermes group too.
|
|
if [ -p "$svc/supervise/control" ]; then
|
|
chmod 0660 "$svc/supervise/control" 2>/dev/null || true
|
|
fi
|
|
fi
|
|
|
|
# Top-level event/ dir — s6-svlisten1 / s6-svwait subscribe here.
|
|
if [ -d "$svc/event" ]; then
|
|
chown hermes:hermes "$svc/event" 2>/dev/null || \
|
|
echo "[supervise-perms] could not chown $svc/event"
|
|
# Preserve s6's 03730 mode (setgid + g+rwx + sticky).
|
|
chmod 03730 "$svc/event" 2>/dev/null || true
|
|
fi
|
|
done
|
|
|
|
echo "[supervise-perms] chowned supervise/ trees for static s6-rc services"
|