hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-05-29 06:31:32 +00:00

Author	SHA1	Message	Date
Ben	4f416fc40c	fix(docker): make s6 lifecycle work for the unprivileged hermes user Resolves the explicit "Known follow-up" left by commit `2f8ceeab9` and the resulting CI failures in tests/docker/test_dashboard.py and tests/docker/test_s6_profile_gateway_integration.py. The product gap --------------- Every hermes runtime operation inside the container runs as the hermes user (UID 10000) via s6-setuidgid. But s6-supervise — spawned by s6-svscan running as PID 1 — creates each service's supervise/ and top-level event/ directories with mode 0700 owned by its effective UID (root). That left every s6-svc / s6-svstat / s6-svwait call from hermes hitting EACCES on the supervise/control FIFO and supervise/status — i.e. the entire S6ServiceManager lifecycle (register, start, stop, unregister) was inert in production. The `2f8ceeab9` commit message called this out and deferred the fix. The audit changes that landed alongside it (defaulting docker_exec to -u hermes) made the integration tests reproduce the bug deterministically; the fix below resolves it. The fix: pre-create the supervise/ skeleton hermes-owned ---------------------------------------------------------- Reading s6's source (src/supervision/s6-supervise.c::trymkdir + control_init), the mkdir and mkfifo calls that build the supervise tree are EEXIST-safe: if the directory or FIFO is already present, s6-supervise reuses it and skips the chown/chmod fix-up that would normally make event/ 03730 root:root. So if we lay the skeleton down with hermes ownership before triggering s6-svscanctl -a, s6-supervise inherits our layout and never touches it. The death_tally / lock / status regular files written later by s6-supervise (still as root) land mode 0644 — world-readable — which is all s6-svstat needs. New module-level helper _seed_supervise_skeleton(svc_dir) in hermes_cli/service_manager.py lays down: svc_dir/event/ hermes:hermes 03730 svc_dir/supervise/ hermes:hermes 0755 svc_dir/supervise/event/ hermes:hermes 03730 svc_dir/supervise/control hermes:hermes 0660 (FIFO) svc_dir/log/event/ hermes:hermes 03730 (if log/ present) svc_dir/log/supervise/ hermes:hermes 0755 svc_dir/log/supervise/event/ hermes:hermes 03730 svc_dir/log/supervise/control hermes:hermes 0660 (FIFO) The log/ branch matters because the logger is a second s6-supervise instance — without it, unregister rmtree races on the logger's root-owned supervise dir even after the parent slot's supervise/ is hermes-owned. The helper is idempotent and swallows PermissionError on chown so it works equally well when called from root (cont-init.d) or hermes (runtime register). Wiring ------ 1. S6ServiceManager.register_profile_gateway calls _seed_supervise_skeleton(tmp_dir) just before publishing the slot via Path.replace. Runtime-registered profile gateways are set up by hermes. 2. container_boot._register_service does the same in the cont-init.d reconciliation path so boot-time-restored profile slots inherit the same layout. 3. New cont-init.d/015-supervise-perms script chowns the supervise/ and event/ trees for STATIC s6-rc services (dashboard, main-hermes). These are spawned by s6-rc before cont-init.d gets to run, so the EEXIST-trick doesn't apply; we chown the already-existing tree instead. s6-supervise keeps using the same files; it never re-asserts ownership on a running service. The script skips s6-overlay internal services (s6rc-, s6-linux-) so the supervision tree itself stays root-only. 015- slot is intentional: lex-sorts between 01-hermes-setup and 02-reconcile-profiles in the container's C-locale, so the chown finishes before the reconciler walks the scandir. Unregister teardown reordering ------------------------------ S6ServiceManager.unregister_profile_gateway now fires s6-svscanctl -an BEFORE rmtree (with a 200ms grace), so s6-svscan reaps the supervise child and releases its file handles on supervise/lock + supervise/status before we try to remove the directory. Previously rmtree raced s6-supervise on a set of files inside the supervise dir, and even with the parent supervise/ now hermes-owned, the contained files (death_tally, lock, status, written by root) could still be in use. Dashboard down-state redesign ----------------------------- The original PR #30136 review fix wrote a 'down' marker file into /run/service/dashboard/ via cont-init.d/03-dashboard-toggle. That approach was broken in two ways: (a) /run/service/dashboard is a symlink to a TRANSIENT /run/s6-rc:s6-rc-init:<tmpdir>/ directory while s6-rc is mid-transaction; the touch landed in a soon-to-be-discarded tmp. (b) Even when written to the final /run/s6-rc/servicedirs/ location, the 'down' file is only consulted by s6-supervise at slot startup. s6-rc's user-bundle explicitly transitions 'dashboard' to 'up' on every boot, overriding any down marker. The right fix is the canonical s6 pattern: when HERMES_DASHBOARD is unset, the dashboard run script exits 0 and a companion finish script exits 125. Per s6-supervise(8), exit code 125 from the finish script is the 'permanent failure, do not restart' marker — equivalent to s6-svc -O. The slot reports as 'down' to s6-svstat, matching the reality that no dashboard process is running. When HERMES_DASHBOARD IS truthy, finish exits 0 and restart-on-crash semantics apply. 03-dashboard-toggle is removed (its function is now subsumed by the run/finish pair). Tests ----- Adds four unit tests for _seed_supervise_skeleton covering the produced layout, the log/ subservice case, the skip-when-no-log case, and idempotency. The live-container verification continues to live in tests/docker/test_s6_profile_gateway_integration.py and tests/docker/test_dashboard.py — both now pass against the rebuilt image. References ---------- * Skarnet skaware mailing list 2020-02-02 (Laurent Bercot + Guillermo Diaz Hartusch) on unprivileged s6 tool semantics: http://skarnet.org/lists/skaware/1424.html * just-containers/s6-overlay#130 — same EEXIST-preseed pattern, community-validated 2016 onward * https://skarnet.org/software/s6/servicedir.html — exit-code 125 semantics in finish scripts (cherry picked from commit `c41f908ad4`)	2026-05-25 12:23:23 +10:00
Ben	d0b1ab48dc	fix(container_boot): publish reconciled service dirs atomically PR #30136 review noted the asymmetry: `register_profile_gateway` used tmp_dir + rename to publish a new service slot atomically, but the boot-time reconciler wrote files into the slot directly. Same underlying concern (a concurrent s6-svscan rescan could observe a half-populated directory), different code path. Rewrite `container_boot._register_service` to mirror the manager: build everything in `<scandir>/gateway-<profile>.tmp/`, then `Path.replace` into place. If a previous interrupted run left a `.tmp` sibling, it's cleaned up before the new build starts. If the target already exists, it's removed before the rename so `Path.replace` doesn't error on a non-empty target (Linux `rename` overwrites empty targets only). Three new tests: atomic publication leaves no .tmp leftovers, overwriting an existing slot still leaves no .tmp leftovers, and a stale .tmp from an interrupted run is cleaned up automatically.	2026-05-24 18:05:33 -07:00
Ben	4443fb481d	fix(container_boot): rotate container-boot.log when it exceeds 256 KiB PR #30136 review noted: container-boot.log was append-only with no rotation. On a long-lived container with frequent restarts and many profiles it would grow unboundedly (~80 B per profile per reconcile pass). Add a soft cap: when the file size hits 256 KiB (`_LOG_ROTATE_BYTES`, ≈3000 reconcile lines, ≈1 year of daily reboots × 5 profiles), the current file is renamed to `container-boot.log.1` (replacing any existing one) before new entries are appended. Worst case is two files at ~512 KiB — well within visibility limits for grep/cat. Rotation is intentionally simple (no logrotate or s6-log machinery for one append-only file). Failures during rotation are logged via the module logger and treated as non-fatal — we keep appending to the existing file rather than dropping the reconcile entry. Three new unit tests cover above-threshold rotation, below-threshold non-rotation, and overwrite of an existing .1 file.	2026-05-24 18:05:33 -07:00
Ben	d735b083e8	fix(service_manager): rip out dead port parameter PR #30136 review caught: `_allocate_gateway_port()` in profiles.py computed a SHA-256-derived port that was threaded through `register_profile_gateway(profile, port=N)` → `_render_run_script(profile, port, extra_env)` → and then ignored. The rendered run script picked the bind port from the profile's config.yaml (`[gateway] port = …`), never from the allocator. So the entire allocator + parameter chain was dead code. Remove: * `hermes_cli.profiles._allocate_gateway_port` (deterministic SHA-256 → [9200, 9800) — never used). * `port` kwarg from `ServiceManager.register_profile_gateway` (Protocol + Mixin + S6 implementation). * `port` positional arg from `_render_run_script(profile, port, extra_env)` — now `_render_run_script(profile, extra_env)`. * The pass-through call in `profiles._maybe_register_gateway_service`. config.yaml is now the single source of truth for gateway port selection — matches reality and reduces the API surface. Three explanatory comments in service_manager.py / profiles.py document the retirement so future readers don't reach for the allocator and find a ghost. Tests: drop the three `_allocate_gateway_port` tests; update fakes' signatures throughout test_service_manager.py and test_profiles_s6_hooks.py to match the new no-port API.	2026-05-24 18:05:33 -07:00
Ben	b044c1ac29	fix(container_boot): always register gateway-default slot PR #30136 review caught: `hermes gateway start` (no `-p`) inside the container resolves `_profile_suffix() == ""` → service name `gateway-default`, but no such slot was ever registered. The Phase 4 profile-create hook only fired on `hermes profile create <name>`, and the root profile (which lives at the top of $HERMES_HOME, not under `profiles/`) was never one of those. So bare `hermes gateway start` landed on `s6-svc -u /run/service/gateway-default` → uncaught `CalledProcessError` → traceback to the user. Changes: 1. `reconcile_profile_gateways` now always registers a `gateway-default` slot before iterating named profiles. Its prior state is read from `$HERMES_HOME/gateway_state.json` (sibling to the profile root, not under `profiles/`); stale runtime files there are swept the same way. Auto-up only if the prior state was `running` — same rule as named profiles. 2. `S6ServiceManager._render_run_script` special-cases `profile == "default"` to emit `hermes gateway run` with NO `-p` flag. Passing `-p default` would resolve to `$HERMES_HOME/profiles/default/` — a different profile that almost certainly doesn't exist. The empty profile-suffix convention is the dispatcher's contract and the run script has to match. 3. A user-created `profiles/default/` collides with the reserved root-profile slot; the reconciler now skips it with a warning rather than producing two registrations of the same service name. Action-list ordering is stable: `default` first, then named profiles in directory order. Boot-log readers can rely on this. Tests: 8 new dedicated default-slot tests plus updates to every existing test that asserted against the action list (via the new `_named_actions` helper that drops the always-present default entry).	2026-05-24 18:05:33 -07:00
Ben	2afefc501c	feat(docker): per-profile s6 supervision + container-restart reconciliation Phase 4 of the s6-overlay supervision plan. Activates the Phase 3 S6ServiceManager by hooking it into the profile lifecycle and the `hermes gateway start/stop/restart` dispatcher, and adds a cont- init.d-time reconciliation pass that survives `docker restart`. Task 4.0 — container-boot reconciliation: /run/service/ is tmpfs, so every `docker restart` wipes every per-profile gateway slot. /etc/cont-init.d/02-reconcile-profiles invokes hermes_cli.container_boot.reconcile_profile_gateways() on every boot, which walks $HERMES_HOME/profiles/<name>/, reads each gateway_state.json, recreates the s6 service slot, and auto-starts only those whose last state was 'running'. Other states (stopped, starting, startup_failed, missing) register the slot in the down state — avoiding crash-loops across restarts for a gateway that was broken last boot. Per-profile outcome is recorded to $HERMES_HOME/logs/container-boot.log. Implementation: hermes_cli/container_boot.py + 12 unit tests. Profile-marker is SOUL.md, not config.yaml, because `hermes profile create` only seeds SOUL.md by default (config.yaml comes from `hermes setup`). Task 4.1 / 4.2 — profile create/delete hooks: hermes_cli/profiles.py::create_profile now calls _maybe_register_gateway_service(<canon>) at the end, which routes through ServiceManager.register_profile_gateway when running on s6 and no-ops on host backends. delete_profile mirrors with _maybe_unregister_gateway_service. _allocate_gateway_port produces a deterministic SHA-256-derived port in [9200, 9800). Task 4.3 — gateway dispatch + remove rejection arms: _dispatch_via_service_manager_if_s6(action) intercepts start/stop/restart at the top of each subcommand and routes them through S6ServiceManager.{start,stop,restart}. The pre-Phase-4 `elif is_container():` rejection arms are kept as fallback for pre-s6 containers / unsupported runtimes, but only ever fire when detect_service_manager() != 's6'. install/uninstall under s6 print informational guidance pointing users at profile create/delete. Removed the two xfail(strict=True) markers from tests/docker/test_profile_gateway.py — both tests now pass strictly. Task 4.4 — status reporting: get_gateway_runtime_snapshot() reports Manager: 's6 (container supervisor)' inside an s6 container instead of 'docker (foreground)'. Plan-vs-reality drift fixed in this commit: - Plan's S6ServiceManager._render_run_script used `gateway start --foreground --port {port}` — invented args; the real CLI is `gateway run`. Switched accordingly. port arg retained for API parity but now documented as 'currently ignored'. - Plan's reconciler keyed on config.yaml; switched to SOUL.md (config.yaml is created by hermes setup, not by hermes profile create, so the original gate caught nothing). - The plan's _dispatch helper used _profile_arg() which returns '--profile <name>' (i.e. with the flag prefix). Switched to _profile_suffix() which returns the bare name. - Architecture B's docker exec doesn't get /command on PATH or the venv on PATH; Dockerfile's runtime PATH now includes /opt/hermes/.venv/bin so 'docker exec <c> hermes ...' works without sourcing the venv. - stage2-hook now chowns $HERMES_HOME/profiles to hermes on every boot, not just on the UID-remap path. Without this, files created by docker-exec-as-root accumulate and the next reconciler run fails with PermissionError reading SOUL.md. Test harness: 19 passed, 0 xfailed (the two pre-Phase-4 xfail targets flip to passing). 78 unit tests across service_manager + container_boot + profiles_s6_hooks + gateway_s6_dispatch. Hadolint + shellcheck pass cleanly. Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md	2026-05-24 18:05:33 -07:00

6 commits