mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-29 06:31:32 +00:00
12 commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
c341a2d107
|
fix(docker): align HOME for dashboard and s6 gateway services (#33481) | ||
|
|
b345323195 |
fix(docker): tee supervised gateway stdout to docker logs
Follow-up to #33583 (the gateway-run-supervised redirect). Before this fix, the supervised gateway's stdout (most visibly the "Hermes Gateway Starting…" rich-console banner) was swallowed by `s6-log` into the rotated file at `${HERMES_HOME}/logs/gateways/<profile>/current` and never reached `docker logs`. Operational signal lived in two places: * **docker logs** — saw stderr (Python `logging` defaults to stderr), so warnings/errors were visible. * **the rotated file** — saw stdout (rich banners, `print()` output, third-party libs that wrote to fd 1). This was surprising for users coming from the pre-s6 image, where `docker run … gateway run` produced a single unified stream in `docker logs`. They'd see partial output, conclude something was broken, and dig around for the missing pieces. Fix: add the `1` s6-log action directive before the file destination so each line is forwarded to s6-log's stdout — which propagates up the s6-supervise pipeline to /init's stdout = container stdout = `docker logs`. The file destination is preserved as a second destination, so the rotated log (with ISO 8601 timestamps) still exists for `hermes logs` and for survival across container restarts. Trade-off considered: timestamps. Putting `T` between `1` and the file destination (not before `1`) means: * docker logs sees raw lines — Python's logging formatter has its own timestamps, and `docker logs --timestamps` adds another layer when desired. No double-stamping in the common reading path. * The persisted file gets s6-log's ISO 8601 timestamp so even output that lacked a Python-logger timestamp (rich banners, third-party raw prints) is correlatable in `current`. Verification: * New unit-test assertion in `test_service_manager.py` locks the `s6-log 1` directive into the rendered run-script. Mutation- tested by reverting to the pre-fix script (no `1`); the assert catches it cleanly. * New docker-harness test `test_supervised_gateway_stdout_reaches_docker_logs` builds the image, runs `docker run … gateway run`, and asserts the unique `⚕` banner glyph reaches `docker logs`. Also verifies the rotated file still contains the banner (no regression on the existing file destination). Mutation-tested end-to-end: built a deliberately-broken image without the `1` directive and the test failed exactly as designed, citing the banner present in `current` but absent from `docker logs`. * `website/docs/user-guide/docker.md` gains a new `:::note Where gateway logs go` admonition documenting both destinations and the audit-log file at `${HERMES_HOME}/logs/container-boot.log`. Existing functionality preserved: every other docker-harness test still passes against the new image. Unit-test sweep across `tests/hermes_cli/` (5561 tests) is green. |
||
|
|
0927fb5584 |
feat(docker): auto-redirect gateway run to supervised mode inside s6 image
Pre-s6, `docker run nousresearch/hermes-agent gateway run` was the
standard invocation: gateway ran as the container's main process,
tini reaped zombies, container exit code matched gateway exit code,
no supervision. With s6-overlay as PID 1, the same invocation now
auto-upgrades to supervised semantics — auto-restart on crash,
dashboard supervised alongside (when HERMES_DASHBOARD=1 is set),
multiple profile gateways under the same /init.
Users get the new behavior with zero changes to their docker run
command. A loud one-line breadcrumb on stderr explains the upgrade
and points at the opt-out for users who genuinely want pre-s6
foreground semantics.
How it works:
1. `_gateway_command_inner` (the `gateway run` handler) checks if
we're inside a container with s6 as PID 1.
2. If yes, dispatches `start` to the s6 service manager (registers
and starts gateway-default), then `exec sleep infinity` to keep
the CMD process alive without binding container lifetime to
gateway PID lifetime. The supervised gateway can flap freely;
`docker stop` still tears everything down via /init stage 3.
3. If no, falls through to the existing foreground code path
unchanged. Host runs of `hermes gateway run` are unaffected.
Three gates make the redirect inert outside the intended scope:
* `detect_service_manager() != "s6"` — host/non-s6-container runs.
* `HERMES_S6_SUPERVISED_CHILD=1` env var (recursion guard) —
exported by `S6ServiceManager._render_run_script` for the
s6-supervised invocation itself. Without this guard, the
supervised `gateway run --replace` would re-enter the redirect
and recurse (run → start → run → start → ...) infinitely.
* `--no-supervise` CLI flag OR `HERMES_GATEWAY_NO_SUPERVISE=1` env
var — explicit user opt-out for CI smoke tests, debugging the
foreground startup path, or any case wanting "CMD exit =
container exit" semantics. Strict truthiness (1/true/yes,
case-insensitive); typos like `=0` do NOT silently opt out.
Tests:
* Unit tests in tests/hermes_cli/test_gateway_s6_dispatch.py
cover all five paths (host no-op, supervised fire, sentinel
recursion guard, CLI flag, env var truthy + falsy). The two
load-bearing gates (sentinel + opt-out) were mutation-tested
by removing each gate in isolation and confirming the dedicated
test fails with the expected error.
* Docker harness tests in tests/docker/test_gateway_run_supervised.py
cover the round trips end-to-end against a built image: redirect
fires (sleep-infinity heartbeat + supervised gateway-default
slot + breadcrumb), --no-supervise opt-out (foreground gateway,
no want-up on the slot), HERMES_GATEWAY_NO_SUPERVISE env var
works identically, recursion is impossible (≤1 supervised
python gateway-run + exactly 1 sleep-infinity parented to the
CMD wrapper), and HERMES_DASHBOARD=1 produces both supervised
gateway and supervised dashboard.
Docs:
* Added a `:::tip Gateway runs supervised` admonition near the
main docker.md example explaining the upgrade and pointing at
the opt-out. Pre-s6 (tini-based) images still run gateway run
as the foreground main process, so the note is scoped to the
s6 image only.
Trade-off documented in the helper docstring: container exit code
under the redirect is sleep's exit code (always 0 on SIGTERM), not
the gateway's. That was an explicit design call — the supervised
gateway is allowed to flap without taking the container with it,
which is what "supervision" means. CI users who want exit-code
forwarding can pass --no-supervise.
|
||
|
|
4f416fc40c |
fix(docker): make s6 lifecycle work for the unprivileged hermes user
Resolves the explicit "Known follow-up" left by commit |
||
|
|
5cbb132c1d
|
fix(ci): exclude tests/docker/ from regular test shards; pin read_text encoding
Two CI follow-ups to @benbarclay's #30136 salvage: 1. scripts/run_tests_parallel.py — add 'docker' to _SKIP_PARTS so the new tests/docker/ harness doesn't run in the regular test (N) matrix. The harness builds the real Dockerfile in a session fixture, which can exceed pytest-timeout's 180s ceiling on ubuntu-latest where Docker IS available — it surfaced as 6 identical setup-timeout failures across slices 1–6 on the first CI run. The docker harness has its own dedicated runner via .github/actions/hermes-smoke-test (added in #30136) plus the docker-lint workflow. Same treatment as tests/integration/ and tests/e2e/ — runs separately, not in the main shards. 2. hermes_cli/service_manager.py — pin encoding='utf-8' on the /proc/1/comm read_text call. Ruff PLW1514 enforcement rolled in between Ben's last push and the salvage; pure ruff-fix, no behavior change. |
||
|
|
d735b083e8
|
fix(service_manager): rip out dead port parameter
PR #30136 review caught: `_allocate_gateway_port()` in profiles.py computed a SHA-256-derived port that was threaded through `register_profile_gateway(profile, port=N)` → `_render_run_script(profile, port, extra_env)` → and then **ignored**. The rendered run script picked the bind port from the profile's config.yaml (`[gateway] port = …`), never from the allocator. So the entire allocator + parameter chain was dead code. Remove: * `hermes_cli.profiles._allocate_gateway_port` (deterministic SHA-256 → [9200, 9800) — never used). * `port` kwarg from `ServiceManager.register_profile_gateway` (Protocol + Mixin + S6 implementation). * `port` positional arg from `_render_run_script(profile, port, extra_env)` — now `_render_run_script(profile, extra_env)`. * The pass-through call in `profiles._maybe_register_gateway_service`. config.yaml is now the single source of truth for gateway port selection — matches reality and reduces the API surface. Three explanatory comments in service_manager.py / profiles.py document the retirement so future readers don't reach for the allocator and find a ghost. Tests: drop the three `_allocate_gateway_port` tests; update fakes' signatures throughout test_service_manager.py and test_profiles_s6_hooks.py to match the new no-port API. |
||
|
|
b28b3f51d3
|
fix(service_manager): friendly errors for missing slots and s6-svc failures
PR #30136 review caught: `S6ServiceManager.start/stop/restart` called `subprocess.run(check=True)` on `s6-svc`, so any failure surfaced as a raw `CalledProcessError` traceback. The two cases operators actually hit are: 1. The service slot doesn't exist — most commonly because the user typed a profile name wrong (`hermes -p typo gateway start`). 2. s6-svc itself fails — most commonly EACCES on the supervise control FIFO when running unprivileged. Both deserve named errors with actionable messages, not stacktraces. Changes: * Add `S6Error` base + two concrete errors in `hermes_cli.service_manager`: - `GatewayNotRegisteredError(profile)` — carries the unprefixed profile name; message: `no such gateway 'typo': register it with `hermes profile create typo` first, or pass an existing profile name via `-p <name>``. - `S6CommandError(service, action, returncode, stderr)` — carries the s6-svc rc and stderr; message: `s6-svc start on 'gateway-coder' failed (rc=111): <stderr>`. * Factor lifecycle dispatch through `_run_svc(flag, label, name)`: pre-checks that the service directory exists (raises GatewayNotRegisteredError before invoking s6-svc), then runs s6-svc and translates any CalledProcessError into S6CommandError. * `_dispatch_via_service_manager_if_s6` in `hermes_cli.gateway` catches both errors and prints `✗ <message>` + `sys.exit(1)` instead of letting the exception bubble. The dispatch path that used to dump a traceback at the user now gives an actionable one-liner. Tests: 6 new tests for the error types and their CLI rendering; existing lifecycle test pre-seeds the slot directory before calling `mgr.start` etc. |
||
|
|
b044c1ac29
|
fix(container_boot): always register gateway-default slot
PR #30136 review caught: `hermes gateway start` (no `-p`) inside the container resolves `_profile_suffix() == ""` → service name `gateway-default`, but no such slot was ever registered. The Phase 4 profile-create hook only fired on `hermes profile create <name>`, and the root profile (which lives at the top of $HERMES_HOME, not under `profiles/`) was never one of those. So bare `hermes gateway start` landed on `s6-svc -u /run/service/gateway-default` → uncaught `CalledProcessError` → traceback to the user. Changes: 1. `reconcile_profile_gateways` now always registers a `gateway-default` slot before iterating named profiles. Its prior state is read from `$HERMES_HOME/gateway_state.json` (sibling to the profile root, not under `profiles/`); stale runtime files there are swept the same way. Auto-up only if the prior state was `running` — same rule as named profiles. 2. `S6ServiceManager._render_run_script` special-cases `profile == "default"` to emit `hermes gateway run` with NO `-p` flag. Passing `-p default` would resolve to `$HERMES_HOME/profiles/default/` — a different profile that almost certainly doesn't exist. The empty profile-suffix convention is the dispatcher's contract and the run script has to match. 3. A user-created `profiles/default/` collides with the reserved root-profile slot; the reconciler now skips it with a warning rather than producing two registrations of the same service name. Action-list ordering is stable: `default` first, then named profiles in directory order. Boot-log readers can rely on this. Tests: 8 new dedicated default-slot tests plus updates to every existing test that asserted against the action list (via the new `_named_actions` helper that drops the always-present default entry). |
||
|
|
fc39296e1f
|
fix(service_manager): s6 detection works for unprivileged hermes user
PR #30136 review surfaced two issues, both rooted in the same audit gap: docker integration tests were running as root, not the unprivileged `hermes` user (UID 10000) that the runtime actually uses via `s6-setuidgid hermes`. Anything that probed PID-1 state or wrote to the s6 control surface worked as root in the tests but was inert in production. Fixes: 1. `_s6_running()` previously called `Path("/proc/1/exe").resolve()`, which is root-only readable. For UID 10000 the symlink yields PermissionError, `resolve()` silently returns the unresolved path, and `exe.name == "exe"` — so detection always returned False, the service-manager runtime-registration path was inert, and every `hermes profile create` / `hermes -p X gateway start` silently skipped the s6 hook. Replace with `/proc/1/comm` (world-readable) + `/run/s6/basedir` (s6-overlay-specific) — both required, fail closed. 2. `02-reconcile-profiles` now also chowns `/run/service/.s6-svscan/` {control,lock} to hermes so `s6-svscanctl -a/-an` works without root. Previously the directory chown stopped at `/run/service` and the FIFO inside stayed root-owned, so `register_profile_gateway` from hermes failed at the rescan-trigger step with EACCES — the wrapper in profiles.py caught the exception and printed a swallowed warning, so profile creation appeared to succeed while the slot was rolled back. Audit changes to flush this class of bug next time: - Add `docker_exec` / `docker_exec_sh` helpers to `tests/docker/conftest.py` that default to `-u hermes`. The module docstring explains why and flags `user="root"` as opt-in only for tests that explicitly need root (none currently do). - Refactor every `docker exec` call in tests/docker/ through the new helpers (test_dashboard.py, test_zombie_reaping.py, test_profile_gateway.py, test_container_restart.py, test_s6_profile_gateway_integration.py). - Add 5 unit tests covering `_s6_running` under various probe states (both signals present; comm wrong; basedir missing; PermissionError on /proc/1/comm; missing /proc — non-Linux). The PermissionError test is the explicit regression guard for the original bug. Known follow-up: the per-service `supervise/control` FIFO inside each `/run/service/gateway-<profile>/supervise/` is created root-owned by s6-supervise (which runs as root because s6-svscan is PID 1). `s6-svc -u/-d/-t` from the hermes user will get EACCES on those. The audit under `-u hermes` will reveal this in lifecycle tests — surfacing the issue cleanly so it can be fixed in a focused follow-up (likely via a small SUID helper or a polling chown loop in cont-init.d). The detection + svscanctl fixes here are independent and complete on their own. |
||
|
|
2afefc501c
|
feat(docker): per-profile s6 supervision + container-restart reconciliation
Phase 4 of the s6-overlay supervision plan. Activates the Phase 3
S6ServiceManager by hooking it into the profile lifecycle and the
`hermes gateway start/stop/restart` dispatcher, and adds a cont-
init.d-time reconciliation pass that survives `docker restart`.
Task 4.0 — container-boot reconciliation:
/run/service/ is tmpfs, so every `docker restart` wipes every
per-profile gateway slot. /etc/cont-init.d/02-reconcile-profiles
invokes hermes_cli.container_boot.reconcile_profile_gateways() on
every boot, which walks $HERMES_HOME/profiles/<name>/, reads each
gateway_state.json, recreates the s6 service slot, and auto-starts
only those whose last state was 'running'. Other states
(stopped, starting, startup_failed, missing) register the slot
in the down state — avoiding crash-loops across restarts for a
gateway that was broken last boot. Per-profile outcome is recorded
to $HERMES_HOME/logs/container-boot.log.
Implementation: hermes_cli/container_boot.py + 12 unit tests.
Profile-marker is SOUL.md, not config.yaml, because `hermes profile
create` only seeds SOUL.md by default (config.yaml comes from
`hermes setup`).
Task 4.1 / 4.2 — profile create/delete hooks:
hermes_cli/profiles.py::create_profile now calls
_maybe_register_gateway_service(<canon>) at the end, which routes
through ServiceManager.register_profile_gateway when running on s6
and no-ops on host backends. delete_profile mirrors with
_maybe_unregister_gateway_service. _allocate_gateway_port produces
a deterministic SHA-256-derived port in [9200, 9800).
Task 4.3 — gateway dispatch + remove rejection arms:
_dispatch_via_service_manager_if_s6(action) intercepts
start/stop/restart at the top of each subcommand and routes them
through S6ServiceManager.{start,stop,restart}. The pre-Phase-4
`elif is_container():` rejection arms are kept as fallback for
pre-s6 containers / unsupported runtimes, but only ever fire when
detect_service_manager() != 's6'. install/uninstall under s6
print informational guidance pointing users at profile create/delete.
Removed the two xfail(strict=True) markers from
tests/docker/test_profile_gateway.py — both tests now pass strictly.
Task 4.4 — status reporting:
get_gateway_runtime_snapshot() reports
Manager: 's6 (container supervisor)' inside an s6 container instead
of 'docker (foreground)'.
Plan-vs-reality drift fixed in this commit:
- Plan's S6ServiceManager._render_run_script used
`gateway start --foreground --port {port}` — invented args; the
real CLI is `gateway run`. Switched accordingly. port arg
retained for API parity but now documented as 'currently ignored'.
- Plan's reconciler keyed on config.yaml; switched to SOUL.md
(config.yaml is created by hermes setup, not by hermes profile
create, so the original gate caught nothing).
- The plan's _dispatch helper used _profile_arg() which returns
'--profile <name>' (i.e. with the flag prefix). Switched to
_profile_suffix() which returns the bare name.
- Architecture B's docker exec doesn't get /command on PATH or
the venv on PATH; Dockerfile's runtime PATH now includes
/opt/hermes/.venv/bin so 'docker exec <c> hermes ...' works
without sourcing the venv.
- stage2-hook now chowns $HERMES_HOME/profiles to hermes on every
boot, not just on the UID-remap path. Without this, files created
by docker-exec-as-root accumulate and the next reconciler run
fails with PermissionError reading SOUL.md.
Test harness:
19 passed, 0 xfailed (the two pre-Phase-4 xfail targets flip to
passing). 78 unit tests across service_manager + container_boot +
profiles_s6_hooks + gateway_s6_dispatch. Hadolint + shellcheck
pass cleanly.
Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md
|
||
|
|
0abf661f71
|
feat(service_manager): add S6ServiceManager for runtime gateway supervision
Phase 3 of the s6-overlay supervision plan. Implements the runtime-
registration surface from D4 — only the s6 backend supports
register_profile_gateway / unregister_profile_gateway /
list_profile_gateways; host backends continue to raise
NotImplementedError. No caller yet (Phase 4 wires in the profile
create/delete hooks).
Key implementation notes:
- Service directory shape: /run/service/gateway-<profile>/{type,run,log/run}.
Atomic register: write to gateway-<profile>.tmp, fsync via
os.rename. Cleanup on rescan failure.
- Run script uses #!/command/with-contenv sh so HERMES_HOME and any
extra_env arrive at exec time. The hermes -p <profile> gateway
start --foreground --port <port> command is wrapped in
s6-setuidgid hermes for the per-service privilege drop (OQ2-A).
- Log script (OQ8-C): persists via s6-log to
${HERMES_HOME}/logs/gateways/<profile>/. CRITICAL — HERMES_HOME is
a runtime env-var expansion in the rendered script, NOT a Python
f-string substitution. Negative-asserted in
test_s6_register_creates_service_dir_and_triggers_scan so
regressions are caught.
- PATH gotcha: /command/ is only on PATH for processes spawned by
the supervision tree (services, cont-init.d). `docker exec` and
profile-create hooks don't get it. S6ServiceManager calls all
s6-* binaries via absolute path through the new _S6_BIN_DIR
constant so callers don't have to fix up env vars.
- validate_profile_name rejects path-traversal, leading-dash (s6
would parse as a flag), uppercase, whitespace, and names >251
chars (s6-svscan default name_max).
Test coverage:
- 13 new unit tests in tests/hermes_cli/test_service_manager.py
(kind detection, run-script content, env quoting, register
rollback on rescan failure, unregister idempotence, list filter,
lifecycle dispatch, svstat parsing). Total: 36 passing.
- 2 new in-container integration tests in
tests/docker/test_s6_profile_gateway_integration.py validating
end-to-end registration against a real s6 supervision tree.
Docker harness: 14 passed, 2 xfailed (Phase 4 target unchanged).
Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md
|
||
|
|
51914b0514
|
feat(service_manager): add ServiceManager protocol + host wrappers
Phase 1 of the s6-overlay supervision plan. Pure-refactor addition: introduces the abstract interface (with runtime_checkable Protocol), detect_service_manager(), validate_profile_name(), and thin SystemdServiceManager / LaunchdServiceManager / WindowsServiceManager wrappers around the existing systemd_* / launchd_* / gateway_windows.* module-level functions. No host call site was modified — host code continues to use the existing functions directly; the protocol is for new backend-agnostic code (Phase 4 profile create/delete hooks and the Phase 4 s6 dispatch path in 'hermes gateway start/stop/restart'). WindowsServiceManager.install() forwards the v3 kwargs (start_now, start_on_login, elevated_handoff) added in PRs #28169-adjacent so non-Windows callers — there aren't any today — can opt in. The s6 backend lands in Phase 3; until then get_service_manager() raises a clear error if invoked on a host that detects as 's6'. |