Commit graph

2213 commits

Author SHA1 Message Date
Ben
4f416fc40c fix(docker): make s6 lifecycle work for the unprivileged hermes user
Resolves the explicit "Known follow-up" left by commit 2f8ceeab9 and
the resulting CI failures in tests/docker/test_dashboard.py and
tests/docker/test_s6_profile_gateway_integration.py.

The product gap
---------------
Every hermes runtime operation inside the container runs as the
hermes user (UID 10000) via s6-setuidgid. But s6-supervise — spawned
by s6-svscan running as PID 1 — creates each service's supervise/
and top-level event/ directories with mode 0700 owned by its
effective UID (root). That left every s6-svc / s6-svstat / s6-svwait
call from hermes hitting EACCES on the supervise/control FIFO and
supervise/status — i.e. the entire S6ServiceManager lifecycle
(register, start, stop, unregister) was inert in production.

The 2f8ceeab9 commit message called this out and deferred the fix.
The audit changes that landed alongside it (defaulting docker_exec
to -u hermes) made the integration tests reproduce the bug
deterministically; the fix below resolves it.

The fix: pre-create the supervise/ skeleton hermes-owned
----------------------------------------------------------
Reading s6's source (src/supervision/s6-supervise.c::trymkdir +
control_init), the mkdir and mkfifo calls that build the supervise
tree are EEXIST-safe: if the directory or FIFO is already present,
s6-supervise reuses it and skips the chown/chmod fix-up that would
normally make event/ 03730 root:root. So if we lay the skeleton
down with hermes ownership before triggering s6-svscanctl -a,
s6-supervise inherits our layout and never touches it. The
death_tally / lock / status regular files written later by
s6-supervise (still as root) land mode 0644 — world-readable —
which is all s6-svstat needs.

New module-level helper _seed_supervise_skeleton(svc_dir) in
hermes_cli/service_manager.py lays down:
  svc_dir/event/                       hermes:hermes 03730
  svc_dir/supervise/                   hermes:hermes 0755
  svc_dir/supervise/event/             hermes:hermes 03730
  svc_dir/supervise/control            hermes:hermes 0660 (FIFO)
  svc_dir/log/event/                   hermes:hermes 03730  (if log/ present)
  svc_dir/log/supervise/               hermes:hermes 0755
  svc_dir/log/supervise/event/         hermes:hermes 03730
  svc_dir/log/supervise/control        hermes:hermes 0660 (FIFO)

The log/ branch matters because the logger is a second
s6-supervise instance — without it, unregister rmtree races on
the logger's root-owned supervise dir even after the parent
slot's supervise/ is hermes-owned. The helper is idempotent and
swallows PermissionError on chown so it works equally well when
called from root (cont-init.d) or hermes (runtime register).

Wiring
------
1. S6ServiceManager.register_profile_gateway calls
   _seed_supervise_skeleton(tmp_dir) just before publishing the
   slot via Path.replace. Runtime-registered profile gateways are
   set up by hermes.

2. container_boot._register_service does the same in the cont-init.d
   reconciliation path so boot-time-restored profile slots inherit
   the same layout.

3. New cont-init.d/015-supervise-perms script chowns the supervise/
   and event/ trees for STATIC s6-rc services (dashboard,
   main-hermes). These are spawned by s6-rc before cont-init.d
   gets to run, so the EEXIST-trick doesn't apply; we chown the
   already-existing tree instead. s6-supervise keeps using the
   same files; it never re-asserts ownership on a running service.
   The script skips s6-overlay internal services (s6rc-*,
   s6-linux-*) so the supervision tree itself stays root-only.
   015- slot is intentional: lex-sorts between 01-hermes-setup
   and 02-reconcile-profiles in the container's C-locale, so
   the chown finishes before the reconciler walks the scandir.

Unregister teardown reordering
------------------------------
S6ServiceManager.unregister_profile_gateway now fires
s6-svscanctl -an BEFORE rmtree (with a 200ms grace), so
s6-svscan reaps the supervise child and releases its file
handles on supervise/lock + supervise/status before we try to
remove the directory. Previously rmtree raced s6-supervise on a
set of files inside the supervise dir, and even with the parent
supervise/ now hermes-owned, the contained files (death_tally,
lock, status, written by root) could still be in use.

Dashboard down-state redesign
-----------------------------
The original PR #30136 review fix wrote a 'down' marker file
into /run/service/dashboard/ via cont-init.d/03-dashboard-toggle.
That approach was broken in two ways:

  (a) /run/service/dashboard is a symlink to a TRANSIENT
      /run/s6-rc:s6-rc-init:<tmpdir>/ directory while s6-rc is
      mid-transaction; the touch landed in a soon-to-be-discarded
      tmp.

  (b) Even when written to the final /run/s6-rc/servicedirs/
      location, the 'down' file is only consulted by s6-supervise
      at slot startup. s6-rc's user-bundle explicitly transitions
      'dashboard' to 'up' on every boot, overriding any down
      marker.

The right fix is the canonical s6 pattern: when HERMES_DASHBOARD
is unset, the dashboard run script exits 0 and a companion
finish script exits 125. Per s6-supervise(8), exit code 125 from
the finish script is the 'permanent failure, do not restart'
marker — equivalent to s6-svc -O. The slot reports as 'down' to
s6-svstat, matching the reality that no dashboard process is
running. When HERMES_DASHBOARD IS truthy, finish exits 0 and
restart-on-crash semantics apply.

03-dashboard-toggle is removed (its function is now subsumed by
the run/finish pair).

Tests
-----
Adds four unit tests for _seed_supervise_skeleton covering the
produced layout, the log/ subservice case, the skip-when-no-log
case, and idempotency. The live-container verification continues
to live in tests/docker/test_s6_profile_gateway_integration.py and
tests/docker/test_dashboard.py — both now pass against the
rebuilt image.

References
----------
* Skarnet skaware mailing list 2020-02-02 (Laurent Bercot
  + Guillermo Diaz Hartusch) on unprivileged s6 tool semantics:
  http://skarnet.org/lists/skaware/1424.html
* just-containers/s6-overlay#130 — same EEXIST-preseed pattern,
  community-validated 2016 onward
* https://skarnet.org/software/s6/servicedir.html — exit-code 125
  semantics in finish scripts

(cherry picked from commit c41f908ad4)
2026-05-25 12:23:23 +10:00
teknium1
5cbb132c1d
fix(ci): exclude tests/docker/ from regular test shards; pin read_text encoding
Two CI follow-ups to @benbarclay's #30136 salvage:

1. scripts/run_tests_parallel.py — add 'docker' to _SKIP_PARTS so
   the new tests/docker/ harness doesn't run in the regular test (N)
   matrix. The harness builds the real Dockerfile in a session
   fixture, which can exceed pytest-timeout's 180s ceiling on
   ubuntu-latest where Docker IS available — it surfaced as 6
   identical setup-timeout failures across slices 1–6 on the first
   CI run.

   The docker harness has its own dedicated runner via
   .github/actions/hermes-smoke-test (added in #30136) plus the
   docker-lint workflow. Same treatment as tests/integration/ and
   tests/e2e/ — runs separately, not in the main shards.

2. hermes_cli/service_manager.py — pin encoding='utf-8' on the
   /proc/1/comm read_text call. Ruff PLW1514 enforcement rolled in
   between Ben's last push and the salvage; pure ruff-fix, no
   behavior change.
2026-05-24 18:23:13 -07:00
teknium1
a4092ab217
fix(profiles): short-circuit s6 hooks on host before importing service_manager
Follow-up to @benbarclay's Docker s6 PR (#30136). The Phase 4 hooks
`_maybe_register_gateway_service` and `_maybe_unregister_gateway_service`
were already documented as "no-op on host", but they reached that no-op
by:

  1. importing `hermes_cli.service_manager`
  2. calling `get_service_manager()` (which calls `detect_service_manager()`)
  3. checking `mgr.supports_runtime_registration()` and returning False

If anything in step 1 or 2 raised an unexpected exception (e.g. a host
machine with a partial s6 install — `/proc/1/comm == s6-svscan` somehow,
but `/run/s6/basedir` absent, or vice versa), the `except Exception`
in the hook would print a confusing "⚠ Could not register s6 gateway
service: ..." warning on a non-container machine that has never touched
the container.

Reorder so `detect_service_manager() != "s6"` is checked FIRST, and
return silently for any detection failure. Host machines now:

  - never import the s6 backend
  - never call get_service_manager()
  - never print an s6-shaped warning under any failure mode

E2E confirmed on host Linux (systemd):
  `_maybe_register_gateway_service(...)` produces empty stdout,
  detect_service_manager() returns "systemd".

Existing tests updated to patch `detect_service_manager` for the s6
call-through cases (they previously relied on get_service_manager
being the only gate, which is no longer true). Added one new test —
`test_register_silent_when_detect_throws` — asserting that a broken
detector cannot leak a warning to host users.

cc @benbarclay — visible behavior change vs. your branch is one
fewer code path on host. Test changes are minimal (one helper +
`_patch_detect_s6` opt-in per s6 test). Happy to revert if you
prefer the original shape.
2026-05-24 18:07:47 -07:00
Ben
d0b1ab48dc
fix(container_boot): publish reconciled service dirs atomically
PR #30136 review noted the asymmetry: `register_profile_gateway`
used tmp_dir + rename to publish a new service slot atomically,
but the boot-time reconciler wrote files into the slot directly.
Same underlying concern (a concurrent s6-svscan rescan could
observe a half-populated directory), different code path.

Rewrite `container_boot._register_service` to mirror the manager:
build everything in `<scandir>/gateway-<profile>.tmp/`, then
`Path.replace` into place. If a previous interrupted run left a
`.tmp` sibling, it's cleaned up before the new build starts. If
the target already exists, it's removed before the rename so
`Path.replace` doesn't error on a non-empty target (Linux `rename`
overwrites empty targets only).

Three new tests: atomic publication leaves no .tmp leftovers,
overwriting an existing slot still leaves no .tmp leftovers, and
a stale .tmp from an interrupted run is cleaned up automatically.
2026-05-24 18:05:33 -07:00
Ben
4443fb481d
fix(container_boot): rotate container-boot.log when it exceeds 256 KiB
PR #30136 review noted: container-boot.log was append-only with no
rotation. On a long-lived container with frequent restarts and
many profiles it would grow unboundedly (~80 B per profile per
reconcile pass).

Add a soft cap: when the file size hits 256 KiB (`_LOG_ROTATE_BYTES`,
≈3000 reconcile lines, ≈1 year of daily reboots × 5 profiles), the
current file is renamed to `container-boot.log.1` (replacing any
existing one) before new entries are appended. Worst case is two
files at ~512 KiB — well within visibility limits for grep/cat.

Rotation is intentionally simple (no logrotate or s6-log machinery
for one append-only file). Failures during rotation are logged via
the module logger and treated as non-fatal — we keep appending to
the existing file rather than dropping the reconcile entry. Three
new unit tests cover above-threshold rotation, below-threshold
non-rotation, and overwrite of an existing .1 file.
2026-05-24 18:05:33 -07:00
Ben
d735b083e8
fix(service_manager): rip out dead port parameter
PR #30136 review caught: `_allocate_gateway_port()` in profiles.py
computed a SHA-256-derived port that was threaded through
`register_profile_gateway(profile, port=N)` →
`_render_run_script(profile, port, extra_env)` → and then **ignored**.
The rendered run script picked the bind port from the profile's
config.yaml (`[gateway] port = …`), never from the allocator. So
the entire allocator + parameter chain was dead code.

Remove:

* `hermes_cli.profiles._allocate_gateway_port` (deterministic
  SHA-256 → [9200, 9800) — never used).
* `port` kwarg from `ServiceManager.register_profile_gateway`
  (Protocol + Mixin + S6 implementation).
* `port` positional arg from `_render_run_script(profile, port,
  extra_env)` — now `_render_run_script(profile, extra_env)`.
* The pass-through call in `profiles._maybe_register_gateway_service`.

config.yaml is now the single source of truth for gateway port
selection — matches reality and reduces the API surface. Three
explanatory comments in service_manager.py / profiles.py document
the retirement so future readers don't reach for the allocator and
find a ghost.

Tests: drop the three `_allocate_gateway_port` tests; update
fakes' signatures throughout test_service_manager.py and
test_profiles_s6_hooks.py to match the new no-port API.
2026-05-24 18:05:33 -07:00
Ben
b28b3f51d3
fix(service_manager): friendly errors for missing slots and s6-svc failures
PR #30136 review caught: `S6ServiceManager.start/stop/restart` called
`subprocess.run(check=True)` on `s6-svc`, so any failure surfaced as
a raw `CalledProcessError` traceback. The two cases operators
actually hit are:

  1. The service slot doesn't exist — most commonly because the user
     typed a profile name wrong (`hermes -p typo gateway start`).
  2. s6-svc itself fails — most commonly EACCES on the supervise
     control FIFO when running unprivileged.

Both deserve named errors with actionable messages, not stacktraces.

Changes:

* Add `S6Error` base + two concrete errors in `hermes_cli.service_manager`:
    - `GatewayNotRegisteredError(profile)` — carries the unprefixed
      profile name; message: `no such gateway 'typo': register it
      with `hermes profile create typo` first, or pass an existing
      profile name via `-p <name>``.
    - `S6CommandError(service, action, returncode, stderr)` — carries
      the s6-svc rc and stderr; message: `s6-svc start on
      'gateway-coder' failed (rc=111): <stderr>`.

* Factor lifecycle dispatch through `_run_svc(flag, label, name)`:
  pre-checks that the service directory exists (raises
  GatewayNotRegisteredError before invoking s6-svc), then runs
  s6-svc and translates any CalledProcessError into S6CommandError.

* `_dispatch_via_service_manager_if_s6` in `hermes_cli.gateway`
  catches both errors and prints `✗ <message>` + `sys.exit(1)`
  instead of letting the exception bubble. The dispatch path that
  used to dump a traceback at the user now gives an actionable
  one-liner.

Tests: 6 new tests for the error types and their CLI rendering;
existing lifecycle test pre-seeds the slot directory before calling
`mgr.start` etc.
2026-05-24 18:05:33 -07:00
Ben
b044c1ac29
fix(container_boot): always register gateway-default slot
PR #30136 review caught: `hermes gateway start` (no `-p`) inside
the container resolves `_profile_suffix() == ""` → service name
`gateway-default`, but no such slot was ever registered. The Phase 4
profile-create hook only fired on `hermes profile create <name>`,
and the root profile (which lives at the top of $HERMES_HOME, not
under `profiles/`) was never one of those. So bare `hermes gateway
start` landed on `s6-svc -u /run/service/gateway-default` →
uncaught `CalledProcessError` → traceback to the user.

Changes:

1. `reconcile_profile_gateways` now always registers a
   `gateway-default` slot before iterating named profiles. Its
   prior state is read from `$HERMES_HOME/gateway_state.json`
   (sibling to the profile root, not under `profiles/`); stale
   runtime files there are swept the same way. Auto-up only if the
   prior state was `running` — same rule as named profiles.

2. `S6ServiceManager._render_run_script` special-cases
   `profile == "default"` to emit `hermes gateway run` with NO
   `-p` flag. Passing `-p default` would resolve to
   `$HERMES_HOME/profiles/default/` — a different profile that
   almost certainly doesn't exist. The empty profile-suffix
   convention is the dispatcher's contract and the run script has
   to match.

3. A user-created `profiles/default/` collides with the reserved
   root-profile slot; the reconciler now skips it with a warning
   rather than producing two registrations of the same service name.

Action-list ordering is stable: `default` first, then named
profiles in directory order. Boot-log readers can rely on this.

Tests: 8 new dedicated default-slot tests plus updates to every
existing test that asserted against the action list (via the new
`_named_actions` helper that drops the always-present default
entry).
2026-05-24 18:05:33 -07:00
Ben
6dedaa4846
fix(gateway): route --all stop/restart through s6 under container
PR #30136 review caught that `hermes gateway stop --all` and
`... restart --all` were broken under s6. The Phase 4 dispatcher was
gated on `not stop_all` (and the symmetric restart_all), so `--all`
fell through to `kill_gateway_processes(all_profiles=True)`. pkill
SIGTERMed every gateway, s6-supervise observed the crashes, and
restarted every gateway ~1s later — net effect: `--all` *kicked*
gateways instead of *stopping* them.

Add `_dispatch_all_via_service_manager_if_s6(action)` that iterates
`mgr.list_profile_gateways()` and routes stop/restart through each
service slot. s6's `want up`/`want down` flips correctly, so a
stop persists. Partial failures are surfaced per-profile with a
running success count; the host pkill path is only reached when s6
isn't in play.

`start --all` isn't a CLI surface — the helper rejects it and
returns False (host code path can take over).
2026-05-24 18:05:33 -07:00
Ben
fc39296e1f
fix(service_manager): s6 detection works for unprivileged hermes user
PR #30136 review surfaced two issues, both rooted in the same audit gap:
docker integration tests were running as root, not the unprivileged
`hermes` user (UID 10000) that the runtime actually uses via
`s6-setuidgid hermes`. Anything that probed PID-1 state or wrote to
the s6 control surface worked as root in the tests but was inert in
production.

Fixes:

1. `_s6_running()` previously called `Path("/proc/1/exe").resolve()`,
   which is root-only readable. For UID 10000 the symlink yields
   PermissionError, `resolve()` silently returns the unresolved path,
   and `exe.name == "exe"` — so detection always returned False, the
   service-manager runtime-registration path was inert, and every
   `hermes profile create` / `hermes -p X gateway start` silently
   skipped the s6 hook. Replace with `/proc/1/comm` (world-readable)
   + `/run/s6/basedir` (s6-overlay-specific) — both required, fail
   closed.

2. `02-reconcile-profiles` now also chowns `/run/service/.s6-svscan/`
   {control,lock} to hermes so `s6-svscanctl -a/-an` works without
   root. Previously the directory chown stopped at `/run/service`
   and the FIFO inside stayed root-owned, so `register_profile_gateway`
   from hermes failed at the rescan-trigger step with EACCES — the
   wrapper in profiles.py caught the exception and printed a swallowed
   warning, so profile creation appeared to succeed while the slot
   was rolled back.

Audit changes to flush this class of bug next time:

- Add `docker_exec` / `docker_exec_sh` helpers to `tests/docker/conftest.py`
  that default to `-u hermes`. The module docstring explains why and
  flags `user="root"` as opt-in only for tests that explicitly need
  root (none currently do).
- Refactor every `docker exec` call in tests/docker/ through the new
  helpers (test_dashboard.py, test_zombie_reaping.py, test_profile_gateway.py,
  test_container_restart.py, test_s6_profile_gateway_integration.py).
- Add 5 unit tests covering `_s6_running` under various probe states
  (both signals present; comm wrong; basedir missing; PermissionError
  on /proc/1/comm; missing /proc — non-Linux). The PermissionError
  test is the explicit regression guard for the original bug.

Known follow-up: the per-service `supervise/control` FIFO inside each
`/run/service/gateway-<profile>/supervise/` is created root-owned by
s6-supervise (which runs as root because s6-svscan is PID 1). `s6-svc
-u/-d/-t` from the hermes user will get EACCES on those. The audit
under `-u hermes` will reveal this in lifecycle tests — surfacing the
issue cleanly so it can be fixed in a focused follow-up (likely via a
small SUID helper or a polling chown loop in cont-init.d). The
detection + svscanctl fixes here are independent and complete on
their own.
2026-05-24 18:05:33 -07:00
Ben
4b4c36cb61
feat(docker): remove gosu from bundled image; s6-setuidgid handles privilege drop
The s6-overlay migration replaced every runtime use of gosu with
s6-setuidgid (in stage2-hook.sh, main-wrapper.sh, per-service run
scripts, and cont-init.d hooks), but the gosu binary itself was still
being copied into the image from tianon/gosu, and several comments
across the repo still pointed to it.

Image changes:
- Drop the FROM tianon/gosu:1.19-trixie AS gosu_source stage
- Drop the COPY --from=gosu_source /gosu /usr/local/bin/ layer
- Net: one fewer base-image pull, ~12-15 MB layer eliminated

Documentation/comment refresh (no behavior change):
- Dockerfile: update root-user rationale comment + cont-init.d comment
- docker/main-wrapper.sh: drop "pre-s6 contract (gosu drop)" reference
- docker-compose.yml: update UID/GID remap comment
- .hadolint.yaml: update DL3002 ignore rationale
- website/docs/user-guide/docker.md: privilege-drop helper is s6-setuidgid now
- hermes_cli/config.py: docker_run_as_host_user docstring

tools/environments/docker.py runs *arbitrary user images* via the
terminal backend, not the bundled Hermes image. It still needs SETUID/
SETGID caps so user images that use gosu/su/s6-setuidgid all work.
Renamed the cap-list constant _GOSU_CAP_ARGS → _PRIVDROP_CAP_ARGS and
updated comments to list s6-setuidgid alongside the others as examples.
The matching test (test_security_args_include_setuid_setgid_for_gosu_drop
→ test_security_args_include_setuid_setgid_for_privdrop) was renamed
and its docstring updated; behavior is unchanged.

Verification:
- hadolint clean against .hadolint.yaml
- shellcheck clean against all docker/ shell scripts
- Image rebuilt successfully (sha 1a090924ccea)
- Docker harness: 19 passed in 41.87s (every Phase 0 test + Phase 4
  per-profile-gateway lifecycle + container-restart reconciliation)
- tests/tools/test_docker_environment.py: 23 passed (rename did not
  break test discovery; pre-existing unrelated mock warning)

The plan document (docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md)
intentionally retains its historical references to gosu — it describes
the pre-s6 entrypoint as background for understanding the migration.
2026-05-24 18:05:33 -07:00
Ben
a36221ed91
docs(s6): document container supervision; doctor + skill + user-guide updates
Phase 5 of the s6-overlay supervision plan. Documentation + small
diagnostic cleanups; no behavior changes.

website/docs/user-guide/docker.md:
  - Replace the old 'entrypoint script does the bootstrap' section
    with the s6-overlay boot flow (cont-init.d/01-hermes-setup,
    cont-init.d/02-reconcile-profiles, static main-hermes + dashboard
    services, ENTRYPOINT-as-main-program pattern).
  - Add a 'Per-profile gateway supervision' subsection covering the
    new lifecycle commands, restart semantics, log persistence, and
    'Manager: s6 (container supervisor)' status reporting.
  - Add 'Breaking change vs. pre-s6 images' callout naming the
    /init ENTRYPOINT and pointing affected wrappers at the pin
    workaround.

website/docs/user-guide/profiles.md:
  - Add a note under 'Persistent services' pointing container users
    at the docker.md section explaining s6 supervision inside the
    image. Host-side systemd/launchd documentation is unchanged.

skills/software-development/hermes-s6-container-supervision/SKILL.md:
  - New maintainer skill covering the supervision-tree map, file
    layout, the Architecture B rationale (cont-init.d args + halt
    exit-code propagation), quick recipes, and the 8 pitfalls we hit
    while implementing the plan (PATH-without-/command, root-owned
    profile dirs, SOUL.md as marker, the '143' anti-pattern, etc.).

hermes_cli/doctor.py:
  - _check_gateway_service_linger skips on s6 (the linger concept
    doesn't apply inside the container).
  - New _check_s6_supervision section reports main-hermes/dashboard
    state and per-profile-gateway count (registered vs supervised
    up), only inside the s6 container. Host doctor output unchanged.
  - External Tools / Docker check no longer emits a 'docker not
    found' warning inside the container; prints an explanatory
    info line instead. Still respects an explicit TERMINAL_ENV=docker
    (in case the user mounted /var/run/docker.sock).

hermes_cli/gateway.py:
  - Document _container_systemd_operational more precisely: it's
    NOT for our Hermes Docker image (s6-overlay handles that via
    detect_service_manager() == 's6'). It still covers
    systemd-nspawn / k8s-with-systemd-init cases, so leaving it in
    place is correct; the docstring just makes that explicit.

Test harness (verification, no test changes in this commit):
  19 passed, 0 xfailed. 66 service-manager / container-boot /
  profiles-s6-hooks / gateway-s6-dispatch unit tests still green.
  61 doctor tests still green. Hadolint + shellcheck clean.

Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md
2026-05-24 18:05:33 -07:00
Ben
2afefc501c
feat(docker): per-profile s6 supervision + container-restart reconciliation
Phase 4 of the s6-overlay supervision plan. Activates the Phase 3
S6ServiceManager by hooking it into the profile lifecycle and the
`hermes gateway start/stop/restart` dispatcher, and adds a cont-
init.d-time reconciliation pass that survives `docker restart`.

Task 4.0 — container-boot reconciliation:
  /run/service/ is tmpfs, so every `docker restart` wipes every
  per-profile gateway slot. /etc/cont-init.d/02-reconcile-profiles
  invokes hermes_cli.container_boot.reconcile_profile_gateways() on
  every boot, which walks $HERMES_HOME/profiles/<name>/, reads each
  gateway_state.json, recreates the s6 service slot, and auto-starts
  only those whose last state was 'running'. Other states
  (stopped, starting, startup_failed, missing) register the slot
  in the down state — avoiding crash-loops across restarts for a
  gateway that was broken last boot. Per-profile outcome is recorded
  to $HERMES_HOME/logs/container-boot.log.

  Implementation: hermes_cli/container_boot.py + 12 unit tests.
  Profile-marker is SOUL.md, not config.yaml, because `hermes profile
  create` only seeds SOUL.md by default (config.yaml comes from
  `hermes setup`).

Task 4.1 / 4.2 — profile create/delete hooks:
  hermes_cli/profiles.py::create_profile now calls
  _maybe_register_gateway_service(<canon>) at the end, which routes
  through ServiceManager.register_profile_gateway when running on s6
  and no-ops on host backends. delete_profile mirrors with
  _maybe_unregister_gateway_service. _allocate_gateway_port produces
  a deterministic SHA-256-derived port in [9200, 9800).

Task 4.3 — gateway dispatch + remove rejection arms:
  _dispatch_via_service_manager_if_s6(action) intercepts
  start/stop/restart at the top of each subcommand and routes them
  through S6ServiceManager.{start,stop,restart}. The pre-Phase-4
  `elif is_container():` rejection arms are kept as fallback for
  pre-s6 containers / unsupported runtimes, but only ever fire when
  detect_service_manager() != 's6'. install/uninstall under s6
  print informational guidance pointing users at profile create/delete.

  Removed the two xfail(strict=True) markers from
  tests/docker/test_profile_gateway.py — both tests now pass strictly.

Task 4.4 — status reporting:
  get_gateway_runtime_snapshot() reports
  Manager: 's6 (container supervisor)' inside an s6 container instead
  of 'docker (foreground)'.

Plan-vs-reality drift fixed in this commit:
  - Plan's S6ServiceManager._render_run_script used
    `gateway start --foreground --port {port}` — invented args; the
    real CLI is `gateway run`. Switched accordingly. port arg
    retained for API parity but now documented as 'currently ignored'.
  - Plan's reconciler keyed on config.yaml; switched to SOUL.md
    (config.yaml is created by hermes setup, not by hermes profile
    create, so the original gate caught nothing).
  - The plan's _dispatch helper used _profile_arg() which returns
    '--profile <name>' (i.e. with the flag prefix). Switched to
    _profile_suffix() which returns the bare name.
  - Architecture B's docker exec doesn't get /command on PATH or
    the venv on PATH; Dockerfile's runtime PATH now includes
    /opt/hermes/.venv/bin so 'docker exec <c> hermes ...' works
    without sourcing the venv.
  - stage2-hook now chowns $HERMES_HOME/profiles to hermes on every
    boot, not just on the UID-remap path. Without this, files created
    by docker-exec-as-root accumulate and the next reconciler run
    fails with PermissionError reading SOUL.md.

Test harness:
  19 passed, 0 xfailed (the two pre-Phase-4 xfail targets flip to
  passing). 78 unit tests across service_manager + container_boot +
  profiles_s6_hooks + gateway_s6_dispatch. Hadolint + shellcheck
  pass cleanly.

Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md
2026-05-24 18:05:33 -07:00
Ben
0abf661f71
feat(service_manager): add S6ServiceManager for runtime gateway supervision
Phase 3 of the s6-overlay supervision plan. Implements the runtime-
registration surface from D4 — only the s6 backend supports
register_profile_gateway / unregister_profile_gateway /
list_profile_gateways; host backends continue to raise
NotImplementedError. No caller yet (Phase 4 wires in the profile
create/delete hooks).

Key implementation notes:

  - Service directory shape: /run/service/gateway-<profile>/{type,run,log/run}.
    Atomic register: write to gateway-<profile>.tmp, fsync via
    os.rename. Cleanup on rescan failure.

  - Run script uses #!/command/with-contenv sh so HERMES_HOME and any
    extra_env arrive at exec time. The hermes -p <profile> gateway
    start --foreground --port <port> command is wrapped in
    s6-setuidgid hermes for the per-service privilege drop (OQ2-A).

  - Log script (OQ8-C): persists via s6-log to
    ${HERMES_HOME}/logs/gateways/<profile>/. CRITICAL — HERMES_HOME is
    a runtime env-var expansion in the rendered script, NOT a Python
    f-string substitution. Negative-asserted in
    test_s6_register_creates_service_dir_and_triggers_scan so
    regressions are caught.

  - PATH gotcha: /command/ is only on PATH for processes spawned by
    the supervision tree (services, cont-init.d). `docker exec` and
    profile-create hooks don't get it. S6ServiceManager calls all
    s6-* binaries via absolute path through the new _S6_BIN_DIR
    constant so callers don't have to fix up env vars.

  - validate_profile_name rejects path-traversal, leading-dash (s6
    would parse as a flag), uppercase, whitespace, and names >251
    chars (s6-svscan default name_max).

Test coverage:
  - 13 new unit tests in tests/hermes_cli/test_service_manager.py
    (kind detection, run-script content, env quoting, register
    rollback on rescan failure, unregister idempotence, list filter,
    lifecycle dispatch, svstat parsing). Total: 36 passing.
  - 2 new in-container integration tests in
    tests/docker/test_s6_profile_gateway_integration.py validating
    end-to-end registration against a real s6 supervision tree.

Docker harness: 14 passed, 2 xfailed (Phase 4 target unchanged).

Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md
2026-05-24 18:05:33 -07:00
Ben
51914b0514
feat(service_manager): add ServiceManager protocol + host wrappers
Phase 1 of the s6-overlay supervision plan. Pure-refactor addition:
introduces the abstract interface (with runtime_checkable Protocol),
detect_service_manager(), validate_profile_name(), and thin
SystemdServiceManager / LaunchdServiceManager / WindowsServiceManager
wrappers around the existing systemd_* / launchd_* / gateway_windows.*
module-level functions. No host call site was modified — host code
continues to use the existing functions directly; the protocol is for
new backend-agnostic code (Phase 4 profile create/delete hooks and the
Phase 4 s6 dispatch path in 'hermes gateway start/stop/restart').

WindowsServiceManager.install() forwards the v3 kwargs (start_now,
start_on_login, elevated_handoff) added in PRs #28169-adjacent so
non-Windows callers — there aren't any today — can opt in.

The s6 backend lands in Phase 3; until then get_service_manager()
raises a clear error if invoked on a host that detects as 's6'.
2026-05-24 18:05:14 -07:00
kshitijk4poor
00ec0b617c feat(tts): add register_tts_provider() plugin hook (closes #30398)
Adds a `TTSProvider(ABC)` + `register_tts_provider()` extension point
to the plugin context API, **alongside** the existing config-driven
`tts.providers.<name>: type: command` registry from PR #17843. This is
additive — the command-provider surface stays as the primary way to
add a TTS backend.

The hook covers cases the shell-template grammar can't reasonably
express:

- Native Python SDKs without a CLI (Cartesia, Fish Audio, etc.)
- Streaming synthesis (chunked Opus → voice-bubble delivery)
- Voice metadata API for the `hermes tools` picker
- OAuth-refreshing auth flows

None of the 10 inline built-in providers (`edge`, `openai`,
`elevenlabs`, `minimax`, `gemini`, `mistral`, `xai`, `piper`,
`kittentts`, `neutts`) are migrated to plugins. They stay inline. The
hook is for *new* engines that aren't built-in.

## Resolution order

The dispatcher's resolution order is the load-bearing invariant:

1. `tts.provider` is a built-in name → built-in dispatch. **Always wins.**
2. `tts.provider` matches `tts.providers.<name>` with `command:` set
   → command-provider dispatch (PR #17843).
3. `tts.provider` matches a plugin-registered `TTSProvider`
   → plugin dispatch (new).
4. No match → falls through to Edge TTS default (legacy behavior).

Built-ins-always-win is enforced at THREE layers:
- Registry: `register_provider()` rejects shadowing names with a warning.
- Dispatcher: `_dispatch_to_plugin_provider()` short-circuits built-in
  names defensively before consulting the registry.
- Picker: `_plugin_tts_providers()` filters built-in shadows out of
  the `hermes tools` row list defensively.

Command-providers-win-over-plugins is enforced at TWO layers:
- The caller in `text_to_speech_tool` checks
  `_resolve_command_provider_config` first.
- `_dispatch_to_plugin_provider` re-checks for a same-name command
  config defensively so a refactor of the caller can't silently break
  the invariant.

## New files

- `agent/tts_provider.py` — `TTSProvider(ABC)` with `synthesize()` (required),
  `list_voices()`, `list_models()`, `get_setup_schema()`, `stream()`,
  `voice_compatible` (all optional with sane defaults). Mirrors
  `agent/image_gen_provider.py` shape.
- `agent/tts_registry.py` — `register_provider`/`get_provider`/`list_providers`
  with `_BUILTIN_NAMES` reject-shadowing invariant. Mirrors
  `agent/image_gen_registry.py` shape.
- `plugins/tts/...` directory ready for community plugins (none shipped).

## Modified files

- `hermes_cli/plugins.py` — `register_tts_provider()` method on
  `PluginContext`. Matches the gating shape of
  `register_image_gen_provider()` / `register_browser_provider()`.
- `tools/tts_tool.py` — `_dispatch_to_plugin_provider()` +
  `_plugin_provider_is_voice_compatible()` + walrus-elif wiring into
  the main dispatcher. Built-in elif chain untouched.
- `hermes_cli/tools_config.py` — `_plugin_tts_providers()` injects
  plugin rows into the Text-to-Speech picker category alongside the
  10 hardcoded built-in rows.

## Tests

- `tests/agent/test_tts_registry.py` — 47 tests covering registration,
  lookup, ABC contract, helpers, AND a `TestBuiltinSync` regression
  test that fails if `agent.tts_registry._BUILTIN_NAMES` drifts from
  `tools.tts_tool.BUILTIN_TTS_PROVIDERS` (kept duplicated due to
  circular import constraints).
- `tests/tools/test_tts_plugin_dispatch.py` — 35 tests covering
  built-in-always-wins, command-wins-over-plugin, plugin dispatch,
  exception passthrough, voice_compatible helper.
- `tests/hermes_cli/test_tts_picker.py` — 10 tests covering the
  picker surface, builtin shadowing defense, integration with
  `_visible_providers`.
- `tests/hermes_cli/test_plugins_tts_registration.py` — 3 end-to-end
  tests via `PluginManager.discover_and_load()`.
- `tests/plugins/tts/check_parity_vs_main.py` — 9-scenario subprocess
  parity harness vs `origin/main`. The only intentional diff is
  `fallback_edge → plugin` for the `plugin-installed` scenario.

## Verification

- 95/95 new tests pass.
- 170/170 pre-existing TTS tests (test_tts_command_providers,
  test_tts_max_text_length, test_tts_speed, etc.) pass unchanged.
- Parity harness against `origin/main`: 8 OK + 1 expected DIFF.
- E2E smoke: a registered plugin's `synthesize()` is called via
  `text_to_speech_tool` with the standard JSON envelope returned.
- Ruff clean on all touched files.

## Docs

- `website/docs/user-guide/features/tts.md` — new "Python plugin
  providers" section with a decision table (command-provider vs
  plugin), minimal plugin example, and the optional-hook reference.
- `website/docs/user-guide/features/plugins.md` — TTS row updated to
  mention both surfaces (command-provider primary, plugin for
  SDK/streaming).

Closes #30398
2026-05-24 18:04:54 -07:00
JunghwanNA
243ebc7a61 Protect dashboard OAuth credentials with the same file-safety guarantees as other auth paths
The web dashboard's Anthropic OAuth helper wrote the credential file
straight to its final destination and relied on the process umask for
permissions. That left the dashboard-specific path weaker than the
existing auth writers, which already use owner-only permissions and
safer write semantics.

This change keeps the scope narrow: make the dashboard helper write via
a temp file + replace, chmod the final file to owner-only, and add a
focused regression test for both permission handling and atomic-write
behavior.

Constraint: Must preserve the existing dashboard OAuth flow and credential-pool side effects
Rejected: Broader auth-storage refactor | unnecessary scope for a single verified inconsistency
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Keep dashboard credential writes aligned with existing auth storage semantics; do not reintroduce direct write_text() here without matching chmod/atomic behavior
Tested: pytest -o addopts='' tests/hermes_cli/test_web_server_oauth_write.py tests/hermes_cli/test_web_server.py -q (78 passed)
Not-tested: Cross-platform permission semantics on Windows-managed filesystems
2026-05-24 17:47:24 -07:00
leeseoki0
ce529d6072 fix(kanban): scratch tasks must not inherit board.default_workdir (#28818)
Board defaults represent persistent project checkouts. Scratch workspaces
are auto-deleted on completion and must stay under the per-board scratch
root that resolve_workspace() creates. Inheriting default_workdir for a
scratch task pointed the cleanup path at the user's source tree — the
data-loss vector documented in #28818.

The containment guard in _cleanup_workspace (just added) is the safety
rail. This commit prevents the bad state from being created in the first
place: only persistent kinds (dir/worktree) inherit board defaults.

Tests updated to cover the new semantics: scratch with default_workdir
set keeps workspace_path=None; dir/worktree still inherits the board
default.

Salvaged from PR #31315 by @leeseoki0 — prevention layer on top of the
#28819 containment fix by @briandevans.

Co-authored-by: teknium1 <127238744+teknium1@users.noreply.github.com>
2026-05-24 15:48:58 -07:00
briandevans
23115b5c0f fix(kanban): restrict managed-scratch roots to workspaces/ dirs only
Copilot review on PR #28819 flagged that `_is_managed_scratch_path` accepted
the entire `<kanban_home>/kanban` subtree as managed scratch storage. With
that, a task whose `workspace_kind='scratch'` and `workspace_path` was
mis-set to `<kanban_home>/kanban`, `.../kanban/logs`, or a board's
metadata directory (e.g. `.../kanban/boards/<slug>` without the
`workspaces/` child) would pass the containment guard and let task
completion `shutil.rmtree` Hermes' own DB, metadata, and log subtrees.

Tighten the guard:

* Allowed roots are now exclusively `workspaces/` directories — the
  `HERMES_KANBAN_WORKSPACES_ROOT` override, `<kanban_home>/kanban/workspaces`,
  and each `<kanban_home>/kanban/boards/<slug>/workspaces` discovered on
  disk.
* Require strict descendancy: a path equal to a root itself is rejected
  too, because deleting a workspaces root would wipe every task's scratch
  dir at once.

Add a regression test covering the three Copilot-named attack paths
(kanban root, kanban/logs, board root without `workspaces/`) plus the
workspaces-root-itself case, and confirm the inner task-id dir still
matches.
2026-05-24 15:48:58 -07:00
briandevans
80ad1609c8 fix(kanban): refuse to rmtree workspace_path outside managed scratch root (#28818)
A board's ``default_workdir`` (e.g. ``hermes kanban boards
set-default-workdir my-board /path/to/real/source``) is copied into
``tasks.workspace_path`` for tasks created without an explicit
``workspace_kind``. Those tasks default to ``workspace_kind='scratch'``,
so completion calls ``_cleanup_workspace`` and unconditionally runs
``shutil.rmtree(wp, ignore_errors=True)`` — deleting the user's real
source tree as if it were disposable scratch storage.

Add ``_is_managed_scratch_path()`` and gate ``_cleanup_workspace`` on
it: only delete paths under ``HERMES_KANBAN_WORKSPACES_ROOT`` (the
worker-side override the dispatcher injects) or under the active kanban
home's ``kanban/`` subtree (covering both the legacy default-board root
and per-board ``kanban/boards/<slug>/workspaces`` roots). Anything else
gets a warning log and is left alone, so a misconfigured
``default_workdir`` can no longer destroy user data on task completion.
2026-05-24 15:48:58 -07:00
helix4u
514f5020c7 fix(debug): redact BlueBubbles webhook secrets 2026-05-24 15:43:48 -07:00
Teknium
13b85bc646 feat(config): document resume-recap tuning keys in DEFAULT_CONFIG
The hardcoded constants in _display_resumed_history were exposed as
config in PR #4434; declare them in DEFAULT_CONFIG and the CLI fallback
dict so they show up in 'hermes config' diagnostics and the schema
validator.
2026-05-24 15:36:37 -07:00
Teknium
54e61f9331 fix(matrix,gateway): Matrix E2EE installs full dep set; plugins respect is_connected
Fixes #31116 — two distinct bugs in fresh-install Matrix gateway:

1. Matrix E2EE setup installed only mautrix[encryption], leaving asyncpg
   / aiosqlite / Markdown / aiohttp-socks uninstalled. The first encrypted
   connect failed with 'No module named asyncpg' deep inside
   MatrixAdapter.connect(). Root cause: the setup wizard hand-rolled a
   pip install of one package instead of using lazy_deps.ensure(
   'platform.matrix'), and check_matrix_requirements() short-circuited the
   runtime installer on 'import mautrix' alone — so the other 4 packages
   were never pulled in.

2. Discord auto-enabled itself on every gateway start, even when the user
   never selected Discord and had no DISCORD_BOT_TOKEN. Root cause:
   gateway/config.py plugin-enablement loop gated enablement on
   entry.check_fn() (just 'is the SDK importable?') and ignored
   entry.is_connected (the 'did the user configure credentials?' probe).
   Same bug class as commit 7849a3d73 fixed for _platform_status in the
   setup wizard; this is the runtime counterpart. Affects Discord, Teams,
   and Google Chat.

Changes:
- hermes_cli/setup.py::_setup_matrix — install via
  lazy_deps.ensure('platform.matrix') to pull the full feature group.
- gateway/platforms/matrix.py::_check_e2ee_deps — verify asyncpg +
  aiosqlite + PgCryptoStore in addition to OlmMachine, so E2EE failures
  surface at startup instead of at first encrypted-room connect.
- gateway/platforms/matrix.py::check_matrix_requirements — use
  feature_missing('platform.matrix') as the install gate instead of a
  single 'import mautrix' check, so partial installs trigger the lazy
  installer correctly.
- gateway/config.py plugin-enablement loop — consult entry.is_connected
  before flipping enabled=True. Explicit YAML enabled=true still wins.

Tests: 3 new in tests/gateway/test_matrix.py (asyncpg-required,
aiosqlite-required, partial-install lazy-runs), 5 new in
tests/gateway/test_platform_registry.py (is_connected=False blocks,
is_connected=True enables, is_connected=None falls back to check_fn,
raising probe doesn't enable, explicit YAML wins).

Validation: 310 tests across affected test modules pass.
2026-05-24 15:16:03 -07:00
Teknium
7ab1677362
feat(security): on-demand supply-chain audit via OSV.dev (#31460)
Adds 'hermes security audit' — a one-shot vulnerability scan against
OSV.dev covering three surfaces a Hermes user actually controls:

  1. The running Python's installed PyPI dists (importlib.metadata)
  2. Plugin requirements.txt / pyproject.toml pins under ~/.hermes/plugins/
  3. Pinned npx/uvx MCP servers in config.yaml

Zero new dependencies (stdlib urllib + importlib.metadata + tomllib +
concurrent.futures). No auth required for OSV's public batch API.

Flags: --json, --fail-on {low,moderate,high,critical} (default: critical),
       --skip-venv, --skip-plugins, --skip-mcp

Output groups findings by source, sorts by severity descending, surfaces
fixed-versions inline. Exit 1 when any finding meets the --fail-on tier.

Deliberately out of scope: globally-installed pip/npm, editor/browser
extensions, daily background scans, auto-blocking of installs. The audit
is on-demand by design — daily scans become noise the user trains
themselves to ignore.
2026-05-24 15:15:16 -07:00
hinotoi-agent
2e66eefbc3 fix(dashboard): validate WebSocket Host and Origin 2026-05-24 15:00:44 -07:00
Hinotoi-agent
3bace071bf fix(state): restrict sensitive store file permissions
response_store.db (api server) holds conversation history including tool
payloads, prompts, and results. webhook_subscriptions.json holds per-route
HMAC secrets. Under a permissive umask (e.g. 0o022, default on most
distros) both files were created mode 0o644 — readable by other local
users on shared boxes.

- gateway/platforms/api_server.py: ResponseStore tightens itself + WAL/SHM
  sidecars to 0o600 after __init__, then trusts the inode. (Original
  contributor patch chmod'd after every _commit() — wasteful on a hot
  api_server path; chmod-on-create is sufficient since SQLite preserves
  mode bits across writes.)

- hermes_cli/webhook.py: _save_subscriptions writes via tempfile.mkstemp
  (which itself creates the file with 0o600), chmods the temp before the
  atomic rename, and re-asserts 0o600 on the destination so an existing
  permissive file from before this fix gets narrowed.

Tests cover (a) creation under permissive umask leaves 0o600 and (b) an
existing 0o644 webhook_subscriptions.json gets narrowed on next save.
Tests guarded with skipif os.name=='nt' since POSIX mode bits don't apply
on Windows.

Salvaged from PR #30917 by @Hinotoi-agent. Reworked the api_server.py
side from chmod-on-every-commit to chmod-on-create.

Co-authored-by: teknium1 <127238744+teknium1@users.noreply.github.com>
2026-05-24 04:55:18 -07:00
Jiaming Guo
ee002e7fc5
fix(dashboard): require auth for plugin rescan (#27340) 2026-05-24 04:45:07 -07:00
Teknium
be27bfed01
security: harden API server key placeholder handling (#30738) 2026-05-24 04:25:32 -07:00
Teknium
1f897b0dc9
fix(gateway): stop enabling dingtalk allow-all during setup (#30743) 2026-05-24 04:24:44 -07:00
Teknium
9732559864
fix(security): restrict dashboard websockets to loopback clients (#30741) 2026-05-24 04:24:40 -07:00
Teknium
bc3f1f4f34
feat(secrets/bitwarden): EU Cloud + self-hosted server URL support (#31378)
Closes #31370.

bws defaults to the US identity endpoint, so EU Cloud and self-hosted
machine-account tokens fail with [400 Bad Request] {"error":"invalid_client"}
during 'hermes secrets bitwarden setup'. The token is valid — it's just
being checked against the wrong region.

Add a Bitwarden region step to the wizard between the access-token and
project-list steps:

  Step 1  Install bws
  Step 2  Provide access token
  Step 3  Pick region   <-- new (US / EU / self-hosted-custom-URL)
  Step 4  Pick project  (now talks to the right endpoint)
  Step 5  Test fetch

Region is stored in config.yaml as secrets.bitwarden.server_url and
plumbed into every bws subprocess as BWS_SERVER_URL (project list,
secret list, test fetch, and the env_loader startup pull).

Also:
- Non-interactive: 'hermes secrets bitwarden setup --server-url ...'
- Pre-existing BWS_SERVER_URL in the shell is detected and reused
- Cache key includes server_url so EU/US fetches don't collide
- 'hermes secrets bitwarden status' shows the configured region
- 'invalid_client' / '400 Bad Request' from bws now triggers a hint
  pointing at the region setting instead of looking like a bad token
2026-05-24 02:19:57 -07:00
Teknium
b207dc28b3 feat(kanban): --ids bulk promote + AUTHOR_MAP entry for #29464
Adds an --ids flag to 'hermes kanban promote' mirroring the existing
block/schedule convention, so the marquee use case from issue #28822
(promote all children of a closed organizational parent in one shot)
doesn't require a shell loop. Single-id JSON output stays a flat
object for back-compat; bulk emits a list. Dedupes positional + --ids
so the same id can't be promoted twice in one call. 5 new CLI-level
tests cover bulk happy path, partial-failure exit code, JSON shapes,
and dedup.

Also adds the thedavidmurray noreply-email -> github-login mapping in
scripts/release.py so the salvage cherry-pick passes the AUTHOR_MAP
contributor-credit check.
2026-05-23 23:10:36 -07:00
David Murray
d46adad22f feat(cli): kanban promote verb for manual todo->ready recovery
Adds `hermes kanban promote <task_id>` for manual lifecycle recovery
when an auto-promote daemon misses the parent-done transition (issue
#28822). Refuses promotion unless every parent dep is done/archived
(override with --force). Emits a `promoted_manual` audit event distinct
from the automatic `promoted` kind, so audit consumers can filter
human-driven from system-driven promotions. Supports --dry-run and
--json for orchestration. Does not mutate assignee/claim state — the
dispatcher picks the card up via its normal ready polling path.

Closes #28822.
2026-05-23 23:10:36 -07:00
honor2030
6a1aa420e7 Fix CLI verbose tool progress config fallback 2026-05-23 21:03:51 -07:00
Teknium
e42fcc5625
fix(provider): make config.yaml model.provider the single source of truth (#31222)
Policy: if it ain't a secret it goes in config.yaml. HERMES_INFERENCE_PROVIDER
was leaking behavioral config into the .env surface, including from the gateway,
which bypassed config.yaml entirely.

Behavior:
- gateway/run.py: drop HERMES_INFERENCE_PROVIDER read in _resolve_runtime_agent_kwargs.
  Gateway now flows through resolve_runtime_provider() with no `requested` override,
  which reads model.provider from config.yaml first.

Docs/UX (strip env var from user-facing surface):
- --provider help text no longer mentions the env var
- cli-config.yaml.example same
- reference/environment-variables.md: remove HERMES_INFERENCE_PROVIDER row and
  the cross-reference from HERMES_INFERENCE_MODEL
- reference/cli-commands.md: blank the env-var column for --provider
- guides/xai-grok-oauth.md, guides/minimax-oauth.md: replace
  HERMES_INFERENCE_PROVIDER=x hermes invocations with config.yaml / --provider
- developer-guide/adding-providers.md, model-provider-plugin.md: reframe

Internal mechanism (kept as-is):
- hermes_cli/main.py writes HERMES_INFERENCE_PROVIDER into the TUI subprocess env
- tui_gateway/server.py reads it on TUI startup
- resolve_requested_provider() / oneshot.py / cli.py still fall through to the
  env var as a last-resort behind config.yaml, which is what makes the TUI
  parent->child handoff work
This stays. We just stop documenting it as a user knob.

Tests: tests/gateway/test_auth_fallback.py — simplify mock to fail on first
call, succeed on second; drop monkeypatch.setenv lines that no longer matter.

Supersedes #31064 (closed with credit to @novax635 who surfaced the underlying
issue but proposed aligning gateway *to* the env var rather than removing it).
2026-05-23 18:18:41 -07:00
Edison
e752c9454e feat(plugins): add register_auxiliary_task() to PluginContext API
Auxiliary LLM tasks (vision, compression, web_extract, etc.) currently
require modifications to core files for any plugin that needs its own
task slot — specifically the _AUX_TASKS list in hermes_cli/main.py and
the hardcoded env-var bridging dict in gateway/run.py. This violates
the 'plugins must not modify core files' rule and forces every memory
or context plugin that wants its own auxiliary task to either fork
core or open a coupled core+plugin PR.

This change adds a generic plugin surface for auxiliary task
registration:

    ctx.register_auxiliary_task(
        key='memory_retain_filter',
        display_name='Memory retain filter',
        description='hindsight pre-retain dedup/extract',
        defaults={'timeout': 30, 'extra_body': {'reasoning_effort': 'low'}},
    )

After registration, the task automatically:

  - Appears in 'hermes model → Configure auxiliary models' picker via
    a new _all_aux_tasks() merge of built-in + plugin tasks
  - Has its provider/model/base_url/api_key bridged from config.yaml
    to AUXILIARY_<KEY_UPPER>_* env vars at gateway startup
    (gateway/run.py now uses a dynamic bridged-keys set instead of
    a hardcoded per-task dict)
  - Gets plugin-declared defaults (timeout, extra_body, etc.) layered
    underneath user config so unconfigured plugin tasks still work
    (agent/auxiliary_client._get_auxiliary_task_config)
  - Resets to auto via 'Reset all to auto' alongside built-ins

Validation:

  - Rejects shadowing of built-in keys (vision, compression, etc.)
  - Rejects invalid key shapes (must match [A-Za-z0-9_]+)
  - Rejects cross-plugin collisions (clear error)
  - Allows same-plugin re-registration (idempotent updates)

Plugin discovery failures (rare) fall back gracefully — the aux
config UI still shows built-in tasks if get_plugin_auxiliary_tasks()
raises, and gateway env-var bridging keeps working for built-ins.

Built-in tasks remain hardcoded in _AUX_TASKS for stability — they're
the baseline UX, and DEFAULT_CONFIG already ships their defaults.
Plugin tasks layer on top.

Tests: 15 new tests in test_plugin_auxiliary_tasks.py covering API
validation, manager state lifecycle, helper sort order, _all_aux_tasks
merge semantics, _reset_aux_to_auto inclusion of plugin tasks, and
default-layering in auxiliary_client.

Updates the gateway-bridge code-parity test (test_auxiliary_config_bridge)
to assert the new dynamic shape rather than the hardcoded literal env
var names which no longer appear post-refactor.

Motivation: this unblocks PR #20262 (hindsight smart retain pipeline)
and similar plugins that need a dedicated aux task slot. The change
is non-breaking — built-in env vars (AUXILIARY_VISION_PROVIDER, etc.)
keep working since they're produced by the same f-string template
that built the hardcoded names.
2026-05-23 17:49:47 -07:00
soynchux
e8fa415a9e fix(cli): validate runtime token refresh capability in Qwen auth status 2026-05-23 17:47:36 -07:00
teknium1
4254f7dd17 refactor(skills): slim AST diagnostic to single entry point
Trim ~600 LOC off the original contribution while keeping the same
operator-facing surface and detection coverage.

- Collapse three entry points (file / dir / bundle) into one
  ast_scan_path(path) that handles both files and directories.
- Drop AstFinding dataclass + severity field — replaced with plain
  (file, line, pattern_id, description) tuples. Severity ordering was
  display-only for a diagnostic that explicitly disclaims security
  verdicts, so the field added bookkeeping without earning its place.
- Replace Rich-markup formatter with plain text grouped by file.
- Drop the 'inspect --ast-deep' surface — same scanner, same output as
  'audit --deep', single CLI entry is enough. Operators audit after
  install; pre-install inspection signal isn't worth the second surface.
- Trim test file to the cases that earn their place: bypass payload,
  syntax error survival, RecursionError survival, false-positive guard
  (importer lookalike), literal-arg false-positive guard, non-.py
  ignored, directory recursion + cache-dir skipping, missing-path,
  getattr/__dict__ detection, formatter empty + populated.

Net: tools/skills_ast_audit.py 353 -> 133 LOC,
tests/tools/test_skills_ast_audit.py 299 -> 103 LOC, full diff
+704/-12 -> +264/-6. No change to tools/skills_guard.py — Skills Guard
verdicts remain untouched per SECURITY.md §2.4.
2026-05-23 17:47:26 -07:00
Tranquil-Flow
7255050c99 feat(skills): add opt-in AST deep diagnostics
Add opt-in AST diagnostics for skill review without making Skills Guard stricter by default.

- Add hermes skills inspect --ast-deep to scan fetched skill bundles before installation
- Add hermes skills audit --deep to scan already-installed hub skills
- Keep AST analysis in tools/skills_ast_audit.py, separate from tools/skills_guard.py
- Label output as diagnostic hints, not security verdicts
- Cover dynamic import/access patterns: importlib, __import__(computed), getattr(computed), and __dict__[computed]

This follows the maintainer guidance from closed PR #7436: useful AST-level analysis belongs in an opt-in diagnostic path, not in Skills Guard's default heuristic scan.
2026-05-23 17:47:26 -07:00
Yuan Li
75643a6154 fix(env): strip null bytes from .env before python-dotenv loads
Null bytes in API key values (introduced by copy-paste) crash
    os.environ[k] = v with ValueError: embedded null byte, preventing
    hermes from starting at all.
2026-05-23 17:17:05 -07:00
0xchainer
2c34a7da87 fix(cli): prevent temp directory leak on ZIP update failure
Move shutil.rmtree into a finally block so the temp directory is always
cleaned up, even when an exception occurs during download, extraction,
or file copying.
2026-05-23 16:16:35 -07:00
Teknium
6a8e131a0a refactor(ntfy): convert built-in adapter to platform plugin
ntfy now ships as a self-contained plugin under plugins/platforms/ntfy/
instead of editing 8 core files (gateway/config.py Platform enum,
gateway/run.py factory + auth maps, cron/scheduler.py, toolsets.py,
hermes_cli/status.py, agent/prompt_builder.py, gateway/channel_directory.py,
tools/send_message_tool.py).

All routing goes through gateway/platform_registry via register_platform():
- adapter_factory, check_fn, validate_config, is_connected
- env_enablement_fn seeds PlatformConfig.extra from NTFY_* env vars so
  gateway status reflects env-only setups without instantiating httpx
- standalone_sender_fn handles deliver=ntfy cron jobs when cron runs
  out-of-process from the gateway
- allowed_users_env / allow_all_env hook into _is_user_authorized
- cron_deliver_env_var=NTFY_HOME_CHANNEL for cron home routing
- platform_hint surfaces in the system prompt
- pii_safe=True (topic names are the only identifier; no PII to redact)

Tests moved to tests/gateway/test_ntfy_plugin.py using _plugin_adapter_loader
so the module lives under plugin_adapter_ntfy in sys.modules and cannot
collide with sibling plugin-adapter tests on the same xdist worker. The
core-file grep tests (Platform.NTFY in source, hermes-ntfy in toolsets,
etc.) are replaced with plugin-shape tests covering register() metadata,
env_enablement_fn output, and standalone_sender_fn behavior.

68 tests pass under scripts/run_tests.sh.
2026-05-23 16:13:01 -07:00
sprmn24
b10f17bf1e feat(ntfy): add ntfy platform adapter with atomic reconnect, identity fix, and 81 tests 2026-05-23 16:13:01 -07:00
Teknium
ad11327db0
feat(kanban): warn users that scratch workspaces are deleted on completion (#30949)
First scratch workspace creation on an install now emits a one-shot
warning log + a 'tip_scratch_workspace' event on the task. Sentinel
file at ~/.hermes/kanban/.scratch_tip_shown silences subsequent
creations across the whole install.

Behavior unchanged — scratch is still ephemeral by design. This just
makes the design visible to new users (reported in user community:
'progress files vanished, no warning anywhere').

Docs (en + ko) updated to spell out 'Deleted when the task completes'
on the scratch bullet and 'Preserved on completion' on worktree/dir.
2026-05-23 11:27:00 -07:00
teknium1
c4b8f5efee fix(kanban): harden corrupt-db backup against CodeQL path-injection findings
Path.resolve() before any I/O and confine backup writes to the resolved
parent directory. Adds explicit parent-equality assertions so static
analyzers see the containment guarantee, and walks WAL/SHM sidecars
through the same resolved-parent path so accidental .. segments are
collapsed before shutil.copy2.

Functionally equivalent to the original PR; preserves the corrupt bytes
to <db>.corrupt.<ts>.bak in the same directory, still raises
KanbanDbCorruptError from connect(). E2E with Stefan's exact hex header
+ malformed pages still passes. 163/163 kanban tests still pass.
2026-05-23 05:51:33 -07:00
Nick
39fe4ecee3 fix(kanban): refuse corrupt db auto-init 2026-05-23 05:51:33 -07:00
QuenVix
7245bc77eb fix(fallback): merge fallback_providers with legacy fallback_model configurations 2026-05-23 05:24:57 -07:00
Teknium
b4cf5b65dd
feat(portal): one-shot setup, status CLI, and Nous-included markers (#30860)
* feat(portal): one-shot setup, status CLI, and Nous-included markers

Four small Portal-aware surfaces that drive subscription value without
adding friction for non-Portal users.

  - hermes setup --portal: one-shot Nous OAuth + provider switch + Tool
    Gateway opt-in. Shareable as a single command from docs/social.
  - hermes portal {status,open,tools}: small surface over Portal auth +
    Tool Gateway routing. Defaults to 'status' when no subcommand.
  - Tool picker (hermes tools): when the user is logged into Nous, mark
    Nous-managed provider rows with a star and 'Included with your Nous
    subscription'. Suppressed when not authed — non-subscribers see the
    picker unchanged.
  - BYOK setup hint: a single dim line 'Available through Nous Portal
    subscription.' appears when the user is being prompted for a paid
    API key (Firecrawl, FAL, ElevenLabs, Browserbase, etc.) AND the
    category has a Nous-managed sibling AND the user is not already
    authed to Nous. Suppressed in all other cases.

Tested live end-to-end in an isolated HERMES_HOME with a simulated
authed and unauthed user. Targeted suite (tests/hermes_cli/
test_tools_config.py + test_setup.py) passes 97/97.

* fix: add portal to _BUILTIN_SUBCOMMANDS so plugin discovery fast-path skips it
2026-05-23 02:39:09 -07:00
sprmn24
b183be95a2 fix(gateway-windows): atomic write for .cmd and startup launcher scripts 2026-05-23 02:30:41 -07:00
xxxigm
8bf99227f0 fix(plugins): block plugin-api path traversal + project RCE (#29156)
GHSA-5qr3-c538-wm9j — half two of the bypass chain.

``_mount_plugin_api_routes`` imports each dashboard plugin's
manifest ``api`` field as a Python module via
``importlib.util.spec_from_file_location`` — arbitrary code
execution by design.  Two primitives in the surrounding code
turned that "by design" RCE into a usable attack:

1. Absolute paths in the manifest swallow the plugin directory.
   ``Path('safe/dashboard') / '/tmp/evil.py'`` resolves to
   ``/tmp/evil.py``, so a single manifest line
   ``{"api": "/tmp/payload.py"}`` was enough to redirect the
   importer at any Python file on disk.
2. ``..`` traversal in the manifest climbs out of the dashboard
   directory.  ``Path('plugins/safe/dashboard') /
   '../../../tmp/evil.py'`` lands in ``/tmp/evil.py`` after
   ``resolve()`` — the static-asset handler
   (``serve_plugin_asset``) already defends against this via
   ``is_relative_to``; the api-mount path didn't.

Fix at three layers so a regression in any one can't re-open the
advisory:

* New ``_safe_plugin_api_relpath`` validator runs at *discovery*
  time and stores only sanitised relative paths on the plugin
  entry's ``_api_file`` field.  Absolute paths, ``..`` traversal,
  empty / non-string values, and paths that ``resolve()`` outside
  the plugin's ``dashboard/`` directory are rejected with a
  warning naming the plugin.  ``has_api`` follows the sanitised
  value so the dashboard frontend doesn't render a fake "Backend
  API" badge for plugins whose api was scrubbed.
* ``_mount_plugin_api_routes`` re-validates the resolved path
  against the live filesystem just before the import — defence in
  depth in case ``_dir`` is tampered with post-cache or a future
  caller bypasses the discovery-time validator.
* Project plugins (``source == "project"``) are refused outright
  for backend import.  ``./.hermes/plugins/`` ships with the CWD,
  so any threat model that includes "user opens a malicious repo"
  treats it as attacker-controlled; project plugins can still
  extend the UI via static JS/CSS but their Python ``api`` is no
  longer auto-imported.  Combined with the truthy env-gate fix
  from the previous commit, the original advisory chain now
  fails at two distinct choke points.
2026-05-23 01:43:52 -07:00