hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-05-30 06:41:51 +00:00

Author	SHA1	Message	Date
Ben	541b40532a	fix(container_boot): publish reconciled service dirs atomically PR #30136 review noted the asymmetry: `register_profile_gateway` used tmp_dir + rename to publish a new service slot atomically, but the boot-time reconciler wrote files into the slot directly. Same underlying concern (a concurrent s6-svscan rescan could observe a half-populated directory), different code path. Rewrite `container_boot._register_service` to mirror the manager: build everything in `<scandir>/gateway-<profile>.tmp/`, then `Path.replace` into place. If a previous interrupted run left a `.tmp` sibling, it's cleaned up before the new build starts. If the target already exists, it's removed before the rename so `Path.replace` doesn't error on a non-empty target (Linux `rename` overwrites empty targets only). Three new tests: atomic publication leaves no .tmp leftovers, overwriting an existing slot still leaves no .tmp leftovers, and a stale .tmp from an interrupted run is cleaned up automatically.	2026-05-23 15:34:51 +10:00
Ben	5b1fcdd16b	fix(container_boot): rotate container-boot.log when it exceeds 256 KiB PR #30136 review noted: container-boot.log was append-only with no rotation. On a long-lived container with frequent restarts and many profiles it would grow unboundedly (~80 B per profile per reconcile pass). Add a soft cap: when the file size hits 256 KiB (`_LOG_ROTATE_BYTES`, ≈3000 reconcile lines, ≈1 year of daily reboots × 5 profiles), the current file is renamed to `container-boot.log.1` (replacing any existing one) before new entries are appended. Worst case is two files at ~512 KiB — well within visibility limits for grep/cat. Rotation is intentionally simple (no logrotate or s6-log machinery for one append-only file). Failures during rotation are logged via the module logger and treated as non-fatal — we keep appending to the existing file rather than dropping the reconcile entry. Three new unit tests cover above-threshold rotation, below-threshold non-rotation, and overwrite of an existing .1 file.	2026-05-23 15:33:11 +10:00
Ben	8b6733ebe2	fix(service_manager): rip out dead port parameter PR #30136 review caught: `_allocate_gateway_port()` in profiles.py computed a SHA-256-derived port that was threaded through `register_profile_gateway(profile, port=N)` → `_render_run_script(profile, port, extra_env)` → and then ignored. The rendered run script picked the bind port from the profile's config.yaml (`[gateway] port = …`), never from the allocator. So the entire allocator + parameter chain was dead code. Remove: * `hermes_cli.profiles._allocate_gateway_port` (deterministic SHA-256 → [9200, 9800) — never used). * `port` kwarg from `ServiceManager.register_profile_gateway` (Protocol + Mixin + S6 implementation). * `port` positional arg from `_render_run_script(profile, port, extra_env)` — now `_render_run_script(profile, extra_env)`. * The pass-through call in `profiles._maybe_register_gateway_service`. config.yaml is now the single source of truth for gateway port selection — matches reality and reduces the API surface. Three explanatory comments in service_manager.py / profiles.py document the retirement so future readers don't reach for the allocator and find a ghost. Tests: drop the three `_allocate_gateway_port` tests; update fakes' signatures throughout test_service_manager.py and test_profiles_s6_hooks.py to match the new no-port API.	2026-05-23 15:30:15 +10:00
Ben	1759c0f090	fix(service_manager): friendly errors for missing slots and s6-svc failures PR #30136 review caught: `S6ServiceManager.start/stop/restart` called `subprocess.run(check=True)` on `s6-svc`, so any failure surfaced as a raw `CalledProcessError` traceback. The two cases operators actually hit are: 1. The service slot doesn't exist — most commonly because the user typed a profile name wrong (`hermes -p typo gateway start`). 2. s6-svc itself fails — most commonly EACCES on the supervise control FIFO when running unprivileged. Both deserve named errors with actionable messages, not stacktraces. Changes: * Add `S6Error` base + two concrete errors in `hermes_cli.service_manager`: - `GatewayNotRegisteredError(profile)` — carries the unprefixed profile name; message: `no such gateway 'typo': register it with `hermes profile create typo` first, or pass an existing profile name via `-p <name>``. - `S6CommandError(service, action, returncode, stderr)` — carries the s6-svc rc and stderr; message: `s6-svc start on 'gateway-coder' failed (rc=111): <stderr>`. * Factor lifecycle dispatch through `_run_svc(flag, label, name)`: pre-checks that the service directory exists (raises GatewayNotRegisteredError before invoking s6-svc), then runs s6-svc and translates any CalledProcessError into S6CommandError. * `_dispatch_via_service_manager_if_s6` in `hermes_cli.gateway` catches both errors and prints `✗ <message>` + `sys.exit(1)` instead of letting the exception bubble. The dispatch path that used to dump a traceback at the user now gives an actionable one-liner. Tests: 6 new tests for the error types and their CLI rendering; existing lifecycle test pre-seeds the slot directory before calling `mgr.start` etc.	2026-05-23 15:20:41 +10:00
Ben	367c15b1dc	fix(container_boot): always register gateway-default slot PR #30136 review caught: `hermes gateway start` (no `-p`) inside the container resolves `_profile_suffix() == ""` → service name `gateway-default`, but no such slot was ever registered. The Phase 4 profile-create hook only fired on `hermes profile create <name>`, and the root profile (which lives at the top of $HERMES_HOME, not under `profiles/`) was never one of those. So bare `hermes gateway start` landed on `s6-svc -u /run/service/gateway-default` → uncaught `CalledProcessError` → traceback to the user. Changes: 1. `reconcile_profile_gateways` now always registers a `gateway-default` slot before iterating named profiles. Its prior state is read from `$HERMES_HOME/gateway_state.json` (sibling to the profile root, not under `profiles/`); stale runtime files there are swept the same way. Auto-up only if the prior state was `running` — same rule as named profiles. 2. `S6ServiceManager._render_run_script` special-cases `profile == "default"` to emit `hermes gateway run` with NO `-p` flag. Passing `-p default` would resolve to `$HERMES_HOME/profiles/default/` — a different profile that almost certainly doesn't exist. The empty profile-suffix convention is the dispatcher's contract and the run script has to match. 3. A user-created `profiles/default/` collides with the reserved root-profile slot; the reconciler now skips it with a warning rather than producing two registrations of the same service name. Action-list ordering is stable: `default` first, then named profiles in directory order. Boot-log readers can rely on this. Tests: 8 new dedicated default-slot tests plus updates to every existing test that asserted against the action list (via the new `_named_actions` helper that drops the always-present default entry).	2026-05-23 15:16:35 +10:00
Ben	efd3569739	fix(gateway): route --all stop/restart through s6 under container PR #30136 review caught that `hermes gateway stop --all` and `... restart --all` were broken under s6. The Phase 4 dispatcher was gated on `not stop_all` (and the symmetric restart_all), so `--all` fell through to `kill_gateway_processes(all_profiles=True)`. pkill SIGTERMed every gateway, s6-supervise observed the crashes, and restarted every gateway ~1s later — net effect: `--all` kicked gateways instead of stopping them. Add `_dispatch_all_via_service_manager_if_s6(action)` that iterates `mgr.list_profile_gateways()` and routes stop/restart through each service slot. s6's `want up`/`want down` flips correctly, so a stop persists. Partial failures are surfaced per-profile with a running success count; the host pkill path is only reached when s6 isn't in play. `start --all` isn't a CLI surface — the helper rejects it and returns False (host code path can take over).	2026-05-23 15:08:17 +10:00
Ben	2f8ceeab9a	fix(service_manager): s6 detection works for unprivileged hermes user PR #30136 review surfaced two issues, both rooted in the same audit gap: docker integration tests were running as root, not the unprivileged `hermes` user (UID 10000) that the runtime actually uses via `s6-setuidgid hermes`. Anything that probed PID-1 state or wrote to the s6 control surface worked as root in the tests but was inert in production. Fixes: 1. `_s6_running()` previously called `Path("/proc/1/exe").resolve()`, which is root-only readable. For UID 10000 the symlink yields PermissionError, `resolve()` silently returns the unresolved path, and `exe.name == "exe"` — so detection always returned False, the service-manager runtime-registration path was inert, and every `hermes profile create` / `hermes -p X gateway start` silently skipped the s6 hook. Replace with `/proc/1/comm` (world-readable) + `/run/s6/basedir` (s6-overlay-specific) — both required, fail closed. 2. `02-reconcile-profiles` now also chowns `/run/service/.s6-svscan/` {control,lock} to hermes so `s6-svscanctl -a/-an` works without root. Previously the directory chown stopped at `/run/service` and the FIFO inside stayed root-owned, so `register_profile_gateway` from hermes failed at the rescan-trigger step with EACCES — the wrapper in profiles.py caught the exception and printed a swallowed warning, so profile creation appeared to succeed while the slot was rolled back. Audit changes to flush this class of bug next time: - Add `docker_exec` / `docker_exec_sh` helpers to `tests/docker/conftest.py` that default to `-u hermes`. The module docstring explains why and flags `user="root"` as opt-in only for tests that explicitly need root (none currently do). - Refactor every `docker exec` call in tests/docker/ through the new helpers (test_dashboard.py, test_zombie_reaping.py, test_profile_gateway.py, test_container_restart.py, test_s6_profile_gateway_integration.py). - Add 5 unit tests covering `_s6_running` under various probe states (both signals present; comm wrong; basedir missing; PermissionError on /proc/1/comm; missing /proc — non-Linux). The PermissionError test is the explicit regression guard for the original bug. Known follow-up: the per-service `supervise/control` FIFO inside each `/run/service/gateway-<profile>/supervise/` is created root-owned by s6-supervise (which runs as root because s6-svscan is PID 1). `s6-svc -u/-d/-t` from the hermes user will get EACCES on those. The audit under `-u hermes` will reveal this in lifecycle tests — surfacing the issue cleanly so it can be fixed in a focused follow-up (likely via a small SUID helper or a polling chown loop in cont-init.d). The detection + svscanctl fixes here are independent and complete on their own.	2026-05-23 14:56:39 +10:00
Ben	a6f7171a5e	feat(docker): remove gosu from bundled image; s6-setuidgid handles privilege drop The s6-overlay migration replaced every runtime use of gosu with s6-setuidgid (in stage2-hook.sh, main-wrapper.sh, per-service run scripts, and cont-init.d hooks), but the gosu binary itself was still being copied into the image from tianon/gosu, and several comments across the repo still pointed to it. Image changes: - Drop the FROM tianon/gosu:1.19-trixie AS gosu_source stage - Drop the COPY --from=gosu_source /gosu /usr/local/bin/ layer - Net: one fewer base-image pull, ~12-15 MB layer eliminated Documentation/comment refresh (no behavior change): - Dockerfile: update root-user rationale comment + cont-init.d comment - docker/main-wrapper.sh: drop "pre-s6 contract (gosu drop)" reference - docker-compose.yml: update UID/GID remap comment - .hadolint.yaml: update DL3002 ignore rationale - website/docs/user-guide/docker.md: privilege-drop helper is s6-setuidgid now - hermes_cli/config.py: docker_run_as_host_user docstring tools/environments/docker.py runs arbitrary user images via the terminal backend, not the bundled Hermes image. It still needs SETUID/ SETGID caps so user images that use gosu/su/s6-setuidgid all work. Renamed the cap-list constant _GOSU_CAP_ARGS → _PRIVDROP_CAP_ARGS and updated comments to list s6-setuidgid alongside the others as examples. The matching test (test_security_args_include_setuid_setgid_for_gosu_drop → test_security_args_include_setuid_setgid_for_privdrop) was renamed and its docstring updated; behavior is unchanged. Verification: - hadolint clean against .hadolint.yaml - shellcheck clean against all docker/ shell scripts - Image rebuilt successfully (sha 1a090924ccea) - Docker harness: 19 passed in 41.87s (every Phase 0 test + Phase 4 per-profile-gateway lifecycle + container-restart reconciliation) - tests/tools/test_docker_environment.py: 23 passed (rename did not break test discovery; pre-existing unrelated mock warning) The plan document (docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md) intentionally retains its historical references to gosu — it describes the pre-s6 entrypoint as background for understanding the migration.	2026-05-22 11:47:42 +10:00
Ben	7d07dd60a8	docs(s6): document container supervision; doctor + skill + user-guide updates Phase 5 of the s6-overlay supervision plan. Documentation + small diagnostic cleanups; no behavior changes. website/docs/user-guide/docker.md: - Replace the old 'entrypoint script does the bootstrap' section with the s6-overlay boot flow (cont-init.d/01-hermes-setup, cont-init.d/02-reconcile-profiles, static main-hermes + dashboard services, ENTRYPOINT-as-main-program pattern). - Add a 'Per-profile gateway supervision' subsection covering the new lifecycle commands, restart semantics, log persistence, and 'Manager: s6 (container supervisor)' status reporting. - Add 'Breaking change vs. pre-s6 images' callout naming the /init ENTRYPOINT and pointing affected wrappers at the pin workaround. website/docs/user-guide/profiles.md: - Add a note under 'Persistent services' pointing container users at the docker.md section explaining s6 supervision inside the image. Host-side systemd/launchd documentation is unchanged. skills/software-development/hermes-s6-container-supervision/SKILL.md: - New maintainer skill covering the supervision-tree map, file layout, the Architecture B rationale (cont-init.d args + halt exit-code propagation), quick recipes, and the 8 pitfalls we hit while implementing the plan (PATH-without-/command, root-owned profile dirs, SOUL.md as marker, the '143' anti-pattern, etc.). hermes_cli/doctor.py: - _check_gateway_service_linger skips on s6 (the linger concept doesn't apply inside the container). - New _check_s6_supervision section reports main-hermes/dashboard state and per-profile-gateway count (registered vs supervised up), only inside the s6 container. Host doctor output unchanged. - External Tools / Docker check no longer emits a 'docker not found' warning inside the container; prints an explanatory info line instead. Still respects an explicit TERMINAL_ENV=docker (in case the user mounted /var/run/docker.sock). hermes_cli/gateway.py: - Document _container_systemd_operational more precisely: it's NOT for our Hermes Docker image (s6-overlay handles that via detect_service_manager() == 's6'). It still covers systemd-nspawn / k8s-with-systemd-init cases, so leaving it in place is correct; the docstring just makes that explicit. Test harness (verification, no test changes in this commit): 19 passed, 0 xfailed. 66 service-manager / container-boot / profiles-s6-hooks / gateway-s6-dispatch unit tests still green. 61 doctor tests still green. Hadolint + shellcheck clean. Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md	2026-05-22 11:47:42 +10:00
Ben	57c6e29666	feat(docker): per-profile s6 supervision + container-restart reconciliation Phase 4 of the s6-overlay supervision plan. Activates the Phase 3 S6ServiceManager by hooking it into the profile lifecycle and the `hermes gateway start/stop/restart` dispatcher, and adds a cont- init.d-time reconciliation pass that survives `docker restart`. Task 4.0 — container-boot reconciliation: /run/service/ is tmpfs, so every `docker restart` wipes every per-profile gateway slot. /etc/cont-init.d/02-reconcile-profiles invokes hermes_cli.container_boot.reconcile_profile_gateways() on every boot, which walks $HERMES_HOME/profiles/<name>/, reads each gateway_state.json, recreates the s6 service slot, and auto-starts only those whose last state was 'running'. Other states (stopped, starting, startup_failed, missing) register the slot in the down state — avoiding crash-loops across restarts for a gateway that was broken last boot. Per-profile outcome is recorded to $HERMES_HOME/logs/container-boot.log. Implementation: hermes_cli/container_boot.py + 12 unit tests. Profile-marker is SOUL.md, not config.yaml, because `hermes profile create` only seeds SOUL.md by default (config.yaml comes from `hermes setup`). Task 4.1 / 4.2 — profile create/delete hooks: hermes_cli/profiles.py::create_profile now calls _maybe_register_gateway_service(<canon>) at the end, which routes through ServiceManager.register_profile_gateway when running on s6 and no-ops on host backends. delete_profile mirrors with _maybe_unregister_gateway_service. _allocate_gateway_port produces a deterministic SHA-256-derived port in [9200, 9800). Task 4.3 — gateway dispatch + remove rejection arms: _dispatch_via_service_manager_if_s6(action) intercepts start/stop/restart at the top of each subcommand and routes them through S6ServiceManager.{start,stop,restart}. The pre-Phase-4 `elif is_container():` rejection arms are kept as fallback for pre-s6 containers / unsupported runtimes, but only ever fire when detect_service_manager() != 's6'. install/uninstall under s6 print informational guidance pointing users at profile create/delete. Removed the two xfail(strict=True) markers from tests/docker/test_profile_gateway.py — both tests now pass strictly. Task 4.4 — status reporting: get_gateway_runtime_snapshot() reports Manager: 's6 (container supervisor)' inside an s6 container instead of 'docker (foreground)'. Plan-vs-reality drift fixed in this commit: - Plan's S6ServiceManager._render_run_script used `gateway start --foreground --port {port}` — invented args; the real CLI is `gateway run`. Switched accordingly. port arg retained for API parity but now documented as 'currently ignored'. - Plan's reconciler keyed on config.yaml; switched to SOUL.md (config.yaml is created by hermes setup, not by hermes profile create, so the original gate caught nothing). - The plan's _dispatch helper used _profile_arg() which returns '--profile <name>' (i.e. with the flag prefix). Switched to _profile_suffix() which returns the bare name. - Architecture B's docker exec doesn't get /command on PATH or the venv on PATH; Dockerfile's runtime PATH now includes /opt/hermes/.venv/bin so 'docker exec <c> hermes ...' works without sourcing the venv. - stage2-hook now chowns $HERMES_HOME/profiles to hermes on every boot, not just on the UID-remap path. Without this, files created by docker-exec-as-root accumulate and the next reconciler run fails with PermissionError reading SOUL.md. Test harness: 19 passed, 0 xfailed (the two pre-Phase-4 xfail targets flip to passing). 78 unit tests across service_manager + container_boot + profiles_s6_hooks + gateway_s6_dispatch. Hadolint + shellcheck pass cleanly. Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md	2026-05-22 11:47:42 +10:00
Ben	ad5fdab092	feat(service_manager): add S6ServiceManager for runtime gateway supervision Phase 3 of the s6-overlay supervision plan. Implements the runtime- registration surface from D4 — only the s6 backend supports register_profile_gateway / unregister_profile_gateway / list_profile_gateways; host backends continue to raise NotImplementedError. No caller yet (Phase 4 wires in the profile create/delete hooks). Key implementation notes: - Service directory shape: /run/service/gateway-<profile>/{type,run,log/run}. Atomic register: write to gateway-<profile>.tmp, fsync via os.rename. Cleanup on rescan failure. - Run script uses #!/command/with-contenv sh so HERMES_HOME and any extra_env arrive at exec time. The hermes -p <profile> gateway start --foreground --port <port> command is wrapped in s6-setuidgid hermes for the per-service privilege drop (OQ2-A). - Log script (OQ8-C): persists via s6-log to ${HERMES_HOME}/logs/gateways/<profile>/. CRITICAL — HERMES_HOME is a runtime env-var expansion in the rendered script, NOT a Python f-string substitution. Negative-asserted in test_s6_register_creates_service_dir_and_triggers_scan so regressions are caught. - PATH gotcha: /command/ is only on PATH for processes spawned by the supervision tree (services, cont-init.d). `docker exec` and profile-create hooks don't get it. S6ServiceManager calls all s6-* binaries via absolute path through the new _S6_BIN_DIR constant so callers don't have to fix up env vars. - validate_profile_name rejects path-traversal, leading-dash (s6 would parse as a flag), uppercase, whitespace, and names >251 chars (s6-svscan default name_max). Test coverage: - 13 new unit tests in tests/hermes_cli/test_service_manager.py (kind detection, run-script content, env quoting, register rollback on rescan failure, unregister idempotence, list filter, lifecycle dispatch, svstat parsing). Total: 36 passing. - 2 new in-container integration tests in tests/docker/test_s6_profile_gateway_integration.py validating end-to-end registration against a real s6 supervision tree. Docker harness: 14 passed, 2 xfailed (Phase 4 target unchanged). Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md	2026-05-22 11:47:41 +10:00
Ben	cf6133495c	feat(service_manager): add ServiceManager protocol + host wrappers Phase 1 of the s6-overlay supervision plan. Pure-refactor addition: introduces the abstract interface (with runtime_checkable Protocol), detect_service_manager(), validate_profile_name(), and thin SystemdServiceManager / LaunchdServiceManager / WindowsServiceManager wrappers around the existing systemd_* / launchd_* / gateway_windows.* module-level functions. No host call site was modified — host code continues to use the existing functions directly; the protocol is for new backend-agnostic code (Phase 4 profile create/delete hooks and the Phase 4 s6 dispatch path in 'hermes gateway start/stop/restart'). WindowsServiceManager.install() forwards the v3 kwargs (start_now, start_on_login, elevated_handoff) added in PRs #28169-adjacent so non-Windows callers — there aren't any today — can opt in. The s6 backend lands in Phase 3; until then get_service_manager() raises a clear error if invoked on a host that detects as 's6'.	2026-05-22 11:47:41 +10:00
Teknium	2a474bcf72	fix(termux): resolve packed-refs and worktree refs in skill-sync fingerprint The bundled-skill sync stamp added in the cherry-picked salvage commit parsed .git/HEAD and looked for a loose ref file in the worktree gitdir only, so two real cases hit the unresolved branch: - repos after `git gc` where active refs live in packed-refs - linked worktrees, whose branch ref lives in <commondir>/refs/heads/ (verified on the worktree this salvage was built in) Both fell back to a constant-string fingerprint, so post-commit launches would never re-run the real skill sync. Now we resolve packed-refs and check both the worktree gitdir and the common dir for loose refs. Adds three tests covering: packed-refs resolution, worktree common-dir packed lookup, worktree common-dir loose lookup, and the explicit 'unresolved' marker (still stable + version-fallback-safe).	2026-05-21 17:19:05 -07:00
adybag14-cyber	6dbbf20ff4	perf(termux): speed up non-tui cli startup	2026-05-21 17:19:05 -07:00
Teknium	3fde8c153d	fix(skills): prune dependency/venv dirs from all skill scanners (#30042 ) * fix(skills): skip dependency dirs in skill scan * fix(skills): widen sibling rglob scanners to use shared exclusion set Follow-up to PR #29968. The contributor's PR widened EXCLUDED_SKILL_DIRS in the canonical walker (iter_skill_index_files), which fixes the user-visible discovery path. This commit sweeps the ~12 other rglob('SKILL.md') sites that did their own ad-hoc filtering — most only checked .git/.hub, some had no filter at all — so dependency dirs (.venv, node_modules, site-packages, etc.) cannot leak ghost skills through the secondary paths. Adds agent.skill_utils.is_excluded_skill_path(path) helper. Migrates all 13 sites to use it. Removes 3 hardcoded duplicate filter sets. Sites touched: agent/curator_backup.py - skill backup file count gateway/run.py - disabled-skill response (2 sites) hermes_cli/dump.py - skill count in env dump hermes_cli/profile_describer.py- profile description (2 sites) hermes_cli/profile_distribution.py - profile install count hermes_cli/profiles.py - profile skill count hermes_cli/skills_hub.py - category detection tools/skill_manager_tool.py - skill name lookup (already used set, now uses helper) tools/skill_usage.py - usage tracking + skill dir lookup (2 sites) tools/skills_hub.py - optional skills find + scan (2 sites) tools/skills_sync.py - bundled skills sync E2E verified with the exact reported shape (bring/scripts/.venv/.../typer/.agents/skills/typer/SKILL.md): no sibling site picks up the ghost skill, all five legit-skill counts still return 1. * chore(infographic): retro-pop-grid bento for PR #30042 skill-scanner sweep --------- Co-authored-by: helix4u <4317663+helix4u@users.noreply.github.com>	2026-05-21 14:18:02 -07:00
Teknium	552e9c7881	feat(secrets): Bitwarden Secrets Manager integration with lazy bws install (#30035 ) * feat(secrets): Bitwarden Secrets Manager integration with lazy bws install Pull API keys from Bitwarden Secrets Manager at process startup instead of storing them all in plaintext in ~/.hermes/.env. One bootstrap token (BWS_ACCESS_TOKEN) replaces N per-provider keys, and rotating a credential becomes a single change in the Bitwarden web app. Bitwarden defaults to source of truth: secrets pulled from BSM overwrite any matching env vars on startup so rotations actually take effect. Set secrets.bitwarden.override_existing: false in config.yaml to invert. The bws binary is auto-downloaded into ~/.hermes/bin/bws on first use (pinned to v2.0.0, SHA-256 verified against the GitHub release checksum file). No apt, brew, or sudo required. New surfaces: hermes secrets bitwarden setup — interactive wizard hermes secrets bitwarden status — config + binary + token state hermes secrets bitwarden sync — dry-run fetch / --apply exports hermes secrets bitwarden disable — flip enabled: false hermes secrets bitwarden install — just download the binary Failures (missing binary, bad token, no network) never block Hermes startup — they emit a one-line warning to stderr and continue with whatever credentials .env already had. Docs: website/docs/user-guide/secrets/{index,bitwarden}.md Tests: tests/test_bitwarden_secrets.py (26 tests, hermetic — bws subprocess and HTTP downloads fully mocked) * chore(infographic): add bitwarden-secrets-manager bento-grid retro-pop-grid Generated for PR #30035 — Bitwarden Secrets Manager integration. Style picked via pick_pr_infographic_style.py rotation: layout: bento-grid style: retro-pop-grid aspect: 1:1 square Saved at infographic/bitwarden-secrets-manager/infographic.png	2026-05-21 14:10:34 -07:00
teknium1	3d2f146460	fix(tui): also pass --expose-gc on the wheel-bundled launch path The original PR fixed the ext_dir and built-tui paths but missed the sibling pip-wheel path at line 1155. Without this, wheel installs would lose --expose-gc entirely (the env-var append at the call site was already removed). All three production node-launch sites now pass --expose-gc via argv consistently.	2026-05-21 13:10:34 -07:00
YarrowQiao	2ea7cf287e	fix(tui): pass --expose-gc as node argv instead of NODE_OPTIONS Node refuses to start when NODE_OPTIONS contains --expose-gc: node: --expose-gc is not allowed in NODE_OPTIONS NODE_OPTIONS is restricted to a small allowlist of flags that are safe to inject via env (since any process able to set env vars on a node child could otherwise enable arbitrary capabilities). --expose-gc is not on that list and never has been -- it must be passed as a direct CLI flag. _launch_tui() was appending --expose-gc to NODE_OPTIONS before spawning the TUI's node process, which made `hermes --tui` fail to start on every modern node release. The intent (manual GC for long sessions to avoid fatal-OOM) is preserved by inserting --expose-gc directly into the node argv in _make_tui_argv() -- same effect, but actually allowed. --max-old-space-size=8192 stays in NODE_OPTIONS: it is allowlisted, and keeping it there means downstream node spawns inherit the same heap cap without having to re-thread the flag through every spawn site. The dev paths (`tsx src/entry.tsx` and `npm start` fallback) are left alone -- they don't accept node flags directly, and the production dist path is the one users actually hit via `hermes --tui`. Repro before fix: $ hermes --tui /usr/bin/node: --expose-gc is not allowed in NODE_OPTIONS	2026-05-21 13:10:34 -07:00
helix4u	ba9964ff0d	fix(custom): pass custom provider extra body Allow custom OpenAI-compatible providers declared under `custom_providers:` to set provider-specific `extra_body` fields and have Hermes merge them into chat-completions requests when the matching custom endpoint is active. This is a manual per-provider override rather than a model-name heuristic. OpenAI-compatible Gemma thinking support is real, but the on-wire payload shape is backend-specific: some servers want top-level `enable_thinking`, while vLLM Gemma and NIM-style endpoints expect `chat_template_kwargs`. A per-provider override is safer than picking one assumed payload. Example config: ```yaml custom_providers: - name: gemma-local base_url: http://localhost:8080/v1 model: google/gemma-4-31b-it extra_body: enable_thinking: true reasoning_effort: high ``` For vLLM Gemma or NIM-style endpoints, use the nested shape those servers expect: ```yaml extra_body: chat_template_kwargs: enable_thinking: true ``` Changes: - `hermes_cli/config.py`: preserve `extra_body` in normalized `custom_providers:` entries and allow it in the validated field set. - `hermes_cli/runtime_provider.py`: propagate custom-provider `extra_body` as `request_overrides.extra_body` for named custom runtime resolution, including credential-pool paths. - `agent/agent_init.py`: at agent init, locate the matching custom-provider entry by `base_url` (+ optional model) and merge its `extra_body` into `AIAgent.request_overrides`, with caller-provided overrides winning on conflicting top-level keys. - `plugins/model-providers/custom/__init__.py`: keep existing CustomProfile behavior (Ollama `num_ctx`, `think=False` when reasoning disabled); user-configured `extra_body` flows through `request_overrides`. - `website/docs/integrations/providers.md`: document the explicit `extra_body` override and the vLLM/Gemma `chat_template_kwargs` variant. - Tests cover config normalization, runtime propagation, model matching, trailing-slash equivalence, fallback when no `model` field is set, and caller-override merging precedence. Verified end-to-end against `CustomProfile` via `ChatCompletionsTransport`: configured `extra_body` reaches `kwargs.extra_body` on the wire request, and coexists with profile-generated entries (Ollama `num_ctx`, `think=False`) without clobber. Salvaged from #29022 onto current `main`. Cosmetic typing edit in `plugins/model-providers/custom/__init__.py` and a stale-base docs revert in `providers.md` were dropped during cherry-pick. Closes #29022	2026-05-21 07:48:53 -07:00
ethernet	48be2e0e4d	test: use subprocesses for each test file (#29016 ) * ci(tests): install ripgrep from prebuilt tarball instead of apt apt-get update + install of ripgrep takes ~4 min on the GHA Ubuntu runners (the apt-get update against archive.ubuntu.com is the slow part; ripgrep itself is small). Switching to the upstream musl binary tarball cuts the step to a few seconds. - Pinned to ripgrep 15.1.0 with sha256 verification (same hash as published in the releases sha256 sidecar file). - Drops the `rg` binary into /usr/local/bin so it is on PATH for every subsequent step without GITHUB_PATH manipulation. - Applied to both the test and e2e jobs in tests.yml. * fix(cli): compile syntax check to tempdir, not source __pycache__ `_validate_critical_files_syntax` runs `py_compile.compile()` on each critical bootstrap file after a successful `git pull`. The default `py_compile` writes the resulting `.pyc` next to the source under `__pycache__/`, which causes two real problems: 1. Parallel test workers walking the same source tree (e.g. running the suite under per-file process isolation) can race against each other on the `__pycache__` write — manifests as flaky 'directory not empty' errors during teardown. 2. In production, the post-pull syntax check leaves a `.pyc` behind that the next interpreter run might pick up — fine when the interpreter version matches, sketchy if it doesn't. Fix: write the compiled output to a `tempfile.TemporaryDirectory()` that's discarded on function exit. We only care about the compile-or-not signal, not the artifact. * test(runner): per-file process isolation, drop manual state reset + xdist Replace fragile manual _reset_module_state test fixtures with robust per-file subprocess isolation. Each test file runs in a fresh `python -m pytest <file>` subprocess via ThreadPoolExecutor. No xdist, no custom pytest plugin, no shared worker state. Key changes: * scripts/run_tests_parallel.py — new runner: discovers test files, runs N in parallel via ThreadPoolExecutor, captures stdout per file, treats exit code 5 (no tests collected) as pass, kills all children on exit. Change from cpu_count to cpu_count2. The runner is I/O-bound (waiting on subprocess.communicate() from pytest children) The parent process does almost no CPU work, so 2x oversubscription keeps more pipes full. When a file fails, immediately show the last 30 lines of pytest output (stack traces + FAILED summary) plus a ready-to-copy repro command: python -m pytest tests/agent/test_auxiliary_client.py scripts/run_tests.sh — delegates to run_tests_parallel.py * .github/workflows/tests.yml — test step: python scripts/run_tests_parallel.py * pyproject.toml — drop pytest-xdist, pytest-split; simplify addopts * tests/conftest.py — remove ~200 lines of manual state-reset fixtures * AGENTS.md — update Testing section for per-file design * test(runner): speed gateway test antipattern scan up * fix(test): web search provider plugin test missing xai * fix(tests): make 14 test files pass under per-file subprocess isolation Tests that relied on cross-file state pollution from xdist workers fail when run in isolation (per-file subprocess model). Root causes and fixes: Tool registry not populated: - test_video_generation_tool_surface_matrix: add discover_builtin_tools() - test_web_providers_brave_free/ddgs/searxng/general: autouse fixtures registering all 8 bundled web providers, reset after each test - test_website_policy: same provider registration pattern - test_web_tools_tavily: same pattern across 3 dispatch test classes - Also add is_safe_url/check_website_access mocks where SSRF check blocks example.com (DNS resolution fails in isolated envs) Stale check_fn cache: - test_kanban_tools: invalidate_check_fn_cache() + _clear_tool_defs_cache() in both kanban guidance tests (prior test cached False for kanban_show) - test_discord_tool: cache invalidation in setup/teardown - test_homeassistant_tool: invalidate_check_fn_cache() before registry queries Module-level state pollution: - test_auxiliary_client: autouse fixture clearing _aux_unhealthy_until cache - test_skill_commands: set_session_vars() instead of patch.dict(os.environ) (ContextVar takes precedence over os.environ) - test_dm_topics: overwrite sys.modules + separate telegram.constants mock + force-reimport of gateway.platforms.telegram - test_terminal_tool_requirements: removed duplicate class declaration, autouse _clear_caches fixture * change(tests): run_tests.sh explicitly includes env vars instead of manually dropping some vars, now we just only include some * fix(tests): 5 more isolation/NixOS fixes - test_approval_plugin_hooks: isolate HERMES_HOME so real user's command_allowlist doesn't short-circuit the approval path - test_google_chat: skipif when Platform.GOOGLE_CHAT not in enum (feature not merged on this branch) - test_write_deny: test systemd prefix against tmp_path instead of /etc/systemd which resolves to /nix/store on NixOS - test_pty_bridge: use shutil.which('cat') instead of /bin/cat (doesn't exist on NixOS) - profiles.py: rmtree onexc handler chmod's parent dirs too, fixing profile deletion when copytree preserved read-only modes from nix store * fix(tests): clear unhealthy cache in autouse fixture for auxiliary_client * fix(tests): skip send_message when telegram not installed; handle missing worker_id in browser_supervisor * fix: py3.11 rmtree onexc compat + belt-and-suspenders unhealthy cache clear for expired codex test * fix: address PR #29016 review feedback - Remove tracked .pytest-cache/ artifact and add to .gitignore - Fix stale 'xdist worker' comment in conftest.py - Deduplicate web provider registration into tests/tools/conftest.py shared helper (register_all_web_providers), replacing 8 copy-pasted blocks across 6 test files - Update PR description: remove stale recovered-test-files claim, fix worker count to match code (cpu_count2) fix: eliminate race in stale-cache achievements test The background scan thread could complete and overwrite _SNAPSHOT_CACHE before evaluate_all() returned the stale data — only 10 fake sessions made the scan finish instantly. Added scan_delay param to _FakeSessionDB and set it to 2s in the stale-cache test so the background thread can't win the race.	2026-05-21 16:40:04 +05:30
liuhao1024	4ead464f97	fix(security): guard os.chmod(parent) against / and top-level dirs Five call sites do os.chmod(path.parent, 0o700) without checking that the parent resolves to a safe directory. If HERMES_HOME or another path env var resolves to /, the chmod strips traversal permission from the root inode and bricks the entire host. Add secure_parent_dir() to hermes_constants.py that refuses to chmod / or any top-level directory (depth < 2). Replace all 5 call sites with this helper. Fixes #25821	2026-05-20 22:56:55 -07:00
Teknium	c6a992e3e3	fix(security): derive <VENDOR>_API_KEY from host as final credential fallback After #28660's host-gating fix, users with provider=custom and base_url pointed at a commercial endpoint (DeepSeek, Groq, Mistral, …) hit no-key-required even when they had the vendor-named env var set (DEEPSEEK_API_KEY, GROQ_API_KEY, …). The issue author flagged this as 'what users intuitively expect'. Adds _host_derived_api_key() to derive an env var name from the base URL host using the registrable label (second-to-last). Appended to all three api_key_candidates chains (_resolve_named_custom_runtime direct-alias path, named-custom path, _resolve_openrouter_runtime non-openrouter branch). Lookalike resistance: api.deepseek.com.attacker.test resolves to vendor label 'attacker', NOT 'deepseek' — DEEPSEEK_API_KEY stays put. IPs and loopback yield no vendor label. Already-handled vendors (OPENAI/OPENROUTER/ OLLAMA) are filtered to prevent bypass of the explicit host-gated paths. Adds 6 tests covering positive paths (DeepSeek, Groq), the lookalike attack, loopback rejection, the already-handled-vendor filter, and direct helper unit tests. Also adds erhnysr to AUTHOR_MAP.	2026-05-20 22:12:09 -07:00
Erhnysr	9514ddbee2	fix(security): address review feedback from pmos69 - Preserve OPENROUTER_API_KEY for explicit mirror/proxy configs when requested provider is openrouter and OPENROUTER_BASE_URL is set - Gate OPENAI_API_KEY and OPENROUTER_API_KEY in named custom provider path (_resolve_named_custom_runtime) on authoritative hosts - Gate same keys in direct-alias path - Update tests to reflect secure-by-default behavior for local endpoints	2026-05-20 22:12:09 -07:00
Erhnysr	59088228f6	fix(security): prevent API key leakage to non-authoritative custom endpoints Custom endpoint provider was forwarding OPENAI_API_KEY and OLLAMA_API_KEY to arbitrary hosts. Keys should only be sent to their authoritative domains (openai.com, ollama.com) or when explicitly configured via pool/env. - Gate OPENAI_API_KEY to openai.com hosts only - Gate OLLAMA_API_KEY to ollama.com hosts only - Return 'no-key-required' for unrecognized custom endpoints - Update tests to reflect secure-by-default behavior Closes #28660	2026-05-20 22:12:09 -07:00
teknium1	5672772dab	fix(gateway): reorder telegram menu priority — everyday commands first Some checks are pending Deploy Site / deploy-vercel (push) Waiting to run Details Deploy Site / deploy-docs (push) Waiting to run Details Docker Build and Publish / build-amd64 (push) Waiting to run Details Docker Build and Publish / build-arm64 (push) Waiting to run Details Docker Build and Publish / merge (push) Blocked by required conditions Details Docker Build and Publish / move-main (push) Blocked by required conditions Details Docker Build and Publish / move-latest (push) Blocked by required conditions Details Lint (ruff + ty) / ruff + ty diff (push) Waiting to run Details Lint (ruff + ty) / ruff enforcement (blocking) (push) Waiting to run Details Lint (ruff + ty) / Windows footguns (blocking) (push) Waiting to run Details Nix Lockfile Fix / auto-fix-main (push) Waiting to run Details Nix Lockfile Fix / fix (push) Waiting to run Details Nix / nix (macos-latest) (push) Waiting to run Details Nix / nix (ubuntu-latest) (push) Waiting to run Details OSV-Scanner / Scan lockfiles (push) Waiting to run Details Tests / test (push) Waiting to run Details Tests / e2e (push) Waiting to run Details uv.lock check / uv lock --check (push) Waiting to run Details Put /help, /new, /stop, /status, /resume, /sessions, /model ahead of the maintenance group (/debug, /restart, /update, /verbose, /commands) so the menu's first row matches what users actually type most often. The maintenance commands that prompted this priority list still land inside the 30-cap visible window — just not at the very top.	2026-05-20 19:14:21 -07:00
helix4u	b9b6e034d5	fix(gateway): prioritize Telegram command menu	2026-05-20 19:14:21 -07:00
EloquentBrush0x	8f92327891	fix(skills-hub): fix dedup in browse_skills() programmatic API browse_skills() is the TUI gateway's API for the web UI skills browser (tui_gateway/server.py:6574). It had the same dedup-by-name bug as do_browse() and unified_search() fixed in the parent commit: r.name is not unique for browse-sh skills (Airbnb, Booking.com, Zillow all publish "search-listings"), so the dedup loop silently dropped all but the first skill with each task name. Switch to r.identifier, which is always globally unique. Add a regression test asserting that two browse-sh skills with the same name but different hostnames both appear in the browse_skills() result.	2026-05-20 15:04:01 -07:00
EloquentBrush0x	fc7e04e9ed	fix(skills-hub): deduplicate search results by identifier, not name Browse.sh exposes skills by task name (e.g. "search-listings"), which is shared across hundreds of sites. Deduplicating by name silently dropped every browse-sh skill after the first one with a given task name — e.g. only Airbnb's "search-listings" would survive, collapsing Booking.com, Zillow, and every other site's variant into nothing. Switch unified_search() and do_browse() to use r.identifier as the dedup key. identifier is always globally unique (e.g. "browse-sh/airbnb.com/search-listings-ddgioa"), so same-named skills from different browse-sh hostnames are preserved as distinct results. Update existing TestUnifiedSearchDedup tests to model the real scenario (same identifier appearing from two sources) and add a regression test that asserts browse-sh skills with the same name but different hostnames are never collapsed.	2026-05-20 15:04:01 -07:00
helix4u	1a7bb988fc	fix(gateway): harden kanban and provider cleanup races	2026-05-20 14:31:22 -07:00
Teknium	eeb747de25	feat(sessions): opt-in per-session JSON snapshot writer PR #29182 deleted the per-session JSON snapshot writer outright because state.db is canonical and the snapshots had no in-tree consumer. Some users have external tooling that reads `~/.hermes/sessions/session_{sid}.json` directly, so reintroduce the writer behind a config flag that defaults to off. - Add `sessions.write_json_snapshots` (default False) to DEFAULT_CONFIG - Restore `AIAgent._save_session_log` + `_clean_session_content` as gated methods. When the flag is off the call is a fast no-op; when on, the writer behaves as before (atomic write, truncation guard preserved, REASONING_SCRATCHPAD → think tag normalization) - Re-derive the target path from `agent.session_id` on each call so `/branch` and `/compress` re-points happen automatically — no need to restore the explicit re-point bookkeeping at call sites - Wire the single call site in `_persist_session` (the cleanup-on-exit hook). Did NOT restore the 7 intra-turn calls the original PR deleted — those were redundant writes within the same turn that doubled disk I/O without adding any persistence guarantee `_persist_session` does not already provide - Read the flag once at agent init via `load_config()`, cache as `agent._session_json_enabled` - Update `TestNoSessionJsonSnapshot` → `TestSessionJsonSnapshotOptIn` to pin behavior: default off (no file), opt-in true (file written), no-op method on default agents, logs_dir retained unconditionally - Update CONTRIBUTING.md and the bundled `hermes-agent` skill to document the flag and its default	2026-05-20 11:44:10 -07:00
adybag14-cyber	c29b4f55d9	perf(termux): speed up tui cold start	2026-05-20 11:41:52 -07:00
Julien Talbot	5af4b73f87	fix(xai): align migrate retirement map with docs	2026-05-20 09:18:23 -07:00
Julien Talbot	12842d32ce	feat(cli): hermes migrate xai [--apply] [--no-backup] Adds a new `migrate` top-level sub-command that delegates to `migrate xai` for now. xAI handler: - Default: dry-run. Lists every retired xAI model reference found in config.yaml, with the recommended replacement and reasoning_effort hint, and points to the official xAI migration guide. - --apply: rewrites config.yaml in-place (via the ruamel round-trip apply_migration helper from hermes_cli.xai_retirement). A timestamped backup is created automatically. - --no-backup: skips the backup when applying (opt-in only — the safe default keeps a copy). Together with the doctor + chat-startup warnings already in this stack, this gives users three escalating signals before the May 15, 2026 retirement date: green check / warning at chat startup / actionable migration command.	2026-05-20 09:18:23 -07:00
Julien Talbot	9ff98daf71	feat(xai): apply_migration — rewrite config.yaml in-place via ruamel round-trip Extends hermes_cli.xai_retirement with apply_migration(config_path, issues, backup=True), used by the upcoming `hermes migrate xai` sub-command. Uses ruamel.yaml round-trip mode so that comments, key order, indentation, quoting style, and scalar types are preserved on rewrite — config.yaml is treated as a user-edited file, not a data dump. Behavior: - Each issue rewrites parent[leaf] to issue.replacement - When issue.reasoning_effort is set (non-reasoning variants that map to grok-4.3), a sibling reasoning_effort key is added/updated alongside the model - Empty issues list or missing slots are no-ops (no backup, no rewrite) - When changes occur, a timestamped backup (.bak-pre-migrate-xai-YYYYMMDD-HHMMSS) is written first unless backup=False 17 unit tests cover dry-run/no-op, surgical replacement (each slot), comment + key-order preservation, backup creation, and idempotence (apply twice → no-op the second time).	2026-05-20 09:18:23 -07:00
Julien Talbot	a8a05c8ea7	feat(cli): warn about retired xAI models at chat startup Print a non-blocking stderr warning at the top of cmd_chat when the active config still references xAI models scheduled for retirement on May 15, 2026. Each line includes the config path, the recommended replacement, and the reasoning_effort to set for non-reasoning variants. Points to hermes doctor for full diagnostic. Wrapped in try/except — never blocks startup. After May 15 the upstream xAI API will return a clear error anyway; this is purely a heads-up to give users time to migrate before that happens.	2026-05-20 09:18:23 -07:00
Julien Talbot	b4ba42550c	feat(doctor): surface xAI model retirement in hermes doctor Add a new section in run_doctor that lists retired xAI model references found in the active config and points the user at the official xAI migration guide. Each retired reference shows its config path (principal.model, auxiliary.<slot>.model, delegation.model, tts.xai.model, or plugins.image_gen.xai.model), the recommended replacement, and whether reasoning_effort needs to be set (for non-reasoning variants that map to grok-4.3 + reasoning_effort=none). Findings are appended to manual_issues so the final doctor summary reminds the user to update their config.yaml manually (no automatic YAML rewriting in this PR — preserves comments, key order, types). Wrapped in try/except so doctor still completes if load_config or the retirement module raise unexpectedly.	2026-05-20 09:18:23 -07:00
Julien Talbot	6f3a020e62	feat(xai): detect retired xAI models (May 15, 2026) Add hermes_cli.xai_retirement module that walks a Hermes config and flags references to models being retired by xAI on May 15, 2026 per the official migration guide. Pure logic + dataclass, no I/O — testable in isolation and reusable from a future hermes migrate xai sub-command. Mappings (per https://docs.x.ai/developers/migration/may-15-retirement): - grok-4 / grok-4-0709 -> grok-4.3 - grok-4-fast{,-reasoning,-non-reasoning} -> grok-4.3 (+reasoning_effort=none for non-reasoning) - grok-4-1-fast{,-reasoning,-non-reasoning} -> grok-4.3 (+reasoning_effort=none for non-reasoning) - grok-code-fast-1 -> grok-4.3 - grok-imagine-image-pro -> grok-imagine-image-quality Slots scanned: principal.model, auxiliary.<any>.model (introspective), delegation.model, tts.xai.model, plugins.image_gen.xai.model. Provider prefix x-ai/ is normalized. 33 unit tests covering edge cases (empty/non-dict config, valid models, ambiguous variants, all retired slots, formatter).	2026-05-20 09:18:23 -07:00
H-Ali13381	697d38a3f4	feat: auto-launch Chromium-family browser for CDP Add browser CDP launch candidates for Chrome, Chromium, Brave, and Edge while preserving Chrome-first selection. Retry candidate launch failures instead of giving up after the first executable. Update /browser CLI and TUI messaging, docs, and tool descriptions from Chrome-only wording to Chromium-family browser support. Add regression coverage for Brave/Edge paths, Chrome-first precedence, fallback launches, and CDP endpoint probing.	2026-05-19 22:34:05 -07:00
xxxigm	34120a0ae2	fix(kanban): worker-initiated block must not be auto-promoted (#28712 ) When a worker calls ``kanban_block(reason="review-required: ...")`` to hand a task off for human review, the dispatcher's ``recompute_ready`` was treating the resulting ``blocked`` status as eligible for auto-promotion — exactly the same as a circuit-breaker block. On the next tick the task flipped back to ``ready``, a fresh worker spawned, found nothing to do (work already applied, review-required comment already posted), exited cleanly, got recorded as ``protocol_violation`` → ``gave_up`` → ``blocked``, and the dispatcher promoted again. Infinite loop until manual ``hermes kanban reclaim`` + ``kanban block``. Add ``_has_sticky_block`` which distinguishes the two block sources using the cheapest available signal: the most recent ``"blocked"``/``"unblocked"`` event in ``task_events``. * Worker / operator ``kanban_block`` emits ``"blocked"`` → ``_has_sticky_block`` returns True → ``recompute_ready`` skips the task entirely. ``unblock_task`` emits ``"unblocked"`` which flips the predicate back, so the only legitimate exit is the documented human-in-the-loop path. * Circuit-breaker ``_record_task_failure`` emits ``"gave_up"`` (not ``"blocked"``) → predicate stays False → original parent-completion-recovery semantics from #`40c1decb3` are preserved. * Tasks blocked purely by direct DB manipulation also recover, since they have no ``"blocked"`` event row at all — matches the existing ``test_recompute_ready_promotes_blocked_with_done_parents`` fixture behaviour.	2026-05-19 17:26:23 -07:00
Teknium	64a9a199bb	fix(xai-oauth): pin inference base_url to x.ai origin (#28952 ) XAI_BASE_URL / HERMES_XAI_BASE_URL let users repoint the OAuth-authenticated inference endpoint, but the env override was an unguarded credential-leak vector: a tampered .env or hostile shell init setting XAI_BASE_URL=https://attacker.example/v1 would silently ship the SuperGrok OAuth bearer to a third party on every request. Add _xai_validate_inference_base_url() that pins the host to x.ai or a *.x.ai subdomain and rejects non-HTTPS. On rejection, fall back to the default with a warning rather than raise — a bad env var should not deadlock auth, but should never leak the bearer either. Apply at all three sites that read the env override for xai-oauth: - hermes_cli/auth.py resolve_xai_oauth_runtime_credentials (main path) - hermes_cli/auth.py _xai_oauth_loopback_login (initial login) - agent/auxiliary_client.py _resolve_xai_oauth_for_aux (aux client) E2E validated against four scenarios: attacker.example, lookalike api.x.ai.evil.com, http:// downgrade on api.x.ai, and legit custom.x.ai subdomain (which still resolves correctly). Discovered while comparing against the opencode-grok-auth plugin (github.com/ysnock404/opencode-grok-auth), which highlighted the same guard on the OpenCode side.	2026-05-19 14:51:21 -07:00
helix4u	d9829ab45f	fix(model): match custom provider by active base url	2026-05-19 14:50:38 -07:00
Teknium	544c31b50b	perf(agent-loop): cut 47% of per-conversation function calls via 3 targeted hot-path optimizations (#28866 ) * perf(config): add load_config_readonly() fast path for hot agent loop `load_config()` is called from the agent loop's per-API-call hot path via `get_provider_request_timeout()` and `get_provider_stale_timeout()` — both invoked once per turn from `_resolved_api_call_timeout()` in run_agent.py. Profiling a synthetic 20-tool-call agent run revealed: - 21 invocations of `load_config()` cumulating 56ms (~17% of agent loop) - 34,398 deepcopy calls totaling 37ms (config defensive deepcopy + chain) - 8,652 `_expand_env_vars` invocations (~412 per turn) Microbench (cache-hit, real config.yaml present): load_config() 265us/call (125us deepcopy + 140us infra) load_config_readonly() 138us/call (~48% faster) `load_config_readonly()` returns the cached dict directly without the defensive deepcopy. Documented contract: caller must not mutate. Returns plain dict (not MappingProxyType) so downstream `isinstance(x, dict)` guards keep working — caught during initial implementation when MappingProxyType broke get_provider_request_timeout's guard logic. Wired into hermes_cli/timeouts.py (the two functions called per agent turn). load_config() is unchanged for the 263 other call sites that mutate the result before save_config(), are not in the hot path, or where the safety guarantee matters more than the perf. Profile A/B (cached config, 21-turn agent loop): BEFORE AFTER delta get_provider_request_timeout 55ms 16ms -71% total function calls 399k 160k -60% deepcopy calls (in hotspots) 34,398 ~0 ~elim Verified: - isinstance(load_config_readonly(), dict) is True - timeout/stale resolutions correct - load_config() still returns isolated mutable deepcopies - tests/hermes_cli/test_config.py / test_timeouts.py: 102/102 pass - tests/cli/ + tests/agent/test_auxiliary_client.py: 883/883 pass perf(redact): substring pre-screens skip non-matching regex chains Every log record passes through `RedactingFormatter.format` which calls `redact_sensitive_text`, which historically ran ALL 13 secret-pattern regexes against every line — including DB connection strings, JWTs, Discord mentions, Signal phone numbers, etc. — even for typical clean log records like 'INFO run_agent: API call completed'. Add cheap substring pre-checks before each regex pass. False positives still run the regex (which then matches nothing); false negatives are impossible because every pattern requires the gated substring to match its leading anchor: - `_PREFIX_RE` gated on any of 33 known credential prefix substrings - `_ENV_ASSIGN_RE` gated on `=` in text - `_JSON_FIELD_RE` gated on `:` and `"` in text - `_AUTH_HEADER_RE` gated on `uthorization`/`UTHORIZATION` in text - `_TELEGRAM_RE` gated on `:` in text - `_PRIVATE_KEY_RE` gated on `BEGIN` and `-----` - `_DB_CONNSTR_RE` gated on `://` in text - `_JWT_RE` gated on `eyJ` in text - URL userinfo/query gated on `://` - `_redact_form_body` gated on `&` and `=` - `_DISCORD_MENTION_RE` gated on `<@` - `_SIGNAL_PHONE_RE` gated on `+` Microbench (5 typical log records, 20k iterations each): BEFORE AFTER delta redact_sensitive_text per call 5.63us 1.79us -68% Real-world impact: ~244 log records emitted in a 30-turn agent loop, so the chain saves ~1ms of CPU per conversation. Bigger win is the reduction in regex execution and GC pressure during heavy logging sessions (verbose logging, gateway message processing). Security regression test: 30 secret-containing inputs (sk-/ghp_/JWT/DB connstr/Auth-Bearer/private key/URL userinfo/Discord/Signal/etc.) verified to produce identical redacted output before/after. All 75 existing tests/agent/test_redact.py cases pass. The `?access_token=foo&code=bar` (bare query string, no scheme) case that 'leaks' is pre-existing behavior — the URL query redaction requires a well-formed URL with scheme+host. Not a regression. * perf(run_agent): cache _needs_thinking_reasoning_pad result per (provider, model, base_url) Profile of a 31-turn synthetic agent run shows `_needs_thinking_reasoning_pad` fires 495 times (~16 per turn) and each call ran 3 helper methods, each hitting `base_url_host_matches` 1-4 times via `urlparse`. Total cost: 3,342 base_url_host_matches calls + 3,373 urlparse calls accounting for ~36ms of agent-loop overhead (~7% of the entire post-network work). Provider / model / base_url don't change during a conversation except via `switch_model` and fallback activation — both of which already overwrite those attributes atomically. Cache the result on a tuple key; since the key is derived from the very fields that would change, the cache auto-invalidates on the next read after a switch. No manual invalidation needed in switch_model / _try_activate_fallback. Profile A/B (31-turn cached-config agent run): BEFORE AFTER delta _needs_thinking_reasoning_pad cum 18ms 1ms -94% _copy_reasoning_content_for_api cum 17ms 1ms -94% base_url_host_matches calls 3,342 372 -89% urlparse calls 3,373 403 -88% total function calls 296k 223k -25% Verified: - tests/run_agent/test_deepseek_reasoning_content_echo.py: 36/36 pass - tests/run_agent/ (full): 1383/1383 pass + 3 skipped	2026-05-19 14:25:10 -07:00
teknium1	6a159be7ca	fix(runtime): treat 'ollama'/'vllm'/'llamacpp' aliases like 'custom' for base_url trust (#27132 ) When config.yaml has provider: ollama (or vllm/llamacpp/llama-cpp) with a non-loopback base_url, auth.py's resolve_provider() correctly normalises the alias to 'custom' at the top level, but two sites in runtime_provider.py were still comparing the original string against the literal 'custom': - _config_base_url_trustworthy_for_bare_custom() rejected non-loopback URLs because cfg_provider_norm was 'ollama', not 'custom'. - _resolve_openrouter_runtime() only entered the trust branch when requested_norm == 'custom'. Both sites now consult resolve_provider() and treat any alias that resolves to 'custom' identically. Result: provider: ollama + LAN IP no longer silently falls through to OpenRouter (HTTP 401), matching the behaviour of provider: custom with the same base_url. E2E verified across 6 cases (ollama/vllm/llamacpp/custom + LAN; ollama + loopback; openrouter + cloud) — all route to the configured endpoint; 'frobnicate' + LAN still rejects with AuthError as before. Also adds scripts/release.py AUTHOR_MAP entry for @stepanov1975 (PR #22074 — wizard config picker preservation, cherry-picked into the preceding commit).	2026-05-19 14:23:19 -07:00
stepanov1975	e13f242f01	fix(cli): preserve setup config picker writes Resync the setup wizard's in-memory config after the shared model picker writes to disk so the wizard's final save does not overwrite auxiliary choices or other provider updates.\n\nAdds a regression test for auxiliary task choices saved by the picker.	2026-05-19 14:23:19 -07:00
Kyle Jeong	90be1be501	fix: register browse-sh in per-source limits and --source choices - Add 'browse-sh' to _PER_SOURCE_LIMIT in both do_browse() and browse_skills() with limit=500 (covers full 171-skill catalog) - Add 'browse-sh' to --source argparse choices for both 'hermes skills browse' and 'hermes skills search' Without these, browse-sh fell back to the default cap of 50 results and was not filterable via --source.	2026-05-19 14:17:38 -07:00
nekwo	d948de39e9	fix(gateway): harden Windows gateway install lifecycle Preserve Windows profile install decisions across UAC handoff, avoid visible console windows by launching via pythonw, make repeated install/start idempotent, recreate stale Scheduled Tasks, and separate start-now from login auto-start behavior. Add Windows gateway regression coverage and systemd setup tests for the shared install flow.	2026-05-19 11:23:15 -07:00
Teknium	2a7308b7c4	fix(update): quarantine hermes.exe vs concurrent Windows instance (#26670 ) (#26677 ) * fix(update): detect concurrent hermes.exe on Windows; retry + restart-defer quarantine Closes #26670. When 'hermes update' runs on Windows with another hermes.exe alive (most commonly the Hermes Desktop Electron app's spawned backend) _quarantine_running_hermes_exe() fails to rename the venv shim with [WinError 32]. uv pip install -e . then exits 2, the git-pull fast path is silently abandoned, and the ZIP fallback runs (and fails the same way) before eventually succeeding. This change implements three of the five proposed fixes from the issue: 1. Concurrent-instance detection (preferred fix). _detect_concurrent_hermes_instances() uses psutil to enumerate processes whose .exe is one of our venv shims (hermes.exe / hermes-gateway.exe), excluding the caller's PID. When any match exists, cmd_update prints an actionable message naming the blocking PIDs and exits 2 BEFORE any destructive work. New --force flag bypasses the gate. 2. Retry + restart-deferred fallback. _quarantine_running_hermes_exe() now retries the rename up to 4 times with 100/250/500/1000 ms backoff (covers the transient AV-scanner-handle case). If all retries fail, it schedules the replacement via MoveFileExW with the OS deferred-rename flag so the new shim can land at the original path and the update completes; the old image is fully unloaded after the user's next system restart. 3. Actionable warning text. The old 'Could not quarantine: [WinError 32]' warning is replaced with one that names the likely culprits (Hermes Desktop, REPLs, gateway, AV) and points to the new --force flag. Tests: - 13 new tests in tests/hermes_cli/test_update_concurrent_quarantine.py covering: psutil-based enumeration, self-pid exclusion, case-insensitive matching of .EXE, no-psutil graceful degradation, off-Windows no-op, helpful warning formatting, retry-then-succeed, restart-deferred fallback, cmd_update abort + exit code 2, and --force bypass. - New autouse fixture in tests/hermes_cli/conftest.py defaults _detect_concurrent_hermes_instances to [] so the rest of the suite isn't tripped by the developer's own running hermes.exe. Opt-out marker 'real_concurrent_gate' registered in pyproject.toml. - Updating docs page (website/docs/getting-started/updating.md) gains a short section explaining the new Windows error and remediation. * chore: refresh uv.lock to match pyproject.toml exact pins aiohttp 3.13.4 -> 3.13.3 (matches pyproject pin: aiohttp==3.13.3) anthropic 0.87.0 -> 0.86.0 (matches pyproject pin: anthropic==0.86.0) hermes-agent 0.13.0 -> 0.14.0 (matches pyproject version) CI's uv lock --check was failing on the merged state because main drifted: pyproject.toml uses exact == pins for those two deps and the hermes-agent version was bumped to 0.14.0 but the lockfile still had 0.13.0.	2026-05-19 11:10:51 -07:00
LeonSGP43	ebe0b77122	fix(model-switch): mark bare custom provider as current	2026-05-19 10:57:35 -07:00
kshitijk4poor	7552e0f3c0	fix(kanban): also hoist idx_events_run + drop redundant inner create Extends the previous commit to cover the remaining additive-column index that sits on the same migration trap: - ``task_events.run_id`` -> ``idx_events_run`` was still in SCHEMA_SQL. A legacy ``task_events`` table predating #17805 (no ``run_id``) would still abort ``executescript`` before ``_migrate_add_optional_columns`` could add the column. Hoisted out of SCHEMA_SQL and made unconditional in the migration alongside the other three indexes. - Removed the now-redundant ``CREATE INDEX idx_tasks_idempotency`` that was nested inside the ``if "idempotency_key" not in cols`` branch. The unconditional create lower in the function makes it idempotent on both fresh and legacy DBs. - Strengthened the regression test to cover all four indexes (``idx_tasks_session_id``, ``idx_tasks_tenant``, ``idx_tasks_idempotency``, ``idx_events_run``) and to seed a pre-#17805 ``task_events`` shape that exercises the ``run_id`` migration path. The result: every ``CREATE INDEX`` that depends on an additive column now runs after the migration ensures the column exists. Verified against a realistic pre-#16081 board fixture (tasks + task_events both legacy shape) — origin/main reproduces ``no such column: session_id``; this branch migrates cleanly and creates all four indexes.	2026-05-19 08:09:11 -07:00
Michael Nguyen	7c622b6c74	fix(kanban): migrate task session index after columns	2026-05-19 08:09:11 -07:00

1 2 3 4 5 ...

2157 commits