Updates curated picker lists for both the OpenRouter fallback snapshot
(`OPENROUTER_MODELS`) and the Nous Portal list (`_PROVIDER_MODELS['nous']`).
Regenerates website/static/api/model-catalog.json via
`scripts/build_model_catalog.py` to keep the docs-hosted manifest in
sync (drift guard in `test_in_repo_lists_match_manifest`).
tests/hermes_cli/test_models.py fixtures updated — they pinned the
old model id as their live-fetch sample.
* feat(mcp): Nous-approved MCP catalog with interactive picker
Adds an optional-mcps/ directory mirroring optional-skills/: curated,
Nous-approved MCP servers shipped with the repo but disabled by default.
Presence in optional-mcps/ = approval. No community tier, no trust signals.
Entries are added by merging a PR.
New surface:
hermes mcp Interactive catalog picker (default)
hermes mcp catalog Plain-text list, scriptable
hermes mcp install <name> Install a catalog entry
Picker behavior:
not installed -> install (clone/bootstrap if needed, prompt for creds)
installed/off -> enable
installed/on -> menu (disable / uninstall / reinstall)
Manifest schema (manifest_version: 1) supports:
- transport: stdio (command/args, ${INSTALL_DIR} substitution) or http (url)
- install: optional git clone + bootstrap commands (for repos that need
local venv setup, like the n8n bridge); omit for npx/uvx servers
- auth: api_key (prompts -> ~/.hermes/.env), oauth (provider-mediated
or native MCP), or none
Catalog entries are never auto-updated. Users re-run `hermes mcp install`
to refresh. Credentials always go to ~/.hermes/.env (the .env-is-for-secrets
rule), never to per-server env blocks.
Ships n8n as the reference manifest (https://github.com/CyberSamuraiX/hermes-n8n-mcp).
Tests: 19 catalog tests + E2E install/uninstall round-trip via the shipped
manifest.
* feat(mcp): tool-selection checklist + Linear catalog entry
Adds install-time tool selection so users only enable the MCP tools they
actually want, and ships Linear as a second reference catalog entry to
demonstrate the http+oauth path alongside n8n's stdio+api_key+git-bootstrap.
Tool selection flow:
install (clone/auth/credentials) ->
probe server for available tools ->
curses checklist with pre-checked rows ->
write mcp_servers.<name>.tools.include
Pre-check priority:
1. user's prior tools.include (reinstall preserves selection)
2. manifest's tools.default_enabled (curated subset)
3. all probed tools (default)
Probe-failure fallback (server unreachable, OAuth not yet complete,
backing service offline):
- manifest declared default_enabled -> applied directly
- no default declared -> no filter written (all-on when reachable)
- both cases point user at hermes mcp configure <name>
Manifest schema additions:
tools:
default_enabled: [list, of, tool, names] # optional
Updates:
- optional-mcps/linear/manifest.yaml -- new reference entry (http+oauth)
- optional-mcps/n8n/manifest.yaml -- tools.default_enabled set to the
8 read-mostly tools; mutating tools (activate/deactivate, container_logs)
pruned by default
- docs: new 'Tool selection at install time' section in features/mcp.md
Tests: 7 new tests in TestToolSelection covering probe-success / probe-fail
matrix, manifest-default filtering, reinstall-preserves-selection, and
invalid-default-enabled rejection. 26 catalog tests + 32 existing
mcp_config tests passing.
* feat(mcp): polish — picker unification, include-mode convergence, hardening
Addresses review findings on PR #30870. Lands all improvements that
belong in this PR before merge; defers separate cleanup (consolidating
two probe implementations, change-detector tests) to follow-ups.
Picker UX (mcp_picker.py)
- Unifies catalog + custom (user-added) MCPs in one view with distinct
status badges (available / enabled / installed (disabled) /
custom — enabled / custom — disabled)
- Adds 'Configure tools (probe server + re-pick)' action to both the
catalog-installed and custom-row submenus — the existing
hermes mcp configure flow was previously unreachable from the picker
- Loops until ESC/q so the user can manage several entries in one
session instead of having to re-launch
- Uninstall message now mentions .env credentials are preserved with a
pointer to clean them up manually if no longer needed
- Surfaces a 'requires a newer Hermes' warning per future-manifest
entry instead of silently hiding it
Catalog (mcp_catalog.py)
- catalog_diagnostics() exposes which manifests were skipped and why
(future_manifest vs invalid) so UIs can give actionable feedback
- _do_git_install detects SHA-shaped refs (regex /[0-9a-f]{7,40}/)
and skips the doomed 'git clone --branch <sha>' attempt — clone --branch
only accepts branches/tags, so SHAs always failed noisily before
falling back to the full-clone path
- Probe-success all-tools-enabled message now mentions that new tools
the server adds later will be auto-enabled (no-filter mode)
Convergence (tools_config.py)
- _configure_mcp_tools_interactive now writes tools.include (whitelist)
instead of tools.exclude (blacklist), matching the catalog flow and
hermes mcp configure. The on-disk config shape no longer depends on
which UI the user touched last
- Two existing tests updated to assert the new include-mode contract
Discoverability
- Setup wizard final step now prints 'Browse curated MCPs: hermes mcp'
- Three tip-corpus entries pointing at the new catalog
- Docs updated with: trust model (manifests run code locally, gated by
PR review, but read before installing), runtime ${ENV_VAR} substitution
semantics, and the manifest_version forward-compat behavior
Tests
- 7 new tests covering future-manifest diagnostics, custom MCP picker
rows, SHA-ref git-install path, branch-ref git-install path, and the
tools_config include-mode write contract
- 80 MCP-related tests passing across test_mcp_catalog.py,
test_mcp_config.py, test_mcp_tools_config.py
* fix(mcp): drop setup-wizard catalog hint to satisfy supply-chain scanner
The wizard line 'Browse curated MCPs: hermes mcp' triggered the
CI supply-chain scanner because it pattern-matches on edits to any
file named hermes_cli/setup.py — that filename matches the Python
'install-hook file' heuristic even though this setup.py is the
user-facing 'hermes setup' wizard, not a packaging install hook.
The catalog is already surfaced via three tip-corpus entries in
hermes_cli/tips.py (which the scanner doesn't flag), so dropping the
wizard mention loses no discoverability. Worth revisiting after a
scanner allowlist for this specific file lands.
Follow-up to #32087 after community report from @ethernet that 8000-char
single-line pastes get dumped raw into the input box.
A) Fallback regression revert
paste_collapse_threshold_fallback default: 0 -> 5
#32087 disabled the fallback handler by default. The fallback path
has been always-on with line_count >= 5 since #3065 (March 2026);
the previous shape was the salvaged contributor's design and didn't
match pre-existing behavior for terminals without bracketed paste
support (Windows terminals, some SSH setups). Restoring the original
on-by-default.
B) Long single-line paste guard
New config key: paste_collapse_char_threshold (default 2000)
Bracketed-paste handler and fallback handler now BOTH collapse when
line count >= line threshold OR total char length >= char threshold.
Catches the case ethernet hit: ~8000 chars of minified JSON / log
output on a single line dumped raw into the buffer.
TUI mirrors the same config via uiStore.pasteCollapseChars.
Set 0 to disable.
Defaults verified:
paste_collapse_threshold: 5
paste_collapse_threshold_fallback: 5
paste_collapse_char_threshold: 2000
Tests:
tests/hermes_cli/test_config.py: 87/87 pass
ui-tui useConfigSync.test.ts: 34/34 pass
ui-tui useComposerState.test.ts: 9/9 pass
tsc: 0 new errors in touched files
Reported via AskClaw. When config.yaml has `model: <name>` (flat string)
instead of the nested `model: {default: ..., provider: ...}` form, every
gateway `/model X --global` crashed silently with
TypeError: 'str' object does not support item assignment
The persist block did:
model_cfg = cfg.setdefault("model", {})
model_cfg["default"] = result.new_model
`setdefault` returns the existing scalar, and the next assignment blows
up. The 'switch failed' warning was logged at WARNING level and the user
never saw why their persist didn't stick.
Coerce scalar/None `model:` into a dict before mutation, in both the
gateway path (`gateway/run.py`) and the sister site in
`hermes_cli/doctor.py --fix` (same setdefault-on-string flaw). The CLI
`/model` path is unaffected because it goes through `_set_nested` which
already replaces scalar leaves with dicts.
Regression test `tests/gateway/test_model_command_flat_string_config.py`
covers the flat-string, missing, and proper-dict cases. Without the fix,
the flat-string case fails with the exact original TypeError.
`load_hermes_dotenv()` is called at module-import time from cli.py,
hermes_cli/main.py, run_agent.py, trajectory_compressor.py, gateway/run.py,
tui_gateway/server.py, acp_adapter/entry.py, and a few others. Each call
triggered `_apply_external_secret_sources()`, which re-parsed config,
re-fetched from Bitwarden Secrets Manager (its own 300s cache mostly absorbed
this), re-ran the ASCII sanitization sweep, and reprinted
Bitwarden Secrets Manager: applied N secret(s) (...)
to stderr. Users saw the status line 3-5x per CLI startup.
Guard the function with a process-level set of HERMES_HOME paths that have
already had external secrets applied. Subsequent calls for the same home_path
are no-ops. `reset_secret_source_cache()` lets tests (and any future
long-running consumer that wants to refresh after a config change) force a
re-pull.
Two posture fixes surfaced by the web-pentest skill self-test against
the dashboard (issue #32267).
1. /dashboard-plugins/<name>/<path> previously returned 200 for any
file inside the plugin's dashboard directory — including
plugin_api.py and __pycache__/*.pyc. The path is unauthenticated by
architecture (SPA loads JS via <script src> and CSS via <link href>,
neither of which can attach a custom auth header), so the fix is
not "require token" — it's "restrict to browser-fetchable suffixes."
Allowlist now: .js .mjs .css .json .html .svg .png .jpg .jpeg .gif
.webp .ico .woff .woff2 .ttf .otf .map. Everything else → 404.
This stops a private user-installed plugin's Python source from
being readable by anyone reachable on the dashboard's loopback port
(other local users on a shared box, sidecar containers sharing the
host netns).
2. save_env_value() now refuses to persist env-var names that
influence how the next subprocess executes: LD_PRELOAD,
LD_LIBRARY_PATH, LD_AUDIT, DYLD_*, PYTHONPATH, PYTHONHOME,
PYTHONSTARTUP, NODE_OPTIONS, NODE_PATH, PATH, SHELL, EDITOR,
VISUAL, PAGER, BROWSER, GIT_SSH_COMMAND, GIT_EXEC_PATH; plus
HERMES_HOME / HERMES_PROFILE / HERMES_CONFIG / HERMES_ENV.
PUT /api/env is authed but the session token lives in the SPA HTML
where any future plugin XSS or local process can read it. Without
this gate, a token-holder could plant LD_PRELOAD in .env and the
next hermes process start would load attacker code via the dotenv
to os.environ chain. This is enforced on write only — pre-existing
.env values are left alone (the gate is in save_env_value, not in
load_env). PUT /api/env now returns 400 with the explanatory
message instead of an opaque 500.
IMPORTANT: HERMES_* overall is NOT blocked — only the four runtime
location names. Integration credentials following the HERMES_*
convention (HERMES_GEMINI_*, HERMES_LANGFUSE_*, HERMES_SPOTIFY_*,
HERMES_QWEN_BASE_URL, ...) keep working.
Regression tests cover both fixes (30 new test cases). No existing
tests changed; 257 passing in tests/hermes_cli/.
Closes#32267.
Adds two new config keys:
- paste_collapse_threshold (default: 5) — line count threshold for
bracketed paste collapse in both TUI and CLI
- paste_collapse_threshold_fallback (default: 0, disabled) — same for
the fallback heuristic in terminals without bracketed paste support
TUI frontend reads these from config.get full via applyDisplay/patchUiState.
CLI reads from self.config at paste-handling time.
Closes#5626
Related: #5623
The gateway's media delivery allowlist required files live inside
`~/.hermes/cache/{documents,images,...}`, which is the wrong shape for
real agent usage. Agents naturally produce artifacts via terminal tools
(`pandoc -o /tmp/report.pdf`, `matplotlib savefig`, etc.) or
write_file into project directories — these never land under the cache.
Result: users got a raw file path in chat instead of an attachment.
This is doubly bad in deployment shapes where the cache directories
aren't writable by the agent at all: Hermes running in Docker with a
read-only mount, or with a Docker/Modal/SSH terminal backend whose
filesystem isn't the gateway host's filesystem.
Layered trust model:
1. Cache-dir allowlist (unchanged) — Hermes-managed roots always trusted.
2. Operator allowlist — `HERMES_MEDIA_ALLOW_DIRS` env var, now also
surfaced as `gateway.media_delivery_allow_dirs` in config.yaml.
3. Recency-based trust (new, default on) — files whose mtime is within
`gateway.trust_recent_files_seconds` (default 600s) of "now" are
trusted even outside the cache/operator allowlist. Old host files
(`/etc/passwd`, `~/.bashrc`, `~/.ssh/id_rsa`) have mtimes measured
in days/months, well outside the window — prompt-injection paths
pointing at pre-existing files are still rejected.
4. Hard denylist — `/etc`, `/proc`, `/sys`, `/dev`, `/root`, `/boot`,
`/var/{log,lib,run}`, plus `$HOME/.{ssh,aws,gnupg,kube,docker,config,
azure,gcloud}` and `Library/Keychains`. Denylist blocks delivery
even when recency would trust the file, in case an attacker
somehow refreshes a sensitive file's mtime.
Operators who want strict-allowlist behavior set
`gateway.trust_recent_files: false` and the system reverts to
pre-existing behavior.
Tests: 6 new cases in test_platform_base.py cover the recency window,
disabled mode, system-path denylist, and the motivating PDF-in-project
scenario. 3 existing tests (test_platform_base, test_tts_media_routing,
test_send_message_tool) that exercised the strict-allowlist path are
updated to disable recency trust explicitly.
E2E validation: real `validate_media_delivery_path()` accepts fresh
PDFs in /tmp and project dirs, rejects /etc/passwd, ~/.ssh/id_rsa, and
files older than the window; config.yaml `gateway.*` keys bridge
correctly to the env vars the validator reads.
_update_via_zip downloads a source ZIP from GitHub and calls
zipfile.ZipFile.extractall. The existing zip-slip path guard validates
each member's path stays under tmp_dir, but does not check member type
— so a ZIP containing a symlink member would still be materialized by
extractall, and a symlink target could point outside the extracted
tree (or to a sensitive system path).
This isn't a high-likelihood threat for hermes-agent's actual GitHub
source ZIPs (we don't ship symlinks), but the extractall path runs as
the user's account and a compromised mirror could plant arbitrary files
via the symlink → target → write chain.
Reject any member whose Unix mode bits (upper 16 bits of external_attr)
are S_IFLNK before extractall. Hermes source ZIPs contain only regular
files and directories; a symlink member is unambiguously suspicious.
Regression tests cover: symlink member rejection (raises ValueError,
caught by the outer try/except as a clean SystemExit, no extraction),
and the happy-path verification that a normal ZIP doesn't trigger the
symlink reject message.
Salvaged from PR #15881 by @codeblackhole1024. The remaining pieces of
that PR were already on main or contradicted explicit design decisions:
- config.yaml write-deny: already in agent/file_safety.py's
control_file_names denylist (the modern guard); the proposed addition
to build_write_denied_paths was the legacy path.
- Quick commands danger detection: contradicts the explicit
cli.py:8491-8492 comment 'shell=True is intentional: quick_commands
are user-defined shell snippets from config.yaml — not agent/LLM
controlled.'
- Memory plugin shlex.split for dep checks: already on main
(hermes_cli/memory_setup.py:133).
Co-authored-by: teknium1 <127238744+teknium1@users.noreply.github.com>
.env holds API keys and secrets. Multiple creation sites used `cp` /
`touch` / `shutil.copy2` which obey the process umask — commonly
0o022, leaving the file at 0o644 (world-readable). Apply chmod 0o600
explicitly at every site that creates or copies .env.
Sites covered:
- docker/stage2-hook.sh: after the seed_one '.env' call, applied
unconditionally (not just on first-seed) so a host-mounted .env with
loose perms gets tightened on every container restart
- hermes_cli/doctor.py: 'hermes doctor --fix' touches an empty .env
when missing
- hermes_cli/profiles.py: 'hermes profile create --clone' copies .env
from the source profile; shutil.copy2 preserves source mode, so a
source .env at 0o644 was being cloned into 0o644
- setup-hermes.sh: in-tree setup script's cp .env.example .env path,
plus the already-exists branch (mirror of install.sh which already
chmods 600 unconditionally on line 1442)
scripts/install.sh was NOT changed — it already chmod 600's the .env
unconditionally after the create/already-exists branches (line 1442).
Salvaged from PR #25726 by @dusterbloom. The docker/entrypoint.sh
portion of the original PR was dropped because main switched to an
s6-overlay shim — the .env creation logic moved to stage2-hook.sh,
which is where the chmod now lives.
Closes#25497 (subset — install.sh + setup-hermes.sh) and #8448
(subset — install.sh only) as superseded.
Co-authored-by: teknium1 <127238744+teknium1@users.noreply.github.com>
* perf(bitwarden): persist secret-fetch cache across CLI invocations
Every `hermes` invocation paid a ~380ms tax for `bws secret list` to
Bitwarden Secrets Manager because the existing cache was in-process only.
Back-to-back `hermes chat -q`, gateway-spawned agents, and cron-launched
runs all re-fetched.
Adds a disk-persisted L2 cache at `<hermes_home>/cache/bws_cache.json`
(mode 0600, never contains the access token — only the SHA-256
fingerprint prefix). Same TTL as the in-process cache. Read on miss,
write on bws success, ignored on key mismatch / corruption / expiry.
Measured on a startup profile:
load_hermes_dotenv() cold: 372ms → warm (disk cache hit): 20ms
End-to-end `hermes --version` cold→warm: 666ms → ~295ms.
In a hermes-vs-codex benchmark across 11 single- and multi-turn tasks
(framework overhead = wall − llm − tool_exec, median over 3 trials):
cohort before after saved
single-turn (median) 2.96s 2.31s -0.65s
multi-turn (5-turn) 9.40s 8.95s -0.45s (≈0.3s/turn)
Hermes now wins head-to-head on 6/11 tasks vs codex (was 4/11 before).
The remaining ~0.6s single-turn delta is mostly Python's own import
cost in hermes_cli.main, which is a separate optimization.
* perf(cli): lazy-load model catalog + dedupe config.yaml reads at startup
Two import-time wins on top of the bws disk-cache fix:
1. Lazy-load `hermes_cli.models._PROVIDER_MODELS` via PEP 562
module-level `__getattr__`. The catalog is ~55ms of work that was
eagerly imported on every CLI invocation (line 4557 `if not
_is_termux_startup_environment(): from hermes_cli.models import
_PROVIDER_MODELS`). Audit showed every internal call site already
does its own function-local import; only test code reads
`hermes_cli.main._PROVIDER_MODELS` as a module attribute, and
__getattr__ keeps that working transparently. First access triggers
the import once and caches the result on the module via
`globals()[name] = ...`, so subsequent reads are dict lookups.
2. Dedupe the double config.yaml read in the top-of-module bootstrap.
Previously: one raw yaml.safe_load for the `security.redact_secrets`
bridge, then a separate full `load_config()` (with deep-merge) for
`network.force_ipv4`. Both keys come from the same file. Merged
into one raw yaml load.
Combined with the bws cache fix in the previous commit:
hermes --version wall time:
original (cold): 666 ms
after bws fix (warm): 295 ms
after lazy-load + dedupe: 228 ms (-67 ms additional, -66% from original)
Tests:
- tests/hermes_cli/test_api_key_providers.py: 173/173 pass
(lazy __getattr__ correctly handles
`from hermes_cli.main import _PROVIDER_MODELS`)
- tests/test_ipv4_preference.py + tests/hermes_cli/test_redact_config_bridge.py +
tests/agent/test_redact.py: 93/93 pass (dedupe preserves both bridges)
- tests/test_bitwarden_secrets.py + env_loader tests: 49/49 pass
Add an opt-in Python plugin surface for speech-to-text backends,
mirroring the TTS hook pattern. New backends (OpenRouter, SenseAudio,
Gemini-STT, custom proprietary engines) can be implemented as plugins
without modifying tools/transcription_tools.py.
Built-ins always win
--------------------
The 6 built-in STT providers (local/faster-whisper, local_command,
groq, openai, mistral, xai) keep their native handlers. Plugins
attempting to register under a built-in name are rejected at
registration time with a warning and re-checked defensively at
dispatch.
Resolution order
----------------
1. stt.provider matches a built-in → built-in dispatch (unchanged)
2. stt.provider matches a registered plugin →
a. if plugin.is_available() returns False → unavailability envelope
identifying the plugin (not the generic "No STT provider"
message — the user explicitly opted into this plugin)
b. otherwise plugin.transcribe() with model + language forwarded
from stt.<provider>.{model,language} config
3. No match → legacy "No STT provider available" error (unchanged)
Per-provider config namespace
-----------------------------
Plugins read their config from stt.<provider> in config.yaml, mirroring
how built-ins read stt.openai.model / stt.mistral.model. The dispatcher
forwards `model` and `language` from this section. Caller's explicit
`model=` argument overrides the config-set model.
Files
-----
- agent/transcription_provider.py: TranscriptionProvider ABC
- agent/transcription_registry.py: register/get/list providers,
built-in shadow guard, _reset_for_tests
- hermes_cli/plugins.py: register_transcription_provider() on
PluginContext
- tools/transcription_tools.py: BUILTIN_STT_PROVIDERS frozenset,
_dispatch_to_plugin_provider() with availability gate, wire-in
after xai branch and before "No STT provider" error
- tests/agent/test_transcription_registry.py: 27 tests
- tests/hermes_cli/test_plugins_transcription_registration.py: 3 tests
- tests/tools/test_transcription_plugin_dispatch.py: 28 tests
(covering built-in short-circuit, plugin dispatch, exception
envelope, non-dict guard, availability gate, language forwarding)
- tests/plugins/transcription/check_parity_vs_main.py: 10-scenario
subprocess-pinned parity harness vs origin/main
- website/docs/user-guide/features/{tts,plugins}.md: docs
Behavior parity
---------------
10 scenarios, 8 OK + 2 expected DIFFs:
no_provider_error → plugin (plugin-installed scenario)
no_provider_error → plugin_unavailable (plugin-installed-unavailable
scenario; PR returns cleaner envelope)
Zero behavior change for users not opting into a plugin.
Issue follow-up to #30398.
Two defense-in-depth fixes on cron output path handling:
1. cron/jobs.py:update_job() rejects mutation of the immutable 'id' field
(raises ValueError). Dashboard PUT /api/cron/jobs/{id} converts this to
HTTP 400. Without this, an attacker who can reach the update endpoint
could rename a job's id to '../escape' and move its output directory
outside OUTPUT_DIR.
2. cron/jobs.py:_job_output_dir() validates job IDs before composing
paths: rejects '.', '..', '/', '\\', absolute paths, and Windows drive
prefixes. Used by save_job_output() and remove_job() so legacy unsafe
IDs (from before this guard) fail closed rather than half-applying a
shutil.rmtree or output write outside the sandbox.
Tests:
- update_job rejects {'id': '../escape'} without renaming
- remove_job(legacy '../escape' id) raises ValueError without deleting
files outside OUTPUT_DIR or removing the job from the store
- save_job_output rejects '..', './escape', 'nested/escape',
absolute paths
- dashboard PUT /api/cron/jobs/{id} with {'id': '../escape'} returns
400, job list unchanged
Salvaged from PR #29826 by @zapabob. Simplified implementation:
- Dropped a 23-line _validate_job_output_id() helper using Path.parts
semantics. The inline check (path separators + dot-components +
is_absolute) is shorter and behaviorally identical.
- Dropped the secondary OUTPUT_DIR.resolve()/relative_to() check —
redundant once we reject any path separator at the input boundary.
- Dropped the _docs/2026-05-21_cron-output-path-hardening_codex.md
planning artifact (we don't check planning docs into the repo).
Co-authored-by: teknium1 <127238744+teknium1@users.noreply.github.com>
Follow-up on top of @jacevys' PR #21437 cherry-pick:
- _provider_model_ids() now also matches normalized == 'openai-api' for
the live /v1/models fetch path, so users see the full catalog instead
of just the curated list.
- Add gpt-5.5-pro and gpt-5.3-codex to the curated list for parity with
the existing 'openai' table (used as fallback when /v1/models fails).
- Add scripts/release.py AUTHOR_MAP entry for jacevys so CI doesn't
block the salvage PR.
#27385 reports that on macOS the browser sees the xAI 'authorization
received' success page but Hermes still raises xai_callback_timeout.
The loopback HTTP handler was silent — no log line on receipt, no log
line on wait timeout — so triaging the gap between 'browser saw
success' and 'CLI saw timeout' required either a code change or
guesswork.
Adds two INFO log lines:
- Per callback hit (handler): path, has_code, has_state, has_error,
truncated User-Agent. Booleans / fingerprints only — no actual
code/state strings leak.
- On wait timeout: report whether result.code or result.error was
populated at deadline. Distinguishes three failure modes:
1. No hit log + timeout log w/ has_code=False has_error=False
→ xAI's IDP never reached the loopback (firewall, port-binding,
IPv6/IPv4 mismatch, browser blocked private-network access).
2. Hit log w/ has_code=False has_error=False + timeout log
→ xAI hit the loopback without OAuth params (the bare-URL
case the handler already 400s on).
3. Hit log w/ has_code=True + timeout log w/ has_code=False
→ result_lock contention or race; would indicate a real bug.
133/133 in tests/hermes_cli/test_auth_xai_oauth_provider.py,
tests/hermes_cli/test_xai_oauth_pkce_token_exchange.py, and
tests/run_agent/test_codex_xai_oauth_recovery.py.
Follow-up to @Strontvod's fix.
Tests:
- Five new tests in test_update_concurrent_quarantine.py cover the parent-
chain exclusion: the .exe launcher is excluded, an unrelated sibling
hermes.exe is still reported, multi-level ancestry is fully excluded,
PID cycles in the parent chain don't hang, and a partially-stubbed
psutil (no Process attribute) degrades gracefully instead of crashing.
- New _fake_psutil_with_parent_chain helper builds a fuller stand-in
(Process / NoSuchProcess / AccessDenied + process_iter) than the
process_iter-only SimpleNamespace the older tests use.
Hardening:
- Broaden the except in the parent-walk to bare Exception. The original
fix listed (NoSuchProcess, AccessDenied, ValueError), but those names
are evaluated lazily during exception matching — if psutil is a partial
stub without the attribute, the exception handler itself raises
AttributeError that escapes. The function is documented as 'never raises'
(the surrounding update flow depends on it), so the broader catch keeps
the contract regardless of how the dependency is shaped.
AUTHOR_MAP:
- Map schepers.zander1@gmail.com -> Strontvod so the salvaged commit
resolves to @Strontvod in the release notes.
All 18 detect_concurrent + quarantine tests pass.
On Windows, the setuptools-generated hermes.exe launcher is a separate native
process that spawns python.exe (the interpreter running the update code).
os.getpid() returns the Python PID, but the launcher (which holds the file
lock) is the parent. Without walking the parent chain, every 'hermes update'
reports its own launcher as a concurrent instance - a false positive.
This patch builds an exclusion set containing the Python process and its
entire ancestor chain, so the running invocation never reports itself.
Resolves the explicit "Known follow-up" left by commit 2f8ceeab9 and
the resulting CI failures in tests/docker/test_dashboard.py and
tests/docker/test_s6_profile_gateway_integration.py.
The product gap
---------------
Every hermes runtime operation inside the container runs as the
hermes user (UID 10000) via s6-setuidgid. But s6-supervise — spawned
by s6-svscan running as PID 1 — creates each service's supervise/
and top-level event/ directories with mode 0700 owned by its
effective UID (root). That left every s6-svc / s6-svstat / s6-svwait
call from hermes hitting EACCES on the supervise/control FIFO and
supervise/status — i.e. the entire S6ServiceManager lifecycle
(register, start, stop, unregister) was inert in production.
The 2f8ceeab9 commit message called this out and deferred the fix.
The audit changes that landed alongside it (defaulting docker_exec
to -u hermes) made the integration tests reproduce the bug
deterministically; the fix below resolves it.
The fix: pre-create the supervise/ skeleton hermes-owned
----------------------------------------------------------
Reading s6's source (src/supervision/s6-supervise.c::trymkdir +
control_init), the mkdir and mkfifo calls that build the supervise
tree are EEXIST-safe: if the directory or FIFO is already present,
s6-supervise reuses it and skips the chown/chmod fix-up that would
normally make event/ 03730 root:root. So if we lay the skeleton
down with hermes ownership before triggering s6-svscanctl -a,
s6-supervise inherits our layout and never touches it. The
death_tally / lock / status regular files written later by
s6-supervise (still as root) land mode 0644 — world-readable —
which is all s6-svstat needs.
New module-level helper _seed_supervise_skeleton(svc_dir) in
hermes_cli/service_manager.py lays down:
svc_dir/event/ hermes:hermes 03730
svc_dir/supervise/ hermes:hermes 0755
svc_dir/supervise/event/ hermes:hermes 03730
svc_dir/supervise/control hermes:hermes 0660 (FIFO)
svc_dir/log/event/ hermes:hermes 03730 (if log/ present)
svc_dir/log/supervise/ hermes:hermes 0755
svc_dir/log/supervise/event/ hermes:hermes 03730
svc_dir/log/supervise/control hermes:hermes 0660 (FIFO)
The log/ branch matters because the logger is a second
s6-supervise instance — without it, unregister rmtree races on
the logger's root-owned supervise dir even after the parent
slot's supervise/ is hermes-owned. The helper is idempotent and
swallows PermissionError on chown so it works equally well when
called from root (cont-init.d) or hermes (runtime register).
Wiring
------
1. S6ServiceManager.register_profile_gateway calls
_seed_supervise_skeleton(tmp_dir) just before publishing the
slot via Path.replace. Runtime-registered profile gateways are
set up by hermes.
2. container_boot._register_service does the same in the cont-init.d
reconciliation path so boot-time-restored profile slots inherit
the same layout.
3. New cont-init.d/015-supervise-perms script chowns the supervise/
and event/ trees for STATIC s6-rc services (dashboard,
main-hermes). These are spawned by s6-rc before cont-init.d
gets to run, so the EEXIST-trick doesn't apply; we chown the
already-existing tree instead. s6-supervise keeps using the
same files; it never re-asserts ownership on a running service.
The script skips s6-overlay internal services (s6rc-*,
s6-linux-*) so the supervision tree itself stays root-only.
015- slot is intentional: lex-sorts between 01-hermes-setup
and 02-reconcile-profiles in the container's C-locale, so
the chown finishes before the reconciler walks the scandir.
Unregister teardown reordering
------------------------------
S6ServiceManager.unregister_profile_gateway now fires
s6-svscanctl -an BEFORE rmtree (with a 200ms grace), so
s6-svscan reaps the supervise child and releases its file
handles on supervise/lock + supervise/status before we try to
remove the directory. Previously rmtree raced s6-supervise on a
set of files inside the supervise dir, and even with the parent
supervise/ now hermes-owned, the contained files (death_tally,
lock, status, written by root) could still be in use.
Dashboard down-state redesign
-----------------------------
The original PR #30136 review fix wrote a 'down' marker file
into /run/service/dashboard/ via cont-init.d/03-dashboard-toggle.
That approach was broken in two ways:
(a) /run/service/dashboard is a symlink to a TRANSIENT
/run/s6-rc:s6-rc-init:<tmpdir>/ directory while s6-rc is
mid-transaction; the touch landed in a soon-to-be-discarded
tmp.
(b) Even when written to the final /run/s6-rc/servicedirs/
location, the 'down' file is only consulted by s6-supervise
at slot startup. s6-rc's user-bundle explicitly transitions
'dashboard' to 'up' on every boot, overriding any down
marker.
The right fix is the canonical s6 pattern: when HERMES_DASHBOARD
is unset, the dashboard run script exits 0 and a companion
finish script exits 125. Per s6-supervise(8), exit code 125 from
the finish script is the 'permanent failure, do not restart'
marker — equivalent to s6-svc -O. The slot reports as 'down' to
s6-svstat, matching the reality that no dashboard process is
running. When HERMES_DASHBOARD IS truthy, finish exits 0 and
restart-on-crash semantics apply.
03-dashboard-toggle is removed (its function is now subsumed by
the run/finish pair).
Tests
-----
Adds four unit tests for _seed_supervise_skeleton covering the
produced layout, the log/ subservice case, the skip-when-no-log
case, and idempotency. The live-container verification continues
to live in tests/docker/test_s6_profile_gateway_integration.py and
tests/docker/test_dashboard.py — both now pass against the
rebuilt image.
References
----------
* Skarnet skaware mailing list 2020-02-02 (Laurent Bercot
+ Guillermo Diaz Hartusch) on unprivileged s6 tool semantics:
http://skarnet.org/lists/skaware/1424.html
* just-containers/s6-overlay#130 — same EEXIST-preseed pattern,
community-validated 2016 onward
* https://skarnet.org/software/s6/servicedir.html — exit-code 125
semantics in finish scripts
(cherry picked from commit c41f908ad4)
Two CI follow-ups to @benbarclay's #30136 salvage:
1. scripts/run_tests_parallel.py — add 'docker' to _SKIP_PARTS so
the new tests/docker/ harness doesn't run in the regular test (N)
matrix. The harness builds the real Dockerfile in a session
fixture, which can exceed pytest-timeout's 180s ceiling on
ubuntu-latest where Docker IS available — it surfaced as 6
identical setup-timeout failures across slices 1–6 on the first
CI run.
The docker harness has its own dedicated runner via
.github/actions/hermes-smoke-test (added in #30136) plus the
docker-lint workflow. Same treatment as tests/integration/ and
tests/e2e/ — runs separately, not in the main shards.
2. hermes_cli/service_manager.py — pin encoding='utf-8' on the
/proc/1/comm read_text call. Ruff PLW1514 enforcement rolled in
between Ben's last push and the salvage; pure ruff-fix, no
behavior change.
X Premium+ also grants Grok OAuth access — the 'SuperGrok Subscription'
wording suggested SuperGrok was the only entitlement path. Updated to
'SuperGrok / Premium+' across the picker label, setup wizard, auth flows,
and docs so Premium+ subscribers know the row applies to them too.
Follow-up to @benbarclay's Docker s6 PR (#30136). The Phase 4 hooks
`_maybe_register_gateway_service` and `_maybe_unregister_gateway_service`
were already documented as "no-op on host", but they reached that no-op
by:
1. importing `hermes_cli.service_manager`
2. calling `get_service_manager()` (which calls `detect_service_manager()`)
3. checking `mgr.supports_runtime_registration()` and returning False
If anything in step 1 or 2 raised an unexpected exception (e.g. a host
machine with a partial s6 install — `/proc/1/comm == s6-svscan` somehow,
but `/run/s6/basedir` absent, or vice versa), the `except Exception`
in the hook would print a confusing "⚠ Could not register s6 gateway
service: ..." warning on a non-container machine that has never touched
the container.
Reorder so `detect_service_manager() != "s6"` is checked FIRST, and
return silently for any detection failure. Host machines now:
- never import the s6 backend
- never call get_service_manager()
- never print an s6-shaped warning under any failure mode
E2E confirmed on host Linux (systemd):
`_maybe_register_gateway_service(...)` produces empty stdout,
detect_service_manager() returns "systemd".
Existing tests updated to patch `detect_service_manager` for the s6
call-through cases (they previously relied on get_service_manager
being the only gate, which is no longer true). Added one new test —
`test_register_silent_when_detect_throws` — asserting that a broken
detector cannot leak a warning to host users.
cc @benbarclay — visible behavior change vs. your branch is one
fewer code path on host. Test changes are minimal (one helper +
`_patch_detect_s6` opt-in per s6 test). Happy to revert if you
prefer the original shape.
Second migration of an existing built-in platform adapter after Discord
(PR #30591) — follows the same shape established by IRC / Teams / LINE /
Google Chat / SimpleX and the playbook in
`references/platform-plugin-migration.md`. Advances the umbrella refactor
in #3823.
Matches Discord's parity bar — adapter under `plugins/platforms/mattermost/`
with the standard `__init__.py` / `adapter.py` / `plugin.yaml` shell,
`register(ctx)` entry point, **no back-compat shim** at the old import
path, and full parity for all five hooks Discord uses plus the
`apply_yaml_config_fn` hook (mattermost is the second consumer of #25443
after Discord):
* `standalone_sender_fn` — out-of-process cron delivery via Mattermost
REST API. Picks up the thread_id + media_files capabilities the
legacy `_send_mattermost` lacked (parity with Discord's `_standalone_send`).
* `setup_fn` — interactive `hermes setup gateway` wizard.
* `apply_yaml_config_fn` — translates `config.yaml` `mattermost:` keys
(`require_mention`, `free_response_channels`, `allowed_channels`) into
`MATTERMOST_*` env vars (replaces the hardcoded block in
`gateway/config.py`).
* `is_connected` — declares connection state from `MATTERMOST_TOKEN` +
`MATTERMOST_URL`.
* `check_fn` — verifies aiohttp is installed and both required env vars
are set.
* plus `allowed_users_env`, `allow_all_env`, `cron_deliver_env_var`,
`max_message_length` (4000 — Mattermost practical limit), `emoji`,
`required_env`, `install_hint`.
Files
-----
* `gateway/platforms/mattermost.py` (873 LOC) →
`plugins/platforms/mattermost/adapter.py` (git rename, R071) +
appended `register()` block, hook helpers, and `_standalone_send`
with media upload + thread_id support.
* New `plugins/platforms/mattermost/{__init__.py, plugin.yaml}` with
`requires_env` / `optional_env` declarations covering MATTERMOST_URL,
MATTERMOST_TOKEN, MATTERMOST_ALLOWED_USERS, MATTERMOST_ALLOW_ALL_USERS,
MATTERMOST_HOME_CHANNEL, MATTERMOST_REPLY_MODE,
MATTERMOST_REQUIRE_MENTION, MATTERMOST_FREE_RESPONSE_CHANNELS,
MATTERMOST_ALLOWED_CHANNELS.
* `gateway/config.py`: delete 17-LOC `mattermost_cfg` YAML→env bridge
(moved into plugin's `_apply_yaml_config`).
* `gateway/run.py::_create_adapter`: delete `Platform.MATTERMOST elif` —
replaced by the existing generic plugin-registry-first dispatch.
* `tools/send_message_tool.py`: delete `_send_mattermost` (22 LOC) +
`Platform.MATTERMOST elif` in `_send_to_platform` — the `else` branch
already routes plugin platforms through `_send_via_adapter`, which
hits the registry's `standalone_sender_fn`.
* `hermes_cli/setup.py`: delete `_setup_mattermost` (44 LOC) — replaced
by the plugin's `interactive_setup`.
* `hermes_cli/gateway.py`: delete `_PLATFORMS["mattermost"]` dict entry
(3 LOC) — plugin's `setup_fn` is dispatched via the plugin path in
`_configure_platform`.
* Consumer rewrite: 5 test files (test_mattermost.py,
test_media_download_retry.py, test_send_multiple_images.py,
test_stream_consumer.py, test_ws_auth_retry.py) get
`gateway.platforms.mattermost` → `plugins.platforms.mattermost.adapter`
with the bulk-rewrite recipe from the platform-plugin-migration playbook.
Single `mock.patch` string in test_stream_consumer.py also repointed.
* `tests/tools/test_send_message_missing_platforms.py`: thin
`(token, extra, chat_id, message)` compat shim around the plugin's
`_standalone_send(pconfig, …)` so existing test bodies continue to
work without rewriting every signature.
Validation
----------
* Plugin discovery: mattermost registers from `plugins/platforms/mattermost/`
alongside discord / teams / irc / line / google_chat / simplex.
All 9 hooks present (setup_fn, standalone_sender_fn,
apply_yaml_config_fn, is_connected, check_fn, allowed_users_env,
allow_all_env, cron_deliver_env_var, max_message_length=4000).
* Mattermost-touching tests: 62/62 pass
(`test_mattermost.py` + `test_send_message_missing_platforms.py`).
* Targeted selectors (mattermost or platform_registry or stream_consumer
or ws_auth_retry or media_download_retry or send_multiple_images or
send_message_tool or platform_connected): 433/433 pass.
* Full sweep (`scripts/run_tests.sh tests/gateway/ tests/cron/
tests/tools/test_send_message_tool.py tests/tools/test_send_message_missing_platforms.py
tests/integration/`): **6220/6220 pass in 47.8s, 0 failures**.
* Lint: ruff clean on all touched files.
* Git identity verified: kshitijk4poor.
* Rename detection: R071 (similarity dropped from a hypothetical R09x
by the ~320-line appended register block — ~36% growth over the
873-LoC base, vs Discord's 5101 LoC base which kept R091).
Closes part of #3823.
PR #30136 review noted the asymmetry: `register_profile_gateway`
used tmp_dir + rename to publish a new service slot atomically,
but the boot-time reconciler wrote files into the slot directly.
Same underlying concern (a concurrent s6-svscan rescan could
observe a half-populated directory), different code path.
Rewrite `container_boot._register_service` to mirror the manager:
build everything in `<scandir>/gateway-<profile>.tmp/`, then
`Path.replace` into place. If a previous interrupted run left a
`.tmp` sibling, it's cleaned up before the new build starts. If
the target already exists, it's removed before the rename so
`Path.replace` doesn't error on a non-empty target (Linux `rename`
overwrites empty targets only).
Three new tests: atomic publication leaves no .tmp leftovers,
overwriting an existing slot still leaves no .tmp leftovers, and
a stale .tmp from an interrupted run is cleaned up automatically.
PR #30136 review noted: container-boot.log was append-only with no
rotation. On a long-lived container with frequent restarts and
many profiles it would grow unboundedly (~80 B per profile per
reconcile pass).
Add a soft cap: when the file size hits 256 KiB (`_LOG_ROTATE_BYTES`,
≈3000 reconcile lines, ≈1 year of daily reboots × 5 profiles), the
current file is renamed to `container-boot.log.1` (replacing any
existing one) before new entries are appended. Worst case is two
files at ~512 KiB — well within visibility limits for grep/cat.
Rotation is intentionally simple (no logrotate or s6-log machinery
for one append-only file). Failures during rotation are logged via
the module logger and treated as non-fatal — we keep appending to
the existing file rather than dropping the reconcile entry. Three
new unit tests cover above-threshold rotation, below-threshold
non-rotation, and overwrite of an existing .1 file.
PR #30136 review caught: `_allocate_gateway_port()` in profiles.py
computed a SHA-256-derived port that was threaded through
`register_profile_gateway(profile, port=N)` →
`_render_run_script(profile, port, extra_env)` → and then **ignored**.
The rendered run script picked the bind port from the profile's
config.yaml (`[gateway] port = …`), never from the allocator. So
the entire allocator + parameter chain was dead code.
Remove:
* `hermes_cli.profiles._allocate_gateway_port` (deterministic
SHA-256 → [9200, 9800) — never used).
* `port` kwarg from `ServiceManager.register_profile_gateway`
(Protocol + Mixin + S6 implementation).
* `port` positional arg from `_render_run_script(profile, port,
extra_env)` — now `_render_run_script(profile, extra_env)`.
* The pass-through call in `profiles._maybe_register_gateway_service`.
config.yaml is now the single source of truth for gateway port
selection — matches reality and reduces the API surface. Three
explanatory comments in service_manager.py / profiles.py document
the retirement so future readers don't reach for the allocator and
find a ghost.
Tests: drop the three `_allocate_gateway_port` tests; update
fakes' signatures throughout test_service_manager.py and
test_profiles_s6_hooks.py to match the new no-port API.
PR #30136 review caught: `S6ServiceManager.start/stop/restart` called
`subprocess.run(check=True)` on `s6-svc`, so any failure surfaced as
a raw `CalledProcessError` traceback. The two cases operators
actually hit are:
1. The service slot doesn't exist — most commonly because the user
typed a profile name wrong (`hermes -p typo gateway start`).
2. s6-svc itself fails — most commonly EACCES on the supervise
control FIFO when running unprivileged.
Both deserve named errors with actionable messages, not stacktraces.
Changes:
* Add `S6Error` base + two concrete errors in `hermes_cli.service_manager`:
- `GatewayNotRegisteredError(profile)` — carries the unprefixed
profile name; message: `no such gateway 'typo': register it
with `hermes profile create typo` first, or pass an existing
profile name via `-p <name>``.
- `S6CommandError(service, action, returncode, stderr)` — carries
the s6-svc rc and stderr; message: `s6-svc start on
'gateway-coder' failed (rc=111): <stderr>`.
* Factor lifecycle dispatch through `_run_svc(flag, label, name)`:
pre-checks that the service directory exists (raises
GatewayNotRegisteredError before invoking s6-svc), then runs
s6-svc and translates any CalledProcessError into S6CommandError.
* `_dispatch_via_service_manager_if_s6` in `hermes_cli.gateway`
catches both errors and prints `✗ <message>` + `sys.exit(1)`
instead of letting the exception bubble. The dispatch path that
used to dump a traceback at the user now gives an actionable
one-liner.
Tests: 6 new tests for the error types and their CLI rendering;
existing lifecycle test pre-seeds the slot directory before calling
`mgr.start` etc.
PR #30136 review caught: `hermes gateway start` (no `-p`) inside
the container resolves `_profile_suffix() == ""` → service name
`gateway-default`, but no such slot was ever registered. The Phase 4
profile-create hook only fired on `hermes profile create <name>`,
and the root profile (which lives at the top of $HERMES_HOME, not
under `profiles/`) was never one of those. So bare `hermes gateway
start` landed on `s6-svc -u /run/service/gateway-default` →
uncaught `CalledProcessError` → traceback to the user.
Changes:
1. `reconcile_profile_gateways` now always registers a
`gateway-default` slot before iterating named profiles. Its
prior state is read from `$HERMES_HOME/gateway_state.json`
(sibling to the profile root, not under `profiles/`); stale
runtime files there are swept the same way. Auto-up only if the
prior state was `running` — same rule as named profiles.
2. `S6ServiceManager._render_run_script` special-cases
`profile == "default"` to emit `hermes gateway run` with NO
`-p` flag. Passing `-p default` would resolve to
`$HERMES_HOME/profiles/default/` — a different profile that
almost certainly doesn't exist. The empty profile-suffix
convention is the dispatcher's contract and the run script has
to match.
3. A user-created `profiles/default/` collides with the reserved
root-profile slot; the reconciler now skips it with a warning
rather than producing two registrations of the same service name.
Action-list ordering is stable: `default` first, then named
profiles in directory order. Boot-log readers can rely on this.
Tests: 8 new dedicated default-slot tests plus updates to every
existing test that asserted against the action list (via the new
`_named_actions` helper that drops the always-present default
entry).
PR #30136 review caught that `hermes gateway stop --all` and
`... restart --all` were broken under s6. The Phase 4 dispatcher was
gated on `not stop_all` (and the symmetric restart_all), so `--all`
fell through to `kill_gateway_processes(all_profiles=True)`. pkill
SIGTERMed every gateway, s6-supervise observed the crashes, and
restarted every gateway ~1s later — net effect: `--all` *kicked*
gateways instead of *stopping* them.
Add `_dispatch_all_via_service_manager_if_s6(action)` that iterates
`mgr.list_profile_gateways()` and routes stop/restart through each
service slot. s6's `want up`/`want down` flips correctly, so a
stop persists. Partial failures are surfaced per-profile with a
running success count; the host pkill path is only reached when s6
isn't in play.
`start --all` isn't a CLI surface — the helper rejects it and
returns False (host code path can take over).
PR #30136 review surfaced two issues, both rooted in the same audit gap:
docker integration tests were running as root, not the unprivileged
`hermes` user (UID 10000) that the runtime actually uses via
`s6-setuidgid hermes`. Anything that probed PID-1 state or wrote to
the s6 control surface worked as root in the tests but was inert in
production.
Fixes:
1. `_s6_running()` previously called `Path("/proc/1/exe").resolve()`,
which is root-only readable. For UID 10000 the symlink yields
PermissionError, `resolve()` silently returns the unresolved path,
and `exe.name == "exe"` — so detection always returned False, the
service-manager runtime-registration path was inert, and every
`hermes profile create` / `hermes -p X gateway start` silently
skipped the s6 hook. Replace with `/proc/1/comm` (world-readable)
+ `/run/s6/basedir` (s6-overlay-specific) — both required, fail
closed.
2. `02-reconcile-profiles` now also chowns `/run/service/.s6-svscan/`
{control,lock} to hermes so `s6-svscanctl -a/-an` works without
root. Previously the directory chown stopped at `/run/service`
and the FIFO inside stayed root-owned, so `register_profile_gateway`
from hermes failed at the rescan-trigger step with EACCES — the
wrapper in profiles.py caught the exception and printed a swallowed
warning, so profile creation appeared to succeed while the slot
was rolled back.
Audit changes to flush this class of bug next time:
- Add `docker_exec` / `docker_exec_sh` helpers to `tests/docker/conftest.py`
that default to `-u hermes`. The module docstring explains why and
flags `user="root"` as opt-in only for tests that explicitly need
root (none currently do).
- Refactor every `docker exec` call in tests/docker/ through the new
helpers (test_dashboard.py, test_zombie_reaping.py, test_profile_gateway.py,
test_container_restart.py, test_s6_profile_gateway_integration.py).
- Add 5 unit tests covering `_s6_running` under various probe states
(both signals present; comm wrong; basedir missing; PermissionError
on /proc/1/comm; missing /proc — non-Linux). The PermissionError
test is the explicit regression guard for the original bug.
Known follow-up: the per-service `supervise/control` FIFO inside each
`/run/service/gateway-<profile>/supervise/` is created root-owned by
s6-supervise (which runs as root because s6-svscan is PID 1). `s6-svc
-u/-d/-t` from the hermes user will get EACCES on those. The audit
under `-u hermes` will reveal this in lifecycle tests — surfacing the
issue cleanly so it can be fixed in a focused follow-up (likely via a
small SUID helper or a polling chown loop in cont-init.d). The
detection + svscanctl fixes here are independent and complete on
their own.
The s6-overlay migration replaced every runtime use of gosu with
s6-setuidgid (in stage2-hook.sh, main-wrapper.sh, per-service run
scripts, and cont-init.d hooks), but the gosu binary itself was still
being copied into the image from tianon/gosu, and several comments
across the repo still pointed to it.
Image changes:
- Drop the FROM tianon/gosu:1.19-trixie AS gosu_source stage
- Drop the COPY --from=gosu_source /gosu /usr/local/bin/ layer
- Net: one fewer base-image pull, ~12-15 MB layer eliminated
Documentation/comment refresh (no behavior change):
- Dockerfile: update root-user rationale comment + cont-init.d comment
- docker/main-wrapper.sh: drop "pre-s6 contract (gosu drop)" reference
- docker-compose.yml: update UID/GID remap comment
- .hadolint.yaml: update DL3002 ignore rationale
- website/docs/user-guide/docker.md: privilege-drop helper is s6-setuidgid now
- hermes_cli/config.py: docker_run_as_host_user docstring
tools/environments/docker.py runs *arbitrary user images* via the
terminal backend, not the bundled Hermes image. It still needs SETUID/
SETGID caps so user images that use gosu/su/s6-setuidgid all work.
Renamed the cap-list constant _GOSU_CAP_ARGS → _PRIVDROP_CAP_ARGS and
updated comments to list s6-setuidgid alongside the others as examples.
The matching test (test_security_args_include_setuid_setgid_for_gosu_drop
→ test_security_args_include_setuid_setgid_for_privdrop) was renamed
and its docstring updated; behavior is unchanged.
Verification:
- hadolint clean against .hadolint.yaml
- shellcheck clean against all docker/ shell scripts
- Image rebuilt successfully (sha 1a090924ccea)
- Docker harness: 19 passed in 41.87s (every Phase 0 test + Phase 4
per-profile-gateway lifecycle + container-restart reconciliation)
- tests/tools/test_docker_environment.py: 23 passed (rename did not
break test discovery; pre-existing unrelated mock warning)
The plan document (docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md)
intentionally retains its historical references to gosu — it describes
the pre-s6 entrypoint as background for understanding the migration.
Phase 5 of the s6-overlay supervision plan. Documentation + small
diagnostic cleanups; no behavior changes.
website/docs/user-guide/docker.md:
- Replace the old 'entrypoint script does the bootstrap' section
with the s6-overlay boot flow (cont-init.d/01-hermes-setup,
cont-init.d/02-reconcile-profiles, static main-hermes + dashboard
services, ENTRYPOINT-as-main-program pattern).
- Add a 'Per-profile gateway supervision' subsection covering the
new lifecycle commands, restart semantics, log persistence, and
'Manager: s6 (container supervisor)' status reporting.
- Add 'Breaking change vs. pre-s6 images' callout naming the
/init ENTRYPOINT and pointing affected wrappers at the pin
workaround.
website/docs/user-guide/profiles.md:
- Add a note under 'Persistent services' pointing container users
at the docker.md section explaining s6 supervision inside the
image. Host-side systemd/launchd documentation is unchanged.
skills/software-development/hermes-s6-container-supervision/SKILL.md:
- New maintainer skill covering the supervision-tree map, file
layout, the Architecture B rationale (cont-init.d args + halt
exit-code propagation), quick recipes, and the 8 pitfalls we hit
while implementing the plan (PATH-without-/command, root-owned
profile dirs, SOUL.md as marker, the '143' anti-pattern, etc.).
hermes_cli/doctor.py:
- _check_gateway_service_linger skips on s6 (the linger concept
doesn't apply inside the container).
- New _check_s6_supervision section reports main-hermes/dashboard
state and per-profile-gateway count (registered vs supervised
up), only inside the s6 container. Host doctor output unchanged.
- External Tools / Docker check no longer emits a 'docker not
found' warning inside the container; prints an explanatory
info line instead. Still respects an explicit TERMINAL_ENV=docker
(in case the user mounted /var/run/docker.sock).
hermes_cli/gateway.py:
- Document _container_systemd_operational more precisely: it's
NOT for our Hermes Docker image (s6-overlay handles that via
detect_service_manager() == 's6'). It still covers
systemd-nspawn / k8s-with-systemd-init cases, so leaving it in
place is correct; the docstring just makes that explicit.
Test harness (verification, no test changes in this commit):
19 passed, 0 xfailed. 66 service-manager / container-boot /
profiles-s6-hooks / gateway-s6-dispatch unit tests still green.
61 doctor tests still green. Hadolint + shellcheck clean.
Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md
Phase 4 of the s6-overlay supervision plan. Activates the Phase 3
S6ServiceManager by hooking it into the profile lifecycle and the
`hermes gateway start/stop/restart` dispatcher, and adds a cont-
init.d-time reconciliation pass that survives `docker restart`.
Task 4.0 — container-boot reconciliation:
/run/service/ is tmpfs, so every `docker restart` wipes every
per-profile gateway slot. /etc/cont-init.d/02-reconcile-profiles
invokes hermes_cli.container_boot.reconcile_profile_gateways() on
every boot, which walks $HERMES_HOME/profiles/<name>/, reads each
gateway_state.json, recreates the s6 service slot, and auto-starts
only those whose last state was 'running'. Other states
(stopped, starting, startup_failed, missing) register the slot
in the down state — avoiding crash-loops across restarts for a
gateway that was broken last boot. Per-profile outcome is recorded
to $HERMES_HOME/logs/container-boot.log.
Implementation: hermes_cli/container_boot.py + 12 unit tests.
Profile-marker is SOUL.md, not config.yaml, because `hermes profile
create` only seeds SOUL.md by default (config.yaml comes from
`hermes setup`).
Task 4.1 / 4.2 — profile create/delete hooks:
hermes_cli/profiles.py::create_profile now calls
_maybe_register_gateway_service(<canon>) at the end, which routes
through ServiceManager.register_profile_gateway when running on s6
and no-ops on host backends. delete_profile mirrors with
_maybe_unregister_gateway_service. _allocate_gateway_port produces
a deterministic SHA-256-derived port in [9200, 9800).
Task 4.3 — gateway dispatch + remove rejection arms:
_dispatch_via_service_manager_if_s6(action) intercepts
start/stop/restart at the top of each subcommand and routes them
through S6ServiceManager.{start,stop,restart}. The pre-Phase-4
`elif is_container():` rejection arms are kept as fallback for
pre-s6 containers / unsupported runtimes, but only ever fire when
detect_service_manager() != 's6'. install/uninstall under s6
print informational guidance pointing users at profile create/delete.
Removed the two xfail(strict=True) markers from
tests/docker/test_profile_gateway.py — both tests now pass strictly.
Task 4.4 — status reporting:
get_gateway_runtime_snapshot() reports
Manager: 's6 (container supervisor)' inside an s6 container instead
of 'docker (foreground)'.
Plan-vs-reality drift fixed in this commit:
- Plan's S6ServiceManager._render_run_script used
`gateway start --foreground --port {port}` — invented args; the
real CLI is `gateway run`. Switched accordingly. port arg
retained for API parity but now documented as 'currently ignored'.
- Plan's reconciler keyed on config.yaml; switched to SOUL.md
(config.yaml is created by hermes setup, not by hermes profile
create, so the original gate caught nothing).
- The plan's _dispatch helper used _profile_arg() which returns
'--profile <name>' (i.e. with the flag prefix). Switched to
_profile_suffix() which returns the bare name.
- Architecture B's docker exec doesn't get /command on PATH or
the venv on PATH; Dockerfile's runtime PATH now includes
/opt/hermes/.venv/bin so 'docker exec <c> hermes ...' works
without sourcing the venv.
- stage2-hook now chowns $HERMES_HOME/profiles to hermes on every
boot, not just on the UID-remap path. Without this, files created
by docker-exec-as-root accumulate and the next reconciler run
fails with PermissionError reading SOUL.md.
Test harness:
19 passed, 0 xfailed (the two pre-Phase-4 xfail targets flip to
passing). 78 unit tests across service_manager + container_boot +
profiles_s6_hooks + gateway_s6_dispatch. Hadolint + shellcheck
pass cleanly.
Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md
Phase 3 of the s6-overlay supervision plan. Implements the runtime-
registration surface from D4 — only the s6 backend supports
register_profile_gateway / unregister_profile_gateway /
list_profile_gateways; host backends continue to raise
NotImplementedError. No caller yet (Phase 4 wires in the profile
create/delete hooks).
Key implementation notes:
- Service directory shape: /run/service/gateway-<profile>/{type,run,log/run}.
Atomic register: write to gateway-<profile>.tmp, fsync via
os.rename. Cleanup on rescan failure.
- Run script uses #!/command/with-contenv sh so HERMES_HOME and any
extra_env arrive at exec time. The hermes -p <profile> gateway
start --foreground --port <port> command is wrapped in
s6-setuidgid hermes for the per-service privilege drop (OQ2-A).
- Log script (OQ8-C): persists via s6-log to
${HERMES_HOME}/logs/gateways/<profile>/. CRITICAL — HERMES_HOME is
a runtime env-var expansion in the rendered script, NOT a Python
f-string substitution. Negative-asserted in
test_s6_register_creates_service_dir_and_triggers_scan so
regressions are caught.
- PATH gotcha: /command/ is only on PATH for processes spawned by
the supervision tree (services, cont-init.d). `docker exec` and
profile-create hooks don't get it. S6ServiceManager calls all
s6-* binaries via absolute path through the new _S6_BIN_DIR
constant so callers don't have to fix up env vars.
- validate_profile_name rejects path-traversal, leading-dash (s6
would parse as a flag), uppercase, whitespace, and names >251
chars (s6-svscan default name_max).
Test coverage:
- 13 new unit tests in tests/hermes_cli/test_service_manager.py
(kind detection, run-script content, env quoting, register
rollback on rescan failure, unregister idempotence, list filter,
lifecycle dispatch, svstat parsing). Total: 36 passing.
- 2 new in-container integration tests in
tests/docker/test_s6_profile_gateway_integration.py validating
end-to-end registration against a real s6 supervision tree.
Docker harness: 14 passed, 2 xfailed (Phase 4 target unchanged).
Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md
Phase 1 of the s6-overlay supervision plan. Pure-refactor addition:
introduces the abstract interface (with runtime_checkable Protocol),
detect_service_manager(), validate_profile_name(), and thin
SystemdServiceManager / LaunchdServiceManager / WindowsServiceManager
wrappers around the existing systemd_* / launchd_* / gateway_windows.*
module-level functions. No host call site was modified — host code
continues to use the existing functions directly; the protocol is for
new backend-agnostic code (Phase 4 profile create/delete hooks and the
Phase 4 s6 dispatch path in 'hermes gateway start/stop/restart').
WindowsServiceManager.install() forwards the v3 kwargs (start_now,
start_on_login, elevated_handoff) added in PRs #28169-adjacent so
non-Windows callers — there aren't any today — can opt in.
The s6 backend lands in Phase 3; until then get_service_manager()
raises a clear error if invoked on a host that detects as 's6'.
Adds a `TTSProvider(ABC)` + `register_tts_provider()` extension point
to the plugin context API, **alongside** the existing config-driven
`tts.providers.<name>: type: command` registry from PR #17843. This is
additive — the command-provider surface stays as the primary way to
add a TTS backend.
The hook covers cases the shell-template grammar can't reasonably
express:
- Native Python SDKs without a CLI (Cartesia, Fish Audio, etc.)
- Streaming synthesis (chunked Opus → voice-bubble delivery)
- Voice metadata API for the `hermes tools` picker
- OAuth-refreshing auth flows
None of the 10 inline built-in providers (`edge`, `openai`,
`elevenlabs`, `minimax`, `gemini`, `mistral`, `xai`, `piper`,
`kittentts`, `neutts`) are migrated to plugins. They stay inline. The
hook is for *new* engines that aren't built-in.
## Resolution order
The dispatcher's resolution order is the load-bearing invariant:
1. `tts.provider` is a built-in name → built-in dispatch. **Always wins.**
2. `tts.provider` matches `tts.providers.<name>` with `command:` set
→ command-provider dispatch (PR #17843).
3. `tts.provider` matches a plugin-registered `TTSProvider`
→ plugin dispatch (new).
4. No match → falls through to Edge TTS default (legacy behavior).
Built-ins-always-win is enforced at THREE layers:
- Registry: `register_provider()` rejects shadowing names with a warning.
- Dispatcher: `_dispatch_to_plugin_provider()` short-circuits built-in
names defensively before consulting the registry.
- Picker: `_plugin_tts_providers()` filters built-in shadows out of
the `hermes tools` row list defensively.
Command-providers-win-over-plugins is enforced at TWO layers:
- The caller in `text_to_speech_tool` checks
`_resolve_command_provider_config` first.
- `_dispatch_to_plugin_provider` re-checks for a same-name command
config defensively so a refactor of the caller can't silently break
the invariant.
## New files
- `agent/tts_provider.py` — `TTSProvider(ABC)` with `synthesize()` (required),
`list_voices()`, `list_models()`, `get_setup_schema()`, `stream()`,
`voice_compatible` (all optional with sane defaults). Mirrors
`agent/image_gen_provider.py` shape.
- `agent/tts_registry.py` — `register_provider`/`get_provider`/`list_providers`
with `_BUILTIN_NAMES` reject-shadowing invariant. Mirrors
`agent/image_gen_registry.py` shape.
- `plugins/tts/...` directory ready for community plugins (none shipped).
## Modified files
- `hermes_cli/plugins.py` — `register_tts_provider()` method on
`PluginContext`. Matches the gating shape of
`register_image_gen_provider()` / `register_browser_provider()`.
- `tools/tts_tool.py` — `_dispatch_to_plugin_provider()` +
`_plugin_provider_is_voice_compatible()` + walrus-elif wiring into
the main dispatcher. Built-in elif chain untouched.
- `hermes_cli/tools_config.py` — `_plugin_tts_providers()` injects
plugin rows into the Text-to-Speech picker category alongside the
10 hardcoded built-in rows.
## Tests
- `tests/agent/test_tts_registry.py` — 47 tests covering registration,
lookup, ABC contract, helpers, AND a `TestBuiltinSync` regression
test that fails if `agent.tts_registry._BUILTIN_NAMES` drifts from
`tools.tts_tool.BUILTIN_TTS_PROVIDERS` (kept duplicated due to
circular import constraints).
- `tests/tools/test_tts_plugin_dispatch.py` — 35 tests covering
built-in-always-wins, command-wins-over-plugin, plugin dispatch,
exception passthrough, voice_compatible helper.
- `tests/hermes_cli/test_tts_picker.py` — 10 tests covering the
picker surface, builtin shadowing defense, integration with
`_visible_providers`.
- `tests/hermes_cli/test_plugins_tts_registration.py` — 3 end-to-end
tests via `PluginManager.discover_and_load()`.
- `tests/plugins/tts/check_parity_vs_main.py` — 9-scenario subprocess
parity harness vs `origin/main`. The only intentional diff is
`fallback_edge → plugin` for the `plugin-installed` scenario.
## Verification
- 95/95 new tests pass.
- 170/170 pre-existing TTS tests (test_tts_command_providers,
test_tts_max_text_length, test_tts_speed, etc.) pass unchanged.
- Parity harness against `origin/main`: 8 OK + 1 expected DIFF.
- E2E smoke: a registered plugin's `synthesize()` is called via
`text_to_speech_tool` with the standard JSON envelope returned.
- Ruff clean on all touched files.
## Docs
- `website/docs/user-guide/features/tts.md` — new "Python plugin
providers" section with a decision table (command-provider vs
plugin), minimal plugin example, and the optional-hook reference.
- `website/docs/user-guide/features/plugins.md` — TTS row updated to
mention both surfaces (command-provider primary, plugin for
SDK/streaming).
Closes#30398
The web dashboard's Anthropic OAuth helper wrote the credential file
straight to its final destination and relied on the process umask for
permissions. That left the dashboard-specific path weaker than the
existing auth writers, which already use owner-only permissions and
safer write semantics.
This change keeps the scope narrow: make the dashboard helper write via
a temp file + replace, chmod the final file to owner-only, and add a
focused regression test for both permission handling and atomic-write
behavior.
Constraint: Must preserve the existing dashboard OAuth flow and credential-pool side effects
Rejected: Broader auth-storage refactor | unnecessary scope for a single verified inconsistency
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Keep dashboard credential writes aligned with existing auth storage semantics; do not reintroduce direct write_text() here without matching chmod/atomic behavior
Tested: pytest -o addopts='' tests/hermes_cli/test_web_server_oauth_write.py tests/hermes_cli/test_web_server.py -q (78 passed)
Not-tested: Cross-platform permission semantics on Windows-managed filesystems
Board defaults represent persistent project checkouts. Scratch workspaces
are auto-deleted on completion and must stay under the per-board scratch
root that resolve_workspace() creates. Inheriting default_workdir for a
scratch task pointed the cleanup path at the user's source tree — the
data-loss vector documented in #28818.
The containment guard in _cleanup_workspace (just added) is the safety
rail. This commit prevents the bad state from being created in the first
place: only persistent kinds (dir/worktree) inherit board defaults.
Tests updated to cover the new semantics: scratch with default_workdir
set keeps workspace_path=None; dir/worktree still inherits the board
default.
Salvaged from PR #31315 by @leeseoki0 — prevention layer on top of the
#28819 containment fix by @briandevans.
Co-authored-by: teknium1 <127238744+teknium1@users.noreply.github.com>
Copilot review on PR #28819 flagged that `_is_managed_scratch_path` accepted
the entire `<kanban_home>/kanban` subtree as managed scratch storage. With
that, a task whose `workspace_kind='scratch'` and `workspace_path` was
mis-set to `<kanban_home>/kanban`, `.../kanban/logs`, or a board's
metadata directory (e.g. `.../kanban/boards/<slug>` without the
`workspaces/` child) would pass the containment guard and let task
completion `shutil.rmtree` Hermes' own DB, metadata, and log subtrees.
Tighten the guard:
* Allowed roots are now exclusively `workspaces/` directories — the
`HERMES_KANBAN_WORKSPACES_ROOT` override, `<kanban_home>/kanban/workspaces`,
and each `<kanban_home>/kanban/boards/<slug>/workspaces` discovered on
disk.
* Require strict descendancy: a path equal to a root itself is rejected
too, because deleting a workspaces root would wipe every task's scratch
dir at once.
Add a regression test covering the three Copilot-named attack paths
(kanban root, kanban/logs, board root without `workspaces/`) plus the
workspaces-root-itself case, and confirm the inner task-id dir still
matches.
A board's ``default_workdir`` (e.g. ``hermes kanban boards
set-default-workdir my-board /path/to/real/source``) is copied into
``tasks.workspace_path`` for tasks created without an explicit
``workspace_kind``. Those tasks default to ``workspace_kind='scratch'``,
so completion calls ``_cleanup_workspace`` and unconditionally runs
``shutil.rmtree(wp, ignore_errors=True)`` — deleting the user's real
source tree as if it were disposable scratch storage.
Add ``_is_managed_scratch_path()`` and gate ``_cleanup_workspace`` on
it: only delete paths under ``HERMES_KANBAN_WORKSPACES_ROOT`` (the
worker-side override the dispatcher injects) or under the active kanban
home's ``kanban/`` subtree (covering both the legacy default-board root
and per-board ``kanban/boards/<slug>/workspaces`` roots). Anything else
gets a warning log and is left alone, so a misconfigured
``default_workdir`` can no longer destroy user data on task completion.
The hardcoded constants in _display_resumed_history were exposed as
config in PR #4434; declare them in DEFAULT_CONFIG and the CLI fallback
dict so they show up in 'hermes config' diagnostics and the schema
validator.
Fixes#31116 — two distinct bugs in fresh-install Matrix gateway:
1. Matrix E2EE setup installed only mautrix[encryption], leaving asyncpg
/ aiosqlite / Markdown / aiohttp-socks uninstalled. The first encrypted
connect failed with 'No module named asyncpg' deep inside
MatrixAdapter.connect(). Root cause: the setup wizard hand-rolled a
pip install of one package instead of using lazy_deps.ensure(
'platform.matrix'), and check_matrix_requirements() short-circuited the
runtime installer on 'import mautrix' alone — so the other 4 packages
were never pulled in.
2. Discord auto-enabled itself on every gateway start, even when the user
never selected Discord and had no DISCORD_BOT_TOKEN. Root cause:
gateway/config.py plugin-enablement loop gated enablement on
entry.check_fn() (just 'is the SDK importable?') and ignored
entry.is_connected (the 'did the user configure credentials?' probe).
Same bug class as commit 7849a3d73 fixed for _platform_status in the
setup wizard; this is the runtime counterpart. Affects Discord, Teams,
and Google Chat.
Changes:
- hermes_cli/setup.py::_setup_matrix — install via
lazy_deps.ensure('platform.matrix') to pull the full feature group.
- gateway/platforms/matrix.py::_check_e2ee_deps — verify asyncpg +
aiosqlite + PgCryptoStore in addition to OlmMachine, so E2EE failures
surface at startup instead of at first encrypted-room connect.
- gateway/platforms/matrix.py::check_matrix_requirements — use
feature_missing('platform.matrix') as the install gate instead of a
single 'import mautrix' check, so partial installs trigger the lazy
installer correctly.
- gateway/config.py plugin-enablement loop — consult entry.is_connected
before flipping enabled=True. Explicit YAML enabled=true still wins.
Tests: 3 new in tests/gateway/test_matrix.py (asyncpg-required,
aiosqlite-required, partial-install lazy-runs), 5 new in
tests/gateway/test_platform_registry.py (is_connected=False blocks,
is_connected=True enables, is_connected=None falls back to check_fn,
raising probe doesn't enable, explicit YAML wins).
Validation: 310 tests across affected test modules pass.