- Use os.pathsep instead of literal ':' so Windows paths (C:\dir) and
the Windows separator ';' work correctly.
- Add 9 tests covering multi-root behavior: writes inside first/second
root, writes outside all roots, trailing/leading/double separators,
all-separators edge case, static deny priority, duplicate dedup.
- Update hermes_cli/tips.py tip string to mention multiple paths.
- Update docs to mention os.pathsep / ; on Windows.
Follow-up for salvaged PR #49557.
Discord does not render GFM pipe tables — raw pipe characters display
as garbage text. format_message now rewrites tables into bold-heading +
bullet groups using the shared helpers.
Fixes#21168
Co-authored-by: Yashiel Sookdeo <yashiel@skyner.co.za>
Move table-detection regex, row-splitting, and table-to-bullet
conversion into gateway/platforms/helpers.py so both Discord and
Telegram adapters can share them.
Co-authored-by: Yashiel Sookdeo <yashiel@skyner.co.za>
A MoA preset whose reference or aggregator slot points at the moa virtual
provider creates a recursive MoA tree. The runtime guards in moa_loop.py only
surface this mid-turn (references silently skipped, aggregator raises). Reject
it at the config chokepoint (_clean_slot) so it can never be saved, and hide it
from the desktop/dashboard slot pickers so it isn't offered as a dead choice.
preexec_fn=os.setsid runs Python code in the forked child before exec,
which is unsafe in multi-threaded processes (CPython docs). When the
Desktop gateway loads native libraries (onnxruntime, BLAS, provider SDKs)
with active thread pools, the fork can SIGSEGV before the child execs.
Replace all preexec_fn usage with start_new_session=True, which provides
the same setsid/process-group semantics without running Python in the
fork. This is already the pattern used throughout hermes_cli/gateway.py
and hermes_cli/_subprocess_compat.py.
Fixes#46789
_normalize_preset uses bare float() and int() to coerce
reference_temperature, aggregator_temperature, and max_tokens from
config.yaml. When a user hand-edits a non-numeric value (e.g.
max_tokens: "8k" or reference_temperature: "hot"), the coercion raises
ValueError. Since normalize_moa_config runs on every model-selection
and MoA turn (via resolve_moa_preset), the crash is unrecoverable and
blocks all MoA usage until the config is manually fixed.
Replace the bare casts with _coerce_float / _coerce_int helpers that
fall back to the default on TypeError/ValueError instead of raising.
The autonomous self-improvement review fork could still write to a pinned
skill — only external/bundled/hub-installed/protected-builtin skills were
guarded. The curator skips pinned skills from every auto-transition; the
review fork is the same kind of no-user-present actor and must too.
Adds a pin check to _background_review_write_guard so background-origin
edit/patch/delete/write_file/remove_file on a pinned skill are refused.
Stricter than the foreground _pinned_guard (delete-only) by design: with
no user in the loop there is no one to consent to an edit.
Fixes#25839
The dangerous-command approval layer already blocks `hermes gateway
(stop|restart)`, `pkill/killall hermes|gateway`, and `kill ... $(pgrep ...)`.
A reporter noted on #33071 that the agent can still achieve the same
effect by driving launchd directly against the gateway's service label
(`launchctl stop ai.hermes.gateway`, `launchctl kickstart -k
system/ai.hermes.gateway`, etc.) or by substituting `pidof` for `pgrep`
in the kill-expansion form.
This widens the "Gateway lifecycle protection" block in
`tools/approval.py` to cover both vectors:
- `launchctl (stop|kickstart|bootout|unload|kill|disable|remove)`
scoped to commands that target a Hermes label (`hermes`,
`ai.hermes`). Read-only inspection (`launchctl print …`,
`launchctl list`) and operations against unrelated labels remain
unflagged.
- `kill ... $(pidof …)` and the backtick form, alongside the existing
`pgrep` expansion. `pidof` is the BSD/Linux equivalent and is
equally opaque to the `(pkill|killall) … hermes` name pattern.
Intentionally left out of scope: plain `kill -TERM <numeric_pid>` with
a PID looked up out-of-band. Catching that would require runtime PID
state and would break the existing
`TestPgrepKillExpansion::test_safe_kill_pid_not_flagged` contract,
which guarantees that a plain literal-PID `kill 12345` stays safe.
Natural-language skill search returned a short, arbitrary list and never
surfaced NVIDIA (or OpenAI/Anthropic/HuggingFace) skills. Two causes:
1. The runtime index collapses every GitHub tap into source="github", so
there was no way to find or filter by provider at the CLI — the per-tap
identity only existed in the docs-site catalog.
2. HermesIndexSource.search matched only name/description/tags (not the
identifier or provider) and broke at the first `limit` hits in raw index
order, burying the most relevant skills. `search` also defaulted to
--limit 10 against an 86k-entry catalog.
Changes:
- GitHubSource stamps a per-tap provider label (extra.provider) on each
skill via github_provider_for(); source stays "github" so dedup/floor/
index-skip logic is untouched. Flows into the built index.
- HermesIndexSource.search now matches identifier + provider too, and
collect-then-ranks (exact > prefix > whole-word > substring) instead of
break-at-limit.
- --source nvidia|openai|anthropic|huggingface|voltagent|gstack|minimax
provider filters for browse/search (narrows merged results by provider).
- search --limit default 10 -> 25; table Source column shows the provider
label for github skills.
Tested: 181 unit tests pass; E2E against the live runtime index confirms
'nvidia'/'cuda' searches now surface NVIDIA-provider skills and
--source nvidia narrows to exactly the NVIDIA catalog.
The post-update gateway restart path relaunched the gateway with the
venv's console `python.exe` (via `get_python_path()` in
`_gateway_run_args_for_profile`). On Windows this leaves a terminal
window open permanently: uv's `venv\Scripts\python.exe` is a launcher
shim that re-execs the *base* console interpreter, which allocates its
own conhost — and `CREATE_NO_WINDOW` cannot suppress that second window.
The clean-start path (`_spawn_detached`) already dodges this by routing
through `_resolve_detached_python` to use the windowless base
`pythonw.exe`; the restart watcher did not.
Symptom (reported on Windows 11): after an in-app GUI update, a console
window for the gateway stays open and never closes. Confirmed on the
reporter's box — the running gateway was `python.exe ... gateway run
--replace` with a live conhost child and the foreground "Press Ctrl+C to
stop" banner, born exactly at the update's "Restarting Windows gateway"
log line.
Fix:
- Add `gateway_windows.windowless_gateway_restart_spec(run_argv)` which
rewrites a console-python gateway argv into the windowless `pythonw.exe`
equivalent and returns the cwd + env overlay (VIRTUAL_ENV / PYTHONPATH /
HERMES_HOME) the base interpreter needs to import `hermes_cli` without
the venv launcher's site config. No-op on POSIX.
- `_spawn_gateway_restart_watcher` now applies that rewrite on Windows and
threads cwd= / env= into the inlined respawn Popen. Covers both restart
entry points (`launch_detached_profile_gateway_restart` and
`launch_detached_gateway_restart_by_cmdline`). CREATE_NO_WINDOW |
DETACHED_PROCESS | CREATE_BREAKAWAY_FROM_JOB and the breakaway-denied
fallback are all preserved.
Verified E2E on a real Windows 11 box: drove the actual watcher against a
dummy old-pid; the respawned gateway came up as `pythonw.exe` (zero
console python, no conhost child) and booted fully (housekeeping + kanban
dispatcher started → imports resolved under the base interpreter).
Tests: TestWindowlessGatewayRestartSpec (behavior) +
TestGatewayDetachedWatcherWindowsFlags regression assert. Pre-existing
Linux-only failures on a Windows host (SIGKILL, systemd, docker-root)
confirmed identical on the bare base.
On macOS, terminal(background=true) silently failed: the process returned a
session_id and exit_code=0 but the command never ran (empty stdout, no side
effects). Root cause is two interacting issues:
1. _find_shell was aliased to _find_bash, which prefers `shutil.which("bash")`
→ /bin/bash (GNU bash 3.2, still shipped on macOS) over $SHELL (/bin/zsh).
2. process_registry.spawn_local runs [shell, "-lic", "set +m; <cmd>"] with
stdin=/dev/null. bash 3.2 as a login shell sources ~/.bash_profile, which on
many macOS setups contains `exec /bin/zsh -l`; that exec replaces bash but
drops the -c argument, so the command is swallowed (exit 0, no output).
Decouple _find_shell from _find_bash: _find_shell now prefers the user's
configured $SHELL on POSIX (the shell they actually log in with), falling back
to _find_bash when $SHELL is unset/missing. _find_bash is unchanged, so callers
that genuinely need bash (e.g. the _run_bash login-shell snapshot) keep bash
semantics. zsh handles -lic correctly even with redirected stdin.
Salvaged from #42219 by @liuhao1024 (authorship preserved via cherry-pick).
On top of the original (8 unit tests covering $SHELL-set/unset/missing/empty,
Windows-ignores-$SHELL, _find_bash-unchanged), added an E2E regression test
that reproduces the real bash-3.2 login-shell swallow (exit 0 / no file) and
asserts the shell _find_shell selects actually executes a -lic background
command. Mutation-verified: reverting _find_shell to the bash alias fails the
$SHELL-preference test. Bug reproduced directly: /bin/bash 3.2 -lic with a
.bash_profile->exec-zsh creates no file; zsh -lic does.
Closes#42203. Supersedes #42290.
When a Hermes-managed node/npm/npx shim exists but fails --version, redownload
the pinned nodejs.org bundle under HERMES_HOME/node and retry. Do not fall
back to system npm on PATH when a managed tree is present.
POSIX heal probes node, npm, and npx (npm can break while node still runs).
projects.db (per-profile project store) and kanban.db were missing from
_QUICK_STATE_FILES, so the pre-update quick snapshot never backed them up.
On a desktop upgrade, when the update flow removes/replaces the file and the
post-update schema-init re-creates an empty one, all user-created projects,
folder mappings, the active-project pointer, kanban board bindings, and tasks
vanish silently — no error.
Add the per-profile user-created stores to the snapshot set:
- projects.db — project store
- response_store.db — gateway conversation history / tool payloads (WAL)
- memory_store.db — holographic memory facts/entities (WAL)
- verification_evidence.db — agent verification audit trail
- kanban.db — default board (back-compat <root>/kanban.db)
- kanban/boards — non-default boards (<root>/kanban/boards/<slug>/kanban.db
+ metadata); workspaces/ and attachments/ subtrees
are skipped as large + regenerable.
Also: the directory-branch of create_quick_snapshot now routes *.db through the
WAL-safe _safe_copy_db (SQLite backup() API), matching the top-level file path —
previously a non-default board DB with an open WAL could be copied inconsistently.
Salvaged from #52930 by @0xDevNinja (authorship preserved via cherry-pick).
On top of the original (which covered only projects.db + the default kanban.db),
this adds: non-default-board coverage, the three sibling per-profile DBs that
meet the same upgrade-wipe criteria, WAL-safe directory copies, and a
workspaces/attachments skip to avoid snapshot bloat (×20 retained). 8 tests,
all mutation-verified; E2E verified snapshot→wipe→restore preserves all six
store types on the real code path.
Closes#52889. Supersedes #52930.
The external-drain marker .drain_request.json is written under HERMES_HOME,
which on Hermes Cloud is a persistent Fly volume (/opt/data). A begin-drain
marker therefore SURVIVES the post-update machine restart. But the disruptive
lifecycle actions a drain protects (auto-update / image migrate / env edit /
profile change) all restart the machine — which is exactly the signal the drain
is over. The freshly-restarted gateway re-read the orphaned marker on its
startup reconcile and parked itself back in 'draining', refusing every new turn
indefinitely (NS-570: ~52 min until manually cleared).
Fix: stamp the marker with an identity of THIS container/VM instantiation
(kernel boot_id + PID 1 start time, read from /proc) and treat a marker whose
epoch differs from the current instantiation as absent. A deliberate restart →
new PID 1 → new epoch → stale marker ignored → gateway boots 'running'. A marker
written during the current instantiation (the live drain) still matches; an s6
respawn of just the gateway (PID 1/init unchanged) keeps the same epoch, so an
in-flight drain is still honoured (D4a reversibility preserved).
The staleness check is lenient and never fail-closed: a legacy marker with no
epoch, a corrupt/contentless marker, or an environment with no /proc (epoch
unavailable) all degrade to the original presence-only behaviour. NAS is
untouched — it only ever POSTs begin/cancel-drain over HTTP; the marker file is
purely gateway-internal IPC.
The fix is entirely within gateway/drain_control.py; the watcher and the
dashboard endpoint go through the same drain_requested()/write_drain_request()
chokepoints and need no functional change.
## Description
On macOS 26.x, `launchctl bootstrap` and `launchctl kickstart` return exit code 5 ("Input/output error"), which Hermes already anticipates and handles by spawning a detached fallback process. However, the gateway status reporting is ambiguous:
- `gateway status` says "Gateway service is loaded" (because `launchctl list` returns exit 0)
- But `launchctl print` shows `state = not running` — launchd isn't actually supervising anything
- The detached fallback PID running is invisible to the status command
- Users can't tell whether auto-start at login and auto-restart on crash are available
### Root Cause
Two problems in `hermes_cli/gateway.py`:
1. **`_probe_launchd_service_running()`** (line 1067): Determined launchd service liveness solely by `launchctl list <label>` exit code. On macOS 26, this returns 0 even when the service is only *registered* but not running (output lacks a `"PID"` field). This caused `GatewayRuntimeSnapshot.service_running = True` incorrectly, which suppressed the process/service mismatch warning.
2. **`launchd_status()`** (line 3569): Used the same binary "loaded/not loaded" check without inspecting whether launchd actually has a PID, whether a detached fallback is running, or whether auto-start/restart are available.
### Changes
**`hermes_cli/gateway.py`:**
1. **New `_parse_launchd_pid_from_list_output()` helper** — Extracts the PID from `launchctl list` output. When launchd is actively supervising, the output includes `"PID" = <number>;`. When only registered but not running, no PID field is present.
2. **Fixed `_probe_launchd_service_running()`** — Now requires a PID in the `launchctl list` output to confirm launchd is actually supervising. This correctly sets `service_running = False` when launchd has the service registered but `state = not running`, which triggers the existing process/service mismatch detection.
3. **Reworked `launchd_status()`** — Reports clearly separated information:
- LaunchAgent plist currentness (stale or current)
- Whether launchd is actively supervising (with PID)
- Whether a detached fallback PID is running
- Whether auto-start at login and auto-restart on crash are available
- When launchd supervision is known to be unavailable, explains why
4. **Persistent unsupported marker** (`~/.hermes/.gateway-launchd-unsupported`) — Written when `_launchd_fallback_to_detached()` is called (launchd exit 5/125). Allows `launchd_status()` to explain *why* launchd can't supervise even when no fallback process is currently running. Cleared automatically when a future bootstrap/kickstart succeeds (e.g., after an OS update fixes the issue).
5. **Updated `_print_gateway_process_mismatch()`** — Distinguishes the managed detached fallback from a genuinely manual `nohup hermes gateway run`, providing accurate guidance for each case.
### Status Output Examples
**Before** (macOS 26, fallback active):
```
Launchd plist: ~/Library/LaunchAgents/ai.hermes.gateway.plist
✓ Service definition matches the current Hermes install
✓ Gateway service is loaded
{
"Label" = "ai.hermes.gateway";
"OnDemand" = true;
...
};
```
**After** (macOS 26, fallback active):
```
Launchd plist: ~/Library/LaunchAgents/ai.hermes.gateway.plist
✓ Service definition matches the current Hermes install
⚠ Gateway service is registered but launchd is not supervising it
launchd cannot manage the gateway on this macOS version.
✓ Detached fallback process is running (PID 12345)
Cron jobs will fire. Stop with: hermes gateway stop
⚠ Auto-start at login and auto-restart on crash are NOT available.
```
**After** (normal launchd supervision):
```
Launchd plist: ~/Library/LaunchAgents/ai.hermes.gateway.plist
✓ Service definition matches the current Hermes install
✓ Gateway is supervised by launchd (PID 12345)
Auto-start at login and auto-restart on crash are available.
```
### Tests
Updated 5 existing tests and added 11 new tests in `tests/hermes_cli/test_gateway_service.py`:
- PID parsing from `launchctl list` output (with PID, without PID, empty, unquoted PID)
- `_probe_launchd_service_running()` requires PID presence
- Unsupport marker lifecycle (write, clear, persist across fallback)
- Marker cleared on successful bootstrap
- `launchd_status()` reporting: supervised, fallback-running, fallback-unavailable
- Existing fallback tests now verify marker creation
### Related Issues
- Issue #23387 (original macOS 26 launchd workaround)
- Issue #42524 (this issue)
A config migration (or hand-edit) that leaves an invalid toolset name in
`platform_toolsets` — e.g. the #38798 corruption that rewrote `hermes-cli` to
the non-existent `hermes` — silently disabled all affected tools:
resolve_toolset() returns [] for an unknown name, so the agent quietly lost its
tools with no error, warning, or log entry and degraded to text-only replies.
Surface it loudly at two points:
- After migration (migrate_config): validate platform_toolsets and record/print
a warning per unknown name, with a `hermes-<platform>` suggestion when that
would have been valid (the exact #38798 shape).
- At runtime (_get_platform_tools): if a platform was explicitly configured but
every toolset name is invalid, log a warning when tools are resolved for a
session — so an ALREADY-corrupted config is caught at startup, not only on the
next `hermes update`.
Logic lives in a new pure, side-effect-free helper (toolset_validation.py) with
validate_toolset injected, so it is unit-testable without the tool registry.
Note: the original v25→v26 migration that caused the corruption no longer
exists (config format is now v30; no migration step rewrites toolset names).
This change is the durable defense against the silent-failure mode regardless
of cause, matching the issue's "Expected: log a warning".
Salvaged from #39207 by @lEWFkRAD (authorship preserved via cherry-pick).
Tests: 9 helper cases (incl. the #38798 corruption shape, mixed valid/invalid,
zero-tools state, non-dict/scalar/non-string) + a runtime caplog test — both the
helper warning and the runtime guard mutation-verified to fail without the fix.
Closes#38798. Supersedes #39581 (prevent-in-v25→v26 — that path is gone),
#41006 / #40208 (repair-migration for already-corrupted configs).
The cron runtime tripwire (_scan_cron_prompt) used a 10-char invisible-unicode
set while the install-time scanner (threat_patterns.INVISIBLE_CHARS) flags 17.
The cron-local set was missing U+2062-U+2064 (invisible math operators) and
U+2066-U+2069 (directional isolates), so a directive obfuscated with one of
those codepoints (e.g. "ig<U+2063>nore all previous instructions") slipped past
the runtime cron gate while being caught at install time.
Import the canonical set so the cron tripwire and install scanner can't drift
apart again. Emoji-ZWJ protection (_zwj_has_emoji_neighbour) is unchanged.
Fixes#35075
Co-authored-by: rlaope <piyrw9754@gmail.com>
Replies on WhatsApp Cloud arrived at the agent with reply_to_id set but
reply_to_text=None, so run.py never injected the "[Replying to: ...]"
disambiguation prefix (it gates on reply_to_text). Meta's webhook context
object carries only the quoted message's id, never its text.
Index (chat_id, wamid) -> text in rich_sent_store on every inbound message
and every outbound text send -- the same store that solved the identical
Telegram rich-send problem -- then look up the quoted text in
_build_message_event_from_cloud and populate reply_to_text plus
reply_to_is_own_message, derived from context.from versus the business
number.
Tasks 2.1 + 2.2 + 2.3 of the safe-shutdown plan — the reversible
quiesce-without-restart machinery NAS drives during a lifecycle action (D4a).
These ship together because the endpoint, the control channel, and the gateway
state machine are one coherent slice.
2.2 — control channel (gateway/drain_control.py, new):
The dashboard has no HTTP path into a running gateway (guardrails: "there is NO
external control channel into a running gateway"); restart/drain is driven only
by markers the gateway reacts to. So begin/cancel-drain writes/removes a
presence-based marker .drain_request.json (HERMES_HOME-scoped, atomic write,
never-raises read; a corrupt marker reads as present-contentless → fail-safe
toward quiescing). This is Q-B option A.
2.2 — gateway state machine (gateway/run.py):
- _external_drain_active flag, DISTINCT from the shutdown _draining flag: this
one does NOT exit the process and is fully reversible.
- _enter_external_drain / _exit_external_drain: idempotent transitions that
flip gateway_state→draining / →running via _update_runtime_status (preserving
the live active_agents count). exit refuses to revert to running during a
real shutdown or after the loop stops (shutdown wins).
- _drain_control_watcher: 1s background task (modelled on _handoff_watcher)
reconciling accept-state with the marker; honours a marker that survived a
restart on its first tick. Registered alongside the other watchers in start.
- New-turn accept gate in _handle_message, placed BEFORE the session-slot
claim: when draining, refuse to START a new turn (so active_agents can only
fall → no TOCTOU race), while in-flight turns finish untouched. Internal/
system events (restart-recovery replays, bg-process completions) bypass it.
2.1 — endpoint (hermes_cli/web_server.py):
POST /api/gateway/drain {action: drain|cancel}. Authenticated by the Task-2.0a
token seam (the drain plugin registered this exact path as a token route);
attributes the request to the verified token principal. Begin writes the
marker, cancel removes it — the gateway process owns the actual transition.
Force-override (D6) is NOT here; it maps onto the existing immediate
/api/gateway/restart force path.
Tests (mocked — necessary-not-sufficient; the HARD live gate Q-B is next):
- tests/gateway/test_external_drain_control.py — marker contract (write/clear/
read/corrupt/atomic), state machine (enter/exit/idempotency/shutdown-wins/
loop-stopped), watcher reconcile-enter-then-exit, new-turn refusal, and
in-flight-not-interrupted. 15 tests.
- tests/hermes_cli/test_web_server.py — /api/gateway/drain begin/default-begin/
cancel/cancel-idempotent/bad-action-400. 6 tests.
- dashboard.drain_auth config section already added in 2.0b commit.
All touched suites green: 301 (gateway+auth) + 9 (web_server endpoints) passed.
Intentionally deferred:
- HARD live-validation gate (Q-B): real isolated `hermes gateway run`, drive a
real begin-drain marker, prove the 5-point checklist a–e.
- Spec-doc status flip + Phase-2 PR.
Build status: external-drain, restart-drain, status, dashboard-auth, drain-plugin,
token-auth, and web_server-endpoint suites green.
Task 2.0b: the concrete shared-bearer-secret auth provider, the FIRST consumer
of the generic token-auth capability (Task 2.0a). Implements decisions.md Q-A.
plugins/dashboard_auth/drain/ (bundled, discovered like dashboard_auth/basic):
- DrainSecretProvider: non-interactive provider, supports_token=True. Verifies
an inbound Authorization bearer token against a per-agent shared secret with
hmac.compare_digest (constant-time, no timing oracle) and, on a match,
vouches for the caller as the "drain-control" principal scoped to "drain".
The five interactive ABC methods raise NotImplementedError; verify_session
returns None (stacks harmlessly in the cookie-verify loop).
- assess_secret_strength(): fail-closed entropy gate. Rejects secrets shorter
than 43 url-safe-b64 chars (~256 bits), with < 16 distinct characters, or
below 128 bits Shannon entropy — so a weak/structured/repeated secret can
never be silently accepted. Enforced both at register() (friendly skip
reason) and in __init__ (raises — defence in depth).
- register(ctx): no-op + skip reason when HERMES_DASHBOARD_DRAIN_SECRET is
unset; rejects a weak secret fail-closed (drain endpoint stays gated). On a
strong secret, registers the provider AND opts /api/gateway/drain into the
generic token-auth seam via register_token_route().
Config: the secret is a CREDENTIAL → carried via HERMES_DASHBOARD_DRAIN_SECRET
(per-agent, provisioned by NAS at deploy). Behavioural knobs only
(dashboard.drain_auth.{scope,min_secret_chars}) live in config.yaml — added to
DEFAULT_CONFIG with the .env-is-for-secrets rationale documented inline.
Tests: tests/plugins/dashboard_auth/test_drain_provider.py — entropy gate
(strong pass; empty/short/repeated/few-distinct/custom-min reject), verify_token
(match → scoped principal, wrong/empty → None, custom scope), protocol
compliance, interactive-methods-raise, and register() (skip-no-secret,
fail-closed-weak-secret, strong-env-secret registers + route opt-in, config
scope + min_secret_chars). 21 new tests; drain + token-auth suites 44 passed.
Verified the plugin is discovered as dashboard_auth/drain alongside basic/nous.
Intentionally deferred:
- The begin/cancel-drain endpoint handler itself — Task 2.1.
- The dashboard→gateway control channel — Task 2.2.
Build status: dashboard-auth + drain-plugin suites green.
Task 2.0a of the safe-shutdown drain-coordination plan. Widens the dashboard
auth framework GENERICALLY to support non-interactive (service-to-service)
bearer-token auth, mirroring the existing supports_password precedent. This is
a reusable capability — any future machine-credential provider plugs in without
core changes (decisions.md Q-C). The drain bearer-secret plugin (Task 2.0b) is
the first consumer, not the definition.
- base.py: add TokenPrincipal dataclass (the token analog of Session) +
supports_token capability flag + verify_token() on the ABC (default raises
NotImplementedError so a misconfigured provider fails loud). Contract mirrors
verify_session stacking: return None for unrecognised tokens (never raise),
raise ProviderError only on a genuine backing-store outage.
- registry.py: list_token_providers() — the supports_token subset, in
registration order. Empty when none registered (token routes fail closed).
- token_auth.py (new): route-agnostic seam. Routes opt in via
register_token_route(exact path); token_auth_middleware owns the auth
decision for those routes only — authenticate via stacked providers, attach
request.state.token_principal + token_authenticated, pass through. 401 on
missing/unrecognised token, 503 when a provider was unreachable, untouched
passthrough for non-token routes. Fails closed (never open).
- web_server.py: install the seam OUTERMOST (registered last → runs first).
Both downstream gates (legacy auth_middleware + gated_auth_middleware) honour
request.state.token_authenticated and skip enforcement, so a token-authed
service request is never bounced to /login.
- audit.py: TOKEN_AUTH_SUCCESS / TOKEN_AUTH_FAILURE events.
Tests: tests/hermes_cli/test_dashboard_token_auth.py — ABC flag default,
verify_token NotImplementedError, registry filter, bearer extraction
(case-insensitive scheme, malformed/non-bearer → ""), provider stacking
(first-match-wins, unreachable-remembered, unreachable-then-valid, buggy
provider doesn't crash the gate), and the seam's passthrough/401/503/
fail-closed behaviour. 29 new tests; full dashboard-auth suite 169 passed.
Intentionally deferred:
- The concrete shared-bearer-secret provider plugin — Task 2.0b.
- The begin/cancel-drain endpoint that registers itself as a token route —
Task 2.1.
Build status: dashboard-auth + plugin-hook suites green.
The known_c2_framework threat pattern included 'praxis' in its
alternation alongside genuine offensive-security tool brands (Cobalt
Strike, Sliver, Havoc, Mythic, Metasploit, Brainworm). Unlike those
distinctive brand names, 'praxis' is a common English word (Greek for
practice/action) and a legitimate agent name, so any context file that
mentioned an agent named Praxis matched at 'context' scope and the whole
AGENTS.md / SOUL.md was replaced with a [BLOCKED] placeholder before it
reached the system prompt.
Remove 'praxis' from the alternation and add a guard comment: every
token in this list must be a distinctive tool brand, not a common word.
Real C2 brands still fire.
The new _maybe_flag_poisoned_client tests built a provider via
get_or_build_provider without an interactive stdin. Under the hermetic
test env (no TTY, no cached tokens), the non-interactive guard in
mcp_oauth_manager._make_provider raised OAuthNonInteractiveError before
the provider was built, failing 6 tests in CI parity (they passed
locally where stdin was a TTY).
Thread monkeypatch into _provider_with_token_endpoint and present an
interactive stdin, matching the sibling test_manager_builds_hermes_provider_subclass.
Fixes#36767.
Two complementary recoveries for the recurring "delete three cache files and
re-auth by hand" ritual when an MCP server's dynamically-registered OAuth
client goes dead server-side (IdP redeploy / DB wipe / rebrand):
- Auto-heal (token-endpoint subset): HermesMCPOAuthProvider now sniffs
auth-flow responses and, on a 400/401 `invalid_client` from the discovered
token endpoint, backs up + deletes `<server>.client.json` and `.meta.json`
and clears the in-memory client so the SDK re-runs RFC 7591 dynamic client
registration on the next flow. Conservative by construction: only
dynamically-registered (non config-supplied) clients, only the token
endpoint, only on a word-boundary `invalid_client` match (so RFC 7591's
`invalid_client_metadata` does not trip it); best-effort so a miss never
breaks the live flow. Covers both code-exchange and refresh when the token
endpoint was discovered. Tokens are preserved.
- `hermes mcp reauth [<name>|--all]`: the reporter's primary symptom — the
IdP's in-browser "Redirect URI Mismatch" — produces no HTTP signal (the SDK
only sees a callback timeout), so it cannot be auto-detected. The new
command re-auths one or ALL `auth: oauth` servers, serially: one browser
flow at a time, which also fixes the startup popup storm when several
servers are stale at once. Single-server reauth is factored out of
`mcp login` and shared.
Tests: +14 (poison helper x2; token-endpoint detection x5 incl. wrong-endpoint,
success-response, pre-registered, and invalid_client_metadata negative guards;
a bridge integration test driving the real async_auth_flow generator to prove
the detection hook preserves the bidirectional asend() forwarding contract;
reauth CLI x6). Verified against the pinned mcp==1.26.0: scripts/run_tests.sh
122/122 green for the touched suites; check-windows-footguns.py and ruff clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cut over the agent half of Shape A (D-Q1.5a/b.1/c) to front a SET of platforms on
one relay WS:
- relay_platform_identities() parses GATEWAY_RELAY_PLATFORMS (list) +
GATEWAY_RELAY_BOT_IDS (JSON keyed map {platform:{botId,username?}}). Cut over
from the scalar GATEWAY_RELAY_PLATFORM/_BOT_ID (no fallback, D-Q1.5c).
- self_provision_relay() loops one /relay/provision per platform under one
gatewayId+secret, partial-failure-tolerant.
- WebSocketRelayTransport takes the identity SET, sends one hello per identity
(connector accumulates the advertised set), and stamps the per-frame
OutboundFrame.platform + its matching advertised botId on outbound.
- RelayAdapter remembers each chat's underlying source.platform (mirroring the
existing guild/dm scope capture) and tags the reply's egress platform.
- send_relay_policy() declares one relevance policy per fronted platform (the
connector keys policy by (tenant,platform,instanceId)).
Single-platform deploys are byte-identical on the wire (1-element list, no per-frame
tag -> connector session-default fallback). typecheck/ruff clean; relay unit 221 pass
(+10 new); all 15 cross-repo E2E drivers green vs connector origin/main.
The gateway reconnect watcher (gateway/run.py) recovers a platform after a
fatal adapter error by building a fresh adapter and calling
connect(is_reconnect=True). Every BasePlatformAdapter implements
connect(*, is_reconnect: bool = False) for this — except RelayAdapter, whose
connect() was bare. So the watcher's recovery path raised:
TypeError: connect() got an unexpected keyword argument 'is_reconnect'
Observed live on a hosted staging agent: after a fatal relay adapter error the
watcher could never re-establish relay, so the shared-bot inbound never reached
the gateway and Discord DMs stopped (dashboard surfaced the TypeError).
Relay deliberately ignores the flag: the #46621 server-side-queue-preservation
concern doesn't apply, because relay's outage buffer is the connector's durable
buffer (replayed on the transport's re-handshake), not a gateway-side queue the
adapter owns. Routine WS drops are already handled by the transport's own
reconnect supervisor (WebSocketRelayTransport, reconnect=True); the watcher path
is fatal-error recovery, and the fatal handler disconnect()s the old adapter
(cancelling its supervisor) before a fresh adapter+transport is built, so there
is no double-dial.
Adds two regression tests (both proven red without the fix): connect(is_reconnect=True)
reaches the same transport-less RuntimeError instead of TypeError, and the
signature matches BasePlatformAdapter.connect.
When operator config has provider=anthropic with model.base_url pointing
at a non-Anthropic host (e.g. https://openrouter.ai/api/v1 with provider=anthropic),
the auxiliary Anthropic path was unconditionally applying that override.
Main-session traffic routed correctly because the main path attaches the
right credential for the actual destination, but every side-channel call
(memory extractors, reflection, vision, title generation, janus
extractor/promise) sent ANTHROPIC_API_KEY to the foreign host and 401'd.
Gate the override on hostname == api.anthropic.com. Operators routing main
through a non-Anthropic provider must use that provider's own auxiliary
client; the Anthropic aux path now stays pointed at api.anthropic.com.
Regression tests cover openrouter, openai, anthropic-with-path, empty, and
anthropic-default-base_url cases.
Ensure TUI/desktop stop targets the actual conversation thread and cancels any queued next prompt, including the lazy agent-start window, so a stopped session cannot keep running or restart itself.
Wire get_reasoning_stale_timeout_floor() into both stale detectors so known
reasoning models (Nemotron 3 Ultra, OpenAI o1/o3, Opus 4.x thinking, DeepSeek
R1, Qwen QwQ, Grok reasoning) tolerate multi-minute thinking phases instead of
the upstream gateway idle-killing the socket (BrokenPipeError) before first
token. Applied as max(default, floor) — never overrides explicit user config,
never lowers an existing threshold.
The reasoning_timeouts.py allowlist module already landed on main via #52795,
so this salvage carries only the wiring + tests (the duplicate module and the
stale-base MoA reverts from the original PR branch are dropped).
Salvaged from #52238. Fixes#52217.
The salvaged #51875 added a background-review write guard in skill_manage
that refuses mutations to skills.external_dirs skills — but it only fires
when is_background_review() is true. The curator's LLM review fork ran with
the default _memory_write_origin='assistant_tool', so the guard never
triggered during the exact curation pass it exists to protect against
(GH-47688).
- Set _memory_write_origin='background_review' on the curator review fork so
turn_context binds it onto the write-origin ContextVar and the guard fires.
- Add a regression test asserting the fork runs under the background_review
origin (the invariant linking the fork to the guard).
- AUTHOR_MAP: map yu-xin-c for the salvaged commit.
Force redact_sensitive_text(force=True) on the browser_type text arg so
recognized credentials (API keys, tokens, JWTs) are masked in tool
progress, previews, callbacks, and return payloads even when the global
security.redact_secrets opt-out is set — a typed credential reaching chat
history is a security boundary, not log hygiene. Normal typed text matches
no pattern and stays fully readable for debuggability.
Tests assert the API-key-shaped secret is masked across every surface and
that normal text passes through unchanged.
Stopping a turn while the model is streaming (stop/esc to redirect) raised
InterruptedError, set final_response to the throwaway "waiting for model
response" sentinel, and persisted messages WITHOUT the assistant text that
was already streamed to the screen. The next turn then had no record of the
half-finished reply, so the model appeared to "forget" what it just said.
Recover the on-screen text from _current_streamed_assistant_text in the
InterruptedError branch and append it as the assistant turn (and surface it
as final_response). The metadata sentinel is kept only when nothing was
streamed yet, preserving the ACP/client suppression behavior.
Completes the partial-stream recovery from 397eae5d9 (which wired the same
_current_streamed_assistant_text salvage into the connection-failure twin
but missed the user-interrupt path). The lossy handler dates to c98ee9852.