_normalize_custom_provider_entry() ran urlparse() on base_url and dropped
any entry whose value was an un-expanded placeholder, so a caller reaching
the normalizer with raw config (e.g. the Dockerized gateway path) silently
skipped the provider with a 'not a valid URL' warning. Skip URL validation
when the candidate contains a placeholder token — both ${ENV_VAR} env-refs
and bare {region}-style templates — since those are expanded at runtime.
Closes#14457
The dashboard Config tab's Model field is a flat string with no provider
info. _denormalize_config_from_web only updated model.default and kept the
stale provider, so picking an OpenRouter model while the default provider was
ollama-local left provider=ollama-local and every call 404'd.
When the model string actually changes, infer the serving provider — curated
catalog first, then a vendor/model-slug heuristic for non-aggregator providers
— and route the switch through the existing _normalize_main_model_assignment /
_apply_main_model_assignment chokepoints so stale base_url/api_mode/api_key are
cleared on a provider change and preserved on a same-provider re-pick. Saving
an unchanged model never re-detects, so unrelated config saves keep an explicit
provider.
Closes#14058
The dashboard form is built from CONFIG_SCHEMA, which doesn't enumerate
every root-level key the YAML supports. Most visibly, `custom_providers`
is in `_KNOWN_ROOT_KEYS` but is absent from the schema — so the frontend
never sends it in the PUT body. The previous full-replace save() then
silently wiped the key from disk every time the user clicked anything
that triggered a save. Other casualties (less visible because defaults
re-mask them on load) include `agent.personalities`,
`agent.reasoning_effort`, `terminal.lifetime_seconds`, etc.
Fix: read the raw on-disk config and deep-merge the incoming PUT body
on top of it before saving. The frontend can only overwrite what it
explicitly sends; everything else is preserved verbatim.
Reuses the existing `_deep_merge` helper from `hermes_cli.config`.
Tests:
- `test_round_trip_preserves_custom_providers` exercises the exact bug:
seed config with custom_providers, GET → drop the key → PUT,
assert it's still on disk.
- `test_round_trip_preserves_schema_invisible_nested_keys` covers the
shallow-vs-deep-merge case for nested dicts under `agent` etc.
Both fail on current main; both pass with this patch.
Follow-up on the gateway-picker salvage: the cherry-picked change added a
second copy of the MoA virtual-provider row in model_switch.py, duplicating
inventory._moa_provider_row (same slug/name/preset-models, identical extra
fields). Make _moa_provider_row take a bare current_provider string and reuse
it from the gateway picker path so the row shape lives in one place and the
two surfaces can't drift.
The verify-on-stop guard fired too eagerly — including on doc/markdown/skill
edits with nothing to verify, where it pushed a pointless /tmp verification
script. Three changes:
1. Default OFF for new installs: agent.verify_on_stop defaults to false
(was the "auto" surface-aware sentinel). _config_version bumped 30 -> 31.
2. One-time migration (v30 -> v31): existing installs are switched off once,
but only when the value is missing or still the "auto" sentinel — an
explicit true/false the user set is preserved.
3. Path filter: build_verify_on_stop_nudge() now drops documentation/prose
paths (.md/.mdx/.rst/.txt/LICENSE/CHANGELOG/...) so even when explicitly
enabled, a doc-only turn never nudges. Mixed doc+code turns still nudge on
the code paths.
The legacy "auto" sentinel is still honored when set explicitly (ON for
interactive coding surfaces, OFF for messaging). HERMES_VERIFY_ON_STOP env
override unchanged.
/moa no longer does a sticky model switch. It now always runs a single
prompt through the default MoA preset and restores the prior model
afterward; the whole argument is the prompt (no preset-name matching).
To switch to a MoA preset for the session, select it from the model
picker, where presets already surface under a virtual Mixture of Agents
provider on every model-selection surface.
Also fixes#53444: the TUI one-shot only set session[model_override],
which the already-built cached agent ignored, so MoA silently never ran
and the turn used the original model. The TUI now does a real in-place
agent.switch_model() via _apply_model_switch() when a live agent exists
(with a proper restore after the turn), and falls back to a model_override
for lazy/unbuilt sessions.
Removes the redundant sticky-switch branch from the CLI, gateway, and TUI
/moa handlers; updates the command description, usage string, and docs.
A model selected via the CLI (e.g. /model openrouter/<uncurated-name>) was
absent from every model picker — the main picker AND the MoA reference/
aggregator slot pickers — because each provider row only carried its curated
catalog. Inject the current model at the front of its provider's row so it is
selectable and shown everywhere.
* Return None instead of erroring on drain login failure
* Fix login on drain
* Remove login for drained endpoints flow and clean the code
* chore: drop unrelated credits changes from this PR
* Remove extra comments that were not really necessary
When an MCP server requires OAuth, the interactive `hermes` TUI froze on
startup: background MCP discovery hit the OAuth flow, which on an interactive
TTY spawns a daemon thread doing a blocking `sys.stdin.readline()` (the
"paste the redirect URL" fallback in mcp_oauth._wait_for_callback). That
thread competes with the TUI's own stdin reader for the same terminal, so
keystrokes get swallowed and the TUI appears frozen (up to the 300s OAuth
timeout). Reported symptom: "MCP OAuth: authorization required / Open this URL
... the tui is freezing, not respond to typing."
Add a thread-local `suppress_interactive_oauth()` context manager in
tools/mcp_oauth.py; `_is_interactive()` returns False while it's active, so the
stdin paste-thread and prompt are never created. Background discovery
(hermes_cli/mcp_startup.py, tui_gateway/entry.py) now runs discovery inside
that context, so OAuth-requiring servers soft-skip (raise
OAuthNonInteractiveError, already handled) instead of stealing the TUI's stdin.
A real `hermes mcp login` on the main thread is unaffected (thread-local).
Salvaged from #35945 by @zapabob (authorship preserved via cherry-pick;
resolved a conflict against main's new mcp_discovery_timeout / wait_for_mcp_
discovery refactor, keeping both). Verified E2E: with suppression the paste
prompt is NOT printed and no stdin thread spawns (raises OAuthNonInteractive
soft-skip); without it the prompt shows (the freeze). Mutation-verified
(removing the suppress check in _is_interactive fails the regression test).
76 tests pass, ruff clean.
Closes#35927.
SELF-REVIEW FIX: the original #35945 used threading.local(), which does NOT
propagate to the dedicated mcp-event-loop thread where OAuth actually runs
(discover_mcp_tools dispatches the connect via run_coroutine_threadsafe), so
the suppression was a NO-OP in production (the tests passed only by stubbing
out the cross-thread dispatch). Converted to a contextvars.ContextVar, which
asyncio copies onto the scheduled coroutine — empirically verified suppression
now holds on the mcp-event-loop thread through the real _run_on_mcp_loop path.
Added a cross-thread regression test (fails on threading.local, passes on the
ContextVar) so the no-op can't regress.
On the desktop Channels / Messaging page, the "Open setup guide" button was
rendered as a bare <a href={platform.docs_url} target="_blank"> with no guard.
Plugin-provided platforms (Microsoft Teams, Google Chat, Line, Raft, Yuanbao,
…) ship an empty docs_url, so the anchor's href was "".
In a packaged build, Electron resolves an empty href against the current
document — the app's own index.html inside the asar bundle — and
shell.openPath then fails with an OS "file not found" dialog. This is exactly
the Windows error reported for Messaging → Teams → Open guide.
Fix (3 changes):
1. fix(desktop) — Only render the "Open setup guide" button when docs_url is
non-empty, and route clicks through openExternalLink so a relative/empty
value can never be treated as a local bundle path. Fixes the whole class
(every plugin platform), not just Teams.
2. fix(messaging) — Give the Teams platform plugin a real docs_url (Microsoft
Teams setup guide) so its card shows a working button instead of nothing.
3. fix(messaging) — Give the Google Chat platform plugin a real docs_url
(Google Chat setup guide) so its card shows a working button instead of
nothing. Originally from #48940; folded in here because that PR's test
was broken (it queried the HTTP endpoint, but google_chat is a dynamic
enum member that only appears after the adapter module is imported).
Test plan:
- apps/desktop — new src/app/messaging/index.test.tsx: button is hidden when
docs_url is empty; a real URL opens via the validated external opener (does
not navigate).
- apps/desktop typecheck (tsc --noEmit) clean.
- backend — test_teams_messaging_metadata_links_setup_guide: the Teams catalog
entry exposes the setup-guide docs_url.
- backend — test_google_chat_messaging_metadata_links_setup_guide: the Google
Chat catalog entry exposes the setup-guide docs_url.
Co-authored-by: xxxigm <tuancanhnguyen706@gmail.com>
Co-authored-by: p-andhika <andhika.prakasiwi@gmail.com>
- Use os.pathsep instead of literal ':' so Windows paths (C:\dir) and
the Windows separator ';' work correctly.
- Add 9 tests covering multi-root behavior: writes inside first/second
root, writes outside all roots, trailing/leading/double separators,
all-separators edge case, static deny priority, duplicate dedup.
- Update hermes_cli/tips.py tip string to mention multiple paths.
- Update docs to mention os.pathsep / ; on Windows.
Follow-up for salvaged PR #49557.
A MoA preset whose reference or aggregator slot points at the moa virtual
provider creates a recursive MoA tree. The runtime guards in moa_loop.py only
surface this mid-turn (references silently skipped, aggregator raises). Reject
it at the config chokepoint (_clean_slot) so it can never be saved, and hide it
from the desktop/dashboard slot pickers so it isn't offered as a dead choice.
_normalize_preset uses bare float() and int() to coerce
reference_temperature, aggregator_temperature, and max_tokens from
config.yaml. When a user hand-edits a non-numeric value (e.g.
max_tokens: "8k" or reference_temperature: "hot"), the coercion raises
ValueError. Since normalize_moa_config runs on every model-selection
and MoA turn (via resolve_moa_preset), the crash is unrecoverable and
blocks all MoA usage until the config is manually fixed.
Replace the bare casts with _coerce_float / _coerce_int helpers that
fall back to the default on TypeError/ValueError instead of raising.
Natural-language skill search returned a short, arbitrary list and never
surfaced NVIDIA (or OpenAI/Anthropic/HuggingFace) skills. Two causes:
1. The runtime index collapses every GitHub tap into source="github", so
there was no way to find or filter by provider at the CLI — the per-tap
identity only existed in the docs-site catalog.
2. HermesIndexSource.search matched only name/description/tags (not the
identifier or provider) and broke at the first `limit` hits in raw index
order, burying the most relevant skills. `search` also defaulted to
--limit 10 against an 86k-entry catalog.
Changes:
- GitHubSource stamps a per-tap provider label (extra.provider) on each
skill via github_provider_for(); source stays "github" so dedup/floor/
index-skip logic is untouched. Flows into the built index.
- HermesIndexSource.search now matches identifier + provider too, and
collect-then-ranks (exact > prefix > whole-word > substring) instead of
break-at-limit.
- --source nvidia|openai|anthropic|huggingface|voltagent|gstack|minimax
provider filters for browse/search (narrows merged results by provider).
- search --limit default 10 -> 25; table Source column shows the provider
label for github skills.
Tested: 181 unit tests pass; E2E against the live runtime index confirms
'nvidia'/'cuda' searches now surface NVIDIA-provider skills and
--source nvidia narrows to exactly the NVIDIA catalog.
The post-update gateway restart path relaunched the gateway with the
venv's console `python.exe` (via `get_python_path()` in
`_gateway_run_args_for_profile`). On Windows this leaves a terminal
window open permanently: uv's `venv\Scripts\python.exe` is a launcher
shim that re-execs the *base* console interpreter, which allocates its
own conhost — and `CREATE_NO_WINDOW` cannot suppress that second window.
The clean-start path (`_spawn_detached`) already dodges this by routing
through `_resolve_detached_python` to use the windowless base
`pythonw.exe`; the restart watcher did not.
Symptom (reported on Windows 11): after an in-app GUI update, a console
window for the gateway stays open and never closes. Confirmed on the
reporter's box — the running gateway was `python.exe ... gateway run
--replace` with a live conhost child and the foreground "Press Ctrl+C to
stop" banner, born exactly at the update's "Restarting Windows gateway"
log line.
Fix:
- Add `gateway_windows.windowless_gateway_restart_spec(run_argv)` which
rewrites a console-python gateway argv into the windowless `pythonw.exe`
equivalent and returns the cwd + env overlay (VIRTUAL_ENV / PYTHONPATH /
HERMES_HOME) the base interpreter needs to import `hermes_cli` without
the venv launcher's site config. No-op on POSIX.
- `_spawn_gateway_restart_watcher` now applies that rewrite on Windows and
threads cwd= / env= into the inlined respawn Popen. Covers both restart
entry points (`launch_detached_profile_gateway_restart` and
`launch_detached_gateway_restart_by_cmdline`). CREATE_NO_WINDOW |
DETACHED_PROCESS | CREATE_BREAKAWAY_FROM_JOB and the breakaway-denied
fallback are all preserved.
Verified E2E on a real Windows 11 box: drove the actual watcher against a
dummy old-pid; the respawned gateway came up as `pythonw.exe` (zero
console python, no conhost child) and booted fully (housekeeping + kanban
dispatcher started → imports resolved under the base interpreter).
Tests: TestWindowlessGatewayRestartSpec (behavior) +
TestGatewayDetachedWatcherWindowsFlags regression assert. Pre-existing
Linux-only failures on a Windows host (SIGKILL, systemd, docker-root)
confirmed identical on the bare base.
projects.db (per-profile project store) and kanban.db were missing from
_QUICK_STATE_FILES, so the pre-update quick snapshot never backed them up.
On a desktop upgrade, when the update flow removes/replaces the file and the
post-update schema-init re-creates an empty one, all user-created projects,
folder mappings, the active-project pointer, kanban board bindings, and tasks
vanish silently — no error.
Add the per-profile user-created stores to the snapshot set:
- projects.db — project store
- response_store.db — gateway conversation history / tool payloads (WAL)
- memory_store.db — holographic memory facts/entities (WAL)
- verification_evidence.db — agent verification audit trail
- kanban.db — default board (back-compat <root>/kanban.db)
- kanban/boards — non-default boards (<root>/kanban/boards/<slug>/kanban.db
+ metadata); workspaces/ and attachments/ subtrees
are skipped as large + regenerable.
Also: the directory-branch of create_quick_snapshot now routes *.db through the
WAL-safe _safe_copy_db (SQLite backup() API), matching the top-level file path —
previously a non-default board DB with an open WAL could be copied inconsistently.
Salvaged from #52930 by @0xDevNinja (authorship preserved via cherry-pick).
On top of the original (which covered only projects.db + the default kanban.db),
this adds: non-default-board coverage, the three sibling per-profile DBs that
meet the same upgrade-wipe criteria, WAL-safe directory copies, and a
workspaces/attachments skip to avoid snapshot bloat (×20 retained). 8 tests,
all mutation-verified; E2E verified snapshot→wipe→restore preserves all six
store types on the real code path.
Closes#52889. Supersedes #52930.
## Description
On macOS 26.x, `launchctl bootstrap` and `launchctl kickstart` return exit code 5 ("Input/output error"), which Hermes already anticipates and handles by spawning a detached fallback process. However, the gateway status reporting is ambiguous:
- `gateway status` says "Gateway service is loaded" (because `launchctl list` returns exit 0)
- But `launchctl print` shows `state = not running` — launchd isn't actually supervising anything
- The detached fallback PID running is invisible to the status command
- Users can't tell whether auto-start at login and auto-restart on crash are available
### Root Cause
Two problems in `hermes_cli/gateway.py`:
1. **`_probe_launchd_service_running()`** (line 1067): Determined launchd service liveness solely by `launchctl list <label>` exit code. On macOS 26, this returns 0 even when the service is only *registered* but not running (output lacks a `"PID"` field). This caused `GatewayRuntimeSnapshot.service_running = True` incorrectly, which suppressed the process/service mismatch warning.
2. **`launchd_status()`** (line 3569): Used the same binary "loaded/not loaded" check without inspecting whether launchd actually has a PID, whether a detached fallback is running, or whether auto-start/restart are available.
### Changes
**`hermes_cli/gateway.py`:**
1. **New `_parse_launchd_pid_from_list_output()` helper** — Extracts the PID from `launchctl list` output. When launchd is actively supervising, the output includes `"PID" = <number>;`. When only registered but not running, no PID field is present.
2. **Fixed `_probe_launchd_service_running()`** — Now requires a PID in the `launchctl list` output to confirm launchd is actually supervising. This correctly sets `service_running = False` when launchd has the service registered but `state = not running`, which triggers the existing process/service mismatch detection.
3. **Reworked `launchd_status()`** — Reports clearly separated information:
- LaunchAgent plist currentness (stale or current)
- Whether launchd is actively supervising (with PID)
- Whether a detached fallback PID is running
- Whether auto-start at login and auto-restart on crash are available
- When launchd supervision is known to be unavailable, explains why
4. **Persistent unsupported marker** (`~/.hermes/.gateway-launchd-unsupported`) — Written when `_launchd_fallback_to_detached()` is called (launchd exit 5/125). Allows `launchd_status()` to explain *why* launchd can't supervise even when no fallback process is currently running. Cleared automatically when a future bootstrap/kickstart succeeds (e.g., after an OS update fixes the issue).
5. **Updated `_print_gateway_process_mismatch()`** — Distinguishes the managed detached fallback from a genuinely manual `nohup hermes gateway run`, providing accurate guidance for each case.
### Status Output Examples
**Before** (macOS 26, fallback active):
```
Launchd plist: ~/Library/LaunchAgents/ai.hermes.gateway.plist
✓ Service definition matches the current Hermes install
✓ Gateway service is loaded
{
"Label" = "ai.hermes.gateway";
"OnDemand" = true;
...
};
```
**After** (macOS 26, fallback active):
```
Launchd plist: ~/Library/LaunchAgents/ai.hermes.gateway.plist
✓ Service definition matches the current Hermes install
⚠ Gateway service is registered but launchd is not supervising it
launchd cannot manage the gateway on this macOS version.
✓ Detached fallback process is running (PID 12345)
Cron jobs will fire. Stop with: hermes gateway stop
⚠ Auto-start at login and auto-restart on crash are NOT available.
```
**After** (normal launchd supervision):
```
Launchd plist: ~/Library/LaunchAgents/ai.hermes.gateway.plist
✓ Service definition matches the current Hermes install
✓ Gateway is supervised by launchd (PID 12345)
Auto-start at login and auto-restart on crash are available.
```
### Tests
Updated 5 existing tests and added 11 new tests in `tests/hermes_cli/test_gateway_service.py`:
- PID parsing from `launchctl list` output (with PID, without PID, empty, unquoted PID)
- `_probe_launchd_service_running()` requires PID presence
- Unsupport marker lifecycle (write, clear, persist across fallback)
- Marker cleared on successful bootstrap
- `launchd_status()` reporting: supervised, fallback-running, fallback-unavailable
- Existing fallback tests now verify marker creation
### Related Issues
- Issue #23387 (original macOS 26 launchd workaround)
- Issue #42524 (this issue)
A config migration (or hand-edit) that leaves an invalid toolset name in
`platform_toolsets` — e.g. the #38798 corruption that rewrote `hermes-cli` to
the non-existent `hermes` — silently disabled all affected tools:
resolve_toolset() returns [] for an unknown name, so the agent quietly lost its
tools with no error, warning, or log entry and degraded to text-only replies.
Surface it loudly at two points:
- After migration (migrate_config): validate platform_toolsets and record/print
a warning per unknown name, with a `hermes-<platform>` suggestion when that
would have been valid (the exact #38798 shape).
- At runtime (_get_platform_tools): if a platform was explicitly configured but
every toolset name is invalid, log a warning when tools are resolved for a
session — so an ALREADY-corrupted config is caught at startup, not only on the
next `hermes update`.
Logic lives in a new pure, side-effect-free helper (toolset_validation.py) with
validate_toolset injected, so it is unit-testable without the tool registry.
Note: the original v25→v26 migration that caused the corruption no longer
exists (config format is now v30; no migration step rewrites toolset names).
This change is the durable defense against the silent-failure mode regardless
of cause, matching the issue's "Expected: log a warning".
Salvaged from #39207 by @lEWFkRAD (authorship preserved via cherry-pick).
Tests: 9 helper cases (incl. the #38798 corruption shape, mixed valid/invalid,
zero-tools state, non-dict/scalar/non-string) + a runtime caplog test — both the
helper warning and the runtime guard mutation-verified to fail without the fix.
Closes#38798. Supersedes #39581 (prevent-in-v25→v26 — that path is gone),
#41006 / #40208 (repair-migration for already-corrupted configs).
Tasks 2.1 + 2.2 + 2.3 of the safe-shutdown plan — the reversible
quiesce-without-restart machinery NAS drives during a lifecycle action (D4a).
These ship together because the endpoint, the control channel, and the gateway
state machine are one coherent slice.
2.2 — control channel (gateway/drain_control.py, new):
The dashboard has no HTTP path into a running gateway (guardrails: "there is NO
external control channel into a running gateway"); restart/drain is driven only
by markers the gateway reacts to. So begin/cancel-drain writes/removes a
presence-based marker .drain_request.json (HERMES_HOME-scoped, atomic write,
never-raises read; a corrupt marker reads as present-contentless → fail-safe
toward quiescing). This is Q-B option A.
2.2 — gateway state machine (gateway/run.py):
- _external_drain_active flag, DISTINCT from the shutdown _draining flag: this
one does NOT exit the process and is fully reversible.
- _enter_external_drain / _exit_external_drain: idempotent transitions that
flip gateway_state→draining / →running via _update_runtime_status (preserving
the live active_agents count). exit refuses to revert to running during a
real shutdown or after the loop stops (shutdown wins).
- _drain_control_watcher: 1s background task (modelled on _handoff_watcher)
reconciling accept-state with the marker; honours a marker that survived a
restart on its first tick. Registered alongside the other watchers in start.
- New-turn accept gate in _handle_message, placed BEFORE the session-slot
claim: when draining, refuse to START a new turn (so active_agents can only
fall → no TOCTOU race), while in-flight turns finish untouched. Internal/
system events (restart-recovery replays, bg-process completions) bypass it.
2.1 — endpoint (hermes_cli/web_server.py):
POST /api/gateway/drain {action: drain|cancel}. Authenticated by the Task-2.0a
token seam (the drain plugin registered this exact path as a token route);
attributes the request to the verified token principal. Begin writes the
marker, cancel removes it — the gateway process owns the actual transition.
Force-override (D6) is NOT here; it maps onto the existing immediate
/api/gateway/restart force path.
Tests (mocked — necessary-not-sufficient; the HARD live gate Q-B is next):
- tests/gateway/test_external_drain_control.py — marker contract (write/clear/
read/corrupt/atomic), state machine (enter/exit/idempotency/shutdown-wins/
loop-stopped), watcher reconcile-enter-then-exit, new-turn refusal, and
in-flight-not-interrupted. 15 tests.
- tests/hermes_cli/test_web_server.py — /api/gateway/drain begin/default-begin/
cancel/cancel-idempotent/bad-action-400. 6 tests.
- dashboard.drain_auth config section already added in 2.0b commit.
All touched suites green: 301 (gateway+auth) + 9 (web_server endpoints) passed.
Intentionally deferred:
- HARD live-validation gate (Q-B): real isolated `hermes gateway run`, drive a
real begin-drain marker, prove the 5-point checklist a–e.
- Spec-doc status flip + Phase-2 PR.
Build status: external-drain, restart-drain, status, dashboard-auth, drain-plugin,
token-auth, and web_server-endpoint suites green.
Task 2.0b: the concrete shared-bearer-secret auth provider, the FIRST consumer
of the generic token-auth capability (Task 2.0a). Implements decisions.md Q-A.
plugins/dashboard_auth/drain/ (bundled, discovered like dashboard_auth/basic):
- DrainSecretProvider: non-interactive provider, supports_token=True. Verifies
an inbound Authorization bearer token against a per-agent shared secret with
hmac.compare_digest (constant-time, no timing oracle) and, on a match,
vouches for the caller as the "drain-control" principal scoped to "drain".
The five interactive ABC methods raise NotImplementedError; verify_session
returns None (stacks harmlessly in the cookie-verify loop).
- assess_secret_strength(): fail-closed entropy gate. Rejects secrets shorter
than 43 url-safe-b64 chars (~256 bits), with < 16 distinct characters, or
below 128 bits Shannon entropy — so a weak/structured/repeated secret can
never be silently accepted. Enforced both at register() (friendly skip
reason) and in __init__ (raises — defence in depth).
- register(ctx): no-op + skip reason when HERMES_DASHBOARD_DRAIN_SECRET is
unset; rejects a weak secret fail-closed (drain endpoint stays gated). On a
strong secret, registers the provider AND opts /api/gateway/drain into the
generic token-auth seam via register_token_route().
Config: the secret is a CREDENTIAL → carried via HERMES_DASHBOARD_DRAIN_SECRET
(per-agent, provisioned by NAS at deploy). Behavioural knobs only
(dashboard.drain_auth.{scope,min_secret_chars}) live in config.yaml — added to
DEFAULT_CONFIG with the .env-is-for-secrets rationale documented inline.
Tests: tests/plugins/dashboard_auth/test_drain_provider.py — entropy gate
(strong pass; empty/short/repeated/few-distinct/custom-min reject), verify_token
(match → scoped principal, wrong/empty → None, custom scope), protocol
compliance, interactive-methods-raise, and register() (skip-no-secret,
fail-closed-weak-secret, strong-env-secret registers + route opt-in, config
scope + min_secret_chars). 21 new tests; drain + token-auth suites 44 passed.
Verified the plugin is discovered as dashboard_auth/drain alongside basic/nous.
Intentionally deferred:
- The begin/cancel-drain endpoint handler itself — Task 2.1.
- The dashboard→gateway control channel — Task 2.2.
Build status: dashboard-auth + drain-plugin suites green.
Task 2.0a of the safe-shutdown drain-coordination plan. Widens the dashboard
auth framework GENERICALLY to support non-interactive (service-to-service)
bearer-token auth, mirroring the existing supports_password precedent. This is
a reusable capability — any future machine-credential provider plugs in without
core changes (decisions.md Q-C). The drain bearer-secret plugin (Task 2.0b) is
the first consumer, not the definition.
- base.py: add TokenPrincipal dataclass (the token analog of Session) +
supports_token capability flag + verify_token() on the ABC (default raises
NotImplementedError so a misconfigured provider fails loud). Contract mirrors
verify_session stacking: return None for unrecognised tokens (never raise),
raise ProviderError only on a genuine backing-store outage.
- registry.py: list_token_providers() — the supports_token subset, in
registration order. Empty when none registered (token routes fail closed).
- token_auth.py (new): route-agnostic seam. Routes opt in via
register_token_route(exact path); token_auth_middleware owns the auth
decision for those routes only — authenticate via stacked providers, attach
request.state.token_principal + token_authenticated, pass through. 401 on
missing/unrecognised token, 503 when a provider was unreachable, untouched
passthrough for non-token routes. Fails closed (never open).
- web_server.py: install the seam OUTERMOST (registered last → runs first).
Both downstream gates (legacy auth_middleware + gated_auth_middleware) honour
request.state.token_authenticated and skip enforcement, so a token-authed
service request is never bounced to /login.
- audit.py: TOKEN_AUTH_SUCCESS / TOKEN_AUTH_FAILURE events.
Tests: tests/hermes_cli/test_dashboard_token_auth.py — ABC flag default,
verify_token NotImplementedError, registry filter, bearer extraction
(case-insensitive scheme, malformed/non-bearer → ""), provider stacking
(first-match-wins, unreachable-remembered, unreachable-then-valid, buggy
provider doesn't crash the gate), and the seam's passthrough/401/503/
fail-closed behaviour. 29 new tests; full dashboard-auth suite 169 passed.
Intentionally deferred:
- The concrete shared-bearer-secret provider plugin — Task 2.0b.
- The begin/cancel-drain endpoint that registers itself as a token route —
Task 2.1.
Build status: dashboard-auth + plugin-hook suites green.
Fixes#36767.
Two complementary recoveries for the recurring "delete three cache files and
re-auth by hand" ritual when an MCP server's dynamically-registered OAuth
client goes dead server-side (IdP redeploy / DB wipe / rebrand):
- Auto-heal (token-endpoint subset): HermesMCPOAuthProvider now sniffs
auth-flow responses and, on a 400/401 `invalid_client` from the discovered
token endpoint, backs up + deletes `<server>.client.json` and `.meta.json`
and clears the in-memory client so the SDK re-runs RFC 7591 dynamic client
registration on the next flow. Conservative by construction: only
dynamically-registered (non config-supplied) clients, only the token
endpoint, only on a word-boundary `invalid_client` match (so RFC 7591's
`invalid_client_metadata` does not trip it); best-effort so a miss never
breaks the live flow. Covers both code-exchange and refresh when the token
endpoint was discovered. Tokens are preserved.
- `hermes mcp reauth [<name>|--all]`: the reporter's primary symptom — the
IdP's in-browser "Redirect URI Mismatch" — produces no HTTP signal (the SDK
only sees a callback timeout), so it cannot be auto-detected. The new
command re-auths one or ALL `auth: oauth` servers, serially: one browser
flow at a time, which also fixes the startup popup storm when several
servers are stale at once. Single-server reauth is factored out of
`mcp login` and shared.
Tests: +14 (poison helper x2; token-endpoint detection x5 incl. wrong-endpoint,
success-response, pre-registered, and invalid_client_metadata negative guards;
a bridge integration test driving the real async_auth_flow generator to prove
the detection hook preserves the bidirectional asend() forwarding contract;
reauth CLI x6). Verified against the pinned mcp==1.26.0: scripts/run_tests.sh
122/122 green for the touched suites; check-windows-footguns.py and ruff clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(kanban): typed block reasons + unblock-loop breaker
Stops the kanban blocked-task loop: a worker blocks a task, a cron
unblocks it, the worker re-blocks for the same reason, repeat forever.
block_task now takes a typed kind and a persistent block_recurrences
counter on the tasks table:
- kind=dependency routes to todo (parent-gated, auto-resumed), never
the human 'blocked' bucket a cron would keep unblocking.
- needs_input/capability/transient/untyped land in blocked; each
same-cause re-block after an unblock increments block_recurrences,
and at BLOCK_RECURRENCE_LIMIT (default 2) the task routes to triage
for a human instead of blocked.
- unblock_task no longer resets block_recurrences (the amnesia that
let the loop run unbounded); complete_task clears it on success.
Wired through the worker kanban_block tool (new kind arg) and the
hermes kanban block --kind CLI flag, both reporting where the task
actually landed. Docs + 11 new tests; 536 existing kanban tests green.
* test(kanban): make second-block notify test use a distinct block cause
test_notifier_second_blocked_delivers blocked the same task twice with
the same (untyped) reason, which now trips the new unblock-loop breaker
and routes the second block to triage instead of blocked — so only one
'blocked' notification fired. The test's actual intent is that TWO
distinct block cycles each notify; give the two cycles different kinds
(needs_input then capability) so they're genuinely separate blocks. The
same-cause loop→triage path is covered by test_kanban_block_kinds.py.
A readable state.db can still reject every message write through the
messages_fts* triggers when the FTS5 index is corrupt: base-table reads and
PRAGMA integrity_check pass, but INSERT INTO messages fails with 'database
disk image is malformed'. The gateway reloads conversation_history from disk
each turn, so a silently-failed write hands the next turn stale/empty history
even though the same cached AIAgent still holds the live transcript — causing
immediate same-session amnesia. (#50502)
- hermes_state.py: _db_opens_cleanly() now drives a rolled-back message write
through the FTS triggers, so write-only corruption (which the read-only
probe reported healthy) is detected. repair_state_db_schema() gains an
in-place FTS5 'rebuild' strategy (tier 0) before the dedup/drop tiers, plus
an already_healthy short-circuit. Both 'hermes sessions repair' and
'hermes doctor' route through these, so the fix covers the whole class.
- hermes_cli/doctor.py: the state.db check runs the write-health probe even on
the success (readable) path and repairs in place with --fix.
- gateway/run.py: _select_cached_agent_history() prefers the cached agent's
longer live _session_messages over a shorter persisted transcript, so an
FTS write failure can't wipe in-session context.
- tests: regressions for write-health detection, in-place repair preserving
rows + resuming writes, the already_healthy shortcut, and the gateway guard.
Combines the approaches from #50504 (@0-CYBERDYNE-SYSTEMS-0, issue author),
#52165 (@davidgut1982), and #50576 (@trevorgordon981).
The desktop scheduler can overwrite cron/jobs.json with its own small
set of internally-tracked crons after an update/restart, causing
partial loss of tool-created cron jobs. The previous guard only
checked for total loss (live_count == 0), missing the case where
live_count > 0 but less than the pre-update snapshot count.
Compare live_count against snap_count instead of checking for zero,
so both total loss (0 vs N) and partial loss (1 vs 19) trigger
restoration.
Salvaged from #52161 by @liuhao1024.
Closes#52144
Adds a CodeMirror 6 spot editor to the right-rail file preview so users can
make quick edits in-app without leaving for an IDE. Entering edit mode is a
pure in-place swap of the read view — same fixed-height header, same gutter
geometry/typography (mirrors SourceView 1:1) so nothing shifts — toggled via
the Edit button, a bare `e` when the pane is hovered/focused, or the tab.
- Save path is transport-agnostic (writeDesktopFileText): local Electron IPC
or a new hardened POST /api/fs/write-text on the dashboard server (path
validation, parent-must-exist, regular-files-only, size cap, atomic
temp-file + os.replace), behind the existing auth middleware.
- Stale-on-disk guard re-reads before writing and offers overwrite vs
discard-and-reload instead of clobbering external/agent edits.
- VS Code-style modified dot on the tab; ⌘/Ctrl+S and ⌘/Ctrl+Enter save,
Esc cancels; GitHub highlight style matched to the read view's Shiki theme.
- Typing stays render-free (draft in a ref; dirty flips once at the boundary).
The pre-update HERMES_HOME zip shipped on by default (DEFAULT_CONFIG +
runtime fallback both True), so every `hermes update` zipped the entire
~/.hermes — sessions DB, caches, skills — adding minutes to each update.
The shipped cli-config.yaml.example, the --backup help, and the example
config all already said "off by default," so the live default
contradicted its own documentation.
Flip the default to off everywhere: DEFAULT_CONFIG, the runtime
`.get(..., False)` fallback in _run_pre_update_backup, and the stale
--backup help string. Users who want the #48200 safety net opt in via
updates.pre_update_backup: true or --backup for a single run.
Updated test_default_enabled_creates_backup -> test_default_disabled_is_silent
to assert the new default (silent no-op, no zip).
* fix(cron): add default retention to per-run job output to bound disk usage (#52383)
Per-run cron output (cron/output/<job>/<timestamp>.md) is written once
per execution and was never pruned, so a frequently-scheduled job on
a long-running deploy accumulates one file per run indefinitely and
can fill the volume ('no space left on device').
save_job_output() now keeps the most recent N output files per job and
removes older ones. N defaults to 50 and is configurable via
cron.output_retention; a non-positive value disables pruning for
operators who manage cleanup externally.
Salvaged from #52402 by @0xDevNinja.
Closes#52383
* fix(config): add cron.output_retention to DEFAULT_CONFIG
Follow-up to #52383: the retention config key was functional via
get()-with-default but missing from DEFAULT_CONFIG, so the deep-merge
wouldn't auto-populate it for new installs. Add it explicitly.
---------
Co-authored-by: 0xDevNinja <manmit0x@gmail.com>
* feat(moa): expose MoA presets as selectable virtual models
Reconstructed onto current main (PR #46081's base had diverged with no common
ancestor, marking the PR dirty so CI never dispatched). MoA is now a virtual
provider: each named preset is a selectable model under provider 'moa', and the
preset's aggregator is the acting model that answers and calls tools.
Reference models fan out in parallel via a bounded ThreadPoolExecutor (the same
batch pattern delegate_task uses) — all references dispatched at once, collected
when every one finishes, then handed to the aggregator. Output order is
preserved, failures and the MoA-recursion guard stay isolated per reference.
- Removed the old mixture_of_agents model tool and moa toolset.
- Added moa as a virtual provider in the provider/model inventory.
- /moa is shortcut behavior over model selection (default preset / named preset
/ one-shot prompt).
- Dashboard + Desktop manage named presets; presets appear in model pickers.
- Parallel reference fan-out in agent/moa_loop.py with regression test.
* fix(moa): thread moa_config through _run_agent to _run_agent_inner
The reconstructed gateway MoA wiring declared moa_config on _run_agent (the
profile-scoping wrapper) and used it inside _run_agent_inner, but the wrapper
never forwarded it — _run_agent_inner had no such parameter, so the runtime hit
NameError: name 'moa_config' is not defined on the compression-failure session
sync path. Add moa_config to _run_agent_inner's signature and forward it from
both wrapper call sites (multiplex and non-multiplex). Caught by
tests/gateway/test_compression_failure_session_sync.py on CI shard test(4).
* fix(moa): classify moa as a virtual provider in the catalog
The moa virtual provider has no PROVIDER_REGISTRY/ProviderProfile entry, so
provider_catalog() fell through to the default auth_type="api_key" with no
env vars — tripping two catalog invariants:
- test_provider_catalog: api_key providers must expose a credential env var
- test_provider_parity: every hermes-model provider must be desktop-configurable
moa already declares auth_type="virtual" in HERMES_OVERLAYS; consult that
overlay as an auth_type fallback so the catalog reports moa as virtual (no real
credential, no network endpoint). Exempt virtual providers from the desktop
parity union check the same way 'custom' is exempt — derived from the catalog,
not a hardcoded slug, so future virtual providers are covered too.
In-place compaction (single durable session id, non-destructive soft-archive)
becomes the default. Rotation is now the opt-out fallback via
compression.in_place: false.
Prerequisite: #50098 (hygiene guard reads result flag not config flag) merged
first — without it, flipping the default causes permanent transcript loss on
gateway hygiene-compress and /compress when no session_db is available.
Blast radius (empirically measured on current main): 7 rotation-asserting
tests broke and are pinned to in_place=False in the companion test commit:
- tests/agent/test_compression_concurrent_fork.py (2)
- tests/agent/test_compression_logging_session_context.py (1)
- tests/agent/test_compression_rotation_state.py (1)
- tests/run_agent/test_compression_boundary_hook.py (2 _make_agent helpers)
- tests/gateway/test_compression_concurrent_sessions.py (2)
Rotation stays as a working fallback and deserves continued coverage.
Plan: .hermes/plans/in-place-compaction-38763.md
Switching sessions in the desktop app could freeze the whole UI for
several seconds on heavy, tool-rich chats. Root causes and fixes:
- Cold `session.resume` built the AIAgent (MCP discovery, prompt/skill
build) *before* returning, and the desktop awaits that RPC before it
paints — so the entire switch blocked on the build. Add an opt-in
`defer_build` resume path (the contract `session.create` already uses):
return the full display transcript immediately, register an upgradable
live session, and pre-warm the agent on a short timer. The persisted
runtime identity (model/provider/base_url/api_mode/reasoning/tier) is
restored on the deferred build so it can't drop the provider.
- Nothing bounded how many in-memory agents accumulate; a user who
reconnects often piled up detached sessions for the full 6h TTL. Add a
soft LRU cap (`max_live_sessions`, default 16) that evicts the
least-recently-active DETACHED sessions (no live client) — never a
running, awaiting-input, mid-build, or live-transport one. Reopening
re-resumes from disk.
- On the prefetch-hit cold-resume path, skip rebuilding a throwaway
merged-message array (and its 1000-entry Map) when the prefetch already
painted the exact transcript; the downstream sameMessageList guard
already drops the publish, so it was pure main-thread cost.
The desktop opts into `defer_build` for every non-watch cold resume; the
eager path stays for CLI/TUI and existing callers.