Commit graph

1682 commits

Author SHA1 Message Date
Afnath Ahamed
f98ffbc246 fix(models): live-first merge + update opencode-zen catalog + uncap aggregator picker 2026-06-27 21:23:25 -07:00
Teknium
3b23a984b5
feat(kanban): stamp handoff freshness so workers don't read stale state as current (#53973)
Multi-agent boards leak staleness: a sibling worker's parent handoff,
comment, or prior-attempt summary gets read by the next worker as live
truth even when it's a day old. build_worker_context surfaced the text
with (at best) a bare absolute timestamp, which an LLM reads as fact
regardless of age — parent results had no timestamp at all.

Adds a coarse relative-age stamp (just now / 18h ago / 3d ago) to every
recalled-state line and a one-line 'point-in-time snapshot, re-verify
against source' frame on the parent-results section, so the worker sees
when handoffs were produced and re-checks stale ones before acting.
2026-06-27 21:21:54 -07:00
kshitijk4poor
2af1678bfc fix(auth): explicit provider intent beats stale OAuth active_provider (#29285)
`resolve_provider("auto")` checked `auth.json` `active_provider` BEFORE the
config.yaml `model.provider` and env-var API-key checks. So a user who was
OAuth-logged-into one provider (e.g. Anthropic) but had set an explicit
`model.provider` or exported an API key (e.g. `OPENAI_API_KEY`) was silently
routed to the stale OAuth provider — the override was invisible and surprising.

Reorder the auto-path so explicit intent wins (the order the issue asks for):

  1. explicit CLI api_key/base_url
  2. config.yaml `model.provider`            (safety net — see below)
  3. OPENAI_API_KEY / OPENROUTER_API_KEY env
  4. OpenRouter credential pool
  5. provider-specific API-key env vars
  6. auth.json `active_provider` (OAuth)      ← demoted to last-resort
  7. AWS Bedrock credential chain
  8. error

`active_provider` is still honored — it's just a last-resort fallback chosen
only when the user expressed no other preference, instead of overriding one.

The normal chat/gateway/TUI/ACP/status path already resolves config.provider
upstream in `resolve_requested_provider()` before "auto" is reached, so this
duplicate config check is the safety net for the lone direct caller
(`main.py` `resolve_provider("auto")`) and any future bypass. Because every
surface funnels through this one resolver, the fix propagates everywhere with
a single edit — no sibling path re-implements precedence.

Also add a one-shot WARN when resolution lands on `active_provider` while a
populated `model` config dict lacks a `provider` key — surfacing the silent
override the issue reported without breaking first-install.

Synthesizes the two competing PRs: #29615 (LifeJiggy — config-before-auth +
the silent-override framing) and #29809 (Minksgo — the env-before-auth
reorder). #29809 could not be merged directly (bundled unrelated, un-opt-in
cost-tagging telemetry); its reorder idea is incorporated here and credited.

Tests: tests/hermes_cli/test_provider_precedence.py — config/env beat stale
OAuth, OAuth still used as last resort, explicit request short-circuits, WARN
fires on silent fall-through. Full provider-resolution suites: 374 passed.

Fixes #29285

Co-authored-by: LifeJiggy <141562589+LifeJiggy@users.noreply.github.com>
Co-authored-by: Minksgo <153416856+Minksgo@users.noreply.github.com>
2026-06-27 19:49:02 -07:00
郝鹏宇
98488c4be4 fix(config): prevent save_config from materialising schema defaults
Fixes #27354

Root cause:  called during init (or by any code path
that saves ) wrote injected schema defaults into
config.yaml as if the user had authored them.  Two fix layers:

1.  now only injects
    when the user actually set
    somewhere (root or agent).  A user who never set
    keeps it absent, so 's explicit-path
   detection won't treat it as user-authored.

2.  gains a  parameter and a
   new  pass that removes keys matching
    unless those paths were explicitly present in the
   **raw** (pre-normalization) config on disk.  Explicit-path detection
   uses  on  *before* any
   normalisation runs — preventing injected-in defaults from being
   mistaken for user-set values.

All migration and edit-config call sites pass
to preserve their intentional default-seeding behaviour.

New helpers:
-   — collects leaf-key paths from a raw dict
-    — removes keys matching schema defaults

Test coverage: 4 new regression tests (59 total, all passing).
2026-06-27 19:38:11 -07:00
Teknium
d3d621f7c3
revert(windows): roll back terminal-popup PRs #53791 #53810 #53829 (#53853)
* Revert "fix(windows): capture is not a no-window boundary; route flashing spawns through chokepoint (#53829)"

This reverts commit 2ecca1e7d3.

* Revert "fix(windows): stop terminal-window popups from background spawns (#53810)"

This reverts commit 5db1430af9.

* Revert "fix(windows): stop subprocess console-window popups + add CI guard (#53791)"

This reverts commit ef17cd204d.
2026-06-27 15:59:00 -07:00
brooklyn!
5db1430af9
fix(windows): stop terminal-window popups from background spawns (#53810)
* fix(windows): stop terminal-window popups from background spawns

Native-Windows desktop/gateway users saw cmd/conhost windows flash on
gateway restart, image paste, the dashboard Projects tree, voice notes,
and ~5 min after closing the app (detached cron). Two root causes:

- Console-subsystem exes (taskkill, schtasks, wmic, netstat, tasklist,
  agent-browser, git, ffmpeg, powershell, git-bash) spawned via raw
  subprocess allocate a fresh console when the launching process has
  none (pythonw desktop backend / detached gateway) - even with output
  captured.
- uv venv pythonw shims re-exec console python.exe, so Python children
  get a console regardless of how they're launched.

Fixes:
- Single hidden-spawn primitive (_subprocess_compat.run/.popen) that ORs
  CREATE_NO_WINDOW on Windows, no-op on POSIX. Route every Hermes-owned
  console-exe spawn through it.
- FreeConsole() catch-all in hermes_bootstrap: any Python child that
  exclusively owns an auto-allocated console detaches it at startup
  (GetConsoleProcessList()==1 gate leaves shared interactive consoles
  untouched).
- Replace PowerShell/wmic gateway PID scans with in-process psutil.
- Skip schtasks queries on non-interactive desktop restarts.
- Prefer native agent-browser .exe over .cmd shims.
- Guard test bans raw subprocess spawns of the Windows-only console
  tools repo-wide so the popup class can't regress.

* fix(windows): scope FreeConsole to background entry points; fix merge fallout

Console detach review (per #53810 feedback): GetConsoleProcessList()==1 can't
tell a uv pythonw->python phantom console apart from a user opening the
interactive CLI/TUI in its own fresh console (double-click, shortcut, ConPTY) —
both report a single attached process with a tty. Running FreeConsole() in the
import-time bootstrap therefore risked detaching a legitimately-interactive
terminal.

- Extract FreeConsole into explicit hermes_bootstrap.detach_orphan_console();
  remove it from apply_windows_utf8_bootstrap() (import side effect).
- Call it only from known background mains: gateway run, dashboard backend
  (start_server, what the desktop spawns), cron standalone, tui_gateway entry,
  slash worker. Interactive CLI/TUI never calls it.
- Behavior-contract tests: frees only when solo owner, leaves shared console,
  no-op without console / on POSIX, and asserts it's not an import side effect.

Merge fallout from origin/main (#53791):
- local.py: 3-way merge left a dangling **_popen_kwargs (NameError crashing
  every terminal init). _subprocess_compat.popen already hides the window, so
  drop it.
- discord adapter: merge stacked an undefined windows_hide_flags() onto the
  primitive call; drop the redundant arg.
- test_gateway: scan now goes psutil-first (zero spawn); rewrite the
  case-variant test to drive that production path.

* test(claw): mock _subprocess_compat.run seam for Windows process scan

claw.py's Windows tasklist/powershell scan routes through the hidden-spawn
primitive; the tests still patched claw_mod.subprocess, so on win32 the mock
was never hit and real spawns returned nothing. Patch the actual seam.
2026-06-27 14:02:24 -07:00
Teknium
ef17cd204d
fix(windows): stop subprocess console-window popups + add CI guard (#53791)
* fix(windows): stop subprocess console-window popups + add CI guard

The single biggest source of Windows 'terminal popup' bug reports was bare
subprocess.run/Popen calls spawning a console window. The compat helpers
(windows_hide_flags / windows_detach_popen_kwargs) already existed but the
footgun checker had no rule to stop new bare calls from reintroducing the flash.

- scripts/check-windows-footguns.py: new AST-based rule flagging subprocess
  calls that can create a new console — output-redirection-aware (capture/
  redirect/check_output exempt) and POSIX-only-program-aware (launchctl/
  systemctl/brew/etc. exempt). Comprehensive on real popups, no annotation
  burden on calls that can't flash.
- Swept all genuine window-spawning sites through windows_hide_flags()/
  windows_detach_popen_kwargs(); marked intentionally-visible launches
  (editor/terminal/foreground re-exec) with '# windows-footgun: ok'.
- tests/scripts/test_windows_footgun_subprocess_rule.py: behavior-contract
  tests + full-repo cleanliness invariant.
- CONTRIBUTING.md: documents the rule + the helper pattern.

* test: accept creationflags kwarg in psutil_android fake_subprocess_run

The Windows no-window sweep added creationflags=windows_hide_flags() to
install_psutil_android.py's subprocess.run call; the test's fake stub had a
fixed (cmd) signature and raised TypeError on the new kwarg.
2026-06-27 13:03:51 -07:00
ailthrim
25ec01f79f fix(desktop): don't purge Electron cache / mirror-retry after a late build failure
`hermes desktop` / `hermes update` recover from a corrupt Electron download by
purging the cached zip + re-downloading and retrying the pack, and then by
falling back to a public mirror. That recovery is only meaningful when the
packaged executable is MISSING — the signature of a partial/corrupt unpack.

A LATE failure such as macOS code signing (#40187) leaves
`Hermes.app/Contents/MacOS/Hermes` (or the platform equivalent) in place.
Re-downloading Electron can't repair a signing failure, so the purge +
slow mirror retry just grind through another identical failure before the
build finally errors out.

Gate both recovery blocks on `_desktop_packaged_executable(desktop_dir) is None`
so a build that already produced the executable fails fast instead of
triggering the destructive download recovery. The corrupt-download path
(executable missing) is unchanged.

Salvage of #42782, re-applied onto current main (the surrounding recovery was
refactored to `_electron_dist_ok` / `_redownload_electron_dist` since the PR
was opened). Adds a regression test asserting no purge / mirror retry runs when
the executable exists, and updates the existing retry/mirror tests to model the
corrupt-download case (executable absent) the recovery is actually for.

Related to #40187 (the residual cache-purge sub-issue; the signing failure
itself is fixed by #52591).
2026-06-28 00:29:34 +05:30
teknium1
1ef19bad90 fix(model): show MoA preset picker on selection and label MoA in the banner
Selecting 'Mixture of Agents' in the `hermes model` provider picker fell
through silently — select_provider_and_model had no moa branch, so it just
reprinted the current model/provider summary and exited. And the CLI session
banner rendered the bare preset name (e.g. 'opus-gpt · Nous Research'),
which is meaningless out of context.

- Add _model_flow_moa: always lists the available presets (even one), then
  prints the full reference-models + aggregator breakdown for the selection
  and persists model.provider=moa / model.default=<preset> (dropping stale
  base_url + endpoint creds, since moa is a virtual local provider).
- Wire the branch into select_provider_and_model.
- build_welcome_banner takes provider; when 'moa' it renders
  'MoA: <preset> · agg <aggregator>' instead of a bare slug. Both CLI call
  sites pass self.provider.

Tests: 2 new banner tests (moa + non-moa unchanged); E2E verified the picker
persists the preset and clears stale base_url/api_key.
2026-06-27 11:45:07 -07:00
Teknium
27322612b4
fix(update): route loud build/installer output to update.log instead of the terminal (#53616)
* fix(update): route loud build/installer output to update.log instead of the terminal

hermes update flooded the terminal with the full vite asset dump,
electron-builder logs, npm deprecation warnings from the desktop build,
and the cua-driver installer's 'Next steps' wall. All of that is
low-signal noise the user doesn't need on a successful update.

- Capture the desktop --build-only subprocess (vite + electron-builder)
  into ~/.hermes/logs/update.log; print a one-line status, and on
  failure surface the last 15 lines + a pointer to the full log.
- Capture the cua-driver installer's output when verbose=False (the
  hermes update refresh path); concise upgrade line is unchanged.
- Add _log_only_write() / _run_logged_subprocess() helpers that write to
  the update.log handle without echoing to the terminal.

The repo-root npm install keeps streaming (capture_output=False) — that
is the deliberate #18840 guard so a slow postinstall download doesn't
look hung. The desktop npm install is a separate Electron process with
no such progress concern and is captured.

* fix(update): persist full cua-driver installer output to update.log

The captured cua-driver installer output was only sent to logger.debug
(agent.log) on failure, so the 'Next steps' wall was lost from
update.log entirely on success. Write the full captured output straight
to the update.log handle (sys.stdout._log) on both success and failure,
matching the desktop-build capture, so update.log keeps the complete
record of everything an update did.
2026-06-27 11:43:01 -07:00
Teknium
917f6bdb00
fix(tools): let vision pick any provider+model, not just OpenRouter (#53606)
* fix(tools): let vision pick any provider+model, not just OpenRouter

hermes tools → configure → vision no longer forces an OPENROUTER_API_KEY.
It now offers the same any-provider surface as the model command: Auto
(use main model / aggregator fallback), pick any authenticated provider +
model, or a custom OpenAI-compatible endpoint. Selections persist to
auxiliary.vision.{provider,model,base_url} — the keys the vision resolver
already reads. Custom endpoint pins provider=custom so base_url routes
correctly. Reconfigure path uses the same picker instead of re-prompting
for OPENROUTER_API_KEY.

* docs: add PR infographic for vision any-provider picker
2026-06-27 04:41:42 -07:00
ms-alan
16192103f4 fix(config): accept placeholder base_url in custom provider validation
_normalize_custom_provider_entry() ran urlparse() on base_url and dropped
any entry whose value was an un-expanded placeholder, so a caller reaching
the normalizer with raw config (e.g. the Dockerized gateway path) silently
skipped the provider with a 'not a valid URL' warning. Skip URL validation
when the candidate contains a placeholder token — both ${ENV_VAR} env-refs
and bare {region}-style templates — since those are expanded at runtime.

Closes #14457
2026-06-27 04:15:27 -07:00
Teknium
5ab4136631
fix(webui): switch provider when Config-page model field changes (#53583)
The dashboard Config tab's Model field is a flat string with no provider
info. _denormalize_config_from_web only updated model.default and kept the
stale provider, so picking an OpenRouter model while the default provider was
ollama-local left provider=ollama-local and every call 404'd.

When the model string actually changes, infer the serving provider — curated
catalog first, then a vendor/model-slug heuristic for non-aggregator providers
— and route the switch through the existing _normalize_main_model_assignment /
_apply_main_model_assignment chokepoints so stale base_url/api_mode/api_key are
cleared on a provider change and preserved on a same-provider re-pick. Saving
an unchanged model never re-detects, so unrelated config saves keep an explicit
provider.

Closes #14058
2026-06-27 04:13:44 -07:00
blaryx
76af2456a2 fix(dashboard): merge PUT /api/config with existing on-disk config
The dashboard form is built from CONFIG_SCHEMA, which doesn't enumerate
every root-level key the YAML supports. Most visibly, `custom_providers`
is in `_KNOWN_ROOT_KEYS` but is absent from the schema — so the frontend
never sends it in the PUT body. The previous full-replace save() then
silently wiped the key from disk every time the user clicked anything
that triggered a save. Other casualties (less visible because defaults
re-mask them on load) include `agent.personalities`,
`agent.reasoning_effort`, `terminal.lifetime_seconds`, etc.

Fix: read the raw on-disk config and deep-merge the incoming PUT body
on top of it before saving. The frontend can only overwrite what it
explicitly sends; everything else is preserved verbatim.

Reuses the existing `_deep_merge` helper from `hermes_cli.config`.

Tests:
- `test_round_trip_preserves_custom_providers` exercises the exact bug:
  seed config with custom_providers, GET → drop the key → PUT,
  assert it's still on disk.
- `test_round_trip_preserves_schema_invisible_nested_keys` covers the
  shallow-vs-deep-merge case for nested dicts under `agent` etc.
Both fail on current main; both pass with this patch.
2026-06-27 03:48:18 -07:00
dodo-reach
ed54469d06 fix(gateway): show MoA presets in model picker 2026-06-27 03:43:38 -07:00
briandevans
17cb829991 test(moa): cover non-list/bare-dict reference_models normalization 2026-06-27 03:43:16 -07:00
Teknium
60f58a2b95
feat(verify-on-stop): default OFF, one-time migration, skip doc-only edits (#53552)
The verify-on-stop guard fired too eagerly — including on doc/markdown/skill
edits with nothing to verify, where it pushed a pointless /tmp verification
script. Three changes:

1. Default OFF for new installs: agent.verify_on_stop defaults to false
   (was the "auto" surface-aware sentinel). _config_version bumped 30 -> 31.
2. One-time migration (v30 -> v31): existing installs are switched off once,
   but only when the value is missing or still the "auto" sentinel — an
   explicit true/false the user set is preserved.
3. Path filter: build_verify_on_stop_nudge() now drops documentation/prose
   paths (.md/.mdx/.rst/.txt/LICENSE/CHANGELOG/...) so even when explicitly
   enabled, a doc-only turn never nudges. Mixed doc+code turns still nudge on
   the code paths.

The legacy "auto" sentinel is still honored when set explicitly (ON for
interactive coding surfaces, OFF for messaging). HERMES_VERIFY_ON_STOP env
override unchanged.
2026-06-27 03:23:22 -07:00
Versun
c655cdf2c1 feat(dashboard): expose cron job execution fields 2026-06-27 03:20:32 -07:00
Teknium
d712a7fd73
fix(model-picker): surface the current custom/uncurated model in picker rows (#53457)
A model selected via the CLI (e.g. /model openrouter/<uncurated-name>) was
absent from every model picker — the main picker AND the MoA reference/
aggregator slot pickers — because each provider row only carried its curated
catalog. Inject the current model at the front of its provider's row so it is
selectable and shown everywhere.
2026-06-27 00:06:34 -07:00
ethernet
bcc3eb3419 fix(ci): rip out some xdist legacy stuff... how did these ever work?? 2026-06-26 19:15:18 -07:00
Nacho Avecilla
dbe734beff
fix(dashboard-auth): exclude non-interactive providers from interactive login surfaces (#53239)
* Return None instead of erroring on drain login failure

* Fix login on drain

* Remove login for drained endpoints flow and clean the code

* chore: drop unrelated credits changes from this PR

* Remove extra comments that were not really necessary
2026-06-27 10:08:13 +10:00
kshitijk4poor
7475d125d2 test(mcp): stub mcp_oauth in backgrounding test to deflake CI
The backgrounding-contract test (test_prepare_agent_startup_backgrounds_
blocking_mcp_for_chat) failed intermittently on loaded CI shards: it stubs
tools.mcp_tool.discover_mcp_tools but NOT tools.mcp_oauth, so the background
discovery thread paid the real, cold ~0.75s 'import tools.mcp_oauth' (added by
this PR's _discover_mcp_tools_without_interactive_oauth) before calling the
stubbed discovery. On a slow/loaded runner that import plus thread scheduling
exceeded the 1.0s polling deadline, leaving calls['mcp'] == 0.

Fix: stub tools.mcp_oauth with a nullcontext suppress_interactive_oauth (the
same no-op production falls back to when mcp_oauth is unavailable), so the
test exercises the backgrounding contract without paying an unrelated cold
import in its timing window. Bumped the poll deadline 1.0s -> 3.0s as
belt-and-suspenders. Production behaviour is unchanged; the import cost was
always off the main thread.

Verified: 5/5 pass repeatedly via scripts/run_tests.sh (per-file isolation,
matching CI), ruff clean.
2026-06-27 04:59:23 +05:30
zapabob
e55ddc3e33 fix(mcp): suppress interactive OAuth stdin prompts during background discovery (#35927)
When an MCP server requires OAuth, the interactive `hermes` TUI froze on
startup: background MCP discovery hit the OAuth flow, which on an interactive
TTY spawns a daemon thread doing a blocking `sys.stdin.readline()` (the
"paste the redirect URL" fallback in mcp_oauth._wait_for_callback). That
thread competes with the TUI's own stdin reader for the same terminal, so
keystrokes get swallowed and the TUI appears frozen (up to the 300s OAuth
timeout). Reported symptom: "MCP OAuth: authorization required / Open this URL
... the tui is freezing, not respond to typing."

Add a thread-local `suppress_interactive_oauth()` context manager in
tools/mcp_oauth.py; `_is_interactive()` returns False while it's active, so the
stdin paste-thread and prompt are never created. Background discovery
(hermes_cli/mcp_startup.py, tui_gateway/entry.py) now runs discovery inside
that context, so OAuth-requiring servers soft-skip (raise
OAuthNonInteractiveError, already handled) instead of stealing the TUI's stdin.
A real `hermes mcp login` on the main thread is unaffected (thread-local).

Salvaged from #35945 by @zapabob (authorship preserved via cherry-pick;
resolved a conflict against main's new mcp_discovery_timeout / wait_for_mcp_
discovery refactor, keeping both). Verified E2E: with suppression the paste
prompt is NOT printed and no stdin thread spawns (raises OAuthNonInteractive
soft-skip); without it the prompt shows (the freeze). Mutation-verified
(removing the suppress check in _is_interactive fails the regression test).
76 tests pass, ruff clean.

Closes #35927.

SELF-REVIEW FIX: the original #35945 used threading.local(), which does NOT
propagate to the dedicated mcp-event-loop thread where OAuth actually runs
(discover_mcp_tools dispatches the connect via run_coroutine_threadsafe), so
the suppression was a NO-OP in production (the tests passed only by stubbing
out the cross-thread dispatch). Converted to a contextvars.ContextVar, which
asyncio copies onto the scheduled coroutine — empirically verified suppression
now holds on the mcp-event-loop thread through the real _run_on_mcp_loop path.
Added a cross-thread regression test (fails on threading.local, passes on the
ContextVar) so the no-op can't regress.
2026-06-27 04:59:23 +05:30
kshitijk4poor
244a6f2ceb fix(desktop): broken "Open setup guide" button for plugin platforms
On the desktop Channels / Messaging page, the "Open setup guide" button was
rendered as a bare <a href={platform.docs_url} target="_blank"> with no guard.
Plugin-provided platforms (Microsoft Teams, Google Chat, Line, Raft, Yuanbao,
…) ship an empty docs_url, so the anchor's href was "".

In a packaged build, Electron resolves an empty href against the current
document — the app's own index.html inside the asar bundle — and
shell.openPath then fails with an OS "file not found" dialog. This is exactly
the Windows error reported for Messaging → Teams → Open guide.

Fix (3 changes):

1. fix(desktop) — Only render the "Open setup guide" button when docs_url is
   non-empty, and route clicks through openExternalLink so a relative/empty
   value can never be treated as a local bundle path. Fixes the whole class
   (every plugin platform), not just Teams.

2. fix(messaging) — Give the Teams platform plugin a real docs_url (Microsoft
   Teams setup guide) so its card shows a working button instead of nothing.

3. fix(messaging) — Give the Google Chat platform plugin a real docs_url
   (Google Chat setup guide) so its card shows a working button instead of
   nothing. Originally from #48940; folded in here because that PR's test
   was broken (it queried the HTTP endpoint, but google_chat is a dynamic
   enum member that only appears after the adapter module is imported).

Test plan:
- apps/desktop — new src/app/messaging/index.test.tsx: button is hidden when
  docs_url is empty; a real URL opens via the validated external opener (does
  not navigate).
- apps/desktop typecheck (tsc --noEmit) clean.
- backend — test_teams_messaging_metadata_links_setup_guide: the Teams catalog
  entry exposes the setup-guide docs_url.
- backend — test_google_chat_messaging_metadata_links_setup_guide: the Google
  Chat catalog entry exposes the setup-guide docs_url.

Co-authored-by: xxxigm <tuancanhnguyen706@gmail.com>
Co-authored-by: p-andhika <andhika.prakasiwi@gmail.com>
2026-06-27 04:34:08 +05:30
Teknium
7e101e553b
fix(moa): block the moa virtual provider as a reference or aggregator slot (#53281)
A MoA preset whose reference or aggregator slot points at the moa virtual
provider creates a recursive MoA tree. The runtime guards in moa_loop.py only
surface this mid-turn (references silently skipped, aggregator raises). Reject
it at the config chokepoint (_clean_slot) so it can never be saved, and hide it
from the desktop/dashboard slot pickers so it isn't offered as a dead choice.
2026-06-26 14:42:42 -07:00
srojk34
f0678b031e fix(moa): tolerate non-numeric values in hand-edited MoA preset config
_normalize_preset uses bare float() and int() to coerce
reference_temperature, aggregator_temperature, and max_tokens from
config.yaml.  When a user hand-edits a non-numeric value (e.g.
max_tokens: "8k" or reference_temperature: "hot"), the coercion raises
ValueError.  Since normalize_moa_config runs on every model-selection
and MoA turn (via resolve_moa_preset), the crash is unrecoverable and
blocks all MoA usage until the config is manually fixed.

Replace the bare casts with _coerce_float / _coerce_int helpers that
fall back to the default on TypeError/ValueError instead of raising.
2026-06-26 14:35:38 -07:00
kyssta-exe
c0568ca95f fix(config): use read_raw_config() in migrations to prevent expanding defaults (#40821) 2026-06-26 22:40:52 +05:30
brooklyn!
5cc4009deb
Merge pull request #52828 from helix4u/fix/desktop-backend-update-indicator
fix(desktop): show remote backend updates without counts
2026-06-26 11:49:07 -05:00
kshitij
7b2c51152a
Merge pull request #52990 from NousResearch/salvage/52889-backup-projects-kanban
fix(backup): include projects.db and kanban boards in pre-update snapshot (#52889)
2026-06-26 20:09:15 +05:30
0xDevNinja
9ef49cd78f fix(backup): include projects.db, kanban boards, and sibling stores in pre-update snapshot (#52889)
projects.db (per-profile project store) and kanban.db were missing from
_QUICK_STATE_FILES, so the pre-update quick snapshot never backed them up.
On a desktop upgrade, when the update flow removes/replaces the file and the
post-update schema-init re-creates an empty one, all user-created projects,
folder mappings, the active-project pointer, kanban board bindings, and tasks
vanish silently — no error.

Add the per-profile user-created stores to the snapshot set:
- projects.db               — project store
- response_store.db         — gateway conversation history / tool payloads (WAL)
- memory_store.db           — holographic memory facts/entities (WAL)
- verification_evidence.db  — agent verification audit trail
- kanban.db                 — default board (back-compat <root>/kanban.db)
- kanban/boards             — non-default boards (<root>/kanban/boards/<slug>/kanban.db
                              + metadata); workspaces/ and attachments/ subtrees
                              are skipped as large + regenerable.

Also: the directory-branch of create_quick_snapshot now routes *.db through the
WAL-safe _safe_copy_db (SQLite backup() API), matching the top-level file path —
previously a non-default board DB with an open WAL could be copied inconsistently.

Salvaged from #52930 by @0xDevNinja (authorship preserved via cherry-pick).
On top of the original (which covered only projects.db + the default kanban.db),
this adds: non-default-board coverage, the three sibling per-profile DBs that
meet the same upgrade-wipe criteria, WAL-safe directory copies, and a
workspaces/attachments skip to avoid snapshot bloat (×20 retained). 8 tests,
all mutation-verified; E2E verified snapshot→wipe→restore preserves all six
store types on the real code path.

Closes #52889. Supersedes #52930.
2026-06-26 19:23:33 +05:30
Dr1985
e3db1ef92d fix(macos): clearly distinguish launchd supervision from detached fallback in gateway status
Some checks failed
CI / detect (push) Waiting to run
CI / tests (push) Blocked by required conditions
CI / lint (push) Blocked by required conditions
CI / typecheck (push) Blocked by required conditions
CI / docs-site (push) Blocked by required conditions
CI / history-check (push) Blocked by required conditions
CI / contributor-check (push) Blocked by required conditions
CI / uv-lockfile (push) Blocked by required conditions
CI / docker-lint (push) Blocked by required conditions
CI / supply-chain (push) Blocked by required conditions
CI / osv-scanner (push) Blocked by required conditions
CI / All required checks pass (push) Blocked by required conditions
Deploy Site / deploy-vercel (push) Waiting to run
Deploy Site / deploy-docs (push) Waiting to run
Docker Build and Publish / build-amd64 (push) Has been cancelled
Docker Build and Publish / build-arm64 (push) Has been cancelled
Docker Build and Publish / merge (push) Has been cancelled
## Description

On macOS 26.x, `launchctl bootstrap` and `launchctl kickstart` return exit code 5 ("Input/output error"), which Hermes already anticipates and handles by spawning a detached fallback process. However, the gateway status reporting is ambiguous:

- `gateway status` says "Gateway service is loaded" (because `launchctl list` returns exit 0)
- But `launchctl print` shows `state = not running` — launchd isn't actually supervising anything
- The detached fallback PID running is invisible to the status command
- Users can't tell whether auto-start at login and auto-restart on crash are available

### Root Cause

Two problems in `hermes_cli/gateway.py`:

1. **`_probe_launchd_service_running()`** (line 1067): Determined launchd service liveness solely by `launchctl list <label>` exit code. On macOS 26, this returns 0 even when the service is only *registered* but not running (output lacks a `"PID"` field). This caused `GatewayRuntimeSnapshot.service_running = True` incorrectly, which suppressed the process/service mismatch warning.

2. **`launchd_status()`** (line 3569): Used the same binary "loaded/not loaded" check without inspecting whether launchd actually has a PID, whether a detached fallback is running, or whether auto-start/restart are available.

### Changes

**`hermes_cli/gateway.py`:**

1. **New `_parse_launchd_pid_from_list_output()` helper** — Extracts the PID from `launchctl list` output. When launchd is actively supervising, the output includes `"PID" = <number>;`. When only registered but not running, no PID field is present.

2. **Fixed `_probe_launchd_service_running()`** — Now requires a PID in the `launchctl list` output to confirm launchd is actually supervising. This correctly sets `service_running = False` when launchd has the service registered but `state = not running`, which triggers the existing process/service mismatch detection.

3. **Reworked `launchd_status()`** — Reports clearly separated information:
   - LaunchAgent plist currentness (stale or current)
   - Whether launchd is actively supervising (with PID)
   - Whether a detached fallback PID is running
   - Whether auto-start at login and auto-restart on crash are available
   - When launchd supervision is known to be unavailable, explains why

4. **Persistent unsupported marker** (`~/.hermes/.gateway-launchd-unsupported`) — Written when `_launchd_fallback_to_detached()` is called (launchd exit 5/125). Allows `launchd_status()` to explain *why* launchd can't supervise even when no fallback process is currently running. Cleared automatically when a future bootstrap/kickstart succeeds (e.g., after an OS update fixes the issue).

5. **Updated `_print_gateway_process_mismatch()`** — Distinguishes the managed detached fallback from a genuinely manual `nohup hermes gateway run`, providing accurate guidance for each case.

### Status Output Examples

**Before** (macOS 26, fallback active):
```
Launchd plist: ~/Library/LaunchAgents/ai.hermes.gateway.plist
✓ Service definition matches the current Hermes install
✓ Gateway service is loaded
{
    "Label" = "ai.hermes.gateway";
    "OnDemand" = true;
    ...
};
```

**After** (macOS 26, fallback active):
```
Launchd plist: ~/Library/LaunchAgents/ai.hermes.gateway.plist
✓ Service definition matches the current Hermes install
⚠ Gateway service is registered but launchd is not supervising it
  launchd cannot manage the gateway on this macOS version.
✓ Detached fallback process is running (PID 12345)
  Cron jobs will fire. Stop with: hermes gateway stop
  ⚠ Auto-start at login and auto-restart on crash are NOT available.
```

**After** (normal launchd supervision):
```
Launchd plist: ~/Library/LaunchAgents/ai.hermes.gateway.plist
✓ Service definition matches the current Hermes install
✓ Gateway is supervised by launchd (PID 12345)
  Auto-start at login and auto-restart on crash are available.
```

### Tests

Updated 5 existing tests and added 11 new tests in `tests/hermes_cli/test_gateway_service.py`:
- PID parsing from `launchctl list` output (with PID, without PID, empty, unquoted PID)
- `_probe_launchd_service_running()` requires PID presence
- Unsupport marker lifecycle (write, clear, persist across fallback)
- Marker cleared on successful bootstrap
- `launchd_status()` reporting: supervised, fallback-running, fallback-unavailable
- Existing fallback tests now verify marker creation

### Related Issues

- Issue #23387 (original macOS 26 launchd workaround)
- Issue #42524 (this issue)
2026-06-26 16:30:30 +05:30
kshitij
1aa458a1e6
Merge pull request #52920 from NousResearch/salvage/38798-toolset-validation
fix(config): surface invalid platform_toolsets instead of silently dropping tools (#38798)
2026-06-26 14:14:55 +05:30
lEWFkRAD
41ede84b93 fix(config): surface invalid platform_toolsets instead of silently dropping tools (#38798)
A config migration (or hand-edit) that leaves an invalid toolset name in
`platform_toolsets` — e.g. the #38798 corruption that rewrote `hermes-cli` to
the non-existent `hermes` — silently disabled all affected tools:
resolve_toolset() returns [] for an unknown name, so the agent quietly lost its
tools with no error, warning, or log entry and degraded to text-only replies.

Surface it loudly at two points:
- After migration (migrate_config): validate platform_toolsets and record/print
  a warning per unknown name, with a `hermes-<platform>` suggestion when that
  would have been valid (the exact #38798 shape).
- At runtime (_get_platform_tools): if a platform was explicitly configured but
  every toolset name is invalid, log a warning when tools are resolved for a
  session — so an ALREADY-corrupted config is caught at startup, not only on the
  next `hermes update`.

Logic lives in a new pure, side-effect-free helper (toolset_validation.py) with
validate_toolset injected, so it is unit-testable without the tool registry.

Note: the original v25→v26 migration that caused the corruption no longer
exists (config format is now v30; no migration step rewrites toolset names).
This change is the durable defense against the silent-failure mode regardless
of cause, matching the issue's "Expected: log a warning".

Salvaged from #39207 by @lEWFkRAD (authorship preserved via cherry-pick).
Tests: 9 helper cases (incl. the #38798 corruption shape, mixed valid/invalid,
zero-tools state, non-dict/scalar/non-string) + a runtime caplog test — both the
helper warning and the runtime guard mutation-verified to fail without the fix.

Closes #38798. Supersedes #39581 (prevent-in-v25→v26 — that path is gone),
#41006 / #40208 (repair-migration for already-corrupted configs).
2026-06-26 14:07:43 +05:30
Shannon Sands
a0dc92450b Split dashboard PTY reconnect tests 2026-06-26 01:06:02 -07:00
Shannon Sands
41f8126148 Reconnect dashboard PTY chat after socket drops 2026-06-26 01:06:02 -07:00
Ben
19b2624404 feat(gateway): external drain trigger + accept-gating (begin/cancel + control channel)
Tasks 2.1 + 2.2 + 2.3 of the safe-shutdown plan — the reversible
quiesce-without-restart machinery NAS drives during a lifecycle action (D4a).
These ship together because the endpoint, the control channel, and the gateway
state machine are one coherent slice.

2.2 — control channel (gateway/drain_control.py, new):
The dashboard has no HTTP path into a running gateway (guardrails: "there is NO
external control channel into a running gateway"); restart/drain is driven only
by markers the gateway reacts to. So begin/cancel-drain writes/removes a
presence-based marker .drain_request.json (HERMES_HOME-scoped, atomic write,
never-raises read; a corrupt marker reads as present-contentless → fail-safe
toward quiescing). This is Q-B option A.

2.2 — gateway state machine (gateway/run.py):
- _external_drain_active flag, DISTINCT from the shutdown _draining flag: this
  one does NOT exit the process and is fully reversible.
- _enter_external_drain / _exit_external_drain: idempotent transitions that
  flip gateway_state→draining / →running via _update_runtime_status (preserving
  the live active_agents count). exit refuses to revert to running during a
  real shutdown or after the loop stops (shutdown wins).
- _drain_control_watcher: 1s background task (modelled on _handoff_watcher)
  reconciling accept-state with the marker; honours a marker that survived a
  restart on its first tick. Registered alongside the other watchers in start.
- New-turn accept gate in _handle_message, placed BEFORE the session-slot
  claim: when draining, refuse to START a new turn (so active_agents can only
  fall → no TOCTOU race), while in-flight turns finish untouched. Internal/
  system events (restart-recovery replays, bg-process completions) bypass it.

2.1 — endpoint (hermes_cli/web_server.py):
POST /api/gateway/drain {action: drain|cancel}. Authenticated by the Task-2.0a
token seam (the drain plugin registered this exact path as a token route);
attributes the request to the verified token principal. Begin writes the
marker, cancel removes it — the gateway process owns the actual transition.
Force-override (D6) is NOT here; it maps onto the existing immediate
/api/gateway/restart force path.

Tests (mocked — necessary-not-sufficient; the HARD live gate Q-B is next):
- tests/gateway/test_external_drain_control.py — marker contract (write/clear/
  read/corrupt/atomic), state machine (enter/exit/idempotency/shutdown-wins/
  loop-stopped), watcher reconcile-enter-then-exit, new-turn refusal, and
  in-flight-not-interrupted. 15 tests.
- tests/hermes_cli/test_web_server.py — /api/gateway/drain begin/default-begin/
  cancel/cancel-idempotent/bad-action-400. 6 tests.
- dashboard.drain_auth config section already added in 2.0b commit.

All touched suites green: 301 (gateway+auth) + 9 (web_server endpoints) passed.

Intentionally deferred:
- HARD live-validation gate (Q-B): real isolated `hermes gateway run`, drive a
  real begin-drain marker, prove the 5-point checklist a–e.
- Spec-doc status flip + Phase-2 PR.

Build status: external-drain, restart-drain, status, dashboard-auth, drain-plugin,
token-auth, and web_server-endpoint suites green.
2026-06-26 00:47:19 -07:00
Ben
cb9cb6ba1c feat(dashboard-auth): generic non-interactive API-token capability
Task 2.0a of the safe-shutdown drain-coordination plan. Widens the dashboard
auth framework GENERICALLY to support non-interactive (service-to-service)
bearer-token auth, mirroring the existing supports_password precedent. This is
a reusable capability — any future machine-credential provider plugs in without
core changes (decisions.md Q-C). The drain bearer-secret plugin (Task 2.0b) is
the first consumer, not the definition.

- base.py: add TokenPrincipal dataclass (the token analog of Session) +
  supports_token capability flag + verify_token() on the ABC (default raises
  NotImplementedError so a misconfigured provider fails loud). Contract mirrors
  verify_session stacking: return None for unrecognised tokens (never raise),
  raise ProviderError only on a genuine backing-store outage.
- registry.py: list_token_providers() — the supports_token subset, in
  registration order. Empty when none registered (token routes fail closed).
- token_auth.py (new): route-agnostic seam. Routes opt in via
  register_token_route(exact path); token_auth_middleware owns the auth
  decision for those routes only — authenticate via stacked providers, attach
  request.state.token_principal + token_authenticated, pass through. 401 on
  missing/unrecognised token, 503 when a provider was unreachable, untouched
  passthrough for non-token routes. Fails closed (never open).
- web_server.py: install the seam OUTERMOST (registered last → runs first).
  Both downstream gates (legacy auth_middleware + gated_auth_middleware) honour
  request.state.token_authenticated and skip enforcement, so a token-authed
  service request is never bounced to /login.
- audit.py: TOKEN_AUTH_SUCCESS / TOKEN_AUTH_FAILURE events.

Tests: tests/hermes_cli/test_dashboard_token_auth.py — ABC flag default,
verify_token NotImplementedError, registry filter, bearer extraction
(case-insensitive scheme, malformed/non-bearer → ""), provider stacking
(first-match-wins, unreachable-remembered, unreachable-then-valid, buggy
provider doesn't crash the gate), and the seam's passthrough/401/503/
fail-closed behaviour. 29 new tests; full dashboard-auth suite 169 passed.

Intentionally deferred:
- The concrete shared-bearer-secret provider plugin — Task 2.0b.
- The begin/cancel-drain endpoint that registers itself as a token route —
  Task 2.1.

Build status: dashboard-auth + plugin-hook suites green.
2026-06-26 00:47:19 -07:00
Max Hsu
075f93ad78 fix(mcp): auto-recover from invalid_client on stale OAuth client registration
Fixes #36767.

Two complementary recoveries for the recurring "delete three cache files and
re-auth by hand" ritual when an MCP server's dynamically-registered OAuth
client goes dead server-side (IdP redeploy / DB wipe / rebrand):

- Auto-heal (token-endpoint subset): HermesMCPOAuthProvider now sniffs
  auth-flow responses and, on a 400/401 `invalid_client` from the discovered
  token endpoint, backs up + deletes `<server>.client.json` and `.meta.json`
  and clears the in-memory client so the SDK re-runs RFC 7591 dynamic client
  registration on the next flow. Conservative by construction: only
  dynamically-registered (non config-supplied) clients, only the token
  endpoint, only on a word-boundary `invalid_client` match (so RFC 7591's
  `invalid_client_metadata` does not trip it); best-effort so a miss never
  breaks the live flow. Covers both code-exchange and refresh when the token
  endpoint was discovered. Tokens are preserved.

- `hermes mcp reauth [<name>|--all]`: the reporter's primary symptom — the
  IdP's in-browser "Redirect URI Mismatch" — produces no HTTP signal (the SDK
  only sees a callback timeout), so it cannot be auto-detected. The new
  command re-auths one or ALL `auth: oauth` servers, serially: one browser
  flow at a time, which also fixes the startup popup storm when several
  servers are stale at once. Single-server reauth is factored out of
  `mcp login` and shared.

Tests: +14 (poison helper x2; token-endpoint detection x5 incl. wrong-endpoint,
success-response, pre-registered, and invalid_client_metadata negative guards;
a bridge integration test driving the real async_auth_flow generator to prove
the detection hook preserves the bidirectional asend() forwarding contract;
reauth CLI x6). Verified against the pinned mcp==1.26.0: scripts/run_tests.sh
122/122 green for the touched suites; check-windows-footguns.py and ruff clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-26 00:35:27 -07:00
Teknium
5b5c79a8ef
feat(kanban): typed block reasons + unblock-loop breaker (#52848)
* feat(kanban): typed block reasons + unblock-loop breaker

Stops the kanban blocked-task loop: a worker blocks a task, a cron
unblocks it, the worker re-blocks for the same reason, repeat forever.

block_task now takes a typed kind and a persistent block_recurrences
counter on the tasks table:

- kind=dependency routes to todo (parent-gated, auto-resumed), never
  the human 'blocked' bucket a cron would keep unblocking.
- needs_input/capability/transient/untyped land in blocked; each
  same-cause re-block after an unblock increments block_recurrences,
  and at BLOCK_RECURRENCE_LIMIT (default 2) the task routes to triage
  for a human instead of blocked.
- unblock_task no longer resets block_recurrences (the amnesia that
  let the loop run unbounded); complete_task clears it on success.

Wired through the worker kanban_block tool (new kind arg) and the
hermes kanban block --kind CLI flag, both reporting where the task
actually landed. Docs + 11 new tests; 536 existing kanban tests green.

* test(kanban): make second-block notify test use a distinct block cause

test_notifier_second_blocked_delivers blocked the same task twice with
the same (untyped) reason, which now trips the new unblock-loop breaker
and routes the second block to triage instead of blocked — so only one
'blocked' notification fired. The test's actual intent is that TWO
distinct block cycles each notify; give the two cycles different kinds
(needs_input then capability) so they're genuinely separate blocks. The
same-cause loop→triage path is covered by test_kanban_block_kinds.py.
2026-06-25 21:46:58 -07:00
helix4u
1c8594b634 fix(desktop): show remote backend updates without counts 2026-06-25 21:39:29 -06:00
liuhao1024
56cf517ccd fix(cron): detect partial job loss in restore_cron_jobs_if_emptied (#52144)
The desktop scheduler can overwrite cron/jobs.json with its own small
set of internally-tracked crons after an update/restart, causing
partial loss of tool-created cron jobs. The previous guard only
checked for total loss (live_count == 0), missing the case where
live_count > 0 but less than the pre-update snapshot count.

Compare live_count against snap_count instead of checking for zero,
so both total loss (0 vs N) and partial loss (1 vs 19) trigger
restoration.

Salvaged from #52161 by @liuhao1024.

Closes #52144
2026-06-25 18:49:18 -07:00
Teknium
208f0d7c3b
fix(update): default pre-update backup to off (#52729)
The pre-update HERMES_HOME zip shipped on by default (DEFAULT_CONFIG +
runtime fallback both True), so every `hermes update` zipped the entire
~/.hermes — sessions DB, caches, skills — adding minutes to each update.
The shipped cli-config.yaml.example, the --backup help, and the example
config all already said "off by default," so the live default
contradicted its own documentation.

Flip the default to off everywhere: DEFAULT_CONFIG, the runtime
`.get(..., False)` fallback in _run_pre_update_backup, and the stale
--backup help string. Users who want the #48200 safety net opt in via
updates.pre_update_backup: true or --backup for a single run.

Updated test_default_enabled_creates_backup -> test_default_disabled_is_silent
to assert the new default (silent no-op, no zip).
2026-06-25 16:01:09 -07:00
brooklyn!
ffa3d3c811
Merge pull request #49037 from NousResearch/bb/projects-paradigm
feat(desktop): first-class projects — sidebar, coding rail, review pane, and agent project tools
2026-06-25 17:49:05 -05:00
Gille
bf0513bca0 test(windows): align gateway restart CI coverage 2026-06-25 14:42:38 -07:00
Brooklyn Nicholson
e7811345c1 feat(kanban): link tasks to project worktrees 2026-06-25 16:40:26 -05:00
Brooklyn Nicholson
8a45ce2dd4 feat(projects): add per-profile project store 2026-06-25 16:40:26 -05:00
Teknium
c6575df927
feat(moa): expose MoA presets as selectable virtual models (#46081)
* feat(moa): expose MoA presets as selectable virtual models

Reconstructed onto current main (PR #46081's base had diverged with no common
ancestor, marking the PR dirty so CI never dispatched). MoA is now a virtual
provider: each named preset is a selectable model under provider 'moa', and the
preset's aggregator is the acting model that answers and calls tools.

Reference models fan out in parallel via a bounded ThreadPoolExecutor (the same
batch pattern delegate_task uses) — all references dispatched at once, collected
when every one finishes, then handed to the aggregator. Output order is
preserved, failures and the MoA-recursion guard stay isolated per reference.

- Removed the old mixture_of_agents model tool and moa toolset.
- Added moa as a virtual provider in the provider/model inventory.
- /moa is shortcut behavior over model selection (default preset / named preset
  / one-shot prompt).
- Dashboard + Desktop manage named presets; presets appear in model pickers.
- Parallel reference fan-out in agent/moa_loop.py with regression test.

* fix(moa): thread moa_config through _run_agent to _run_agent_inner

The reconstructed gateway MoA wiring declared moa_config on _run_agent (the
profile-scoping wrapper) and used it inside _run_agent_inner, but the wrapper
never forwarded it — _run_agent_inner had no such parameter, so the runtime hit
NameError: name 'moa_config' is not defined on the compression-failure session
sync path. Add moa_config to _run_agent_inner's signature and forward it from
both wrapper call sites (multiplex and non-multiplex). Caught by
tests/gateway/test_compression_failure_session_sync.py on CI shard test(4).

* fix(moa): classify moa as a virtual provider in the catalog

The moa virtual provider has no PROVIDER_REGISTRY/ProviderProfile entry, so
provider_catalog() fell through to the default auth_type="api_key" with no
env vars — tripping two catalog invariants:
  - test_provider_catalog: api_key providers must expose a credential env var
  - test_provider_parity: every hermes-model provider must be desktop-configurable

moa already declares auth_type="virtual" in HERMES_OVERLAYS; consult that
overlay as an auth_type fallback so the catalog reports moa as virtual (no real
credential, no network endpoint). Exempt virtual providers from the desktop
parity union check the same way 'custom' is exempt — derived from the catalog,
not a hardcoded slug, so future virtual providers are covered too.
2026-06-25 13:52:06 -07:00
sweetcornna
150afea942 fix(config): quote env values containing hash 2026-06-26 00:54:34 +05:30
Brooklyn Nicholson
1d9ed7f48a fix(desktop): ad-hoc sign macOS self-update rebuilds
The desktop self-updater rebuilds and re-signs the .app on each user's own
machine (`hermes desktop --build-only` -> electron-builder `--dir`). With
CSC_IDENTITY_AUTO_DISCOVERY on (its default), electron-builder signs the
type=distribution, hardened-runtime bundle with whatever identity is in that
user's keychain -- typically a personal "Apple Development" cert -- which
stalls/fails the sign step (no Developer ID, no provisioning profile) or
clobbers the original notarized signature with an unusable one, tripping
Gatekeeper on every post-update launch.

Force ad-hoc signing for the local packaged rebuild instead: deterministic,
and exactly what _desktop_macos_relaunchable_fixup already finishes off.
No-op for source runs, off-macOS, when a real identity is configured
(CSC_LINK / APPLE_SIGNING_IDENTITY), or when the caller already pinned the flag.
2026-06-25 12:08:29 -05:00
Ben Barclay
736e981abf
fix(auth): honor NOUS_INFERENCE_BASE_URL env override for Nous OAuth sessions (#52270)
The host-allowlist hardening (#30611) plus the refresh heal (#49735) left
the documented NOUS_INFERENCE_BASE_URL dev/staging escape hatch unreachable
for OAuth sessions, despite three code comments asserting it still works.

Root cause — resolution precedence in resolve_nous_runtime_credentials:

    inference_base_url = (
        _optional_base_url(state.get("inference_base_url"))  # stored — wins
        or os.getenv("NOUS_INFERENCE_BASE_URL")              # env — unreachable
        or DEFAULT_NOUS_INFERENCE_URL
    )

A staging OAuth login persists its inference_base_url, but the allowlist
rejects the staging host and the refresh heal rewrites the stored value to
the production default. The stored (now prod) value is then read BEFORE the
env var, so the override never takes effect — every request 401s against
prod or is pinned to prod, and setting the env var does nothing.

Fix: the user-set env override is the most-trusted source, so consult it
FIRST for the URL used to build the client / returned to callers — while
keeping the PERSISTED value the validated, network-provenance one (the
override is a runtime overlay, never written to auth.json, so unsetting it
cleanly reverts to prod). Applied at both chokepoints:

- resolve_nous_runtime_credentials (no-refresh read path AND refresh path)
- the nous_portal proxy adapter, which re-validates the resolver's returned
  base_url against the prod allowlist as defense-in-depth and would
  otherwise reject a legitimate staging override at the forward boundary.

New _nous_inference_env_override() / split of stored-vs-effective URL keep
the threat model intact: Portal-returned URLs are still allowlist-validated
at every network site, and the env path stays ungated (trusted OS user).

Also folds in the no-refresh read-path heal (supersedes the approach in
the open #50265): a poisoned stored staging host now heals to the prod
default on read even when no refresh fires.

Tests: TestEnvOverrideWins (env wins on read + refresh paths; override never
persisted; poisoned stored heals) and TestProxyAdapterEnvOverride. Verified
the 4 behavioral tests fail against pre-fix code and pass with the fix; full
inference-validation + nous-provider suites green (85 passed). E2E-validated
against a real temp HERMES_HOME exercising the real resolver + proxy adapter:
resolver→staging, persisted→prod, proxy→staging, unset→reverts to prod.
2026-06-25 00:11:15 -07:00