Commit graph

1647 commits

Author SHA1 Message Date
Shannon Sands
41f8126148 Reconnect dashboard PTY chat after socket drops 2026-06-26 01:06:02 -07:00
Ben
19b2624404 feat(gateway): external drain trigger + accept-gating (begin/cancel + control channel)
Tasks 2.1 + 2.2 + 2.3 of the safe-shutdown plan — the reversible
quiesce-without-restart machinery NAS drives during a lifecycle action (D4a).
These ship together because the endpoint, the control channel, and the gateway
state machine are one coherent slice.

2.2 — control channel (gateway/drain_control.py, new):
The dashboard has no HTTP path into a running gateway (guardrails: "there is NO
external control channel into a running gateway"); restart/drain is driven only
by markers the gateway reacts to. So begin/cancel-drain writes/removes a
presence-based marker .drain_request.json (HERMES_HOME-scoped, atomic write,
never-raises read; a corrupt marker reads as present-contentless → fail-safe
toward quiescing). This is Q-B option A.

2.2 — gateway state machine (gateway/run.py):
- _external_drain_active flag, DISTINCT from the shutdown _draining flag: this
  one does NOT exit the process and is fully reversible.
- _enter_external_drain / _exit_external_drain: idempotent transitions that
  flip gateway_state→draining / →running via _update_runtime_status (preserving
  the live active_agents count). exit refuses to revert to running during a
  real shutdown or after the loop stops (shutdown wins).
- _drain_control_watcher: 1s background task (modelled on _handoff_watcher)
  reconciling accept-state with the marker; honours a marker that survived a
  restart on its first tick. Registered alongside the other watchers in start.
- New-turn accept gate in _handle_message, placed BEFORE the session-slot
  claim: when draining, refuse to START a new turn (so active_agents can only
  fall → no TOCTOU race), while in-flight turns finish untouched. Internal/
  system events (restart-recovery replays, bg-process completions) bypass it.

2.1 — endpoint (hermes_cli/web_server.py):
POST /api/gateway/drain {action: drain|cancel}. Authenticated by the Task-2.0a
token seam (the drain plugin registered this exact path as a token route);
attributes the request to the verified token principal. Begin writes the
marker, cancel removes it — the gateway process owns the actual transition.
Force-override (D6) is NOT here; it maps onto the existing immediate
/api/gateway/restart force path.

Tests (mocked — necessary-not-sufficient; the HARD live gate Q-B is next):
- tests/gateway/test_external_drain_control.py — marker contract (write/clear/
  read/corrupt/atomic), state machine (enter/exit/idempotency/shutdown-wins/
  loop-stopped), watcher reconcile-enter-then-exit, new-turn refusal, and
  in-flight-not-interrupted. 15 tests.
- tests/hermes_cli/test_web_server.py — /api/gateway/drain begin/default-begin/
  cancel/cancel-idempotent/bad-action-400. 6 tests.
- dashboard.drain_auth config section already added in 2.0b commit.

All touched suites green: 301 (gateway+auth) + 9 (web_server endpoints) passed.

Intentionally deferred:
- HARD live-validation gate (Q-B): real isolated `hermes gateway run`, drive a
  real begin-drain marker, prove the 5-point checklist a–e.
- Spec-doc status flip + Phase-2 PR.

Build status: external-drain, restart-drain, status, dashboard-auth, drain-plugin,
token-auth, and web_server-endpoint suites green.
2026-06-26 00:47:19 -07:00
Ben
cb9cb6ba1c feat(dashboard-auth): generic non-interactive API-token capability
Task 2.0a of the safe-shutdown drain-coordination plan. Widens the dashboard
auth framework GENERICALLY to support non-interactive (service-to-service)
bearer-token auth, mirroring the existing supports_password precedent. This is
a reusable capability — any future machine-credential provider plugs in without
core changes (decisions.md Q-C). The drain bearer-secret plugin (Task 2.0b) is
the first consumer, not the definition.

- base.py: add TokenPrincipal dataclass (the token analog of Session) +
  supports_token capability flag + verify_token() on the ABC (default raises
  NotImplementedError so a misconfigured provider fails loud). Contract mirrors
  verify_session stacking: return None for unrecognised tokens (never raise),
  raise ProviderError only on a genuine backing-store outage.
- registry.py: list_token_providers() — the supports_token subset, in
  registration order. Empty when none registered (token routes fail closed).
- token_auth.py (new): route-agnostic seam. Routes opt in via
  register_token_route(exact path); token_auth_middleware owns the auth
  decision for those routes only — authenticate via stacked providers, attach
  request.state.token_principal + token_authenticated, pass through. 401 on
  missing/unrecognised token, 503 when a provider was unreachable, untouched
  passthrough for non-token routes. Fails closed (never open).
- web_server.py: install the seam OUTERMOST (registered last → runs first).
  Both downstream gates (legacy auth_middleware + gated_auth_middleware) honour
  request.state.token_authenticated and skip enforcement, so a token-authed
  service request is never bounced to /login.
- audit.py: TOKEN_AUTH_SUCCESS / TOKEN_AUTH_FAILURE events.

Tests: tests/hermes_cli/test_dashboard_token_auth.py — ABC flag default,
verify_token NotImplementedError, registry filter, bearer extraction
(case-insensitive scheme, malformed/non-bearer → ""), provider stacking
(first-match-wins, unreachable-remembered, unreachable-then-valid, buggy
provider doesn't crash the gate), and the seam's passthrough/401/503/
fail-closed behaviour. 29 new tests; full dashboard-auth suite 169 passed.

Intentionally deferred:
- The concrete shared-bearer-secret provider plugin — Task 2.0b.
- The begin/cancel-drain endpoint that registers itself as a token route —
  Task 2.1.

Build status: dashboard-auth + plugin-hook suites green.
2026-06-26 00:47:19 -07:00
Max Hsu
075f93ad78 fix(mcp): auto-recover from invalid_client on stale OAuth client registration
Fixes #36767.

Two complementary recoveries for the recurring "delete three cache files and
re-auth by hand" ritual when an MCP server's dynamically-registered OAuth
client goes dead server-side (IdP redeploy / DB wipe / rebrand):

- Auto-heal (token-endpoint subset): HermesMCPOAuthProvider now sniffs
  auth-flow responses and, on a 400/401 `invalid_client` from the discovered
  token endpoint, backs up + deletes `<server>.client.json` and `.meta.json`
  and clears the in-memory client so the SDK re-runs RFC 7591 dynamic client
  registration on the next flow. Conservative by construction: only
  dynamically-registered (non config-supplied) clients, only the token
  endpoint, only on a word-boundary `invalid_client` match (so RFC 7591's
  `invalid_client_metadata` does not trip it); best-effort so a miss never
  breaks the live flow. Covers both code-exchange and refresh when the token
  endpoint was discovered. Tokens are preserved.

- `hermes mcp reauth [<name>|--all]`: the reporter's primary symptom — the
  IdP's in-browser "Redirect URI Mismatch" — produces no HTTP signal (the SDK
  only sees a callback timeout), so it cannot be auto-detected. The new
  command re-auths one or ALL `auth: oauth` servers, serially: one browser
  flow at a time, which also fixes the startup popup storm when several
  servers are stale at once. Single-server reauth is factored out of
  `mcp login` and shared.

Tests: +14 (poison helper x2; token-endpoint detection x5 incl. wrong-endpoint,
success-response, pre-registered, and invalid_client_metadata negative guards;
a bridge integration test driving the real async_auth_flow generator to prove
the detection hook preserves the bidirectional asend() forwarding contract;
reauth CLI x6). Verified against the pinned mcp==1.26.0: scripts/run_tests.sh
122/122 green for the touched suites; check-windows-footguns.py and ruff clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-26 00:35:27 -07:00
Teknium
5b5c79a8ef
feat(kanban): typed block reasons + unblock-loop breaker (#52848)
* feat(kanban): typed block reasons + unblock-loop breaker

Stops the kanban blocked-task loop: a worker blocks a task, a cron
unblocks it, the worker re-blocks for the same reason, repeat forever.

block_task now takes a typed kind and a persistent block_recurrences
counter on the tasks table:

- kind=dependency routes to todo (parent-gated, auto-resumed), never
  the human 'blocked' bucket a cron would keep unblocking.
- needs_input/capability/transient/untyped land in blocked; each
  same-cause re-block after an unblock increments block_recurrences,
  and at BLOCK_RECURRENCE_LIMIT (default 2) the task routes to triage
  for a human instead of blocked.
- unblock_task no longer resets block_recurrences (the amnesia that
  let the loop run unbounded); complete_task clears it on success.

Wired through the worker kanban_block tool (new kind arg) and the
hermes kanban block --kind CLI flag, both reporting where the task
actually landed. Docs + 11 new tests; 536 existing kanban tests green.

* test(kanban): make second-block notify test use a distinct block cause

test_notifier_second_blocked_delivers blocked the same task twice with
the same (untyped) reason, which now trips the new unblock-loop breaker
and routes the second block to triage instead of blocked — so only one
'blocked' notification fired. The test's actual intent is that TWO
distinct block cycles each notify; give the two cycles different kinds
(needs_input then capability) so they're genuinely separate blocks. The
same-cause loop→triage path is covered by test_kanban_block_kinds.py.
2026-06-25 21:46:58 -07:00
liuhao1024
56cf517ccd fix(cron): detect partial job loss in restore_cron_jobs_if_emptied (#52144)
The desktop scheduler can overwrite cron/jobs.json with its own small
set of internally-tracked crons after an update/restart, causing
partial loss of tool-created cron jobs. The previous guard only
checked for total loss (live_count == 0), missing the case where
live_count > 0 but less than the pre-update snapshot count.

Compare live_count against snap_count instead of checking for zero,
so both total loss (0 vs N) and partial loss (1 vs 19) trigger
restoration.

Salvaged from #52161 by @liuhao1024.

Closes #52144
2026-06-25 18:49:18 -07:00
Teknium
208f0d7c3b
fix(update): default pre-update backup to off (#52729)
The pre-update HERMES_HOME zip shipped on by default (DEFAULT_CONFIG +
runtime fallback both True), so every `hermes update` zipped the entire
~/.hermes — sessions DB, caches, skills — adding minutes to each update.
The shipped cli-config.yaml.example, the --backup help, and the example
config all already said "off by default," so the live default
contradicted its own documentation.

Flip the default to off everywhere: DEFAULT_CONFIG, the runtime
`.get(..., False)` fallback in _run_pre_update_backup, and the stale
--backup help string. Users who want the #48200 safety net opt in via
updates.pre_update_backup: true or --backup for a single run.

Updated test_default_enabled_creates_backup -> test_default_disabled_is_silent
to assert the new default (silent no-op, no zip).
2026-06-25 16:01:09 -07:00
brooklyn!
ffa3d3c811
Merge pull request #49037 from NousResearch/bb/projects-paradigm
feat(desktop): first-class projects — sidebar, coding rail, review pane, and agent project tools
2026-06-25 17:49:05 -05:00
Gille
bf0513bca0 test(windows): align gateway restart CI coverage 2026-06-25 14:42:38 -07:00
Brooklyn Nicholson
e7811345c1 feat(kanban): link tasks to project worktrees 2026-06-25 16:40:26 -05:00
Brooklyn Nicholson
8a45ce2dd4 feat(projects): add per-profile project store 2026-06-25 16:40:26 -05:00
Teknium
c6575df927
feat(moa): expose MoA presets as selectable virtual models (#46081)
* feat(moa): expose MoA presets as selectable virtual models

Reconstructed onto current main (PR #46081's base had diverged with no common
ancestor, marking the PR dirty so CI never dispatched). MoA is now a virtual
provider: each named preset is a selectable model under provider 'moa', and the
preset's aggregator is the acting model that answers and calls tools.

Reference models fan out in parallel via a bounded ThreadPoolExecutor (the same
batch pattern delegate_task uses) — all references dispatched at once, collected
when every one finishes, then handed to the aggregator. Output order is
preserved, failures and the MoA-recursion guard stay isolated per reference.

- Removed the old mixture_of_agents model tool and moa toolset.
- Added moa as a virtual provider in the provider/model inventory.
- /moa is shortcut behavior over model selection (default preset / named preset
  / one-shot prompt).
- Dashboard + Desktop manage named presets; presets appear in model pickers.
- Parallel reference fan-out in agent/moa_loop.py with regression test.

* fix(moa): thread moa_config through _run_agent to _run_agent_inner

The reconstructed gateway MoA wiring declared moa_config on _run_agent (the
profile-scoping wrapper) and used it inside _run_agent_inner, but the wrapper
never forwarded it — _run_agent_inner had no such parameter, so the runtime hit
NameError: name 'moa_config' is not defined on the compression-failure session
sync path. Add moa_config to _run_agent_inner's signature and forward it from
both wrapper call sites (multiplex and non-multiplex). Caught by
tests/gateway/test_compression_failure_session_sync.py on CI shard test(4).

* fix(moa): classify moa as a virtual provider in the catalog

The moa virtual provider has no PROVIDER_REGISTRY/ProviderProfile entry, so
provider_catalog() fell through to the default auth_type="api_key" with no
env vars — tripping two catalog invariants:
  - test_provider_catalog: api_key providers must expose a credential env var
  - test_provider_parity: every hermes-model provider must be desktop-configurable

moa already declares auth_type="virtual" in HERMES_OVERLAYS; consult that
overlay as an auth_type fallback so the catalog reports moa as virtual (no real
credential, no network endpoint). Exempt virtual providers from the desktop
parity union check the same way 'custom' is exempt — derived from the catalog,
not a hardcoded slug, so future virtual providers are covered too.
2026-06-25 13:52:06 -07:00
sweetcornna
150afea942 fix(config): quote env values containing hash 2026-06-26 00:54:34 +05:30
Brooklyn Nicholson
1d9ed7f48a fix(desktop): ad-hoc sign macOS self-update rebuilds
The desktop self-updater rebuilds and re-signs the .app on each user's own
machine (`hermes desktop --build-only` -> electron-builder `--dir`). With
CSC_IDENTITY_AUTO_DISCOVERY on (its default), electron-builder signs the
type=distribution, hardened-runtime bundle with whatever identity is in that
user's keychain -- typically a personal "Apple Development" cert -- which
stalls/fails the sign step (no Developer ID, no provisioning profile) or
clobbers the original notarized signature with an unusable one, tripping
Gatekeeper on every post-update launch.

Force ad-hoc signing for the local packaged rebuild instead: deterministic,
and exactly what _desktop_macos_relaunchable_fixup already finishes off.
No-op for source runs, off-macOS, when a real identity is configured
(CSC_LINK / APPLE_SIGNING_IDENTITY), or when the caller already pinned the flag.
2026-06-25 12:08:29 -05:00
Ben Barclay
736e981abf
fix(auth): honor NOUS_INFERENCE_BASE_URL env override for Nous OAuth sessions (#52270)
The host-allowlist hardening (#30611) plus the refresh heal (#49735) left
the documented NOUS_INFERENCE_BASE_URL dev/staging escape hatch unreachable
for OAuth sessions, despite three code comments asserting it still works.

Root cause — resolution precedence in resolve_nous_runtime_credentials:

    inference_base_url = (
        _optional_base_url(state.get("inference_base_url"))  # stored — wins
        or os.getenv("NOUS_INFERENCE_BASE_URL")              # env — unreachable
        or DEFAULT_NOUS_INFERENCE_URL
    )

A staging OAuth login persists its inference_base_url, but the allowlist
rejects the staging host and the refresh heal rewrites the stored value to
the production default. The stored (now prod) value is then read BEFORE the
env var, so the override never takes effect — every request 401s against
prod or is pinned to prod, and setting the env var does nothing.

Fix: the user-set env override is the most-trusted source, so consult it
FIRST for the URL used to build the client / returned to callers — while
keeping the PERSISTED value the validated, network-provenance one (the
override is a runtime overlay, never written to auth.json, so unsetting it
cleanly reverts to prod). Applied at both chokepoints:

- resolve_nous_runtime_credentials (no-refresh read path AND refresh path)
- the nous_portal proxy adapter, which re-validates the resolver's returned
  base_url against the prod allowlist as defense-in-depth and would
  otherwise reject a legitimate staging override at the forward boundary.

New _nous_inference_env_override() / split of stored-vs-effective URL keep
the threat model intact: Portal-returned URLs are still allowlist-validated
at every network site, and the env path stays ungated (trusted OS user).

Also folds in the no-refresh read-path heal (supersedes the approach in
the open #50265): a poisoned stored staging host now heals to the prod
default on read even when no refresh fires.

Tests: TestEnvOverrideWins (env wins on read + refresh paths; override never
persisted; poisoned stored heals) and TestProxyAdapterEnvOverride. Verified
the 4 behavioral tests fail against pre-fix code and pass with the fix; full
inference-validation + nous-provider suites green (85 passed). E2E-validated
against a real temp HERMES_HOME exercising the real resolver + proxy adapter:
resolver→staging, persisted→prod, proxy→staging, unset→reverts to prod.
2026-06-25 00:11:15 -07:00
kshitijk4poor
d6cf383d74 refactor(setup): simplify Z.AI picker — drop dead fallback, fix tests
- Remove dead `chosen_base or effective_base` fallback; _select_zai_endpoint
  always returns a non-empty base URL (returns current_base on cancel).
- Add .rstrip("/") to official-endpoint return for symmetry with custom-proxy
  path (both now return normalized URLs).
- Replace magic index 4 with len(ZAI_ENDPOINTS) in custom-proxy tests so they
  don't break if a 5th endpoint is added to ZAI_ENDPOINTS.
2026-06-25 12:07:01 +05:30
kshitijk4poor
d0df264213 test(setup): add ZAI endpoint picker tests, move base-URL tests to MiniMax
Z.AI now uses a curses picker instead of plain text input for base URL,
so the existing TestBaseUrlValidation tests (which used zai as their test
subject) are migrated to MiniMax, which still uses the text input path.

Add TestZaiEndpointPicker covering:
- Selecting each official endpoint (Global, China, Coding Plan Global,
  Coding Plan China) saves the correct base URL to config
- Custom proxy URL entry (valid + invalid rejection)
- Cancel keeps the existing base URL
- Current endpoint is the default choice in the picker
- Non-standard URL defaults to the Custom proxy option
2026-06-25 12:07:01 +05:30
Teknium
411faf08bd
fix(soul): installers seed the real default persona, upgrade legacy empty templates (#52246)
The desktop bootstrap (and curl/PowerShell/docker installs) seeded
~/.hermes/SOUL.md with a comment-only scaffold that contained no persona
text. That shadowed the runtime default (_ensure_default_soul_md ->
DEFAULT_SOUL_MD), since seeding is guarded by 'if SOUL.md doesn't exist'.
Result: every fresh installer install got the empty template instead of
the documented Hermes persona; desktop just made it visible in onboarding.

- install.sh / install.ps1 / docker/SOUL.md now write DEFAULT_SOUL_MD.
- _ensure_default_soul_md() upgrades a SOUL.md still matching the known
  legacy scaffold in place; customized files (any deviation, incl. a
  persona appended below the comment) are never touched.
- Detection normalizes CRLF/BOM so Windows-installer drift still matches.
2026-06-24 18:56:26 -07:00
brooklyn!
b649cdee4a
Merge pull request #52203 from NousResearch/bb/update-drain-announce
fix(update): announce gateway drain waits so desktop updates don't look hung
2026-06-24 19:28:44 -05:00
Ben
538c419d2e fix(gateway): scope dashboard liveness fallback to the profile
PR #52151 hardened the runtime-status liveness check to trust a readable
live process command line over stale gateway_state.json argv, so a recycled
PID now owned by an s6 supervisor no longer counts as a running gateway.

That fix is correct but incomplete for the reported symptom: the web
dashboard showed a named profile's gateway green while
`hermes -p <name> gateway status` showed it stopped. Two further issues:

1. Cross-profile PID reuse. In per-profile Docker supervision, one profile's
   stale `gateway_state.json` can record a PID the OS later recycled onto a
   DIFFERENT profile's live gateway. That PID's command line still
   `looks_like_gateway`, so the dead profile was reported running. The
   recorded argv has its `-p <name>` selector stripped in-process by
   `_apply_profile_override`, so it cannot disambiguate; the live `/proc`
   cmdline still carries it. `get_runtime_status_running_pid` now accepts an
   `expected_home` and validates the live command line belongs to THAT
   profile (mirroring `hermes_cli.gateway._matches_current_profile`, the
   logic the CLI scan path already uses — which is why the CLI was correct).
   `_check_gateway_running` passes the enumerated profile dir.

2. The existing regression test `test_gateway_running_check_falls_back_to_
   runtime_state` used the live pytest PID with a gateway-shaped record; once
   the live cmdline became authoritative it no longer looked like a gateway.
   Updated to mock the live cmdline to the real separate-process scenario it
   describes.

The active-profile path (`get_running_pid`) is intentionally left unscoped:
it is lock-verified and any live gateway cmdline is acceptable there. Multiplex
mode is unaffected — `running` state is only ever written to a gateway's own
home, never a secondary served profile's.

Adds coverage for: cross-profile PID reuse (named + default), matching
profile cmdline (`-p`, `--profile`, explicit HERMES_HOME=), the bare default
gateway, and the unreadable-cmdline cross-platform fallback. Each new
cross-profile assertion fails without the profile scope and passes with it.

Co-authored-by: helix4u <4317663+helix4u@users.noreply.github.com>
2026-06-25 10:25:54 +10:00
AIalliAI
463bf2be25 fix(update): announce gateway drain waits so desktop updates don't look hung
On macOS, the desktop updater's stage 1 (hermes update --gateway) ends by
restarting running gateways. launchd_restart() SIGTERMs the gateway and
silently waits up to agent.restart_drain_timeout (default 180s) for the
drain; the manual profile-gateway loop waits its drain budget per gateway
the same way. Neither path prints anything before the wait, so the desktop
updater's live output goes dead for minutes right after '✓ Update
complete!' — users read it as a hung update and force-kill their gateway
processes to make it move (#44515). The systemd branch already announces
its drain ('draining (up to Ns)...'); launchd and the manual loop did not.

Print the stop/drain (with PID and budget) before the wait in both paths,
mirroring the systemd branch, and assert the message in the existing
launchd drain test.

Fixes #44515
2026-06-24 19:12:44 -05:00
kshitij
c42d44cb2f
revert(plugins): restore user dashboard plugin backend API auto-import (#43719) (#51950)
* Revert "refactor(security): centralize non-bundled plugin sources in one constant"

This reverts commit e2bea0abe6.

* Revert "fix(security): restrict dashboard plugin backend import to bundled plugins (#43719)"

This reverts commit 8845f3316c.
2026-06-24 07:46:54 -07:00
Elshayib
1a435a6d5d fix(model-switch): prevent custom-provider misattribution in model picker (#48305)
When the current provider is a custom endpoint (custom or custom:*), the model
switch pipeline must NOT auto-switch to a native provider/OpenRouter based on a
static-catalog match. The user explicitly configured their own endpoint and the
same model name may be served there; silently rewriting model.provider destroys
their config.

- detect_static_provider_for_model(): skip the static-catalog scan when the
  current provider is custom/custom:*
- switch_model() Step e: extend is_custom to cover custom:* so the
  detect_provider_for_model() last-resort fallback cannot fire

Salvaged from #48351 by Elshayib (authorship preserved).

Fixes #48305
2026-06-24 19:34:33 +05:30
kshitij
2187fd884c
Merge pull request #51027 from NousResearch/salvage/typed-model-routing
fix(model_switch): route typed configured models off openai-codex (#45006)
2026-06-24 19:32:35 +05:30
kshitijk4poor
1a174dfb50 fix(models): gate openai-codex/xai-oauth soft-accept to family-shaped slugs (#45006)
Completes the #45006 fix. PR-base commit (configured-provider routing) handles
the case where a typed model IS declared in user/custom provider config. This
commit closes the other root: when a typed model is NOT in any config and the
current provider is a soft-accepting one (openai-codex / xai-oauth), the
hidden-model soft-accept (#16172 / #19729) would accept ANY unknown name as a
hidden model — so `qwen3.5-4b` typed on a Codex-default session "succeeded" and
mislabeled the provider as "OpenAI Codex" (the exact reported symptom), then
400'd on the next turn.

Gate the soft-accept to slugs that plausibly belong to the provider's family
(openai-codex -> gpt-/codex-/o1/o3/o4; xai-oauth -> grok-). Family-shaped
unknown slugs are still soft-accepted (preserving the #16172 entitlement-gated
hidden-model intent); unrelated names are rejected with actionable guidance to
pin the right provider via `--provider <slug>` or the picker.

Adds TestCodexSoftAcceptPlausibilityGate (5 tests): unrelated names rejected on
codex/xai, family-shaped hidden slugs still accepted, real catalog models
unaffected. Verified load-bearing.
2026-06-24 19:23:53 +05:30
xxxigm
89540d592b test(cli): cover non-interactive prompt_yes_no fallback
Regression coverage for the desktop gateway-restart hang: prompt_yes_no
returns its default when HERMES_NONINTERACTIVE=1 or on a bare EOFError
(closed/redirected stdin), and still exits on KeyboardInterrupt.
2026-06-24 17:56:30 +05:30
Teknium
be78fbd70e
Revert "fix(profiles): clone auth.json so OAuth credentials carry to cloned profiles (#51719)" (#51732)
This reverts commit f504aecffe.
2026-06-23 23:58:43 -07:00
teknium1
53f8386587 test(delegation): regression for bedrock Claude target_model api_mode routing
Asserts resolve_runtime_provider honors target_model over the stale
persisted model.default when choosing the Bedrock dual-path api_mode:
Claude target -> anthropic_messages, Nova target -> bedrock_converse.
Both fail without the #49095 fix.
2026-06-23 23:49:37 -07:00
teknium1
d4be583d98 fix(telegram): raise default command-menu cap to 60 so skills stay visible
The 30-slot default could not fit Hermes's ~50 built-in commands, so
every skill command (and 20 built-ins) were silently dropped from the
Telegram \`/\` menu by default — they only worked when typed manually.
Raising the default to 60 keeps all built-ins plus common skill commands
visible out of the box while staying under Telegram's ~4KB payload limit.
Users can still tune it via platforms.telegram.extra.command_menu.
2026-06-23 23:49:22 -07:00
Thestral
dbe14ce35d feat(gateway): configure Telegram command menu priority
Adds a configurable Telegram BotCommand menu cap and priority list via
platforms.telegram.extra.command_menu (max_commands clamped 1..100;
priority_mode prepend|append|replace). Default cap stays 30; hidden
commands remain invokable when typed and /commands lists the full set.

Salvaged from PR #42021. Cherry-picked onto current main; the original
edited gateway/platforms/telegram.py, now relocated to
plugins/platforms/telegram/adapter.py.
2026-06-23 23:49:22 -07:00
Teknium
f504aecffe
fix(profiles): clone auth.json so OAuth credentials carry to cloned profiles (#51719)
Selective --clone / --clone-from / --clone-config copied .env but not
auth.json, silently dropping the credential pool — including OAuth tokens
(Anthropic `claude /login`, Codex, xAI) that never land in .env. A profile
cloned from an OAuth-authenticated default therefore resolved a different
provider (or none) than the source under provider: auto. --clone-all already
carried auth.json via the full copytree; only the selective path missed it.

Add auth.json to _CLONE_CONFIG_FILES and tighten it to 0o600 after copy,
matching .env semantics.
2026-06-23 23:44:34 -07:00
teknium1
901165b5a4 fix(cron): complete plugins.cron_providers rename in 2 missed test files
uperLu's #50958 renamed plugins/cron → plugins/cron_providers but left
two test files patching the now-gone plugins.cron.chronos.verify path,
which would fail collection. Point them at plugins.cron_providers.*.
Add uperLu to release.py AUTHOR_MAP.
2026-06-23 23:39:22 -07:00
Ben
31bced1607 fix(profiles): detect a separate-process gateway in profile status
The dashboard Profiles view showed "Gateway stopped" for a gateway that
is in fact running — while the sidebar status strip and `hermes gateway
status` (CLI) both correctly showed it running. Reported on v0.17.0
running the gateway + dashboard in one Docker container.

Root cause: three liveness surfaces with three detection strengths, all
reading the same `gateway.pid`:

  - `hermes gateway status` -> find_gateway_pids() (process-table scan)
  - sidebar /api/status     -> get_running_pid() + gateway_state.json PID
                               fallback + health-URL probe
  - Profiles view           -> _check_gateway_running() = get_running_pid()
                               ONLY, no fallback

`get_running_pid()` short-circuits to None the moment the runtime lock
(`gateway.lock`) doesn't register as held by the *calling* process —
which is always true when the reader is a separate process from the
gateway (the dashboard is its own s6 service in the container), and also
for any launch-service-managed gateway that left a fresh
`gateway_state.json` but no live PID file. So the Profiles view alone
reported the live gateway as stopped.

Fix: give _check_gateway_running the same fallback the sidebar already
has — after the pid-file/lock check misses, validate the PID recorded in
that profile's gateway_state.json against the live process table via the
existing get_runtime_status_running_pid(). read_runtime_status() gains an
optional path arg so a profile's state file can be read without mutating
the process-global HERMES_HOME (preserving the contextvar-based profile
isolation the dashboard relies on). Backward compatible: every existing
caller passes no argument.

Tests: a regression test that fails pre-fix (live gateway, lock check
returns None -> must still report running) and a guard test that a
'stopped' state file is never reported running even with a live PID.
2026-06-24 16:36:17 +10:00
Francesco Mucio
776f68e1ee fix(gateway): exit 78 (EX_CONFIG) on fatal startup errors, s6 finish script stops restart loop
Profiles without their own messaging token inherit the default
profile's token via os.getenv, hit a token collision, and exit with
startup_failed.  s6 restarts them immediately, creating ~30MB tirith
sandbox dirs in /tmp each cycle — filling the disk in hours (#51228).

Changes:
- gateway/restart.py: add GATEWAY_FATAL_CONFIG_EXIT_CODE = 78
- gateway/run.py: set exit_code=78 on non-retryable startup errors
  (token collision, no platforms)
- hermes_cli/service_manager.py: add _render_finish_script() that
  translates exit 78 → exit 125 (s6 permanent failure)
- hermes_cli/container_boot.py: write finish script alongside run
  script during profile registration

The s6 finish script pattern follows docker/s6-rc.d/dashboard/finish.

Closes #51228
2026-06-24 16:34:51 +10:00
Teknium
78e122ae1a
feat(cron): warn when gateway not running on cron create/list (#51696)
The cron ticker only runs inside the gateway (_start_cron_ticker); there
is no standalone cron daemon. When the gateway isn't running, next_run_at
passes but jobs never fire and last_run_at stays null — and manual
'hermes cron run' (which bypasses the ticker) appears to work, masking
the real cause. This is the most common cron support report (#51038).

cron list already warned; extend the same warning to cron create (the
moment the user is most likely to hit this) via a shared helper, and add
a pointer to 'hermes cron status'. Silent when a gateway is running, so
the gateway /cron path is unaffected.
2026-06-23 23:29:50 -07:00
teknium1
3d56807fbd fix(gateway): actively reap no-systemd gateway orphan before restart
Builds on @wgu9's runtime-tracking fix: now that find_gateway_pids() can
see a no-supervisor `gateway restart` runtime, have stop_profile_gateway()
fall back to an orphan-aware, profile-scoped reap (SIGTERM then SIGKILL)
when the pidfile/runtime record is missing or stale. Closes the duplicate-
accumulation path in #51325 — a follow-up restart now kills the prior
orphan instead of stacking another listener on :8644. Gated on
not supports_systemd_services() so a transient `gateway restart` argv on
supervised hosts is never killed.

Also adds the AUTHOR_MAP entry for the salvaged contributor.
2026-06-23 23:29:28 -07:00
jeremy gu
044996e403 fix(gateway): track no-systemd restart runtimes 2026-06-23 23:29:28 -07:00
Teknium
d539cd9004
fix(config): write config.yaml as UTF-8 to stop emoji/personality corruption (#51676)
atomic_yaml_write (and two sibling config writers) called yaml.dump
without allow_unicode=True. The default personalities shipped in cli.py
contain emoji/kaomoji, so PyYAML escaped astral-plane chars as 8-digit
\\UXXXXXXXX sequences inside multi-line double-quoted strings wrapped
with \\ line-continuations. Stricter/non-PyYAML parsers, editors, and
hand-edits break that structure into unclosed quotes, failing the whole
config parse -> silent fallback to defaults -> custom_providers lost.

Add allow_unicode=True to the canonical writer plus tui_gateway/server.py
and the telegram adapter's atomic config write so config is written as
readable UTF-8 with no escape/fold artifacts.

Fixes #51356
2026-06-23 23:28:21 -07:00
pefontana
667a9f5139 fix(update): reuse an existing PATH uv on Termux before pip
_ensure_uv_for_termux only checked resolve_uv() (the managed
$HERMES_HOME/bin/uv) before falling back to pip, so a uv installed via
`pkg install uv` lives on PATH but is invisible to the helper. Combined
with the cherry-picked wheel-only fallback, a Termux user with no managed
uv still hit `pip install uv`, which has no Android wheel and tried to
source-build the Rust crate, OOM-killing low-memory devices.

Probe shutil.which("uv") right after the Termux guard and reuse it before
pip. Add a regression test that keeps resolve_uv() returning None while a
uv exists on PATH and asserts pip is never invoked.
2026-06-23 18:42:05 -07:00
jinhyuk9714
3e508363f7 fix(update): avoid source-building uv on Termux 2026-06-23 18:42:05 -07:00
Brooklyn Nicholson
e495b33bf1 Merge remote-tracking branch 'origin/main' into bb/pets-merge
# Conflicts:
#	hermes_cli/commands.py
#	tui_gateway/server.py
2026-06-23 19:05:22 -05:00
lEWFkRAD
433db17c0a
fix(windows): harden gateway scheduled task (#45610)
* fix(windows): harden gateway scheduled task

* fix(windows): launch gateway scheduled task via console-less wscript

The Scheduled Task ran the gateway through cmd.exe, which allocates a
console. During logon Windows broadcasts CTRL_CLOSE_EVENT to console
process groups, reaping cmd.exe and the half-initialized gateway with
STATUS_CONTROL_C_EXIT (0xC000013A) - which Task Scheduler treats as a
user cancel, so RestartOnFailure never fires and the gateway vanishes on
every reboot (issue #45599 root cause #1).

Add a console-less .vbs launcher (wscript.exe -> pythonw.exe, both
GUI-subsystem) mirroring the gateway.cmd env + argv, and point the task
action at it. The .cmd stays for the Startup-folder fallback and /Run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Jeff <jeffrobodie@gmail.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 16:07:52 -07:00
konsisumer
02050859f3 fix(tui): preserve live session identity across compression (#49041)
When a session rotates id on compression, _sync_session_key_after_compress()
re-anchored the session_key, approval-notify routing, yolo state, and slash
worker — but never moved the active-session lease, which stayed keyed to the
pre-compression id. And _find_live_session_by_key() matched live sessions on
the stale session_key, not the live agent's current agent.session_id. After
compression a resume/create path failed to recognize the existing live agent
and could build a SECOND live agent against the same DB continuation -> forked
lineage / cross-session message mixing.

- active_sessions.transfer_active_session(): move a lease in place to the new
  id under the exclusive file lock (no slot drop).
- gateway _transfer_active_session_slot(): call it inside
  _sync_session_key_after_compress(); on the rare fallback (entry pruned)
  RESERVE the new slot before releasing the old lease (reserve-before-release),
  so a concurrent gateway at the session cap cannot grab the freed slot in a
  release-then-reacquire window and leave this session with no lease; if the
  reserve fails, keep the existing lease (review fix).
- _session_lookup_key(): make live-session lookup authoritative on
  agent.session_id, wired into all stale-session_key consumers
  (_find_live_session_by_key, _session_live_item, _live_session_payload) —
  fixes the whole lookup class.

Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>
2026-06-24 00:54:18 +05:30
Victor Kyriazakos
da80ac0042 feat(slack): add --no-assistant flag to manifest generation
By default `hermes slack manifest` opts the app into Slack's AI Assistant
container (assistant_view feature + assistant:write scope +
assistant_thread_* events). Slack then renders DMs as the right-hand
Assistant split-pane, where every exchange is a thread and bare slash
commands (/help, /new, ...) are not delivered as normal command events —
they only work when the bot is @mentioned. There was no way to opt out
short of hand-editing the generated JSON.

Add --no-assistant to emit a flat-DM manifest that omits those three
pieces, so DMs render as a normal chat and slash commands dispatch
inline. The regular messaging surface (Messages tab, slash commands,
Socket Mode, channel + DM scopes/events) is preserved in both modes.

Default behaviour is unchanged (assistant mode still on).

Tests: cover both manifest modes and the argparse wiring.
2026-06-23 11:30:10 -07:00
Teknium
0f741cef28 fix(tests): update cua install tests for cross-platform support
f-trycua's #50855 test file predated the cross-platform PR (#50552) and
reintroduced two stale tests asserting Linux is unsupported
(test_*_non_macos_*, patching platform.system="Linux" and expecting a
no-op/warn). Linux + Windows are supported now, so install proceeds on
those platforms. Restore main's cross-platform-correct versions:
test_*_on_unsupported_platform_* using FreeBSD as the genuinely
unsupported case.
2026-06-22 13:41:03 -07:00
Francesco Bonacci
5f1d23cfb2 fix(computer-use): delete broken pre-install asset probe; trust the upstream installer
`hermes computer-use install` refused to install on Linux, Windows, and
macOS x86_64 because the pre-install asset probe was hitting the wrong
GitHub endpoint AND duplicating tag-resolution logic the upstream
installer already does correctly.

`_check_cua_driver_asset_for_arch()` queried
`https://api.github.com/repos/trycua/cua/releases/latest`. On trycua/cua:

- cua-driver-rs releases (the binary the installer fetches) are marked
  **prerelease** on every cut. GitHub's `/releases/latest` explicitly
  skips prereleases.
- The Python package releases (`cua-agent`, `cua-computer`, `cua-train`)
  are non-prerelease and end up as the "latest" instead.

Live API check today:

  $ curl -sf https://api.github.com/repos/trycua/cua/releases/latest \
      | jq '{tag:.tag_name, asset_count: (.assets|length)}'
  { "tag": "agent-v0.8.3", "asset_count": 0 }

The probe sees zero assets, prints "Latest CUA release has no Linux
x86_64 asset", and skips install on every Linux / Windows / macOS-x86_64
host — even though the cua-driver-rs-v0.6.0 release ships 19 binary
assets covering all those platforms.

Filtering `/releases?per_page=N` for the `cua-driver-rs-v*` prefix
fixes the bug, but it duplicates tag-resolution logic the upstream
`_install-rust.sh` already does correctly via `CUA_DRIVER_RS_BAKED_VERSION`
(auto-baked by CD on every release, with a `/releases?per_page=N` API
fallback for dev checkouts). The right answer is to trust that
contract instead of mirroring it in Python where it can drift.

Two paths get the same outcome without the probe:

1. **Fresh install**: run `install.sh` directly. It has the baked
   release tag, fetches the right asset, and errors with a clear
   message on missing-arch downloads. No preflight needed.
2. **Upgrade path**: `cua_driver_update_check()` (separately added)
   shells `cua-driver check-update --json` against the installed
   binary, which returns the canonical update answer from the same
   source the installer uses.

- `hermes_cli/tools_config.py`: delete `_check_cua_driver_asset_for_arch`
  and its two call sites in `install_cua_driver`. Replace with an
  inline comment near the top of the module explaining the rationale.
- `tests/hermes_cli/test_install_cua_driver.py`: drop the
  `TestCheckCuaDriverAssetForArch` block. Add `TestArchProbeRemoval`
  with three regressions:

  - `test_probe_function_is_gone` — asserts the deleted helpers stay
    deleted.
  - `test_fresh_install_does_not_call_github_api` — asserts the
    install path doesn't hit GitHub directly from Python anymore.
  - `test_upgrade_with_binary_does_not_call_github_api_directly` —
    same for the upgrade path.

All 9 `test_install_cua_driver` tests pass.

Reported by @teknium1 while testing on a headed Ubuntu host.
2026-06-22 13:41:03 -07:00
harjothkhara
791c992b55 fix(model_switch): route typed configured models off openai-codex (#45006)
A typed `/model <name>` where `<name>` is declared under `providers.<slug>` or
`custom_providers` — but typed while the current provider is a soft-accepting
one (e.g. `openai-codex`) — stayed on the current provider and was swallowed as
an unknown hidden Codex model, instead of routing to the provider that actually
declares it.

Add configured-provider exact-match detection (`_configured_provider_matches`)
and a new Step d.5 in `switch_model`: if the typed model is declared in
user/custom provider config, route to that provider BEFORE
`detect_provider_for_model()` guesses from static catalogs and BEFORE the
common-path validation lets a soft-accepting current provider swallow the name.

- Matching is exact (case-insensitive) against explicitly-declared model
  collections only (`models`, `model`, `default_model`) — never fuzzy/family.
- Same-provider declarer → keep current provider (canonicalize the id).
- Multiple declarers → fail clearly and ask for `--provider <slug>`.
- Single declarer → route there; for `providers.<slug>` user providers, set
  `explicit_provider` so the credential block resolves base_url/key from config.
- Step e (`detect_provider_for_model`) is gated off when `config_routed`.

The deliberately-supported openai-codex / xai-oauth hidden-model soft-accept
(#16172 / #19729) is left untouched: when nothing in config matches, detection
is a no-op.

Salvaged from #45442 by harjothkhara (authorship preserved).

Tests: tests/hermes_cli/test_model_switch_configured_provider_routing.py
(7 tests). Full model_switch suite: 214 passed.

Fixes #45006
2026-06-23 02:03:21 +05:30
Austin Pickett
2a58fee1a1
fix(api): allow dashboard updates for git checkouts in containers (#51005)
Salvages #50469 by @libre-7.

_dashboard_local_update_managed_externally() previously blocked every containerized dashboard from the local update API, even when the running install was a bind-mounted git checkout that can be updated with hermes update.

Allow the dashboard updater only for git installs inside containers, while keeping hosted /opt/data, docker, and pip installs managed externally. Pip remains blocked because its apply path mutates the running container filesystem and is not the self-managed checkout case.

Adds regression coverage for docker, git, and pip install-method handling inside containers, and maps the contributor email for release attribution.

Co-authored-by: libre-7 <libre-7@users.noreply.github.com>
2026-06-22 15:55:33 -04:00
Teknium
2ba1cfeb2e
feat(goals): completion contracts for /goal — evidence-based judging (#50501)
Adds an optional structured completion contract to the standing-goal loop,
adapted from OpenAI Codex's /goal guidance (a durable objective works best
when it names what done means, how to prove it, what not to break, what's in
scope, and when to stop).

A contract has five optional fields — outcome, verification, constraints,
boundaries, stop_when. When set, the continuation prompt tells the agent to
target the verification surface and respect constraints, and the judge marks
the goal done only when the verification criterion is met with concrete
evidence (command result, file excerpt, test output) instead of a loose
"looks done" claim. This tightens the most common /goal failure mode:
premature completion / endless over-continuation on an underspecified goal.

Two ways to set a contract, both backward compatible (bare /goal <text>
behaves exactly as before):
- /goal draft <objective>  — expands plain text into a full contract via the
  goal_judge aux model (cache-safe side call), falls back to a free-form goal
  if the model is unavailable.
- /goal <text> with inline 'field: value' lines (verify:, constraints:,
  boundaries:, stop when:, ...). Plain goals with an incidental colon are not
  mangled — only known field prefixes are pulled out.
- /goal show prints the active contract.

Contracts persist in SessionDB.state_meta alongside the goal (survive /resume),
compose with /subgoal criteria, and old goal rows load unchanged. CLI + every
gateway platform via the shared GoalManager engine; zero new model tools.

Tests: +18 in tests/hermes_cli/test_goals.py (parse/serialize/judge-prompt/
draft/fallback), 73/73 green; 42/42 across the broader goal test surface;
live E2E roundtrip (set -> persist -> reload -> contract-aware prompts) green.
2026-06-22 12:20:09 -07:00
kshitij
5937b95192
Merge pull request #50773 from NousResearch/salvage/43719-dashboard-plugin-rce
fix(security): restrict dashboard plugin backend auto-import to bundled plugins — defense-in-depth (#43719)
2026-06-22 22:57:33 +05:30