hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-06-21 10:22:18 +00:00

Author	SHA1	Message	Date
Austin Pickett	8fe7b52ebf	test(desktop): lock GUI⊇`hermes model` provider parity; surface Bedrock Adds the end-to-end parity contract test: every CANONICAL_PROVIDERS entry (the `hermes model` universe) must be configurable on a desktop Providers tab — keys(/api/env) ∪ ids(/api/providers/oauth) ⊇ canonical. Asserted as an invariant against the live endpoints so the GUI can never silently drift from the CLI again. Surfacing this contract caught Bedrock: it's aws_sdk (no api-key vars), so it had no Keys card. /api/env now tags AWS_REGION/AWS_PROFILE to the bedrock provider card. Anthropic is whitelisted as a legitimate dual-tab provider (direct API key + subscription OAuth). Also refreshes the _OAUTH_PROVIDER_CATALOG docstring to describe its new role as the override base for _build_oauth_catalog().	2026-06-19 07:26:46 -07:00
Austin Pickett	60dfa0f31b	feat(desktop): Accounts tab derives membership from unified provider catalog /api/providers/oauth now unions the explicit hand-tuned OAuth cards (_OAUTH_PROVIDER_CATALOG — bespoke flow/status/cli, plus the api-key Anthropic PKCE card and synthetic claude-code row) with every accounts-tab provider in provider_catalog(). Any OAuth/external provider in the `hermes model` universe now appears automatically, closing the drift where google-gemini-cli and copilot-acp had no Accounts card despite being CLI-configurable. Adds read-only status cards for google-gemini-cli (via existing get_gemini_oauth_auth_status) and copilot-acp (managed-by-CLI, like claude-code). DELETE handler routes through the same _build_oauth_catalog() builder. Parity test asserts the Accounts tab offers every accounts-tab catalog provider as an invariant.	2026-06-19 07:26:46 -07:00
Austin Pickett	3be1326f8d	feat(desktop): /api/env derives provider key membership from unified catalog The Keys tab now surfaces every keys-tab provider in provider_catalog() (the `hermes model` universe), synthesizing a card even when the env var has no hand entry in OPTIONAL_ENV_VARS. Closes the drift where openai-api, kilocode, novita, tencent-tokenhub, and copilot were CLI-configurable but invisible in the desktop Providers → API keys tab. Each provider row now carries backend-derived provider/provider_label grouping hints so the desktop can group by the same provider identity the CLI picker uses. Hand OPTIONAL_ENV_VARS prose still wins where present (enrichment, not a gate). Shared non-provider credentials (e.g. tool-category GITHUB_TOKEN) are explicitly not hijacked into a provider card — Copilot uses its provider-owned COPILOT_GITHUB_TOKEN.	2026-06-19 07:26:46 -07:00
Austin Pickett	054b8c82fd	feat: unified provider_catalog() — one source for CLI picker and desktop tabs Adds hermes_cli/provider_catalog.py, deriving one descriptor per provider from the CANONICAL_PROVIDERS universe (what `hermes model` renders, auto-extended from provider plugins), joined with auth/env from PROVIDER_REGISTRY and display metadata from ProviderProfile (with canonical/env fallbacks for the four profile-less providers and the many profiles with blank display/signup fields). Each descriptor is tagged with the desktop tab it belongs on (keys vs accounts) by auth_type. This is the single source of truth the desktop Providers tabs will derive membership from, so they can no longer drift from the CLI picker. Tests assert the parity contract (catalog == hermes model universe) and tab routing as invariants, not snapshots.	2026-06-19 07:26:46 -07:00
Alex Yates	fad4b40d9d	fix(model): persist /model switch by default across sessions A plain /model <name> switch only lasted for the current session — every new session reverted to the previously-configured model, so users had to re-switch every time (e.g. glm-5.1 -> glm-5.2 on every launch). Persist-by-default is now the behavior across all three /model surfaces (CLI, gateway, TUI/dashboard), gated by a new config key model.persist_switch_by_default (default true): /model <name> switch model (persists to config.yaml) /model <name> --session switch for this session only /model <name> --global switch and persist (explicit, unchanged) The effective persistence is resolved once via resolve_persist_behavior() in hermes_cli/model_switch.py so --session opts out, --global opts in, and the config-gated default applies otherwise. --global remains a valid explicit no-op alias for the new default.	2026-06-19 07:07:06 -07:00
OYLFLMH	c1ffd4c3b4	fix(cli): make refresh_interval configurable, default to 0 (disabled) Commit `6724daa2c` added refresh_interval=1.0 to keep the idle clock ticking, but unconditional 1 Hz redraws in non-fullscreen prompt_toolkit mode cause terminal emulators (Xshell, iTerm2, Windows Terminal) to auto-scroll to the bottom on every tick — breaking scroll-up to read history. Drive it from display.cli_refresh_interval (0 = disabled, the default) so users who want the ticking clock can opt in without affecting everyone. Fixes: #48309 Related: `6724daa2c`, `8972a151a`	2026-06-19 07:06:34 -07:00
kshitijk4poor	01a6f11896	fix(debug): include gui.log (dashboard/TUI/pty/websocket) in hermes debug share gui.log was registered in hermes_cli/logs.py::LOG_FILES (and surfaced by `hermes logs gui`) but was never wired into `hermes debug share`. The share report captured agent/errors/gateway/desktop tails plus full agent/gateway/ desktop logs — but nothing from gui.log, the surface the dashboard, TUI-over- PTY bridge, and websocket layer (hermes_cli.web_server / pty_bridge / tui_gateway) actually write to. A user reporting a dashboard or TUI bug shared zero breadcrumbs from the broken surface. Wire gui.log through all three share surfaces, matching the existing pattern: - _capture_default_log_snapshots(): capture the gui snapshot (redacted like the rest) - collect_debug_report(): add the gui.log summary tail block - build_debug_share(): pull gui full_text, prepend dump header + redaction banner, add to the upload loop - run_debug_share() --local branch: same, plus the local print block - _PRIVACY_NOTICE: name gui.log in both bullets Redaction is inherited for free — the gui snapshot goes through the same _capture_log_snapshot(..., redact=redact) path, so secrets are scrubbed in both the tail and full text (verified E2E: seeded key masked by default, passes through under --no-redact, raw token never leaks). Tests: seed gui.log in the fixture, add test_report_includes_gui_log, and bump the upload-count tripwire 4->5 (test_share_uploads_five_pastes).	2026-06-19 07:05:42 -07:00
Charles Power	fd92a3a5c9	fix(gateway): Windows restart no longer causes a silent outage `hermes gateway restart` on Windows could take the gateway offline with no replacement. restart() was stop() -> sleep(1.0) -> start(), but the graceful drain can run up to ~180s while the detached pythonw process stays alive. The 1s sleep let start() run against the still-draining old process; its "already running" guard then no-opped, and when the old process finally exited nothing relaunched it. Two root causes, both fixed: 1. Loose PID detection. `_scan_gateway_pids` and the gateway.status helpers used substring matches ("... gateway" in cmdline) for lifecycle decisions, so they false-matched `gateway status`/`dashboard` siblings and unrelated processes like `python -m tui_gateway`, plus stale gateway.pid records. Add a shared strict matcher `looks_like_gateway_command_line()` in gateway/status.py that requires the real `gateway run` subcommand (or the dedicated entrypoints), and route `_looks_like_gateway_process`, `_record_looks_like_gateway`, and `_scan_gateway_pids` through it. 2. restart() race. Wait until the gateway is authoritatively gone (`get_running_pid()` + strict `_gateway_pids()`) before relaunch; force-kill once if it lingers and raise rather than start a duplicate; verify the relaunch produced a running gateway and raise loudly if not (no more exit-0 silent outage). Scoped to Windows; systemd/launchd restart paths are already drain-aware. Adds tests/gateway/test_gateway_command_line_matcher.py. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-19 06:31:56 -07:00
xxxigm	e738c08336	fix(backup): exclude regeneratable dependency and cache dirs `hermes backup` walked every file under HERMES_HOME, excluding only hermes-agent / node_modules / __pycache__ / backups / checkpoints. Python dependency trees (plugin and MCP-server venvs, site-packages) and pip/uv tool caches that live under HERMES_HOME were swept in file-by-file, ballooning a backup to hundreds of thousands of entries that crawl for hours — the reported "backup stuck for days / 426543 files" symptom. Add the canonical regeneratable-dir names (.venv, venv, site-packages, .tox, .nox, .pytest_cache, .mypy_cache, .ruff_cache — mirroring agent.skill_utils.EXCLUDED_SKILL_DIRS) plus .cache to the backup's exclusion set, used by both run_backup and the pre-update/pre-migration _write_full_zip_backup. .archive is intentionally left in so the curator's restorable archived skills still get backed up. Tests cover each new dir name (excluded at any depth), that .archive and cache-resembling files are kept, and an integration check that a planted venv/site-packages/cache is pruned from the actual backup zip while skills/config survive.	2026-06-19 14:37:41 +05:30
kshitijk4poor	1ab6f34791	refactor(dashboard): align Slack allowlist validation with gateway parse - Drop empty entries before validating SLACK_ALLOWED_USERS so a trailing or interior comma (which the gateway silently tolerates in gateway/platforms/slack.py) is no longer rejected at the dashboard. - Hoist the member-ID regex to a module-level _SLACK_MEMBER_ID_RE constant and note it stays in sync with the frontend SLACK_MEMBER_ID_RE. - Add a regression test for the trailing-comma case.	2026-06-19 12:22:30 +05:30
kshitijk4poor	83c034bd5b	fix(dashboard): accept Slack allow-all wildcard in allowed-users validation The new SLACK_ALLOWED_USERS validation rejected '', but the Slack gateway honors '' as an allow-all wildcard (gateway/platforms/slack.py DM auth, slash-confirm, and approval-button paths). Accept '*' as a valid list entry in both the API validator and the dashboard form so a value the runtime honors is no longer blocked at setup.	2026-06-19 12:18:15 +05:30
Shannon Sands	d9190491a6	Add Slack setup hints and field validation	2026-06-19 12:16:23 +05:30
Shannon Sands	f741e70791	Add Slack allowed users setup field	2026-06-19 12:16:23 +05:30
kshitij	6278bca055	Merge pull request #48259 from NousResearch/fix/ns501-multipart-upload-salvage fix(dashboard): clean up upload temp file on client disconnect + pin python-multipart (NS-501)	2026-06-19 12:03:58 +05:30
Shannon Sands	12dfcfdf73	fix(tui): restart dashboard chat on idle exit hotkeys	2026-06-19 12:02:22 +05:30
AhmetArif0	245b95b094	fix(terminal): block gateway lifecycle commands from inside the gateway process systemctl --user restart hermes-gateway run via the terminal tool is a child of the gateway itself. When systemd delivers SIGTERM the gateway kills this subprocess before it can complete, so the service may never restart — reproducing issue #37453. The hermes gateway restart/stop guard (hermes_cli/gateway.py) and the cron-path guard (hermes_cli/cron.py) already block equivalent commands in their respective paths but the terminal tool had no such defense. Add a hard-block before command execution in terminal_tool: when _HERMES_GATEWAY=1 and the command matches _contains_gateway_lifecycle_command, return an error immediately. force=True cannot bypass it — unlike the normal dangerous-command approval flow, here even a user-approved restart would fail because the SIGTERM propagates to child processes. Also extend _GATEWAY_LIFECYCLE_PATTERNS to match systemctl with flags (e.g. systemctl --user restart) — the previous regex required the action word immediately after systemctl with no flags in between. Adds 9 regression tests: 6 blocked variants (parametrized), force bypass attempt, safe systemctl passthrough, and guard-inactive-outside-gateway.	2026-06-19 11:53:44 +05:30
Teknium	620fd59b8e	feat(model-picker): add Refresh Models control to bust stale model cache (#48691 ) The desktop model picker had no way to force a fresh model fetch: model.options went through the 1h-cached provider_models_cache.json, and there was no flag to bust it. When a provider's cached list expired and its next live fetch failed, the picker fell back to the curated static list — silently dropping live-only models (e.g. OpenCode Zen's free tier like deepseek-v4-flash-free) the user had been using. - Thread refresh through model.options (RPC + REST /api/model/options) -> build_models_payload -> list_authenticated_providers, which calls clear_provider_models_cache() up front when set so every row re-fetches live. - Add a 'Refresh Models' control to the desktop picker (5-locale i18n, spinning sync icon). Normal opens leave refresh=false to stay snappy on the cache. Verified: stale cache hides deepseek-v4-flash-free -> refresh busts it -> live re-fetch surfaces it. refresh=false never touches the cache.	2026-06-18 21:37:41 -07:00
kshitij	d06104a9ee	fix(dashboard): resolve chat TUI argv off event loop (#48561 ) * fix(dashboard): resolve chat TUI argv off event loop Dashboard chat now resolves its TUI launch command off the FastAPI/WebSocket event loop. The resolver can run `npm install` / `npm run build` through `_make_tui_argv()`, and doing that synchronously in `/api/pty` can block proxy keepalives and other dashboard WebSocket work long enough for reverse-proxy deployments to drop the chat connection. This keeps the current TUI build policy intact: normal production launches still run the correctness-first `npm run build` path, while `HERMES_TUI_DIR` remains the prebuilt/no-build path for distros and containers. The change only moves the potentially slow resolver work to a worker thread for the dashboard chat path, serialized by an `asyncio.Lock` so concurrent chat tabs preserve one-build-at-a-time behavior. `SystemExit` (node/npm missing) and the profile `HTTPException` path still propagate cleanly through `asyncio.to_thread()`. Salvaged from #26124 — rebased onto current main. The async wrapper now threads the `profile` parameter that `_resolve_chat_argv` gained on main since the PR was opened, so cross-profile chat is preserved. Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com> * chore: add 0xdany to AUTHOR_MAP * fix(dashboard): bind chat-argv lock to app.state; cover error propagation Self-review hardening on top of the salvaged fix: - Move `_chat_argv_lock` from a module-level `asyncio.Lock()` onto `app.state` (initialised in `_lifespan`, lazy fallback via `_get_chat_argv_lock`), mirroring `event_lock`. A module-level `asyncio.Lock()` binds to whatever event loop is active at import time, which is the exact pattern `_get_event_state`'s docstring warns against (breaks across TestClient instances / uvicorn reloads). This keeps the lock on the running loop. - Add two tests exercising the real `_resolve_chat_argv_async` → `asyncio.to_thread` → lock → re-raise chain: `SystemExit` (node/npm missing) and `HTTPException` (invalid profile) both propagate out of the worker thread and are caught by `pty_ws`'s existing handlers. The prior tests mocked `asyncio.to_thread` away and never covered this path. * test(dashboard): dedupe pty error-propagation tests; assert close code simplify-code cleanup pass on the salvage stack: - Extract the shared scaffolding of the two pty_ws error-propagation tests into `_assert_pty_propagates`, keeping the two tests as distinct contracts for the `except SystemExit` and `except HTTPException` arms. - Assert the stable WebSocket close code (1011) instead of relying solely on the user-facing "Chat unavailable" notice wording — a behavior contract per the AGENTS.md "behavior contracts over snapshots" rule, robust to notice rewording. The detail substring ("unknown profile") is still checked for the HTTPException case since proving the detail survives the thread hop is the point of that test. No production-code change; the helper exercises the same real _resolve_chat_argv_async -> asyncio.to_thread -> lock -> re-raise chain. --------- Co-authored-by: draihan <draihan@student.ubc.ca>	2026-06-18 22:20:52 -04:00
Ben	03d9a95a74	fix(desktop): show Hindsight memory provider (#37546 ) * fix(desktop): show Hindsight memory provider * feat(desktop): configure Hindsight memory provider * fix(desktop): limit Hindsight modes to supported setup * refactor(desktop): generic memory-provider config surface Replace the bespoke Hindsight settings surface with a declarative, schema-driven path so adding a memory provider is pure declaration — no per-provider page, conditional, or endpoint. - memory_providers.py: declarative registry. Each provider lists its fields {key, label, kind, default, options, secret-vs-plain}. Hindsight's mode is a select(cloud, local_external), so rejecting local_embedded falls out of generic enum validation instead of a hand-written check. - One generic endpoint pair GET/PUT /api/memory/providers/{name}/config. GET returns declared fields + current values (secrets only as is_set, never read back); PUT validates selects against their options, writes plain fields to the provider config file, secrets to the env store, and flips memory.provider. - ProviderConfigPanel renders straight from the schema, replacing hindsight-settings.tsx and the memory.provider === 'hindsight' conditional in config-settings.tsx — same pattern as toolset-config-panel.tsx off env_vars. Scoped to memory providers; storage layout is unchanged so the runtime Hindsight plugin reads the same config.json / HINDSIGHT_API_KEY / provider keys as before. Tests cover the registry, endpoint behavior (defaults, write+secret, select rejection, unknown provider, secret-never-returned), and the generic panel.	2026-06-18 16:48:47 -05:00
Victor Kyriazakos	3ead2bdd0d	feat(prompt): configurable per-platform system-prompt hint overrides Add platform_hints config so an admin can append to or replace Hermes' built-in platform hint for a single messaging platform (WhatsApp, Slack, Telegram, ...) without affecting other platforms. Enables enterprise managed profiles to steer platform-aware skills (e.g. invoke a custom table-formatting skill on WhatsApp where Markdown tables don't render) while leaving Telegram/Slack/CLI behavior unchanged. - hermes_cli/config.py: document platform_hints in DEFAULT_CONFIG - agent/agent_init.py: load platform_hints -> agent._platform_hint_overrides - agent/system_prompt.py: _resolve_platform_hint() applies append/replace (replace wins; bare string = append shorthand); defensive on bad config - tests: 16 cases covering append/replace/shorthand/isolation/malformed Override only affects the platform-hint segment of the system prompt; SOUL/context/memory tiers and general instructions are unchanged.	2026-06-18 14:28:01 -07:00
brooklyn!	2944b3c394	fix(desktop): make session delete idempotent and id-resolving (#48641 ) DELETE /api/sessions/{id} was the only session endpoint that didn't resolve the id (detail, messages, rename, export all call resolve_session_id) and 404'd when the row was already gone. The desktop optimistically removes the sidebar row, then RESTORES it and shows the error on any failure — so deleting a session that had just been reaped (empty-session hygiene) or removed by a concurrent client resurrected a ghost row and surfaced "session not found". /goal + auto-compression churn leaves transient empty rows that race the sidebar snapshot, which is the exact "I deleted the empty one and got 'session not found'" report. Resolve exact ids / unique prefixes, and treat an already-absent session as an idempotent success — DELETE's contract is "ensure it's gone". This mirrors the bulk-delete endpoint, which already treats ghost ids as success. Tests: deleting an absent id is idempotent (200, not 404); delete resolves a unique prefix; a real session still deletes.	2026-06-18 21:16:06 +00:00
flooryyyy	f8d8f045fa	feat(kanban): auto-subscribe calling session on kanban_create When a worker calls kanban_create from inside a session that has a persistent delivery channel, the originating session is now subscribed to the new task's completion/block events automatically. The agent that dispatched the task gets notified instead of having to poll. - Gateway sessions (telegram/discord/slack): HERMES_SESSION_PLATFORM + HERMES_SESSION_CHAT_ID ContextVars, set by the messaging gateway. - TUI / desktop sessions: HERMES_SESSION_KEY in the subprocess env. The TUI notification poller keys on platform='tui' + chat_id=<key>. - CLI / cron / test: no persistent channel, no subscription. Gated by kanban.auto_subscribe_on_create in config.yaml (default True). Disable to mirror pre-feature behaviour — users who want explicit kanban_notify-subscribe calls per task can set it to false. This config gate addresses the design concern that got PR #19718 reverted upstream (unconditional implicit auto-subscribe on tool-driven kanban_create was too aggressive for orchestrator users). HERMES_SESSION_ID is intentionally not a fallback channel — it is set by ACP/agent subprocess telemetry for every invocation, not just TUI, so treating it as a notification target would auto-subscribe every CLI session and re-introduce the over-eager behaviour. The kanban_create response now includes a 'subscribed' bool so orchestrators can react if subscription failed (e.g. by falling back to explicit kanban_notify-subscribe or to polling). Includes 6 tests covering the gateway / TUI / CLI / partial-context / gated / add_notify_sub-failure paths. All 90 tests in test_kanban_tools.py pass; 509 broader kanban tests pass.	2026-06-18 14:10:51 -07:00
teknium1	3042045540	fix(picker): keep max_models=0 distinct from unlimited; lock cap semantics Follow-up to the cap-removal salvage. The contributor guarded the new unlimited default with `[:max_models] if max_models else ...`, which conflates max_models=0 (used by slug-only callers that want an empty model list) with None (unlimited). Tighten to `is not None` at all five slicing sites in list_authenticated_providers / list_picker_providers, and add a regression test asserting the three-way contract: None=full, 0=empty, N=first N.	2026-06-18 13:47:31 -07:00
islam666	9705e7944a	fix(picker): remove max_models=50 cap in interactive model pickers The interactive model pickers (Desktop REST API, TUI model.options, CLI /model) were hard-capped at max_models=50, which truncated large provider catalogs like Kilo Gateway (336 models) to just 50 entries. This made most models undiscoverable via the picker search box. Changes: - Change build_models_payload() default from max_models=50 to None (unlimited) - Change list_authenticated_providers() default from max_models=8 to None - Change list_picker_providers() default from max_models=8 to None - Fix all [:max_models] slicing to handle None as 'no limit' - Remove max_models=50 from 5 interactive picker callers: * web_server.py: get_model_options (Desktop /api/model/options) * web_server.py: get_recommended_default_model * model_switch.py: prewarm_picker_cache_async * tui_gateway/server.py: model.options JSON-RPC * cli.py: HermesCLI model picker - Telegram/Discord inline keyboard picker (gateway/slash_commands.py) still passes max_models=50 explicitly — unchanged behavior. The total_models field was already in the response payload and is now meaningful since models.length == total_models for interactive pickers. Fixes #48279	2026-06-18 13:47:31 -07:00
Siddharth Balyan	73cd8622f9	feat(billing): /billing terminal billing — interactive TUI + CLI client (#45449 ) * feat(billing): nous_billing http client + BillingState core (phase 2b) Phase 2b terminal-billing client foundation: - hermes_cli/nous_billing.py: typed client for the 4 /api/billing/* endpoints (state/charge/poll/auto-top-up). Raises typed errors (BillingScopeRequired, BillingRateLimited, BillingAuthError) mapped from the live-verified contract; fail-open is the caller's job. Idempotency-Key enforced client-side. - agent/billing_view.py: surface-agnostic BillingState core + Decimal money parsing (server emits decimal strings, not 2dp), fail-open builder, idempotency-key gen, custom-amount validation. - 51 unit tests (decimal parse/format, payload tiering, error->exception matrix, fail-open, amount validation). Plan: docs/plans/2026-06-13-001-phase-2b-terminal-billing-tui-plan.md * feat(billing): billing:manage scope + lazy step-up re-auth (phase 2b) - NOUS_BILLING_MANAGE_SCOPE constant. - nous_token_has_billing_scope(): split-based scope check (no false-positive substring match). - step_up_nous_billing_scope(): re-runs the device flow requesting billing:manage, reusing the held credential's portal/inference URLs + client_id (so a preview stays a preview), persists like _login_nous but WITHOUT the model picker. Returns True iff the minted token carries the scope (False when NAS silently downscopes a non-admin / unticked grant). Lazy step-up (plan D-A): normal login path unchanged; 403 insufficient_scope from a billing call triggers this. 7 unit tests. * feat(billing): billing JSON-RPC methods for the TUI (phase 2b) billing.state / charge / charge_status / auto_reload / step_up in tui_gateway/server.py. Return STRUCTURED success envelopes (result.ok + result.error=<code>) rather than JSON-RPC-level errors, so the Ink rpc() promise always resolves and the TUI branches on the typed billing error code (insufficient_scope, rate_limited, no_payment_method, …) to render the right affordance. Money serialized as decimal STRINGS + display strings. charge mints + echoes an idempotency_key for retry reuse. 16 unit tests. * feat(billing): /billing CLI handler + command registry (phase 2b) - CommandDef("billing", subcommands=buy\|auto-reload\|limit), added to _SLACK_VIA_HERMES_ONLY so it routes via /hermes on Slack (keeps the 50-cap parity test green, same as /credits). - cli.py::_show_billing + screen helpers: all 5 screens (overview, buy→confirm→ poll, auto-reload, monthly-limit read-only). Reuses _prompt_text_input_modal / _prompt_text_input (D-C). Non-interactive (_app is None) renders text + portal deep-link, never prompts (R7). Decimal money end-to-end. 2s/5-min cancellable poll loop; 429/503 = retry not failure; settled = ledger truth. Lazy step-up on 403 insufficient_scope. no_payment_method treated as mainline funnel-to-portal. - 6 CLI tests; 156 command tests (incl. Slack/Telegram parity) green. * feat(billing): /billing Ink TUI screens + tests (phase 2b) - ui-tui/src/app/slash/commands/billing.ts: /billing TUI command covering all 5 screens — overview (text), buy <amt> → ConfirmReq → charge → non-blocking 2s/ 5-min poll loop → settled/failed/timeout branches, auto-reload <below> <to> → ConfirmReq → PATCH, limit (read-only). Reuses the existing ConfirmReq overlay (D-C) — no bespoke component. Typed-error envelope branching: insufficient_scope arms the lazy step-up confirm; no_payment_method/rate_limited/cap funnel to portal. Client-side amount validation mirrors the server (bounds + 2dp). - gatewayTypes.ts: Billing* response interfaces. - registry.ts: register billingCommands. - billingCommand.test.ts: 12 vitest cases (overview/gating/buy-confirm-poll- settled/no_payment_method/step-up/limit/auto-reload/validation). TUI build green; 12/12 vitest pass; slash tests pass once @hermes/ink is built. * docs(billing): scrub private cross-repo references NAS is a private repo — remove all references to it from the public PR: - drop the cross-repo planning doc (planning scaffolding, not a deliverable; the PR description documents the design) - replace 'NAS' / 'PR #412 preview' mentions in code + test comments with generic 'the server' / 'a preview deployment' * docs(billing): scrub final NAS reference in step-up docstring * docs(billing): drop dangling plan-doc refs The phase-2b plan doc was removed in the cross-repo scrub (`300afcc0b`) but two module docstrings still pointed at it. Drop the dead refs. * feat(billing): interactive /billing overlay + step-up UX, portal-URL & token fixes Adds the interactive /billing TUI overlay and hardens the terminal-billing client across CLI and TUI. - TUI: full /billing overlay state machine (overview to buy to confirm, auto-reload, read-only monthly limit) reusing the existing confirm overlay. - Step-up: surface the verification link in-transcript and open the browser via the TUI's own opener (the device flow runs in the headless gateway, so a printed URL was being dropped); run the step-up handler off the main loop and emit the link as an out-of-band event so the gateway stays responsive. - Step-up copy is scope-accurate ("Billing permission granted") and re-checks /state so it never claims "enabled" when the org kill-switch is still off. - Portal deep-links resolve to absolute URLs against the active portal base (the server emits them relative) - fixes a bare "/billing?topup=open" link. - Billing calls refresh an expired access token via the stored refresh token instead of reporting a false "not logged in". - Optimistic funnel: advise "set up a saved card on the portal" up front when no card is on file (advisory, not a hard gate). - Token resolution is cached briefly so the 2s charge poll loop stops re-locking + re-reading the auth store on every tick; 401 re-resolves fresh. - Remove the temporary demo-mode shims. Validation: 87 Python billing tests, 88 TS tests (billing command + gateway event handler), tsc clean, ink + ui-tui builds green. * docs(billing): add /billing TUI screenshots for PR * fix(cli): guard _last_invalidate on bare instances; update stale prompt-fallback test The UI-invalidate throttle read self._last_invalidate unconditionally, which raised AttributeError on HermesCLI instances built without __init__ (the thread-safety test's object.__new__ shell). Guard the read with getattr. The off-main-thread branch of _prompt_text_input was changed (#23185) to cancel cleanly to None instead of falling back to a bare input() that would hang on the slash-worker thread; the test still asserted the old direct-input fallback. Update it to assert the current intended behavior: returns None, calls neither run_in_terminal nor input(), and does not hang.	2026-06-19 01:53:32 +05:30
Teknium	0fa7d6f660	fix(desktop): never persist or restore a named custom provider as bare "custom" (#48547 ) * Port from cline/cline#11514: encourage parallel tool calls Add a universal system-prompt guidance block telling the model to batch independent tool calls (reads, searches, web fetches, read-only commands) into a single assistant turn instead of one call per turn. The runtime already executes independent batches concurrently (read-only tools always; non-overlapping path-scoped file ops); the open-source system prompt had nothing steering the model to PRODUCE the batch. Fewer round-trips means less resent context, which compounds over a long conversation. - prompt_builder.py: new PARALLEL_TOOL_CALL_GUIDANCE block (short, static, cache-amortised) modeled on TASK_COMPLETION_GUIDANCE. - system_prompt.py: inject right after the task-completion block, gated by agent.valid_tool_names + the new toggle. - agent_init.py: read agent.parallel_tool_call_guidance (default True). - config.py: add the default under the agent section. - test_prompt_builder.py: behavior-contract tests (batching steer, dependent carve-out, length bound) — invariants, not wording snapshots. Adapted from Cline's TypeScript tool-surface guidance to hermes-agent's Python prompt-assembly architecture and config-over-env conventions. * fix(desktop): never persist or restore a named custom provider as bare "custom" Custom providers vanish from the Desktop/TUI model picker with "No LLM provider configured" — repeatedly fixed (#44062, #44109, #45578) and repeatedly regressed (#44022, #47714) because every fix only recovered the entry identity from a persisted base_url. When a session is persisted/restored with the resolved provider "custom" and NO base_url, bare "custom" leaked through verbatim; resolve_runtime_provider("custom") routes to the OpenRouter default URL with no api_key, so the next turn/resume dies. Bare "custom" is the resolved billing class shared by every named providers:/ custom_providers: entry — it is not a routable identity. Centralize the "never let bare custom escape" invariant in one helper, runtime_provider.canonical_custom_identity(), and apply it at all four leak sites in tui_gateway/server.py: - _ensure_session_db_row — the ORIGIN: first DB write seeds the bad row - _runtime_model_config — live persist - _stored_session_runtime_overrides — resume restore (heals old rows; drops unrecoverable bare custom so resume falls back to config default) - _make_agent — rebuild / per-turn The helper recovers custom:<name> from the endpoint URL when present, else from config.model.provider (the durable identity left when no base_url survived). Regression tests in test_custom_provider_session_persistence.py lock the no-base_url vector at every site so it cannot regress again.	2026-06-18 11:11:51 -07:00
Teknium	c37fdec2d9	feat(dashboard): surface full per-MCP catalog detail; fix pip-install doc (#48520 ) The dashboard MCP catalog only showed name/description/transport and a non-clickable source. Users couldn't see what an entry connects to or runs before installing — the exact detail the docs trust model tells them to vet. - /api/mcp/catalog now returns transport target (url, or command+args), auth_type, git install source/ref + bootstrap commands, default-enabled tool hint, and post-install guidance per entry. - McpPage renders the endpoint URL (http) or command+args (stdio), the git install source/ref, a collapsible bootstrap-commands list, setup notes, and the source as a clickable link when it's a URL. - Docs: drop the 'uv pip install -e .[mcp]' quick-start step (Hermes does not support pip installs; MCP ships with the standard install) and note the dashboard now surfaces this detail. - Strengthen the catalog endpoint test to assert the new inspection fields.	2026-06-18 09:40:56 -07:00
Kewe63	f1254c8eaf	fix(skills): rmtree scope guard + default pre_update_backup to true (#48200 ) Defense-in-depth fix for the silent wipe of ~/.hermes/ documented in #48200. A `hermes update --yes` run silently destroyed a user's .env, MEMORY.md, kanban.db, custom skills, and scripts. Two changes: 1. `_rmtree_writable` in tools/skills_sync.py now refuses to rmtree anything outside SKILLS_DIR (the HERMES_HOME/skills/ root). All five call sites pass paths under SKILLS_DIR, so the guard is a no-op for current code and a loud, recoverable failure for any future regression (bad path join, malicious bundled manifest, stale path in scope after an exception). 2. The default `updates.pre_update_backup` flips from false to true in hermes_cli/config.py. A few minutes of zip per update is negligible compared to silent total data loss. Still overridable; --no-backup still works for one-off opt-out. Five new tests in TestRmtreeWritableScopeGuard (root path, hermes home, sibling dir, skills root itself, subdir) plus a flipped `test_default_enabled_creates_backup` in test_backup.py. 178/178 tests pass in the two affected files. Public method signatures unchanged, no test-stub blast radius. Closes #48200	2026-06-18 08:53:35 -07:00
kshitijk4poor	6777916068	fix(skills): surface list-modified hint on both update paths + disambiguate diff Salvage follow-up to the cherry-picked feat/test commits: - W1: the unpack/install update path in main.py printed the '~ N user-modified (kept)' notice without the new 'hermes skills list-modified' hint that the git-pull path got. Mirror the hint to both sites so the count is actionable regardless of which update path runs. - W2: 'hermes skills diff <name>' (bundled-vs-stock) now shares the verb with the gateway write-approval 'diff <id>'. The gateway handler's docstring + truncation message pointed users to '/skills diff <id>' on the CLI, which now resolves a bundled skill by that name instead. Point at the pending JSON file and note the two diff commands are distinct. - Add an invariant test asserting every 'user-modified (kept)' notice in main.py carries the discovery hint (guards sibling drift).	2026-06-18 12:28:11 +05:30
xxxigm	085fc5d001	feat(skills): find & diff user-modified bundled skills `hermes update` keeps (won't overwrite) bundled skills the user edited locally, but only printed a count — "~ N user-modified (kept)" — with no way to learn which skills, or see what changed. Reverting already existed (`hermes skills reset <name> [--restore]`); discovery and inspection did not. Add two CLI commands (zero model-tool footprint), reusing the manifest origin-hash that sync already maintains: - `hermes skills list-modified [--json]` — list the bundled skills whose on-disk copy diverges from the last-synced origin hash (the exact test the sync loop uses to decide what to skip). - `hermes skills diff <name>` — unified diff between the user's copy and the current bundled (stock) version, so the user can confirm what changed before reverting. Both are mirrored as `/skills list-modified` and `/skills diff`. The `hermes update` notice now points at `hermes skills list-modified`. Core helpers `list_user_modified_bundled_skills()` and `diff_bundled_skill()` live in tools/skills_sync.py alongside the existing reset logic.	2026-06-18 12:26:20 +05:30
kshitij	832d5967f8	Merge pull request #48262 from kshitijk4poor/salvage-32445 feat(memory): improve OpenViking setup UX (salvage #32445)	2026-06-18 11:34:11 +05:30
kshitijk4poor	6752da9a77	fix(dashboard): clean up upload temp file on client disconnect + pin python-multipart (NS-501) Follow-up to #47663 (streaming multipart upload), fixing two issues that landed with it. 1. Temp file leaked on client disconnect. The streaming upload endpoint's except chain caught only HTTPException / PermissionError / OSError — all Exception subclasses. asyncio.CancelledError, raised when a browser aborts a large upload mid-stream (the exact NS-501 scenario), is a BaseException, so it bypassed every except clause and reached a finally that only closed the file handle and never unlinked the temp file. Every aborted large upload orphaned a partial `.{name}.*.upload` file (up to ~100 MB) in the target directory. Cleanup now lives in finally, keyed on a `renamed` success flag, so the temp file is removed on every non-success exit including BaseException paths. Added test_stream_upload_cleans_temp_on_cancellation, which fails on the pre-fix code (leaks the temp file) and passes with the fix. 2. python-multipart pinned to ==0.0.27 instead of ==0.0.20. The package was already resolved at 0.0.27 transitively (via daytona) before #47663; the explicit ==0.0.20 pin in the [web] extra and the tool.dashboard lazy-install set downgraded it. Bumped both to ==0.0.27 and regenerated with `uv lock`, keeping the lockfile coherent. The base dependency stays >=0.0.9,<1.	2026-06-18 11:32:18 +05:30
kshitijk4poor	1153b42b24	Merge upstream/main into OpenViking setup-UX (salvage #32445 ) Resolves conflicts from the OpenViking churn that merged after #32445 was opened (#48042/#47662 session-switch + write hardening, #47311/#47973): - plugins/memory/openviking/__init__.py: keep both __init__ field groups (the PR's _runtime_start_* alongside main's _prefetch_threads/_shutting_down). - tests/plugins/memory/test_openviking_provider.py: keep BOTH the PR's new setup-validation tests and main's session-switch/concurrency tests (disjoint additions to the same region). Two fixes layered while reconciling (contributor work otherwise preserved): - Restore the merged tenant-header contract (#22414/#21232). The PR had changed _VikingClient defaults to '' and made empty account/user OMIT the tenant headers; main's contract is that empty falls back to 'default' and the X-OpenViking-Account/User headers are ALWAYS sent (ROOT API keys need them). Reverted the constructor to 'account or os.environ.get(..., "default")' and updated the two PR tests that asserted the omit-when-empty behavior. - Close a secret-file TOCTOU in the setup writers. _write_env_vars and _write_ovcli_config wrote the api_key/root_api_key file and chmod 0600 AFTERWARD, leaving a world-readable window on newly-created files. Added _precreate_secret_file() to create with 0600 before any secret bytes land.	2026-06-18 11:28:51 +05:30
Ben Barclay	c661634537	fix(dashboard): stream file uploads via multipart instead of base64 JSON (NS-501) (#47663 ) * fix(dashboard): stream file uploads via multipart instead of base64 JSON The dashboard file manager uploaded files (including backup/restore zip archives) by reading them client-side with FileReader.readAsDataURL and POSTing a base64 data URL inside a JSON body to /api/files/upload. For a large backup this (a) inflates the payload ~33%, (b) buffers the whole file plus its decoded copy in memory, and (c) reliably trips an upstream proxy body-size/timeout limit, surfacing as a 502 with the upload appearing to hang indefinitely (NS-501). Dashboard-only hosted users have no shell fallback to place the archive, so backup restore was unusable. Add a streaming multipart endpoint POST /api/files/upload-stream (UploadFile + Form) that reads the request body in 1 MiB chunks straight to a sibling temp file, enforces the existing 100 MB size cap as it streams (413 on overflow, before buffering the whole file), and atomically renames into place so a partial/aborted/over-limit upload never clobbers an existing file. The frontend api.uploadFile now sends multipart/form-data (raw bytes, no base64, browser-set boundary) and FilesPage passes the File object directly; the dead readAsDataUrl helper is removed. The legacy base64 JSON endpoint stays for backward compat. FastAPI's UploadFile/Form require python-multipart, which is NOT pulled in by fastapi itself, so it is added to the base deps, the [web] extra, and the tool.dashboard lazy-install set (kept in sync). Validated: 5 new endpoint tests (roundtrip, multi-chunk >1 MiB, over-limit 413 without clobbering + no temp-file leak, overwrite=false conflict, forced-root traversal containment); existing base64 tests still pass; web typecheck + vite build clean; and a real uvicorn server E2E (5 MB multipart upload -> HTTP 200 in 0.21s, exact byte match) plus a 30 MB TestClient roundtrip confirm constant-memory streaming end to end. Reported via beta (NS-501). * build(deps): regenerate uv.lock for python-multipart (NS-501) CI ran uv lock --check / uv sync --locked which failed because the python-multipart dependency add was not reflected in uv.lock. Regenerate the lockfile (resolves to 0.0.20, matching the [web] extra pin) after merging current main.	2026-06-18 15:54:32 +10:00
Ben Barclay	9c3c5da356	fix(backup): hermes import never overwrites volatile gateway runtime state (NS-501) (#48243 ) Importing a backup wrote every file from the zip over the target home wholesale. On a hosted instance this clobbered gateway_state.json with the source machine's last recorded run/desired state — driving the container-boot reconciler (container_boot._read_desired_state, which only auto-starts a gateway whose state is "running") off stale/foreign state and leaving the gateway stuck "starting", disconnected from the Nous portal. Add _IMPORT_SKIP_NAMES (gateway_state.json, gateway.pid, cron.pid, gateway.lock, processes.json) and skip them by basename in run_import, so both the root profile and named profiles preserve the target's own runtime state. This mirrors what container_boot._STALE_RUNTIME_FILES already sweeps on every container boot, and protects against older backups that predate the backup-side exclusions. The import summary reports which files were preserved. This is the second half of NS-501 (filed separately as NS-508): the upload 502 was fixed in #47663; this fixes the import-breaks-the-instance half.	2026-06-18 15:27:45 +10:00
Ben Barclay	4440d77bf3	fix(update): scope install-method stamp to the code tree, not $HERMES_HOME (#48188 ) The install method (docker/git/pip/...) describes the running binary, but detect_install_method() read it from $HERMES_HOME/.install_method — a shared DATA directory. The Docker docs deliberately bind-mount $HERMES_HOME (~/.hermes:/opt/data) so config/sessions/memory persist and can be shared with a host-side Desktop/CLI install. When a containerized gateway and a host install share one $HERMES_HOME, the home-scoped stamp is a single slot describing two installs: the published image stamps 'docker' on every boot, the host install then reads 'docker' and the in-app updater refuses to run 'hermes update' ("doesn't apply inside the Docker container"). Reinstalling the Desktop app from the DMG doesn't help because the contaminated stamp is re-read every time. Fix (option 1 — code-scoped stamp): - detect_install_method() reads <install tree>/.install_method first (next to the running code, immune to the shared data dir). It falls back to the legacy $HERMES_HOME stamp for back-compat, but IGNORES a 'docker' home stamp when not actually containerized — so already-poisoned shared homes self-heal. - stamp_install_method() writes the code-scoped stamp. - install.sh stamps $INSTALL_DIR instead of $HERMES_HOME. - Dockerfile bakes 'docker' into /opt/hermes/.install_method at build time (inside the immutable block); stage2-hook.sh no longer writes the home stamp and proactively removes a stale 'docker' one to heal existing shared homes. Genuine containers still resolve to 'docker' (baked stamp, or legacy home stamp honored when containerized). Unstamped installs in generic containers still fall through to git/pip (preserves the #34397 fix).	2026-06-18 14:14:41 +10:00
Ben Barclay	c276b017ad	feat(relay): connector⇄gateway channel auth + signed-HTTP inbound receiver + enroll CLI (#48147 ) * feat(relay): authenticate the connector⇄gateway WS channel The relay gateway may be customer-managed and internet-exposed, so the connector⇄gateway channel is itself authenticated (distinct from the platform crypto the relay path sheds). Add gateway/relay/auth.py — a Python port of the connector's HMAC token + delivery-signature schemes (relayAuthToken.ts / deliverySigning.ts), verified byte-for-byte against the connector's compiled TypeScript via cross-language test vectors. Present an Authorization bearer on the /relay WS upgrade keyed by the per-gateway secret (resolved from GATEWAY_RELAY_ID / GATEWAY_RELAY_SECRET in env or config). The connector rejects an unauthenticated/invalid/ revoked upgrade with close 4401. * feat(relay): signed-HTTP inbound delivery receiver The connector delivers normalized inbound events to a tenant's gateway over a signed HTTP POST, not the outbound /relay WS: the connector instance owning a platform socket is generally not the instance a given gateway dialed out to, so inbound targets a tenant endpoint that may load-balance across gateway instances. Add gateway/relay/inbound_receiver.py — verifies x-relay-signature / x-relay-timestamp over the EXACT raw request bytes (re-serializing would break the HMAC: JS JSON.stringify is compact, Python json.dumps spaces) against the per-tenant delivery key verify list within a 300s replay window, then dispatches messages to handle_message and interrupts to the interrupt handler. Wire it into the adapter lifecycle (start in connect() when a delivery key + bind port are configured, tear down in disconnect(); a purely-outbound dev gateway runs without it). Refine test_relay_sheds_crypto to distinguish PLATFORM crypto (Discord ed25519, Twilio/WeCom HMAC — still shed) from the connector⇄gateway CHANNEL auth (intended): auth.py / inbound_receiver.py are exempt from the platform-symbol scan but still banned from importing platform-crypto modules, plus a positive guard that auth.py uses only stdlib hmac/hashlib. * feat(relay): hermes gateway enroll CLI Add the gateway half of zero-touch enrollment. `hermes gateway enroll` resolves a fresh Nous Portal access token (the tenant-proving identity), POSTs {enrollmentToken, gatewayId} to the connector's /relay/enroll, and persists GATEWAY_RELAY_ID / GATEWAY_RELAY_SECRET / GATEWAY_RELAY_DELIVERY_KEY to ~/.hermes/.env. The per-gateway secret authenticates the WS upgrade; the per-tenant delivery key verifies signed inbound deliveries. Refuses under is_managed() (hosted installs get the secret stamped in by the orchestrator). Added as an 'enroll' subcommand on the existing gateway subparser — not a new top-level command. * docs(relay): inbound is signed HTTP, not WS; document channel auth Fix the stale contract: §3/§5 said inbound rode the WS socket (single- instance only, predates the multi-instance socket-ownership + channel-auth model). Inbound + connector→gateway interrupt are signed HTTP POSTs to the tenant endpoint. Add §6.1 documenting the two channel-auth schemes (per- gateway WS-upgrade secret, per-tenant inbound delivery key) and how they differ from the platform crypto the relay path sheds. * test(relay): update build_gateway_parser callers for cmd_gateway_enroll The enroll subcommand added cmd_gateway_enroll as a required keyword-only arg to build_gateway_parser, but two existing parser-extraction tests still called it with only cmd_gateway/cmd_proxy — failing CI with TypeError. Thread the new handler through both call sites and add a test asserting `gateway enroll` dispatches to cmd_gateway_enroll with its flags parsed.	2026-06-18 12:01:54 +10:00
Ben Barclay	fcf6cb3d73	fix(docker): supervised gateway uses --replace to take over stale holder (NS-505) (#47555 ) * fix(docker): supervised gateway uses --replace to take over stale holder Inside the s6 container image the per-profile gateway service rendered a bare `hermes gateway run` (no --replace). When a gateway is started OUTSIDE s6 — a stray shell `hermes gateway run`, an agent action, or the Open WebUI helper (scripts/setup_open_webui.sh) — it grabs the per-HERMES_HOME PID lock first. The supervised slot then execs the bare `gateway run`, hits the "Another gateway instance is already running" guard, exits non-zero, and s6 restarts it: a restart loop that floods the log every ~12s and never binds. The container looks up but the gateway is permanently down, and dashboard-only users (no shell) cannot recover. Render the supervised run script as `gateway run --replace` so s6 is authoritative for its slot: it reaps the stale holder via the hardened takeover path (takeover marker + SIGTERM->SIGKILL-with-confirmation + scoped-lock cleanup in gateway/run.py) and binds. This matches the systemd service path, which already builds its argv with --replace (_build_gateway_argv / 'nohup hermes gateway run --replace'), and the intent already documented in _maybe_redirect_run_to_s6_supervision. The existing HERMES_S6_SUPERVISED_CHILD sentinel still prevents the run->start->run redirect recursion. Each profile is scoped to its own HERMES_HOME and s6 guarantees one supervised instance per slot, so there is no legitimate supervised sibling for --replace to clobber. Reported via beta (NS-505): gateway.log showed PID 17907 'running (manual process)' with the guard error repeating every ~12s on v2026.6.5. Adds a regression test asserting every gateway-run exec line in the rendered script (default + named profile, both privilege branches) carries --replace, and updates the existing render-script assertion. * fix(ci): remove stray .venv symlink committed into repo The PR's commit accidentally tracked a .venv symlink pointing at the developer's local venv (mode 120000 -> /home/ben/nous/hermes-agent/.venv). The CI test/e2e/build jobs run `uv venv` to create .venv and failed with `failed to create directory .venv: File exists (os error 17)` because the checkout already contained the symlink. All test shards aborted in <15s during setup, before any test ran. Untrack the symlink and add a bare `.venv` entry to .gitignore (the existing `.venv/` rule only matches a directory, so a symlink slipped through).	2026-06-18 10:49:02 +10:00
Teknium	9ba4615db2	fix(dump): show commit date instead of release date in hermes debug (#48104 ) * feat(mcp): raise default tool-call timeout 120s -> 300s Port from openai/codex#28234. Long-running MCP tools (web fetches, sandboxed builds, deep-research servers) routinely exceed 120s, causing spurious timeout failures. Codex bumped its default MCP tool timeout from 120 to 300 for the same reason. - _DEFAULT_TOOL_TIMEOUT 120 -> 300 in tools/mcp_tool.py (per-server 'timeout' config override unchanged) - update test_default_timeout assertion - document the default in mcp-config-reference.md * fix(dump): show commit date instead of release date in hermes dump The version line in `hermes dump` (the top of the /debug report) appended the package release date in parentheses, which reads like a wall-clock "generated at" timestamp and confuses support triage. Replace it with the date the HEAD commit was actually made, resolved live via `git log -1 --format=%cd --date=short`, kept next to the commit SHA. On Docker/wheel installs with no .git the date resolves to '' and the suffix is simply omitted (the baked SHA still identifies the build).	2026-06-17 16:53:42 -07:00
brooklyn!	c1f9eb0ec4	fix(desktop): resolve electronDist dynamically + self-heal blocked installs (supersedes #48081/#48082) (#48091 ) * fix(desktop): resolve electronDist dynamically + self-heal blocked installs Supersedes the static-path approach (#48081) and the install-step self-heal (#48082) with a fix that removes the whole failure class instead of chasing each symptom. Three distinct faults converged into the June desktop-build outage; this closes all three. Root cause (the part #48081 left open — "Gap B"): build.electronDist was a static relative path in apps/desktop/package.json, but npm workspace hoisting is NOT deterministic — depending on the npm version and what else is installed, npm nests the workspace-only electron devDep under apps/desktop/node_modules/electron OR hoists it to the repo root. A static path matches only one layout, so a clean install intermittently fails with "The specified electronDist does not exist". #48081 re-pointed the path at the nested layout (correct today) but electron-builder reads electronDist STATICALLY, so any future hoist change silently breaks it again — only caught by a CI invariant, never self-corrected. Fix: - scripts/run-electron-builder.cjs: resolve electron the way Node's runtime does — require.resolve("electron/package.json") walks node_modules from the desktop project upward and finds electron wherever npm actually put it. The path can never drift out of sync with the install layout again, on any OS/npm version. * dist present -> pass -c.electronDist=<abs>/dist so electron-builder reuses the unpacked runtime (keeps the #38673 fast path that dodges the 26.8.x missing-binary re-unpack bug). * dist absent -> omit electronDist; electron-builder fetches Electron itself via @electron/get honoring electronVersion + ELECTRON_MIRROR. package.json: builder script now runs the wrapper; the static build.electronDist is removed (the resolver owns it). - main.py / install.sh / install.ps1: on a dependency-install failure where the electron package staged but its dist is missing (electron's install.js process.exit(1) on a blocked/throttled binary download — #47266/#47917/#48021), repopulate the dist via electron's downloader (canonical, then npmmirror.com) and CONTINUE to the build instead of aborting. npm runs postinstall LAST, so the only casualty is electron/dist; bailing here is what made the pack-time mirror self-heal unreachable on a blocked network. Hard-fail only when electron never staged at all (a genuine dependency error). - The pack-time mirror fallback now retries the build even when the pre-fetch can't populate the dist: the wrapper lets electron-builder download Electron itself via the mirror, so the retry is no longer a no-op (it was, when electronDist was a static path). The exact 40.10.2 pin (already on main) keeps the third mode — the native @electron-internal/extract-zip win32 binding that 40.10.3/40.10.4 ship without a published prebuild — from recurring. Tests: - test_desktop_electron_pin.py: replace the static-path-matches-lockfile invariant with contracts that there is no hardcoded electronDist to drift, the builder script routes through the resolver, and the resolver uses Node module resolution + injects -c.electronDist. - test_gui_command.py: install-failure self-heal continues to build; genuine (electron-never-staged) install failure still hard-fails; pack retries under the mirror even when the pre-fetch is blocked. Salvages/supersedes the overlapping community work in #48003 (sitkarev), #48012 (omegazheng), #48033 (james47kjv), and #48082. Co-authored-by: sitkarev <59806492+sitkarev@users.noreply.github.com> Co-authored-by: omegazheng <zheng@omegasys.eu> Co-authored-by: james47kjv <220877172+james47kjv@users.noreply.github.com> * fix(desktop): narrow Electron self-heal to real missing-dist failures Follow-up on #48091 to remove the remaining misdiagnosis risk from the installer/build fallback path (#46785 concern): only take the Electron repair/retry path when Electron's package files are staged and dist is actually missing/corrupt. - main.py: add _electron_pkg_staged_missing_dist() and use it to gate install failure recovery; fail fast for unrelated npm install errors. - main.py/install.sh/install.ps1: run cache purge + retry only when dist is missing; do not retry unrelated tsc/vite/build failures under an Electron-specific narrative. - install.sh/install.ps1: tighten install-stage self-heal guard to require both package.json + install.js and missing dist. - tests: add coverage that install failure hard-fails when Electron dist already exists, and update retry test to reflect the tightened recovery condition. Validation: - Python tests: 64 passed - install.sh-related tests included in the run - Real mac build on this machine: - npm ci at repo root: success - cd apps/desktop && npm run pack: success - electron-builder packaged darwin arm64 and used custom unpacked Electron dist * refactor(desktop): trim electron self-heal helpers and comments Deduplicate mirror-retry into _try_redownload_electron_dist / shell counterparts; shorten wrapper and install-script commentary without changing recovery semantics. --------- Co-authored-by: sitkarev <59806492+sitkarev@users.noreply.github.com> Co-authored-by: omegazheng <zheng@omegasys.eu> Co-authored-by: james47kjv <220877172+james47kjv@users.noreply.github.com>	2026-06-17 18:48:35 -05:00
Teknium	f8098c6b6f	fix(desktop): resolve electronDist to the actual electron install location (#48081 ) After the June lockfile regeneration (#46652) floated electron and reshuffled npm workspace hoisting, the desktop pack fails with "The specified electronDist does not exist". apps/desktop/package.json pointed electronDist at the repo root (../../node_modules/electron/dist) while npm now installs electron nested under apps/desktop/node_modules/electron. The two contradict, so a clean install can never package the app (Windows + macOS). - electronDist -> node_modules/electron/dist (resolved relative to apps/desktop, i.e. the workspace-local install npm actually produces). - hermes_cli/main.py, scripts/install.sh, scripts/install.ps1: add a runtime electron-dir resolver that prefers apps/desktop/node_modules/electron and falls back to the root hoist, so dist checks + the mirror re-download work under either npm layout. - patch-electron-builder-mac-binary.cjs: try the workspace-local Electron.app before the root hoist in the macOS binary-restore fallback (sibling site no PR touched). - test: assert build.electronDist resolves to where the lockfile installs electron, so a future hoist change (root <-> nested) can't silently break it. Salvages the overlapping work in #48003 (sitkarev), #48012 (omegazheng), and #48033 (james47kjv). Co-authored-by: sitkarev <59806492+sitkarev@users.noreply.github.com> Co-authored-by: omegazheng <zheng@omegasys.eu> Co-authored-by: james47kjv <220877172+james47kjv@users.noreply.github.com>	2026-06-17 18:08:01 -05:00
kshitij	49d7481dfb	Merge pull request #47706 from NousResearch/fix/cli-login-deprecation-graceful fix(cli): deprecated `hermes login` fails gracefully for any provider	2026-06-17 23:02:32 +05:30
definitelynotguru	eaddeaf2e6	feat(xai): add grok-composer-2.5-fast to xAI OAuth model picker The model is callable via xAI OAuth but omitted from models.dev and /v1/models listings. Merge it into the curated xAI catalog so it appears in `hermes model` without requiring a custom model name.	2026-06-17 09:49:46 -07:00
Teknium	c6c8abbadb	refactor: remove agent-callable send_message tool (#47856 ) * feat(mcp): raise default tool-call timeout 120s -> 300s Port from openai/codex#28234. Long-running MCP tools (web fetches, sandboxed builds, deep-research servers) routinely exceed 120s, causing spurious timeout failures. Codex bumped its default MCP tool timeout from 120 to 300 for the same reason. - _DEFAULT_TOOL_TIMEOUT 120 -> 300 in tools/mcp_tool.py (per-server 'timeout' config override unchanged) - update test_default_timeout assertion - document the default in mcp-config-reference.md * refactor: remove agent-callable send_message tool The agent should not decide on its own to fire off cross-platform messages or reactions. Outbound platform messaging is handled outside the agent loop — cron delivery, the gateway kanban notifier (dashboard-toggled), and the `hermes send` CLI. Removes the model-tool registration only; the send engine in send_message_tool.py (_send_to_platform, _send_via_adapter, _parse_target_ref, per-platform _send_* helpers) is kept intact for those non-agent callers. Drops the now-empty 'messaging' toolset and its `hermes tools` toggle. Yuanbao DM guidance now points at the native yb_send_dm tool.	2026-06-17 07:11:23 -07:00
Teknium	cbfa018aef	fix(auth): retry Codex device-code login on 429 with clear rate-limit message (#47860 ) The OpenAI device-code login (POST auth.openai.com/.../deviceauth/usercode) had no retry or 429 handling — a transient throttle from OpenAI surfaced as a bare "Device code request returned status 429" with no guidance, reading as a hard login failure. - Retry the device-code request with capped exponential backoff (honoring Retry-After), up to 4 attempts. - On persistent 429, raise a clear AuthError tagged CODEX_RATE_LIMITED_CODE (classified transient, not a credential problem) with a wait hint. - Apply the same 429 classification to the token-exchange step (same bug class). Unrelated to PR #47399 (Responses-API cache headers); this is the OAuth device-code path in hermes_cli/auth.py.	2026-06-17 05:48:35 -07:00
teknium1	06d907dc4e	fix(dashboard): only run runtime-pid liveness fallback against local status get_runtime_status_running_pid() validates liveness with a local os.kill(pid, 0) probe. In /api/status the runtime record can be the REMOTE health-probe body (cross-container), whose PID belongs to another host and is display-only — probing it locally is wrong and trips the test live-system guard (os.kill on a PID outside the test subtree). Run the fallback only against the local read_runtime_status() record.	2026-06-17 05:40:57 -07:00
teknium1	dc86d48a3e	fix(dashboard): use await-safe config-only scope for /api/status profile _profile_scope swaps process-global skills_tool/skill_manager module attrs under an RLock; /api/status holds that scope across the run_in_executor remote-health probe await, so a concurrent /api/skills?profile=X request can cross-restore the status profile's skill dir on its finally. Add _config_profile_scope (contextvar-only, task-local, await-safe) and use it for status, which only resolves get_hermes_home() at call time for config/env/gateway state and never needs the skills-module globals.	2026-06-17 05:40:57 -07:00
Shannon Sands	674e8b098a	Fix dashboard gateway profile scoping	2026-06-17 05:40:57 -07:00
Teknium	f80381c456	feat(prompt): scale context-file cap to model window + point agent at truncated file (#47846 ) Context files (AGENTS.md, CLAUDE.md, .hermes.md, .cursorrules, SOUL.md) were hard-capped at a flat 20K chars before head/tail truncation. Among the agent harnesses we track, only Codex caps project docs at all (32 KiB); Claude Code, OpenCode, and Cline load them whole. The flat 20K predates large context windows and silently truncates real-world AGENTS.md files. B — dynamic cap: when context_file_max_chars is unset (now the shipped default), the cap scales with the model's context window (ctx_tokens * 4 * 0.06, floor 20K, ceiling 500K). Small-context models stay at the historical 20K; a 200K model gets 48K; large models stop truncating real docs. An explicit context_file_max_chars still wins. Context length is resolved once per conversation (stable -> prompt cache untouched). C — when truncation does happen, the marker now names the concrete file path and tells the agent to read_file it for the full content. Validation: 154 targeted tests + full agent/ + hermes_cli/ + test_config (0 failures); E2E against a real 60K AGENTS.md confirms small windows truncate with the path-bearing marker, large windows load whole, and the system prompt is byte-stable across rebuilds.	2026-06-17 05:40:26 -07:00
Teknium	7bbffceb9c	feat(curator): make skill consolidation opt-in (prune stays default-on) (#47840 ) The curator now defaults to prune-only: the deterministic inactivity pass (mark stale / archive long-unused skills) still runs whenever the curator is enabled, but the opinionated LLM umbrella-building consolidation fork is OFF by default. - agent/curator.py: add DEFAULT_CONSOLIDATE=False + get_consolidate(); gate the forked aux-model review in run_curator_review behind it (new consolidate param, None=read config). When off, the LLM pass is skipped entirely (no aux-model cost); the run is still recorded and reported. - config.py: add curator.consolidate (default false); v29->v30 migration seeds the key for existing installs without clobbering a user-set value. - hermes_cli/curator.py: 'hermes curator run --consolidate' override; status shows consolidate state; prune-only notice on run. - docs + tests.	2026-06-17 05:20:32 -07:00

1 2 3 4 5 ...

2851 commits