hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-07-01 12:02:05 +00:00

Author	SHA1	Message	Date
Teknium	27322612b4	fix(update): route loud build/installer output to update.log instead of the terminal (#53616 ) * fix(update): route loud build/installer output to update.log instead of the terminal hermes update flooded the terminal with the full vite asset dump, electron-builder logs, npm deprecation warnings from the desktop build, and the cua-driver installer's 'Next steps' wall. All of that is low-signal noise the user doesn't need on a successful update. - Capture the desktop --build-only subprocess (vite + electron-builder) into ~/.hermes/logs/update.log; print a one-line status, and on failure surface the last 15 lines + a pointer to the full log. - Capture the cua-driver installer's output when verbose=False (the hermes update refresh path); concise upgrade line is unchanged. - Add _log_only_write() / _run_logged_subprocess() helpers that write to the update.log handle without echoing to the terminal. The repo-root npm install keeps streaming (capture_output=False) — that is the deliberate #18840 guard so a slow postinstall download doesn't look hung. The desktop npm install is a separate Electron process with no such progress concern and is captured. * fix(update): persist full cua-driver installer output to update.log The captured cua-driver installer output was only sent to logger.debug (agent.log) on failure, so the 'Next steps' wall was lost from update.log entirely on success. Write the full captured output straight to the update.log handle (sys.stdout._log) on both success and failure, matching the desktop-build capture, so update.log keeps the complete record of everything an update did.	2026-06-27 11:43:01 -07:00
Teknium	917f6bdb00	fix(tools): let vision pick any provider+model, not just OpenRouter (#53606 ) * fix(tools): let vision pick any provider+model, not just OpenRouter hermes tools → configure → vision no longer forces an OPENROUTER_API_KEY. It now offers the same any-provider surface as the model command: Auto (use main model / aggregator fallback), pick any authenticated provider + model, or a custom OpenAI-compatible endpoint. Selections persist to auxiliary.vision.{provider,model,base_url} — the keys the vision resolver already reads. Custom endpoint pins provider=custom so base_url routes correctly. Reconfigure path uses the same picker instead of re-prompting for OPENROUTER_API_KEY. * docs: add PR infographic for vision any-provider picker	2026-06-27 04:41:42 -07:00
ms-alan	16192103f4	fix(config): accept placeholder base_url in custom provider validation _normalize_custom_provider_entry() ran urlparse() on base_url and dropped any entry whose value was an un-expanded placeholder, so a caller reaching the normalizer with raw config (e.g. the Dockerized gateway path) silently skipped the provider with a 'not a valid URL' warning. Skip URL validation when the candidate contains a placeholder token — both ${ENV_VAR} env-refs and bare {region}-style templates — since those are expanded at runtime. Closes #14457	2026-06-27 04:15:27 -07:00
Teknium	5ab4136631	fix(webui): switch provider when Config-page model field changes (#53583 ) The dashboard Config tab's Model field is a flat string with no provider info. _denormalize_config_from_web only updated model.default and kept the stale provider, so picking an OpenRouter model while the default provider was ollama-local left provider=ollama-local and every call 404'd. When the model string actually changes, infer the serving provider — curated catalog first, then a vendor/model-slug heuristic for non-aggregator providers — and route the switch through the existing _normalize_main_model_assignment / _apply_main_model_assignment chokepoints so stale base_url/api_mode/api_key are cleared on a provider change and preserved on a same-provider re-pick. Saving an unchanged model never re-detects, so unrelated config saves keep an explicit provider. Closes #14058	2026-06-27 04:13:44 -07:00
blaryx	76af2456a2	fix(dashboard): merge PUT /api/config with existing on-disk config The dashboard form is built from CONFIG_SCHEMA, which doesn't enumerate every root-level key the YAML supports. Most visibly, `custom_providers` is in `_KNOWN_ROOT_KEYS` but is absent from the schema — so the frontend never sends it in the PUT body. The previous full-replace save() then silently wiped the key from disk every time the user clicked anything that triggered a save. Other casualties (less visible because defaults re-mask them on load) include `agent.personalities`, `agent.reasoning_effort`, `terminal.lifetime_seconds`, etc. Fix: read the raw on-disk config and deep-merge the incoming PUT body on top of it before saving. The frontend can only overwrite what it explicitly sends; everything else is preserved verbatim. Reuses the existing `_deep_merge` helper from `hermes_cli.config`. Tests: - `test_round_trip_preserves_custom_providers` exercises the exact bug: seed config with custom_providers, GET → drop the key → PUT, assert it's still on disk. - `test_round_trip_preserves_schema_invisible_nested_keys` covers the shallow-vs-deep-merge case for nested dicts under `agent` etc. Both fail on current main; both pass with this patch.	2026-06-27 03:48:18 -07:00
teknium1	a5d1f68c74	refactor(moa): share one virtual-provider row builder across pickers Follow-up on the gateway-picker salvage: the cherry-picked change added a second copy of the MoA virtual-provider row in model_switch.py, duplicating inventory._moa_provider_row (same slug/name/preset-models, identical extra fields). Make _moa_provider_row take a bare current_provider string and reuse it from the gateway picker path so the row shape lives in one place and the two surfaces can't drift.	2026-06-27 03:43:38 -07:00
dodo-reach	ed54469d06	fix(gateway): show MoA presets in model picker	2026-06-27 03:43:38 -07:00
briandevans	8dd4e576d0	fix(moa): tolerate non-list reference_models in hand-edited MoA preset config	2026-06-27 03:43:16 -07:00
Teknium	60f58a2b95	feat(verify-on-stop): default OFF, one-time migration, skip doc-only edits (#53552 ) The verify-on-stop guard fired too eagerly — including on doc/markdown/skill edits with nothing to verify, where it pushed a pointless /tmp verification script. Three changes: 1. Default OFF for new installs: agent.verify_on_stop defaults to false (was the "auto" surface-aware sentinel). _config_version bumped 30 -> 31. 2. One-time migration (v30 -> v31): existing installs are switched off once, but only when the value is missing or still the "auto" sentinel — an explicit true/false the user set is preserved. 3. Path filter: build_verify_on_stop_nudge() now drops documentation/prose paths (.md/.mdx/.rst/.txt/LICENSE/CHANGELOG/...) so even when explicitly enabled, a doc-only turn never nudges. Mixed doc+code turns still nudge on the code paths. The legacy "auto" sentinel is still honored when set explicitly (ON for interactive coding surfaces, OFF for messaging). HERMES_VERIFY_ON_STOP env override unchanged.	2026-06-27 03:23:22 -07:00
Versun	c655cdf2c1	feat(dashboard): expose cron job execution fields	2026-06-27 03:20:32 -07:00
teknium1	50f6855217	feat(moa): make /moa one-shot only; route preset switching through the model picker /moa no longer does a sticky model switch. It now always runs a single prompt through the default MoA preset and restores the prior model afterward; the whole argument is the prompt (no preset-name matching). To switch to a MoA preset for the session, select it from the model picker, where presets already surface under a virtual Mixture of Agents provider on every model-selection surface. Also fixes #53444: the TUI one-shot only set session[model_override], which the already-built cached agent ignored, so MoA silently never ran and the turn used the original model. The TUI now does a real in-place agent.switch_model() via _apply_model_switch() when a live agent exists (with a proper restore after the turn), and falls back to a model_override for lazy/unbuilt sessions. Removes the redundant sticky-switch branch from the CLI, gateway, and TUI /moa handlers; updates the command description, usage string, and docs.	2026-06-27 03:09:09 -07:00
Teknium	d712a7fd73	fix(model-picker): surface the current custom/uncurated model in picker rows (#53457 ) A model selected via the CLI (e.g. /model openrouter/<uncurated-name>) was absent from every model picker — the main picker AND the MoA reference/ aggregator slot pickers — because each provider row only carried its curated catalog. Inject the current model at the front of its provider's row so it is selectable and shown everywhere.	2026-06-27 00:06:34 -07:00
Nacho Avecilla	dbe734beff	fix(dashboard-auth): exclude non-interactive providers from interactive login surfaces (#53239 ) * Return None instead of erroring on drain login failure * Fix login on drain * Remove login for drained endpoints flow and clean the code * chore: drop unrelated credits changes from this PR * Remove extra comments that were not really necessary	2026-06-27 10:08:13 +10:00
zapabob	e55ddc3e33	fix(mcp): suppress interactive OAuth stdin prompts during background discovery (#35927 ) When an MCP server requires OAuth, the interactive `hermes` TUI froze on startup: background MCP discovery hit the OAuth flow, which on an interactive TTY spawns a daemon thread doing a blocking `sys.stdin.readline()` (the "paste the redirect URL" fallback in mcp_oauth._wait_for_callback). That thread competes with the TUI's own stdin reader for the same terminal, so keystrokes get swallowed and the TUI appears frozen (up to the 300s OAuth timeout). Reported symptom: "MCP OAuth: authorization required / Open this URL ... the tui is freezing, not respond to typing." Add a thread-local `suppress_interactive_oauth()` context manager in tools/mcp_oauth.py; `_is_interactive()` returns False while it's active, so the stdin paste-thread and prompt are never created. Background discovery (hermes_cli/mcp_startup.py, tui_gateway/entry.py) now runs discovery inside that context, so OAuth-requiring servers soft-skip (raise OAuthNonInteractiveError, already handled) instead of stealing the TUI's stdin. A real `hermes mcp login` on the main thread is unaffected (thread-local). Salvaged from #35945 by @zapabob (authorship preserved via cherry-pick; resolved a conflict against main's new mcp_discovery_timeout / wait_for_mcp_ discovery refactor, keeping both). Verified E2E: with suppression the paste prompt is NOT printed and no stdin thread spawns (raises OAuthNonInteractive soft-skip); without it the prompt shows (the freeze). Mutation-verified (removing the suppress check in _is_interactive fails the regression test). 76 tests pass, ruff clean. Closes #35927. SELF-REVIEW FIX: the original #35945 used threading.local(), which does NOT propagate to the dedicated mcp-event-loop thread where OAuth actually runs (discover_mcp_tools dispatches the connect via run_coroutine_threadsafe), so the suppression was a NO-OP in production (the tests passed only by stubbing out the cross-thread dispatch). Converted to a contextvars.ContextVar, which asyncio copies onto the scheduled coroutine — empirically verified suppression now holds on the mcp-event-loop thread through the real _run_on_mcp_loop path. Added a cross-thread regression test (fails on threading.local, passes on the ContextVar) so the no-op can't regress.	2026-06-27 04:59:23 +05:30
kshitijk4poor	244a6f2ceb	fix(desktop): broken "Open setup guide" button for plugin platforms On the desktop Channels / Messaging page, the "Open setup guide" button was rendered as a bare <a href={platform.docs_url} target="_blank"> with no guard. Plugin-provided platforms (Microsoft Teams, Google Chat, Line, Raft, Yuanbao, …) ship an empty docs_url, so the anchor's href was "". In a packaged build, Electron resolves an empty href against the current document — the app's own index.html inside the asar bundle — and shell.openPath then fails with an OS "file not found" dialog. This is exactly the Windows error reported for Messaging → Teams → Open guide. Fix (3 changes): 1. fix(desktop) — Only render the "Open setup guide" button when docs_url is non-empty, and route clicks through openExternalLink so a relative/empty value can never be treated as a local bundle path. Fixes the whole class (every plugin platform), not just Teams. 2. fix(messaging) — Give the Teams platform plugin a real docs_url (Microsoft Teams setup guide) so its card shows a working button instead of nothing. 3. fix(messaging) — Give the Google Chat platform plugin a real docs_url (Google Chat setup guide) so its card shows a working button instead of nothing. Originally from #48940; folded in here because that PR's test was broken (it queried the HTTP endpoint, but google_chat is a dynamic enum member that only appears after the adapter module is imported). Test plan: - apps/desktop — new src/app/messaging/index.test.tsx: button is hidden when docs_url is empty; a real URL opens via the validated external opener (does not navigate). - apps/desktop typecheck (tsc --noEmit) clean. - backend — test_teams_messaging_metadata_links_setup_guide: the Teams catalog entry exposes the setup-guide docs_url. - backend — test_google_chat_messaging_metadata_links_setup_guide: the Google Chat catalog entry exposes the setup-guide docs_url. Co-authored-by: xxxigm <tuancanhnguyen706@gmail.com> Co-authored-by: p-andhika <andhika.prakasiwi@gmail.com>	2026-06-27 04:34:08 +05:30
kshitijk4poor	cdb1dfbc49	fix: use os.pathsep, add tests, update tips for multi-root support - Use os.pathsep instead of literal ':' so Windows paths (C:\dir) and the Windows separator ';' work correctly. - Add 9 tests covering multi-root behavior: writes inside first/second root, writes outside all roots, trailing/leading/double separators, all-separators edge case, static deny priority, duplicate dedup. - Update hermes_cli/tips.py tip string to mention multiple paths. - Update docs to mention os.pathsep / ; on Windows. Follow-up for salvaged PR #49557.	2026-06-27 04:01:12 +05:30
Teknium	7e101e553b	fix(moa): block the moa virtual provider as a reference or aggregator slot (#53281 ) A MoA preset whose reference or aggregator slot points at the moa virtual provider creates a recursive MoA tree. The runtime guards in moa_loop.py only surface this mid-turn (references silently skipped, aggregator raises). Reject it at the config chokepoint (_clean_slot) so it can never be saved, and hide it from the desktop/dashboard slot pickers so it isn't offered as a dead choice.	2026-06-26 14:42:42 -07:00
srojk34	f0678b031e	fix(moa): tolerate non-numeric values in hand-edited MoA preset config _normalize_preset uses bare float() and int() to coerce reference_temperature, aggregator_temperature, and max_tokens from config.yaml. When a user hand-edits a non-numeric value (e.g. max_tokens: "8k" or reference_temperature: "hot"), the coercion raises ValueError. Since normalize_moa_config runs on every model-selection and MoA turn (via resolve_moa_preset), the crash is unrecoverable and blocks all MoA usage until the config is manually fixed. Replace the bare casts with _coerce_float / _coerce_int helpers that fall back to the default on TypeError/ValueError instead of raising.	2026-06-26 14:35:38 -07:00
Nacho Avecilla	f509f6e598	fix(dashboard): offload PTY spawn/close off the event loop (#53227 ) * Fix blocking tasks on the dashboard * Remove unnecessary comments	2026-06-26 12:47:23 -07:00
Teknium	3d735fe156	fix(skills-hub): surface per-tap providers (NVIDIA/OpenAI/...) in runtime search (#53191 ) Natural-language skill search returned a short, arbitrary list and never surfaced NVIDIA (or OpenAI/Anthropic/HuggingFace) skills. Two causes: 1. The runtime index collapses every GitHub tap into source="github", so there was no way to find or filter by provider at the CLI — the per-tap identity only existed in the docs-site catalog. 2. HermesIndexSource.search matched only name/description/tags (not the identifier or provider) and broke at the first `limit` hits in raw index order, burying the most relevant skills. `search` also defaulted to --limit 10 against an 86k-entry catalog. Changes: - GitHubSource stamps a per-tap provider label (extra.provider) on each skill via github_provider_for(); source stays "github" so dedup/floor/ index-skip logic is untouched. Flows into the built index. - HermesIndexSource.search now matches identifier + provider too, and collect-then-ranks (exact > prefix > whole-word > substring) instead of break-at-limit. - --source nvidia\|openai\|anthropic\|huggingface\|voltagent\|gstack\|minimax provider filters for browse/search (narrows merged results by provider). - search --limit default 10 -> 25; table Source column shows the provider label for github skills. Tested: 181 unit tests pass; E2E against the live runtime index confirms 'nvidia'/'cuda' searches now surface NVIDIA-provider skills and --source nvidia narrows to exactly the NVIDIA catalog.	2026-06-26 11:04:41 -07:00
Teknium	d430684d7c	fix(gateway,windows): respawn gateway windowless after GUI update (#52239 ) The post-update gateway restart path relaunched the gateway with the venv's console `python.exe` (via `get_python_path()` in `_gateway_run_args_for_profile`). On Windows this leaves a terminal window open permanently: uv's `venv\Scripts\python.exe` is a launcher shim that re-execs the base console interpreter, which allocates its own conhost — and `CREATE_NO_WINDOW` cannot suppress that second window. The clean-start path (`_spawn_detached`) already dodges this by routing through `_resolve_detached_python` to use the windowless base `pythonw.exe`; the restart watcher did not. Symptom (reported on Windows 11): after an in-app GUI update, a console window for the gateway stays open and never closes. Confirmed on the reporter's box — the running gateway was `python.exe ... gateway run --replace` with a live conhost child and the foreground "Press Ctrl+C to stop" banner, born exactly at the update's "Restarting Windows gateway" log line. Fix: - Add `gateway_windows.windowless_gateway_restart_spec(run_argv)` which rewrites a console-python gateway argv into the windowless `pythonw.exe` equivalent and returns the cwd + env overlay (VIRTUAL_ENV / PYTHONPATH / HERMES_HOME) the base interpreter needs to import `hermes_cli` without the venv launcher's site config. No-op on POSIX. - `_spawn_gateway_restart_watcher` now applies that rewrite on Windows and threads cwd= / env= into the inlined respawn Popen. Covers both restart entry points (`launch_detached_profile_gateway_restart` and `launch_detached_gateway_restart_by_cmdline`). CREATE_NO_WINDOW \| DETACHED_PROCESS \| CREATE_BREAKAWAY_FROM_JOB and the breakaway-denied fallback are all preserved. Verified E2E on a real Windows 11 box: drove the actual watcher against a dummy old-pid; the respawned gateway came up as `pythonw.exe` (zero console python, no conhost child) and booted fully (housekeeping + kanban dispatcher started → imports resolved under the base interpreter). Tests: TestWindowlessGatewayRestartSpec (behavior) + TestGatewayDetachedWatcherWindowsFlags regression assert. Pre-existing Linux-only failures on a Windows host (SIGKILL, systemd, docker-root) confirmed identical on the bare base.	2026-06-26 17:39:46 +00:00
kyssta-exe	c0568ca95f	fix(config): use read_raw_config() in migrations to prevent expanding defaults (#40821 )	2026-06-26 22:40:52 +05:30
brooklyn!	5cc4009deb	Merge pull request #52828 from helix4u/fix/desktop-backend-update-indicator fix(desktop): show remote backend updates without counts	2026-06-26 11:49:07 -05:00
kshitij	7b2c51152a	Merge pull request #52990 from NousResearch/salvage/52889-backup-projects-kanban fix(backup): include projects.db and kanban boards in pre-update snapshot (#52889)	2026-06-26 20:09:15 +05:30
0xDevNinja	9ef49cd78f	fix(backup): include projects.db, kanban boards, and sibling stores in pre-update snapshot (#52889 ) projects.db (per-profile project store) and kanban.db were missing from _QUICK_STATE_FILES, so the pre-update quick snapshot never backed them up. On a desktop upgrade, when the update flow removes/replaces the file and the post-update schema-init re-creates an empty one, all user-created projects, folder mappings, the active-project pointer, kanban board bindings, and tasks vanish silently — no error. Add the per-profile user-created stores to the snapshot set: - projects.db — project store - response_store.db — gateway conversation history / tool payloads (WAL) - memory_store.db — holographic memory facts/entities (WAL) - verification_evidence.db — agent verification audit trail - kanban.db — default board (back-compat <root>/kanban.db) - kanban/boards — non-default boards (<root>/kanban/boards/<slug>/kanban.db + metadata); workspaces/ and attachments/ subtrees are skipped as large + regenerable. Also: the directory-branch of create_quick_snapshot now routes *.db through the WAL-safe _safe_copy_db (SQLite backup() API), matching the top-level file path — previously a non-default board DB with an open WAL could be copied inconsistently. Salvaged from #52930 by @0xDevNinja (authorship preserved via cherry-pick). On top of the original (which covered only projects.db + the default kanban.db), this adds: non-default-board coverage, the three sibling per-profile DBs that meet the same upgrade-wipe criteria, WAL-safe directory copies, and a workspaces/attachments skip to avoid snapshot bloat (×20 retained). 8 tests, all mutation-verified; E2E verified snapshot→wipe→restore preserves all six store types on the real code path. Closes #52889. Supersedes #52930.	2026-06-26 19:23:33 +05:30
Dr1985	e3db1ef92d	fix(macos): clearly distinguish launchd supervision from detached fallback in gateway status Some checks failed CI / detect (push) Waiting to run Details CI / tests (push) Blocked by required conditions Details CI / lint (push) Blocked by required conditions Details CI / typecheck (push) Blocked by required conditions Details CI / docs-site (push) Blocked by required conditions Details CI / history-check (push) Blocked by required conditions Details CI / contributor-check (push) Blocked by required conditions Details CI / uv-lockfile (push) Blocked by required conditions Details CI / docker-lint (push) Blocked by required conditions Details CI / supply-chain (push) Blocked by required conditions Details CI / osv-scanner (push) Blocked by required conditions Details CI / All required checks pass (push) Blocked by required conditions Details Deploy Site / deploy-vercel (push) Waiting to run Details Deploy Site / deploy-docs (push) Waiting to run Details Docker Build and Publish / build-amd64 (push) Has been cancelled Details Docker Build and Publish / build-arm64 (push) Has been cancelled Details Docker Build and Publish / merge (push) Has been cancelled Details ## Description On macOS 26.x, `launchctl bootstrap` and `launchctl kickstart` return exit code 5 ("Input/output error"), which Hermes already anticipates and handles by spawning a detached fallback process. However, the gateway status reporting is ambiguous: - `gateway status` says "Gateway service is loaded" (because `launchctl list` returns exit 0) - But `launchctl print` shows `state = not running` — launchd isn't actually supervising anything - The detached fallback PID running is invisible to the status command - Users can't tell whether auto-start at login and auto-restart on crash are available ### Root Cause Two problems in `hermes_cli/gateway.py`: 1. `_probe_launchd_service_running()` (line 1067): Determined launchd service liveness solely by `launchctl list <label>` exit code. On macOS 26, this returns 0 even when the service is only registered but not running (output lacks a `"PID"` field). This caused `GatewayRuntimeSnapshot.service_running = True` incorrectly, which suppressed the process/service mismatch warning. 2. `launchd_status()` (line 3569): Used the same binary "loaded/not loaded" check without inspecting whether launchd actually has a PID, whether a detached fallback is running, or whether auto-start/restart are available. ### Changes `hermes_cli/gateway.py`: 1. New `_parse_launchd_pid_from_list_output()` helper — Extracts the PID from `launchctl list` output. When launchd is actively supervising, the output includes `"PID" = <number>;`. When only registered but not running, no PID field is present. 2. Fixed `_probe_launchd_service_running()` — Now requires a PID in the `launchctl list` output to confirm launchd is actually supervising. This correctly sets `service_running = False` when launchd has the service registered but `state = not running`, which triggers the existing process/service mismatch detection. 3. Reworked `launchd_status()` — Reports clearly separated information: - LaunchAgent plist currentness (stale or current) - Whether launchd is actively supervising (with PID) - Whether a detached fallback PID is running - Whether auto-start at login and auto-restart on crash are available - When launchd supervision is known to be unavailable, explains why 4. Persistent unsupported marker (`~/.hermes/.gateway-launchd-unsupported`) — Written when `_launchd_fallback_to_detached()` is called (launchd exit 5/125). Allows `launchd_status()` to explain why launchd can't supervise even when no fallback process is currently running. Cleared automatically when a future bootstrap/kickstart succeeds (e.g., after an OS update fixes the issue). 5. Updated `_print_gateway_process_mismatch()` — Distinguishes the managed detached fallback from a genuinely manual `nohup hermes gateway run`, providing accurate guidance for each case. ### Status Output Examples Before (macOS 26, fallback active): ``` Launchd plist: ~/Library/LaunchAgents/ai.hermes.gateway.plist ✓ Service definition matches the current Hermes install ✓ Gateway service is loaded { "Label" = "ai.hermes.gateway"; "OnDemand" = true; ... }; ``` After (macOS 26, fallback active): ``` Launchd plist: ~/Library/LaunchAgents/ai.hermes.gateway.plist ✓ Service definition matches the current Hermes install ⚠ Gateway service is registered but launchd is not supervising it launchd cannot manage the gateway on this macOS version. ✓ Detached fallback process is running (PID 12345) Cron jobs will fire. Stop with: hermes gateway stop ⚠ Auto-start at login and auto-restart on crash are NOT available. ``` After (normal launchd supervision): ``` Launchd plist: ~/Library/LaunchAgents/ai.hermes.gateway.plist ✓ Service definition matches the current Hermes install ✓ Gateway is supervised by launchd (PID 12345) Auto-start at login and auto-restart on crash are available. ``` ### Tests Updated 5 existing tests and added 11 new tests in `tests/hermes_cli/test_gateway_service.py`: - PID parsing from `launchctl list` output (with PID, without PID, empty, unquoted PID) - `_probe_launchd_service_running()` requires PID presence - Unsupport marker lifecycle (write, clear, persist across fallback) - Marker cleared on successful bootstrap - `launchd_status()` reporting: supervised, fallback-running, fallback-unavailable - Existing fallback tests now verify marker creation ### Related Issues - Issue #23387 (original macOS 26 launchd workaround) - Issue #42524 (this issue)	2026-06-26 16:30:30 +05:30
kshitij	1aa458a1e6	Merge pull request #52920 from NousResearch/salvage/38798-toolset-validation fix(config): surface invalid platform_toolsets instead of silently dropping tools (#38798)	2026-06-26 14:14:55 +05:30
lEWFkRAD	41ede84b93	fix(config): surface invalid platform_toolsets instead of silently dropping tools (#38798 ) A config migration (or hand-edit) that leaves an invalid toolset name in `platform_toolsets` — e.g. the #38798 corruption that rewrote `hermes-cli` to the non-existent `hermes` — silently disabled all affected tools: resolve_toolset() returns [] for an unknown name, so the agent quietly lost its tools with no error, warning, or log entry and degraded to text-only replies. Surface it loudly at two points: - After migration (migrate_config): validate platform_toolsets and record/print a warning per unknown name, with a `hermes-<platform>` suggestion when that would have been valid (the exact #38798 shape). - At runtime (_get_platform_tools): if a platform was explicitly configured but every toolset name is invalid, log a warning when tools are resolved for a session — so an ALREADY-corrupted config is caught at startup, not only on the next `hermes update`. Logic lives in a new pure, side-effect-free helper (toolset_validation.py) with validate_toolset injected, so it is unit-testable without the tool registry. Note: the original v25→v26 migration that caused the corruption no longer exists (config format is now v30; no migration step rewrites toolset names). This change is the durable defense against the silent-failure mode regardless of cause, matching the issue's "Expected: log a warning". Salvaged from #39207 by @lEWFkRAD (authorship preserved via cherry-pick). Tests: 9 helper cases (incl. the #38798 corruption shape, mixed valid/invalid, zero-tools state, non-dict/scalar/non-string) + a runtime caplog test — both the helper warning and the runtime guard mutation-verified to fail without the fix. Closes #38798. Supersedes #39581 (prevent-in-v25→v26 — that path is gone), #41006 / #40208 (repair-migration for already-corrupted configs).	2026-06-26 14:07:43 +05:30
Shannon Sands	41f8126148	Reconnect dashboard PTY chat after socket drops	2026-06-26 01:06:02 -07:00
Ben	19b2624404	feat(gateway): external drain trigger + accept-gating (begin/cancel + control channel) Tasks 2.1 + 2.2 + 2.3 of the safe-shutdown plan — the reversible quiesce-without-restart machinery NAS drives during a lifecycle action (D4a). These ship together because the endpoint, the control channel, and the gateway state machine are one coherent slice. 2.2 — control channel (gateway/drain_control.py, new): The dashboard has no HTTP path into a running gateway (guardrails: "there is NO external control channel into a running gateway"); restart/drain is driven only by markers the gateway reacts to. So begin/cancel-drain writes/removes a presence-based marker .drain_request.json (HERMES_HOME-scoped, atomic write, never-raises read; a corrupt marker reads as present-contentless → fail-safe toward quiescing). This is Q-B option A. 2.2 — gateway state machine (gateway/run.py): - _external_drain_active flag, DISTINCT from the shutdown _draining flag: this one does NOT exit the process and is fully reversible. - _enter_external_drain / _exit_external_drain: idempotent transitions that flip gateway_state→draining / →running via _update_runtime_status (preserving the live active_agents count). exit refuses to revert to running during a real shutdown or after the loop stops (shutdown wins). - _drain_control_watcher: 1s background task (modelled on _handoff_watcher) reconciling accept-state with the marker; honours a marker that survived a restart on its first tick. Registered alongside the other watchers in start. - New-turn accept gate in _handle_message, placed BEFORE the session-slot claim: when draining, refuse to START a new turn (so active_agents can only fall → no TOCTOU race), while in-flight turns finish untouched. Internal/ system events (restart-recovery replays, bg-process completions) bypass it. 2.1 — endpoint (hermes_cli/web_server.py): POST /api/gateway/drain {action: drain\|cancel}. Authenticated by the Task-2.0a token seam (the drain plugin registered this exact path as a token route); attributes the request to the verified token principal. Begin writes the marker, cancel removes it — the gateway process owns the actual transition. Force-override (D6) is NOT here; it maps onto the existing immediate /api/gateway/restart force path. Tests (mocked — necessary-not-sufficient; the HARD live gate Q-B is next): - tests/gateway/test_external_drain_control.py — marker contract (write/clear/ read/corrupt/atomic), state machine (enter/exit/idempotency/shutdown-wins/ loop-stopped), watcher reconcile-enter-then-exit, new-turn refusal, and in-flight-not-interrupted. 15 tests. - tests/hermes_cli/test_web_server.py — /api/gateway/drain begin/default-begin/ cancel/cancel-idempotent/bad-action-400. 6 tests. - dashboard.drain_auth config section already added in 2.0b commit. All touched suites green: 301 (gateway+auth) + 9 (web_server endpoints) passed. Intentionally deferred: - HARD live-validation gate (Q-B): real isolated `hermes gateway run`, drive a real begin-drain marker, prove the 5-point checklist a–e. - Spec-doc status flip + Phase-2 PR. Build status: external-drain, restart-drain, status, dashboard-auth, drain-plugin, token-auth, and web_server-endpoint suites green.	2026-06-26 00:47:19 -07:00
Ben	2e322466b1	feat(dashboard-auth): drain shared-bearer-secret provider plugin Task 2.0b: the concrete shared-bearer-secret auth provider, the FIRST consumer of the generic token-auth capability (Task 2.0a). Implements decisions.md Q-A. plugins/dashboard_auth/drain/ (bundled, discovered like dashboard_auth/basic): - DrainSecretProvider: non-interactive provider, supports_token=True. Verifies an inbound Authorization bearer token against a per-agent shared secret with hmac.compare_digest (constant-time, no timing oracle) and, on a match, vouches for the caller as the "drain-control" principal scoped to "drain". The five interactive ABC methods raise NotImplementedError; verify_session returns None (stacks harmlessly in the cookie-verify loop). - assess_secret_strength(): fail-closed entropy gate. Rejects secrets shorter than 43 url-safe-b64 chars (~256 bits), with < 16 distinct characters, or below 128 bits Shannon entropy — so a weak/structured/repeated secret can never be silently accepted. Enforced both at register() (friendly skip reason) and in __init__ (raises — defence in depth). - register(ctx): no-op + skip reason when HERMES_DASHBOARD_DRAIN_SECRET is unset; rejects a weak secret fail-closed (drain endpoint stays gated). On a strong secret, registers the provider AND opts /api/gateway/drain into the generic token-auth seam via register_token_route(). Config: the secret is a CREDENTIAL → carried via HERMES_DASHBOARD_DRAIN_SECRET (per-agent, provisioned by NAS at deploy). Behavioural knobs only (dashboard.drain_auth.{scope,min_secret_chars}) live in config.yaml — added to DEFAULT_CONFIG with the .env-is-for-secrets rationale documented inline. Tests: tests/plugins/dashboard_auth/test_drain_provider.py — entropy gate (strong pass; empty/short/repeated/few-distinct/custom-min reject), verify_token (match → scoped principal, wrong/empty → None, custom scope), protocol compliance, interactive-methods-raise, and register() (skip-no-secret, fail-closed-weak-secret, strong-env-secret registers + route opt-in, config scope + min_secret_chars). 21 new tests; drain + token-auth suites 44 passed. Verified the plugin is discovered as dashboard_auth/drain alongside basic/nous. Intentionally deferred: - The begin/cancel-drain endpoint handler itself — Task 2.1. - The dashboard→gateway control channel — Task 2.2. Build status: dashboard-auth + drain-plugin suites green.	2026-06-26 00:47:19 -07:00
Ben	cb9cb6ba1c	feat(dashboard-auth): generic non-interactive API-token capability Task 2.0a of the safe-shutdown drain-coordination plan. Widens the dashboard auth framework GENERICALLY to support non-interactive (service-to-service) bearer-token auth, mirroring the existing supports_password precedent. This is a reusable capability — any future machine-credential provider plugs in without core changes (decisions.md Q-C). The drain bearer-secret plugin (Task 2.0b) is the first consumer, not the definition. - base.py: add TokenPrincipal dataclass (the token analog of Session) + supports_token capability flag + verify_token() on the ABC (default raises NotImplementedError so a misconfigured provider fails loud). Contract mirrors verify_session stacking: return None for unrecognised tokens (never raise), raise ProviderError only on a genuine backing-store outage. - registry.py: list_token_providers() — the supports_token subset, in registration order. Empty when none registered (token routes fail closed). - token_auth.py (new): route-agnostic seam. Routes opt in via register_token_route(exact path); token_auth_middleware owns the auth decision for those routes only — authenticate via stacked providers, attach request.state.token_principal + token_authenticated, pass through. 401 on missing/unrecognised token, 503 when a provider was unreachable, untouched passthrough for non-token routes. Fails closed (never open). - web_server.py: install the seam OUTERMOST (registered last → runs first). Both downstream gates (legacy auth_middleware + gated_auth_middleware) honour request.state.token_authenticated and skip enforcement, so a token-authed service request is never bounced to /login. - audit.py: TOKEN_AUTH_SUCCESS / TOKEN_AUTH_FAILURE events. Tests: tests/hermes_cli/test_dashboard_token_auth.py — ABC flag default, verify_token NotImplementedError, registry filter, bearer extraction (case-insensitive scheme, malformed/non-bearer → ""), provider stacking (first-match-wins, unreachable-remembered, unreachable-then-valid, buggy provider doesn't crash the gate), and the seam's passthrough/401/503/ fail-closed behaviour. 29 new tests; full dashboard-auth suite 169 passed. Intentionally deferred: - The concrete shared-bearer-secret provider plugin — Task 2.0b. - The begin/cancel-drain endpoint that registers itself as a token route — Task 2.1. Build status: dashboard-auth + plugin-hook suites green.	2026-06-26 00:47:19 -07:00
Max Hsu	075f93ad78	fix(mcp): auto-recover from invalid_client on stale OAuth client registration Fixes #36767. Two complementary recoveries for the recurring "delete three cache files and re-auth by hand" ritual when an MCP server's dynamically-registered OAuth client goes dead server-side (IdP redeploy / DB wipe / rebrand): - Auto-heal (token-endpoint subset): HermesMCPOAuthProvider now sniffs auth-flow responses and, on a 400/401 `invalid_client` from the discovered token endpoint, backs up + deletes `<server>.client.json` and `.meta.json` and clears the in-memory client so the SDK re-runs RFC 7591 dynamic client registration on the next flow. Conservative by construction: only dynamically-registered (non config-supplied) clients, only the token endpoint, only on a word-boundary `invalid_client` match (so RFC 7591's `invalid_client_metadata` does not trip it); best-effort so a miss never breaks the live flow. Covers both code-exchange and refresh when the token endpoint was discovered. Tokens are preserved. - `hermes mcp reauth [<name>\|--all]`: the reporter's primary symptom — the IdP's in-browser "Redirect URI Mismatch" — produces no HTTP signal (the SDK only sees a callback timeout), so it cannot be auto-detected. The new command re-auths one or ALL `auth: oauth` servers, serially: one browser flow at a time, which also fixes the startup popup storm when several servers are stale at once. Single-server reauth is factored out of `mcp login` and shared. Tests: +14 (poison helper x2; token-endpoint detection x5 incl. wrong-endpoint, success-response, pre-registered, and invalid_client_metadata negative guards; a bridge integration test driving the real async_auth_flow generator to prove the detection hook preserves the bidirectional asend() forwarding contract; reauth CLI x6). Verified against the pinned mcp==1.26.0: scripts/run_tests.sh 122/122 green for the touched suites; check-windows-footguns.py and ruff clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 00:35:27 -07:00
brooklyn!	a2b49e60b6	Merge pull request #52412 from GodsBoy/fix/verify-on-stop-messaging-surface-leak fix(agent): gate verify-on-stop nudge off for messaging surfaces	2026-06-26 02:30:08 -05:00
Teknium	5b5c79a8ef	feat(kanban): typed block reasons + unblock-loop breaker (#52848 ) * feat(kanban): typed block reasons + unblock-loop breaker Stops the kanban blocked-task loop: a worker blocks a task, a cron unblocks it, the worker re-blocks for the same reason, repeat forever. block_task now takes a typed kind and a persistent block_recurrences counter on the tasks table: - kind=dependency routes to todo (parent-gated, auto-resumed), never the human 'blocked' bucket a cron would keep unblocking. - needs_input/capability/transient/untyped land in blocked; each same-cause re-block after an unblock increments block_recurrences, and at BLOCK_RECURRENCE_LIMIT (default 2) the task routes to triage for a human instead of blocked. - unblock_task no longer resets block_recurrences (the amnesia that let the loop run unbounded); complete_task clears it on success. Wired through the worker kanban_block tool (new kind arg) and the hermes kanban block --kind CLI flag, both reporting where the task actually landed. Docs + 11 new tests; 536 existing kanban tests green. * test(kanban): make second-block notify test use a distinct block cause test_notifier_second_blocked_delivers blocked the same task twice with the same (untyped) reason, which now trips the new unblock-loop breaker and routes the second block to triage instead of blocked — so only one 'blocked' notification fired. The test's actual intent is that TWO distinct block cycles each notify; give the two cycles different kinds (needs_input then capability) so they're genuinely separate blocks. The same-cause loop→triage path is covered by test_kanban_block_kinds.py.	2026-06-25 21:46:58 -07:00
Teknium	0b7128582f	fix(state): detect and repair FTS write corruption that silently drops gateway history (#52798 ) A readable state.db can still reject every message write through the messages_fts* triggers when the FTS5 index is corrupt: base-table reads and PRAGMA integrity_check pass, but INSERT INTO messages fails with 'database disk image is malformed'. The gateway reloads conversation_history from disk each turn, so a silently-failed write hands the next turn stale/empty history even though the same cached AIAgent still holds the live transcript — causing immediate same-session amnesia. (#50502) - hermes_state.py: _db_opens_cleanly() now drives a rolled-back message write through the FTS triggers, so write-only corruption (which the read-only probe reported healthy) is detected. repair_state_db_schema() gains an in-place FTS5 'rebuild' strategy (tier 0) before the dedup/drop tiers, plus an already_healthy short-circuit. Both 'hermes sessions repair' and 'hermes doctor' route through these, so the fix covers the whole class. - hermes_cli/doctor.py: the state.db check runs the write-health probe even on the success (readable) path and repairs in place with --fix. - gateway/run.py: _select_cached_agent_history() prefers the cached agent's longer live _session_messages over a shorter persisted transcript, so an FTS write failure can't wipe in-session context. - tests: regressions for write-health detection, in-place repair preserving rows + resuming writes, the already_healthy shortcut, and the gateway guard. Combines the approaches from #50504 (@0-CYBERDYNE-SYSTEMS-0, issue author), #52165 (@davidgut1982), and #50576 (@trevorgordon981).	2026-06-25 21:18:41 -07:00
helix4u	1c8594b634	fix(desktop): show remote backend updates without counts	2026-06-25 21:39:29 -06:00
liuhao1024	56cf517ccd	fix(cron): detect partial job loss in restore_cron_jobs_if_emptied (#52144 ) The desktop scheduler can overwrite cron/jobs.json with its own small set of internally-tracked crons after an update/restart, causing partial loss of tool-created cron jobs. The previous guard only checked for total loss (live_count == 0), missing the case where live_count > 0 but less than the pre-update snapshot count. Compare live_count against snap_count instead of checking for zero, so both total loss (0 vs N) and partial loss (1 vs 19) trigger restoration. Salvaged from #52161 by @liuhao1024. Closes #52144	2026-06-25 18:49:18 -07:00
Brooklyn Nicholson	ff81365988	feat(desktop): in-app spot editor for the file preview pane Adds a CodeMirror 6 spot editor to the right-rail file preview so users can make quick edits in-app without leaving for an IDE. Entering edit mode is a pure in-place swap of the read view — same fixed-height header, same gutter geometry/typography (mirrors SourceView 1:1) so nothing shifts — toggled via the Edit button, a bare `e` when the pane is hovered/focused, or the tab. - Save path is transport-agnostic (writeDesktopFileText): local Electron IPC or a new hardened POST /api/fs/write-text on the dashboard server (path validation, parent-must-exist, regular-files-only, size cap, atomic temp-file + os.replace), behind the existing auth middleware. - Stale-on-disk guard re-reads before writing and offers overwrite vs discard-and-reload instead of clobbering external/agent edits. - VS Code-style modified dot on the tab; ⌘/Ctrl+S and ⌘/Ctrl+Enter save, Esc cancels; GitHub highlight style matched to the read view's Shiki theme. - Typing stays render-free (draft in a ref; dirty flips once at the boundary).	2026-06-25 19:50:25 -05:00
Teknium	208f0d7c3b	fix(update): default pre-update backup to off (#52729 ) The pre-update HERMES_HOME zip shipped on by default (DEFAULT_CONFIG + runtime fallback both True), so every `hermes update` zipped the entire ~/.hermes — sessions DB, caches, skills — adding minutes to each update. The shipped cli-config.yaml.example, the --backup help, and the example config all already said "off by default," so the live default contradicted its own documentation. Flip the default to off everywhere: DEFAULT_CONFIG, the runtime `.get(..., False)` fallback in _run_pre_update_backup, and the stale --backup help string. Users who want the #48200 safety net opt in via updates.pre_update_backup: true or --backup for a single run. Updated test_default_enabled_creates_backup -> test_default_disabled_is_silent to assert the new default (silent no-op, no zip).	2026-06-25 16:01:09 -07:00
kshitij	e4ff494860	fix(cron): add default retention to per-run job output (#52383 ) (#52646 ) * fix(cron): add default retention to per-run job output to bound disk usage (#52383) Per-run cron output (cron/output/<job>/<timestamp>.md) is written once per execution and was never pruned, so a frequently-scheduled job on a long-running deploy accumulates one file per run indefinitely and can fill the volume ('no space left on device'). save_job_output() now keeps the most recent N output files per job and removes older ones. N defaults to 50 and is configurable via cron.output_retention; a non-positive value disables pruning for operators who manage cleanup externally. Salvaged from #52402 by @0xDevNinja. Closes #52383 * fix(config): add cron.output_retention to DEFAULT_CONFIG Follow-up to #52383: the retention config key was functional via get()-with-default but missing from DEFAULT_CONFIG, so the deep-merge wouldn't auto-populate it for new installs. Add it explicitly. --------- Co-authored-by: 0xDevNinja <manmit0x@gmail.com>	2026-06-25 16:00:13 -07:00
brooklyn!	ffa3d3c811	Merge pull request #49037 from NousResearch/bb/projects-paradigm feat(desktop): first-class projects — sidebar, coding rail, review pane, and agent project tools	2026-06-25 17:49:05 -05:00
Gille	e7d2f0b93c	fix(windows): suppress console flashes and harden gateway restarts	2026-06-25 14:42:38 -07:00
Brooklyn Nicholson	9f3aa1685c	fix(cli): register project command beside MoA	2026-06-25 16:40:27 -05:00
Brooklyn Nicholson	4e023f5bc9	feat(gateway): build authoritative project tree	2026-06-25 16:40:27 -05:00
Brooklyn Nicholson	e7811345c1	feat(kanban): link tasks to project worktrees	2026-06-25 16:40:26 -05:00
Brooklyn Nicholson	8a45ce2dd4	feat(projects): add per-profile project store	2026-06-25 16:40:26 -05:00
Teknium	c6575df927	feat(moa): expose MoA presets as selectable virtual models (#46081 ) * feat(moa): expose MoA presets as selectable virtual models Reconstructed onto current main (PR #46081's base had diverged with no common ancestor, marking the PR dirty so CI never dispatched). MoA is now a virtual provider: each named preset is a selectable model under provider 'moa', and the preset's aggregator is the acting model that answers and calls tools. Reference models fan out in parallel via a bounded ThreadPoolExecutor (the same batch pattern delegate_task uses) — all references dispatched at once, collected when every one finishes, then handed to the aggregator. Output order is preserved, failures and the MoA-recursion guard stay isolated per reference. - Removed the old mixture_of_agents model tool and moa toolset. - Added moa as a virtual provider in the provider/model inventory. - /moa is shortcut behavior over model selection (default preset / named preset / one-shot prompt). - Dashboard + Desktop manage named presets; presets appear in model pickers. - Parallel reference fan-out in agent/moa_loop.py with regression test. * fix(moa): thread moa_config through _run_agent to _run_agent_inner The reconstructed gateway MoA wiring declared moa_config on _run_agent (the profile-scoping wrapper) and used it inside _run_agent_inner, but the wrapper never forwarded it — _run_agent_inner had no such parameter, so the runtime hit NameError: name 'moa_config' is not defined on the compression-failure session sync path. Add moa_config to _run_agent_inner's signature and forward it from both wrapper call sites (multiplex and non-multiplex). Caught by tests/gateway/test_compression_failure_session_sync.py on CI shard test(4). * fix(moa): classify moa as a virtual provider in the catalog The moa virtual provider has no PROVIDER_REGISTRY/ProviderProfile entry, so provider_catalog() fell through to the default auth_type="api_key" with no env vars — tripping two catalog invariants: - test_provider_catalog: api_key providers must expose a credential env var - test_provider_parity: every hermes-model provider must be desktop-configurable moa already declares auth_type="virtual" in HERMES_OVERLAYS; consult that overlay as an auth_type fallback so the catalog reports moa as virtual (no real credential, no network endpoint). Exempt virtual providers from the desktop parity union check the same way 'custom' is exempt — derived from the catalog, not a hardcoded slug, so future virtual providers are covered too.	2026-06-25 13:52:06 -07:00
kshitij	ca714f6189	Merge pull request #52653 from kshitijk4poor/salvage/33814-env-quote-hash fix(config): quote .env values containing # to prevent token truncation (#30355)	2026-06-26 01:32:49 +05:30
kshitijk4poor	2107b86024	feat(compression): flip in_place default to True (#38763 ) [2/2] In-place compaction (single durable session id, non-destructive soft-archive) becomes the default. Rotation is now the opt-out fallback via compression.in_place: false. Prerequisite: #50098 (hygiene guard reads result flag not config flag) merged first — without it, flipping the default causes permanent transcript loss on gateway hygiene-compress and /compress when no session_db is available. Blast radius (empirically measured on current main): 7 rotation-asserting tests broke and are pinned to in_place=False in the companion test commit: - tests/agent/test_compression_concurrent_fork.py (2) - tests/agent/test_compression_logging_session_context.py (1) - tests/agent/test_compression_rotation_state.py (1) - tests/run_agent/test_compression_boundary_hook.py (2 _make_agent helpers) - tests/gateway/test_compression_concurrent_sessions.py (2) Rotation stays as a working fallback and deserves continued coverage. Plan: .hermes/plans/in-place-compaction-38763.md	2026-06-25 12:56:05 -07:00

1 2 3 4 5 ...

3072 commits