hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-07-01 12:02:05 +00:00

Author	SHA1	Message	Date
Vladimir Smirnov	c080a530ae	fix(cli): redact status API keys with --all	2026-06-30 04:38:43 -07:00
Teknium	e7ca53e6b8	fix(moa): disabled presets no longer hijack a plain model switch (#55598 ) exact_moa_preset_name matched any bare model name equal to a preset key, regardless of the preset's enabled flag. On the no-explicit-provider switch path (PATH B in model_switch.py), a plain /model switch whose name collided with a preset key (e.g. "default") silently pivoted the session onto the MoA virtual provider — even when the user had set enabled: false to opt out (issue #55187). The LLM driving a routine model switch could land on a broken moa provider with empty default_preset / unconfigured aggregator credentials. Gate the implicit bare-name match on the per-preset enabled flag. Explicit selection via --provider moa / the model picker uses PATH A and does not go through exact_moa_preset_name, so a disabled preset stays reachable when the user explicitly asks for it.	2026-06-30 04:22:32 -07:00
teknium1	bff61f558f	feat(plugins): enable-time consent prompt for tool_override grant Builds on memosr's sink-level opt-in gate (#29249). Enabling a non-bundled plugin now surfaces the privileged allow_tool_override decision at `hermes plugins enable` time instead of leaving the operator to discover the config key after a runtime rejection. - `hermes plugins enable <name>` prompts for non-bundled plugins: 'Allow this plugin to replace built-in tools?' Default is deny (blank Enter / non-interactive stdin / EOF all fail closed). - --allow-tool-override / --no-allow-tool-override flags for non-interactive and scripted use (and a future desktop checkbox). - Bundled plugins are trusted: never prompted, no entry written. - Writes plugins.entries.<key>.allow_tool_override, the same key the sink gate reads (manifest.key == discovery key), so consent and enforcement compose end to end.	2026-06-30 04:00:42 -07:00
memosr	12f5624a76	fix(security): bind tool_override authorization to handler's defining plugin module egilewski found the prior sink gate was transient: it only applied while PluginManager executed register(ctx). A plugin could defer a direct registry.register(..., override=True) to a post-load callback/thread, after the scope was cleared, and still replace a built-in. Make authorization durable by binding it to where the handler is DEFINED (handler.__globals__['__name__']) rather than to call timing. At load, each plugin's module namespace is mapped to its allow_tool_override opt-in in a table that is never cleared. The sink resolves the handler's owning plugin module and rejects an override from any plugin namespace without opt-in, regardless of when or on which thread the call happens. Plugin namespaces with no recorded policy are treated as not-opted-in (fail-closed). Built-in and MCP handlers live outside the plugin namespace and are unaffected. Adds a regression test for the delayed/post-load direct-registry override.	2026-06-30 04:00:42 -07:00
memosr	3101222312	fix(security): enforce tool_override opt-in at registry sink to close direct-import bypass The opt-in gate lived only in PluginContext.register_tool, so a plugin could bypass it by importing tools.registry and calling registry.register(..., override=True) directly. Enforce the same gate at the sink: during plugin load, the registry rejects an override from a plugin without operator opt-in regardless of the path taken. Built-in and MCP registrations (no active plugin scope) are unaffected. Adds a regression test covering the direct-registry bypass.	2026-06-30 04:00:42 -07:00
memosr	179eb8c2a3	fix(security): require operator opt-in for plugin tool_override to prevent silent built-in tool replacement The tool_override flag landed in v0.14.0 (#26759) so plugins can replace a built-in tool with their own implementation. It works as advertised but there is no trust gate, so any enabled third-party plugin can silently override any built-in like shell_exec, write_file, or web_fetch and exfiltrate everything the agent invokes through it. The only trace is a DEBUG-level log line. Compare with ctx.llm (#23194) which does gate the equivalent privilege escalation: overriding the provider requires plugins.entries.<id>.llm.allow_provider_override: true in config.yaml. The policy shape exists, it just was not extended to tool overrides. Fix: * Add PluginToolOverrideError(PermissionError) for the gate failure. * register_tool() now checks _tool_override_allowed(name) when override=True. Bundled plugins (manifest.source == 'bundled') are trusted by default. Every other source requires plugins.entries.<plugin_id>.allow_tool_override: true in config.yaml. * fail-closed: if config.yaml cannot be loaded for any reason, _tool_override_allowed returns False. Same posture as MSGraphWebhookAdapter.connect() in #22353. Backwards compatibility: * Bundled plugins: no change (source == 'bundled' short-circuits the gate). * Third-party plugins not using override: no change (gate is only consulted when override=True). * Third-party plugins using override: registration fails until the operator opts in. The error message includes the exact config path to add, so the fix is one config edit away for legitimate use cases. Same migration path users went through for allow_provider_override after #23194 landed. Regression tests: * tests/hermes_cli/test_plugins.py::test_register_tool_override_replaces_existing and ::test_register_tool_override_on_new_name_is_noop_path were written before the gate existed. Updated their test configs to include allow_tool_override: true under plugins.entries.<plugin_id>, mirroring how a legitimate operator would now grant the privilege. * New regression test ::test_register_tool_override_blocked_without_operator_opt_in exercises both the PluginManager-catches-error path (built-in tool is preserved, attacker plugin is skipped) and the direct-call path (PluginToolOverrideError is raised with a message that names the config key to set). Verified the test fails without this fix and passes with it. * All 73 tests in test_plugins.py continue to pass.	2026-06-30 04:00:42 -07:00
teknium1	15e44527ab	fix(copilot): prefer endpoints.api for base URL, guard empty chat base URL Folds @trevorgordon981's #50590 into difujia's #15139: - exchange_copilot_token now prefers the authoritative endpoints.api from the token-exchange response, falling back to the proxy-ep-derived host - resolve_api_key_provider_credentials gains a copilot branch that resolves the account-specific base URL and a non-empty last-resort guard, so chat inference never wedges on an empty base URL (#50252) Co-authored-by: Trevor Gordon <trevorbgordon@gmail.com>	2026-06-30 03:27:41 -07:00
NiuNiu Xia	fbd15e285c	fix(copilot): switch to VS Code client ID and derive enterprise base URL Two changes that complete the Copilot auth story (#7731 parts 3 and 4): 1. Switch OAuth client ID from opencode (Ov23li8tweQw6odWQebz) to VS Code (Iv1.b507a08c87ecfe98). The old ID produces gho_* tokens that return 404 on /copilot_internal/v2/token, making token exchange non-functional. The new ID produces ghu_* tokens that support exchange. 2. Derive enterprise API base URL from the proxy-ep field in the exchanged token. Enterprise accounts get tokens containing e.g. "proxy-ep=proxy.enterprise.githubcopilot.com" which is converted to "https://api.enterprise.githubcopilot.com" and stored in the credential pool. Individual accounts (no proxy-ep) continue using the default URL. The COPILOT_API_BASE_URL env var remains as a user escape hatch. Tested on both Individual and Enterprise Copilot accounts: - Individual: device flow works, exchange succeeds, base_url=None (default) - Enterprise: device flow works, exchange succeeds, 39 models returned including claude-opus-4.6-1m (936K), enterprise base URL derived Parts 3 and 4 of #7731.	2026-06-30 03:27:41 -07:00
Peetwan	ebb81f10cb	fix(tui_gateway): prevent WS disconnect under GIL pressure Three targeted fixes for Desktop GUI WebSocket stability when agent turns starve the uvicorn event loop of CPU (GIL contention): 1. Loosen ws_ping_timeout for loopback binds (QW-1) - Loopback (Desktop): ping 30s interval / 60s timeout - Non-loopback (Cloudflare Tunnel): unchanged 20/20 - A GIL-heavy agent turn can stall the event loop past 20s; uvicorn's keepalive ping runs on that same starved loop, so a 20s timeout kills an otherwise-healthy local connection over a recoverable stall. 60s rides out the stall without affecting half-open detection on public binds. 2. Coalesce streaming token frames in WSTransport (CF-2) - Buffer high-frequency delta frames (message.delta, reasoning.delta, thinking.delta) and flush as a batch every ~33ms (~30fps) - Non-streaming frames (RPC responses, control/tool/completion events) flush pending tokens first — wire ordering preserved - Thread-safe via threading.Lock; worker threads return immediately instead of blocking on per-token loop wakeups - Reduces event-loop wakeup churn by orders of magnitude during model streaming, directly cutting GIL pressure 3. Loop heartbeat watchdog (CF-1) - Self-rearming call_later tick (2s) measures drift between expected and actual fire time using loop.time() (monotonic) - Logs 'event loop stalled Ns (GIL pressure suspected)' when drift >5s - Turns mysterious WS drops into diagnosable log entries - Uses call_later chain (not a task) — dies with the loop, nothing to cancel on shutdown Root cause: uvicorn's ws keepalive ping (20/20s) runs on the same starved event loop as agent turns. Under GIL pressure from heavy agent turns or delegation, the loop can't service the ping within 20s, so the websockets protocol declares the connection dead. Reconnects fail with ready_send_failed because the old process's loop is still wedged. None of these fixes touch the model-facing message array, prompt caching, message role alternation, or the wire protocol — they are strictly display-transport improvements plus a config tweak and a diagnostic log. Tests: 762 passed, 17 skipped (0 failures) across test_tui_gateway_ws, test_tui_gateway_server, test_web_server, and tui_gateway/ suites.	2026-06-30 03:11:13 -07:00
teknium1	35a0803a3b	fix(delegation): budget subagent summaries against parent context headroom Batch delegation returned each subagent's full final_response verbatim into the parent's context. A fan-out of N children could dump 60k+ tokens at once, blowing the parent's context window and — on rate-limited providers — triggering a compression/429 death spiral (429 misread as context-too-large -> window step-down -> retry loop -> conversation dies). Cap each summary against the parent's remaining context headroom split across the batch (not a magic char count). When trimming, mirror the web_extract convention: spill the full text to cache/delegation (mounted into remote backends via credential_files._CACHE_DIRS) and return a head+tail window (75/25, line-snapped) plus a footer with the exact read_file offset to page the omitted middle. Both the subagent's opening AND its closing (outcomes / files-changed / issues, which live at the end) survive in-context, and nothing is lost — the parent can read_file the full version on any backend. delegation.max_summary_chars (default 24000) is a static ceiling layered on top as belt-and-suspenders for models that ignore 'be concise'; 0 disables it. Child prompt tightened to lead with outcomes / bullets. Co-authored-by: rc-int <rcint@klaith.com>	2026-06-30 03:07:40 -07:00
kshitij	26f39f7b90	fix(credentials): prefer ~/.hermes/.env over stale os.environ on key rotation (#55528 ) `_resolve_api_key_provider_secret` resolved API keys via `get_env_value`, which returns the `os.environ` value first and only falls back to `~/.hermes/.env`. After a user rotates a key in `.env`, a stale value still exported in the parent shell (Codex CLI, test runner, login profile) shadows the fresh key on every request, producing persistent 401s. The credential-pool seeding path was already fixed to prefer `.env` (#18254/#18755), but the live request-time resolution path was not — so the pool re-seeded with the fresh key while `_resolve_api_key_provider_secret` kept returning the stale shell export. This closes that remaining path. - config: add `get_env_value_prefer_dotenv()` — checks `~/.hermes/.env` first, then `os.environ`. Distinct from `get_env_value()` (unchanged, os.environ-first) so only Hermes-managed credential resolution flips precedence; the generic helper's many callers are unaffected. - auth: `_resolve_api_key_provider_secret` resolves through the new helper. - tests: regression coverage for both the pool-seeding path and the auth resolution path (a rotated `.env` key must beat a stale shell export). Closes #20591. Co-authored-by: 0xDevNinja <manmit0x@gmail.com>	2026-06-30 09:49:52 +00:00
brooklyn!	1d495cfbbf	Merge pull request #55226 from NousResearch/bb/desktop-memory-graph feat(desktop): memory graph — playable timeline of memories + skills over time	2026-06-30 04:36:17 -05:00
Teknium	3f19df2a5b	fix(mcp): late-refresh must see desktop/dashboard discovery thread owner (#55514 ) MCP tools connected and enabled but never surfaced into the agent's session toolset on the desktop app + dashboard WebUI (#51587). There are two independent background MCP discovery thread owners by surface: tui_gateway.entry (stdio 'hermes --tui') and hermes_cli.mcp_startup (desktop app + dashboard WS sidecar via tui_gateway/ws.py, and 'hermes dashboard'). The late-refresh scheduler gates on tui_gateway.entry.mcp_discovery_in_flight(), which read ONLY the entry thread global. On the desktop/dashboard surfaces that global is None, so a server slower than the bounded build-time wait never triggered a late refresh and its tools stayed invisible for the whole session. Make mcp_discovery_in_flight() / join_mcp_discovery() consult BOTH thread owners. Adds the matching in-flight/join helpers to hermes_cli.mcp_startup and has tui_gateway.entry delegate to them as a second owner.	2026-06-30 02:08:37 -07:00
Brooklyn Nicholson	4dbd869ab3	feat(agent): restore surface-aware "auto" default for verify_on_stop #53552 flipped verify_on_stop to default OFF because the guard fired on doc/markdown/skill edits and felt like noise. That doc/markdown/skill suppression already shipped in the same change (_filter_verifiable_paths in agent/verification_stop.py), so the original noise rationale no longer holds: the guard already skips prose-only turns. Restore the surface-aware "auto" default — ON for interactive coding surfaces (CLI, TUI, desktop) and programmatic callers, OFF for conversational messaging surfaces (Telegram, Discord, etc.) where the verification narrative would reach a human as chat noise. The missing/unrecognized fallback in verify_on_stop_enabled now resolves to the same surface-aware default instead of hard OFF, so both the DEFAULT_CONFIG value and the resolver agree. Scope: this changes the shipped default for fresh installs and configs without an explicit verify_on_stop key. Existing configs that #53552/#54740 migrated to an explicit `false` are respected and unchanged — this PR does not add a force-migration of those values back to auto.	2026-06-30 01:43:08 -05:00
Brooklyn Nicholson	821d9f709f	feat(agent): add configurable coding_instructions agent.coding_instructions (a string or list) is appended to the coding brief as its own stable system block, so users can pin project-wide workflow rules without editing the shipped brief. Coding-posture only and cache-safe (resolved once per session; takes effect next session). Empty by default.	2026-06-30 00:59:59 -05:00
Brooklyn Nicholson	a10113658b	feat(agent): add pre_verify hook and verify-on-stop coding guidance Add a `pre_verify` user/plugin/shell hook fired once per turn when the agent edited code and is about to finish, after the existing verify-on-stop guard. A hook can keep the agent going one more turn (run a check, defer it, tidy the diff) by returning {"action":"continue","message":...} (the Claude-Code Stop shape {"decision":"block","reason":...} is accepted too). Hooks receive coding, attempt, final_response, and sorted changed_paths so they can self-scope and self-throttle; the path is bounded by agent.max_verify_nudges and preserves message-role alternation. Hermes still ships its default coding guidance (agent.verify_guidance, on by default), but it now rides the evidence-based verify-on-stop missing-evidence nudge instead of a separate default pre_verify continuation, so it costs no extra model turn of its own. Guidance reuses the shared utils.is_truthy_value parser rather than a local copy.	2026-06-30 00:59:29 -05:00
Brooklyn Nicholson	96552c31e3	feat(learning): profile-scoped memory + learned-skill graph API Assemble a per-profile graph of memories and learned skills over time (agent/learning_graph.py) and serve it at GET /api/learning/graph (hermes_cli/web_server.py), with tests. The radial time axis the desktop renders is derived from this payload; the REST path stays under /learning for backend compatibility.	2026-06-30 00:54:14 -05:00
teknium1	463b1dfa9c	fix(container-boot): also autostart a gateway stranded in 'degraded' degraded is the same wedge class as draining: the gateway came up with some platforms queued for retry, fell through to the running state (gateway/run.py #5196), and is serving. A hard-kill there strands gateway_state=degraded, which (like draining) is not in _AUTOSTART_STATES and is not an operator stop or a failed boot — so it would stay DOWN forever on every recreate. Add degraded to _TRANSIENT_RUNNING_STATES so the fallback path normalises it to running-intent too.	2026-06-29 21:12:36 -07:00
Ben	d3f2931b8c	fix(container-boot): autostart a gateway stranded in 'draining' state A gateway hard-killed while draining (a container/VM recreate SIGTERMs it before _stop_impl reaches its terminal-state persist) leaves gateway_state.json frozen at 'draining'. With no explicit desired_state to fall back to, container_boot read that transient value literally, found it not in _AUTOSTART_STATES, and left the gateway DOWN on every subsequent boot — dashboard up, messaging silently dark. Observed on a relay-opted-in staging instance (2026-06): the s6 gateway-default slot kept its 'down' marker across recreates and the gateway never came back. 'draining' is a transient sub-state of RUNNING (written by the drain watcher / scale-to-zero go-dormant path), never an operator stop and never a failed boot. Normalise it to 'running' in the gateway_state fallback so a stranded drain marker reads as the run-intent it represents. This extends gateway/run.py's #42675 handling (persist 'running' on an unexpected signal) to the case where the gateway died before persisting anything at all. 'starting'/'startup_failed' are deliberately NOT normalised — those mean a mid-boot death and must stay down to avoid the crash-loop the down-marker guard prevents. An explicit desired_state still wins verbatim, so an operator stop survives a transient 'draining' runtime value. Tests: draining named-profile + default-root autostart (both fail without the fix), plus a guard that an explicit desired_state=stopped still blocks a draining runtime.	2026-06-29 21:12:36 -07:00
Jaaneek	9ce79cd642	feat(xai): Imagine public-URL storage, chaining & video edit/extend Add durable public-URL output and URL-based chaining to xAI Grok Imagine: - Store generated media on files-cdn with permanent public HTTPS URLs (public_url: true, no expiry by default). - Chain by URL: generate -> edit -> extend each take a prior result's public HTTPS URL (or a data URI / local file for inputs). - Add provider-specific xai_video_edit and xai_video_extend tools. - Image generation: public-URL/storage output, multi-reference edits, and ~/ local-path support for image edits. Credentials use xAI Grok device-code OAuth (separate PR).	2026-06-29 21:11:58 -07:00
Teknium	481caa66f2	feat(display): friendly human-phrased tool labels for built-in tools (#55166 ) * feat(display): friendly human-phrased tool labels for built-in tools Built-in tools now render ChatGPT-style status verbs ('Searching the web for ...', 'Reading <file>', 'Browsing <url>') on the CLI spinner and gateway/desktop tool-progress instead of the raw tool name. - agent/display.py: _TOOL_VERBS map + build_tool_label() + set/get friendly-labels flag (default on). Custom/plugin/MCP tools fall back to the raw preview; verbose gateway mode left untouched (debug surface). - tool_executor.py / tui_gateway / gateway: route the three spinner sites, the TUI _tool_ctx, and the gateway all/new progress line through the label. - config: display.friendly_tool_labels (default True, per-platform aware). Zero new core tool / schema footprint — pure display layer. * docs: add PR infographic for friendly tool labels * fix(display): preserve arg preview in gateway friendly labels + update tests The first gateway pass re-derived the label from the callback's `args`, which is empty ({}) at the gateway tool.started callsite — the command/query lives in the `preview` string, so terminal rendered as a bare '💻 Running' and dedup collapsed consecutive commands. Now the gateway prefixes the verb onto the already-computed preview via get_tool_verb/tool_verb_connector/verb_drops_preview, preserving the command/url/query. CLI spinner path (real args) keeps build_tool_label. Tests: update test_run_progress_topics exact-format assertions to the friendly form ('💻 Running pwd'), add a format-agnostic preview extractor for the truncation tests (works for both quoted-legacy and verb-prefixed output). * test(tui): update resume-display context to friendly tool label _tool_ctx now uses build_tool_label, so the desktop resume-view context for a search_files turn reads 'Searching files for resume' instead of the bare 'resume' preview — consistent with live tool-progress. Update the assertion. * test(tui): harden no-race worker test against sibling shard leakage test_session_create_no_race_keeps_worker_alive flaked under -j 8: a daemon build thread leaked from a prior session.create test in the same shard process fires close/unregister against its own (foreign) session_key after this test patches the global approval hooks, polluting the captured lists. Scope the assertions to this session's own session_key so the regression intent (this session's worker/notify must survive) is preserved while the test becomes immune to shard composition. Not related to friendly-tool-labels.	2026-06-29 20:31:17 -07:00
Ben Barclay	b963d3238b	feat(gateway): suppress home-channel shutdown broadcast on flagged drains (#54824 ) Add a generic suppress_notification flag to the drain-request marker. When a drain that ends in process exit (e.g. a NAS auto-update image migration on the always-on Hermes Cloud fleet) is flagged, the gateway skips ONLY the home-channel 'gateway shutting down' broadcast — the operator-flavoured ping that would otherwise fire on every routine auto-update, dozens of times a day. The per-active-session interrupt ping is ALWAYS kept: on a drained shutdown it's empty by construction, and in the force-interrupt (deadline-exceeded) case it carries the user-valuable 'your task was cut off, message me to resume' hint. The gateway stays agnostic about WHY a drain is quiet (generic boolean, not a kind enum); the policy of which drain causes set the flag lives in the caller (NAS). Default-false so legacy/operator drains behave exactly as before. The reader reuses the NS-570 epoch-staleness check so an orphaned marker on the durable volume can never silence a fresh gateway's legitimate broadcast. - drain_control.py: write_drain_request gains suppress_notification; new drain_notification_suppressed() reader (current-epoch + truthy flag). - web_server.py: /api/gateway/drain reads + echoes the flag. - run.py: _notify_active_sessions_of_shutdown skips the home-channel loop only. Tests prove: flag round-trips; home-channel suppressed when set, kept when unset; active-session ping always fires; stale/legacy/corrupt markers never suppress.	2026-06-29 12:18:11 -07:00
Teknium	ee8cbfdc03	feat(web_extract): truncate-and-store instead of LLM summarization (#54843 ) * feat(web_extract): truncate-and-store instead of LLM summarization web_extract no longer runs an auxiliary LLM over scraped pages. The extract backends (Firecrawl/Tavily/Exa/Parallel) already return clean, boilerplate- stripped markdown, so we return it directly: pages within a char budget (default 15000, web.extract_char_limit) come back whole; larger pages get a head+tail window plus an explicit footer giving the stored full-text path and the read_file call to page through the omitted middle. The full clean text is written to cache/web (mounted read-only into remote backends like the other cache dirs), so nothing is lost. Inline base64 images are converted to [IMAGE: alt] placeholders (token bombs dropped) while real http(s) image URLs are preserved as links so the agent can still web_extract/vision_analyze them. Removes process_content_with_llm + the chunked summarizer + check_auxiliary_model + _resolve_web_extract_auxiliary. context_references._default_url_fetcher is updated to the truncate path and its stale data.documents shape read is fixed to results (it was silently returning empty). Live before/after eval (firecrawl, 4 URLs): 11.7x faster overall (176.6s -> 15.1s); 10-60x on large pages. Quality identical; findability 4/4 (answer recoverable from stored full text on every truncated page). web_search is unchanged. No own scraper added; no changes to web_search. * fix(web_extract): add char_limit to execute_code web_extract stub The new web_extract char_limit param must appear in the code_execution_tool _TOOL_STUBS signature (and doc line) or test_stubs_cover_all_schema_params fails — the stub schema must cover every real schema param.	2026-06-29 10:00:49 -07:00
Ben Barclay	f53ba9bb54	fix(s6): dot-prefix gateway staging dir so svscan ignores it mid-build (#54834 ) Some checks are pending CI / Detect affected areas (push) Waiting to run Details CI / Python tests (push) Blocked by required conditions Details CI / Python lints (push) Blocked by required conditions Details CI / TypeScript (push) Blocked by required conditions Details CI / Docs Site (push) Blocked by required conditions Details CI / Deny unrelated histories (push) Blocked by required conditions Details CI / Check contributors (push) Blocked by required conditions Details CI / Check uv.lock (push) Blocked by required conditions Details CI / Lint Docker scripts (push) Blocked by required conditions Details CI / Build&Test Docker image (push) Blocked by required conditions Details CI / Supply-chain scan (push) Blocked by required conditions Details CI / OSV scan (push) Waiting to run Details CI / All required checks pass (push) Blocked by required conditions Details Deploy Site / deploy-vercel (push) Waiting to run Details Deploy Site / deploy-docs (push) Waiting to run Details The register path builds each profile-gateway slot in a sibling staging dir under /run/service (the scandir s6-svscan watches), then atomically renames it to the live gateway-<profile> name. The staging dir was named gateway-<profile>.tmp — a NON-dotfile — so a concurrent `s6-svscanctl -a` rescan (fired by the cont-init reconciler registering gateway-default, or by a sibling register) would supervise the half-built slot the moment it had a valid type/run: s6-supervise spawns AS ROOT and mkdirs supervise/ root-owned 0700, then the in-flight _seed_supervise_skeleton early-returns on the now-existing supervise/ and the next `mkdir supervise/event` hits PermissionError. That is the arm64-only CI flake on test_s6_unregister_removes_service_dir_in_live_container (PermissionError: /run/service/gateway-phase3test.tmp/supervise/event) — arm64-only because the native-arm runner's wider scheduling jitter lets the rescan land inside the ~ms seed window; amd64 ran 30/30 clean. Fix: dot-prefix the staging dir (.gateway-<profile>.tmp) in both register paths (S6ServiceManager.register_profile_gateway and container_boot._register_service). s6-svscan skips any scandir entry whose name begins with '.', so the half-built slot can never be supervised mid-build. The atomic rename to the dotless live name is unchanged. Verified on a real s6 image (amd64): a non-dotted staging dir is picked up by an svscanctl -a rescan (SUPERVISED owner=root) while a dot-prefixed one is ignored (NOT-SUPERVISED). Added a docker-harness regression test that asserts both, plus a unit test that the staging dir is dot-prefixed.	2026-06-29 21:33:00 +10:00
teknium1	61f56d27db	refactor(dashboard-auth): drop redundant _interactive_providers helper list_session_providers() already filters on supports_session=True, so the new helper re-filtered an already-filtered list. Call it directly at the single auto-SSO call site.	2026-06-29 04:25:18 -07:00
Ben	f5ecbe1ec6	feat(dashboard): auto-initiate portal SSO redirect on unauthenticated load When the dashboard gateway has no local session cookie, it rendered a click-through /login interstitial — even though the Nous portal's /oauth/authorize auto-approves any current member of the dashboard's org and is a silent 302 when the user already holds a portal session. For the common case (clicking a hosted-agent dashboard link while signed in to the portal) that interstitial click is pure friction. This makes the gate auto-initiate the OAuth redirect on an unauthenticated HTML document load instead of rendering the interstitial, when exactly one interactive provider is registered. A one-shot loop-guard cookie (hermes_sso_attempt, 60s TTL) ensures that a genuinely absent portal session (the portal bounces back still-unauthenticated) falls back to the /login page after exactly one bounce rather than ping-ponging forever. The marker is cleared on a successful callback and whenever the gate falls back to /login. Security: this removes a human CLICK, not a security check. The redirect lands on the existing /auth/login route and runs the unchanged PKCE auth-code flow; token verification, audience checks, redirect-URI match, and org-membership checks are all untouched. /api/* fetches still get the 401 JSON envelope (never a 302 a fetch() would follow opaquely), and with two or more providers the /login chooser still renders. Phase 1 of the cloud-auto-discovery work.	2026-06-29 04:25:18 -07:00
Sahil-SS9	1bb7b59c5d	fix: offload blocking profiles endpoints from asyncio event loop (#54523 ) (cherry picked from commit `09f10e2b77`)	2026-06-29 02:35:57 -07:00
chenxiang	d5eee133eb	perf(profiles): fix list_profiles O(N*M) wrapper rescan (6.4s -> 0.4s) find_alias_for_profile re-scanned the whole wrapper dir (~/.local/bin) and read_text every file for EACH profile — including large unrelated binaries (ffmpeg etc.) read 15x over. With 16 profiles this took ~6.4s, long enough that the desktop's per-request backend calls timed out (15s) and the sidebar rendered '全部智能体 0 / 会话 0'. - Add build_alias_map(): single-pass {profile -> alias} reverse map, reads only an 8KB head slice per wrapper, skips binaries via UnicodeDecodeError. - find_alias_for_profile now delegates to it (behavior preserved). - Cache _count_skills by skills-dir mtime signature (+30s TTL). list_profiles: 6.37s -> 0.84s cold / 0.44s warm. 138 profile tests pass. (cherry picked from commit `89e593749a`)	2026-06-29 02:35:57 -07:00
Telos	fa11b11cf5	fix: propagate key_env from custom_providers into ProviderDef resolve_custom_provider() previously returned api_key_env_vars=() for every custom provider entry, silently dropping the configured key_env field. This caused 401 errors for any custom provider that required an API key via environment variable (e.g. Xiaomi MiMo Token Plan, self-hosted OpenAI-compatible servers). The key_env field is already documented in _VALID_CUSTOM_PROVIDER_FIELDS and normalized by normalize_custom_provider_entry(), so this was just an oversight in the ProviderDef construction. Also adds a regression test that verifies key_env is properly propagated into the resolved ProviderDef.	2026-06-29 02:25:48 -07:00
Teknium	bf0d8fed8e	fix(config): v32 migration flips baked-in verify_on_stop=true to false (#54740 ) The first ship of verify-on-stop (config v30) defaulted DEFAULT_CONFIG agent.verify_on_stop to a literal True, and migrate_config persists defaults with strip_defaults=False — so every install that updated through v30 had verify_on_stop: true written into config.yaml as a literal. The v30->v31 migration only flipped missing/'auto' values to false and deliberately preserved an explicit bool, so it skipped that entire population and left verify-on-stop ON for everyone who had updated. A literal true was never a user choice: the feature had no off-switch worth setting it against until v31 introduced one, so a true persisted before v32 is always the old machine default. v32 migration flips a literal true -> false once, for both v30 (skipped v31) and v31 (preserved-by-bug) installs. A true the user sets AFTER v32 is a deliberate opt-in and is never touched.	2026-06-29 01:51:08 -07:00
teknium1	41095fdb04	fix(camofox): register CAMOFOX_API_KEY in OPTIONAL_ENV_VARS The auth-header fix reads CAMOFOX_API_KEY but it was never registered, so it didn't surface in `hermes setup` / `hermes tools`. Add it as an advanced password-category tool env var alongside CAMOFOX_URL.	2026-06-29 01:26:24 -07:00
Ben	4125cc3b7c	fix(slack): subscribe to message.mpim + mpim scopes so group DMs work Group DMs (multi-person DMs, channel_type=mpim) were never delivered to the Slack bot. The adapter already classifies mpim as a DM and replies ambiently (adapter.py:2526, is_dm = channel_type in {im, mpim}), but the generated app manifest only subscribed to message.im / im:history — the 1:1 DM pair. Without the message.mpim event subscription Slack drops group-DM messages before the adapter ever sees them, so 1:1 DMs worked while group-DM ambient mode was dead. Add message.mpim to bot_events and mpim:history (the scope that event requires per Slack docs) + mpim:read (mirrors im:read for the conversations.info classification call) to bot_scopes. Update the SLACK_BOT_TOKEN / SLACK_APP_TOKEN setup-help strings and the Slack docs (EN + zh-Hans: scope table, event table, troubleshooting) so existing installs are told to add the new scopes and reinstall. Reported by an enterprise customer. Note: this is a manifest/scope change, so it only takes effect after the app is reinstalled and the new scopes are accepted. Tests: assert message.mpim + mpim:history + mpim:read are in the manifest (with and without assistant mode); both fail on current main and pass with this change.	2026-06-29 01:02:53 -07:00
Ben	1c75e7c9d8	feat(dashboard): list & add arbitrary custom .env keys on the Keys page The Keys page only rendered env vars present in a catalog (OPTIONAL_ENV_VARS or the provider catalog); any other key a user set in .env was invisible, and there was no way to add an arbitrary env var from the GUI (e.g. to inject a var a skill or MCP server needs). Backend: GET /api/env now also emits a row for every on-disk .env key that isn't in any catalog, flagged category="custom" + custom=true and password-masked (an unrecognised key could hold anything, so it's redacted and reveal-gated like any secret). Channel-managed credentials stay excluded. The write (PUT /api/env) and reveal (POST /api/env/reveal) paths already handle arbitrary keys, with the existing env-name guard + denylist (PATH, LD_PRELOAD, PYTHONPATH, …) enforced server-side — no new write surface. Frontend: a new "Custom Keys" section lists those custom rows and carries an add-a-key form (client-side name validation mirroring the backend regex; the new row reuses the normal edit/save flow, so on save it round-trips back from the backend as a durable custom row). i18n added for en + zh + types. Tests: behavior-contract coverage that an unknown .env key surfaces as a masked custom row and a catalogued key does not — verified to fail on the pre-fix backend.	2026-06-28 22:53:56 -07:00
Shannon Sands	476875acb9	Add dashboard backup upload and download	2026-06-28 22:35:09 -07:00
brooklyn!	388268ecde	Merge pull request #54568 from NousResearch/bb/shared-websocket-layer refactor(desktop+dashboard): shared WebSocket layer + decouple desktop from dashboard (hermes serve)	2026-06-28 23:43:49 -05:00
Ben Barclay	0943e2a272	fix(cron): don't report a false 'gateway not running' on external-provider instances (#54600 ) `hermes cron status` (and the create/list 'gateway not running' nag) judge whether cron will fire purely from the in-process ticker's heartbeat file + a live gateway PID. That heuristic is correct for the built-in ticker but WRONG for an external provider like Chronos: Chronos arms exactly one external one-shot per job and is fired by a NAS-mediated webhook (POST /api/cron/fire). Its `start()` returns immediately and it deliberately runs no 60s loop and writes no ticker heartbeat — that's the whole point of scale-to-zero (the machine is at zero between fires). So on a perfectly healthy Chronos instance, `cron status` always printed '✗ Gateway is not running — cron jobs will NOT fire' (or a STALLED-ticker warning), and `cron create` always appended the 'jobs won't fire automatically' nag — both false. Verified live on a staging Chronos instance: jobs fired and completed on schedule via the relay while `cron status` insisted the gateway wasn't running and the heartbeat was 370s+ stale. Fix: resolve the active provider (offline — `resolve_cron_scheduler`, whose `is_available()` contract forbids network) and, for any non-builtin provider, report the managed-scheduler state instead of the ticker heuristics, and suppress the ticker-only 'gateway not running' warning. The built-in path is byte-unchanged. Active-job summary is factored into a shared helper so both paths print it identically. New tests prove both directions (chronos: no false negative even with no gateway PID / no heartbeat; builtin: historical warning preserved) and fail without the fix.	2026-06-29 14:03:02 +10:00
lkevincc	163562bf88	fix: normalize lmstudio base urls	2026-06-28 20:46:44 -07:00
Brooklyn Nicholson	9d9a50c2bc	test(cli): pin the `hermes serve` decoupling contract Add a focused contract test for the headless `serve` command (routes to the shared dashboard handler, headless by default while `dashboard` is not, accepts the legacy --no-open, shares the same runtime/lifecycle flag surface). Also refresh the dashboard.py module docstring to cover both commands.	2026-06-28 22:11:48 -05:00
Brooklyn Nicholson	dff491a2b9	feat(cli): add headless `hermes serve` backend; desktop no longer launches `dashboard` The desktop app spawned `hermes dashboard --no-open` as its backend, which made the dashboard look like a desktop prerequisite. Add a dedicated headless `hermes serve` command that boots the same gateway (shared cmd_dashboard / start_server) but never opens a browser, and point the desktop backend spawn exclusively at it. dashboard and serve are now independent surfaces — neither launches the other. - subcommands/dashboard.py: factor shared server args; add `serve` parser (always headless; accepts legacy --no-open as a no-op) - main.py: register serve in _BUILTIN_SUBCOMMANDS + coalesce set + gui-log detection; extend stale-backend reaper patterns to match `serve` - desktop electron: spawn `serve`, rename dashboardArgs -> backendArgs, update comments + windows-child-process test assertions - docs: desktop README, desktop.md (incl. remote-backend), AGENTS.md, and cli-commands.md now describe `hermes serve` as the desktop/headless backend	2026-06-28 22:04:22 -05:00
Ben	dee41d0716	feat(dashboard): catalogue all memory-provider API keys in OPTIONAL_ENV_VARS The dashboard Keys page and `hermes setup` render API-key rows from OPTIONAL_ENV_VARS, but only Honcho had an entry — so Hindsight, Supermemory, Mem0, RetainDB, ByteRover, and OpenViking read their keys straight from os.environ yet had no place to set them in the GUI. Add catalog entries (category=tool, password-masked, with get-key URLs and the tool each powers) for all six, plus the relevant base-URL/endpoint companions. Pure declaration: the generic GET /api/env endpoint, the save/reveal write path, and the sandbox env blocklist (which auto-derives from tool-category OPTIONAL_ENV_VARS) all pick these up with no further wiring. Adds a behavior-contract test asserting every memory provider's primary credential key is catalogued, tool-categorised, and password-masked.	2026-06-28 19:17:02 -07:00
Teknium	11183e8332	fix(profiles): validate custom alias names to prevent path traversal `hermes profile alias <profile> --name <custom>` accepted arbitrary strings and used them verbatim as a filename under ~/.local/bin. Because normalize_profile_name only lowercases/strips (no regex gate), a value like `../../.bashrc` escaped the wrapper directory and clobbered arbitrary user-writable files. remove_wrapper_script had the same sink. Add validate_alias_name (reusing the profile-id regex, which forbids `/`, `.`, and `..`) and wire it into check_alias_collision, create_wrapper_script, remove_wrapper_script, and the CLI alias action so the rejection surfaces a clear "Invalid alias name" error instead of silently writing or unlinking outside the wrapper dir. Co-authored-by: Gutslabs <gutslabsxyz@gmail.com> Co-authored-by: Xowiek <xowiekk@gmail.com>	2026-06-28 18:53:33 -07:00
aaronagent	5c1ac6c70d	fix(config): strip `export` prefix in .env parsers across three modules All three .env parsers use `line.partition("=")` without stripping the bash-compatible `export ` prefix first. A line like `export API_KEY=sk-...` produces key `"export API_KEY"` instead of `"API_KEY"`, silently ignoring the variable and causing auth failures for users who copy-paste from bash profiles or follow tutorials that include `export`. - tools/skills_tool.py: `load_env()` for skill environment - hermes_cli/config.py: `load_env()` for core config - hermes_cli/main.py: `_has_any_provider_configured()` inline parser Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>	2026-06-28 18:53:00 -07:00
Brooklyn Nicholson	27f03243a0	fix(dashboard): stop ElevenLabs voice-list 401 log spam The /api/audio/elevenlabs/voices endpoint logged a WARNING on every failure, and the desktop re-polls it on each settings open/focus — a bad/expired/scoped ELEVENLABS_API_KEY floods agent/gui logs with identical "voice list failed: HTTP Error 401" lines indefinitely. Treat 401/403 as a persistent "integration unavailable" state: return {available: false, error: "unauthorized"} with a 200 (the dropdown already handles available:false) instead of a 502, and collapse repeated identical failures to a single log line via a small re-arming latch (logs again on recovery or when the error changes). Non-auth errors keep the 502 but are throttled the same way.	2026-06-28 17:59:28 -05:00
Teknium	980622d0ec	perf(startup): parse config + plugin manifests with libyaml CSafeLoader (#54486 ) The startup config/manifest reads used PyYAML's pure-Python SafeLoader, which is ~8x slower than the libyaml-backed CSafeLoader C extension. config.yaml is parsed several times during launch (cli config, raw config, early interface/redaction bridge, logging config) and every plugin manifest is parsed once — all on the slow path. Add utils.fast_safe_load (CSafeLoader-preferring, pure-Python fallback, true drop-in for safe_load) and route the hot startup parse sites through it: hermes_cli/config.py (config + manifest reads), hermes_cli/plugins.py (manifest parse), env_loader, cli.load_cli_config, hermes_logging, and the two pre-config early YAML bridges in main.py. Behavior is identical (same restricted safe tag set); only speed changes. safe_load calls on the startup path drop from ~79 to ~0, cutting the YAML parse cost from ~0.9s to ~0.15s under profiling. Adds tests/test_fast_safe_load.py asserting equivalence with safe_load across input shapes, empty-doc falsiness, C-loader preference, and that python/object tags are still rejected (safe, not full loader).	2026-06-28 15:38:39 -07:00
brooklyn!	16ff1a3b93	Merge pull request #54457 from NousResearch/bb/windows-console-launcher-repair fix(windows): repair missing console script launchers	2026-06-28 17:15:56 -05:00
奥森木	e7d4ade8cf	fix(anthropic): ignore stale non-Anthropic base_url across all resolution paths A config left with `provider: anthropic` but a leftover `base_url: https://openrouter.ai/api/v1` (e.g. after a provider switch) would route Anthropic OAuth/setup-token traffic to OpenRouter and 404. Add `_anthropic_base_url_override_ok()` and gate the three native-Anthropic resolution branches (pool, explicit, native) on it. The guard honors a configured `model.base_url` only when it plausibly speaks the Anthropic Messages protocol — official `.anthropic.com` / `.claude.com` hosts, Azure Foundry endpoints, and `/anthropic`-suffixed or Kimi `/coding` proxies — and falls back to `https://api.anthropic.com` otherwise. Aggregator URLs like openrouter.ai / api.openai.com are treated as stale. Reconstructed from @clovericbot's PR #3661 onto current main: the original patched one branch with an anthropic-only allow-list, which would have broken Azure-via-anthropic; widened to all three sites and made Azure/proxy-safe.	2026-06-28 15:12:03 -07:00
Teknium	95f2919f91	perf(startup): lazy-load gateway platform adapters (#54448 ) Bundled platform plugins (telegram, discord, feishu, teams, ...) were eagerly imported at plugin-discovery time on every `hermes` invocation, including plain `hermes chat` which never touches a gateway platform. Their modules import heavy platform SDKs at module level (lark_oapi, microsoft_teams, discord.py, slack_bolt, ...) — feishu alone pulled in lark_oapi (~2.6s), teams pulled microsoft_teams (~1.9s). Discovery now registers a cheap deferred loader per platform in the platform_registry; the adapter module is imported only when the gateway / cron / setup / send_message path actually asks for that platform. is_registered() and the iterate-all accessors stay correct (deferred counts as registered; plugin_entries()/all_entries() materialize all deferred loaders, since those paths genuinely need every adapter). Cold start: ~4.4s -> ~2.45s to banner. discover_and_load: 2.0s -> 0.3s (warm), and the heavy SDKs are no longer imported at all in CLI mode. Every shipped platform remains available out of the box — it just loads on first use.	2026-06-28 15:11:59 -07:00
Mibayy	b0b7ff0d75	fix(provider): auto+base_url bypasses cloud API when custom endpoint configured (#3846 ) When config.yaml has `provider: auto` and a non-cloud `base_url` (e.g. Ollama at localhost:11434), requests were silently sent to https://api.anthropic.com whenever ANTHROPIC_API_KEY was present in the environment, ignoring the configured local endpoint and returning HTTP 401 / "credit balance too low". Root cause: resolve_provider("auto") scans env vars and returns "anthropic" when ANTHROPIC_API_KEY is set, before config.model.base_url is ever consulted. In resolve_runtime_provider(), before calling resolve_provider(), short-circuit to the OpenAI-compatible resolver when no explicit creds were passed, provider is "auto"/unset, and a non-cloud base_url is configured. Well-known cloud roots (openrouter.ai, anthropic.com, openai.com) are matched on HOST (not substring) so look-alike hosts can't evade the bypass and leak a cloud credential. Co-authored-by: Hermes Agent <hermes@nousresearch.com>	2026-06-28 15:11:55 -07:00
Gille	df8e2523fa	fix(windows): verify launchers after primary install	2026-06-28 17:02:05 -05:00
HexLab98	95994bbc56	fix(windows): repair missing hermes.exe after pip install (#52931 ) On Windows, uv pip install -e . can register hermes.exe in package metadata while the launcher never lands on disk. Detect missing [project.scripts] shims and reinstall entry points under the existing quarantine path in hermes update and install.ps1.	2026-06-28 17:01:31 -05:00

1 2 3 4 5 ...

3164 commits