hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-06-27 11:22:03 +00:00

Author	SHA1	Message	Date
kshitij	1aa458a1e6	Merge pull request #52920 from NousResearch/salvage/38798-toolset-validation fix(config): surface invalid platform_toolsets instead of silently dropping tools (#38798)	2026-06-26 14:14:55 +05:30
lEWFkRAD	41ede84b93	fix(config): surface invalid platform_toolsets instead of silently dropping tools (#38798 ) A config migration (or hand-edit) that leaves an invalid toolset name in `platform_toolsets` — e.g. the #38798 corruption that rewrote `hermes-cli` to the non-existent `hermes` — silently disabled all affected tools: resolve_toolset() returns [] for an unknown name, so the agent quietly lost its tools with no error, warning, or log entry and degraded to text-only replies. Surface it loudly at two points: - After migration (migrate_config): validate platform_toolsets and record/print a warning per unknown name, with a `hermes-<platform>` suggestion when that would have been valid (the exact #38798 shape). - At runtime (_get_platform_tools): if a platform was explicitly configured but every toolset name is invalid, log a warning when tools are resolved for a session — so an ALREADY-corrupted config is caught at startup, not only on the next `hermes update`. Logic lives in a new pure, side-effect-free helper (toolset_validation.py) with validate_toolset injected, so it is unit-testable without the tool registry. Note: the original v25→v26 migration that caused the corruption no longer exists (config format is now v30; no migration step rewrites toolset names). This change is the durable defense against the silent-failure mode regardless of cause, matching the issue's "Expected: log a warning". Salvaged from #39207 by @lEWFkRAD (authorship preserved via cherry-pick). Tests: 9 helper cases (incl. the #38798 corruption shape, mixed valid/invalid, zero-tools state, non-dict/scalar/non-string) + a runtime caplog test — both the helper warning and the runtime guard mutation-verified to fail without the fix. Closes #38798. Supersedes #39581 (prevent-in-v25→v26 — that path is gone), #41006 / #40208 (repair-migration for already-corrupted configs).	2026-06-26 14:07:43 +05:30
helix4u	063fe4f6ef	fix(auxiliary): fallback on invalid provider responses	2026-06-26 13:49:46 +05:30
teknium1	fbfccbb3ee	fix(security): align cron invisible-unicode set with install-time scanner The cron runtime tripwire (_scan_cron_prompt) used a 10-char invisible-unicode set while the install-time scanner (threat_patterns.INVISIBLE_CHARS) flags 17. The cron-local set was missing U+2062-U+2064 (invisible math operators) and U+2066-U+2069 (directional isolates), so a directive obfuscated with one of those codepoints (e.g. "ig<U+2063>nore all previous instructions") slipped past the runtime cron gate while being caught at install time. Import the canonical set so the cron tripwire and install scanner can't drift apart again. Emoji-ZWJ protection (_zwj_has_emoji_neighbour) is unchanged. Fixes #35075 Co-authored-by: rlaope <piyrw9754@gmail.com>	2026-06-26 01:11:11 -07:00
Shannon Sands	a0dc92450b	Split dashboard PTY reconnect tests	2026-06-26 01:06:02 -07:00
Shannon Sands	41f8126148	Reconnect dashboard PTY chat after socket drops	2026-06-26 01:06:02 -07:00
Teknium	619dc4a561	fix(whatsapp_cloud): resolve reply-to text so the agent sees reply context (#52957 ) Replies on WhatsApp Cloud arrived at the agent with reply_to_id set but reply_to_text=None, so run.py never injected the "[Replying to: ...]" disambiguation prefix (it gates on reply_to_text). Meta's webhook context object carries only the quoted message's id, never its text. Index (chat_id, wamid) -> text in rich_sent_store on every inbound message and every outbound text send -- the same store that solved the identical Telegram rich-send problem -- then look up the quoted text in _build_message_event_from_cloud and populate reply_to_text plus reply_to_is_own_message, derived from context.from versus the business number.	2026-06-26 01:05:05 -07:00
Ben	19b2624404	feat(gateway): external drain trigger + accept-gating (begin/cancel + control channel) Tasks 2.1 + 2.2 + 2.3 of the safe-shutdown plan — the reversible quiesce-without-restart machinery NAS drives during a lifecycle action (D4a). These ship together because the endpoint, the control channel, and the gateway state machine are one coherent slice. 2.2 — control channel (gateway/drain_control.py, new): The dashboard has no HTTP path into a running gateway (guardrails: "there is NO external control channel into a running gateway"); restart/drain is driven only by markers the gateway reacts to. So begin/cancel-drain writes/removes a presence-based marker .drain_request.json (HERMES_HOME-scoped, atomic write, never-raises read; a corrupt marker reads as present-contentless → fail-safe toward quiescing). This is Q-B option A. 2.2 — gateway state machine (gateway/run.py): - _external_drain_active flag, DISTINCT from the shutdown _draining flag: this one does NOT exit the process and is fully reversible. - _enter_external_drain / _exit_external_drain: idempotent transitions that flip gateway_state→draining / →running via _update_runtime_status (preserving the live active_agents count). exit refuses to revert to running during a real shutdown or after the loop stops (shutdown wins). - _drain_control_watcher: 1s background task (modelled on _handoff_watcher) reconciling accept-state with the marker; honours a marker that survived a restart on its first tick. Registered alongside the other watchers in start. - New-turn accept gate in _handle_message, placed BEFORE the session-slot claim: when draining, refuse to START a new turn (so active_agents can only fall → no TOCTOU race), while in-flight turns finish untouched. Internal/ system events (restart-recovery replays, bg-process completions) bypass it. 2.1 — endpoint (hermes_cli/web_server.py): POST /api/gateway/drain {action: drain\|cancel}. Authenticated by the Task-2.0a token seam (the drain plugin registered this exact path as a token route); attributes the request to the verified token principal. Begin writes the marker, cancel removes it — the gateway process owns the actual transition. Force-override (D6) is NOT here; it maps onto the existing immediate /api/gateway/restart force path. Tests (mocked — necessary-not-sufficient; the HARD live gate Q-B is next): - tests/gateway/test_external_drain_control.py — marker contract (write/clear/ read/corrupt/atomic), state machine (enter/exit/idempotency/shutdown-wins/ loop-stopped), watcher reconcile-enter-then-exit, new-turn refusal, and in-flight-not-interrupted. 15 tests. - tests/hermes_cli/test_web_server.py — /api/gateway/drain begin/default-begin/ cancel/cancel-idempotent/bad-action-400. 6 tests. - dashboard.drain_auth config section already added in 2.0b commit. All touched suites green: 301 (gateway+auth) + 9 (web_server endpoints) passed. Intentionally deferred: - HARD live-validation gate (Q-B): real isolated `hermes gateway run`, drive a real begin-drain marker, prove the 5-point checklist a–e. - Spec-doc status flip + Phase-2 PR. Build status: external-drain, restart-drain, status, dashboard-auth, drain-plugin, token-auth, and web_server-endpoint suites green.	2026-06-26 00:47:19 -07:00
Ben	2e322466b1	feat(dashboard-auth): drain shared-bearer-secret provider plugin Task 2.0b: the concrete shared-bearer-secret auth provider, the FIRST consumer of the generic token-auth capability (Task 2.0a). Implements decisions.md Q-A. plugins/dashboard_auth/drain/ (bundled, discovered like dashboard_auth/basic): - DrainSecretProvider: non-interactive provider, supports_token=True. Verifies an inbound Authorization bearer token against a per-agent shared secret with hmac.compare_digest (constant-time, no timing oracle) and, on a match, vouches for the caller as the "drain-control" principal scoped to "drain". The five interactive ABC methods raise NotImplementedError; verify_session returns None (stacks harmlessly in the cookie-verify loop). - assess_secret_strength(): fail-closed entropy gate. Rejects secrets shorter than 43 url-safe-b64 chars (~256 bits), with < 16 distinct characters, or below 128 bits Shannon entropy — so a weak/structured/repeated secret can never be silently accepted. Enforced both at register() (friendly skip reason) and in __init__ (raises — defence in depth). - register(ctx): no-op + skip reason when HERMES_DASHBOARD_DRAIN_SECRET is unset; rejects a weak secret fail-closed (drain endpoint stays gated). On a strong secret, registers the provider AND opts /api/gateway/drain into the generic token-auth seam via register_token_route(). Config: the secret is a CREDENTIAL → carried via HERMES_DASHBOARD_DRAIN_SECRET (per-agent, provisioned by NAS at deploy). Behavioural knobs only (dashboard.drain_auth.{scope,min_secret_chars}) live in config.yaml — added to DEFAULT_CONFIG with the .env-is-for-secrets rationale documented inline. Tests: tests/plugins/dashboard_auth/test_drain_provider.py — entropy gate (strong pass; empty/short/repeated/few-distinct/custom-min reject), verify_token (match → scoped principal, wrong/empty → None, custom scope), protocol compliance, interactive-methods-raise, and register() (skip-no-secret, fail-closed-weak-secret, strong-env-secret registers + route opt-in, config scope + min_secret_chars). 21 new tests; drain + token-auth suites 44 passed. Verified the plugin is discovered as dashboard_auth/drain alongside basic/nous. Intentionally deferred: - The begin/cancel-drain endpoint handler itself — Task 2.1. - The dashboard→gateway control channel — Task 2.2. Build status: dashboard-auth + drain-plugin suites green.	2026-06-26 00:47:19 -07:00
Ben	cb9cb6ba1c	feat(dashboard-auth): generic non-interactive API-token capability Task 2.0a of the safe-shutdown drain-coordination plan. Widens the dashboard auth framework GENERICALLY to support non-interactive (service-to-service) bearer-token auth, mirroring the existing supports_password precedent. This is a reusable capability — any future machine-credential provider plugs in without core changes (decisions.md Q-C). The drain bearer-secret plugin (Task 2.0b) is the first consumer, not the definition. - base.py: add TokenPrincipal dataclass (the token analog of Session) + supports_token capability flag + verify_token() on the ABC (default raises NotImplementedError so a misconfigured provider fails loud). Contract mirrors verify_session stacking: return None for unrecognised tokens (never raise), raise ProviderError only on a genuine backing-store outage. - registry.py: list_token_providers() — the supports_token subset, in registration order. Empty when none registered (token routes fail closed). - token_auth.py (new): route-agnostic seam. Routes opt in via register_token_route(exact path); token_auth_middleware owns the auth decision for those routes only — authenticate via stacked providers, attach request.state.token_principal + token_authenticated, pass through. 401 on missing/unrecognised token, 503 when a provider was unreachable, untouched passthrough for non-token routes. Fails closed (never open). - web_server.py: install the seam OUTERMOST (registered last → runs first). Both downstream gates (legacy auth_middleware + gated_auth_middleware) honour request.state.token_authenticated and skip enforcement, so a token-authed service request is never bounced to /login. - audit.py: TOKEN_AUTH_SUCCESS / TOKEN_AUTH_FAILURE events. Tests: tests/hermes_cli/test_dashboard_token_auth.py — ABC flag default, verify_token NotImplementedError, registry filter, bearer extraction (case-insensitive scheme, malformed/non-bearer → ""), provider stacking (first-match-wins, unreachable-remembered, unreachable-then-valid, buggy provider doesn't crash the gate), and the seam's passthrough/401/503/ fail-closed behaviour. 29 new tests; full dashboard-auth suite 169 passed. Intentionally deferred: - The concrete shared-bearer-secret provider plugin — Task 2.0b. - The begin/cancel-drain endpoint that registers itself as a token route — Task 2.1. Build status: dashboard-auth + plugin-hook suites green.	2026-06-26 00:47:19 -07:00
Teknium	099df3cd89	fix(security): stop blocking AGENTS.md/SOUL.md that name an agent 'Praxis' (#52925 ) The known_c2_framework threat pattern included 'praxis' in its alternation alongside genuine offensive-security tool brands (Cobalt Strike, Sliver, Havoc, Mythic, Metasploit, Brainworm). Unlike those distinctive brand names, 'praxis' is a common English word (Greek for practice/action) and a legitimate agent name, so any context file that mentioned an agent named Praxis matched at 'context' scope and the whole AGENTS.md / SOUL.md was replaced with a [BLOCKED] placeholder before it reached the system prompt. Remove 'praxis' from the alternation and add a guard comment: every token in this list must be a distinctive tool brand, not a common word. Real C2 brands still fire.	2026-06-26 00:36:01 -07:00
teknium1	4d0dd6bd52	test(mcp): make invalid_client tests interactive under hermetic env The new _maybe_flag_poisoned_client tests built a provider via get_or_build_provider without an interactive stdin. Under the hermetic test env (no TTY, no cached tokens), the non-interactive guard in mcp_oauth_manager._make_provider raised OAuthNonInteractiveError before the provider was built, failing 6 tests in CI parity (they passed locally where stdin was a TTY). Thread monkeypatch into _provider_with_token_endpoint and present an interactive stdin, matching the sibling test_manager_builds_hermes_provider_subclass.	2026-06-26 00:35:27 -07:00
Max Hsu	075f93ad78	fix(mcp): auto-recover from invalid_client on stale OAuth client registration Fixes #36767. Two complementary recoveries for the recurring "delete three cache files and re-auth by hand" ritual when an MCP server's dynamically-registered OAuth client goes dead server-side (IdP redeploy / DB wipe / rebrand): - Auto-heal (token-endpoint subset): HermesMCPOAuthProvider now sniffs auth-flow responses and, on a 400/401 `invalid_client` from the discovered token endpoint, backs up + deletes `<server>.client.json` and `.meta.json` and clears the in-memory client so the SDK re-runs RFC 7591 dynamic client registration on the next flow. Conservative by construction: only dynamically-registered (non config-supplied) clients, only the token endpoint, only on a word-boundary `invalid_client` match (so RFC 7591's `invalid_client_metadata` does not trip it); best-effort so a miss never breaks the live flow. Covers both code-exchange and refresh when the token endpoint was discovered. Tokens are preserved. - `hermes mcp reauth [<name>\|--all]`: the reporter's primary symptom — the IdP's in-browser "Redirect URI Mismatch" — produces no HTTP signal (the SDK only sees a callback timeout), so it cannot be auto-detected. The new command re-auths one or ALL `auth: oauth` servers, serially: one browser flow at a time, which also fixes the startup popup storm when several servers are stale at once. Single-server reauth is factored out of `mcp login` and shared. Tests: +14 (poison helper x2; token-endpoint detection x5 incl. wrong-endpoint, success-response, pre-registered, and invalid_client_metadata negative guards; a bridge integration test driving the real async_auth_flow generator to prove the detection hook preserves the bidirectional asend() forwarding contract; reauth CLI x6). Verified against the pinned mcp==1.26.0: scripts/run_tests.sh 122/122 green for the touched suites; check-windows-footguns.py and ruff clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 00:35:27 -07:00
Ben Barclay	6e4e5967f7	feat(relay): multi-platform-per-agent — list identity, provision-loop, N-hello, per-frame egress (Phase 1.5) (#52830 ) Cut over the agent half of Shape A (D-Q1.5a/b.1/c) to front a SET of platforms on one relay WS: - relay_platform_identities() parses GATEWAY_RELAY_PLATFORMS (list) + GATEWAY_RELAY_BOT_IDS (JSON keyed map {platform:{botId,username?}}). Cut over from the scalar GATEWAY_RELAY_PLATFORM/_BOT_ID (no fallback, D-Q1.5c). - self_provision_relay() loops one /relay/provision per platform under one gatewayId+secret, partial-failure-tolerant. - WebSocketRelayTransport takes the identity SET, sends one hello per identity (connector accumulates the advertised set), and stamps the per-frame OutboundFrame.platform + its matching advertised botId on outbound. - RelayAdapter remembers each chat's underlying source.platform (mirroring the existing guild/dm scope capture) and tags the reply's egress platform. - send_relay_policy() declares one relevance policy per fronted platform (the connector keys policy by (tenant,platform,instanceId)). Single-platform deploys are byte-identical on the wire (1-element list, no per-frame tag -> connector session-default fallback). typecheck/ruff clean; relay unit 221 pass (+10 new); all 15 cross-repo E2E drivers green vs connector origin/main.	2026-06-26 17:32:46 +10:00
brooklyn!	a2b49e60b6	Merge pull request #52412 from GodsBoy/fix/verify-on-stop-messaging-surface-leak fix(agent): gate verify-on-stop nudge off for messaging surfaces	2026-06-26 02:30:08 -05:00
kshitij	7d568293f9	Merge pull request #52891 from kshitijk4poor/salvage/52623-aux-host fix(auxiliary): gate Anthropic base_url override on Anthropic-compatible host (#52608)	2026-06-26 12:24:19 +05:30
konsisumer	3cf900eb67	fix(install): discard managed lockfile churn before stashing	2026-06-25 23:49:11 -07:00
Ben Barclay	cb7d1f68f8	fix(relay): accept is_reconnect kwarg in RelayAdapter.connect (#52911 ) The gateway reconnect watcher (gateway/run.py) recovers a platform after a fatal adapter error by building a fresh adapter and calling connect(is_reconnect=True). Every BasePlatformAdapter implements connect(*, is_reconnect: bool = False) for this — except RelayAdapter, whose connect() was bare. So the watcher's recovery path raised: TypeError: connect() got an unexpected keyword argument 'is_reconnect' Observed live on a hosted staging agent: after a fatal relay adapter error the watcher could never re-establish relay, so the shared-bot inbound never reached the gateway and Discord DMs stopped (dashboard surfaced the TypeError). Relay deliberately ignores the flag: the #46621 server-side-queue-preservation concern doesn't apply, because relay's outage buffer is the connector's durable buffer (replayed on the transport's re-handshake), not a gateway-side queue the adapter owns. Routine WS drops are already handled by the transport's own reconnect supervisor (WebSocketRelayTransport, reconnect=True); the watcher path is fatal-error recovery, and the fatal handler disconnect()s the old adapter (cancelling its supervisor) before a fresh adapter+transport is built, so there is no double-dial. Adds two regression tests (both proven red without the fix): connect(is_reconnect=True) reaches the same transport-less RuntimeError instead of TypeError, and the signature matches BasePlatformAdapter.connect.	2026-06-26 16:46:09 +10:00
Moonsong	4e66bf1f80	fix(auxiliary): gate Anthropic base_url override on Anthropic-compatible host (#52608 ) When operator config has provider=anthropic with model.base_url pointing at a non-Anthropic host (e.g. https://openrouter.ai/api/v1 with provider=anthropic), the auxiliary Anthropic path was unconditionally applying that override. Main-session traffic routed correctly because the main path attaches the right credential for the actual destination, but every side-channel call (memory extractors, reflection, vision, title generation, janus extractor/promise) sent ANTHROPIC_API_KEY to the foreign host and 401'd. Gate the override on hostname == api.anthropic.com. Operators routing main through a non-Anthropic provider must use that provider's own auxiliary client; the Anthropic aux path now stays pointed at api.anthropic.com. Regression tests cover openrouter, openai, anthropic-with-path, empty, and anthropic-default-base_url cases.	2026-06-26 11:21:05 +05:30
brooklyn!	6ba551e942	Merge pull request #52871 from NousResearch/bb/fix-tui-interrupt-queued fix(tui-gateway): make stop interrupt queued turns	2026-06-26 00:37:13 -05:00
Brooklyn Nicholson	594380d44a	fix(tui): make stop interrupt queued desktop turns Ensure TUI/desktop stop targets the actual conversation thread and cancels any queued next prompt, including the lazy agent-start window, so a stopped session cannot keep running or restart itself.	2026-06-26 00:31:06 -05:00
kshitij	a28b939092	Merge pull request #52678 from kshitijk4poor/salvage/52502-fuzzy-boundary fix(fuzzy-match): preserve boundary space after whitespace-normalized match (#52491)	2026-06-26 10:59:14 +05:30
DavidMetcalfe	27c486e3b1	feat(agent): apply per-reasoning-model stale-timeout floor in stream + non-stream detectors Wire get_reasoning_stale_timeout_floor() into both stale detectors so known reasoning models (Nemotron 3 Ultra, OpenAI o1/o3, Opus 4.x thinking, DeepSeek R1, Qwen QwQ, Grok reasoning) tolerate multi-minute thinking phases instead of the upstream gateway idle-killing the socket (BrokenPipeError) before first token. Applied as max(default, floor) — never overrides explicit user config, never lowers an existing threshold. The reasoning_timeouts.py allowlist module already landed on main via #52795, so this salvage carries only the wiring + tests (the duplicate module and the stale-base MoA reverts from the original PR branch are dropped). Salvaged from #52238. Fixes #52217.	2026-06-25 22:12:06 -07:00
brooklyn!	f4c656b0a0	Merge pull request #52854 from NousResearch/bb/fix-interrupt-partial-reply fix(interrupt): keep partial streamed reply when stopped mid-response	2026-06-26 00:04:37 -05:00
teknium1	4d04c652f2	fix(curator): make external-skill write guard actually fire during curation The salvaged #51875 added a background-review write guard in skill_manage that refuses mutations to skills.external_dirs skills — but it only fires when is_background_review() is true. The curator's LLM review fork ran with the default _memory_write_origin='assistant_tool', so the guard never triggered during the exact curation pass it exists to protect against (GH-47688). - Set _memory_write_origin='background_review' on the curator review fork so turn_context binds it onto the write-origin ContextVar and the guard fires. - Add a regression test asserting the fork runs under the background_review origin (the invariant linking the fork to the guard). - AUTHOR_MAP: map yu-xin-c for the salvaged commit.	2026-06-25 22:03:02 -07:00
yu-xin-c	96bc524a71	fix(curator): protect external skills from background curation	2026-06-25 22:03:02 -07:00
teknium1	6c58878e7d	fix(browser): force secret-pattern redaction on browser_type display Force redact_sensitive_text(force=True) on the browser_type text arg so recognized credentials (API keys, tokens, JWTs) are masked in tool progress, previews, callbacks, and return payloads even when the global security.redact_secrets opt-out is set — a typed credential reaching chat history is a security boundary, not log hygiene. Normal typed text matches no pattern and stays fully readable for debuggability. Tests assert the API-key-shaped secret is masked across every surface and that normal text passes through unchanged.	2026-06-25 22:02:22 -07:00
rebel	8ff426e53b	fix: redact browser typed text surfaces	2026-06-25 22:02:22 -07:00
Brooklyn Nicholson	8233598e64	fix(interrupt): keep partial streamed reply when stopped mid-response Stopping a turn while the model is streaming (stop/esc to redirect) raised InterruptedError, set final_response to the throwaway "waiting for model response" sentinel, and persisted messages WITHOUT the assistant text that was already streamed to the screen. The next turn then had no record of the half-finished reply, so the model appeared to "forget" what it just said. Recover the on-screen text from _current_streamed_assistant_text in the InterruptedError branch and append it as the assistant turn (and surface it as final_response). The metadata sentinel is kept only when nothing was streamed yet, preserving the ACP/client suppression behavior. Completes the partial-stream recovery from `397eae5d9` (which wired the same _current_streamed_assistant_text salvage into the connection-failure twin but missed the user-interrupt path). The lossy handler dates to `c98ee9852`.	2026-06-25 23:54:20 -05:00
Teknium	5b5c79a8ef	feat(kanban): typed block reasons + unblock-loop breaker (#52848 ) * feat(kanban): typed block reasons + unblock-loop breaker Stops the kanban blocked-task loop: a worker blocks a task, a cron unblocks it, the worker re-blocks for the same reason, repeat forever. block_task now takes a typed kind and a persistent block_recurrences counter on the tasks table: - kind=dependency routes to todo (parent-gated, auto-resumed), never the human 'blocked' bucket a cron would keep unblocking. - needs_input/capability/transient/untyped land in blocked; each same-cause re-block after an unblock increments block_recurrences, and at BLOCK_RECURRENCE_LIMIT (default 2) the task routes to triage for a human instead of blocked. - unblock_task no longer resets block_recurrences (the amnesia that let the loop run unbounded); complete_task clears it on success. Wired through the worker kanban_block tool (new kind arg) and the hermes kanban block --kind CLI flag, both reporting where the task actually landed. Docs + 11 new tests; 536 existing kanban tests green. * test(kanban): make second-block notify test use a distinct block cause test_notifier_second_blocked_delivers blocked the same task twice with the same (untyped) reason, which now trips the new unblock-loop breaker and routes the second block to triage instead of blocked — so only one 'blocked' notification fired. The test's actual intent is that TWO distinct block cycles each notify; give the two cycles different kinds (needs_input then capability) so they're genuinely separate blocks. The same-cause loop→triage path is covered by test_kanban_block_kinds.py.	2026-06-25 21:46:58 -07:00
teknium1	43b8ba4181	fix(telegram): preserve Bot API update queue on watcher reconnect After a prolonged outage the in-process network-error ladder escalates to fatal and GatewayRunner._platform_reconnect_watcher rebuilds a fresh adapter that reconnects through the bootstrap path. That path called start_polling(drop_pending_updates=True), discarding every update Telegram queued during the outage — all messages sent while the bot was down were silently lost. The in-process ladder and 409-conflict handler already passed drop_pending_updates=False; only bootstrap did not distinguish a cold first boot from a reconnect. Thread an is_reconnect signal from the watcher through _connect_adapter_with_timeout into adapter.connect(). The base BasePlatformAdapter.connect() gains a keyword-only is_reconnect=False so every adapter inherits a tolerant signature (no per-platform breakage when the runner forwards the kwarg). Telegram translates is_reconnect into drop_pending_updates=not is_reconnect on both the polling and webhook bootstrap calls. Cold boot still drops the stale queue; a watcher reconnect preserves it. Fixes #46621. Co-authored-by: annguyenNous <annguyen@nousresearch.com> Co-authored-by: kyssta-exe <kyssta-exe@users.noreply.github.com> Co-authored-by: Kewe63 <Kewe63@users.noreply.github.com>	2026-06-25 21:29:57 -07:00
liuhao1024	f44415e71a	fix(gateway): add init-time provider fallback to _make_agent When the primary provider raises AuthError (e.g. expired OAuth token), _make_agent now walks the configured fallback_providers/fallback_model chain before giving up — matching the behavior that cron/scheduler.py and cli_agent_setup_mixin.py already have. Fixes #47627	2026-06-25 21:21:58 -07:00
Teknium	0b7128582f	fix(state): detect and repair FTS write corruption that silently drops gateway history (#52798 ) A readable state.db can still reject every message write through the messages_fts* triggers when the FTS5 index is corrupt: base-table reads and PRAGMA integrity_check pass, but INSERT INTO messages fails with 'database disk image is malformed'. The gateway reloads conversation_history from disk each turn, so a silently-failed write hands the next turn stale/empty history even though the same cached AIAgent still holds the live transcript — causing immediate same-session amnesia. (#50502) - hermes_state.py: _db_opens_cleanly() now drives a rolled-back message write through the FTS triggers, so write-only corruption (which the read-only probe reported healthy) is detected. repair_state_db_schema() gains an in-place FTS5 'rebuild' strategy (tier 0) before the dedup/drop tiers, plus an already_healthy short-circuit. Both 'hermes sessions repair' and 'hermes doctor' route through these, so the fix covers the whole class. - hermes_cli/doctor.py: the state.db check runs the write-health probe even on the success (readable) path and repairs in place with --fix. - gateway/run.py: _select_cached_agent_history() prefers the cached agent's longer live _session_messages over a shorter persisted transcript, so an FTS write failure can't wipe in-session context. - tests: regressions for write-health detection, in-place repair preserving rows + resuming writes, the already_healthy shortcut, and the gateway guard. Combines the approaches from #50504 (@0-CYBERDYNE-SYSTEMS-0, issue author), #52165 (@davidgut1982), and #50576 (@trevorgordon981).	2026-06-25 21:18:41 -07:00
teknium1	85e084d60d	fix(email): reject spoofed From: header for authorization (GHSA-rxqh-5572-8m77) The email adapter authorized senders entirely off the From: header, which is attacker-controlled and unauthenticated by IMAP. An attacker could forge From: an-allowlisted-address and pass both the adapter's EMAIL_ALLOWED_USERS pre-filter and the gateway's allowlist authz (both key on the same spoofable sender_addr), getting unauthorized commands executed by the agent. Verify the From: domain against the trusted Authentication-Results header the receiving mail server stamps (SPF/DKIM/DMARC) before trusting it for authorization. Enforced only when an allowlist is in effect and allow-all is off — fail-closed. Operators whose server does not stamp the header can opt out via platforms.email.require_authenticated_sender: false (or EMAIL_TRUST_FROM_HEADER=true).	2026-06-25 21:11:02 -07:00
Ben Barclay	dedf5643d8	fix(gateway): scale-to-zero never armed — arm-gate counted disabled placeholder platforms (#52831 ) The scale-to-zero idle watcher never started on a correctly-opted-in, relay-only instance, so the gateway never ran its idle decision, never called go_dormant(), and never sent going_idle to the connector. Fly's autostop still suspended the machine on traffic-idle, but the connector never flipped the instance to buffered-only — so an inbound DM took the live delivery path, found no live session for the suspended machine, and was dropped fail-closed with no wake poke. The machine slept and never woke. Root cause: _scale_to_zero_should_arm() passed list(config.platforms.keys()) to messaging_is_relay_only_or_absent(). config.platforms is pre-seeded with a DISABLED placeholder PlatformConfig for every known platform (telegram, discord, slack, matrix, …), so the key set is always the full ~20-entry catalog regardless of what the instance actually runs. The relay-only check discarded "relay", saw the disabled placeholders as live direct-socket platforms, and returned False — so should_arm() was False and the watcher was never created. Verified live on a staging instance: config.platforms keys = [telegram, discord, slack, mattermost, matrix, relay] with only relay enabled=True; should_arm() = False. Fix: filter config.platforms to ENABLED entries before the relay-only check, mirroring the adapter-connect loop which already gates on `if not platform_config.enabled: continue`. This arms off the same notion of "active platform" the rest of start() already uses — no parallel concept. Also add a one-line not-armed diagnostic: when an instance IS opted in (the HERMES_SCALE_TO_ZERO stamp is set) but the watcher still doesn't arm, log why (relay_only_or_absent, the enabled platforms, wake_url present/missing). A non-opted instance stays silent. The arm path previously logged only on success, so a failed arm was invisible. Tests: the existing pure-helper tests passed bare names so they never exercised the call site that feeds the placeholder-laden config. Add behaviour-contract tests against the REAL _scale_to_zero_should_arm with a realistic config.platforms (relay enabled + others disabled). The F25 regression test (relay-only + disabled placeholders must arm) and the no-platform case are RED without this fix, GREEN with it; the genuinely-enabled-direct-platform / not-opted-in / no-wake-url cases stay correctly non-arming so the filter can't over-broaden. Wake mechanism itself verified healthy independently (direct wakeUrl GET resumed a suspended staging instance in 1.15s, clean resume signature).	2026-06-26 14:01:48 +10:00
Teknium	a4091e49f1	fix(auth): write rotated Codex/xAI pool grant through to global root (#48415 ) (#52760 ) CredentialPool._sync_device_code_entry_to_auth_store rotated single-use OAuth refresh tokens but wrote the new chain only into the active profile store. When a profile resolves a grant from the global-root fallback (read_credential_pool, #18594) and the pool then refreshes it, root was left holding a now-revoked refresh token — every other profile reading the stale root grant subsequently died with refresh_token_reused / invalid_grant once its access token expired. This is the credential-pool analog of #43589 (which fixed the non-pool xAI refresh path in _save_xai_oauth_tokens). Detect the read-from-root case (profile lacks its own providers.<id> block) BEFORE the profile save and, after it, write the rotated chain back to the global root via a best-effort, seat-belted write-through. A profile that genuinely shadows root (owns the block) is untouched; classic mode (profile == root) is a no-op; a failed root write never breaks the profile's own save. Covers openai-codex (reported), xai-oauth, and nous through the shared sync path.	2026-06-25 19:14:06 -07:00
Harjoth Khara	233ef98afe	fix(docker): skip symlinked stage2 chown targets (#52789 ) Prevents stage2-hook.sh recursive chown from following a symlinked $HERMES_HOME/home (or profiles/cron) and destroying the host user's home directory. Also guards top-level state-file chowns and refuses first-boot seeding through symlinks. Fixes #52781. Co-authored-by: harjoth <harjoth.khara@gmail.com>	2026-06-26 12:09:52 +10:00
DavidMetcalfe	865a09a610	fix(agent): detect thinking-timeout for reasoning models and surface actionable guidance instead of misleading file-write advice Two-part fix: Part 1 (classifier override at agent/error_classifier.py:720-738): A transport disconnect on a reasoning model — even on a large session — now routes to FailoverReason.timeout instead of context_overflow. Without this, large-session reasoning-model disconnects route to the compression branch and silently delete conversation history on a phantom context-length error. The override is strictly targeted: non-reasoning models (gpt-4o, claude-3-5-sonnet, llama-3.3-70b, etc.) still route to context_overflow on large sessions — the existing intentional behavior for chat models whose proxy doesn't idle-kill during prefill/generation. Part 2 (new agent/thinking_timeout_guidance.py + integration at agent/conversation_loop.py:3488-3567): New is_thinking_timeout() and build_thinking_timeout_guidance() helpers. When a known reasoning model (NVIDIA Nemotron 3 Ultra, OpenAI o1/o3, Anthropic Opus 4.x thinking, DeepSeek R1, Qwen QwQ, xAI Grok reasoning) hits a transport-kill on a small session (classifier says timeout directly) or after Part 1 routes correctly (large session), the user now sees reasoning-specific guidance with three actionable workarounds in priority order: 1. Set providers.<provider>.models.<model>.stale_timeout_seconds: 900 in ~/.hermes/config.yaml (Hermes's built-in floor is already 600s for known reasoning models; raise further if upstream is even tighter). 2. Lower reasoning_budget or set reasoning_effort: medium on this model if the provider supports it. 3. Use a smaller / faster reasoning model if the task doesn't require deep thinking. The new guidance takes precedence via if/elif over the existing _is_stream_drop block, so a reasoning-model user with a transport-kill message sees actionable advice instead of the misleading "try execute_code with Python's open() for large files" advice (which is correct for the unrelated large-file-write stream-drop case but actively wrong for the thinking-timeout case). Verified: - 478 tests passing across 9 directly-relevant files (49 new + 429 existing, zero regressions). - Ruff lint clean on all 4 modified/new files. - Negative test: 6 parametrized regression guards confirm non-reasoning models still route to context_overflow on large sessions; 4 parametrized gates confirm non-timeout classifier reasons never trigger the guidance; 5 parametrized cases confirm non-transport messages never trigger it. - Regression guard: new guidance message does NOT contain "execute_code" or "open()" — the misleading advice is fully replaced, not appended alongside. - Cross-vendor dual review via agy -p: - Gemini 3.5 Flash (Medium) — passed: true, zero blockers, one SHOULD-FIX (vprint block duplication — fixed by extracting detection into a helper module). - GPT-OSS 120B (Medium) — passed: true, zero blockers, two nits (test placement — adopted at tests/agent/test_thinking_timeout_guidance.py; primary-model capture — accepted as non-issue per Flash's nit). Dependency note for maintainers: This PR includes agent/reasoning_timeouts.py (the reasoning-model allowlist module from PR #52238) because the Layer 1 override is load-bearing on get_reasoning_stale_timeout_floor(). After PR #52238 lands on main, this PR's duplicate agent/reasoning_timeouts.py should be rebased away. Either PR can land first; the other rebase is mechanical. Fixes #52271.	2026-06-25 19:00:48 -07:00
Teknium	811df74a10	fix(gateway): defer cross-process cache cleanup off the cache lock (#52197 ) (#52761 ) The #45966 cross-process coherence guard popped the stale cached agent and then called the blocking _cleanup_agent_resources (memory-provider shutdown, tool-resource teardown, async-client teardown) while still holding _agent_cache_lock, on the gateway event-loop thread. While that ran, _sweep_idle_cached_agents (driven by _session_expiry_watcher) blocked acquiring the same lock and the asyncio loop stalled for minutes, tripping repeated Discord 'heartbeat blocked' warnings. Fix mirrors the cap-enforcer / idle-sweep paths: pop the stale entry under the lock, release it, then schedule the SOFT release on a daemon thread. The soft path (_release_evicted_agent_soft) is also more correct here than the hard teardown the regression used — the same session rebuilds a fresh agent immediately after invalidation, so its terminal sandbox / browser / bg processes (keyed on task_id) must be preserved for the rebuilt agent to inherit, not torn down. Verified the cross-process site was the only cleanup-under-lock instance; the other _cleanup_agent_resources call sites run outside the lock.	2026-06-25 18:58:47 -07:00
Teknium	ce802e932c	fix(telegram): heartbeat loop exits cleanly when bot has no get_me CI shard test_telegram_conflict.py timed out (140s) because the new _polling_heartbeat_loop, started by connect(), busy-spun under those tests: they monkeypatch asyncio.sleep to instant and pass a bot double with no get_me(), so the probe raised AttributeError (swallowed) and the loop re-entered immediately with no real pacing, starving the event loop. Guard the loop to return when bot.get_me is not callable — a real PTB Bot always exposes it, so this only triggers on a torn-down app or a test double, where there is nothing to probe. Also cancel the heartbeat task in the conflict tests that call connect() without disconnect(), matching the production disconnect() teardown. Verified: test_telegram_conflict.py now runs in ~4.5s; the 22 heartbeat/reconnect tests still pass; E2E confirms a hanging get_me still fires the reconnect ladder while a missing get_me exits without spinning.	2026-06-25 18:50:11 -07:00
agt-user	8501caf51f	fix(telegram): persistent heartbeat loop to detect CLOSE-WAIT polling sockets When a Telegram long-poll TCP socket enters CLOSE-WAIT (remote sent FIN but httpx hasn't noticed), epoll still reports it readable so no exception is raised. PTB's error_callback never fires, the reconnect ladder never engages, and the gateway silently stops receiving messages while the process stays alive — until a manual systemctl restart. The existing recovery only covers two cases: error_callback-driven reconnects (which require an exception PTB never gets) and a one-shot _verify_polling_after_reconnect probe (which runs only right after an explicit reconnect). A socket that wedges during steady-state operation is never detected. Add _polling_heartbeat_loop: a background asyncio.Task started in connect() (polling mode only) that probes get_me() every 90s on the general request pool (not the getUpdates pool, so healthy long-polls are never interrupted). On asyncio.TimeoutError/OSError it hands off to the existing _handle_polling_network_error ladder; other errors are swallowed. disconnect() cancels and awaits the task. Worst-case detection window ~105s. Complementary to #51541 (general-pool keepalive limits / fd leak) — that recycles idle pooled connections; this detects a wedged active read. Fixes #48495 Co-authored-by: agt-user <267614622+agt-user@users.noreply.github.com>	2026-06-25 18:50:11 -07:00
liuhao1024	56cf517ccd	fix(cron): detect partial job loss in restore_cron_jobs_if_emptied (#52144 ) The desktop scheduler can overwrite cron/jobs.json with its own small set of internally-tracked crons after an update/restart, causing partial loss of tool-created cron jobs. The previous guard only checked for total loss (live_count == 0), missing the case where live_count > 0 but less than the pre-update snapshot count. Compare live_count against snap_count instead of checking for zero, so both total loss (0 vs N) and partial loss (1 vs 19) trigger restoration. Salvaged from #52161 by @liuhao1024. Closes #52144	2026-06-25 18:49:18 -07:00
brooklyn!	41f4dce828	Merge pull request #52756 from NousResearch/bb/delegate-bg-resume-ux feat(delegation): calm "will resume" affordance for background delegate_task	2026-06-25 20:08:06 -05:00
Brooklyn Nicholson	985350dd85	feat(cli): note background delegate_task dispatch in _on_tool_complete A top-level delegate_task dispatches in the background and re-enters as a fresh turn when done. Print a one-line dispatch-time note — no spinner, nothing to poll — so the idle prompt doesn't read as "nothing happened."	2026-06-25 19:57:58 -05:00
Que0x	b8fc8c908b	fix(approval): fold Windows absolute home paths in dangerous-command detection The detector folds absolute home / Hermes-home prefixes into their canonical ~/ and ~/.hermes/ forms so static patterns catch /home/alice/.bashrc the same way they catch ~/.bashrc (`abd69b81`). On native Windows this fold never fired, so terminal commands writing to shell startup files, ~/.ssh/authorized_keys, or ~/.hermes/config.yaml / .env returned "safe" and skipped the approval prompt — and config.yaml carries the approval policy itself. Two compounding causes: 1. The fold ran after the backslash-escape strip (r\m -> rm), which dissolves the backslash separators in a Windows path (C:\Users\alice\.bashrc -> C:Usersalice...) before the fold could match. It now runs before the strip. 2. The fold only recognized POSIX absolute paths and only the home prefix, leaving multi-segment backslash suffixes (\.ssh\authorized_keys) to be mangled by the strip. Consolidated into _home_prefix_fold_regex / _fold_home_prefixes: match a home prefix with either separator, capture the rest of the path token, and normalize its separators to / so multi-segment patterns match. The degenerate-path guard generalizes count("/") >= 2 to "at least two components below the root" (also rejecting a bare drive root C:\). HOME is consulted directly because Windows' expanduser ignores it; the more specific Hermes home is folded first, longest candidate first, so neither fold clobbers the other. POSIX behavior unchanged; the r\m -> rm anti-obfuscation strip still runs. Adds TestWindowsAbsolutePathFolding, which monkeypatches a Windows-style HOME/HERMES_HOME so the behavior is also exercised on the CI runner.	2026-06-25 17:49:39 -07:00
teknium1	6dfb8326f5	fix(state): exclude delegate/branch/tool children from resume walk + reconcile salvaged fixes Follow-up to the salvage of #45035 + #48682. The two PRs touched different functions (resolve_resume_session_id vs get_compression_tip) but #45035's descendant walk followed ANY parent_session_id child, so a delegate/subagent child could hijack the resume target. Apply the same _branched_from / _delegate_from / source!='tool' exclusion the rest of hermes_state.py uses, so the resume walk only follows genuine compression continuations. Also updates the unrealistic delegation test fixture to carry the real _delegate_from marker, and updates 3 list_sessions_rich test mocks for the order_by_last_active kwarg #48682 added. AUTHOR_MAP: map PINKIIILQWQ + ailang323 salvage authors.	2026-06-25 16:29:09 -07:00
longer	6d9ca04574	fix(desktop): resume latest compression continuation	2026-06-25 16:29:09 -07:00
Pink	263f6b03eb	chore: rename test to reflect new semantics of resolve_resume_session_id	2026-06-25 16:29:09 -07:00
PINKIIILQWQ	abd6b85200	fix(state): resolve compression chain tip in resolve_resume_session_id After context compression, the parent session holds pre-compression messages and a child (or deeper descendant) holds the continuation. resolve_resume_session_id() short-circuited when the input session already had messages (row is not None -> return session_id), causing REST API endpoints, gateway resume, and CLI resume to serve stale parent messages. Remove the early-return. Walk the full descendant chain, record the deepest node that has messages (best), and return best if not None else the original session_id (preserving the empty-chain fallback). Callers (api_server.py, web_server.py, cli_agent_setup_mixin.py, cli_commands_mixin.py) all use the resolved != input -> redirect pattern and are transparent to this change.	2026-06-25 16:29:09 -07:00
Teknium	208f0d7c3b	fix(update): default pre-update backup to off (#52729 ) The pre-update HERMES_HOME zip shipped on by default (DEFAULT_CONFIG + runtime fallback both True), so every `hermes update` zipped the entire ~/.hermes — sessions DB, caches, skills — adding minutes to each update. The shipped cli-config.yaml.example, the --backup help, and the example config all already said "off by default," so the live default contradicted its own documentation. Flip the default to off everywhere: DEFAULT_CONFIG, the runtime `.get(..., False)` fallback in _run_pre_update_backup, and the stale --backup help string. Users who want the #48200 safety net opt in via updates.pre_update_backup: true or --backup for a single run. Updated test_default_enabled_creates_backup -> test_default_disabled_is_silent to assert the new default (silent no-op, no zip).	2026-06-25 16:01:09 -07:00

1 2 3 4 5 ...

6284 commits