hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-04-25 00:51:20 +00:00

Author	SHA1	Message	Date
Andre Kurait	a9ccb03ccc	fix(bedrock): evict cached boto3 client on stale-connection errors ## Problem When a pooled HTTPS connection to the Bedrock runtime goes stale (NAT timeout, VPN flap, server-side TCP RST, proxy idle cull), the next Converse call surfaces as one of: * botocore.exceptions.ConnectionClosedError / ReadTimeoutError / EndpointConnectionError / ConnectTimeoutError * urllib3.exceptions.ProtocolError * A bare AssertionError raised from inside urllib3 or botocore (internal connection-pool invariant check) The agent loop retries the request 3x, but the cached boto3 client in _bedrock_runtime_client_cache is reused across retries — so every attempt hits the same dead connection pool and fails identically. Only a process restart clears the cache and lets the user keep working. The bare-AssertionError variant is particularly user-hostile because str(AssertionError()) is an empty string, so the retry banner shows: ⚠️ API call failed: AssertionError 📝 Error: with no hint of what went wrong. ## Fix Add two helpers to agent/bedrock_adapter.py: * is_stale_connection_error(exc) — classifies exceptions that indicate dead-client/dead-socket state. Matches botocore ConnectionError + HTTPClientError subtrees, urllib3 ProtocolError / NewConnectionError, and AssertionError raised from a frame whose module name starts with urllib3., botocore., or boto3.. Application-level AssertionErrors are intentionally excluded. * invalidate_runtime_client(region) — per-region counterpart to the existing reset_client_cache(). Evicts a single cached client so the next call rebuilds it (and its connection pool). Wire both into the Converse call sites: * call_converse() / call_converse_stream() in bedrock_adapter.py (defense-in-depth for any future caller) * The two direct client.converse(kwargs) / client.converse_stream(kwargs) call sites in run_agent.py (the paths the agent loop actually uses) On a stale-connection exception, the client is evicted and the exception re-raised unchanged. The agent's existing retry loop then builds a fresh client on the next attempt and recovers without requiring a process restart. ## Tests tests/agent/test_bedrock_adapter.py gets three new classes (14 tests): * TestInvalidateRuntimeClient — per-region eviction correctness; non-cached region returns False. * TestIsStaleConnectionError — classifies botocore ConnectionClosedError / EndpointConnectionError / ReadTimeoutError, urllib3 ProtocolError, library-internal AssertionError (both urllib3.* and botocore.* frames), and correctly ignores application-level AssertionError and unrelated exceptions (ValueError, KeyError). * TestCallConverseInvalidatesOnStaleError — end-to-end: stale error evicts the cached client, non-stale error (validation) leaves it alone, successful call leaves it cached. All 116 tests in test_bedrock_adapter.py pass. Signed-off-by: Andre Kurait <andrekurait@gmail.com>	2026-04-24 07:26:07 -07:00
Tranquil-Flow	7dc6eb9fbf	fix(agent): handle aws_sdk auth type in resolve_provider_client Bedrock's aws_sdk auth_type had no matching branch in resolve_provider_client(), causing it to fall through to the "unhandled auth_type" warning and return (None, None). This broke all auxiliary tasks (compression, memory, summarization) for Bedrock users — the main conversation loop worked fine, but background context management silently failed. Add an aws_sdk branch that creates an AnthropicAuxiliaryClient via build_anthropic_bedrock_client(), using boto3's default credential chain (IAM roles, SSO, env vars, instance metadata). Default auxiliary model is Haiku for cost efficiency. Closes #13919	2026-04-24 07:26:07 -07:00
Andre Kurait	b290297d66	fix(bedrock): resolve context length via static table before custom-endpoint probe ## Problem `get_model_context_length()` in `agent/model_metadata.py` had a resolution order bug that caused every Bedrock model to fall back to the 128K default context length instead of reaching the static Bedrock table (200K for Claude, etc.). The root cause: `bedrock-runtime.<region>.amazonaws.com` is not listed in `_URL_TO_PROVIDER`, so `_is_known_provider_base_url()` returned False. The resolution order then ran the custom-endpoint probe (step 2) before the Bedrock branch (step 4b), which: 1. Treated Bedrock as a custom endpoint (via `_is_custom_endpoint`). 2. Called `fetch_endpoint_model_metadata()` → `GET /models` on the bedrock-runtime URL (Bedrock doesn't serve this shape). 3. Fell through to `return DEFAULT_FALLBACK_CONTEXT` (128K) at the "probe-down" branch — never reaching the Bedrock static table. Result: users on Bedrock saw 128K context for Claude models that actually support 200K on Bedrock, causing premature auto-compression. ## Fix Promote the Bedrock branch from step 4b to step 1b, so it runs before the custom-endpoint probe at step 2. The static table in `bedrock_adapter.py::get_bedrock_context_length()` is the authoritative source for Bedrock (the ListFoundationModels API doesn't expose context window sizes), so there's no reason to probe `/models` first. The original step 4b is replaced with a one-line breadcrumb comment pointing to the new location, to make the resolution-order docstring accurate. ## Changes - `agent/model_metadata.py` - Add step 1b: Bedrock static-table branch (unchanged predicate, moved). - Remove dead step 4b block, replace with breadcrumb comment. - Update resolution-order docstring to include step 1b. - `tests/agent/test_model_metadata.py` - New `TestBedrockContextResolution` class (3 tests): - `test_bedrock_provider_returns_static_table_before_probe`: confirms `provider="bedrock"` hits the static table and does NOT call `fetch_endpoint_model_metadata` (regression guard). - `test_bedrock_url_without_provider_hint`: confirms the `bedrock-runtime.*.amazonaws.com` host match works without an explicit `provider=` hint. - `test_non_bedrock_url_still_probes`: confirms the probe still fires for genuinely-custom endpoints (no over-reach). ## Testing pytest tests/agent/test_model_metadata.py -q # 83 passed in 1.95s (3 new + 80 existing) ## Risk Very low. - Predicate is identical to the original step 4b — no behaviour change for non-Bedrock paths. - Original step 4b was dead code for the user-facing case (always hit the 128K fallback first), so removing it cannot regress behaviour. - Bedrock path now short-circuits before any network I/O — faster too. - `ImportError` fall-through preserved so users without `boto3` installed are unaffected. ## Related - This is a prerequisite for accurate context-window accounting on Bedrock — the fix for #14710 (stale-connection client eviction) depends on correct context sizing to know when to compress. Signed-off-by: Andre Kurait <andrekurait@gmail.com>	2026-04-24 07:26:07 -07:00
Qi Ke	f2fba4f9a1	fix(anthropic): auto-detect Bedrock model IDs in normalize_model_name (#12295 ) Bedrock model IDs use dots as namespace separators (anthropic.claude-opus-4-7, us.anthropic.claude-sonnet-4-5-v1:0), not version separators. normalize_model_name() was unconditionally converting all dots to hyphens, producing invalid IDs that Bedrock rejects with HTTP 400/404. This affected both the main agent loop (partially mitigated by _anthropic_preserve_dots in run_agent.py) and all auxiliary client calls (compression, session_search, vision, etc.) which go through _AnthropicCompletionsAdapter and never pass preserve_dots=True. Fix: add _is_bedrock_model_id() to detect Bedrock namespace prefixes (anthropic., us., eu., ap., jp., global.) and skip dot-to-hyphen conversion for these IDs regardless of the preserve_dots flag.	2026-04-24 07:26:07 -07:00
Teknium	fcc05284fc	fix(delegate): tool-activity-aware heartbeat stale detection (#13041 ) (#15183 ) A child running a legitimately long-running tool (terminal command, browser fetch, big file read) holds current_tool set and keeps api_call_count frozen while the tool runs. The previous stale check treated that as idle after 5 heartbeat cycles (~150s), stopped touching the parent, and let the gateway kill the session. Split the threshold in two: - _HEARTBEAT_STALE_CYCLES_IDLE=5 (~150s) — applied only when current_tool is None (child wedged between turns) - _HEARTBEAT_STALE_CYCLES_IN_TOOL=20 (~600s) — applied when the child is inside a tool call Stale counter also resets when current_tool changes (new tool = progress). The hard child_timeout_seconds (default 600s) is still the final cap, so genuinely stuck tools don't get to block forever.	2026-04-24 07:25:19 -07:00
Teknium	1840c6a57d	feat(spotify): wire setup wizard into 'hermes tools' + document cron usage (#15180 ) A — 'hermes tools' activation now runs the full Spotify wizard. Previously a user had to (1) toggle the Spotify toolset on in 'hermes tools' AND (2) separately run 'hermes auth spotify' to actually use it. The second step was a discovery gap — the docs mentioned it but nothing in the TUI pointed users there. Now toggling Spotify on calls login_spotify_command as a post_setup hook. If the user has no client_id yet, the interactive wizard walks them through Spotify app creation; if they do, it skips straight to PKCE. Either way, one 'hermes tools' pass leaves Spotify toggled on AND authenticated. SystemExit from the wizard (user abort) leaves the toolset enabled and prints a 'run: hermes auth spotify' hint — it does NOT fail the toolset toggle. Dropped the TOOL_CATEGORIES env_vars list for Spotify. The wizard handles HERMES_SPOTIFY_CLIENT_ID persistence itself, and asking users to type env var names before the wizard fires was UX-backwards — the point of the wizard is that they don't HAVE a client_id yet. B — Docs page now covers cron + Spotify. New 'Scheduling: Spotify + cron' section with two working examples (morning playlist, wind-down pause) using the real 'hermes cron add' CLI surface (verified via 'cron add --help'). Covers the active-device gotcha, Premium gating, memory isolation, and links to the cron docs. Also fixed a stale '9 Spotify tools' reference in the setup copy — we consolidated to 7 tools in #15154. Validation: - scripts/run_tests.sh tests/hermes_cli/test_tools_config.py tests/hermes_cli/test_spotify_auth.py tests/tools/test_spotify_client.py → 54 passed - website: node scripts/prebuild.mjs && npx docusaurus build → SUCCESS, no new warnings	2026-04-24 07:24:28 -07:00
Blind Dev	591aa159aa	feat: allow Telegram chat allowlists for groups and forums (#15027 ) * feat: allow Telegram chat allowlists for groups and forums * chore: map web3blind noreply email for release attribution --------- Co-authored-by: web3blind <web3blind@users.noreply.github.com>	2026-04-24 07:23:14 -07:00
Teknium	c6b734e24d	chore(release): map Group B contributors in AUTHOR_MAP	2026-04-24 07:14:00 -07:00
Wooseong Kim	54146ae07c	fix(aux): refresh cached auth after 401	2026-04-24 07:14:00 -07:00
Wooseong Kim	be6b83562d	fix(aux): force anthropic oauth refresh after 401 Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 07:14:00 -07:00
5park1e	e1106772d9	fix: re-auth on stale OAuth token; read Claude Code credentials from macOS Keychain Bug 3 — Stale OAuth token not detected in 'hermes model': - _model_flow_anthropic used 'has_creds = bool(existing_key)' which treats any non-empty token (including expired OAuth tokens) as valid. - Added existing_is_stale_oauth check: if the only credential is an OAuth token (sk-ant- prefix) with no valid cc_creds fallback, mark it stale and force the re-auth menu instead of silently accepting a broken token. Bug 4 — macOS Keychain credentials never read: - Claude Code >=2.1.114 migrated from ~/.claude/.credentials.json to the macOS Keychain under service 'Claude Code-credentials'. - Added _read_claude_code_credentials_from_keychain() using the 'security' CLI tool; read_claude_code_credentials() now tries Keychain first then falls back to JSON file. - Non-Darwin platforms return None from Keychain read immediately. Tests: - tests/agent/test_anthropic_keychain.py: 11 cases covering Darwin-only guard, security command failures, JSON parsing, fallback priority. - tests/hermes_cli/test_anthropic_model_flow_stale_oauth.py: 8 cases covering stale OAuth detection, API key passthrough, cc_creds fallback. Refs: #12905	2026-04-24 07:14:00 -07:00
nightq	5383615db5	fix: recognize Claude Code OAuth tokens (cc- prefix) in _is_oauth_token Fixes NousResearch/hermes-agent#9813 Root cause: _is_oauth_token() only recognized sk-ant-* and eyJ* patterns, but Claude Code OAuth tokens from CLAUDE_CODE_OAUTH_TOKEN use cc- prefix Fix: Add cc- prefix detection so these tokens route through Bearer auth	2026-04-24 07:14:00 -07:00
Maymun	56086e3fd7	fix(auth): write Anthropic OAuth token files atomically to prevent corruption	2026-04-24 07:14:00 -07:00
Teknium	8d12fb1e6b	refactor(spotify): convert to built-in bundled plugin under plugins/spotify (#15174 ) Moves the Spotify integration from tools/ into plugins/spotify/, matching the existing pattern established by plugins/image_gen/ for third-party service integrations. Why: - tools/ should be reserved for foundational capabilities (terminal, read_file, web_search, etc.). tools/providers/ was a one-off directory created solely for spotify_client.py. - plugins/ is already the home for image_gen backends, memory providers, context engines, and standalone hook-based plugins. Spotify is a third-party service integration and belongs alongside those, not in tools/. - Future service integrations (eventually: Deezer, Apple Music, etc.) now have a pattern to copy. Changes: - tools/spotify_tool.py → plugins/spotify/tools.py (handlers + schemas) - tools/providers/spotify_client.py → plugins/spotify/client.py - tools/providers/ removed (was only used for Spotify) - New plugins/spotify/__init__.py with register(ctx) calling ctx.register_tool() × 7. The handler/check_fn wiring is unchanged. - New plugins/spotify/plugin.yaml (kind: backend, bundled, auto-load). - tests/tools/test_spotify_client.py: import paths updated. tools_config fix — _DEFAULT_OFF_TOOLSETS now wins over plugin auto-enable: - _get_platform_tools() previously auto-enabled unknown plugin toolsets for new platforms. That was fine for image_gen (which has no toolset of its own) but bad for Spotify, which explicitly requires opt-in (don't ship 7 tool schemas to users who don't use it). Added a check: if a plugin toolset is in _DEFAULT_OFF_TOOLSETS, it stays off until the user picks it in 'hermes tools'. Pre-existing test bug fix: - tests/hermes_cli/test_plugins.py::test_list_returns_sorted asserted names were sorted, but list_plugins() sorts by key (path-derived, e.g. image_gen/openai). With only image_gen plugins bundled, name and key order happened to agree. Adding plugins/spotify broke that coincidence (spotify sorts between openai-codex and xai by name but after xai by key). Updated test to assert key order, which is what the code actually documents. Validation: - scripts/run_tests.sh tests/hermes_cli/test_plugins.py \ tests/hermes_cli/test_tools_config.py \ tests/hermes_cli/test_spotify_auth.py \ tests/tools/test_spotify_client.py \ tests/tools/test_registry.py → 143 passed - E2E plugin load: 'spotify' appears in loaded plugins, all 7 tools register into the spotify toolset, check_fn gating intact.	2026-04-24 07:06:11 -07:00
Teknium	e5d41f05d4	feat(spotify): consolidate tools (9→7), add spotify skill, surface in hermes setup (#15154 ) Three quality improvements on top of #15121 / #15130 / #15135: 1. Tool consolidation (9 → 7) - spotify_saved_tracks + spotify_saved_albums → spotify_library with kind='tracks'\|'albums'. Handler code was ~90 percent identical across the two old tools; the merge is a behavioral no-op. - spotify_activity dropped. Its 'now_playing' action was a duplicate of spotify_playback.get_currently_playing (both return identical 204/empty payloads). Its 'recently_played' action moves onto spotify_playback as a new action — history belongs adjacent to live state. - Net: each API call ships 2 fewer tool schemas when the Spotify toolset is enabled, and the action surface is more discoverable (everything playback-related is on one tool). 2. Spotify skill (skills/media/spotify/SKILL.md) Teaches the agent canonical usage patterns so common requests don't balloon into 4+ tool calls: - 'play X' = one search, then play by URI (not search + scan + describe + play) - 'what's playing' = single get_currently_playing (no preflight get_state chain) - Don't retry on '403 Premium required' or '403 No active device' — both require user action - URI/URL/bare-ID format normalization - Full failure-mode reference for 204/401/403/429 3. Surfaced in 'hermes setup' tool status Adds 'Spotify (PKCE OAuth)' to the tool status list when auth.json has a Spotify access/refresh token. Matches the homeassistant pattern but reads from auth.json (OAuth-based) rather than env vars. Docs updated to reflect the new 7-tool surface, and mention the companion skill in the 'Using it' section. Tests: 54 passing (client 22, auth 15, tools_config 35 — 18 = 54 after renaming/replacing the spotify_activity tests with library + recently_played coverage). Docusaurus build clean.	2026-04-24 06:14:51 -07:00
Teknium	9d1b277e1d	chore(release): map Group H contributors in AUTHOR_MAP	2026-04-24 05:48:15 -07:00
XieNBi	4a51ab61eb	fix(cli): non-zero /model counts for native OpenAI and direct API rows	2026-04-24 05:48:15 -07:00
Brian D. Evans	7f26cea390	fix(models): strip models/ prefix in Gemini validator (#12532 ) Salvage of the Gemini-specific piece from PR #12585 by @briandevans. Gemini's OpenAI-compat /v1beta/openai/models endpoint returns IDs prefixed with 'models/' (native Gemini-API convention), so set-membership against curated bare IDs drops every model. Strip the prefix before comparison. The Anthropic static-catalog piece of #12585 was subsumed by #12618's _fetch_anthropic_models() branch landing earlier in the same salvage PR. Full branch cherry-pick was skipped because it also carried unrelated catalog-version regressions.	2026-04-24 05:48:15 -07:00
H-Ali13381	2303dd8686	fix(models): use Anthropic-native headers for model validation The generic /v1/models probe in validate_requested_model() sent a plain 'Authorization: Bearer <key>' header, which works for OpenAI-compatible endpoints but results in a 401 Unauthorized from Anthropic's API. Anthropic requires x-api-key + anthropic-version headers (or Bearer for OAuth tokens from Claude Code). Add a provider-specific branch for normalized == 'anthropic' that calls the existing _fetch_anthropic_models() helper, which already handles both regular API keys and Claude Code OAuth tokens correctly. This mirrors the pattern already used for openai-codex, copilot, and bedrock. The branch also includes: - fuzzy auto-correct (cutoff 0.9) for near-exact model ID typos - fuzzy suggestions (cutoff 0.5) when the model is not listed - graceful fall-through when the token cannot be resolved or the network is unreachable (accepts with a warning rather than hard-fail) - a note that newer/preview/snapshot model IDs can be gate-listed and may still work even if not returned by /v1/models Fixes Anthropic provider users seeing 'service unreachable' errors when running /model <claude-model> because every probe 401'd.	2026-04-24 05:48:15 -07:00
wangshengyang2004	647900e813	fix(cli): support model validation for anthropic_messages and cloudflare-protected endpoints - probe_api_models: add api_mode param; use x-api-key + anthropic-version headers for anthropic_messages mode (Anthropic's native Models API auth) - probe_api_models: add User-Agent header to avoid Cloudflare 403 blocks on third-party OpenAI-compatible endpoints - validate_requested_model: pass api_mode through from switch_model - validate_requested_model: for anthropic_messages mode, attempt probe with correct auth; if probe fails (many proxies don't implement /v1/models), accept the model with an informational warning instead of rejecting - fetch_api_models: propagate api_mode to probe_api_models	2026-04-24 05:48:15 -07:00
Teknium	25465fd8d7	test(gateway): on_session_finalize fires on idle-expiry + AUTHOR_MAP Regression test for #14981. Verifies that _session_expiry_watcher fires on_session_finalize for each session swept out of the store, matching the contract documented for /new, /reset, CLI shutdown, and gateway stop. Verified the test fails cleanly on pre-fix code (hook call list missing sess-expired) and passes with the fix applied.	2026-04-24 05:40:52 -07:00
Stefan Dimitrov	260ae62134	Invoke session finalize hooks on expiry flush	2026-04-24 05:40:52 -07:00
Teknium	9be17bb84f	docs(spotify): expand feature page with tool reference, Free/Premium matrix, troubleshooting (#15135 ) The initial Spotify docs page shipped in #15130 was a setup guide. This expands it into a full feature reference: - Per-tool parameter table for all 9 tools, extracted from the real schemas in tools/spotify_tool.py (actions, required/optional args, premium gating). - Free vs Premium feature matrix — which actions work on which tier, so Free users don't assume Spotify tools are useless to them. - Active-device prerequisite called out at the top; this is the #1 cause of '403 no active device' reports for every Spotify integration. - SSH / headless section explaining that browser auto-open is skipped when SSH_CLIENT/SSH_TTY is set, and how to tunnel the callback port. - Token lifecycle: refresh on 401, persistence across restarts, how to revoke server-side via spotify.com/account/apps. - Example prompt list so users know what to ask the agent. - Troubleshooting expanded: no-active-device, Premium-required, 204 now_playing, INVALID_CLIENT, 429, 401 refresh-revoked, wizard not opening browser. - 'Where things live' table mapping auth.json / .env / Spotify app. Verified with 'node scripts/prebuild.mjs && npx docusaurus build' — page compiles, no new warnings.	2026-04-24 05:38:02 -07:00
Teknium	fe9d9a26d8	chore(release): map Group F contributors in AUTHOR_MAP	2026-04-24 05:35:43 -07:00
Tranquil-Flow	ee83a710f0	fix(gateway,cron): activate fallback_model when primary provider auth fails When the primary provider raises AuthError (expired OAuth token, revoked API key), the error was re-raised before AIAgent was created, so fallback_model was never consulted. Now both gateway/run.py and cron/scheduler.py catch AuthError specifically and attempt to resolve credentials from the fallback_providers/fallback_model config chain before propagating the error. Closes #7230	2026-04-24 05:35:43 -07:00
vlwkaos	f7f7588893	fix(agent): only set rate-limit cooldown when leaving primary; add tests	2026-04-24 05:35:43 -07:00
LeonSGP43	a9fd8d7c88	fix(agent): default missing fallback chain on switch	2026-04-24 05:35:43 -07:00
CruxExperts	46451528a5	fix(agent): pass config_context_length in fallback activation path Try to activate fallback model after errors was calling get_model_context_length() without the config_context_length parameter, causing it to fall through to DEFAULT_FALLBACK_CONTEXT (128K) even when config.yaml has an explicit model.context_length value (e.g. 204800 for MiniMax-M2.7). This mirrors the fix already present in switch_model() at line 1988, which correctly passes config_context_length. The fallback path was missed. Fixes: context_length forced to 128K on fallback activation	2026-04-24 05:35:43 -07:00
Bartok9	4e27e498f1	fix(agent): exclude ssl.SSLError from is_local_validation_error to prevent non-retryable abort ssl.SSLError (and its subclass ssl.SSLCertVerificationError) inherits from OSError and ValueError via Python's MRO. The is_local_validation_error check used isinstance(api_error, (ValueError, TypeError)) to detect programming bugs that should abort immediately — but this inadvertently caught ssl.SSLError, treating a TLS transport failure as a non-retryable client error. The error classifier already maps SSLCertVerificationError to FailoverReason.timeout with retryable=True (its type name is in _TRANSPORT_ERROR_TYPES), but the inline isinstance guard was overriding that classification and triggering an unnecessary abort. Fix: add ssl.SSLError to the exclusion list alongside the existing UnicodeEncodeError carve-out so TLS errors fall through to the classifier's retryable path. Closes #14367	2026-04-24 05:35:43 -07:00
Teknium	ba44a3d256	fix(gemini): fail fast on missing API key + surface it in hermes dump (#15133 ) Two small fixes triggered by a support report where the user saw a cryptic 'HTTP 400 - Error 400 (Bad Request)!!1' (Google's GFE HTML error page, not a real API error) on every gemini-2.5-pro request. The underlying cause was an empty GOOGLE_API_KEY / GEMINI_API_KEY, but nothing in our output made that diagnosable: 1. hermes_cli/dump.py: the api_keys section enumerated 23 providers but omitted Google entirely, so users had no way to verify from 'hermes dump' whether the key was set. Added GOOGLE_API_KEY and GEMINI_API_KEY rows. 2. agent/gemini_native_adapter.py: GeminiNativeClient.__init__ accepted an empty/whitespace api_key and stamped it into the x-goog-api-key header, which made Google's frontend return a generic HTML 400 long before the request reached the Generative Language backend. Now we raise RuntimeError at construction with an actionable message pointing at GOOGLE_API_KEY/GEMINI_API_KEY and aistudio.google.com. Added a regression test that covers '', ' ', and None.	2026-04-24 05:35:17 -07:00
Teknium	a1caec1088	fix(agent): repair CamelCase + _tool suffix tool-call emissions (#15124 ) Claude-style and some Anthropic-tuned models occasionally emit tool names as class-like identifiers: TodoTool_tool, Patch_tool, BrowserClick_tool, PatchTool. These failed strict-dict lookup in valid_tool_names and triggered the 'Unknown tool' self-correction loop, wasting a full turn of iteration and tokens. _repair_tool_call already handled lowercase / separator / fuzzy matches but couldn't bridge the CamelCase-to-snake_case gap or the trailing '_tool' suffix that Claude sometimes tacks on. Extend it with two bounded normalization passes: 1. CamelCase -> snake_case (via regex lookbehind). 2. Strip trailing _tool / -tool / tool suffix (case-insensitive, applied twice so TodoTool_tool reduces all the way: strip _tool -> TodoTool, snake -> todo_tool, strip 'tool' -> todo). Cheap fast-paths (lowercase / separator-normalized) still run first so the common case stays zero-cost. Fuzzy match remains the last resort unchanged. Tests: tests/run_agent/test_repair_tool_call_name.py covers the three original reports (TodoTool_tool, Patch_tool, BrowserClick_tool), plus PatchTool, WriteFileTool, ReadFile_tool, write-file_Tool, patch-tool, and edge cases (empty, None, '_tool' alone, genuinely unknown names). 18 new tests + 17 existing arg-repair tests = 35/35 pass. Closes #14784	2026-04-24 05:32:08 -07:00
Teknium	05394f2f28	feat(spotify): interactive setup wizard + docs page (#15130 ) Previously 'hermes auth spotify' crashed with 'HERMES_SPOTIFY_CLIENT_ID is required' if the user hadn't manually created a Spotify developer app and set env vars. Now the command detects a missing client_id and walks the user through the one-time app registration inline: - Opens https://developer.spotify.com/dashboard in the browser - Tells the user exactly what to paste into the Spotify form (including the correct default redirect URI, 127.0.0.1:43827) - Prompts for the Client ID - Persists HERMES_SPOTIFY_CLIENT_ID to ~/.hermes/.env so subsequent runs skip the wizard - Continues straight into the PKCE OAuth flow Also prints the docs URL at both the start of the wizard and the end of a successful login so users can find the full guide. Adds website/docs/user-guide/features/spotify.md with the complete setup walkthrough, tool reference, and troubleshooting, and wires it into the sidebar under User Guide > Features > Advanced. Fixes a stale redirect URI default in the hermes_cli/tools_config.py TOOL_CATEGORIES entry (was 8888/callback from the PR description instead of the actual DEFAULT_SPOTIFY_REDIRECT_URI value 43827/spotify/callback defined in auth.py).	2026-04-24 05:30:05 -07:00
Teknium	0d32411310	chore(release): map Group D contributors in AUTHOR_MAP	2026-04-24 05:28:45 -07:00
Brian D. Evans	e87a2100f6	fix(mcp): auto-reconnect + retry once when the transport session expires (#13383 ) Streamable HTTP MCP servers may garbage-collect their server-side session state while the OAuth token remains valid — idle TTL, server restart, pod rotation, etc. Before this fix, the tool-call handler treated the resulting "Invalid or expired session" error as a plain tool failure with no recovery path, so every subsequent call on the affected server failed until the gateway was manually restarted. Reporter: #13383. The OAuth-based recovery path (``_handle_auth_error_and_retry``) already exists for 401s, but it only fires on auth errors. Session expiry slipped through because the access token is still valid — nothing 401'd, so the existing recovery branch was skipped. Fix --- Add a sibling function ``_handle_session_expired_and_retry`` that detects MCP session-expiry via ``_is_session_expired_error`` (a narrow allow-list of known-stable substrings: ``"invalid or expired session"``, ``"session expired"``, ``"session not found"``, ``"unknown session"``, etc.) and then uses the existing transport reconnect mechanism: * Sets ``MCPServerTask._reconnect_event`` — the server task's lifecycle loop already interprets this as "tear down the current ``streamablehttp_client`` + ``ClientSession`` and rebuild them, reusing the existing OAuth provider instance". * Waits up to 15 s for the new session to come back ready. * Retries the original call once. If the retry succeeds, returns its result and resets the circuit-breaker error count. If the retry raises, or if the reconnect doesn't ready in time, falls through to the caller's generic error path. Unlike the 401 path, this does not call ``handle_401`` — the access token is already valid and running an OAuth refresh would be a pointless round-trip. All 5 MCP handlers (``call_tool``, ``list_resources``, ``read_resource``, ``list_prompts``, ``get_prompt``) now consult both recovery paths before falling through: recovered = _handle_auth_error_and_retry(...) # 401 path if recovered is not None: return recovered recovered = _handle_session_expired_and_retry(...) # new if recovered is not None: return recovered # generic error response Narrow scope — explicitly not changed ------------------------------------- * Detection is string-based on a 5-entry allow-list. The MCP SDK wraps JSON-RPC errors in ``McpError`` whose exception type + attributes vary across SDK versions, so matching on message substrings is the durable path. Kept narrow to avoid false positives — a regular ``RuntimeError("Tool failed")`` will NOT trigger spurious reconnects (pinned by ``test_is_session_expired_rejects_unrelated_errors``). * No change to the existing 401 recovery flow. The new path is consulted only after the auth path declines (returns ``None``). * Retry count stays at 1. If the reconnect-then-retry also fails, we don't loop — the error surfaces normally so the model sees a failed tool call rather than a hang. * ``InterruptedError`` is explicitly excluded from session-expired detection so user-cancel signals always short-circuit the same way they did before (pinned by ``test_is_session_expired_rejects_interrupted_error``). Regression coverage ------------------- ``tests/tools/test_mcp_tool_session_expired.py`` (new, 16 cases): Unit tests for ``_is_session_expired_error``: * ``test_is_session_expired_detects_invalid_or_expired_session`` — reporter's exact wpcom-mcp text. * ``test_is_session_expired_detects_expired_session_variant`` — "Session expired" / "expired session" variants. * ``test_is_session_expired_detects_session_not_found`` — server GC variant ("session not found", "unknown session"). * ``test_is_session_expired_is_case_insensitive``. * ``test_is_session_expired_rejects_unrelated_errors`` — narrow-scope canary: random RuntimeError / ValueError / 401 don't trigger. * ``test_is_session_expired_rejects_interrupted_error`` — user cancel must never route through reconnect. * ``test_is_session_expired_rejects_empty_message``. Handler integration tests: * ``test_call_tool_handler_reconnects_on_session_expired`` — reporter's full repro: first call raises "Invalid or expired session", handler signals ``_reconnect_event``, retries once, returns the retry's success result with no ``error`` key. * ``test_call_tool_handler_non_session_expired_error_falls_through`` — preserved-behaviour canary: random tool failures do NOT trigger reconnect. * ``test_session_expired_handler_returns_none_without_loop`` — defensive: cold-start / shutdown race. * ``test_session_expired_handler_returns_none_without_server_record`` — torn-down server falls through cleanly. * ``test_session_expired_handler_returns_none_when_retry_also_fails`` — no retry loop on repeated failure. Parametrised across all 4 non-``tools/call`` handlers: * ``test_non_tool_handlers_also_reconnect_on_session_expired`` [list_resources / read_resource / list_prompts / get_prompt]. 15 of 16 fail on clean ``origin/main`` (``6fb69229``) with ``ImportError: cannot import name '_is_session_expired_error'`` — the fix's surface symbols don't exist there yet. The 1 passing test is an ordering artefact of pytest-xdist worker collection. Validation ---------- ``source venv/bin/activate && python -m pytest tests/tools/test_mcp_tool_session_expired.py -q`` → 16 passed. Broader MCP suite (5 files: ``test_mcp_tool.py``, ``test_mcp_tool_401_handling.py``, ``test_mcp_tool_session_expired.py``, ``test_mcp_reconnect_signal.py``, ``test_mcp_oauth.py``) → 230 passed, 0 regressions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-24 05:28:45 -07:00
AntAISecurityLab	8c2732a9f9	fix(security): strip MCP auth on cross-origin redirect Add event hook to httpx.AsyncClient in MCP HTTP transport that strips Authorization headers when a redirect targets a different origin, preventing credential leakage to third-party servers.	2026-04-24 05:28:45 -07:00
Alexazhu	15050fd965	fix(mcp_oauth): raise RuntimeError instead of asserting OAuth port is set ``tools/mcp_oauth.py`` relied on ``assert _oauth_port is not None`` to guard the module-level port set by ``build_oauth_auth``. Python's ``-O`` / ``-OO`` optimization flags strip ``assert`` statements entirely, so a deployment that runs ``python -O -m hermes ...`` silently loses the check: ``_oauth_port`` stays ``None`` and the failure surfaces much later as an obscure ``int()`` or ``http.server.HTTPServer((host, None))`` TypeError rather than the intended "OAuth callback port not set" signal. Replace with an explicit ``if … raise RuntimeError(...)`` so the invariant is preserved regardless of the interpreter's optimization level. Docstring updated to document the new exception. Found during a proactive audit of ``assert`` statements in non-test code paths.	2026-04-24 05:28:45 -07:00
Amanuel Tilahun Bogale	5fa2f4258a	fix: serialize Pydantic AnyUrl fields when persisting MCP OAuth state OAuth client information and token responses from the MCP SDK contain Pydantic AnyUrl fields (client_uri, redirect_uris, etc.). The previous model_dump() call returned a dict with these AnyUrl objects still as their native Python type, which then crashed json.dumps with: TypeError: Object of type AnyUrl is not JSON serializable This caused any OAuth-based MCP server (e.g. alphaxiv) to fail registration with an "OAuth flow error" traceback during startup. Adding mode="json" tells Pydantic to serialize all fields to JSON-compatible primitives (AnyUrl -> str, datetime -> ISO string, etc.) before returning the dict, so the standard json.dumps can handle it. Three call sites fixed: - HermesTokenStorage.set_tokens - HermesTokenStorage.set_client_info - build_oauth_auth pre-registration write	2026-04-24 05:28:45 -07:00
0xbyt4	4ac731c841	fix(model-normalize): pass DeepSeek V-series IDs through instead of folding to deepseek-chat `_normalize_for_deepseek` was mapping every non-reasoner input into `deepseek-chat` on the assumption that DeepSeek's API accepts only two model IDs. That assumption no longer holds — `deepseek-v4-pro` and `deepseek-v4-flash` are first-class IDs accepted by the direct API, and on aggregators `deepseek-chat` routes explicitly to V3 (DeepInfra backend returns `deepseek-chat-v3`). So a user picking V4 Pro through the model picker was being silently downgraded to V3. Verified 2026-04-24 against Nous portal's OpenAI-compat surface: - `deepseek/deepseek-v4-flash` → provider: DeepSeek, model: deepseek-v4-flash-20260423 - `deepseek/deepseek-chat` → provider: DeepInfra, model: deepseek/deepseek-chat-v3 Fix: - Add `deepseek-v4-pro` and `deepseek-v4-flash` to `_DEEPSEEK_CANONICAL_MODELS` so exact matches pass through. - Add `_DEEPSEEK_V_SERIES_RE` (`^deepseek-v\d+(...)?$`) so future V-series IDs (`deepseek-v5-*`, dated variants) keep passing through without another code change. - Update docstring + module header to reflect the new rule. Tests: - New `TestDeepseekVSeriesPassThrough` — 8 parametrized cases covering bare, vendor-prefixed, case-variant, dated, and future V-series IDs plus end-to-end `normalize_model_for_provider(..., "deepseek")`. - New `TestDeepseekCanonicalAndReasonerMapping` — regression coverage for canonical pass-through, reasoner-keyword folding, and fall-back-to-chat behaviour. - 77/77 pass. Reported on Discord (Ufonik, Don Piedro): `/model > Deepseek > deepseek-v4-pro` surfaced `Normalized 'deepseek-v4-pro' to 'deepseek-chat'`. Picker listing showed the v4 names, so validation also rejected the post-normalize `deepseek-chat` as "not in provider listing" — the contradiction users saw. Normalizer now respects the picker's choice.	2026-04-24 05:24:54 -07:00
Teknium	acd78a457e	fix(docker): reap orphaned subprocesses via tini as PID 1 (#15116 ) Install tini in the container image and route ENTRYPOINT through `/usr/bin/tini -g -- /opt/hermes/docker/entrypoint.sh`. Without a PID-1 init, orphans reparented to hermes (MCP stdio servers, git, bun, browser daemons) never get waited() on and accumulate as zombies. Long-running gateway containers eventually exhaust the PID table and hit "fork: cannot allocate memory". tini is the standard container init (same pattern Docker's --init flag and Kubernetes pause container use). It handles SIGCHLD, reaps orphans, and forwards SIGTERM/SIGINT to the entrypoint so hermes's existing graceful-shutdown handlers still run. The -g flag sends signals to the whole process group so `docker stop` cleanly terminates hermes and its descendants, not just direct children. Closes #15012. E2E-verified with a minimal reproducer image: spawning 5 orphans that reparent to PID 1 leaves 5 zombies without tini and 0 with tini.	2026-04-24 05:22:34 -07:00
Teknium	4ff7950f7f	chore(spotify): gate toolset off by default, add to hermes tools UI Follow-up on top of #15096 cherry-pick: - Remove spotify_* from _HERMES_CORE_TOOLS (keep only in the 'spotify' toolset, so the 9 Spotify tool schemas are not shipped to every user). - Add 'spotify' to CONFIGURABLE_TOOLSETS + _DEFAULT_OFF_TOOLSETS so new installs get it opt-in via 'hermes tools', matching homeassistant/rl. - Wire TOOL_CATEGORIES entry pointing at 'hermes auth spotify' for the actual PKCE login (optional HERMES_SPOTIFY_CLIENT_ID / HERMES_SPOTIFY_REDIRECT_URI env vars). - scripts/release.py: map contributor email to GitHub login.	2026-04-24 05:20:38 -07:00
Dilee	7e9dd9ca45	Add native Spotify tools with PKCE auth	2026-04-24 05:20:38 -07:00
Teknium	3392d1e422	chore(release): map Group E contributors in AUTHOR_MAP	2026-04-24 05:20:05 -07:00
konsisumer	785d168d50	fix(credential_pool): add Nous OAuth cross-process auth-store sync Concurrent Hermes processes (e.g. cron jobs) refreshing a Nous OAuth token via resolve_nous_runtime_credentials() write the rotated tokens to auth.json. The calling process's pool entry becomes stale, and the next refresh against the already-rotated token triggers a 'refresh token reuse' revocation on the Nous Portal. _sync_nous_entry_from_auth_store() reads auth.json under the same lock used by resolve_nous_runtime_credentials, and adopts the newer token pair before refreshing the pool entry. This complements #15111 (which preserved the obtained_at timestamps through seeding). Partial salvage of #10160 by @konsisumer — only the agent/credential_pool.py changes + the 3 Nous-specific regression tests. The PR also touched 10 unrelated files (Dockerfile, tips.py, various tool tests) which were dropped as scope creep. Regression tests: - test_sync_nous_entry_from_auth_store_adopts_newer_tokens - test_sync_nous_entry_noop_when_tokens_match - test_nous_exhausted_entry_recovers_via_auth_store_sync	2026-04-24 05:20:05 -07:00
Michael Steuer	cd221080ec	fix: validate nous auth status against runtime credentials	2026-04-24 05:20:05 -07:00
Prasad Subrahmanya	1fc77f995b	fix(agent): fall back on rate limit when pool has no rotation room Extracts pool-rotation-room logic into `_pool_may_recover_from_rate_limit` so single-credential pools no longer block the eager-fallback path on 429. The existing check `pool is not None and pool.has_available()` lets fallback fire only after the pool marks every entry as exhausted. With exactly one credential in the pool (the common shape for Gemini OAuth, Vertex service accounts, and any personal-key setup), `has_available()` flips back to True as soon as the cooldown expires — Hermes retries against the same entry, hits the same daily-quota 429, and burns the retry budget in a tight loop before ever reaching the configured `fallback_model`. Observed in the wild as 4+ hours of 429 noise on a single Gemini key instead of falling through to Vertex as configured. Rotation is only meaningful with more than one credential — gate on `len(pool.entries()) > 1`. Multi-credential pools keep the current wait-for-rotation behaviour unchanged. Fixes #11314. Related to #8947, #10210, #7230. Narrower scope than open PRs #8023 (classifier change) and #11492 (503/529 credential-pool bypass) — this addresses the single-credential 429 case specifically and does not conflict with either. Tests: 6 new unit tests in tests/run_agent/test_provider_fallback.py covering (a) None pool, (b) single-cred available, (c) single-cred in cooldown, (d) 2-cred available rotates, (e) multi-cred all cooling-down falls back, (f) many-cred available rotates. All 18 tests in the file pass.	2026-04-24 05:20:05 -07:00
jakubkrcmar	1af44a13c0	fix(model_picker): detect mapped-provider auth-store credentials	2026-04-24 05:20:05 -07:00
Andy	fff7ee31ae	fix: clarify auth retry guidance	2026-04-24 05:20:05 -07:00
YueLich	6fcaf5ebc2	fix: rotate credential pool on 403 (Forbidden) responses Previously _handle_credential_pool_error handled 401, 402, and 429 but silently ignored 403. When a provider returns 403 for a revoked or unauthorised credential (e.g. Nous agent_key invalidated by a newer login), the pool was never rotated and every subsequent request continued to use the same failing credential. Treat 403 the same as 402: immediately mark the current credential exhausted and rotate to the next pool entry, since a Forbidden response will not resolve itself with a retry. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 05:20:05 -07:00
vominh1919	461899894e	fix: increment request_count in least_used pool strategy The least_used strategy selected entries via min(request_count) but never incremented the counter. All entries stayed at count=0, so the strategy degenerated to fill_first behavior with no actual load balancing. Now increments request_count after each selection and persists the update.	2026-04-24 05:20:05 -07:00
Teknium	b3aed6cfd8	chore(release): map l0hde and difujia in AUTHOR_MAP	2026-04-24 05:09:08 -07:00

1 2 3 4 5 ...

5771 commits