hermes-agent/tools
Brian D. Evans e87a2100f6 fix(mcp): auto-reconnect + retry once when the transport session expires (#13383)
Streamable HTTP MCP servers may garbage-collect their server-side
session state while the OAuth token remains valid — idle TTL, server
restart, pod rotation, etc.  Before this fix, the tool-call handler
treated the resulting "Invalid or expired session" error as a plain
tool failure with no recovery path, so **every subsequent call on
the affected server failed until the gateway was manually
restarted**.  Reporter: #13383.

The OAuth-based recovery path (``_handle_auth_error_and_retry``)
already exists for 401s, but it only fires on auth errors.  Session
expiry slipped through because the access token is still valid —
nothing 401'd, so the existing recovery branch was skipped.

Fix
---
Add a sibling function ``_handle_session_expired_and_retry`` that
detects MCP session-expiry via ``_is_session_expired_error`` (a
narrow allow-list of known-stable substrings: ``"invalid or expired
session"``, ``"session expired"``, ``"session not found"``,
``"unknown session"``, etc.) and then uses the existing transport
reconnect mechanism:

* Sets ``MCPServerTask._reconnect_event`` — the server task's
  lifecycle loop already interprets this as "tear down the current
  ``streamablehttp_client`` + ``ClientSession`` and rebuild them,
  reusing the existing OAuth provider instance".
* Waits up to 15 s for the new session to come back ready.
* Retries the original call once.  If the retry succeeds, returns
  its result and resets the circuit-breaker error count.  If the
  retry raises, or if the reconnect doesn't ready in time, falls
  through to the caller's generic error path.

Unlike the 401 path, this does **not** call ``handle_401`` — the
access token is already valid and running an OAuth refresh would be
a pointless round-trip.

All 5 MCP handlers (``call_tool``, ``list_resources``, ``read_resource``,
``list_prompts``, ``get_prompt``) now consult both recovery paths
before falling through:

    recovered = _handle_auth_error_and_retry(...)          # 401 path
    if recovered is not None: return recovered
    recovered = _handle_session_expired_and_retry(...)     # new
    if recovered is not None: return recovered
    # generic error response

Narrow scope — explicitly not changed
-------------------------------------
* **Detection is string-based on a 5-entry allow-list.**  The MCP
  SDK wraps JSON-RPC errors in ``McpError`` whose exception type +
  attributes vary across SDK versions, so matching on message
  substrings is the durable path.  Kept narrow to avoid false
  positives — a regular ``RuntimeError("Tool failed")`` will NOT
  trigger spurious reconnects (pinned by
  ``test_is_session_expired_rejects_unrelated_errors``).
* **No change to the existing 401 recovery flow.**  The new path is
  consulted only after the auth path declines (returns ``None``).
* **Retry count stays at 1.**  If the reconnect-then-retry also
  fails, we don't loop — the error surfaces normally so the model
  sees a failed tool call rather than a hang.
* **``InterruptedError`` is explicitly excluded** from session-expired
  detection so user-cancel signals always short-circuit the same
  way they did before (pinned by
  ``test_is_session_expired_rejects_interrupted_error``).

Regression coverage
-------------------
``tests/tools/test_mcp_tool_session_expired.py`` (new, 16 cases):

Unit tests for ``_is_session_expired_error``:
* ``test_is_session_expired_detects_invalid_or_expired_session`` —
  reporter's exact wpcom-mcp text.
* ``test_is_session_expired_detects_expired_session_variant`` —
  "Session expired" / "expired session" variants.
* ``test_is_session_expired_detects_session_not_found`` — server GC
  variant ("session not found", "unknown session").
* ``test_is_session_expired_is_case_insensitive``.
* ``test_is_session_expired_rejects_unrelated_errors`` — narrow-scope
  canary: random RuntimeError / ValueError / 401 don't trigger.
* ``test_is_session_expired_rejects_interrupted_error`` — user cancel
  must never route through reconnect.
* ``test_is_session_expired_rejects_empty_message``.

Handler integration tests:
* ``test_call_tool_handler_reconnects_on_session_expired`` — reporter's
  full repro: first call raises "Invalid or expired session", handler
  signals ``_reconnect_event``, retries once, returns the retry's
  success result with no ``error`` key.
* ``test_call_tool_handler_non_session_expired_error_falls_through``
  — preserved-behaviour canary: random tool failures do NOT trigger
  reconnect.
* ``test_session_expired_handler_returns_none_without_loop`` —
  defensive: cold-start / shutdown race.
* ``test_session_expired_handler_returns_none_without_server_record``
  — torn-down server falls through cleanly.
* ``test_session_expired_handler_returns_none_when_retry_also_fails``
  — no retry loop on repeated failure.

Parametrised across all 4 non-``tools/call`` handlers:
* ``test_non_tool_handlers_also_reconnect_on_session_expired``
  [list_resources / read_resource / list_prompts / get_prompt].

**15 of 16 fail on clean ``origin/main`` (``6fb69229``)** with
``ImportError: cannot import name '_is_session_expired_error'``
— the fix's surface symbols don't exist there yet.  The 1 passing
test is an ordering artefact of pytest-xdist worker collection.

Validation
----------
``source venv/bin/activate && python -m pytest
tests/tools/test_mcp_tool_session_expired.py -q`` → **16 passed**.

Broader MCP suite (5 files:
``test_mcp_tool.py``, ``test_mcp_tool_401_handling.py``,
``test_mcp_tool_session_expired.py``, ``test_mcp_reconnect_signal.py``,
``test_mcp_oauth.py``) → **230 passed, 0 regressions**.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-24 05:28:45 -07:00
..
browser_providers feat: ungate Tool Gateway — subscription-based access with per-tool opt-in 2026-04-16 12:36:49 -07:00
environments fix(terminal): auto-source ~/.profile and ~/.bash_profile so n/nvm PATH survives (#14534) 2026-04-23 05:15:37 -07:00
neutts_samples refactor(tts): replace NeuTTS optional skill with built-in provider + setup flow 2026-03-17 02:33:12 -07:00
providers Add native Spotify tools with PKCE auth 2026-04-24 05:20:38 -07:00
__init__.py Merge branch 'main' into rewbs/tool-use-charge-to-subscription 2026-03-31 08:48:54 +09:00
ansi_strip.py fix: strip ANSI at the source — clean terminal output before it reaches the model 2026-03-23 07:43:12 -07:00
approval.py test: cover absolute paths in project env/config approval regex 2026-04-23 14:05:36 -07:00
binary_extensions.py fix(tools): address PR review — remove _extract_raw_output, BudgetConfig everywhere, read_file hardening 2026-04-08 02:24:32 -07:00
browser_camofox.py refactor: remove remaining redundant local imports (comprehensive sweep) 2026-04-21 00:50:58 -07:00
browser_camofox_state.py feat(browser): add persistent Camofox sessions and VNC URL discovery (salvage #4400) (#4419) 2026-04-01 04:18:50 -07:00
browser_cdp_tool.py fix: sanitize tool schemas for llama.cpp backends; restore MCP in TUI (#15032) 2026-04-24 02:44:46 -07:00
browser_dialog_tool.py feat(browser): CDP supervisor — dialog detection + response + cross-origin iframe eval (#14540) 2026-04-23 22:23:37 -07:00
browser_supervisor.py feat(browser): CDP supervisor — dialog detection + response + cross-origin iframe eval (#14540) 2026-04-23 22:23:37 -07:00
browser_tool.py feat(browser): CDP supervisor — dialog detection + response + cross-origin iframe eval (#14540) 2026-04-23 22:23:37 -07:00
budget_config.py fix: preserve existing thresholds, remove pre-read byte guard 2026-04-08 02:24:32 -07:00
checkpoint_manager.py refactor: remove redundant local imports already available at module level 2026-04-21 00:50:58 -07:00
clarify_tool.py refactor: add tool_error/tool_result helpers + read_raw_config, migrate 129 callsites 2026-04-07 13:36:38 -07:00
code_execution_tool.py fix(tools): restrict RPC socket permissions to owner-only 2026-04-22 17:27:18 -07:00
credential_files.py refactor: extract shared helpers to deduplicate repeated code patterns (#7917) 2026-04-11 13:59:52 -07:00
cronjob_tools.py feat(cron): per-job workdir for project-aware cron runs (#15110) 2026-04-24 05:07:01 -07:00
debug_helpers.py refactor: codebase-wide lint cleanup — unused imports, dead code, and inefficient patterns (#5821) 2026-04-07 10:25:31 -07:00
delegate_tool.py feat(delegate): diagnostic dump when a subagent times out with 0 API calls (#15105) 2026-04-24 04:58:32 -07:00
discord_tool.py feat: add Discord server introspection and management tool (#4753) 2026-04-19 11:52:19 -07:00
env_passthrough.py fix(env_passthrough): reject Hermes provider credentials from skill passthrough (#13523) 2026-04-21 06:14:25 -07:00
feishu_doc_tool.py fix(feishu-comment): use get_hermes_home(); drop dead asyncio wrapper; AUTHOR_MAP 2026-04-17 19:04:11 -07:00
feishu_drive_tool.py fix(feishu-comment): use get_hermes_home(); drop dead asyncio wrapper; AUTHOR_MAP 2026-04-17 19:04:11 -07:00
file_operations.py feat(skills): add design-md skill for Google's DESIGN.md spec (#14876) 2026-04-23 21:51:19 -07:00
file_state.py feat(delegate): cross-agent file state coordination for concurrent subagents (#13718) 2026-04-21 16:41:26 -07:00
file_tools.py fix(file_tools): resolve bookkeeping paths against live terminal cwd 2026-04-23 15:11:52 -07:00
fuzzy_match.py fix(patch): gate 'did you mean?' to no-match + extend to v4a/skill_manage 2026-04-21 02:03:46 -07:00
homeassistant_tool.py fix: clean up description escaping, add string-data tests 2026-04-13 04:45:07 -07:00
image_generation_tool.py fix(image-gen): force-refresh plugin providers in long-lived sessions 2026-04-23 03:01:18 -07:00
interrupt.py fix(interrupt): propagate to concurrent-tool workers + opt-in debug trace (#11907) 2026-04-17 20:39:25 -07:00
managed_tool_gateway.py fix(tools): add debug logging for token refresh and tighten domain check 2026-04-02 12:40:03 +11:00
mcp_oauth.py fix(mcp_oauth): raise RuntimeError instead of asserting OAuth port is set 2026-04-24 05:28:45 -07:00
mcp_oauth_manager.py fix(mcp-oauth): bidirectional auth_flow bridge + absolute expires_at (salvage #12025) (#12717) 2026-04-19 16:31:07 -07:00
mcp_tool.py fix(mcp): auto-reconnect + retry once when the transport session expires (#13383) 2026-04-24 05:28:45 -07:00
memory_tool.py fix: nest msvcrt import inside fcntl except block 2026-04-14 10:18:05 -07:00
mixture_of_agents_tool.py Fix (mixture_of_agents): replace deprecated Gemini model and forward max_tokens to OpenRouter (#6621) 2026-04-23 15:14:11 -07:00
neutts_synth.py fix(tts): document NeuTTS provider and align install guidance (#1903) 2026-03-18 02:55:30 -07:00
openrouter_client.py refactor: route ad-hoc LLM consumers through centralized provider router 2026-03-11 20:02:36 -07:00
osv_check.py feat: OSV malware check for MCP extension packages (#5305) 2026-04-05 12:46:07 -07:00
patch_parser.py fix(patch): gate 'did you mean?' to no-match + extend to v4a/skill_manage 2026-04-21 02:03:46 -07:00
path_security.py refactor: extract shared helpers to deduplicate repeated code patterns (#7917) 2026-04-11 13:59:52 -07:00
process_registry.py refactor: remove redundant local imports already available at module level 2026-04-21 00:50:58 -07:00
registry.py fix: tighten AST check to module-level only 2026-04-14 21:12:29 -07:00
rl_training_tool.py refactor: codebase-wide lint cleanup — unused imports, dead code, and inefficient patterns (#5821) 2026-04-07 10:25:31 -07:00
schema_sanitizer.py fix: sanitize tool schemas for llama.cpp backends; restore MCP in TUI (#15032) 2026-04-24 02:44:46 -07:00
send_message_tool.py refactor: remove remaining redundant local imports (comprehensive sweep) 2026-04-21 00:50:58 -07:00
session_search_tool.py fix(aux): add session_search extra_body and concurrency controls 2026-04-20 00:47:39 -07:00
skill_manager_tool.py feat(skills-guard): gate agent-created scanner on config.skills.guard_agent_created (default off) 2026-04-23 06:20:47 -07:00
skills_guard.py feat(skills-guard): gate agent-created scanner on config.skills.guard_agent_created (default off) 2026-04-23 06:20:47 -07:00
skills_hub.py feat(skills): add MiniMax-AI/cli as default skill tap 2026-04-23 02:35:13 -07:00
skills_sync.py feat(skills_sync): surface collision with reset-hint 2026-04-23 05:09:08 -07:00
skills_tool.py fix(skills): follow symlinked category dirs consistently 2026-04-23 14:05:47 -07:00
spotify_tool.py Add native Spotify tools with PKCE auth 2026-04-24 05:20:38 -07:00
terminal_tool.py feat(skills): add design-md skill for Google's DESIGN.md spec (#14876) 2026-04-23 21:51:19 -07:00
tirith_security.py fix: guard against None tirith path in security scanner 2026-04-23 03:08:53 -07:00
todo_tool.py fix(tools): enforce ID uniqueness in TODO store during replace operations 2026-04-11 16:22:50 -07:00
tool_backend_helpers.py fix(fal): extend whitespace-only FAL_KEY handling to all call sites 2026-04-21 02:04:21 -07:00
tool_output_limits.py feat(skills): add design-md skill for Google's DESIGN.md spec (#14876) 2026-04-23 21:51:19 -07:00
tool_result_storage.py fix(tools): neutralize shell injection in _write_to_sandbox via path quoting (#7940) 2026-04-11 14:26:11 -07:00
transcription_tools.py fix(transcription): fall back to CPU when CUDA runtime libs are missing 2026-04-24 02:50:14 -07:00
tts_tool.py fix(tts): use per-provider input-character caps instead of global 4000 (#13743) 2026-04-21 17:49:39 -07:00
url_safety.py feat(security): add global toggle to allow private/internal URL resolution 2026-04-22 14:38:59 -07:00
vision_tools.py fix: vision tool respects auxiliary.vision.temperature from config (#4661) 2026-04-20 00:32:09 -07:00
voice_mode.py fix: point optional-dep install hints at the venv's python (#11938) 2026-04-17 21:16:33 -07:00
web_tools.py feat(web): support TAVILY_BASE_URL env var for custom proxy endpoints 2026-04-22 17:36:33 -07:00
website_policy.py refactor: codebase-wide lint cleanup — unused imports, dead code, and inefficient patterns (#5821) 2026-04-07 10:25:31 -07:00
xai_http.py feat(xai): upgrade to Responses API, add TTS provider 2026-04-16 02:24:08 -07:00