mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-02 02:01:47 +00:00
[verified] fix(mcp-oauth): seed token_expiry_time + pre-flight AS discovery on cold-load
PR #11383's consolidation fixed external-refresh reloading and 401 dedup but left two latent bugs that surfaced on BetterStack and any other OAuth MCP with a split-origin authorization server: 1. HermesTokenStorage persisted only a relative 'expires_in', which is meaningless after a process restart. The MCP SDK's OAuthContext does NOT seed token_expiry_time in _initialize, so is_token_valid() returned True for any reloaded token regardless of age. Expired tokens shipped to servers, and app-level auth failures (e.g. BetterStack's 'No teams found. Please check your authentication.') were invisible to the transport-layer 401 handler. 2. Even once preemptive refresh did fire, the SDK's _refresh_token falls back to {server_url}/token when oauth_metadata isn't cached. For providers whose AS is at a different origin (BetterStack: mcp.betterstack.com for MCP, betterstack.com/oauth/token for the token endpoint), that fallback 404s and drops into full browser re-auth on every process restart. Fix set: - HermesTokenStorage.set_tokens persists an absolute wall-clock expires_at alongside the SDK's OAuthToken JSON (time.time() + TTL at write time). - HermesTokenStorage.get_tokens reconstructs expires_in from max(expires_at - now, 0), clamping expired tokens to zero TTL. Legacy files without expires_at fall back to file-mtime as a best-effort wall-clock proxy, self-healing on the next set_tokens. - HermesMCPOAuthProvider._initialize calls super(), then update_token_expiry on the reloaded tokens so token_expiry_time reflects actual remaining TTL. If tokens are loaded but oauth_metadata is missing, pre-flight PRM + ASM discovery runs via httpx.AsyncClient using the MCP SDK's own URL builders and response handlers (build_protected_resource_metadata_discovery_urls, handle_auth_metadata_response, etc.) so the SDK sees the correct token_endpoint before the first refresh attempt. Pre-flight is skipped when there are no stored tokens to keep fresh-install paths zero-cost. Test coverage (tests/tools/test_mcp_oauth_cold_load_expiry.py): - set_tokens persists absolute expires_at - set_tokens skips expires_at when token has no expires_in - get_tokens round-trips expires_at -> remaining expires_in - expired tokens reload with expires_in=0 - legacy files without expires_at fall back to mtime proxy - _initialize seeds token_expiry_time from stored tokens - _initialize flags expired-on-disk tokens as is_token_valid=False - _initialize pre-flights PRM + ASM discovery with mock transport - _initialize skips pre-flight when no tokens are stored Verified against BetterStack MCP: hermes mcp test betterstack -> Connected (2508ms), 83 tools mcp_betterstack_telemetry_list_teams_tool -> real team data, not 'No teams found. Please check your authentication.' Reference: mcp-oauth-token-diagnosis skill, Fix A.
This commit is contained in:
parent
3eeab4bc06
commit
cc06beaf13
3 changed files with 718 additions and 1 deletions
|
|
@ -111,6 +111,131 @@ def _make_hermes_provider_class() -> Optional[type]:
|
|||
super().__init__(*args, **kwargs)
|
||||
self._hermes_server_name = server_name
|
||||
|
||||
async def _initialize(self) -> None:
|
||||
"""Load stored tokens + client info AND seed token_expiry_time.
|
||||
|
||||
Also eagerly fetches OAuth authorization-server metadata (PRM +
|
||||
ASM) when we have stored tokens but no cached metadata, so the
|
||||
SDK's ``_refresh_token`` can build the correct token_endpoint
|
||||
URL on the preemptive-refresh path. Without this, the SDK
|
||||
falls back to ``{mcp_server_url}/token`` (wrong for providers
|
||||
whose AS is a different origin — BetterStack's MCP lives at
|
||||
``https://mcp.betterstack.com`` but its token endpoint is at
|
||||
``https://betterstack.com/oauth/token``), the refresh 404s, and
|
||||
we drop through to full browser reauth.
|
||||
|
||||
The SDK's base ``_initialize`` populates ``current_tokens`` but
|
||||
does NOT call ``update_token_expiry``, so ``token_expiry_time``
|
||||
stays ``None`` and ``is_token_valid()`` returns True for any
|
||||
loaded token regardless of actual age. After a process restart
|
||||
this ships stale Bearer tokens to the server; some providers
|
||||
return HTTP 401 (caught by the 401 handler), others return 200
|
||||
with an app-level auth error (invisible to the transport layer,
|
||||
e.g. BetterStack returning "No teams found. Please check your
|
||||
authentication.").
|
||||
|
||||
Seeding ``token_expiry_time`` from the reloaded token fixes that:
|
||||
``is_token_valid()`` correctly reports False for expired tokens,
|
||||
``async_auth_flow`` takes the ``can_refresh_token()`` branch,
|
||||
and the SDK quietly refreshes before the first real request.
|
||||
|
||||
Paired with :class:`HermesTokenStorage` persisting an absolute
|
||||
``expires_at`` timestamp (``mcp_oauth.py:set_tokens``) so the
|
||||
remaining TTL we compute here reflects real wall-clock age.
|
||||
"""
|
||||
await super()._initialize()
|
||||
tokens = self.context.current_tokens
|
||||
if tokens is not None and tokens.expires_in is not None:
|
||||
self.context.update_token_expiry(tokens)
|
||||
|
||||
# Pre-flight OAuth AS discovery so ``_refresh_token`` has a
|
||||
# correct ``token_endpoint`` before the first refresh attempt.
|
||||
# Only runs when we have tokens on cold-load but no cached
|
||||
# metadata — i.e. the exact scenario where the SDK's built-in
|
||||
# 401-branch discovery hasn't had a chance to run yet.
|
||||
if (
|
||||
tokens is not None
|
||||
and self.context.oauth_metadata is None
|
||||
):
|
||||
try:
|
||||
await self._prefetch_oauth_metadata()
|
||||
except Exception as exc: # pragma: no cover — defensive
|
||||
# Non-fatal: if discovery fails, the SDK's normal 401-
|
||||
# branch discovery will run on the next request.
|
||||
logger.debug(
|
||||
"MCP OAuth '%s': pre-flight metadata discovery "
|
||||
"failed (non-fatal): %s",
|
||||
self._hermes_server_name, exc,
|
||||
)
|
||||
|
||||
async def _prefetch_oauth_metadata(self) -> None:
|
||||
"""Fetch PRM + ASM from the well-known endpoints, cache on context.
|
||||
|
||||
Mirrors the SDK's 401-branch discovery (oauth2.py ~line 511-551)
|
||||
but runs synchronously before the first request instead of
|
||||
inside the httpx auth_flow generator. Uses the SDK's own URL
|
||||
builders and response handlers so we track whatever the SDK
|
||||
version we're pinned to expects.
|
||||
"""
|
||||
import httpx # local import: httpx is an MCP SDK dependency
|
||||
from mcp.client.auth.utils import (
|
||||
build_oauth_authorization_server_metadata_discovery_urls,
|
||||
build_protected_resource_metadata_discovery_urls,
|
||||
create_oauth_metadata_request,
|
||||
handle_auth_metadata_response,
|
||||
handle_protected_resource_response,
|
||||
)
|
||||
|
||||
server_url = self.context.server_url
|
||||
async with httpx.AsyncClient(timeout=10.0) as client:
|
||||
# Step 1: PRM discovery to learn the authorization_server URL.
|
||||
for url in build_protected_resource_metadata_discovery_urls(
|
||||
None, server_url
|
||||
):
|
||||
req = create_oauth_metadata_request(url)
|
||||
try:
|
||||
resp = await client.send(req)
|
||||
except httpx.HTTPError as exc:
|
||||
logger.debug(
|
||||
"MCP OAuth '%s': PRM discovery to %s failed: %s",
|
||||
self._hermes_server_name, url, exc,
|
||||
)
|
||||
continue
|
||||
prm = await handle_protected_resource_response(resp)
|
||||
if prm:
|
||||
self.context.protected_resource_metadata = prm
|
||||
if prm.authorization_servers:
|
||||
self.context.auth_server_url = str(
|
||||
prm.authorization_servers[0]
|
||||
)
|
||||
break
|
||||
|
||||
# Step 2: ASM discovery against the auth_server_url (or
|
||||
# server_url fallback for legacy providers).
|
||||
for url in build_oauth_authorization_server_metadata_discovery_urls(
|
||||
self.context.auth_server_url, server_url
|
||||
):
|
||||
req = create_oauth_metadata_request(url)
|
||||
try:
|
||||
resp = await client.send(req)
|
||||
except httpx.HTTPError as exc:
|
||||
logger.debug(
|
||||
"MCP OAuth '%s': ASM discovery to %s failed: %s",
|
||||
self._hermes_server_name, url, exc,
|
||||
)
|
||||
continue
|
||||
ok, asm = await handle_auth_metadata_response(resp)
|
||||
if not ok:
|
||||
break
|
||||
if asm:
|
||||
self.context.oauth_metadata = asm
|
||||
logger.debug(
|
||||
"MCP OAuth '%s': pre-flight ASM discovered "
|
||||
"token_endpoint=%s",
|
||||
self._hermes_server_name, asm.token_endpoint,
|
||||
)
|
||||
break
|
||||
|
||||
async def async_auth_flow(self, request): # type: ignore[override]
|
||||
# Pre-flow hook: ask the manager to refresh from disk if needed.
|
||||
# Any failure here is non-fatal — we just log and proceed with
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue