hermes-agent/hermes_cli/dashboard_auth/middleware.py
Ben Barclay c10ccaaf51
feat(dashboard-auth): rotate dashboard sessions via refresh token (#37247)
* feat(dashboard-auth): rotate dashboard sessions via refresh token

The dashboard auth-code grant now issues a 24h rotating refresh token
(server side: NousResearch/nous-account-service#293). This wires up the
Hermes client half so an expired access token is transparently refreshed
instead of bouncing the user to /login every 15 minutes.

plugins/dashboard_auth/nous:
- refresh_session() now POSTs grant_type=refresh_token to Portal's token
  endpoint and returns a Session carrying the ROTATED refresh token (was
  an unconditional RefreshExpiredError under the old "no RT in V1"
  contract). The RT is sent in BOTH the request body (Portal's schema
  requires it there) and the X-Refresh-Token header (log redaction) —
  verified against the #293 preview deploy: header-only is rejected as
  invalid_request, body is accepted.
- A 400 from Portal (expired / revoked / reuse-detected) maps to
  RefreshExpiredError so the middleware forces a clean re-login; network
  errors map to ProviderError; empty RT fast-fails without a network call.
- complete_login now captures the initial refresh token Portal returns
  (forward-tolerant: empty string if a deploy omits it).
- Extracted the shared token-response handling into
  _token_response_to_session, parameterised on the 400 exception type so
  the auth-code path raises InvalidCodeError and the refresh path raises
  RefreshExpiredError.
- revoke_session stays a best-effort no-op: Portal exposes no public
  token-endpoint revocation grant (revocation is the authenticated
  /sessions UI, keyed by sessionId+userId), so logout is cookie-clearing
  and the 24h session expires on its own. Documented for a future
  revoke grant.

hermes_cli/dashboard_auth/middleware:
- On an expired/invalid access token the gate now attempts refresh via
  the session's RT BEFORE forcing re-login. On success it serves the
  request and re-sets the rotated cookies on the response (mandatory:
  Portal rotates the RT every refresh and reuse-detects, so a stale RT
  cookie would revoke the whole session on the next refresh). On
  RefreshExpiredError (or no RT) it falls through to clear-and-relogin.
- ProviderError during refresh (Portal unreachable) forces a clean
  re-login rather than 500-ing the request.
- Uses the existing REFRESH_SUCCESS / REFRESH_FAILURE audit events.

Validation:
- 176 dashboard-auth unit/integration tests pass.
- Live E2E against the #293 preview deploy: refresh_session(bad rt) ->
  RefreshExpiredError through the real token endpoint; live JWKS fetch +
  RS256 verification rejects a forged token; empty-RT fast-fail. The
  successful happy-path rotation is covered by unit tests (a live run
  needs an interactive browser OAuth round trip + registered agent:*
  client).

Depends on: NousResearch/nous-account-service#293 (server-side RT issuance).

* fix(dashboard-auth): use Portal's x-nous-refresh-token header name

The refresh-token header must match Portal's REFRESH_TOKEN_HEADER exactly
("x-nous-refresh-token"); the initial cut used "X-Refresh-Token", which
Portal silently ignores (harmless since the RT is also in the body, which
is what the schema requires — but the header redaction was a no-op).
Confirmed against the NAS token route + re-validated live against the
#293 preview deploy.

* fix(dashboard-auth): refresh session when access-token cookie has been evicted

The gated middleware bounced users to /login the instant the access-token
cookie was absent, without ever consulting the refresh token:

    at, _rt = read_session_cookies(request)
    if not at:
        return _unauth_response(...)   # bailed here

This made transparent refresh effectively dead for the common case. The
access-token cookie is set with Max-Age = access_token_expires_in (~15 min),
so a real browser EVICTS hermes_session_at the moment the token lapses while
hermes_session_rt persists (30-day Max-Age). From that point the browser
sends only the refresh-token cookie — and the old guard rejected it before
_attempt_refresh could run. The _attempt_refresh path only fired for a
present-but-invalid access token, which never happens in a browser.

Fix: only hard-bounce when NEITHER cookie is present. A request carrying
just the refresh token now skips verification (no AT to verify) and flows
into the existing refresh path, which rotates both cookies and serves the
request transparently. A dead/expired RT still raises RefreshExpiredError
and falls through to clear-and-relogin.

This failure mode escaped the original tests + manual refresh button because
both kept the access-token cookie present; only a real browser evicting the
cookie at Max-Age exposes it. Added 3 regression tests covering: AT-evicted +
RT-present (transparent refresh), no-cookies (still bounces), and RT-only with
a dead RT (clean 401, no 500).
2026-06-02 21:16:41 +10:00

344 lines
14 KiB
Python

"""Auth-gate middleware for the dashboard.
Engaged when ``app.state.auth_required is True``. The gate's job:
1. Allow a small set of routes through unauthenticated (login page,
``/auth/*`` OAuth round trip, ``/api/auth/providers``, static
assets).
2. For everything else, demand a valid session cookie and attach the
verified :class:`Session` to ``request.state.session``.
3. On HTML routes, redirect missing/invalid cookies to ``/login``.
On ``/api/*`` routes, return 401 JSON.
The middleware is a no-op when ``auth_required`` is False (loopback
mode); the legacy ``_SESSION_TOKEN`` ``auth_middleware`` handles those
binds.
"""
from __future__ import annotations
import logging
from typing import Awaitable, Callable
from fastapi import Request
from fastapi.responses import JSONResponse, RedirectResponse, Response
from hermes_cli.dashboard_auth import list_providers
from hermes_cli.dashboard_auth.audit import AuditEvent, audit_log
from hermes_cli.dashboard_auth.base import ProviderError, RefreshExpiredError
from hermes_cli.dashboard_auth.cookies import read_session_cookies
from hermes_cli.dashboard_auth.public_paths import PUBLIC_API_PATHS
_log = logging.getLogger(__name__)
# Prefixes that bypass the auth gate. Match via ``path == prefix`` or
# ``path.startswith(prefix)`` — so ``/assets/`` (with trailing slash)
# matches ``/assets/foo.css`` but not ``/assetsleak``. Auth-bootstrap
# (login page, OAuth round trip, provider listing) and static asset
# mounts go here.
_GATE_PUBLIC_PREFIXES: tuple[str, ...] = (
"/auth/login",
"/auth/callback",
"/auth/logout",
"/login",
"/api/auth/providers",
"/assets/",
"/favicon.ico",
"/ds-assets/",
"/fonts/",
"/fonts-terminal/",
)
def _path_is_public(path: str) -> bool:
"""True if ``path`` bypasses the OAuth auth gate.
Two sources of public-ness:
* :data:`PUBLIC_API_PATHS` — the shared ``/api/*`` allowlist that
the legacy ``_SESSION_TOKEN`` middleware also honours. Matched
exactly (no prefix expansion) so adding ``/api/status`` doesn't
accidentally expose ``/api/status/secret-extension``.
* :data:`_GATE_PUBLIC_PREFIXES` — auth-bootstrap routes and static
mounts. Prefix-matched so ``/assets/foo.css`` lights up via
``/assets/``.
"""
if path in PUBLIC_API_PATHS:
return True
return any(
path == prefix or path.startswith(prefix)
for prefix in _GATE_PUBLIC_PREFIXES
)
def _client_ip(request: Request) -> str:
fwd = request.headers.get("x-forwarded-for", "")
if fwd:
return fwd.split(",")[0].strip()
return request.client.host if request.client else ""
def _unauth_response(request: Request, *, reason: str) -> Response:
"""API routes → 401 JSON with ``login_url``; HTML routes → 302 → /login.
The JSON envelope carries a ``login_url`` field with a ``next=`` query
string so the SPA's global 401 handler can drop the user back where
they were after re-auth. The contract is intentionally simple so any
fetch-wrapper can implement the redirect without parsing details:
if response.status === 401 && body.error in ("unauthenticated",
"session_expired"):
window.location.assign(body.login_url);
HTML redirects also carry the ``next=`` query string so direct
navigation to ``/sessions`` (etc.) without a cookie comes back to
``/sessions`` after login.
Under a reverse proxy with ``X-Forwarded-Prefix: /hermes``, the
``login_url`` is prefixed (``/hermes/login?next=...``) so the
browser's window.location.assign / Location: follow lands on the
proxied login page rather than the bare ``/login`` (which the
proxy doesn't route to the dashboard).
"""
from hermes_cli.dashboard_auth.prefix import prefix_from_request
path = request.url.path
next_param = _safe_next_target(request)
prefix = prefix_from_request(request)
login_url = (
f"{prefix}/login?next={next_param}" if next_param
else f"{prefix}/login"
)
if path.startswith("/api/"):
# API routes never get redirects: the browser fetch() API would
# follow a 302 into the cross-origin OAuth dance opaquely. Return
# 401 with a structured envelope so the SPA can full-page-navigate
# to login_url.
error_code = (
"session_expired"
if reason == "invalid_or_expired_session"
else "unauthenticated"
)
return JSONResponse(
{
"error": error_code,
"detail": "Unauthorized",
"reason": reason,
"login_url": login_url,
},
status_code=401,
)
return RedirectResponse(url=login_url, status_code=302)
def _safe_next_target(request: Request) -> str:
"""Build the URL-encoded ``next`` query value, or empty string.
Only same-origin relative paths are accepted; absolute URLs or
``//evil.com`` open-redirect attempts are silently dropped. The empty
string return means the caller produces a bare ``/login`` URL — fine,
user lands at the dashboard root after re-auth.
"""
path = request.url.path
# Reject anything that doesn't start with "/" or starts with "//"
# (protocol-relative URL — would open-redirect to an attacker host).
if not path or not path.startswith("/") or path.startswith("//"):
return ""
# Don't redirect back to the auth routes themselves — that loops.
if any(
path == p or path.startswith(p)
for p in ("/login", "/auth/", "/api/auth/")
):
return ""
# Reject ALL ``/api/*`` paths. The 401-envelope code path fires for
# any unauthenticated SPA fetch (e.g. ``GET /api/analytics/models``
# from ModelsPage), and the SPA's global 401 handler full-page
# navigates to ``login_url``. After the OAuth round trip the user
# would land on the API URL and see raw JSON instead of the
# dashboard. SPA routes survive (they don't start with ``/api/``);
# the SPA's own ``sessionStorage["hermes.lastLocation"]`` fallback
# in ``web/src/lib/api.ts`` covers the deep-link case.
if path == "/api" or path.startswith("/api/"):
return ""
# Preserve query string if present (e.g. /sessions?page=2).
query = request.url.query
target = f"{path}?{query}" if query else path
# urlencode the whole thing as a single value.
from urllib.parse import quote
return quote(target, safe="")
async def gated_auth_middleware(
request: Request,
call_next: Callable[[Request], Awaitable[Response]],
) -> Response:
"""Engaged only when ``app.state.auth_required is True``.
No-op pass-through in loopback mode so the legacy auth_middleware can
handle those binds via ``_SESSION_TOKEN``.
"""
if not getattr(request.app.state, "auth_required", False):
return await call_next(request)
path = request.url.path
if _path_is_public(path):
return await call_next(request)
at, _rt = read_session_cookies(request)
if not at and not _rt:
# Neither token present — no session at all. Nothing to verify or
# refresh; force login.
return _unauth_response(request, reason="no_cookie")
# Try every registered provider's verify_session in turn. Providers
# MUST return None for tokens they don't recognise (not raise). This
# lets multiple providers stack — the first one that recognises a
# token wins.
#
# When the access-token cookie is absent but a refresh-token cookie is
# present, skip verification and go straight to the refresh path below.
# This is the COMMON expiry case, not an edge case: the access-token
# cookie is set with ``Max-Age = access_token_expires_in`` (~15 min), so
# the browser EVICTS it the moment the token lapses, while the
# refresh-token cookie lives for 30 days. From that point the browser
# sends only ``hermes_session_rt``. If we bailed on ``not at`` here we'd
# bounce the user to /login on every expiry despite holding a perfectly
# good refresh token — defeating the whole transparent-refresh feature.
session = None
if at:
for provider in list_providers():
try:
session = provider.verify_session(access_token=at)
except ProviderError as e:
_log.warning(
"dashboard-auth: provider %r unreachable during verify: %s",
provider.name, e,
)
audit_log(
AuditEvent.SESSION_VERIFY_FAILURE,
provider=provider.name,
reason="provider_unreachable",
ip=_client_ip(request),
)
return JSONResponse(
{"detail": f"Auth provider {provider.name!r} unreachable"},
status_code=503,
)
if session is not None:
break
if session is None:
# Access token is expired/invalid. Before forcing re-login, try to
# rotate it using the refresh token (if the session cookie carries
# one). On success we re-set the rotated cookies on the response and
# serve the request transparently; on RefreshExpiredError (RT dead /
# revoked / reuse-detected) we fall through to clear-and-relogin.
refreshed = _attempt_refresh(request, refresh_token=_rt)
if refreshed is not None:
new_session, refreshing_provider = refreshed
request.state.session = new_session
response = await call_next(request)
# Persist the ROTATED tokens. Portal rotates the refresh token on
# every refresh and runs reuse-detection, so writing the new RT
# back is mandatory: a stale RT cookie would replay a rotated
# token on the next refresh and (outside Portal's grace) revoke
# the whole session. Bind cookie Secure/Path to the request shape.
from hermes_cli.dashboard_auth.cookies import (
detect_https,
set_session_cookies,
)
from hermes_cli.dashboard_auth.prefix import prefix_from_request
set_session_cookies(
response,
access_token=new_session.access_token,
refresh_token=new_session.refresh_token,
access_token_expires_in=_expires_in_seconds(new_session),
use_https=detect_https(request),
prefix=prefix_from_request(request),
)
audit_log(
AuditEvent.REFRESH_SUCCESS,
provider=refreshing_provider,
user_id=new_session.user_id,
ip=_client_ip(request),
)
return response
audit_log(
AuditEvent.SESSION_VERIFY_FAILURE,
reason="no_provider_recognises",
ip=_client_ip(request),
)
response = _unauth_response(request, reason="invalid_or_expired_session")
# Clear the dead cookies so the browser doesn't keep sending them.
# Refresh already failed (or there was no RT), so the only correct
# next step is full re-auth via /login. Importing locally avoids a
# cycle with cookies → middleware at module load. Pass the active
# prefix so the deletion's Path matches the set-Path (otherwise
# the browser ignores it).
from hermes_cli.dashboard_auth.cookies import clear_session_cookies
from hermes_cli.dashboard_auth.prefix import prefix_from_request
clear_session_cookies(response, prefix=prefix_from_request(request))
return response
request.state.session = session
return await call_next(request)
def _expires_in_seconds(session) -> int:
"""Seconds until the access token's ``exp``, floored at 60.
Mirrors the auth-route's ``max(60, exp - now)`` so the access-token
cookie's Max-Age tracks the token lifetime even on a slightly skewed
clock. ``time`` imported locally to keep the module's import surface
minimal.
"""
import time
return max(60, int(session.expires_at) - int(time.time()))
def _attempt_refresh(request: Request, *, refresh_token):
"""Try to rotate an expired session via the refresh token.
Returns ``(new_session, provider_name)`` on success, or ``None`` if
there's no RT or every provider's ``refresh_session`` failed with
``RefreshExpiredError`` (dead/revoked/reuse-detected RT → force re-login).
A ``ProviderError`` (Portal unreachable) is NOT swallowed into a re-login
here — re-raising would 500 the request; instead we log and return None so
the caller forces a clean re-login, which is the safer UX than a hard
error on a transient network blip during the narrow refresh window.
"""
if not refresh_token:
return None
for provider in list_providers():
try:
new_session = provider.refresh_session(refresh_token=refresh_token)
except RefreshExpiredError:
# This provider owns the RT but it's dead — stop trying others
# (an RT belongs to exactly one provider) and force re-login.
audit_log(
AuditEvent.REFRESH_FAILURE,
provider=provider.name,
reason="refresh_expired",
ip=_client_ip(request),
)
return None
except ProviderError as e:
_log.warning(
"dashboard-auth: provider %r unreachable during refresh: %s",
provider.name, e,
)
audit_log(
AuditEvent.REFRESH_FAILURE,
provider=provider.name,
reason="provider_unreachable",
ip=_client_ip(request),
)
return None
if new_session is not None:
return new_session, provider.name
return None