list_session_providers() already filters on supports_session=True, so the
new helper re-filtered an already-filtered list. Call it directly at the
single auto-SSO call site.
When the dashboard gateway has no local session cookie, it rendered a
click-through /login interstitial — even though the Nous portal's
/oauth/authorize auto-approves any current member of the dashboard's org
and is a silent 302 when the user already holds a portal session. For the
common case (clicking a hosted-agent dashboard link while signed in to the
portal) that interstitial click is pure friction.
This makes the gate auto-initiate the OAuth redirect on an unauthenticated
HTML document load instead of rendering the interstitial, when exactly one
interactive provider is registered. A one-shot loop-guard cookie
(hermes_sso_attempt, 60s TTL) ensures that a genuinely absent portal
session (the portal bounces back still-unauthenticated) falls back to the
/login page after exactly one bounce rather than ping-ponging forever. The
marker is cleared on a successful callback and whenever the gate falls back
to /login.
Security: this removes a human CLICK, not a security check. The redirect
lands on the existing /auth/login route and runs the unchanged PKCE
auth-code flow; token verification, audience checks, redirect-URI match,
and org-membership checks are all untouched. /api/* fetches still get the
401 JSON envelope (never a 302 a fetch() would follow opaquely), and with
two or more providers the /login chooser still renders.
Phase 1 of the cloud-auto-discovery work.
* Return None instead of erroring on drain login failure
* Fix login on drain
* Remove login for drained endpoints flow and clean the code
* chore: drop unrelated credits changes from this PR
* Remove extra comments that were not really necessary
Task 2.0a of the safe-shutdown drain-coordination plan. Widens the dashboard
auth framework GENERICALLY to support non-interactive (service-to-service)
bearer-token auth, mirroring the existing supports_password precedent. This is
a reusable capability — any future machine-credential provider plugs in without
core changes (decisions.md Q-C). The drain bearer-secret plugin (Task 2.0b) is
the first consumer, not the definition.
- base.py: add TokenPrincipal dataclass (the token analog of Session) +
supports_token capability flag + verify_token() on the ABC (default raises
NotImplementedError so a misconfigured provider fails loud). Contract mirrors
verify_session stacking: return None for unrecognised tokens (never raise),
raise ProviderError only on a genuine backing-store outage.
- registry.py: list_token_providers() — the supports_token subset, in
registration order. Empty when none registered (token routes fail closed).
- token_auth.py (new): route-agnostic seam. Routes opt in via
register_token_route(exact path); token_auth_middleware owns the auth
decision for those routes only — authenticate via stacked providers, attach
request.state.token_principal + token_authenticated, pass through. 401 on
missing/unrecognised token, 503 when a provider was unreachable, untouched
passthrough for non-token routes. Fails closed (never open).
- web_server.py: install the seam OUTERMOST (registered last → runs first).
Both downstream gates (legacy auth_middleware + gated_auth_middleware) honour
request.state.token_authenticated and skip enforcement, so a token-authed
service request is never bounced to /login.
- audit.py: TOKEN_AUTH_SUCCESS / TOKEN_AUTH_FAILURE events.
Tests: tests/hermes_cli/test_dashboard_token_auth.py — ABC flag default,
verify_token NotImplementedError, registry filter, bearer extraction
(case-insensitive scheme, malformed/non-bearer → ""), provider stacking
(first-match-wins, unreachable-remembered, unreachable-then-valid, buggy
provider doesn't crash the gate), and the seam's passthrough/401/503/
fail-closed behaviour. 29 new tests; full dashboard-auth suite 169 passed.
Intentionally deferred:
- The concrete shared-bearer-secret provider plugin — Task 2.0b.
- The begin/cancel-drain endpoint that registers itself as a token route —
Task 2.1.
Build status: dashboard-auth + plugin-hook suites green.
Live-test finding: the Chronos fire webhook was only on the APIServerAdapter
(aiohttp), but hosted agents expose `hermes dashboard` (the FastAPI web_server
app on :9119) as their public URL — NOT the api_server adapter. So NAS's relay
callback to {callback_url}/api/cron/fire could never reach the verifier on a
hosted agent (the exact target environment). Two layers were wrong:
1. Wrong server: /api/cron/fire didn't exist on the dashboard app. Added
cron_fire_webhook there, alongside the existing /api/cron/* dashboard routes.
It resolves the job's profile (_find_cron_job_profile) and runs fire_due via
the resolved provider under the cron-profile retarget lock
(_fire_cron_job_for_profile, mirroring _call_cron_for_profile) so the CAS
claim + run_one_job operate on the right profile's jobs.json. Runs with no
live adapters (delivery falls back to the per-platform send path, like the
desktop cron path). 202 + background so a long turn never trips NAS's
timeout; the store CAS de-dupes a NAS retry. job-not-found -> 200 "gone".
2. Auth gate: the dashboard auth middleware 401s any non-cookie request before
the handler runs. Added /api/cron/fire to the shared PUBLIC_API_PATHS so the
NAS bearer-JWT callback reaches the verifier — the JWT (purpose=cron_fire),
not the cookie, is the real gate. One shared frozenset feeds both the
loopback and OAuth middlewares, so no drift.
Kept the APIServerAdapter route too (valid self-host api_server surface).
Contract doc updated to name the dashboard app as the hosted-agent callback
surface.
Tests: test_cron_fire_dashboard (6) — route registered on the dashboard app,
in PUBLIC_API_PATHS, 401 on bad token WITH the cookie gate engaged (proves it's
reachable past the gate + JWT is the gate), 400 missing job_id, 200 gone for
unknown job, 202 + fire_due invoked for the resolved profile on a valid token.
Full hermes_cli + cron + chronos + webhook suites green (7637).
Why the original tests missed it: the api_server webhook test built an
APIServerAdapter client directly and never asserted which server the hosted
public URL exposes — green-but-wrong-integration. The new test pins the route
to the dashboard app.
A non-empty HERMES_DASHBOARD_PUBLIC_URL / dashboard.public_url value that
fails URL validation (overwhelmingly: a missing http(s):// scheme, e.g.
"hermes.domain.com") was silently discarded by resolve_public_url(),
falling back to reconstructing the OAuth redirect_uri from request
headers. Behind a reverse proxy that doesn't forward X-Forwarded-Proto
reliably, that yields an http:// callback even though the operator
explicitly set the public URL — with no signal as to why (#42780).
Emit a deduplicated operator-facing WARNING (once per distinct value,
since resolve_public_url runs per request) naming the offending value
and the required scheme. Turns a silent footgun into a self-diagnosing
one; behaviour is otherwise unchanged.
Tests assert the warning fires for a scheme-less value, is deduplicated
across repeated calls, and stays silent for a valid value — all three
fail without the fix.
The desktop OAuth remote-gateway path gated connectivity on
hasOauthSessionCookie(), which checks only the access-token cookie
(hermes_session_at, ~15 min TTL). The moment that cookie's Max-Age
lapsed, Electron's cookie jar dropped it and both resolveRemoteBackend()
and sanitizeDesktopConnectionConfig() reported "not signed in" — forcing
a full IDP re-login every ~15 min — even though a valid 24h refresh-token
cookie (hermes_session_rt) was sitting in the same jar.
The desktop OAuth code (2026-06-04) was written against the obsolete
"contract v1 issues no refresh token" model, two days after #37247
re-introduced server-side transparent refresh: Portal now issues a 24h
rotating, reuse-detected refresh token, and the gateway middleware
(_attempt_refresh) rotates a fresh AT from the RT on the next
authenticated request. So an expired-AT/live-RT session is fully
connectable — the desktop just never let the request through.
Fix:
- connection-config.cjs: add RT_COOKIE_VARIANTS + cookiesHaveLiveSession()
(true when EITHER a live AT or RT cookie is present). Keep
cookiesHaveSession() AT-only for callers that need that specific signal.
- main.cjs: add hasLiveOauthSession(); resolveRemoteBackend()'s oauth
branch now early-outs only when NEITHER cookie is present, otherwise
uses the ws-ticket mint as the authoritative liveness probe (that POST
carries the RT cookie and triggers the server-side AT rotation). A real
401 still surfaces as needsOauthLogin. Settings indicator + oauth-logout
report against the same AT-or-RT notion.
- Remove the stale "contract v1 / NO refresh token" docstrings in
cookies.py and the verify_session comments in the Nous provider that
contradicted #37247.
Tests: +57 lines in connection-config.test.cjs covering the RT-only
"still connectable" case. node --test: 32/32. dashboard-auth +
nous-provider Python suites: 223/223.
Note: server-side files (hermes_cli/dashboard_auth/, plugins/dashboard_auth/)
are comment/docstring-only here, but this touches outside apps/desktop/ so
it needs Teknium review.
The gated dashboard verifies a session cookie by trying each registered
DashboardAuthProvider's verify_session in turn (the session cookie stores
only the access token, not which provider issued it). A provider that
doesn't recognise a token returns None; a provider whose IDP/JWKS is
unreachable raises ProviderError.
The loop used to return HTTP 503 on the FIRST ProviderError, before any
later provider got a turn. With multiple providers stacked, that means an
unreachable IDP for a session you didn't even use blocks login through a
different, reachable provider.
Concrete repro: a self-hosted-OIDC session hits the 'nous' provider first
(registered earlier); nous tries to reach Nous Portal's JWKS, which is
unreachable in a self-hosted deployment, so it raises — and the gate
503s before the 'self-hosted' provider can verify the token. Hit live
while testing the new self-hosted OIDC plugin against a local Keycloak.
Fix: a ProviderError from one provider is logged and the loop continues
to the next. A 503 is returned only if NO provider verified the token
AND at least one was unreachable — distinguishing a transient IDP outage
(don't force a needless re-login) from a token that's genuinely invalid
(fall through to refresh/relogin). Single-provider behaviour is
unchanged.
Tests: adds an _UnreachableProvider stub and three cases — unreachable
provider first must not block a working second; all-unreachable still
503s; reachable-but-unrecognised falls through to 401/relogin (not 503).
Mutation-tested: reverting the fix makes the first case fail with the
exact 503 bug.
The dashboard auth gate was OAuth-only: a DashboardAuthProvider could
authenticate only via a redirect to an IDP (start_login -> /auth/callback
-> complete_login). There was no first-class path for username/password
auth, so self-hosters who just want a password on their dashboard had no
clean option short of an external OAuth IDP.
Extend the provider framework with a parallel, non-redirect front door
that converges on the same Session + cookie + refresh machinery:
- base.py: add the optional supports_password flag and
complete_password_login(username, password) -> Session (default
raises NotImplementedError so an OAuth-only provider that forgets the
flag fails loudly). Add InvalidCredentialsError. OAuth providers are
unaffected (flag defaults False; the method is never called).
- routes.py: add POST /auth/password-login, mirroring the cookie-minting
tail of /auth/callback but skipping PKCE/state/code. Returns JSON
{ok, next} (the form POSTs via fetch). Generic 401 for both unknown
user and wrong password (no enumeration oracle); 404 hides whether a
provider exists or supports passwords; per-IP sliding-window rate
limit (10/min -> 429). /api/auth/providers now reports
supports_password so the login page can branch.
- middleware.py: allowlist /auth/password-login (a bootstrap route).
verify/refresh/revoke/ws-tickets/logout need zero changes — a password
session is just a Session with provider-minted opaque tokens.
- login_page.py: render a credential form (instead of a redirect button)
for supports_password providers, wired by a small inline script that
POSTs to /auth/password-login and navigates on success. OAuth-only
pages stay script-free.
Follow-up to Ben's PR #37892. Adds a TestInternalCredential block to
test_dashboard_auth_ws_tickets.py exercising the mint-once stability,
multi-use, unminted-rejection, empty-value, wrong-value, reset-and-remint,
and ticket-store-independence branches directly (previously only covered
indirectly via _ws_auth_ok, which left the unminted and empty-value
branches unexercised).
Also corrects the consume_internal_credential docstring: the returned
identity dict is discarded by the current _ws_auth_ok caller (which only
needs the boolean outcome), so the prior 'carry it into its session log'
wording over-promised.
The embedded-TUI PTY child attaches to two server-internal WebSockets:
/api/ws (its primary JSON-RPC gateway backend) and /api/pub (the event
sidecar). Both URLs are built server-side in web_server.py and handed to
the child via its environment.
In OAuth-gated mode (auth_required=true, every hosted Fly agent), _ws_auth_ok
unconditionally rejects the legacy ?token=<_SESSION_TOKEN> path — a leaked
session token must not grant WS access once the gate is engaged. But
_build_gateway_ws_url() still only emitted ?token=, with no gated-mode
branch (its sibling _build_sidecar_url had been given a ticket branch; the
gateway-url builder was missed). So the TUI child's /api/ws upgrade was
rejected 4401 -> 'gateway websocket connection failed' -> 'gateway startup
timeout', leaving the embedded chat unusable on every gated deployment.
A single-use 30s browser ticket is the wrong shape for this link: the child
reads its attach URL once at startup and reuses it on every reconnect, and
on a slow cold boot it may not dial within the TTL. (_build_sidecar_url's
own docstring already flagged this fragility.)
Fix: add a process-lifetime, multi-use internal credential to
dashboard_auth.ws_tickets (internal_ws_credential / consume_internal_credential),
minted once per process and NEVER injected into the SPA — it only leaves the
process via a spawned child's env, so browser-side XSS can't read it, and a
leak grants no more than a ticket already does. _ws_auth_ok accepts it via
?internal= in gated mode only. Both _build_gateway_ws_url and
_build_sidecar_url now use it, so the child can reconnect both sockets.
Loopback / --insecure behavior is unchanged (still ?token=).
Needs review: touches _ws_auth_ok + dashboard_auth (core auth surface).
* feat(dashboard-auth): rotate dashboard sessions via refresh token
The dashboard auth-code grant now issues a 24h rotating refresh token
(server side: NousResearch/nous-account-service#293). This wires up the
Hermes client half so an expired access token is transparently refreshed
instead of bouncing the user to /login every 15 minutes.
plugins/dashboard_auth/nous:
- refresh_session() now POSTs grant_type=refresh_token to Portal's token
endpoint and returns a Session carrying the ROTATED refresh token (was
an unconditional RefreshExpiredError under the old "no RT in V1"
contract). The RT is sent in BOTH the request body (Portal's schema
requires it there) and the X-Refresh-Token header (log redaction) —
verified against the #293 preview deploy: header-only is rejected as
invalid_request, body is accepted.
- A 400 from Portal (expired / revoked / reuse-detected) maps to
RefreshExpiredError so the middleware forces a clean re-login; network
errors map to ProviderError; empty RT fast-fails without a network call.
- complete_login now captures the initial refresh token Portal returns
(forward-tolerant: empty string if a deploy omits it).
- Extracted the shared token-response handling into
_token_response_to_session, parameterised on the 400 exception type so
the auth-code path raises InvalidCodeError and the refresh path raises
RefreshExpiredError.
- revoke_session stays a best-effort no-op: Portal exposes no public
token-endpoint revocation grant (revocation is the authenticated
/sessions UI, keyed by sessionId+userId), so logout is cookie-clearing
and the 24h session expires on its own. Documented for a future
revoke grant.
hermes_cli/dashboard_auth/middleware:
- On an expired/invalid access token the gate now attempts refresh via
the session's RT BEFORE forcing re-login. On success it serves the
request and re-sets the rotated cookies on the response (mandatory:
Portal rotates the RT every refresh and reuse-detects, so a stale RT
cookie would revoke the whole session on the next refresh). On
RefreshExpiredError (or no RT) it falls through to clear-and-relogin.
- ProviderError during refresh (Portal unreachable) forces a clean
re-login rather than 500-ing the request.
- Uses the existing REFRESH_SUCCESS / REFRESH_FAILURE audit events.
Validation:
- 176 dashboard-auth unit/integration tests pass.
- Live E2E against the #293 preview deploy: refresh_session(bad rt) ->
RefreshExpiredError through the real token endpoint; live JWKS fetch +
RS256 verification rejects a forged token; empty-RT fast-fail. The
successful happy-path rotation is covered by unit tests (a live run
needs an interactive browser OAuth round trip + registered agent:*
client).
Depends on: NousResearch/nous-account-service#293 (server-side RT issuance).
* fix(dashboard-auth): use Portal's x-nous-refresh-token header name
The refresh-token header must match Portal's REFRESH_TOKEN_HEADER exactly
("x-nous-refresh-token"); the initial cut used "X-Refresh-Token", which
Portal silently ignores (harmless since the RT is also in the body, which
is what the schema requires — but the header redaction was a no-op).
Confirmed against the NAS token route + re-validated live against the
#293 preview deploy.
* fix(dashboard-auth): refresh session when access-token cookie has been evicted
The gated middleware bounced users to /login the instant the access-token
cookie was absent, without ever consulting the refresh token:
at, _rt = read_session_cookies(request)
if not at:
return _unauth_response(...) # bailed here
This made transparent refresh effectively dead for the common case. The
access-token cookie is set with Max-Age = access_token_expires_in (~15 min),
so a real browser EVICTS hermes_session_at the moment the token lapses while
hermes_session_rt persists (30-day Max-Age). From that point the browser
sends only the refresh-token cookie — and the old guard rejected it before
_attempt_refresh could run. The _attempt_refresh path only fired for a
present-but-invalid access token, which never happens in a browser.
Fix: only hard-bounce when NEITHER cookie is present. A request carrying
just the refresh token now skips verification (no AT to verify) and flows
into the existing refresh path, which rotates both cookies and serves the
request transparently. A dead/expired RT still raises RefreshExpiredError
and falls through to clear-and-relogin.
This failure mode escaped the original tests + manual refresh button because
both kept the access-token cookie present; only a real browser evicting the
cookie at Max-Age exposes it. Added 3 regression tests covering: AT-evicted +
RT-present (transparent refresh), no-cookies (still bounces), and RT-only with
a dead RT (clean 401, no 500).
When an unauthenticated SPA fetch hit a gated /api/* endpoint (e.g.
GET /api/analytics/models?days=30 fired from ModelsPage on mount or
after a session expiry), the gated middleware stamped the request's
own path into next= on the 401 envelope's login_url. The SPA's global
401 handler in web/src/lib/api.ts full-page-navigated to that URL,
the PKCE cookie carried the encoded /api/* value through the OAuth
round trip to Portal, and /auth/callback's _validate_post_login_target
accepted it as same-origin and redirected the user to the raw JSON
endpoint instead of the dashboard.
Symptom Ben reported: after the OAuth screen he kept landing on
$DOMAIN/api/analytics/models?days=30 (raw JSON) rather than /models.
The bug was deterministic per page — whichever /api/* call ModelsPage,
AnalyticsPage, or SessionsPage fired first owned the redirect race.
Fix: both validators now reject /api/* targets in addition to the
existing /login, /auth/, /api/auth/ exclusions:
- _safe_next_target in middleware.py drops the value before it ever
enters login_url, so the SPA's 401 handler navigates to a bare
/login (which the SPA itself can return-from via its own
sessionStorage["hermes.lastLocation"] fallback that was already
saving the actual browser location).
- _validate_post_login_target in routes.py drops it as second-line
defence at the callback boundary, so a legacy cookie, a regressed
middleware, or an attacker-crafted /auth/login?next=/api/... value
can't smuggle the redirect through. Either layer alone is enough;
pairing them means a regression in one is caught by the other.
The match is anchored: ``decoded == "/api"`` or
``decoded.startswith("/api/")``. SPA route lookalikes like /apidocs
or /api-keys remain valid landing targets — tests pin that.
Test additions in test_dashboard_auth_401_reauth.py:
- TestApi401Envelope: rewrote test_login_url_carries_next_for_deep_
api_path (which asserted the pre-fix behaviour) as
test_login_url_drops_next_for_deep_api_path, plus added the
specific analytics-models repro case from Ben's report.
- TestNextSameOriginValidation: rejects-api-paths + does-not-reject-
api-prefix-lookalikes (covers /apidocs, /api-keys).
- TestAuthCallbackNext: end-to-end test_callback_with_api_next_
lands_at_root drives /auth/login?next=/api/... through to the
callback and asserts the user lands at "/", not the API URL.
- TestValidatePostLoginTarget: new class covering the callback-side
validator directly, including the URL-encoded ``%2Fapi%2F...``
form the PKCE cookie actually carries.
Mutation-tested: reverting both validators causes exactly the 5 new
or rewritten /api/*-related assertions to fail (each fix layer is
independently tested), while the 31 other assertions in the file
remain green. Full tests/hermes_cli/ suite (288 files, 5,938 tests)
passes with the fix applied.
Two parallel public-path allowlists drifted: _PUBLIC_API_PATHS in
hermes_cli/web_server.py (legacy _SESSION_TOKEN middleware) and
_GATE_PUBLIC_PREFIXES in hermes_cli/dashboard_auth/middleware.py
(OAuth gate). The legacy list included /api/status (documented as a
non-sensitive read-only liveness target); the OAuth gate's list did not.
Effect: every wildcard-subdomain agent surfaced as STARTING/down to the
portal even though the dashboard was serving correctly. Nous account
service (src/server/agents/fly-provider.ts
getInstanceRuntimeStatus) fetches ``/api/status`` without a cookie
as its sole liveness probe; the OAuth gate's 401 looked identical to
'agent dead' on the portal side.
Fix: lift the allowlist into hermes_cli/dashboard_auth/public_paths.py
and have both middlewares import it. _path_is_public now consults
the shared frozenset first, then falls back to the gate's
auth-bootstrap/static prefix list. Future additions to the public list
hit both gates automatically.
Endpoint inventory (verified safe to remain public):
* /api/status — version, gateway state, active session count,
auth-gate shape. Portal liveness probe target.
* /api/config/defaults — config-defaults feed for the SPA's Config page
* /api/config/schema — config schema for the SPA's Config page
* /api/model/info — model catalogue metadata (context windows)
* /api/dashboard/themes — theme manifests for the skin engine
* /api/dashboard/plugins — plugin manifests for the dashboard
No user data, no session content, no secrets. Same shape an external
monitoring agent would hit on /healthz.
Tests:
* New: test_gated_status_is_public (regression guard with the NAS
fly-provider.ts liveness-probe rationale spelled out in the docstring)
* New: test_other_public_api_paths_are_public_under_gate (parametrised
over the rest of PUBLIC_API_PATHS — proves 401 / 302-to-login is
never the response)
* New: docker integration check #3 in
test_dashboard_oauth_gate_engaged_by_default — /api/status
remains 200 under the gate AND reports auth_required=True so the
portal can distinguish modes
* Updated: test_full_login_round_trip_unlocks_gated_api now probes
/api/sessions instead of /api/status (status is public, so it
can no longer distinguish 'logged in' from 'gate accidentally
disabled')
* Updated: TestApi401Envelope (the no-cookie / invalid-cookie /
dead-cookie tests) probes /api/sessions for the same reason
* Updated: docker integration check #2 in
test_dashboard_oauth_gate_engaged_by_default probes
/api/sessions to prove the gate is intercepting
* Removed: dead _login() helper in
test_dashboard_auth_status_endpoint.py (no longer needed since
/api/status is reachable cold)
Companion to docs/handover/hermes-agent-dashboard-s6-insecure-fix.md
(the --insecure flag fix that shipped earlier).
Operators behind reverse proxies that don't reliably forward
X-Forwarded-Host / X-Forwarded-Proto / X-Forwarded-Prefix (manual
nginx setups, on-prem ingresses, custom-domain Fly deploys with
incomplete proxy chains) had no way to force the absolute base URL
the OAuth callback redirects from. The dashboard would reconstruct
the redirect_uri from request headers, the IDP would echo it back,
and the user would land on the wrong host or wrong path — 404.
Add `dashboard.public_url` to config.yaml with env override
HERMES_DASHBOARD_PUBLIC_URL. When set, it is the complete authority —
scheme + host + optional path prefix (e.g. https://example.com/hermes) —
and becomes the base for the OAuth `redirect_uri`. X-Forwarded-Prefix
is IGNORED on this code path because the operator has explicitly
declared the public URL; we no longer need to guess from proxy
headers, and stacking the prefix on top would double-prefix the
common case where the prefix is already baked into public_url.
When unset, the existing proxy_headers + X-Forwarded-Prefix
reconstruction runs untouched. Existing Fly.io deploys continue to
work without configuration — this is purely additive.
Precedence mirrors dashboard.oauth.client_id:
env (non-empty) > config.yaml > reconstructed from request
Implementation:
- hermes_cli/config.py: add dashboard.public_url to DEFAULT_CONFIG
with a multi-paragraph doc comment explaining the use case,
the X-Forwarded-Prefix interaction, and the validation rules.
- hermes_cli/dashboard_auth/prefix.py: factored out the existing
_REJECT_CHARS frozenset, added _normalise_public_url() validator
(requires http/https scheme + non-empty host + no header-injection
chars), _load_dashboard_section() loader (robust to load_config
raising, non-dict shapes), and resolve_public_url() entry point
with the env-overrides-config precedence. A malformed value
silently falls through to ""; the caller treats "" as "reconstruct
from request" so a typo never breaks the login flow.
- hermes_cli/dashboard_auth/routes.py: rewrite _redirect_uri()
docstring to spell out the three resolution tiers; add the
public_url short-circuit before the existing X-Forwarded-Prefix
splicing. Source-level comment notes that X-Forwarded-Prefix is
intentionally ignored when public_url is set so a future reader
doesn't try to "fix" the missing prefix layering.
- cli-config.yaml.example: extend the existing dashboard section
with a public_url block.
- website/docs/user-guide/features/web-dashboard.md: new "Public
URL override" section between the provider configuration and
the OAuth flow walkthrough. Documents the env-vs-config table,
the validation rules, and the `http://` `public_url` ↔ Secure
cookie footgun.
Test coverage — new TestPublicUrlOverride class (8 tests):
- env var overrides request reconstruction (the primary motivating
case)
- config.yaml used when env unset
- env wins over config (precedence pin)
- public_url with a path prefix already baked in (the Q1-a case the
user explicitly chose)
- public_url suppresses X-Forwarded-Prefix layering (defends
against the double-prefix bug)
- trailing slash stripped from public_url (no //auth/callback)
- malformed public_url falls through to reconstruction (six
hostile inputs: javascript:, ftp:, missing scheme, missing host,
quote chars, CRLF injection)
- empty env string doesn't shadow config.yaml entry (CI / Fly
provisioned-but-empty secret case)
Mutation-tested: flipping the precedence in resolve_public_url() trips
exactly test_env_overrides_config_public_url; weakening the validator
(accept any scheme) trips exactly test_malformed_public_url_falls_through_to_reconstruction.
Both other tests in each pair stay green, confirming the suite
discriminates the specific regression each test pins.
The login page is the first surface the user sees on a gated dashboard
and shipped with off-the-shelf system fonts and a generic orange
accent that didn't match the React dashboard waiting on the other
side of the OAuth round trip. Apply the same visual language the SPA
uses (the @nous-research/ui package) so the auth flow feels like one
product, not two.
What changes (visual only — no functional changes):
Typography
- Body: Collapse (regular + bold), served from /fonts/ — the same
woff2 files the dashboard SPA loads via the design-system's
fonts.css.
- Display: Rules Compressed (regular + medium) for the brand
wordmark and the page heading.
- Brand chrome (heading, buttons, footer) uses the DS idiom:
uppercase + letter-spacing 0.2em (matching the DS Button class).
Colour
- Background: #170d02 (deep brown-black; --background-base in DS).
- Accent: #ffac02 (amber; --midground in DS).
- Foreground: #ffffff.
- Hairlines: color-mix() of the midground at 18% / 35%, mirroring
the DS "@theme inline" derived tokens.
Button surface
- Solid amber surface with dark text, no rounded corners (DS Button
is squared). Inset bevel — — directly mirrors the DS
Button SHADOW_DEFAULT (). :active uses filter:invert(1) which matches the DS
Button's .
Atmosphere
- Subtle 3px dither (repeating-conic-gradient at 4% midground) +
a midground radial glow at top — same idioms as the DS .dither
utility and the SPA's panel chrome.
- slide-up fade-in entrance animation matching DS @keyframes
slide-up (0.6s ease-out). Honours prefers-reduced-motion.
Brand wordmark
- 'NOUS · RESEARCH' above the card in Rules Compressed, amber,
0.32em tracking. Establishes ownership before the user squints
at the buttons.
Empty-state page
- The 'Sign-in unavailable' fallback (no providers registered)
got the same colour-token and typography treatment so the
misconfigured-deploy experience is also coherent.
Fonts are served from /fonts/*.woff2 — a path the dashboard-auth gate
already allowlists pre-auth (see _GATE_PUBLIC_PREFIXES in
middleware.py:42), so the login page renders with the brand typeface
without needing the React bundle loaded. The page is still entirely
static HTML+CSS with no JS — the original constraint (no SPA
dependency, no session token) is preserved.
The class="provider-btn" selector is unchanged — the existing test
suite extracts the anchor href via that class, and a regression that
renamed it would silently break tests/hermes_cli/test_dashboard_auth_401_reauth.py.
A docstring note on the module flags this so future visual tweaks
don't break the contract by accident.
Visual smoke-test: rendered both the happy path (multiple providers
listed) and the empty-state page in a browser and verified all five
DS criteria — brown-black bg, amber accent, uppercase wide-tracking
type, inset-bevel buttons, Nous · Research wordmark — render
correctly with no unstyled fallbacks. 208/208 dashboard-auth tests
remain green.
Mission-control style deploys reverse-proxy the dashboard at a path
prefix (e.g. mission-control.tilos.com/hermes/* -> :9119) and inject
X-Forwarded-Prefix: /hermes on every request. The SPA mount already
honoured this for asset URLs and the bootstrap __HERMES_BASE_PATH__,
but the OAuth gate didn't:
1. The gate's Location: header to /login and the 401 envelope's
login_url were built bare ("/login?next=..."). Under a /hermes
prefix the browser follows that to mission-control.tilos.com/login
which the proxy doesn't route to the dashboard.
2. _redirect_uri (the OAuth callback URL handed to the IDP) used
request.url_for() which doesn't honour X-Forwarded-Prefix
(Starlette/uvicorn only proxy_headers Host + Proto + For). The
IDP redirects back to /auth/callback instead of /hermes/auth/
callback → 404 in the user's browser.
3. Cookies were set with Path=/ which leaks them to other apps on
the same origin and won't be sent back on requests under the
prefix in the first place.
Fix threads the normalised prefix through every boundary:
* New hermes_cli/dashboard_auth/prefix.py — single source of truth
for X-Forwarded-Prefix parsing. web_server._normalise_prefix
becomes a re-export so the SPA mount, the gate, and the cookies
helper all agree.
* middleware._unauth_response builds login_url = f"{prefix}/login".
* routes._redirect_uri splices the prefix into the path component
of the IDP-bound URL (with full validation of the header).
* cookies.{set,clear}_{session,pkce}_cookie now take prefix="".
Path attribute switches to /hermes when set; cookie name switches
name variant (see below). Every caller passes the request's
normalised prefix.
Cookie hardening (Teknium's lesser-note #1 in the PR review): adopt
the __Host- / __Secure- cookie name prefixes per draft-west-cookie-
prefixes. The variant is selected from (use_https, prefix):
* Loopback HTTP → bare "hermes_session_at" (both prefixes require
Secure, incompatible with HTTP).
* HTTPS, direct deploy (Path=/) → "__Host-hermes_session_at".
Strongest spec: bound to exact origin, no Domain attribute, Secure
required.
* HTTPS, behind a proxy prefix (Path=/hermes) →
"__Secure-hermes_session_at". __Host- forbids Path != "/"; the
explicit Path=/hermes covers same-origin app isolation.
Setter and reader BOTH consult the prefix because the cookie *name*
changes — a reader that looked up the bare name when the setter wrote
__Secure- would never find the value. The reader falls back across
all three variants so a request whose shape changed mid-session (e.g.
post-deploy from no-prefix to /hermes) still picks up the existing
cookie until it expires.
Test coverage:
- tests/hermes_cli/test_dashboard_auth_prefix.py — new file. 11 tests
pinning:
• Location: /hermes/login on the gate's HTML redirect
• 401 envelope login_url carries the prefix
• Malformed X-Forwarded-Prefix is ignored (header-injection
defence; the script-tag value is normalised to empty string)
• _redirect_uri splices /hermes into the path (the property
that prevents the IDP-returns-to-404 failure)
• PKCE cookie uses Path=/hermes + __Secure- when proxied
• Session cookies use __Host- when direct, __Secure- when
proxied, bare on loopback HTTP
• End-to-end round trip with hand-managed PKCE cookie carriage
(TestClient can't simulate a Path=/hermes cookie automatically)
- tests/hermes_cli/test_dashboard_auth_cookies.py — rewritten to pin
each (use_https, prefix) shape produces its expected cookie name,
plus reader-side coverage that __Host- and __Secure- variants are
both recognised.
- Existing tests across middleware / 401-reauth / etc. updated to
match the new cookie names (substring contains instead of
startswith).
Mutation-tested: reverting _unauth_response to build the bare
"/login" URL trips exactly the two tests that pin the prefix
carriage, confirming the suite discriminates the regression.
The gate's _unauth_response set next=<path> on the /login redirect URL,
but nothing downstream read it: render_login_html ignored next=,
auth_login dropped it, and auth_callback read next= from its own query
string — which an IDP never sets on the callback URL (real IDPs only
echo back code+state). The _validate_post_login_target plumbing in the
callback was unreachable on the happy path, so users always landed on
"/" regardless of what they originally requested.
Worse: reading next= from the callback URL was a latent open-redirect
sink, since an attacker could craft /auth/callback?...&next=/admin and
have the server honour it post-auth.
Fix carries next= through the round trip on a server-controlled channel:
1. login_page reads request.query_params['next'] and passes it (post-
validation) to render_login_html.
2. render_login_html threads next= URL-encoded into each provider
button's href, with HTML-attribute escaping as defence in depth.
3. auth_login accepts ?next= as a query param, re-validates, and
appends it as a fourth segment (next=<urlquoted>) in the PKCE
cookie payload alongside provider/state/verifier.
4. auth_callback no longer accepts a next: str = "" query param. It
parses next= out of the PKCE cookie and validates that with the
same same-origin rules. Any attacker-supplied ?next= on the
callback URL is silently ignored — server-only carrier.
Test coverage adds three classes:
- TestAuthCallbackNext drives /login → /auth/login → IDP-bounce →
/auth/callback end-to-end without smuggling next= onto the callback
URL (which is what the previous tests did and why they didn't
catch the bug). Includes test_attacker_callback_next_param_is_ignored
to pin the security property that the URL value is never read.
- TestRenderLoginHtmlNext covers the rendering function at the
unit boundary so a regression that drops next_path is caught
without spinning up the full app.
- TestAuthLoginPkceCookieNext inspects the Set-Cookie header on
/auth/login responses so a regression in cookie encoding is caught
without driving the full round trip.
Mutation-tested: reverting auth_callback to read next= from the URL
trips 3 of 6 TestAuthCallbackNext tests (the safe-path and attacker-
hardening ones), confirming the suite discriminates between the cookie
read and the URL read.
Contract V1 of nous-account-service PR #180 ships no refresh tokens, so
the original Phase 6 silent-refresh design is replaced with a thinner
'401 → redirect to /login' UX. The dashboard's gated middleware now
emits a structured envelope on any auth failure; the SPA's fetch
wrapper sees it and full-page-navigates the user through re-auth.
hermes_cli/dashboard_auth/cookies.py:
set_session_cookies(refresh_token='') SKIPS writing the
hermes_session_rt cookie. Forward-compat: a non-empty refresh_token
still emits the cookie unchanged, so a future Portal contract that
starts issuing RTs flips the persistence on with no other change.
clear_session_cookies still emits a Max-Age=0 deletion for the RT
cookie so stale cookies from earlier deployments get flushed on
logout / session expiry. Deprecation marker + rationale in
module docstring per the user's docstring-only deprecation pattern.
hermes_cli/dashboard_auth/middleware.py:
_unauth_response now builds a structured JSON envelope for API 401s:
{ error: 'session_expired' | 'unauthenticated',
detail: 'Unauthorized',
reason: <internal>,
login_url: '/login?next=<safe-path>' }
HTML redirects also carry next= so a user landing on /sessions
without a cookie bounces back to /sessions after re-auth.
_safe_next_target validates same-origin: drops protocol-relative
paths (//evil.com), absolute URLs, and any /login or /auth/* loop.
Dead cookies are cleared on the 401 path so the browser stops
replaying invalid tokens.
hermes_cli/dashboard_auth/routes.py:
/auth/callback accepts next= query param and validates via
_validate_post_login_target (same rules as the gate's
_safe_next_target — defence-in-depth because next= survived a full
IDP round trip and attacker-controlled state can re-enter via the
callback URL). Open-redirect attempts land at '/' instead.
web/src/lib/api.ts:
fetchJSON parses the 401 envelope and full-page-navigates to
body.login_url ONLY on the known session-expiry error codes.
Domain-level 401s (e.g. permission errors) bubble up as regular
errors. credentials: 'include' added so cookie auth works for all
fetches routed through this wrapper. sessionStorage.lastLocation is
preserved for future use by AuthWidget / hermes_status.
Test files marked with pytest.mark.xdist_group so the four files that
mutate web_server.app.state.auth_required serialize onto the same xdist
worker — eliminates 'works locally, fails in CI' app-state bleed.
20 new tests in test_dashboard_auth_401_reauth.py:
- set_session_cookies(refresh_token='') skips RT cookie
- clear_session_cookies still emits RT deletion
- 401 envelope shape (unauthenticated vs session_expired)
- dead cookie cleared on invalid-token 401
- login_url carries next= for deep paths
- login loop avoided when path is /login/auth/api-auth
- protocol-relative URL rejected
- _safe_next_target unit tests (accept same-origin, reject loops/abs)
- /auth/callback respects safe next= but rejects open redirects
2 pre-existing tests updated to accept the new /login?next=%2F shape.
Full dashboard-auth suite: 168 passed, 1 skipped (Phase 0 pre-existing).
Phase 5 task 5.1. Browsers cannot set Authorization on a WebSocket
upgrade, so in gated mode the SPA needs an alternative way to bind the
upgrade to its authenticated session.
hermes_cli/dashboard_auth/ws_tickets.py — in-memory single-use ticket
store with 30s TTL. Thread-safe (threading.Lock), token_urlsafe(32)
values, ticket value truncated to 8 chars in error messages for log
hygiene. Module-level state with _reset_for_tests() helper.
hermes_cli/dashboard_auth/routes.py — adds POST /api/auth/ws-ticket.
Auth-required (the gate middleware already attaches Session to
request.state.session). Returns {ticket, ttl_seconds}; emits
WS_TICKET_MINTED audit event with user_id + provider + ip.
hermes_cli/dashboard_auth/audit.py — adds WS_TICKET_REJECTED enum
value for the consume-side rejection event (wired into the WS
endpoints in task 5.2).
11 new tests covering round-trip, single-use, TTL boundary, unknown
ticket rejection, secret-hygiene truncation in error messages, and
concurrent mint+consume from 20 threads.
Phase 3, Tasks 3.2 + 3.3 + 3.4. These three pieces are mutually
dependent so they land together.
middleware.py - gated_auth_middleware engages when app.state.auth_required
is True. Allowlists /login, /auth/*, /api/auth/providers, and static
asset paths; everything else demands a valid session_at cookie. Verifies
by trying every registered provider's verify_session in turn (multi-
provider stack); attaches verified Session to request.state.session.
Returns 401 JSON for /api/* and 302 -> /login for HTML. ProviderError
during verify -> 503.
routes.py - APIRouter with:
GET /login server-rendered HTML
GET /auth/login?provider=N 302 to IDP + PKCE cookie
GET /auth/callback?code,state completes login, sets session cookies
POST /auth/logout clears cookies + best-effort revoke
GET /api/auth/providers public bootstrap endpoint (503 if zero)
GET /api/auth/me verified session as JSON (auth-required)
login_page.py - Inline-CSS HTML template, no React, no JavaScript.
web_server.py - Mounted gated_auth_middleware between host_header and
auth_middleware (FastAPI runs middlewares in registration order: host
check -> cookie auth -> token auth). auth_middleware short-circuits
when auth_required so cookie auth is authoritative in gated mode.
Router is included before mount_spa so the catch-all doesn't swallow
/login or /auth/*.
17 new behavioural tests; loopback regression harness still green.
Phase 3, Task 3.1. Three cookies:
- hermes_session_at: OAuth access token (HttpOnly, TTL = token TTL)
- hermes_session_rt: OAuth refresh token (HttpOnly, 30d max-age)
- hermes_session_pkce: PKCE state + verifier + provider hint (10min)
All SameSite=Lax + Path=/. Secure flag is set ONLY when the request
scheme is https — uvicorn proxy_headers=True (enabled in gated mode at
Phase 3.5) rewrites scheme from X-Forwarded-Proto so Fly's TLS
terminator works.
Phase 1, Task 1.4. Records every auth event (login start/success/failure,
logout, refresh success/failure, revoke, session verify failure, WS
ticket mint) as one JSON object per line. Token-like kwargs (access_token,
refresh_token, code, code_verifier, state, ticket, cookie, Authorization)
are dropped before serialisation so the log never contains live secrets.
Write failures log at WARNING but never raise — auth flows must not fail
because the audit logger broke.
Phase 1, Task 1.1. New package hermes_cli/dashboard_auth/ contains:
base.py - DashboardAuthProvider ABC with 5 abstract methods
(start_login, complete_login, verify_session,
refresh_session, revoke_session), Session + LoginStart
frozen dataclasses, three exception types
(ProviderError / InvalidCodeError / RefreshExpiredError),
and assert_protocol_compliance() for plugins to call
in their own tests.
registry.py - Module-level register/get/list/clear with a lock.
Nothing reads the registry yet — Phase 2 adds the StubAuthProvider and
Phase 3 wires the gate middleware. The plugin hook lands in Task 1.3.