hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-07-14 14:12:44 +00:00

Author	SHA1	Message	Date
teknium1	486d632cc2	fix(auxiliary): coerce None final.output to empty list in Codex aux adapter Closes #33368. `_CodexCompletionsAdapter.create()` iterates `final.output` from the Codex Responses stream. The event-driven consumer (introduced in #33042) always sets `final.output` to a list, so this shape can't come from our own code path. But: - Mocked clients in tests can return a typed Response with `output=None` - Third-party shims / compatibility layers that bypass the consumer can do the same - A future code path that wraps a different consumer could regress The old code `getattr(final, "output", [])` returns `None` (not the default `[]`) when the attribute EXISTS but is `None`. Iterating `None` then raises `TypeError: 'NoneType' object is not iterable` — the exact error logged by title-generation when this fires. Fix: `getattr(final, "output", None) or []` — single-line defensive coerce. Cheap; zero risk. Regression test asserts the auxiliary path handles a final whose `.output` is `None` (via monkey-patched consumer) without raising and returns the expected chat.completions-shaped response. Reporter: @pavegrid-1 (issue #33368).	2026-05-27 11:08:21 -07:00
Teknium	9919caff46	feat(image_gen): add Krea provider plugin (Krea 2 Medium + Large) (#33236 ) * feat(image_gen): add Krea provider plugin (Krea 2 Medium + Large) New built-in image_gen backend wrapping Krea's Krea 2 foundation image model family. Auto-discovered like the other image_gen plugins and appears in 'hermes tools' → Image Generation → Krea. Krea's API is asynchronous — submit returns a job_id, poll /jobs/{id} until terminal. The provider hides that behind the synchronous ImageGenProvider.generate() contract: submit, poll every 2s with light backoff (max 5s), 3-minute ceiling matching Krea's hosted-tool timeout. Result URL is materialised to $HERMES_HOME/cache/images/ to avoid CDN-expiry 404s downstream (same fix as xAI #26942). Models: - krea-2-medium (default — Krea's 'start here' recommendation) - krea-2-large Aspect ratios map landscape→16:9, square→1:1, portrait→9:16. Resolution: 1K (Krea's only current option). Kwarg passthrough: seed, creativity (raw/low/medium/high), styles, image_style_references (capped 10), moodboards (capped 1) — matches Krea's per-request limits. Unknown kwargs are ignored. Config knobs (config.yaml): image_gen.provider: krea image_gen.krea.model: krea-2-medium \| krea-2-large image_gen.krea.creativity: raw \| low \| medium \| high Env overrides: KREA_API_KEY (required), KREA_IMAGE_MODEL. KREA_API_KEY is registered in OPTIONAL_ENV_VARS so 'hermes setup' prompts for it. 31 new tests; image_gen suite + picker + tools_config: 211/211. * fix(image_gen/krea): address review feedback - Update KREA_API_KEY setup URL to the canonical token-creation page (https://www.krea.ai/app/api/tokens). The previous URL returned 404. - Fail fast on non-retryable HTTP statuses during poll. The previous loop retried every HTTPError for the full 180s deadline, so an auth (401), billing (402), forbidden (403), or not-found (404) response would make image_generate hang for three minutes. Only retry transient statuses (408/409/425/429/5xx); surface everything else immediately. - Add 5 tests covering fail-fast on 401/403/404 and retry on 429/503. * fix(krea): point users at the real API token dashboard URL Three call sites linked users to dashboard pages that don't exist: - hermes_cli/config.py: https://www.krea.ai/app/api/tokens - plugins/image_gen/krea/__init__.py get_setup_schema: https://www.krea.ai/api-keys - plugins/image_gen/krea/__init__.py auth_required error: https://www.krea.ai/api-keys Per Krea's own docs (https://docs.krea.ai/developers/api-keys-and-billing), the real dashboard URL is https://www.krea.ai/settings/api-tokens. All three sites now point there.	2026-05-27 11:01:47 -07:00
Dora (kyra-nest)	bcae3fcc4e	fix(honcho): align user context peer perspective Use the shared observer/target resolver for session context so peer='user' and explicit configured peer IDs query Honcho from the same assistant-observed perspective when allowed. Add regression coverage for user alias, explicit peer, and self-observer fallback.	2026-05-27 10:49:33 -07:00
David Doan	1800a1c796	fix(honcho): align peer-card read and write paths honcho_profile(peer="user") returned an empty card even when Honcho held a populated peer card for the user. Two independent bugs combined to produce the symptom: 1. Read path: get_peer_card() called _fetch_peer_card(observer, target=user), which hits GET /peers/{observer}/card?target={user} — the observer's local card of the user. On self-hosted Honcho v3 this slot is empty unless writes also use it. The peer card lives on the user peer itself (GET /peers/{user}/card). Add a fallback: when the observer-target slot is empty and a target exists, retry against the target peer's own card. 2. Write path: set_peer_card() resolved only the target peer and called user_peer.set_card(card). The read path uses the assistant peer as observer, so writes and reads addressed different Honcho card scopes. Align set_peer_card() with _resolve_observer_target() so writes go to assistant_peer.set_card(card, target=user_peer_id), matching the read. Both paths now use the same observer/target resolution, and the read path additionally falls back to the target's own card for compatibility with deployments where cards were written directly to the peer. Closes: related to #13375, #17124, #20729	2026-05-27 10:49:33 -07:00
Erosika	1a8e67076a	fix(honcho): cover pinUserPeer + aiPeer edge cases in setup, clone, and gateway cache Three related regressions stemming from the pinUserPeer alias landing: - Setup wizard read host-only fields when detecting current shape but the parser supports root-level config and gives host pinUserPeer higher precedence than pinPeerName. Re-running setup could mis-detect shape and silently flip routing. Detection now uses the same resolver order as HonchoClientConfig, and each shape branch scrubs every peer-mapping key before writing so a stale pinUserPeer=false can't outrank a freshly written pinPeerName=true. Multi no longer auto-writes userPeerAliases={} (was silently masking root-level baselines). - clone_honcho_for_profile inherited pinPeerName but not pinUserPeer, so a default profile configured with the newer key produced cloned profiles without the pin. - Gateway cache-busting signature fingerprinted Honcho user-peer fields but not ai_peer. Since HonchoSessionManager freezes cfg.ai_peer at init, mid-flight aiPeer edits kept assistant writes on the old peer until an unrelated cache eviction. ai_peer is now part of the signature.	2026-05-27 10:49:33 -07:00
Erosika	939499beed	chore(honcho): trim PR-history narration from docs and tests Remove "PR #14984 / #27371 / #1969" references and "the original key / legacy / backwards-compatible / Port #N" narration from the honcho plugin README, tests, and one stale code comment. These artefacts age poorly: they describe how a change happened rather than what the code does today, and they tax readers who weren't around for the original work. Also drop a dangling reference to scratch/memory-plugin-ux-specs.md in __init__.py — the file isn't in the repo or git history. No behaviour change.	2026-05-27 10:49:33 -07:00
Erosika	6feb2afd50	fix(honcho): plug pinPeerName transition gaps Three correctness gaps when honcho.json's identity-mapping config changes mid-flight: 1. The gateway's agent cache signature ignored honcho identity keys, so editing peerName / pinPeerName / userPeerAliases / runtimePeerPrefix was silently dropped until an unrelated cache eviction. Extend _extract_cache_busting_config to fingerprint the resolved honcho config so the AIAgent rebuilds on the next message. 2. cmd_setup let single → multi flips orphan the pinned-pool history under peerName without warning. Detect the transition, warn that runtime users will resolve to fresh empty peers, and auto-steer to hybrid (alias the operator's runtime IDs back to peerName) so the operator's own continuity survives. yes / no overrides available. 3. README didn't document the orphaning behaviour. Add a "Migrating single → multi" callout under Deployment shapes. Tests: - TestPinTransition (test_pin_peer_name.py): fresh-manager flip resolves to runtime, in-process flip is gated by the per-key session cache (documents the gateway-cache-must-bust contract), 3 cache-bust signature tests for pin / aliases / prefix. - TestProfilePeerUniqueness: two profiles pinned to distinct peerNames resolve to distinct peers; host-level peerName overrides root when pinned. - test_single_to_multi_steers_to_hybrid_by_default and test_single_to_multi_yes_override_keeps_multi (test_cli.py): wizard guard end-to-end coverage.	2026-05-27 10:49:33 -07:00
erosika	3cf5e8225d	refactor(honcho): accept pinUserPeer as backwards-compatible alias for pinPeerName The original key 'pinPeerName' from #14984 is ambiguous: a fresh reader can't tell whether it pins the user peer or the AI peer from the name alone. The resolver only ever pins the user-side (_resolve_user_peer_id short-circuits when pin_peer_name is true; the AI peer is already pinned by construction via aiPeer). Add 'pinUserPeer' as the canonical alias. Both keys land on the same internal pin_peer_name field; precedence is host pinUserPeer → host pinPeerName → root pinUserPeer → root pinPeerName → default. Host-level always beats root-level regardless of alias, so a host block can still explicitly disable a root-level pin even via the new key. Make _resolve_bool variadic so it can express the four-value precedence chain. All existing callers pass two positional args + default keyword, which the new signature accepts unchanged. Internal var name (pin_peer_name) stays the same to keep the cherry-picked #27371 commits clean and avoid a noisy rename diff.	2026-05-27 10:49:33 -07:00
erosika	0bac880991	feat(honcho-setup): add deployment-shape step to identity-mapping wizard The PR #27371 resolver introduced three identity-mapping config keys (pinPeerName, userPeerAliases, runtimePeerPrefix), but operators had no guided way to set them — they had to read the README, understand the resolver ladder, and hand-edit honcho.json. This commit adds an interactive step to 'hermes honcho setup' that asks one question ('what's your deployment shape?') and writes the right combination of keys. Three shapes cover the realistic deployments: * single -- pinPeerName=true. All gateway users collapse to your peerName. Recommended for personal/single-operator use. * multi -- pinPeerName=false, no aliases. Each runtime user gets their own peer. Optional runtimePeerPrefix for cross- platform namespace isolation. * hybrid -- pinPeerName=false, with userPeerAliases mapping YOUR runtime IDs (Telegram UID, Discord snowflake, Slack user, Matrix MXID) to peerName. Multi-user gateway where you are a privileged operator. A 'skip' option leaves existing identity-mapping config untouched — critical because re-running setup must not silently wipe operator- curated aliases. The wizard detects the current shape from existing config so the prompt's default matches what the operator already has.	2026-05-27 10:49:33 -07:00
erosika	c03960decd	fix(honcho): include user_id in agent cache signature to prevent shared-thread peer contamination PR #27371 introduced a per-user-peer resolver in HonchoSessionManager, but the resolved runtime identity is frozen into the manager at first- message init. When the gateway session_key intentionally omits the participant ID (the default for threads via thread_sessions_per_user= False), a cached AIAgent created by user A is reused for user B's messages, attributing B's writes to A's resolved Honcho peer and breaking #27371's per-user-peer contract. Fix by including user_id and user_id_alt in _agent_config_signature so the cache key distinguishes participants in shared threads. Each user in a shared thread now triggers a fresh AIAgent build (trading prompt- cache warmth for memory-attribution correctness — the right tradeoff for an external-memory backend where misattribution is unrecoverable). The default-None case keeps the signature byte-identical to pre-fix behavior so this change doesn't invalidate in-flight caches on deploy.	2026-05-27 10:49:33 -07:00
erosika	00e6830204	fix(honcho): inherit identity-mapping config in cloned profile blocks PR #27371 added host-scoped userPeerAliases, runtimePeerPrefix, and pinPeerName, but the cloned-profile allowlist in plugins/memory/honcho/cli.py::clone_honcho_for_profile() omitted them. A new profile created via 'hermes honcho setup' or similar would silently drop the operator's identity-mapping config, causing gateway users to resolve to raw runtime IDs and fragmenting Honcho memory across an unintended set of peers. Add the three keys to the allowlist and a regression test class covering all three plus the unset case.	2026-05-27 10:49:33 -07:00
mavrickdeveloper	30b391ab36	Avoid Honcho runtime peer collisions (cherry picked from commit `4ae3c1a228`)	2026-05-27 10:49:33 -07:00
mavrickdeveloper	382b1fc1b6	Cover Honcho runtime peer edge cases (cherry picked from commit `d89a57ea40`)	2026-05-27 10:49:33 -07:00
mavrickdeveloper	2e3c6627ce	Add Honcho runtime peer mapping (cherry picked from commit `864cdb3d2e`)	2026-05-27 10:49:33 -07:00
zccyman	2e181602a1	fix(agent): isolate credential pool on provider fallback Closes #33163. When _try_activate_fallback() switches from one provider to another (e.g. openai-codex → openrouter), the credential pool still belongs to the primary provider. This causes two compounding bugs: 1. The pool retains the primary's base_url. Downstream pool recovery (rate_limit / billing / auth) calls _swap_credential() with a primary entry which overwrites the agent's base_url back to the primary's endpoint. Every fallback request then 404s against the wrong host. 2. Pool recovery acting on errors from the FALLBACK provider mutates the PRIMARY's pool state (#33088 reported a related corruption pattern), exhausting/rotating entries that have nothing to do with the failure. Two layered fixes: a) try_activate_fallback (agent/chat_completion_helpers.py): on fallback activation, clear agent._credential_pool when the fallback provider doesn't match the pool's provider. Pool is preserved when the fallback shares the pool's provider (e.g. multiple openrouter entries). b) recover_with_credential_pool (agent/agent_runtime_helpers.py): defensive guard rejects any pool mutation when agent.provider doesn't match pool.provider. Defense-in-depth — should never fire after (a) is in place, but covers any future path that attaches a stale pool. Salvaged from @zccyman's PR #33217. The original PR was written against the pre-refactor monolithic run_agent.py; both target functions have since been extracted to module-level helpers. Behavior is identical — the guards live in the canonical extracted locations. Tests - New tests/run_agent/test_fallback_credential_isolation.py (7 tests covering: fallback clears mismatched pool, fallback preserves matching pool, recovery rejects mismatched pool, recovery accepts matching pool, 429-from-z.ai-doesn't-exhaust-codex-pool, _client_kwargs base_url survives pool clear, _swap_credential doesn't restore primary URL after fallback). - Cross-verified: 77/77 passing across fallback isolation tests + agent/test_credential_pool.py — no regression. Co-authored-by: zccyman <16263913+zccyman@users.noreply.github.com>	2026-05-27 10:45:26 -07:00
JohnC1009	414a5bc924	fix(auth): fall back to global auth.json in _load_provider_state In profile mode, _load_provider_state previously returned None when a provider was absent from the profile's auth.json — even if the user had authenticated at the global root. This broke runtime credential resolvers that read state directly (resolve_nous_access_token, resolve_nous_runtime_credentials), causing profiles without their own nous login to fail with 'Hermes is not logged into Nous Portal' despite a valid global session. Push the existing read-only global fallback (already used by get_provider_auth_state and read_credential_pool) into _load_provider_state so every caller benefits, and simplify get_provider_auth_state into a thin wrapper. Writes still target the profile only — profile state continues to shadow global state on the next read after a per-profile login. Behavior in classic (non-profile) mode is unchanged because _load_global_auth_store returns an empty dict. Adds 5 tests covering the new contract on _load_provider_state directly. Existing 770 auth/credential/nous tests still pass.	2026-05-27 09:38:58 -07:00
LeonSGP43	458a94e425	fix(cli): keep destructive slash modal on Linux	2026-05-27 05:57:01 -07:00
Teknium	f0de3cd0a0	fix(agent): roll back switch_model() state when client rebuild fails (#33228 ) Closes #33175. switch_model() in agent/agent_runtime_helpers.py mutated agent.model and agent.provider before rebuilding the client, with no try/except to restore them on failure. If the rebuild raised (bad API key, network error, build_anthropic_client failure, etc.) the agent was left with the new model+provider name paired with the OLD client — producing HTTP 400s like "claude-sonnet-4-6 is not supported on openai-codex" on the next turn. Callers in cli.py, gateway/run.py, and tui_gateway/server.py already catch the exception and warn the user, but the warning was misleading because the swap had partially succeeded; the agent's state was torn. Snapshot every mutated field before the swap, wrap the swap+rebuild block in try/except, and restore the snapshot on failure before re-raising so the caller's warning surfaces. Reported by @amirariff91. Tests cover both branches (chat_completions and anthropic_messages) and the cross-branch case (anthropic -> openai).	2026-05-27 05:43:20 -07:00
Teknium	b4eea187d5	fix(xai-oauth): gate slash-enum strip on model name + add regression tests (#28490 ) Three additions on top of @Nami4D's salvage: 1. Gate the preflight slash-enum strip on the model name pattern (grok-* / x-ai/grok-*). The original PR stripped slash-containing enum values from every codex_responses request, but native Codex (OpenAI) and GitHub Models DO accept slash enums — stripping them there would silently degrade tool-schema constraints. xAI is the only Responses-API surface that rejects the shape. 2. Resolve the merge conflict in agent/transports/codex.py by preserving both the timeout-forwarding block that landed on main between the PR's branch point and now AND the new service_tier strip. Behavioural intent of both is preserved. 3. Six new tests in tests/agent/transports/test_codex_transport.py covering: - TestCodexTransportXaiServiceTierStrip (3 tests): xAI strips service_tier from request_overrides; non-xAI codex_responses and GitHub Models both KEEP service_tier (regression guards so the strip stays xAI-only). - TestPreflightSlashEnumStrip (3 tests): Grok and aggregator- prefixed Grok model names both trigger the safety-net strip; non-Grok models preserve slash enums as a regression guard against the strip becoming too broad. 51/51 in tests/agent/transports/test_codex_transport.py. Co-authored-by: Nami4D <hello@nami4d.tech>	2026-05-27 05:25:38 -07:00
Teknium	0325e18f34	fix(gateway): keep Telegram heartbeat + interim commentary on; edit heartbeat in place (#33187 ) #33151 flipped THREE Telegram display defaults to false: - tool_progress: new -> off (kept: per-tool stream is too chatty) - interim_assistant_messages: T -> F (REVERTED here) - long_running_notifications: T -> F (REVERTED here) - busy_ack_detail: T -> F (kept: verbose iteration counter) The two reverts were wrong. interim_assistant_messages = the model's REAL words mid-turn ("I'll inspect the repo first.", "Let me check both files in parallel"). That is signal, not noise. Suppressing it left Telegram users staring at "typing..." for the entire turn duration with no feedback. long_running_notifications = the periodic heartbeat. Silent agent for 30 minutes is worse than one bubble updating every 3 minutes. Changes: - gateway/display_config.py: Telegram tier-1 inbox keeps both defaults on (only tool_progress and busy_ack_detail stay off). - gateway/run.py _notify_long_running(): edit a single heartbeat message in place (where the adapter supports it) instead of posting a new "Still working..." bubble each interval. Telegram, Discord, Slack, Matrix all qualify. Falls back to send-new when edit fails. - gateway/run.py: tighten heartbeat text. "⏳ Still working... (12 min elapsed — iteration 21/60, running: terminal)" -> "⏳ Working — 12 min, terminal". Verbose iteration detail moves behind busy_ack_detail (one knob now controls both busy acks AND heartbeat verbosity). - tests/, cli-config.yaml.example, website/docs/user-guide/messaging: updated to reflect the corrected story.	2026-05-27 05:21:53 -07:00
Teknium	69dfcdcc15	fix(auth): codex chat path falls back to credential_pool when singleton is empty Closes #32992. The chat path resolves Codex credentials via `resolve_codex_runtime_credentials` which only reads `providers.openai-codex.tokens` (the singleton). The auxiliary path uses `_read_codex_access_token` which checks the credential_pool first. For users whose tokens live only in the pool — manual seed, partial re-auth, restore from backup, or any state where the singleton is empty but the pool is healthy — the chat path raised AuthError or (worse, since OpenAI(api_key='') silently attaches no header) the wire saw HTTP 401 "Missing Authentication header" while the auxiliary path worked fine. This adds a pool fallback to `resolve_codex_runtime_credentials`: when the singleton has no usable access_token, scan `credential_pool.openai-codex` for the first entry that has a non-empty access_token and isn't in an exhaustion cooldown window (`last_error_reset_at` in the future). If found, return that token with `source="credential_pool"`. If no usable entry exists, the original AuthError propagates as before. Regression tests cover: - Empty singleton + healthy pool entry → pool token returned - Pool fallback skips entries currently in cooldown - Empty singleton + empty/wedged pool → AuthError propagates (existing contract preserved)	2026-05-27 03:43:51 -07:00
helix4u	ea34925002	fix(discord): recover Windows voice opus decoding	2026-05-27 03:35:33 -07:00
Teknium	0b6ace6498	test(verbose): align with telegram tier-1 inbox default Two tests in test_verbose_command.py asserted Telegram's tool_progress default was "new" and expected /verbose to cycle that to "all". The default has since been overridden to "off" in gateway/display_config.py (_PLATFORM_DEFAULTS for telegram — tier-1 inbox preset that keeps mobile chats final-answer-first), making the first /verbose invocation cycle off → new, not all → verbose. The behavioral change was intentional; the tests were stale and missing from the same commit. Surfaced as a pre-existing failure on origin/main during CI for the unrelated #33164 / #33168 Codex auth salvages.	2026-05-27 03:13:15 -07:00
konsisumer	f1422ffd77	fix(gateway): classify Codex 429 quota as rate-limit, not missing credentials When the Codex OAuth token endpoint returns 429 (usage-limit / quota exhaustion), refresh_codex_oauth_pure raised a generic auth error that the gateway surfaced as 'Primary provider auth failed: No Codex credentials stored. Run hermes auth', prompting re-auth that cannot lift a quota cap. Classify 429 distinctly (codex_rate_limited, relogin_required=False) with a non-alarming quota message that honors Retry-After, log it as 'Primary provider rate-limited (429)', and stop format_auth_error from appending the re-authenticate remediation. Also log the fallback provider's literal config key instead of the resolved runtime category. Refs #32790	2026-05-27 03:13:15 -07:00
konsisumer	2bbd53493d	fix(cli): sync credential_pool on Codex re-auth Codex re-auth via `hermes setup` / `hermes model` wrote fresh OAuth tokens to providers.openai-codex.tokens but left the credential_pool device_code entry holding the consumed refresh token and stale error markers. Since the runtime selects from the pool, the next request spent a dead token and got a 401 token_invalidated. Update the singleton-seeded pool entries in lockstep and clear their error state. Fixes #33000	2026-05-27 03:02:06 -07:00
houenyang-momo	60f84c6c28	gateway: quiet Telegram operational chatter	2026-05-27 02:41:24 -07:00
Robert DaSilva	efa952531b	fix: ignore Telegram start pings	2026-05-27 02:41:24 -07:00
sir-ad	8807b1c727	fix(gateway): hide telegram compaction status noise	2026-05-27 02:41:24 -07:00
chaconne67	9c69204d87	fix(codex_responses_adapter): drop foreign-issuer reasoning on replay reasoning.encrypted_content is sealed to the Responses endpoint that minted it. When a session switches model providers mid-conversation — say the user runs /model gpt-5.5 after several turns on grok-4.3, or vice versa — the persisted codex_reasoning_items carry blobs the new endpoint cannot decrypt, and every subsequent turn fails with HTTP 400 invalid_encrypted_content. This is the cross-issuer prevention layer. Pairs with: * PR #33035 — runtime recovery when the HTTP 400 fires anyway * PR #33146 — prevention for transient rs_tmp_* items Stamps each reasoning item with the issuer kind that minted it (codex_backend / xai_responses / github_responses / other:<url>) at normalize time, then drops items at replay time when the active endpoint differs from the stamp. Unstamped (legacy) items pass through for backwards compatibility. Cherry-picked from @chaconne67's PR #31629. Conflict against current main (#33035's replay_encrypted_reasoning parameter) resolved as 'keep both' — the two guards compose: replay_encrypted_reasoning=False is the session-wide kill switch, current_issuer_kind is the per-item filter that runs only when replay is still enabled.	2026-05-27 02:40:03 -07:00
Krishna	b1a46b3047	fix(codex): drop transient rs_tmp reasoning replay state	2026-05-27 02:25:59 -07:00
Teknium	187cf0f257	tools(terminal): nudge homebrewed CI pollers at the tool surface (#33142 ) Background processes whose command contains `gh pr view --json statusCheckRollup` or `gh pr checks \| jq` now get a runtime hint in the result pointing at the canonical green-ci-policy snippets. The homebrew shape has caused at least seven silent CI-watcher failures in the past two weeks (#31329, #31448, #31695, #31709, #31745, #32264, #33131) — each one a different jq/awk/grep variation of the same fundamental problem (stdout buffering, jq null-key edge cases, conclusion-vs-status confusion, TTY-only banner grepping). The skill that documents this anti-pattern is excellent, but a skill only fires if the agent loads it. The tool surface fires on every misuse. This is the embed-footguns-in-tool-surface pattern from PR #31289 applied to a recurring failure mode that's outgrown skill-only enforcement. Detector is deliberately narrow — flags two specific shapes: 1. Any command containing `statusCheckRollup` (the JSON-API path — conclusion vs status field semantics keep burning us). 2. `gh pr view` / `gh pr checks` combined with `jq` (gh pr checks doesn't emit JSON, so any `\| jq` here is confused intent; the canonical column-2 poller uses awk-on-tabs, not jq). Does NOT flag the blessed column-2 awk-on-tabs poller (which uses `awk -F"\t" "\==\"pending\""`) or the exit-code-driven `gh pr checks $PR >/dev/null` snippet. Hint composes with the existing background-without-notify_on_complete hint — both can fire on the same call. Each is independently actionable. Tests: - 4 new cases in tests/tools/test_notify_on_complete.py - test_homebrew_ci_poller_via_statusCheckRollup_emits_hint (positive) - test_homebrew_ci_poller_via_gh_pr_checks_piped_to_jq_emits_hint (positive) - test_canonical_column2_awk_poller_does_not_emit_homebrew_hint (negative) - test_canonical_gh_pr_checks_exit_code_loop_does_not_emit_hint (negative) - test_non_ci_background_command_does_not_emit_homebrew_hint (negative) - 30/30 passing (was 26)	2026-05-27 02:22:08 -07:00
Ben	a890389b69	feat(dashboard-auth): HERMES_DASHBOARD_PUBLIC_URL / dashboard.public_url override Operators behind reverse proxies that don't reliably forward X-Forwarded-Host / X-Forwarded-Proto / X-Forwarded-Prefix (manual nginx setups, on-prem ingresses, custom-domain Fly deploys with incomplete proxy chains) had no way to force the absolute base URL the OAuth callback redirects from. The dashboard would reconstruct the redirect_uri from request headers, the IDP would echo it back, and the user would land on the wrong host or wrong path — 404. Add `dashboard.public_url` to config.yaml with env override HERMES_DASHBOARD_PUBLIC_URL. When set, it is the complete authority — scheme + host + optional path prefix (e.g. https://example.com/hermes) — and becomes the base for the OAuth `redirect_uri`. X-Forwarded-Prefix is IGNORED on this code path because the operator has explicitly declared the public URL; we no longer need to guess from proxy headers, and stacking the prefix on top would double-prefix the common case where the prefix is already baked into public_url. When unset, the existing proxy_headers + X-Forwarded-Prefix reconstruction runs untouched. Existing Fly.io deploys continue to work without configuration — this is purely additive. Precedence mirrors dashboard.oauth.client_id: env (non-empty) > config.yaml > reconstructed from request Implementation: - hermes_cli/config.py: add dashboard.public_url to DEFAULT_CONFIG with a multi-paragraph doc comment explaining the use case, the X-Forwarded-Prefix interaction, and the validation rules. - hermes_cli/dashboard_auth/prefix.py: factored out the existing _REJECT_CHARS frozenset, added _normalise_public_url() validator (requires http/https scheme + non-empty host + no header-injection chars), _load_dashboard_section() loader (robust to load_config raising, non-dict shapes), and resolve_public_url() entry point with the env-overrides-config precedence. A malformed value silently falls through to ""; the caller treats "" as "reconstruct from request" so a typo never breaks the login flow. - hermes_cli/dashboard_auth/routes.py: rewrite _redirect_uri() docstring to spell out the three resolution tiers; add the public_url short-circuit before the existing X-Forwarded-Prefix splicing. Source-level comment notes that X-Forwarded-Prefix is intentionally ignored when public_url is set so a future reader doesn't try to "fix" the missing prefix layering. - cli-config.yaml.example: extend the existing dashboard section with a public_url block. - website/docs/user-guide/features/web-dashboard.md: new "Public URL override" section between the provider configuration and the OAuth flow walkthrough. Documents the env-vs-config table, the validation rules, and the `http://` `public_url` ↔ Secure cookie footgun. Test coverage — new TestPublicUrlOverride class (8 tests): - env var overrides request reconstruction (the primary motivating case) - config.yaml used when env unset - env wins over config (precedence pin) - public_url with a path prefix already baked in (the Q1-a case the user explicitly chose) - public_url suppresses X-Forwarded-Prefix layering (defends against the double-prefix bug) - trailing slash stripped from public_url (no //auth/callback) - malformed public_url falls through to reconstruction (six hostile inputs: javascript:, ftp:, missing scheme, missing host, quote chars, CRLF injection) - empty env string doesn't shadow config.yaml entry (CI / Fly provisioned-but-empty secret case) Mutation-tested: flipping the precedence in resolve_public_url() trips exactly test_env_overrides_config_public_url; weakening the validator (accept any scheme) trips exactly test_malformed_public_url_falls_through_to_reconstruction. Both other tests in each pair stay green, confirming the suite discriminates the specific regression each test pins.	2026-05-27 02:12:27 -07:00
Ben	61dcc33893	feat(dashboard-auth): config.yaml as canonical surface for dashboard.oauth Per AGENTS.md, ~/.hermes/.env is reserved for API keys / secrets and config.yaml is the surface for non-secret configuration. The Nous Portal plugin previously read HERMES_DASHBOARD_OAUTH_CLIENT_ID and HERMES_DASHBOARD_PORTAL_URL from the environment only, which forced local-dev / on-prem operators to put non-secret per-instance configuration in .env — violating the convention. Add dashboard.oauth.{client_id,portal_url} to DEFAULT_CONFIG and have the plugin resolve each setting with env-overrides-config precedence: 1. Env var when set to a non-empty value (Fly.io platform-secret injection — what pushes per-deploy client_ids without baking them into the image). 2. config.yaml entry (canonical surface for local dev / on-prem). 3. Plugin default (no provider registered when client_id is empty; portal_url defaults to https://portal.nousresearch.com). Empty env values are explicitly treated as unset so a provisioned-but- not-populated Fly secret can't accidentally shadow a valid config.yaml entry with an empty string — operators would otherwise lose the gate. Implementation: - hermes_cli/config.py: add dashboard.oauth.{client_id,portal_url} block to DEFAULT_CONFIG with full doc comment explaining the override precedence and Fly.io rationale. - plugins/dashboard_auth/nous/__init__.py: add _load_config_oauth_section, _resolve_client_id, _resolve_portal_url helpers; replace the two direct os.environ.get() calls in register() with the resolvers. Update the skip-reason string to mention BOTH surfaces so an operator looking at the fail-closed bind error knows config.yaml is a valid alternative to the env var. - plugins/dashboard_auth/nous/plugin.yaml: update description to name both surfaces. requires_env stays pointing at the env var name — it's metadata-only (not used by the plugin loader for gating) so this is documentation/UX, not enforcement. - cli-config.yaml.example: append commented dashboard.oauth block with the same override rationale operators see in code. - website/docs/user-guide/features/web-dashboard.md: rewrite the 'Default provider: Nous Research' section to lead with config.yaml, present env vars as operator overrides (Fly.io's primary path). Updated the example fail-closed bind error to match the new skip-reason text. Test coverage — new TestConfigYamlSource class (8 tests) pinning every tier of the precedence chain: - config-yaml-only path registers correctly - both config-yaml fields (client_id + portal_url) honoured - env var overrides config for client_id (Fly.io critical path) - env var overrides config for portal_url - empty env string does NOT shadow config (CI/Fly edge case) - neither source set → skip with reason mentioning BOTH surfaces - load_config() raising falls through to env-only path (resilience) - non-dict oauth section falls through cleanly (typo resilience) Mutation-tested: flipping the precedence to config-wins-over-env trips exactly test_env_overrides_config_client_id while the other 7 stay green, confirming the suite discriminates the order, not just the sources. This closes the last item in Teknium's PR review (PR #30156).	2026-05-27 02:12:27 -07:00
Ben	b26d81d536	feat(dashboard-auth): honour X-Forwarded-Prefix + __Host-/__Secure- cookies Mission-control style deploys reverse-proxy the dashboard at a path prefix (e.g. mission-control.tilos.com/hermes/* -> :9119) and inject X-Forwarded-Prefix: /hermes on every request. The SPA mount already honoured this for asset URLs and the bootstrap __HERMES_BASE_PATH__, but the OAuth gate didn't: 1. The gate's Location: header to /login and the 401 envelope's login_url were built bare ("/login?next=..."). Under a /hermes prefix the browser follows that to mission-control.tilos.com/login which the proxy doesn't route to the dashboard. 2. _redirect_uri (the OAuth callback URL handed to the IDP) used request.url_for() which doesn't honour X-Forwarded-Prefix (Starlette/uvicorn only proxy_headers Host + Proto + For). The IDP redirects back to /auth/callback instead of /hermes/auth/ callback → 404 in the user's browser. 3. Cookies were set with Path=/ which leaks them to other apps on the same origin and won't be sent back on requests under the prefix in the first place. Fix threads the normalised prefix through every boundary: * New hermes_cli/dashboard_auth/prefix.py — single source of truth for X-Forwarded-Prefix parsing. web_server._normalise_prefix becomes a re-export so the SPA mount, the gate, and the cookies helper all agree. * middleware._unauth_response builds login_url = f"{prefix}/login". * routes._redirect_uri splices the prefix into the path component of the IDP-bound URL (with full validation of the header). * cookies.{set,clear}_{session,pkce}_cookie now take prefix="". Path attribute switches to /hermes when set; cookie name switches name variant (see below). Every caller passes the request's normalised prefix. Cookie hardening (Teknium's lesser-note #1 in the PR review): adopt the __Host- / __Secure- cookie name prefixes per draft-west-cookie- prefixes. The variant is selected from (use_https, prefix): * Loopback HTTP → bare "hermes_session_at" (both prefixes require Secure, incompatible with HTTP). * HTTPS, direct deploy (Path=/) → "__Host-hermes_session_at". Strongest spec: bound to exact origin, no Domain attribute, Secure required. * HTTPS, behind a proxy prefix (Path=/hermes) → "__Secure-hermes_session_at". __Host- forbids Path != "/"; the explicit Path=/hermes covers same-origin app isolation. Setter and reader BOTH consult the prefix because the cookie name changes — a reader that looked up the bare name when the setter wrote __Secure- would never find the value. The reader falls back across all three variants so a request whose shape changed mid-session (e.g. post-deploy from no-prefix to /hermes) still picks up the existing cookie until it expires. Test coverage: - tests/hermes_cli/test_dashboard_auth_prefix.py — new file. 11 tests pinning: • Location: /hermes/login on the gate's HTML redirect • 401 envelope login_url carries the prefix • Malformed X-Forwarded-Prefix is ignored (header-injection defence; the script-tag value is normalised to empty string) • _redirect_uri splices /hermes into the path (the property that prevents the IDP-returns-to-404 failure) • PKCE cookie uses Path=/hermes + __Secure- when proxied • Session cookies use __Host- when direct, __Secure- when proxied, bare on loopback HTTP • End-to-end round trip with hand-managed PKCE cookie carriage (TestClient can't simulate a Path=/hermes cookie automatically) - tests/hermes_cli/test_dashboard_auth_cookies.py — rewritten to pin each (use_https, prefix) shape produces its expected cookie name, plus reader-side coverage that __Host- and __Secure- variants are both recognised. - Existing tests across middleware / 401-reauth / etc. updated to match the new cookie names (substring contains instead of startswith). Mutation-tested: reverting _unauth_response to build the bare "/login" URL trips exactly the two tests that pin the prefix carriage, confirming the suite discriminates the regression.	2026-05-27 02:12:27 -07:00
Ben	034ad95fed	fix(dashboard-auth): propagate next= through login page + PKCE cookie The gate's _unauth_response set next=<path> on the /login redirect URL, but nothing downstream read it: render_login_html ignored next=, auth_login dropped it, and auth_callback read next= from its own query string — which an IDP never sets on the callback URL (real IDPs only echo back code+state). The _validate_post_login_target plumbing in the callback was unreachable on the happy path, so users always landed on "/" regardless of what they originally requested. Worse: reading next= from the callback URL was a latent open-redirect sink, since an attacker could craft /auth/callback?...&next=/admin and have the server honour it post-auth. Fix carries next= through the round trip on a server-controlled channel: 1. login_page reads request.query_params['next'] and passes it (post- validation) to render_login_html. 2. render_login_html threads next= URL-encoded into each provider button's href, with HTML-attribute escaping as defence in depth. 3. auth_login accepts ?next= as a query param, re-validates, and appends it as a fourth segment (next=<urlquoted>) in the PKCE cookie payload alongside provider/state/verifier. 4. auth_callback no longer accepts a next: str = "" query param. It parses next= out of the PKCE cookie and validates that with the same same-origin rules. Any attacker-supplied ?next= on the callback URL is silently ignored — server-only carrier. Test coverage adds three classes: - TestAuthCallbackNext drives /login → /auth/login → IDP-bounce → /auth/callback end-to-end without smuggling next= onto the callback URL (which is what the previous tests did and why they didn't catch the bug). Includes test_attacker_callback_next_param_is_ignored to pin the security property that the URL value is never read. - TestRenderLoginHtmlNext covers the rendering function at the unit boundary so a regression that drops next_path is caught without spinning up the full app. - TestAuthLoginPkceCookieNext inspects the Set-Cookie header on /auth/login responses so a regression in cookie encoding is caught without driving the full round trip. Mutation-tested: reverting auth_callback to read next= from the URL trips 3 of 6 TestAuthCallbackNext tests (the safe-path and attacker- hardening ones), confirming the suite discriminates between the cookie read and the URL read.	2026-05-27 02:12:27 -07:00
Ben	c3104195b8	fix(dashboard-auth): bypass loopback WS peer check in gated mode When the OAuth gate is active, start_server runs uvicorn with proxy_headers=True so the dashboard can honour X-Forwarded-Proto from Fly's TLS terminator (cookies, redirect URI reconstruction). A side effect: ws.client.host is rewritten to the X-Forwarded-For value, which on Fly is the real internet client IP — never loopback. The loopback peer guard in _ws_client_is_allowed then rejected every WS upgrade in gated mode (4403 close) even after a successful OAuth round trip and ticket consumption, silently breaking /api/pty, /api/ws, /api/pub, and /api/events. Fix: in gated mode, bypass the peer-IP check. The OAuth gate + single-use ticket is the auth. The Host/Origin guard in _ws_host_origin_is_allowed still runs and is what protects against DNS-rebinding here, not the peer IP. Loopback mode behaviour is unchanged: the legacy ?token= path is the only auth there and we don't want LAN hosts guessing tokens. Regression coverage: TestWsRequestIsAllowedGated pins all four behaviours — non-loopback peer allowed in gated mode, non-loopback peer rejected in loopback mode, loopback peer allowed in loopback mode, and the Host/Origin guard still firing on a rebinding attempt with gated mode + matching peer.	2026-05-27 02:12:27 -07:00
Ben	866cc988b5	fix(dashboard-auth): use fixed-length sig suffix in stub token framing The stub auth provider's _sign/_unsign helpers joined payload and HMAC with a 'b"."' separator and recovered the parts via bytes.rsplit. HMAC-SHA256 digests are random bytes, so ~12% of the time the digest contains 0x2E ('.') and rsplit picks the wrong split point -- HMAC verification then spuriously rejects valid tokens. test_stub_refresh_round_trips was failing ~25% of the time in isolation because of this. Switch to a fixed-length suffix (32 bytes, sliced off in _unsign): no separator means no collision class. After the fix, 10/10 runs pass.	2026-05-27 02:12:27 -07:00
Ben	c598076b76	test(dashboard-auth): strip HERMES_DASHBOARD_OAUTH_* env vars in hermetic fixture When these vars are set in the developer's shell, every /api/status call triggers load_gateway_config() -> discover_plugins() -> the bundled dashboard_auth/nous plugin auto-registers itself, leaking a provider into the registry across tests on the same xdist worker. That breaks assertions like 'auth_providers == []' (loopback) and '== ["stub"]' (gated) in test_dashboard_auth_status_endpoint.py. CI never has these set, so this only surfaced locally -- exactly the hermeticity gap _hermetic_environment is meant to close. Add them to _HERMES_BEHAVIORAL_VARS so the autouse fixture strips them, and to the unset list in scripts/run_tests.sh as belt-and-suspenders for direct pytest invocations.	2026-05-27 02:12:27 -07:00
Ben	a498485631	feat(dashboard-auth-nous): surface token iss/aud in verification-failure error When jwt.decode raises InvalidTokenError, decode the token a second time without signature verification (safe — we never trust the values, just display them) and append the actual iss/aud claims plus our configured expected values to the error message. Lets operators see config drift between HERMES_DASHBOARD_PORTAL_URL / HERMES_DASHBOARD_OAUTH_CLIENT_ID and what Portal is actually emitting without having to hand-decode the JWT from the browser cookie.	2026-05-27 02:12:27 -07:00
Ben	b3dc539304	feat(dashboard-auth): Nous plugin always-on; default portal URL; specific error messages The Nous OAuth provider plugin (plugins/dashboard_auth/nous) is bundled and auto-loaded — same as before — but previously refused to register unless BOTH HERMES_DASHBOARD_OAUTH_CLIENT_ID and HERMES_DASHBOARD_PORTAL_URL were set, then the gate's fail-closed branch told the operator 'install the default Nous provider'. That message is misleading: the provider IS installed; it's just unconfigured. And the contract only really needs the per-instance client_id — the portal URL is the same for everyone in production. Three changes: 1. plugins/dashboard_auth/nous/__init__.py: - HERMES_DASHBOARD_PORTAL_URL is now optional and defaults to 'https://portal.nousresearch.com'. Override only for staging (portal.rewbs.uk) or a custom deployment. Empty string also falls back to the default so an empty Fly secret can't point the dashboard at nowhere. - Plugin exposes a module-level LAST_SKIP_REASON: str that the gate reads when no providers register. Cleared on each register() call. Skip reasons are human-readable and actionable ('HERMES_DASHBOARD_OAUTH_CLIENT_ID is not set. The Nous Portal provisions this env var…'). 2. plugins/dashboard_auth/nous/plugin.yaml: - requires_env drops HERMES_DASHBOARD_PORTAL_URL; only the client_id is mandatory. Description updated to reflect this. 3. hermes_cli/web_server.py: - When the gate fail-closes for 'no providers', it now reads each bundled plugin's LAST_SKIP_REASON and embeds them in the SystemExit message. Operator sees the specific config fix needed: Bundled providers reported these issues: • nous: HERMES_DASHBOARD_OAUTH_CLIENT_ID is not set. … instead of the prior generic 'Install the default Nous provider'. Tests: - TestPluginRegister rewritten to assert the new defaults + LAST_SKIP_REASON contents (6 tests, +1 new for empty-string env). - New gate test test_start_server_surfaces_nous_skip_reason_when_unconfigured. - test_get_method_is_not_allowed widened to handle the SPA-shell 200 path explicitly — assertion now verifies no JSON ticket leaks rather than asserting a specific status code (covers all four of 401/404/405/200). Docs updated: web-dashboard.md's 'Default provider' section now shows the env-var table with required/optional columns and embeds the fail-closed error message verbatim so operators can match what they see at the prompt.	2026-05-27 02:12:27 -07:00
Ben	2fc4615fc4	feat(dashboard-auth): Phase 7 — SPA AuthWidget + /api/status auth fields Phase 7 surfaces the OAuth gate state to users. web/src/components/AuthWidget.tsx (new): Sidebar widget that fetches /api/auth/me on mount and renders a compact 'Logged in as <user_id…> via <provider>' row with a logout icon. Contract V1 (Nous Portal) emits no email/display_name claims, so user_id is the display value (truncated to 14 chars + ellipsis); display_name and email fallthroughs are forward-compat for OQ-C1. Renders nothing on 401 from /api/auth/me — that's the signal the gate isn't engaged (loopback mode), in which case the widget would be confusing. Logout POSTs /auth/logout (which clears cookies + redirects to /login) then full-page-navigates to /login itself; the SPA's fetch wrapper doesn't follow that redirect, so the navigation is explicit. web/src/App.tsx: mounts <AuthWidget /> above <SidebarFooter />. Component is self-hiding in loopback mode so there's no need for a conditional mount. web/src/lib/api.ts: - getAuthMe() + logout() helpers - AuthMeResponse type - StatusResponse gets optional auth_required + auth_providers fields so the existing StatusPage can render a gated/loopback badge. hermes_cli/web_server.py: /api/status payload now includes - auth_required: bool — whether app.state.auth_required is True - auth_providers: list[str] — registered DashboardAuthProvider names Lazy-imports list_providers so early-startup status calls don't crash if the dashboard_auth module is still being set up. tests/hermes_cli/test_dashboard_auth_status_endpoint.py: 3 new tests covering the new status fields in both gated and loopback modes plus a regression that no existing field got dropped from the payload. The hermes status CLI is unchanged in this commit — that command tracks model providers + OAuth credentials, not running-dashboard state. The /api/status endpoint is the canonical place to query dashboard auth-gate state, consumed by the React StatusPage already.	2026-05-27 02:12:27 -07:00
Ben	5e9308b5b8	feat(dashboard-auth): Phase 6 — 401 re-auth envelope + next= propagation Contract V1 of nous-account-service PR #180 ships no refresh tokens, so the original Phase 6 silent-refresh design is replaced with a thinner '401 → redirect to /login' UX. The dashboard's gated middleware now emits a structured envelope on any auth failure; the SPA's fetch wrapper sees it and full-page-navigates the user through re-auth. hermes_cli/dashboard_auth/cookies.py: set_session_cookies(refresh_token='') SKIPS writing the hermes_session_rt cookie. Forward-compat: a non-empty refresh_token still emits the cookie unchanged, so a future Portal contract that starts issuing RTs flips the persistence on with no other change. clear_session_cookies still emits a Max-Age=0 deletion for the RT cookie so stale cookies from earlier deployments get flushed on logout / session expiry. Deprecation marker + rationale in module docstring per the user's docstring-only deprecation pattern. hermes_cli/dashboard_auth/middleware.py: _unauth_response now builds a structured JSON envelope for API 401s: { error: 'session_expired' \| 'unauthenticated', detail: 'Unauthorized', reason: <internal>, login_url: '/login?next=<safe-path>' } HTML redirects also carry next= so a user landing on /sessions without a cookie bounces back to /sessions after re-auth. _safe_next_target validates same-origin: drops protocol-relative paths (//evil.com), absolute URLs, and any /login or /auth/* loop. Dead cookies are cleared on the 401 path so the browser stops replaying invalid tokens. hermes_cli/dashboard_auth/routes.py: /auth/callback accepts next= query param and validates via _validate_post_login_target (same rules as the gate's _safe_next_target — defence-in-depth because next= survived a full IDP round trip and attacker-controlled state can re-enter via the callback URL). Open-redirect attempts land at '/' instead. web/src/lib/api.ts: fetchJSON parses the 401 envelope and full-page-navigates to body.login_url ONLY on the known session-expiry error codes. Domain-level 401s (e.g. permission errors) bubble up as regular errors. credentials: 'include' added so cookie auth works for all fetches routed through this wrapper. sessionStorage.lastLocation is preserved for future use by AuthWidget / hermes_status. Test files marked with pytest.mark.xdist_group so the four files that mutate web_server.app.state.auth_required serialize onto the same xdist worker — eliminates 'works locally, fails in CI' app-state bleed. 20 new tests in test_dashboard_auth_401_reauth.py: - set_session_cookies(refresh_token='') skips RT cookie - clear_session_cookies still emits RT deletion - 401 envelope shape (unauthenticated vs session_expired) - dead cookie cleared on invalid-token 401 - login_url carries next= for deep paths - login loop avoided when path is /login/auth/api-auth - protocol-relative URL rejected - _safe_next_target unit tests (accept same-origin, reject loops/abs) - /auth/callback respects safe next= but rejects open redirects 2 pre-existing tests updated to accept the new /login?next=%2F shape. Full dashboard-auth suite: 168 passed, 1 skipped (Phase 0 pre-existing).	2026-05-27 02:12:27 -07:00
Ben	b2360ba44e	feat(dashboard-auth): _ws_auth_ok helper + ticket auth on all 4 WS endpoints Phase 5 task 5.2. Four WebSocket endpoints — /api/pty, /api/ws, /api/pub, /api/events — previously authed with the same constant-time check against `_SESSION_TOKEN`. Replaced with a single helper that branches on `app.state.auth_required`: Loopback / --insecure: legacy ?token=<_SESSION_TOKEN> path (unchanged). Gated: ?ticket=<single-use> consumed against the dashboard-auth ticket store. Critical security property: gated mode UNCONDITIONALLY rejects the ?token= path. A leaked _SESSION_TOKEN value from a log line is not replayable for WS access in gated deployments. `_build_sidecar_url` now branches too: loopback uses the legacy token; gated mode mints a server-internal ticket via mint_ticket() with pseudo-user 'pty-sidecar' / provider 'server-internal' so audit logs can distinguish PTY-internal sidecar tickets from browser tickets. PTY children open /api/pub exactly once at startup so single-use suffices. Ticket rejections audit-log as WS_TICKET_REJECTED with truncated reason + client IP + WS path. Operators debugging 'WS keeps closing' issues see which endpoint and why. 17 new tests: - POST /api/auth/ws-ticket: 200 with cookie, 401/302 without, distinct per call, GET-not-allowed. - _ws_auth_ok loopback: token accept/reject, missing-token reject, ticket-param-ignored. - _ws_auth_ok gated: ticket accept, single-use rejection, unknown reject, legacy-token-rejected-in-gated assertion, audit-log emission. - _build_sidecar_url: loopback uses token=, gated uses ticket=, no-bound returns None.	2026-05-27 02:12:27 -07:00
Ben	b69fce9c86	feat(dashboard-auth): single-use WS tickets + POST /api/auth/ws-ticket Phase 5 task 5.1. Browsers cannot set Authorization on a WebSocket upgrade, so in gated mode the SPA needs an alternative way to bind the upgrade to its authenticated session. hermes_cli/dashboard_auth/ws_tickets.py — in-memory single-use ticket store with 30s TTL. Thread-safe (threading.Lock), token_urlsafe(32) values, ticket value truncated to 8 chars in error messages for log hygiene. Module-level state with _reset_for_tests() helper. hermes_cli/dashboard_auth/routes.py — adds POST /api/auth/ws-ticket. Auth-required (the gate middleware already attaches Session to request.state.session). Returns {ticket, ttl_seconds}; emits WS_TICKET_MINTED audit event with user_id + provider + ip. hermes_cli/dashboard_auth/audit.py — adds WS_TICKET_REJECTED enum value for the consume-side rejection event (wired into the WS endpoints in task 5.2). 11 new tests covering round-trip, single-use, TTL boundary, unknown ticket rejection, secret-hygiene truncation in error messages, and concurrent mint+consume from 20 threads.	2026-05-27 02:12:27 -07:00
Ben	848baeb0a8	feat(dashboard-auth): plugins/dashboard_auth/nous — contract-compliant Nous OAuth provider Bundled, kind=backend, auto-loads. Activates ONLY when Portal-injected env vars are present: HERMES_DASHBOARD_OAUTH_CLIENT_ID — agent:{instance_id} HERMES_DASHBOARD_PORTAL_URL — Portal base URL Loopback / --insecure operators leave both unset and never see this plugin register anything. The fail-closed branch in start_server handles the 'public bind + zero providers' case independently. Implementation follows nous-account-service PR #180's published OAuth contract verbatim: - client_id is per-instance (agent:{instance_id}); the suffix is cross-checked against the token's agent_instance_id claim as defense-in-depth (contract C9). - scope is agent_dashboard:access only (contract C3). - aud is the bare client_id, no hermes-cli: prefix (contract C2). - RS256 JWT verification against /.well-known/jwks.json with 5-minute cache (contract C7). - No refresh tokens in V1: refresh_session always raises RefreshExpiredError; revoke_session is a no-op (contract C5). - oauth_contract_version claim: missing → warn + proceed; present and != 1 → refuse (contract C11, OQ-C2 tolerant treatment). - redirect_uri validated client-side as defense before bouncing to Portal; authoritative check is server-side per agent-redirect-uri.ts. 41 new tests covering construction, plugin-entry env gating, start_login shape, complete_login httpx-mocked happy path + error mapping, verify_session JWT verification (RSA keypair fixture, full claim-check matrix), refresh_session always raising, revoke_session no-op. PyJWT + cryptography are already in the venv (jose was previously suggested; switched to pyjwt[crypto] since the latter is already pulled in transitively).	2026-05-27 02:12:27 -07:00
Ben	53736b3922	feat(dashboard-auth): fail-closed on no providers; proxy_headers when gated; suppress _SESSION_TOKEN injection Phase 3, Task 3.5. Three changes to web_server.py: 1. start_server replaces the legacy SystemExit-refusing-to-bind guard with: if app.state.auth_required and no providers registered, exit with a clear message; otherwise log the gate-on banner. --insecure keeps its existing behaviour. 2. uvicorn proxy_headers flag is computed from app.state.auth_required. Loopback / --insecure keep it False (so _ws_client_is_allowed sees the real peer for the loopback gate); gated mode flips it True so X-Forwarded-Proto from Fly's TLS terminator is honoured for cookie Secure-flag decisions in detect_https(). 3. _serve_index no longer injects window.__HERMES_SESSION_TOKEN__ when the gate is on — the SPA reads identity from /api/auth/me using cookie auth instead. window.__HERMES_AUTH_REQUIRED__ flag lets the SPA pick between ticket-auth (gated) and token-auth (loopback) for /api/pty + /api/ws (Phase 5 will wire this in the React layer). 4 new behavioural tests; loopback regression harness still green.	2026-05-27 02:12:27 -07:00
Ben	5b17eab67a	feat(dashboard-auth): auth gate middleware + /auth/* routes + /login HTML Phase 3, Tasks 3.2 + 3.3 + 3.4. These three pieces are mutually dependent so they land together. middleware.py - gated_auth_middleware engages when app.state.auth_required is True. Allowlists /login, /auth/, /api/auth/providers, and static asset paths; everything else demands a valid session_at cookie. Verifies by trying every registered provider's verify_session in turn (multi- provider stack); attaches verified Session to request.state.session. Returns 401 JSON for /api/ and 302 -> /login for HTML. ProviderError during verify -> 503. routes.py - APIRouter with: GET /login server-rendered HTML GET /auth/login?provider=N 302 to IDP + PKCE cookie GET /auth/callback?code,state completes login, sets session cookies POST /auth/logout clears cookies + best-effort revoke GET /api/auth/providers public bootstrap endpoint (503 if zero) GET /api/auth/me verified session as JSON (auth-required) login_page.py - Inline-CSS HTML template, no React, no JavaScript. web_server.py - Mounted gated_auth_middleware between host_header and auth_middleware (FastAPI runs middlewares in registration order: host check -> cookie auth -> token auth). auth_middleware short-circuits when auth_required so cookie auth is authoritative in gated mode. Router is included before mount_spa so the catch-all doesn't swallow /login or /auth/*. 17 new behavioural tests; loopback regression harness still green.	2026-05-27 02:12:27 -07:00
Ben	a30c4d8ebd	feat(dashboard-auth): cookie helpers for session_at/session_rt/pkce Phase 3, Task 3.1. Three cookies: - hermes_session_at: OAuth access token (HttpOnly, TTL = token TTL) - hermes_session_rt: OAuth refresh token (HttpOnly, 30d max-age) - hermes_session_pkce: PKCE state + verifier + provider hint (10min) All SameSite=Lax + Path=/. Secure flag is set ONLY when the request scheme is https — uvicorn proxy_headers=True (enabled in gated mode at Phase 3.5) rewrites scheme from X-Forwarded-Proto so Fly's TLS terminator works.	2026-05-27 02:12:27 -07:00
Ben	628a52fce2	test(dashboard-auth): stub auth provider for E2E gate testing Phase 2, Task 2.1. Self-contained fake IDP — start_login redirects straight back to {redirect_uri}?code=stub_code&state=<s> so tests can walk the OAuth round trip in-process. Tokens are HMAC-signed JSON blobs (not real JWTs) — enough structure for verify_session to detect tamper and expiry without pulling in pyjwt. Lives in tests/ only — never registered as a real plugin. Phase 3's end-to-end tests import StubAuthProvider directly. Convention: exp <= now counts as expired (TTL=0 means born-expired) — matches what Phase 6's silent-refresh test will need.	2026-05-27 02:12:27 -07:00
Ben	865cae4f61	feat(dashboard-auth): json-lines audit log at $HERMES_HOME/logs/dashboard-auth.log Phase 1, Task 1.4. Records every auth event (login start/success/failure, logout, refresh success/failure, revoke, session verify failure, WS ticket mint) as one JSON object per line. Token-like kwargs (access_token, refresh_token, code, code_verifier, state, ticket, cookie, Authorization) are dropped before serialisation so the log never contains live secrets. Write failures log at WARNING but never raise — auth flows must not fail because the audit logger broke.	2026-05-27 02:12:27 -07:00

1 2 3 4 5 ...

4414 commits