hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-07-17 14:42:06 +00:00

Author	SHA1	Message	Date
Robin Fernandes	406901b27d	feat(auth) normalise the way in which we check whether a user has free/paid access to nous portal so we can expose behaviour and error messages accordingly.	2026-05-28 00:19:31 -07:00
Squiddy	3ba8962738	fix(kanban): add Windows init lock guard	2026-05-27 23:28:51 -07:00
Squiddy	90b6b3d18f	fix(kanban): harden sqlite connection concurrency	2026-05-27 23:28:51 -07:00
Ben	b924b22a9d	fix(docker): `hermes update` prints `docker pull` guidance instead of bogus git error Inside the published Docker image, `hermes update` was hitting the ".git missing → reinstall via curl" fallback: ✗ Not a git repository. Please reinstall: curl -fsSL https://raw.githubusercontent.com/.../install.sh \| bash That message is wrong on two counts: 1. It tells the user to run the host-side installer, which would install a new Hermes on the host — not update the running container. 2. It doesn't mention `docker pull` at all, leaving Docker users to figure out the right action from scratch. `hermes update --check` was worse: it bailed with "Not a git repository — cannot check for updates." and nothing else. Fix: detect the Docker install method (already stamped by `docker/stage2-hook.sh` and surfaced by `detect_install_method()`) in both update entry points and print a long-form message that covers: - The right command: `docker pull nousresearch/hermes-agent:latest` - Restart guidance (`docker compose up -d --force-recreate` / re-run `docker run`) - How to verify the new version after restart - Tag-pinning caveat (`:latest` doesn't move a pinned tag) - Config persistence across upgrades (state under `HERMES_HOME` / `/opt/data` is bind-mounted and survives) - Fork escape hatch (build your own image with the repo's Dockerfile) Exit code is 1 (matches `managed_error` semantic for "tried to update but can't update this way"). Plumbing: - hermes_cli/config.py: new `format_docker_update_message()` helper sits next to the existing `_NIX_UPDATE_MSG` / `format_managed_message()` family so the wording lives in one place and both call sites (apply path + check path) consume it. - hermes_cli/main.py: * `cmd_update()`: bail right after the `is_managed()` gate, before any of the apply-path branches. * `_cmd_update_check()`: bail at the top of the function, before the existing `method == "pip"` branch. Neither path touches subprocess.run / git when method == "docker". Coverage: - 7 new tests in `tests/hermes_cli/test_cmd_update_docker.py`: * `hermes update` in Docker → message + exit 1, no git calls * `hermes update --check` (via cmd_update) → same * `--yes` / `--force` don't bypass (intentional) * `_cmd_update_check` called directly → bails too * git/pip installs still take their normal paths (regression guards) * `format_docker_update_message` content-lock test pinning the five user-actionable bits the message must contain - Existing test_cmd_update.py (21 tests) + test_managed_installs.py (5 tests) still pass — no regression on the source-install path. - Verified end-to-end in a real container: `docker run ... update` and `docker run ... update --check` both render the message and exit 1.	2026-05-28 15:50:25 +10:00
Ben	66489f38c7	fix(docker): bake build-time git SHA into the image `hermes dump` and the startup banner both call `git rev-parse HEAD` to report the running commit, but `.dockerignore` line 2 excludes `.git` — so inside the published image `hermes dump` shows `version: ... [(unknown)]` and the banner drops its `· upstream <sha>` suffix entirely. That makes support triage from container bug reports impossible: we can't tell which commit the user is actually running. Fix: thread the build-time SHA through as a Docker build-arg, write it to `/opt/hermes/.hermes_build_sha` in the image, and have a new `hermes_cli/build_info.get_build_sha()` read it as a fallback after the existing live-git lookup fails. Output format is unchanged in both callsites — same 8-char short SHA whether resolved live or baked. Wiring: - Dockerfile: `ARG HERMES_GIT_SHA=` + write-file step after the source copy. Empty/missing arg → no file written → callers fall through to live git (so local `docker build` without --build-arg is unchanged). - docker-publish.yml: passes `HERMES_GIT_SHA=${{ github.sha }}` on all four build-push-action steps (amd64/arm64, smoke-test + final push). - dump.py:_get_git_commit() / banner.py:get_git_banner_state(): try live git first, fall back to baked SHA, then to legacy `(unknown)` / None. Banner returns `upstream == local, ahead=0` because a built image is by definition pinned to one commit. Coverage: - Unit tests cover build_info (file present/absent/empty/error, truncation, whitespace), dump (live-git wins, both fallbacks, identical output-format regression guard), and banner (no-repo + baked, no-repo + no-sha, shallow-clone fallback). - tests/docker/test_dump_build_sha.py is an integration regression guard that runs against the real image, reads `/opt/hermes/.hermes_build_sha`, and asserts `hermes dump` surfaces its content (or stays at `(unknown)` if no file). - Verified end-to-end: `docker build --build-arg HERMES_GIT_SHA=abc...` → `docker run ... dump` reports `[abc12345]`; without the build-arg it reports `[(unknown)]` as before.	2026-05-28 15:14:05 +10:00
teknium1	ebe04c66cd	fix(kanban): close kanban.db FD after every connect() in long-lived processes `sqlite3.Connection.__exit__` commits/rollbacks but does NOT close the underlying FD. `with kb.connect() as conn:` in long-lived processes (gateway `run_slash`, dashboard `decompose_task_endpoint`) therefore leaks one FD to `kanban.db` per call. After enough operations the gateway dies with `[Errno 24] Too many open files` (~4 days uptime in the production report — #33159). Fix: add a `connect_closing()` context manager in `hermes_cli/kanban_db` that wraps `connect()` with a real `try/finally: conn.close()`. Switch the 42 leak-prone call sites in `hermes_cli/kanban.py` (35), `hermes_cli/kanban_decompose.py` (4), and `hermes_cli/kanban_specify.py` (3) over to it. `kanban.py` matters because `run_slash` (called from the gateway for every `/kanban` slash command) parses argparse and dispatches to those `_cmd_*` functions in-process — each one was leaking one FD per invocation. Tests inside `tests/` are untouched: short-lived processes where OS cleanup masks the leak. Regression tests added in `test_kanban_db.py` cover both happy-path and exception-path closure, plus an explicit assertion that bare `with kb.connect()` still does NOT close (documenting the upstream sqlite3 behaviour we're working around). Closes #33159.	2026-05-27 22:07:49 -07:00
Dusk	c341a2d107	fix(docker): align HOME for dashboard and s6 gateway services (#33481 )	2026-05-28 13:42:27 +10:00
Ben Barclay	b345323195	fix(docker): tee supervised gateway stdout to docker logs Follow-up to #33583 (the gateway-run-supervised redirect). Before this fix, the supervised gateway's stdout (most visibly the "Hermes Gateway Starting…" rich-console banner) was swallowed by `s6-log` into the rotated file at `${HERMES_HOME}/logs/gateways/<profile>/current` and never reached `docker logs`. Operational signal lived in two places: * docker logs — saw stderr (Python `logging` defaults to stderr), so warnings/errors were visible. * the rotated file — saw stdout (rich banners, `print()` output, third-party libs that wrote to fd 1). This was surprising for users coming from the pre-s6 image, where `docker run … gateway run` produced a single unified stream in `docker logs`. They'd see partial output, conclude something was broken, and dig around for the missing pieces. Fix: add the `1` s6-log action directive before the file destination so each line is forwarded to s6-log's stdout — which propagates up the s6-supervise pipeline to /init's stdout = container stdout = `docker logs`. The file destination is preserved as a second destination, so the rotated log (with ISO 8601 timestamps) still exists for `hermes logs` and for survival across container restarts. Trade-off considered: timestamps. Putting `T` between `1` and the file destination (not before `1`) means: * docker logs sees raw lines — Python's logging formatter has its own timestamps, and `docker logs --timestamps` adds another layer when desired. No double-stamping in the common reading path. * The persisted file gets s6-log's ISO 8601 timestamp so even output that lacked a Python-logger timestamp (rich banners, third-party raw prints) is correlatable in `current`. Verification: * New unit-test assertion in `test_service_manager.py` locks the `s6-log 1` directive into the rendered run-script. Mutation- tested by reverting to the pre-fix script (no `1`); the assert catches it cleanly. * New docker-harness test `test_supervised_gateway_stdout_reaches_docker_logs` builds the image, runs `docker run … gateway run`, and asserts the unique `⚕` banner glyph reaches `docker logs`. Also verifies the rotated file still contains the banner (no regression on the existing file destination). Mutation-tested end-to-end: built a deliberately-broken image without the `1` directive and the test failed exactly as designed, citing the banner present in `current` but absent from `docker logs`. * `website/docs/user-guide/docker.md` gains a new `:::note Where gateway logs go` admonition documenting both destinations and the audit-log file at `${HERMES_HOME}/logs/container-boot.log`. Existing functionality preserved: every other docker-harness test still passes against the new image. Unit-test sweep across `tests/hermes_cli/` (5561 tests) is green.	2026-05-28 13:18:41 +10:00
brooklyn!	912e6e2274	fix(tui): suppress mouse-residue leaks during Python launcher startup (#31213 ) * fix(tui): suppress mouse-residue leaks during Python launcher startup `hermes --tui …` spends ~100–300ms inside the Python launcher (lazy imports, arg parsing, session resolution) before exec'ing the Node TUI binary. During that window stdin is still in cooked + echo mode. If a prior session left DEC mouse tracking asserted (or the user spammed mouse movement while the previous session was opening), the terminal keeps emitting `\\x1b[<…M` SGR motion reports that get echoed straight back into the user's shell scrollback as literal `^[[<…M` text and sit there above the TUI banner until the next clear. The Node side already calls `resetTerminalModes()` in `entry.tsx`, but by then the race is already lost — the bytes echoed during the Python warmup window were committed to the scrollback before Node started. Fix: write the mouse-tracking disable sequence at the very top of `hermes_cli.main`, before every heavy import. The terminal stops emitting motion events as soon as the bytes hit the wire (one TTY round-trip), shrinking the race window from hundreds of milliseconds to a few. `HERMES_TUI_NO_EARLY_DISABLE=1` opts out for diagnostics. * test(tui): drop dead _reload_main, hoist import out of patch context Addresses Copilot review on PR #31213. The tests used to import `hermes_cli.main` inside the `patch("os.write")` context, which Copilot pointed out is order-dependent: if the module is already loaded (e.g. imported by a prior test in the same process), the import is a no-op and the patch only sees the explicit `_suppress_mouse_residue_early()` call. Either way the assertion can flake when run alongside other tests. Move the import to module scope — every subprocess gets a fresh `hermes_cli.main`, whose module-level invocation is a no-op under pytest argv. Tests then exercise `_suppress_mouse_residue_early()` directly inside their own patch context. Also drop the unused `_reload_main` helper. * fix(tui): skip early mouse-disable when stdout is not a TTY Addresses Copilot review on PR #31213. `hermes --tui … >log` or CI capture pipes fd 1 away from the terminal. The disable bytes can't reach the terminal in that case but would still get written into the log file as raw CSI sequences. Guard with `os.isatty(1)` inside the existing `try/except OSError` block so the 'never break startup' contract holds. * docs(tui): rephrase 'raw cooked mode' as 'cooked + echo mode' Copilot review nit on PR #31213 — the original wording was self- contradictory. Pre-TUI stdin state is cooked + echo (kernel TTY discipline still owns the line buffer and echoes input back). The TUI switches it to raw mode later when Ink mounts.	2026-05-27 22:03:45 -05:00
Ben Barclay	0927fb5584	feat(docker): auto-redirect `gateway run` to supervised mode inside s6 image Pre-s6, `docker run nousresearch/hermes-agent gateway run` was the standard invocation: gateway ran as the container's main process, tini reaped zombies, container exit code matched gateway exit code, no supervision. With s6-overlay as PID 1, the same invocation now auto-upgrades to supervised semantics — auto-restart on crash, dashboard supervised alongside (when HERMES_DASHBOARD=1 is set), multiple profile gateways under the same /init. Users get the new behavior with zero changes to their docker run command. A loud one-line breadcrumb on stderr explains the upgrade and points at the opt-out for users who genuinely want pre-s6 foreground semantics. How it works: 1. `_gateway_command_inner` (the `gateway run` handler) checks if we're inside a container with s6 as PID 1. 2. If yes, dispatches `start` to the s6 service manager (registers and starts gateway-default), then `exec sleep infinity` to keep the CMD process alive without binding container lifetime to gateway PID lifetime. The supervised gateway can flap freely; `docker stop` still tears everything down via /init stage 3. 3. If no, falls through to the existing foreground code path unchanged. Host runs of `hermes gateway run` are unaffected. Three gates make the redirect inert outside the intended scope: * `detect_service_manager() != "s6"` — host/non-s6-container runs. * `HERMES_S6_SUPERVISED_CHILD=1` env var (recursion guard) — exported by `S6ServiceManager._render_run_script` for the s6-supervised invocation itself. Without this guard, the supervised `gateway run --replace` would re-enter the redirect and recurse (run → start → run → start → ...) infinitely. * `--no-supervise` CLI flag OR `HERMES_GATEWAY_NO_SUPERVISE=1` env var — explicit user opt-out for CI smoke tests, debugging the foreground startup path, or any case wanting "CMD exit = container exit" semantics. Strict truthiness (1/true/yes, case-insensitive); typos like `=0` do NOT silently opt out. Tests: * Unit tests in tests/hermes_cli/test_gateway_s6_dispatch.py cover all five paths (host no-op, supervised fire, sentinel recursion guard, CLI flag, env var truthy + falsy). The two load-bearing gates (sentinel + opt-out) were mutation-tested by removing each gate in isolation and confirming the dedicated test fails with the expected error. * Docker harness tests in tests/docker/test_gateway_run_supervised.py cover the round trips end-to-end against a built image: redirect fires (sleep-infinity heartbeat + supervised gateway-default slot + breadcrumb), --no-supervise opt-out (foreground gateway, no want-up on the slot), HERMES_GATEWAY_NO_SUPERVISE env var works identically, recursion is impossible (≤1 supervised python gateway-run + exactly 1 sleep-infinity parented to the CMD wrapper), and HERMES_DASHBOARD=1 produces both supervised gateway and supervised dashboard. Docs: * Added a `:::tip Gateway runs supervised` admonition near the main docker.md example explaining the upgrade and pointing at the opt-out. Pre-s6 (tini-based) images still run gateway run as the foreground main process, so the note is scoped to the s6 image only. Trade-off documented in the helper docstring: container exit code under the redirect is sleep's exit code (always 0 on SIGTERM), not the gateway's. That was an explicit design call — the supervised gateway is allowed to flap without taking the container with it, which is what "supervision" means. CI users who want exit-code forwarding can pass --no-supervise.	2026-05-28 12:42:13 +10:00
Stephen Chin	ffdc937c18	fix(kanban): hoist zombie reaper out of dispatch_once Reaper now runs at the top of every dispatcher tick regardless of per-board connect() failures. Previously the reaper sat inside dispatch_once after the kanban_db.connect() call — any EIO during connect would skip reaping for that tick, accumulating zombie workers and stale claim_lock rows. Also: reap_worker_zombies now returns the list of reaped pids (the dispatcher logs them) and a test indentation fix. Squashes three sibling commits from PR #32301 into one logical change for batch review.	2026-05-27 14:31:55 -07:00
steveonjava	99c19eb2fe	fix(kanban): add post-commit page_count invariant check to write_txn Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect(). Refs: #31208 (Bug E - same file, coordinate), #30973 (wal_autocheckpoint) Refs: #30445, #30896, #30908 (corruption reports)	2026-05-27 14:31:55 -07:00
Stephen Chin	c002668ff0	fix(kanban): add grace period to detect_crashed_workers `detect_crashed_workers` calls `_pid_alive` on every `running` task whose claim is held by this host. The check can transiently return False for a freshly-spawned worker (fork → /proc-visibility lag, or reap-race between SIGCHLD and parent reaping). When a second dispatcher ticks inside that window it reclaims the task and spawns a duplicate worker. Add `DEFAULT_CRASH_GRACE_SECONDS = 30` and an `HERMES_KANBAN_CRASH_GRACE_SECONDS` env-var override. `detect_crashed_workers` skips the liveness check when `time.time() - started_at < grace`. The existing 15-minute claim TTL still reclaims genuinely-crashed workers; grace only suppresses the launch-window false positive. `HERMES_KANBAN_CRASH_GRACE_SECONDS=0` is set on the `kanban_home` fixture in `test_kanban_core_functionality.py` so existing tests that assert immediate reclaim retain pre-fix semantics. Companion to merged PR #23442 (`release_stale_claims`, closes #23025), which addressed the same multi-dispatcher race in the stale-claim path. Related: #20015 (`_pid_alive` false-negative behaviour),	2026-05-27 14:31:55 -07:00
Stephen Chin	e83252dc46	fix(kanban): preserve original exception when write_txn rollback fails When code inside a write_txn block raises an OperationalError that SQLite has already auto-rolled-back (typical for disk I/O error, database is locked, and database disk image is malformed), the explicit ROLLBACK in write_txn.__exit__ itself raises cannot rollback - no transaction is active and the secondary exception replaces the original in the traceback. Operators see a misleading error and lose the diagnostic information they need. Swallow the rollback-time OperationalError so the caller always sees the original cause. Confirmed reproducer: tests/hermes_cli/test_kanban_db.py:: test_write_txn_preserves_original_exception_when_rollback_fails	2026-05-27 14:31:55 -07:00
Stephen Chin	6416dd5187	fix(kanban): harden SQLite against torn-write corruption (secure_delete + cell_size_check + synchronous=FULL) Production corruption #6 left b-tree pages with zeroed headers but intact old cell content — the Bug E pattern. This fix applies three pragma calls on every connect(): - synchronous=FULL (was NORMAL): closes the WAL-checkpoint reordering window where a crash between WAL commit and main-DB write leaves a partially-written b-tree page header. Cost is <1ms per commit on local SSD; negligible at kanban write volume. - secure_delete=ON: forces SQLite to zero freed page bytes on disk. If a torn write or hardware fault later corrupts a page, the underlying cell content is zero, so corruption is detectable and no stale rows can resurface as live data. - cell_size_check=ON: adds a read-side guard so corrupt cells surface as errors at read time rather than as silent wrong-data returns. All three are connection-scoped and re-applied on every connect(). secure_delete also writes a persistent flag into the DB header on the first call against a fresh DB, making the protection durable across processes for new DBs. Tests added for all four required cases: each pragma active on a fresh connection, and all three re-applied after close+reopen. Also adds the required negative test (migration path does not reset pragmas).	2026-05-27 14:31:55 -07:00
wysie	f040710d04	fix: backfill official optional skill provenance	2026-05-27 13:39:58 -07:00
wysie	a38e283395	fix: preserve nested official skill install paths	2026-05-27 13:39:58 -07:00
Franci Penov	6f2a2f157f	fix: check upstream even when origin/main has no new commits The upstream sync logic only ran after a successful origin pull, so forks whose origin/main was already in sync with local (but behind upstream/main) would bail out with "Already up to date!" without ever checking upstream.	2026-05-27 13:10:50 -07:00
Teknium	e8955f222c	fix(codex): drop dead model slugs that HTTP 400 on ChatGPT Pro (#33424 ) DEFAULT_CODEX_MODELS shipped three slugs that the chatgpt.com Codex backend rejects with HTTP 400 'The <slug> model is not supported when using Codex with a ChatGPT account.' on every account tested live: gpt-5.2-codex gpt-5.1-codex-max gpt-5.1-codex-mini Live verified against https://chatgpt.com/backend-api/codex/models which returns gpt-5.5, gpt-5.4, gpt-5.4-mini, gpt-5.3-codex, gpt-5.3-codex-spark, gpt-5.2 for ChatGPT Pro accounts. When _fetch_models_from_api fell back to DEFAULT_CODEX_MODELS (offline first-run, transient API failure) the picker surfaced these dead slugs and crashed on selection. The forward-compat synthesis table chained them downstream too. If OpenAI re-enables them on the OAuth-backed Codex backend, live discovery will pick them up automatically — the defaults list is only consulted when live discovery is unavailable. Test fixture pivoted to use gpt-5.3-codex (templated by 4 entries) as the synthesis driver so the forward-compat test still exercises the synthesis path.	2026-05-27 12:16:15 -07:00
Teknium	9919caff46	feat(image_gen): add Krea provider plugin (Krea 2 Medium + Large) (#33236 ) * feat(image_gen): add Krea provider plugin (Krea 2 Medium + Large) New built-in image_gen backend wrapping Krea's Krea 2 foundation image model family. Auto-discovered like the other image_gen plugins and appears in 'hermes tools' → Image Generation → Krea. Krea's API is asynchronous — submit returns a job_id, poll /jobs/{id} until terminal. The provider hides that behind the synchronous ImageGenProvider.generate() contract: submit, poll every 2s with light backoff (max 5s), 3-minute ceiling matching Krea's hosted-tool timeout. Result URL is materialised to $HERMES_HOME/cache/images/ to avoid CDN-expiry 404s downstream (same fix as xAI #26942). Models: - krea-2-medium (default — Krea's 'start here' recommendation) - krea-2-large Aspect ratios map landscape→16:9, square→1:1, portrait→9:16. Resolution: 1K (Krea's only current option). Kwarg passthrough: seed, creativity (raw/low/medium/high), styles, image_style_references (capped 10), moodboards (capped 1) — matches Krea's per-request limits. Unknown kwargs are ignored. Config knobs (config.yaml): image_gen.provider: krea image_gen.krea.model: krea-2-medium \| krea-2-large image_gen.krea.creativity: raw \| low \| medium \| high Env overrides: KREA_API_KEY (required), KREA_IMAGE_MODEL. KREA_API_KEY is registered in OPTIONAL_ENV_VARS so 'hermes setup' prompts for it. 31 new tests; image_gen suite + picker + tools_config: 211/211. * fix(image_gen/krea): address review feedback - Update KREA_API_KEY setup URL to the canonical token-creation page (https://www.krea.ai/app/api/tokens). The previous URL returned 404. - Fail fast on non-retryable HTTP statuses during poll. The previous loop retried every HTTPError for the full 180s deadline, so an auth (401), billing (402), forbidden (403), or not-found (404) response would make image_generate hang for three minutes. Only retry transient statuses (408/409/425/429/5xx); surface everything else immediately. - Add 5 tests covering fail-fast on 401/403/404 and retry on 429/503. * fix(krea): point users at the real API token dashboard URL Three call sites linked users to dashboard pages that don't exist: - hermes_cli/config.py: https://www.krea.ai/app/api/tokens - plugins/image_gen/krea/__init__.py get_setup_schema: https://www.krea.ai/api-keys - plugins/image_gen/krea/__init__.py auth_required error: https://www.krea.ai/api-keys Per Krea's own docs (https://docs.krea.ai/developers/api-keys-and-billing), the real dashboard URL is https://www.krea.ai/settings/api-tokens. All three sites now point there.	2026-05-27 11:01:47 -07:00
JohnC1009	414a5bc924	fix(auth): fall back to global auth.json in _load_provider_state In profile mode, _load_provider_state previously returned None when a provider was absent from the profile's auth.json — even if the user had authenticated at the global root. This broke runtime credential resolvers that read state directly (resolve_nous_access_token, resolve_nous_runtime_credentials), causing profiles without their own nous login to fail with 'Hermes is not logged into Nous Portal' despite a valid global session. Push the existing read-only global fallback (already used by get_provider_auth_state and read_credential_pool) into _load_provider_state so every caller benefits, and simplify get_provider_auth_state into a thin wrapper. Writes still target the profile only — profile state continues to shadow global state on the next read after a per-profile login. Behavior in classic (non-profile) mode is unchanged because _load_global_auth_store returns an empty dict. Adds 5 tests covering the new contract on _load_provider_state directly. Existing 770 auth/credential/nous tests still pass.	2026-05-27 09:38:58 -07:00
Teknium	69dfcdcc15	fix(auth): codex chat path falls back to credential_pool when singleton is empty Closes #32992. The chat path resolves Codex credentials via `resolve_codex_runtime_credentials` which only reads `providers.openai-codex.tokens` (the singleton). The auxiliary path uses `_read_codex_access_token` which checks the credential_pool first. For users whose tokens live only in the pool — manual seed, partial re-auth, restore from backup, or any state where the singleton is empty but the pool is healthy — the chat path raised AuthError or (worse, since OpenAI(api_key='') silently attaches no header) the wire saw HTTP 401 "Missing Authentication header" while the auxiliary path worked fine. This adds a pool fallback to `resolve_codex_runtime_credentials`: when the singleton has no usable access_token, scan `credential_pool.openai-codex` for the first entry that has a non-empty access_token and isn't in an exhaustion cooldown window (`last_error_reset_at` in the future). If found, return that token with `source="credential_pool"`. If no usable entry exists, the original AuthError propagates as before. Regression tests cover: - Empty singleton + healthy pool entry → pool token returned - Pool fallback skips entries currently in cooldown - Empty singleton + empty/wedged pool → AuthError propagates (existing contract preserved)	2026-05-27 03:43:51 -07:00
konsisumer	f1422ffd77	fix(gateway): classify Codex 429 quota as rate-limit, not missing credentials When the Codex OAuth token endpoint returns 429 (usage-limit / quota exhaustion), refresh_codex_oauth_pure raised a generic auth error that the gateway surfaced as 'Primary provider auth failed: No Codex credentials stored. Run hermes auth', prompting re-auth that cannot lift a quota cap. Classify 429 distinctly (codex_rate_limited, relogin_required=False) with a non-alarming quota message that honors Retry-After, log it as 'Primary provider rate-limited (429)', and stop format_auth_error from appending the re-authenticate remediation. Also log the fallback provider's literal config key instead of the resolved runtime category. Refs #32790	2026-05-27 03:13:15 -07:00
konsisumer	2bbd53493d	fix(cli): sync credential_pool on Codex re-auth Codex re-auth via `hermes setup` / `hermes model` wrote fresh OAuth tokens to providers.openai-codex.tokens but left the credential_pool device_code entry holding the consumed refresh token and stale error markers. Since the runtime selects from the pool, the next request spent a dead token and got a 401 token_invalidated. Update the singleton-seeded pool entries in lockstep and clear their error state. Fixes #33000	2026-05-27 03:02:06 -07:00
Robert DaSilva	efa952531b	fix: ignore Telegram start pings	2026-05-27 02:41:24 -07:00
Ben	a890389b69	feat(dashboard-auth): HERMES_DASHBOARD_PUBLIC_URL / dashboard.public_url override Operators behind reverse proxies that don't reliably forward X-Forwarded-Host / X-Forwarded-Proto / X-Forwarded-Prefix (manual nginx setups, on-prem ingresses, custom-domain Fly deploys with incomplete proxy chains) had no way to force the absolute base URL the OAuth callback redirects from. The dashboard would reconstruct the redirect_uri from request headers, the IDP would echo it back, and the user would land on the wrong host or wrong path — 404. Add `dashboard.public_url` to config.yaml with env override HERMES_DASHBOARD_PUBLIC_URL. When set, it is the complete authority — scheme + host + optional path prefix (e.g. https://example.com/hermes) — and becomes the base for the OAuth `redirect_uri`. X-Forwarded-Prefix is IGNORED on this code path because the operator has explicitly declared the public URL; we no longer need to guess from proxy headers, and stacking the prefix on top would double-prefix the common case where the prefix is already baked into public_url. When unset, the existing proxy_headers + X-Forwarded-Prefix reconstruction runs untouched. Existing Fly.io deploys continue to work without configuration — this is purely additive. Precedence mirrors dashboard.oauth.client_id: env (non-empty) > config.yaml > reconstructed from request Implementation: - hermes_cli/config.py: add dashboard.public_url to DEFAULT_CONFIG with a multi-paragraph doc comment explaining the use case, the X-Forwarded-Prefix interaction, and the validation rules. - hermes_cli/dashboard_auth/prefix.py: factored out the existing _REJECT_CHARS frozenset, added _normalise_public_url() validator (requires http/https scheme + non-empty host + no header-injection chars), _load_dashboard_section() loader (robust to load_config raising, non-dict shapes), and resolve_public_url() entry point with the env-overrides-config precedence. A malformed value silently falls through to ""; the caller treats "" as "reconstruct from request" so a typo never breaks the login flow. - hermes_cli/dashboard_auth/routes.py: rewrite _redirect_uri() docstring to spell out the three resolution tiers; add the public_url short-circuit before the existing X-Forwarded-Prefix splicing. Source-level comment notes that X-Forwarded-Prefix is intentionally ignored when public_url is set so a future reader doesn't try to "fix" the missing prefix layering. - cli-config.yaml.example: extend the existing dashboard section with a public_url block. - website/docs/user-guide/features/web-dashboard.md: new "Public URL override" section between the provider configuration and the OAuth flow walkthrough. Documents the env-vs-config table, the validation rules, and the `http://` `public_url` ↔ Secure cookie footgun. Test coverage — new TestPublicUrlOverride class (8 tests): - env var overrides request reconstruction (the primary motivating case) - config.yaml used when env unset - env wins over config (precedence pin) - public_url with a path prefix already baked in (the Q1-a case the user explicitly chose) - public_url suppresses X-Forwarded-Prefix layering (defends against the double-prefix bug) - trailing slash stripped from public_url (no //auth/callback) - malformed public_url falls through to reconstruction (six hostile inputs: javascript:, ftp:, missing scheme, missing host, quote chars, CRLF injection) - empty env string doesn't shadow config.yaml entry (CI / Fly provisioned-but-empty secret case) Mutation-tested: flipping the precedence in resolve_public_url() trips exactly test_env_overrides_config_public_url; weakening the validator (accept any scheme) trips exactly test_malformed_public_url_falls_through_to_reconstruction. Both other tests in each pair stay green, confirming the suite discriminates the specific regression each test pins.	2026-05-27 02:12:27 -07:00
Ben	0af37ff272	style(dashboard-auth): redesign /login page to match Nous design system The login page is the first surface the user sees on a gated dashboard and shipped with off-the-shelf system fonts and a generic orange accent that didn't match the React dashboard waiting on the other side of the OAuth round trip. Apply the same visual language the SPA uses (the @nous-research/ui package) so the auth flow feels like one product, not two. What changes (visual only — no functional changes): Typography - Body: Collapse (regular + bold), served from /fonts/ — the same woff2 files the dashboard SPA loads via the design-system's fonts.css. - Display: Rules Compressed (regular + medium) for the brand wordmark and the page heading. - Brand chrome (heading, buttons, footer) uses the DS idiom: uppercase + letter-spacing 0.2em (matching the DS Button class). Colour - Background: #170d02 (deep brown-black; --background-base in DS). - Accent: #ffac02 (amber; --midground in DS). - Foreground: #ffffff. - Hairlines: color-mix() of the midground at 18% / 35%, mirroring the DS "@theme inline" derived tokens. Button surface - Solid amber surface with dark text, no rounded corners (DS Button is squared). Inset bevel — — directly mirrors the DS Button SHADOW_DEFAULT (). :active uses filter:invert(1) which matches the DS Button's . Atmosphere - Subtle 3px dither (repeating-conic-gradient at 4% midground) + a midground radial glow at top — same idioms as the DS .dither utility and the SPA's panel chrome. - slide-up fade-in entrance animation matching DS @keyframes slide-up (0.6s ease-out). Honours prefers-reduced-motion. Brand wordmark - 'NOUS · RESEARCH' above the card in Rules Compressed, amber, 0.32em tracking. Establishes ownership before the user squints at the buttons. Empty-state page - The 'Sign-in unavailable' fallback (no providers registered) got the same colour-token and typography treatment so the misconfigured-deploy experience is also coherent. Fonts are served from /fonts/*.woff2 — a path the dashboard-auth gate already allowlists pre-auth (see _GATE_PUBLIC_PREFIXES in middleware.py:42), so the login page renders with the brand typeface without needing the React bundle loaded. The page is still entirely static HTML+CSS with no JS — the original constraint (no SPA dependency, no session token) is preserved. The class="provider-btn" selector is unchanged — the existing test suite extracts the anchor href via that class, and a regression that renamed it would silently break tests/hermes_cli/test_dashboard_auth_401_reauth.py. A docstring note on the module flags this so future visual tweaks don't break the contract by accident. Visual smoke-test: rendered both the happy path (multiple providers listed) and the empty-state page in a browser and verified all five DS criteria — brown-black bg, amber accent, uppercase wide-tracking type, inset-bevel buttons, Nous · Research wordmark — render correctly with no unstyled fallbacks. 208/208 dashboard-auth tests remain green.	2026-05-27 02:12:27 -07:00
Ben	61dcc33893	feat(dashboard-auth): config.yaml as canonical surface for dashboard.oauth Per AGENTS.md, ~/.hermes/.env is reserved for API keys / secrets and config.yaml is the surface for non-secret configuration. The Nous Portal plugin previously read HERMES_DASHBOARD_OAUTH_CLIENT_ID and HERMES_DASHBOARD_PORTAL_URL from the environment only, which forced local-dev / on-prem operators to put non-secret per-instance configuration in .env — violating the convention. Add dashboard.oauth.{client_id,portal_url} to DEFAULT_CONFIG and have the plugin resolve each setting with env-overrides-config precedence: 1. Env var when set to a non-empty value (Fly.io platform-secret injection — what pushes per-deploy client_ids without baking them into the image). 2. config.yaml entry (canonical surface for local dev / on-prem). 3. Plugin default (no provider registered when client_id is empty; portal_url defaults to https://portal.nousresearch.com). Empty env values are explicitly treated as unset so a provisioned-but- not-populated Fly secret can't accidentally shadow a valid config.yaml entry with an empty string — operators would otherwise lose the gate. Implementation: - hermes_cli/config.py: add dashboard.oauth.{client_id,portal_url} block to DEFAULT_CONFIG with full doc comment explaining the override precedence and Fly.io rationale. - plugins/dashboard_auth/nous/__init__.py: add _load_config_oauth_section, _resolve_client_id, _resolve_portal_url helpers; replace the two direct os.environ.get() calls in register() with the resolvers. Update the skip-reason string to mention BOTH surfaces so an operator looking at the fail-closed bind error knows config.yaml is a valid alternative to the env var. - plugins/dashboard_auth/nous/plugin.yaml: update description to name both surfaces. requires_env stays pointing at the env var name — it's metadata-only (not used by the plugin loader for gating) so this is documentation/UX, not enforcement. - cli-config.yaml.example: append commented dashboard.oauth block with the same override rationale operators see in code. - website/docs/user-guide/features/web-dashboard.md: rewrite the 'Default provider: Nous Research' section to lead with config.yaml, present env vars as operator overrides (Fly.io's primary path). Updated the example fail-closed bind error to match the new skip-reason text. Test coverage — new TestConfigYamlSource class (8 tests) pinning every tier of the precedence chain: - config-yaml-only path registers correctly - both config-yaml fields (client_id + portal_url) honoured - env var overrides config for client_id (Fly.io critical path) - env var overrides config for portal_url - empty env string does NOT shadow config (CI/Fly edge case) - neither source set → skip with reason mentioning BOTH surfaces - load_config() raising falls through to env-only path (resilience) - non-dict oauth section falls through cleanly (typo resilience) Mutation-tested: flipping the precedence to config-wins-over-env trips exactly test_env_overrides_config_client_id while the other 7 stay green, confirming the suite discriminates the order, not just the sources. This closes the last item in Teknium's PR review (PR #30156).	2026-05-27 02:12:27 -07:00
Ben	b26d81d536	feat(dashboard-auth): honour X-Forwarded-Prefix + __Host-/__Secure- cookies Mission-control style deploys reverse-proxy the dashboard at a path prefix (e.g. mission-control.tilos.com/hermes/* -> :9119) and inject X-Forwarded-Prefix: /hermes on every request. The SPA mount already honoured this for asset URLs and the bootstrap __HERMES_BASE_PATH__, but the OAuth gate didn't: 1. The gate's Location: header to /login and the 401 envelope's login_url were built bare ("/login?next=..."). Under a /hermes prefix the browser follows that to mission-control.tilos.com/login which the proxy doesn't route to the dashboard. 2. _redirect_uri (the OAuth callback URL handed to the IDP) used request.url_for() which doesn't honour X-Forwarded-Prefix (Starlette/uvicorn only proxy_headers Host + Proto + For). The IDP redirects back to /auth/callback instead of /hermes/auth/ callback → 404 in the user's browser. 3. Cookies were set with Path=/ which leaks them to other apps on the same origin and won't be sent back on requests under the prefix in the first place. Fix threads the normalised prefix through every boundary: * New hermes_cli/dashboard_auth/prefix.py — single source of truth for X-Forwarded-Prefix parsing. web_server._normalise_prefix becomes a re-export so the SPA mount, the gate, and the cookies helper all agree. * middleware._unauth_response builds login_url = f"{prefix}/login". * routes._redirect_uri splices the prefix into the path component of the IDP-bound URL (with full validation of the header). * cookies.{set,clear}_{session,pkce}_cookie now take prefix="". Path attribute switches to /hermes when set; cookie name switches name variant (see below). Every caller passes the request's normalised prefix. Cookie hardening (Teknium's lesser-note #1 in the PR review): adopt the __Host- / __Secure- cookie name prefixes per draft-west-cookie- prefixes. The variant is selected from (use_https, prefix): * Loopback HTTP → bare "hermes_session_at" (both prefixes require Secure, incompatible with HTTP). * HTTPS, direct deploy (Path=/) → "__Host-hermes_session_at". Strongest spec: bound to exact origin, no Domain attribute, Secure required. * HTTPS, behind a proxy prefix (Path=/hermes) → "__Secure-hermes_session_at". __Host- forbids Path != "/"; the explicit Path=/hermes covers same-origin app isolation. Setter and reader BOTH consult the prefix because the cookie name changes — a reader that looked up the bare name when the setter wrote __Secure- would never find the value. The reader falls back across all three variants so a request whose shape changed mid-session (e.g. post-deploy from no-prefix to /hermes) still picks up the existing cookie until it expires. Test coverage: - tests/hermes_cli/test_dashboard_auth_prefix.py — new file. 11 tests pinning: • Location: /hermes/login on the gate's HTML redirect • 401 envelope login_url carries the prefix • Malformed X-Forwarded-Prefix is ignored (header-injection defence; the script-tag value is normalised to empty string) • _redirect_uri splices /hermes into the path (the property that prevents the IDP-returns-to-404 failure) • PKCE cookie uses Path=/hermes + __Secure- when proxied • Session cookies use __Host- when direct, __Secure- when proxied, bare on loopback HTTP • End-to-end round trip with hand-managed PKCE cookie carriage (TestClient can't simulate a Path=/hermes cookie automatically) - tests/hermes_cli/test_dashboard_auth_cookies.py — rewritten to pin each (use_https, prefix) shape produces its expected cookie name, plus reader-side coverage that __Host- and __Secure- variants are both recognised. - Existing tests across middleware / 401-reauth / etc. updated to match the new cookie names (substring contains instead of startswith). Mutation-tested: reverting _unauth_response to build the bare "/login" URL trips exactly the two tests that pin the prefix carriage, confirming the suite discriminates the regression.	2026-05-27 02:12:27 -07:00
Ben	034ad95fed	fix(dashboard-auth): propagate next= through login page + PKCE cookie The gate's _unauth_response set next=<path> on the /login redirect URL, but nothing downstream read it: render_login_html ignored next=, auth_login dropped it, and auth_callback read next= from its own query string — which an IDP never sets on the callback URL (real IDPs only echo back code+state). The _validate_post_login_target plumbing in the callback was unreachable on the happy path, so users always landed on "/" regardless of what they originally requested. Worse: reading next= from the callback URL was a latent open-redirect sink, since an attacker could craft /auth/callback?...&next=/admin and have the server honour it post-auth. Fix carries next= through the round trip on a server-controlled channel: 1. login_page reads request.query_params['next'] and passes it (post- validation) to render_login_html. 2. render_login_html threads next= URL-encoded into each provider button's href, with HTML-attribute escaping as defence in depth. 3. auth_login accepts ?next= as a query param, re-validates, and appends it as a fourth segment (next=<urlquoted>) in the PKCE cookie payload alongside provider/state/verifier. 4. auth_callback no longer accepts a next: str = "" query param. It parses next= out of the PKCE cookie and validates that with the same same-origin rules. Any attacker-supplied ?next= on the callback URL is silently ignored — server-only carrier. Test coverage adds three classes: - TestAuthCallbackNext drives /login → /auth/login → IDP-bounce → /auth/callback end-to-end without smuggling next= onto the callback URL (which is what the previous tests did and why they didn't catch the bug). Includes test_attacker_callback_next_param_is_ignored to pin the security property that the URL value is never read. - TestRenderLoginHtmlNext covers the rendering function at the unit boundary so a regression that drops next_path is caught without spinning up the full app. - TestAuthLoginPkceCookieNext inspects the Set-Cookie header on /auth/login responses so a regression in cookie encoding is caught without driving the full round trip. Mutation-tested: reverting auth_callback to read next= from the URL trips 3 of 6 TestAuthCallbackNext tests (the safe-path and attacker- hardening ones), confirming the suite discriminates between the cookie read and the URL read.	2026-05-27 02:12:27 -07:00
Ben	c3104195b8	fix(dashboard-auth): bypass loopback WS peer check in gated mode When the OAuth gate is active, start_server runs uvicorn with proxy_headers=True so the dashboard can honour X-Forwarded-Proto from Fly's TLS terminator (cookies, redirect URI reconstruction). A side effect: ws.client.host is rewritten to the X-Forwarded-For value, which on Fly is the real internet client IP — never loopback. The loopback peer guard in _ws_client_is_allowed then rejected every WS upgrade in gated mode (4403 close) even after a successful OAuth round trip and ticket consumption, silently breaking /api/pty, /api/ws, /api/pub, and /api/events. Fix: in gated mode, bypass the peer-IP check. The OAuth gate + single-use ticket is the auth. The Host/Origin guard in _ws_host_origin_is_allowed still runs and is what protects against DNS-rebinding here, not the peer IP. Loopback mode behaviour is unchanged: the legacy ?token= path is the only auth there and we don't want LAN hosts guessing tokens. Regression coverage: TestWsRequestIsAllowedGated pins all four behaviours — non-loopback peer allowed in gated mode, non-loopback peer rejected in loopback mode, loopback peer allowed in loopback mode, and the Host/Origin guard still firing on a rebinding attempt with gated mode + matching peer.	2026-05-27 02:12:27 -07:00
Ben	42729775db	fix(dashboard): trigger plugin discovery in cmd_dashboard before start_server The argparse-setup plugin discovery path is gated on _plugin_cli_discovery_needed(), which returns False for any built-in subcommand including 'dashboard' (to save ~500ms startup on hot paths like --tui). As a result, plugins/dashboard_auth/nous never registered its DashboardAuthProvider, and start_server's fail-closed gate check tripped for any non-loopback bind even when the Nous provider was bundled and ready to run. Call discover_plugins() explicitly in cmd_dashboard so the provider registry is populated before the gate check runs. discover_plugins() is idempotent (per its docstring), so this is safe to call regardless of whether the argparse path already ran it.	2026-05-27 02:12:27 -07:00
Ben	b3dc539304	feat(dashboard-auth): Nous plugin always-on; default portal URL; specific error messages The Nous OAuth provider plugin (plugins/dashboard_auth/nous) is bundled and auto-loaded — same as before — but previously refused to register unless BOTH HERMES_DASHBOARD_OAUTH_CLIENT_ID and HERMES_DASHBOARD_PORTAL_URL were set, then the gate's fail-closed branch told the operator 'install the default Nous provider'. That message is misleading: the provider IS installed; it's just unconfigured. And the contract only really needs the per-instance client_id — the portal URL is the same for everyone in production. Three changes: 1. plugins/dashboard_auth/nous/__init__.py: - HERMES_DASHBOARD_PORTAL_URL is now optional and defaults to 'https://portal.nousresearch.com'. Override only for staging (portal.rewbs.uk) or a custom deployment. Empty string also falls back to the default so an empty Fly secret can't point the dashboard at nowhere. - Plugin exposes a module-level LAST_SKIP_REASON: str that the gate reads when no providers register. Cleared on each register() call. Skip reasons are human-readable and actionable ('HERMES_DASHBOARD_OAUTH_CLIENT_ID is not set. The Nous Portal provisions this env var…'). 2. plugins/dashboard_auth/nous/plugin.yaml: - requires_env drops HERMES_DASHBOARD_PORTAL_URL; only the client_id is mandatory. Description updated to reflect this. 3. hermes_cli/web_server.py: - When the gate fail-closes for 'no providers', it now reads each bundled plugin's LAST_SKIP_REASON and embeds them in the SystemExit message. Operator sees the specific config fix needed: Bundled providers reported these issues: • nous: HERMES_DASHBOARD_OAUTH_CLIENT_ID is not set. … instead of the prior generic 'Install the default Nous provider'. Tests: - TestPluginRegister rewritten to assert the new defaults + LAST_SKIP_REASON contents (6 tests, +1 new for empty-string env). - New gate test test_start_server_surfaces_nous_skip_reason_when_unconfigured. - test_get_method_is_not_allowed widened to handle the SPA-shell 200 path explicitly — assertion now verifies no JSON ticket leaks rather than asserting a specific status code (covers all four of 401/404/405/200). Docs updated: web-dashboard.md's 'Default provider' section now shows the env-var table with required/optional columns and embeds the fail-closed error message verbatim so operators can match what they see at the prompt.	2026-05-27 02:12:27 -07:00
Ben	2fc4615fc4	feat(dashboard-auth): Phase 7 — SPA AuthWidget + /api/status auth fields Phase 7 surfaces the OAuth gate state to users. web/src/components/AuthWidget.tsx (new): Sidebar widget that fetches /api/auth/me on mount and renders a compact 'Logged in as <user_id…> via <provider>' row with a logout icon. Contract V1 (Nous Portal) emits no email/display_name claims, so user_id is the display value (truncated to 14 chars + ellipsis); display_name and email fallthroughs are forward-compat for OQ-C1. Renders nothing on 401 from /api/auth/me — that's the signal the gate isn't engaged (loopback mode), in which case the widget would be confusing. Logout POSTs /auth/logout (which clears cookies + redirects to /login) then full-page-navigates to /login itself; the SPA's fetch wrapper doesn't follow that redirect, so the navigation is explicit. web/src/App.tsx: mounts <AuthWidget /> above <SidebarFooter />. Component is self-hiding in loopback mode so there's no need for a conditional mount. web/src/lib/api.ts: - getAuthMe() + logout() helpers - AuthMeResponse type - StatusResponse gets optional auth_required + auth_providers fields so the existing StatusPage can render a gated/loopback badge. hermes_cli/web_server.py: /api/status payload now includes - auth_required: bool — whether app.state.auth_required is True - auth_providers: list[str] — registered DashboardAuthProvider names Lazy-imports list_providers so early-startup status calls don't crash if the dashboard_auth module is still being set up. tests/hermes_cli/test_dashboard_auth_status_endpoint.py: 3 new tests covering the new status fields in both gated and loopback modes plus a regression that no existing field got dropped from the payload. The hermes status CLI is unchanged in this commit — that command tracks model providers + OAuth credentials, not running-dashboard state. The /api/status endpoint is the canonical place to query dashboard auth-gate state, consumed by the React StatusPage already.	2026-05-27 02:12:27 -07:00
Ben	5e9308b5b8	feat(dashboard-auth): Phase 6 — 401 re-auth envelope + next= propagation Contract V1 of nous-account-service PR #180 ships no refresh tokens, so the original Phase 6 silent-refresh design is replaced with a thinner '401 → redirect to /login' UX. The dashboard's gated middleware now emits a structured envelope on any auth failure; the SPA's fetch wrapper sees it and full-page-navigates the user through re-auth. hermes_cli/dashboard_auth/cookies.py: set_session_cookies(refresh_token='') SKIPS writing the hermes_session_rt cookie. Forward-compat: a non-empty refresh_token still emits the cookie unchanged, so a future Portal contract that starts issuing RTs flips the persistence on with no other change. clear_session_cookies still emits a Max-Age=0 deletion for the RT cookie so stale cookies from earlier deployments get flushed on logout / session expiry. Deprecation marker + rationale in module docstring per the user's docstring-only deprecation pattern. hermes_cli/dashboard_auth/middleware.py: _unauth_response now builds a structured JSON envelope for API 401s: { error: 'session_expired' \| 'unauthenticated', detail: 'Unauthorized', reason: <internal>, login_url: '/login?next=<safe-path>' } HTML redirects also carry next= so a user landing on /sessions without a cookie bounces back to /sessions after re-auth. _safe_next_target validates same-origin: drops protocol-relative paths (//evil.com), absolute URLs, and any /login or /auth/* loop. Dead cookies are cleared on the 401 path so the browser stops replaying invalid tokens. hermes_cli/dashboard_auth/routes.py: /auth/callback accepts next= query param and validates via _validate_post_login_target (same rules as the gate's _safe_next_target — defence-in-depth because next= survived a full IDP round trip and attacker-controlled state can re-enter via the callback URL). Open-redirect attempts land at '/' instead. web/src/lib/api.ts: fetchJSON parses the 401 envelope and full-page-navigates to body.login_url ONLY on the known session-expiry error codes. Domain-level 401s (e.g. permission errors) bubble up as regular errors. credentials: 'include' added so cookie auth works for all fetches routed through this wrapper. sessionStorage.lastLocation is preserved for future use by AuthWidget / hermes_status. Test files marked with pytest.mark.xdist_group so the four files that mutate web_server.app.state.auth_required serialize onto the same xdist worker — eliminates 'works locally, fails in CI' app-state bleed. 20 new tests in test_dashboard_auth_401_reauth.py: - set_session_cookies(refresh_token='') skips RT cookie - clear_session_cookies still emits RT deletion - 401 envelope shape (unauthenticated vs session_expired) - dead cookie cleared on invalid-token 401 - login_url carries next= for deep paths - login loop avoided when path is /login/auth/api-auth - protocol-relative URL rejected - _safe_next_target unit tests (accept same-origin, reject loops/abs) - /auth/callback respects safe next= but rejects open redirects 2 pre-existing tests updated to accept the new /login?next=%2F shape. Full dashboard-auth suite: 168 passed, 1 skipped (Phase 0 pre-existing).	2026-05-27 02:12:27 -07:00
Ben	b2360ba44e	feat(dashboard-auth): _ws_auth_ok helper + ticket auth on all 4 WS endpoints Phase 5 task 5.2. Four WebSocket endpoints — /api/pty, /api/ws, /api/pub, /api/events — previously authed with the same constant-time check against `_SESSION_TOKEN`. Replaced with a single helper that branches on `app.state.auth_required`: Loopback / --insecure: legacy ?token=<_SESSION_TOKEN> path (unchanged). Gated: ?ticket=<single-use> consumed against the dashboard-auth ticket store. Critical security property: gated mode UNCONDITIONALLY rejects the ?token= path. A leaked _SESSION_TOKEN value from a log line is not replayable for WS access in gated deployments. `_build_sidecar_url` now branches too: loopback uses the legacy token; gated mode mints a server-internal ticket via mint_ticket() with pseudo-user 'pty-sidecar' / provider 'server-internal' so audit logs can distinguish PTY-internal sidecar tickets from browser tickets. PTY children open /api/pub exactly once at startup so single-use suffices. Ticket rejections audit-log as WS_TICKET_REJECTED with truncated reason + client IP + WS path. Operators debugging 'WS keeps closing' issues see which endpoint and why. 17 new tests: - POST /api/auth/ws-ticket: 200 with cookie, 401/302 without, distinct per call, GET-not-allowed. - _ws_auth_ok loopback: token accept/reject, missing-token reject, ticket-param-ignored. - _ws_auth_ok gated: ticket accept, single-use rejection, unknown reject, legacy-token-rejected-in-gated assertion, audit-log emission. - _build_sidecar_url: loopback uses token=, gated uses ticket=, no-bound returns None.	2026-05-27 02:12:27 -07:00
Ben	b69fce9c86	feat(dashboard-auth): single-use WS tickets + POST /api/auth/ws-ticket Phase 5 task 5.1. Browsers cannot set Authorization on a WebSocket upgrade, so in gated mode the SPA needs an alternative way to bind the upgrade to its authenticated session. hermes_cli/dashboard_auth/ws_tickets.py — in-memory single-use ticket store with 30s TTL. Thread-safe (threading.Lock), token_urlsafe(32) values, ticket value truncated to 8 chars in error messages for log hygiene. Module-level state with _reset_for_tests() helper. hermes_cli/dashboard_auth/routes.py — adds POST /api/auth/ws-ticket. Auth-required (the gate middleware already attaches Session to request.state.session). Returns {ticket, ttl_seconds}; emits WS_TICKET_MINTED audit event with user_id + provider + ip. hermes_cli/dashboard_auth/audit.py — adds WS_TICKET_REJECTED enum value for the consume-side rejection event (wired into the WS endpoints in task 5.2). 11 new tests covering round-trip, single-use, TTL boundary, unknown ticket rejection, secret-hygiene truncation in error messages, and concurrent mint+consume from 20 threads.	2026-05-27 02:12:27 -07:00
Ben	53736b3922	feat(dashboard-auth): fail-closed on no providers; proxy_headers when gated; suppress _SESSION_TOKEN injection Phase 3, Task 3.5. Three changes to web_server.py: 1. start_server replaces the legacy SystemExit-refusing-to-bind guard with: if app.state.auth_required and no providers registered, exit with a clear message; otherwise log the gate-on banner. --insecure keeps its existing behaviour. 2. uvicorn proxy_headers flag is computed from app.state.auth_required. Loopback / --insecure keep it False (so _ws_client_is_allowed sees the real peer for the loopback gate); gated mode flips it True so X-Forwarded-Proto from Fly's TLS terminator is honoured for cookie Secure-flag decisions in detect_https(). 3. _serve_index no longer injects window.__HERMES_SESSION_TOKEN__ when the gate is on — the SPA reads identity from /api/auth/me using cookie auth instead. window.__HERMES_AUTH_REQUIRED__ flag lets the SPA pick between ticket-auth (gated) and token-auth (loopback) for /api/pty + /api/ws (Phase 5 will wire this in the React layer). 4 new behavioural tests; loopback regression harness still green.	2026-05-27 02:12:27 -07:00
Ben	5b17eab67a	feat(dashboard-auth): auth gate middleware + /auth/* routes + /login HTML Phase 3, Tasks 3.2 + 3.3 + 3.4. These three pieces are mutually dependent so they land together. middleware.py - gated_auth_middleware engages when app.state.auth_required is True. Allowlists /login, /auth/, /api/auth/providers, and static asset paths; everything else demands a valid session_at cookie. Verifies by trying every registered provider's verify_session in turn (multi- provider stack); attaches verified Session to request.state.session. Returns 401 JSON for /api/ and 302 -> /login for HTML. ProviderError during verify -> 503. routes.py - APIRouter with: GET /login server-rendered HTML GET /auth/login?provider=N 302 to IDP + PKCE cookie GET /auth/callback?code,state completes login, sets session cookies POST /auth/logout clears cookies + best-effort revoke GET /api/auth/providers public bootstrap endpoint (503 if zero) GET /api/auth/me verified session as JSON (auth-required) login_page.py - Inline-CSS HTML template, no React, no JavaScript. web_server.py - Mounted gated_auth_middleware between host_header and auth_middleware (FastAPI runs middlewares in registration order: host check -> cookie auth -> token auth). auth_middleware short-circuits when auth_required so cookie auth is authoritative in gated mode. Router is included before mount_spa so the catch-all doesn't swallow /login or /auth/*. 17 new behavioural tests; loopback regression harness still green.	2026-05-27 02:12:27 -07:00
Ben	a30c4d8ebd	feat(dashboard-auth): cookie helpers for session_at/session_rt/pkce Phase 3, Task 3.1. Three cookies: - hermes_session_at: OAuth access token (HttpOnly, TTL = token TTL) - hermes_session_rt: OAuth refresh token (HttpOnly, 30d max-age) - hermes_session_pkce: PKCE state + verifier + provider hint (10min) All SameSite=Lax + Path=/. Secure flag is set ONLY when the request scheme is https — uvicorn proxy_headers=True (enabled in gated mode at Phase 3.5) rewrites scheme from X-Forwarded-Proto so Fly's TLS terminator works.	2026-05-27 02:12:27 -07:00
Ben	865cae4f61	feat(dashboard-auth): json-lines audit log at $HERMES_HOME/logs/dashboard-auth.log Phase 1, Task 1.4. Records every auth event (login start/success/failure, logout, refresh success/failure, revoke, session verify failure, WS ticket mint) as one JSON object per line. Token-like kwargs (access_token, refresh_token, code, code_verifier, state, ticket, cookie, Authorization) are dropped before serialisation so the log never contains live secrets. Write failures log at WARNING but never raise — auth flows must not fail because the audit logger broke.	2026-05-27 02:12:27 -07:00
Ben	c32b17f557	feat(plugins): add register_dashboard_auth_provider hook on PluginContext Phase 1, Task 1.3. Mirrors the existing register_image_gen_provider pattern (plugins.py:531) — wrong-type or duplicate-name registrations log at WARNING and silently return rather than raising, so a misbehaving auth plugin cannot crash the host. Deviation from plan: the plan's draft raised TypeError on non-provider input; switched to silent-warn to match the established image_gen convention. Test updated to match.	2026-05-27 02:12:27 -07:00
Ben	2dc6d03a3d	feat(dashboard-auth): define DashboardAuthProvider ABC + Session dataclass Phase 1, Task 1.1. New package hermes_cli/dashboard_auth/ contains: base.py - DashboardAuthProvider ABC with 5 abstract methods (start_login, complete_login, verify_session, refresh_session, revoke_session), Session + LoginStart frozen dataclasses, three exception types (ProviderError / InvalidCodeError / RefreshExpiredError), and assert_protocol_compliance() for plugins to call in their own tests. registry.py - Module-level register/get/list/clear with a lock. Nothing reads the registry yet — Phase 2 adds the StubAuthProvider and Phase 3 wires the gate middleware. The plugin hook lands in Task 1.3.	2026-05-27 02:12:27 -07:00
Ben	949ad95e4b	feat(dashboard): stash auth_required flag on app.state Phase 0, Task 0.3. start_server now computes should_require_auth(host, allow_public) and records it on app.state.auth_required BEFORE the existing legacy SystemExit guard fires. This gives middleware, the SPA token-injection path, and WS endpoints a consistent read source for 'is the gate active'. The flag is set but no one reads it yet — Phase 3 registers the gate middleware. Note: 4 pre-existing test failures in tests/hermes_cli/test_web_server.py (PtyWebSocket) + test_update_hangup_protection.py reproduce on pristine HEAD and are unrelated to this change (starlette TestClient WS regression).	2026-05-27 02:12:27 -07:00
Ben	8773bbf186	feat(dashboard): add should_require_auth predicate for OAuth gate Phase 0, Task 0.2. Single source of truth for 'is the auth gate active?'. Reuses the existing _LOOPBACK_HOST_VALUES frozenset so this stays in sync with the DNS-rebinding host-header check. RFC1918/CGNAT/link-local are treated as public — exact threat model the gate exists for.	2026-05-27 02:12:27 -07:00
orcool	f0fdb5e67d	feat(catalog): add qwen3.7-max to alibaba + alibaba-coding-plan model lists Alibaba's latest flagship Qwen model is released but not yet present in the DashScope (alibaba) or Alibaba Coding Plan curated catalogs. Add it so it shows up in the /model picker and setup wizard for those providers. OpenCode Go routing for qwen3.7-max already landed via #32780 (commit `2fc77c53f`). OpenRouter + Nous catalog entries already landed via #32809 (commit `ccd3d04fc`). This salvage picks up the remaining alibaba / alibaba-coding-plan entries from #32806 — the AI Gateway entry is dropped because Vercel AI Gateway was removed in #33067.	2026-05-27 02:05:58 -07:00
Teknium	febc4cfec0	remove Vercel AI Gateway and Vercel Sandbox (#33067 ) * remove Vercel AI Gateway provider and Vercel Sandbox terminal backend Both Vercel-hosted integrations are removed end-to-end. Users on the AI Gateway should switch to OpenRouter or one of the other aggregators (Nous Portal, Kilo Code). Users on the Vercel Sandbox backend should switch to Docker, Modal, Daytona, or SSH. What's removed: - `plugins/model-providers/ai-gateway/` provider plugin - `hermes_cli/vercel_auth.py` Vercel-Sandbox auth helper - `tools/environments/vercel_sandbox.py` terminal backend - `ai-gateway` provider wiring across auth, doctor, setup, models, config, status, providers, main, web_server, model_normalize, dump - `vercel_sandbox` backend wiring across terminal_tool, file_tools, code_execution_tool, file_operations, approval, skills_tool, environments/local, credential_files, lazy_deps, prompt_builder, cli, gateway/run - `AI_GATEWAY_BASE_URL` constant, `_AI_GATEWAY_HEADERS` auxiliary-client header set, run_agent base-URL header/reasoning special-cases - `[vercel]` pyproject extra and `vercel`/`vercel-workers` from uv.lock - env vars: `AI_GATEWAY_API_KEY`, `AI_GATEWAY_BASE_URL`, `VERCEL_TOKEN`, `VERCEL_PROJECT_ID`, `VERCEL_TEAM_ID`, `VERCEL_OIDC_TOKEN`, `TERMINAL_VERCEL_RUNTIME` - Tests: deletes test_ai_gateway_models.py and test_vercel_sandbox_environment.py; scrubs references across 23 surviving test files (no entire tests deleted unless they were dedicated to AI Gateway / Sandbox) - Docs: provider tables, env-var reference, setup guides, security notes, tool config, terminal-backend tables — English plus zh-Hans i18n parity - `hermes-agent` skill: provider table entry and remote-backend list What stays (intentional): - `popular-web-designs/templates/vercel.md` — CSS design reference, unrelated to Vercel-the-AI-product - `x-vercel-id` in `stream_diag.py` headers — generic Vercel CDN response header, useful diag signal on any Vercel-hosted endpoint - `vercel-labs/agent-browser` URL in browser config — lightpanda browser project, different OSS effort - `userStories.json` historical contributor entry mentioning Vercel Sandbox — archive, not active docs Validation: - 1153 tests in the 22 targeted files pass (`scripts/run_tests.sh`) - Full repo `py_compile` clean - Live import of every touched module + invariant check (no `ai-gateway` in `PROVIDER_REGISTRY`, no `_AI_GATEWAY_HEADERS`, no `vercel_sandbox` in `_REMOTE_TERMINAL_BACKENDS`) * test: convert profile-count check from change-detector to invariant The hardcoded "== 34" assertion broke when ai-gateway was removed. Per AGENTS.md change-detector-test guidance, assert the relationship (registry count >= number of plugin dirs) instead of a literal count. Counts shift when providers are added/removed; that's expected.	2026-05-27 00:43:32 -07:00
emozilla	3d9a26afad	Merge remote-tracking branch 'origin/main' into jq/hermes-update-branch-flag	2026-05-27 00:48:25 -04:00
beardthelion	2fc77c53f0	feat(opencode-go): route qwen3.7-max via anthropic_messages qwen3.7-max on OpenCode Go rejects the OpenAI-compatible (oa-compat) format with HTTP 401 but works correctly via the Anthropic Messages endpoint (/v1/messages with x-api-key auth). Route it the same way MiniMax models are routed: anthropic_messages api_mode. Changes: - hermes_cli/models.py: add qwen3.7-max routing + curated list - hermes_cli/setup.py: add to setup wizard model list - hermes_cli/auth.py: update provider comment - tests: add assertions for qwen3.7-max api_mode routing	2026-05-26 20:44:43 -07:00
Teknium	bb4703c761	docs(auth): replace stale 'hermes login' references with 'hermes auth add' 'hermes login' was removed (the command now just prints a deprecation message and exits). The bundled hermes-agent SKILL.md, in-code error messages, the tip rotation, the proxy adapters, and the docs site still pointed agents and users at the dead command — so models loading the skill kept running 'hermes login --provider openai-codex' and getting a dead-end print. Replacements use the canonical 'hermes auth add <provider>' surface (or bare 'hermes auth' for the interactive manager). Files: - skills/autonomous-ai-agents/hermes-agent/SKILL.md (+ regenerated docs page) - hermes_cli/tips.py (tip rotation) - agent/google_oauth.py (gemini-cli error message) - agent/conversation_loop.py (nous re-auth troubleshooting line) - agent/credential_sources.py (docstring) - hermes_cli/proxy/cli.py + hermes_cli/proxy/adapters/nous_portal.py (proxy auth hints) - tests/hermes_cli/test_proxy.py (updated assertions) - website/docs/reference/faq.md, website/docs/user-guide/features/subscription-proxy.md - zh-Hans i18n mirrors for the above 'hermes logout' is still a live command and is left untouched. The 'hermes login' stub in hermes_cli/auth.py:login_command() and the cli-commands.md 'Deprecated' rows are intentionally kept as the discoverable deprecation surface.	2026-05-26 15:41:11 -07:00

1 2 3 4 5 ...

2292 commits