Commit graph

2307 commits

Author SHA1 Message Date
Teknium
e0572a6def
fix(skills-hub): stop ellipsis-truncating the Identifier column (#33810)
`hermes skills search` rendered the Identifier column with the default
overflow behaviour, so long slugs (notably browse-sh — every browse-sh
skill ends in a `-XXXXXX` hash that's part of the identifier) were cut
to `browse-sh/weathe…`. Users copied the visible string into
`hermes skills install` and got a not-found error because the hash was
gone.

Set overflow="fold" on the Identifier column in both search tables
(`do_search` and the `_resolve_short_name` multi-match table) so long
slugs wrap onto a second line instead of getting eaten. Also add a
`--json` flag to `hermes skills search` (and the `/skills search`
slash variant) for scripting — emits a list of {name, identifier,
source, trust_level, description} objects with the full identifier,
which is the right shape for copy-paste pipelines too.

Closes #33674.
2026-05-28 04:53:13 -07:00
Teknium
5e1f793430
chore(web): remove web_crawl tool + provider crawl plumbing (#33824)
The web_crawl_tool() function was an orphan — no model schema registered
it, no skill or CLI command called it, and the agent had no way to invoke
it. PR #32608 proposed wiring it up as a model-callable tool; we've
decided not to expose crawl as a separate capability since web_search +
web_extract cover the use cases we want models to have.

Removed:
- tools/web_tools.py: web_crawl_tool() (~230 LOC)
- plugins/web/firecrawl/provider.py: supports_crawl() + crawl()
- plugins/web/tavily/provider.py: supports_crawl() + crawl()
- plugins/web/xai/provider.py: supports_crawl() override
- agent/web_search_provider.py: supports_crawl() + crawl() ABC methods
- agent/web_search_registry.py: get_active_crawl_provider() +
  the 'crawl' branch in _resolve()
- agent/display.py: web_crawl tool-progress rendering
- hermes_cli/config.py: 'web_crawl' from TAVILY_API_KEY.tools
- tools/website_policy.py: stale comment reference
- Tests: removed TestWebCrawlTavily class, the two website-policy
  web_crawl tests, the searxng/ddgs/brave-free crawl-error tests,
  the integration test_web_crawl method, and the
  test_unconfigured_crawl_emits_top_level_error test. Trimmed the
  capability-flag parametrize list and the WebSearchProvider ABC
  conformance tests.
- Docs: trimmed the Crawl column from capability tables in both EN
  and zh-Hans, updated the developer-guide ABC table.

Net: 25 files, +115/-1067.

Closes #33762 (the schema-text bug only existed if #32608 landed).
Supersedes #32608.
2026-05-28 04:52:42 -07:00
teknium1
6f9182cb34 fix(kanban): content-addressed corrupt-DB backup filename
Repeated quarantines of an unchanged corrupt kanban.db used to amplify
disk usage by N: the gateway dispatcher's 5-minute retry loop, multi-
profile fleets sharing one DB, and manual reopen attempts each produced
a fresh '.corrupt.<timestamp>.bak' copy of the same bytes. After 10
retries on a 100KB DB you had 11x the disk footprint of duplicate
corrupt data.

Derive the backup filename from a sha256 of the main DB instead of a
timestamp + collision counter. Same bytes → same filename → skip the
copy on retries. Different bytes (partial repair, further damage) →
different filename → preserve separately. Sidecar (-wal/-shm) backups
inherit the same content-addressed name.

Inspired by @hanzckernel's PR #33529, simplified down to ~30 LOC: drop
the persistent JSON marker file, drop the atomic temp+fsync+rename
helper (shutil.copy2 is fine for a quarantine-only path), drop the
gateway-side WAL/SHM fingerprint extension (the existing
(path, mtime, size) tuple still gives the 5-minute retry semantics it
needs), and drop the gateway-side helper extraction. The backup file
existing IS the marker; no separate state needed.

Test: tests/hermes_cli/test_kanban_db.py::test_repeated_corrupt_open_reuses_single_backup
proves 10 retries on the same corrupt bytes produce 1 backup (was 11),
and mutating the corrupt bytes produces a second backup with a
different fingerprint.

Refs #33529
Co-authored-by: hanzckernel <zhicheng.han@mathematik.uni-goettingen.de>
2026-05-28 03:38:09 -07:00
Teknium
432a691758
fix(update): stream + idle-kill npm run build so a stalled webui-build can't soft-brick the install (#33803)
`hermes update` ran the webui build with `capture_output=True` and no timeout. On low-memory hosts (WSL2's 4 GB default, small VPSes, antivirus stalls) Vite goes silent for minutes; users see a frozen terminal, decide the update is hung, and reboot. The reboot lands *after* `pip install -e .` has already touched the install but *before* the build completes, leaving the `hermes` launcher in place while `hermes_cli` is no longer importable — i.e. `ModuleNotFoundError: No module named 'hermes_cli'` (#33788, same class as #32384).

Changes:

- New `_run_with_idle_timeout()` helper: streams subprocess output line-by-line (so the user sees Vite progress in real time) and kills the process if no bytes appear on stdout/stderr for 180s. The existing stale-dist fallback (#23817) then serves the previous build instead of failing the update.
- `_build_web_ui()` uses the helper for `npm run build` (the actual stall site). `npm install` keeps `subprocess.run` + capture_output to preserve the existing EPERM-retry-on-Windows contract.
- Both `cmd_update` call sites print `→ Core update complete. Building dashboard (optional)...` before the webui build. The CLI is fully functional at this point; a webui-build failure only affects `hermes dashboard`. Telegraphing the boundary explicitly stops users from rebooting through the build step.

Tests:

- `tests/hermes_cli/test_run_with_idle_timeout.py` — 4 tests covering streaming success, nonzero exit, idle-kill, and missing-binary cases. Uses real `subprocess.Popen` on tiny Python scripts; isolated in its own file so per-file canonical-runner parallelism doesn't pair it with the mock-heavy tests.
- `tests/hermes_cli/test_web_ui_build.py` — updated existing tests to patch `_run_with_idle_timeout` for the build step in addition to `subprocess.run` for the install step.
- `tests/hermes_cli/test_cmd_update.py::test_update_refreshes_repo_and_tui_node_dependencies` — same update.

Full suite: `scripts/run_tests.sh tests/hermes_cli/` → 5646 passed, 0 failed.

Fixes #33788.
2026-05-28 03:34:47 -07:00
Teknium
10ee4a729b
fix(gateway): drain on Windows hermes gateway stop so sessions survive restart (#33798)
Sessions now survive `hermes gateway stop` / `restart` on native Windows.
Previously the gateway died on schtasks `/End` + os.kill SIGTERM without
ever running the drain loop, so the v0.13.0 session-resume feature (#21192)
silently broke on Windows: `resume_pending=True` was never written, and
the next boot started with a blank conversation history (issue #33778).

Root cause is twofold and the reporter only identified half of it:

1. `hermes_cli/gateway_windows.py::stop()` did not write the
   `planned_stop_marker` before signalling. The reporter caught this.

2. The bigger reason: `asyncio.add_signal_handler` raises
   NotImplementedError for SIGTERM/SIGINT on Windows, so even if the
   marker had been written, the gateway's existing SIGTERM handler
   (which is what calls `runner.stop()` and the `mark_resume_pending`
   loop) was never invoked. Writing the marker would have been
   necessary-but-insufficient.

The fix has two parts:

* gateway/run.py: new `_run_planned_stop_watcher` daemon thread polls
  for the planned-stop marker file every 0.5s. When the marker appears
  it `loop.call_soon_threadsafe(shutdown_signal_handler, None)` — the
  same shutdown path a real SIGTERM would have driven, including the
  pre-drain `mark_resume_pending` writes (run.py:5977) and graceful
  drain wait. The existing signal handler already accepts
  `received_signal=None` and falls through to
  `consume_planned_stop_marker_for_self()`, so no handler changes
  needed. Runs on every platform as cheap belt-and-suspenders.

* hermes_cli/gateway_windows.py: `stop()` now writes the marker for
  the running gateway PID and waits up to `agent.restart_drain_timeout`
  (default 30s) for the PID to exit cleanly. On clean drain, the kill
  sweep is non-forceful; on timeout, escalates to
  `kill_gateway_processes(force=True)` which routes to taskkill /T /F
  per `references/windows-native-support.md`.

Validation:

* 7 new tests in tests/gateway/test_planned_stop_watcher.py covering:
  marker→handler dispatch, no-marker idle, already-draining skip,
  not-yet-running skip, stop_event responsiveness, fire-once
  semantics, error tolerance.
* 8 new tests in tests/hermes_cli/test_gateway_windows.py covering:
  marker-before-kill ordering, clean-drain skips force-kill,
  drain-timeout escalates to force=True, no-pid-skips-drain,
  invalid-pid handling, fast-exit success, timeout failure,
  marker-write-failure tolerance.
* E2E (Linux, detached orphan): write_planned_stop_marker(pid) +
  `_drain_gateway_pid(pid, 5.0)` returns True in 0.5s after the
  victim sees the marker and exits. Tested with a double-forked
  subprocess so the test parent isn't holding it as a zombie.
* Targeted: tests/gateway/{restart_drain,restart_resume_pending,
  signal,signal_format,status,shutdown_forensics,approve_deny_commands,
  planned_stop_watcher} + tests/hermes_cli/{gateway_windows,
  gateway_service} → 519/519.

What was wrong with the reporter's claim (for future archaeology): they
described the symptom as "no `resume_pending=True` written to
`sessions.json`" — but Hermes uses `state.db` (SQLite), not
`sessions.json`, and `mark_resume_pending` is called regardless of
the marker (the marker only affects exit code 0 vs 1 for systemd
revival semantics). The real session-loss path is the missing drain
on Windows, not a missing marker. Both halves are fixed here.

Closes #33778.
2026-05-28 03:25:32 -07:00
Aditya Rajesh Gadgil
031983bbf8 fix: limit pre-update state snapshots 2026-05-28 02:45:25 -07:00
sprmn24
4ed482549f fix(xai-proxy): handle 429 rate-limit responses in proxy retry path
get_retry_credential only triggered on 401; a 429 Too Many Requests from
xAI was silently streamed back with no key rotation or back-off signal.

- server.py: widen retry gate from == 401 to in {401, 429}
- xai.py: on 429, skip token refresh and call mark_exhausted_and_rotate
  to stamp the 1-hour cooldown on the rate-limited key and return the
  next available credential. Returns None if pool is exhausted.
2026-05-28 02:36:37 -07:00
Dusk1e
aa3466063b fix(android): reject unsafe tar members in psutil compatibility installer 2026-05-28 02:36:09 -07:00
teknium1
9b5dae17a5 feat(context-engine): host contract for external context engines
Condenses the substance of PRs #16453, #17453, #16451, #17600, and #13373
into a minimal generic host contract that external context engine plugins
(e.g. hermes-lcm) need to integrate cleanly. Drops scaffolding that
duplicated existing infrastructure or had marginal value.

Five concrete changes:

1. `_transition_context_engine_session()` on AIAgent — generic lifecycle
   helper that fires on_session_end → on_session_reset → on_session_start
   → optional carry_over_new_session_context. Engines implement only the
   hooks they need; missing hooks are skipped. Built-in compressor keeps
   its existing reset-only behavior because callers default to no
   metadata. `reset_session_state()` now optionally accepts
   previous_messages / old_session_id / carry_over_context and delegates
   to the transition helper when provided. (#16453)

2. `conversation_id` passed to `on_session_start()` — both the
   agent-init call site and the compression-boundary call site now
   forward `self._gateway_session_key` so plugin engines have a stable
   conversation identity that survives session_id rotation (compression
   splits, /new, resume). The key already existed on AIAgent; it just
   wasn't reaching engines. (#16453)

3. Canonical cache buckets forwarded to engines — the usage dict passed
   to `update_from_response()` now includes input_tokens, output_tokens,
   cache_read_tokens, cache_write_tokens, and reasoning_tokens on top of
   the legacy prompt/completion/total keys. Engines can make decisions on
   cache-hit ratios and reasoning costs instead of only aggregates. ABC
   docstring updated. (#17453)

4. Plugin-registered context engines visible in the picker —
   `_discover_context_engines()` in plugins_cmd.py now also includes
   engines registered via `ctx.register_context_engine()` from plugin
   manifests, deduplicating by name so repo-shipped descriptions win on
   collision. (#16451)

5. `_EngineCollector.register_command()` — context engines using the
   standard `register(ctx)` pattern can now expose slash commands (e.g.
   `/lcm`). Routes to the global plugin command registry with the same
   conflict-rejection policy regular plugins use (no shadowing built-ins,
   no clobbering other plugins). Previously these calls hit a no-op and
   the slash commands silently never appeared. (#17600)

Dropped from the original 5 PRs:

- Compression boundary signal (`boundary_reason="compression"`) from
  #16453 — already on main at `agent/conversation_compression.py:412-424`,
  landed via the bg-review extraction.

- `discover_plugins()` before fallback in run_agent.py from #16451 —
  redundant: `get_plugin_context_engine()` already routes through
  `_ensure_plugins_discovered()` which is idempotent.

- Runtime identity diagnostics method + helpers from #13373 (+251 LOC) —
  operators can already read engine state via `engine.get_status()`;
  the diagnostics view added marginal value relative to its surface area.

- The 553-LOC slash-command machinery from #17600 — replaced with a
  20-LOC `register_command` method on the collector that reuses the
  existing plugin command registry instead of building a parallel one.

Net: ~215 LOC of host-contract changes + 282 LOC of focused tests, vs
~1,176 LOC across the original 5 PRs.

Co-authored-by: Tosko4 <1294707+Tosko4@users.noreply.github.com>

Closes #16453.
Closes #17453.
Closes #16451.
Closes #17600.
Closes #13373.
Related: stephenschoettler/hermes-lcm#68.
2026-05-28 01:45:30 -07:00
Teknium
09a5cd8084
fix(auth): sync manual:device_code Codex pool entries on re-auth (#33744)
#33164 made _save_codex_tokens sync the singleton-seeded `device_code`
pool entry on Codex OAuth re-auth. That fixed the #33000 path but missed
`manual:device_code` entries created by `hermes auth add openai-codex`
(the recommended workaround for users who hit #33000 before #33164
landed).

Every subsequent re-auth would refresh the device_code entry but leave
the manual:device_code entry holding the consumed refresh token plus
stale last_error_* markers — immediately recreating the 401
token_invalidated symptom on the next request, exactly as reported in
#33538.

Extend the refreshable source set to include `manual:device_code`.
Completing the device-code OAuth flow proves the user owns the ChatGPT
account, so it is safe to refresh every device-code-backed entry. Keep
`manual:api_key` and other non-device-code manual sources untouched —
those represent independent credentials.

Closes #33538.
2026-05-28 01:33:10 -07:00
Stephen Schoettler
8595281f3c fix: expose context engine tools with saved toolsets 2026-05-28 00:28:42 -07:00
Dusk1e
1a9ef83147 fix(security): require API_SERVER_KEY before dispatching API server work 2026-05-28 00:25:08 -07:00
LeonSGP43
442a9203c0 Fix xAI OAuth timeout manual fallback 2026-05-28 00:24:17 -07:00
Robin Fernandes
dc52b82d53 test(auth): update entitlement CI expectations 2026-05-28 00:19:31 -07:00
Robin Fernandes
1cf5e639b3 fix(auth): refresh Nous entitlement in tool menus 2026-05-28 00:19:31 -07:00
Robin Fernandes
406901b27d feat(auth) normalise the way in which we check whether a user has free/paid access to nous portal so we can expose behaviour and error messages accordingly. 2026-05-28 00:19:31 -07:00
Squiddy
3ba8962738 fix(kanban): add Windows init lock guard 2026-05-27 23:28:51 -07:00
Squiddy
90b6b3d18f fix(kanban): harden sqlite connection concurrency 2026-05-27 23:28:51 -07:00
Ben
b924b22a9d fix(docker): hermes update prints docker pull guidance instead of bogus git error
Inside the published Docker image, `hermes update` was hitting the
".git missing → reinstall via curl" fallback:

    ✗ Not a git repository. Please reinstall:
      curl -fsSL https://raw.githubusercontent.com/.../install.sh | bash

That message is wrong on two counts:
  1. It tells the user to run the host-side installer, which would
     install a *new* Hermes on the host — not update the running
     container.
  2. It doesn't mention `docker pull` at all, leaving Docker users
     to figure out the right action from scratch.

`hermes update --check` was worse: it bailed with "Not a git
repository — cannot check for updates." and nothing else.

Fix: detect the Docker install method (already stamped by
`docker/stage2-hook.sh` and surfaced by `detect_install_method()`)
in both update entry points and print a long-form message that
covers:

  - The right command: `docker pull nousresearch/hermes-agent:latest`
  - Restart guidance (`docker compose up -d --force-recreate` /
    re-run `docker run`)
  - How to verify the new version after restart
  - Tag-pinning caveat (`:latest` doesn't move a pinned tag)
  - Config persistence across upgrades (state under `HERMES_HOME` /
    `/opt/data` is bind-mounted and survives)
  - Fork escape hatch (build your own image with the repo's Dockerfile)

Exit code is 1 (matches `managed_error` semantic for "tried to
update but can't update this way").

Plumbing:
  - hermes_cli/config.py: new `format_docker_update_message()` helper
    sits next to the existing `_NIX_UPDATE_MSG` /
    `format_managed_message()` family so the wording lives in one
    place and both call sites (apply path + check path) consume it.
  - hermes_cli/main.py:
      * `cmd_update()`: bail right after the `is_managed()` gate, before
        any of the apply-path branches.
      * `_cmd_update_check()`: bail at the top of the function, before
        the existing `method == "pip"` branch.
    Neither path touches subprocess.run / git when method == "docker".

Coverage:
  - 7 new tests in `tests/hermes_cli/test_cmd_update_docker.py`:
      * `hermes update` in Docker → message + exit 1, no git calls
      * `hermes update --check` (via cmd_update) → same
      * `--yes` / `--force` don't bypass (intentional)
      * `_cmd_update_check` called directly → bails too
      * git/pip installs still take their normal paths (regression guards)
      * `format_docker_update_message` content-lock test pinning the
        five user-actionable bits the message must contain
  - Existing test_cmd_update.py (21 tests) + test_managed_installs.py
    (5 tests) still pass — no regression on the source-install path.
  - Verified end-to-end in a real container: `docker run ... update`
    and `docker run ... update --check` both render the message and
    exit 1.
2026-05-28 15:50:25 +10:00
Ben
66489f38c7 fix(docker): bake build-time git SHA into the image
`hermes dump` and the startup banner both call `git rev-parse HEAD` to
report the running commit, but `.dockerignore` line 2 excludes `.git` —
so inside the published image `hermes dump` shows
`version: ... [(unknown)]` and the banner drops its `· upstream <sha>`
suffix entirely.  That makes support triage from container bug reports
impossible: we can't tell which commit the user is actually running.

Fix: thread the build-time SHA through as a Docker build-arg, write it
to `/opt/hermes/.hermes_build_sha` in the image, and have a new
`hermes_cli/build_info.get_build_sha()` read it as a fallback after the
existing live-git lookup fails.  Output format is unchanged in both
callsites — same 8-char short SHA whether resolved live or baked.

Wiring:
  - Dockerfile: `ARG HERMES_GIT_SHA=` + write-file step after the source
    copy.  Empty/missing arg → no file written → callers fall through to
    live git (so local `docker build` without --build-arg is unchanged).
  - docker-publish.yml: passes `HERMES_GIT_SHA=${{ github.sha }}` on all
    four build-push-action steps (amd64/arm64, smoke-test + final push).
  - dump.py:_get_git_commit() / banner.py:get_git_banner_state(): try
    live git first, fall back to baked SHA, then to legacy `(unknown)`
    / None.  Banner returns `upstream == local, ahead=0` because a built
    image is by definition pinned to one commit.

Coverage:
  - Unit tests cover build_info (file present/absent/empty/error,
    truncation, whitespace), dump (live-git wins, both fallbacks,
    identical output-format regression guard), and banner (no-repo +
    baked, no-repo + no-sha, shallow-clone fallback).
  - tests/docker/test_dump_build_sha.py is an integration regression
    guard that runs against the real image, reads
    `/opt/hermes/.hermes_build_sha`, and asserts `hermes dump` surfaces
    its content (or stays at `(unknown)` if no file).
  - Verified end-to-end: `docker build --build-arg HERMES_GIT_SHA=abc...`
    → `docker run ... dump` reports `[abc12345]`; without the build-arg
    it reports `[(unknown)]` as before.
2026-05-28 15:14:05 +10:00
teknium1
ebe04c66cd fix(kanban): close kanban.db FD after every connect() in long-lived processes
`sqlite3.Connection.__exit__` commits/rollbacks but does NOT close the
underlying FD. `with kb.connect() as conn:` in long-lived processes
(gateway `run_slash`, dashboard `decompose_task_endpoint`) therefore
leaks one FD to `kanban.db` per call. After enough operations the
gateway dies with `[Errno 24] Too many open files` (~4 days uptime
in the production report — #33159).

Fix: add a `connect_closing()` context manager in `hermes_cli/kanban_db`
that wraps `connect()` with a real `try/finally: conn.close()`. Switch
the 42 leak-prone call sites in `hermes_cli/kanban.py` (35),
`hermes_cli/kanban_decompose.py` (4), and `hermes_cli/kanban_specify.py`
(3) over to it.

`kanban.py` matters because `run_slash` (called from the gateway for
every `/kanban` slash command) parses argparse and dispatches to those
`_cmd_*` functions in-process — each one was leaking one FD per
invocation.

Tests inside `tests/` are untouched: short-lived processes where OS
cleanup masks the leak. Regression tests added in
`test_kanban_db.py` cover both happy-path and exception-path closure,
plus an explicit assertion that bare `with kb.connect()` still does
NOT close (documenting the upstream sqlite3 behaviour we're working
around).

Closes #33159.
2026-05-27 22:07:49 -07:00
Dusk
c341a2d107
fix(docker): align HOME for dashboard and s6 gateway services (#33481) 2026-05-28 13:42:27 +10:00
Ben Barclay
b345323195 fix(docker): tee supervised gateway stdout to docker logs
Follow-up to #33583 (the gateway-run-supervised redirect).

Before this fix, the supervised gateway's stdout (most visibly the
"Hermes Gateway Starting…" rich-console banner) was swallowed by
`s6-log` into the rotated file at
`${HERMES_HOME}/logs/gateways/<profile>/current` and never reached
`docker logs`. Operational signal lived in two places:

  * **docker logs** — saw stderr (Python `logging` defaults to
    stderr), so warnings/errors were visible.
  * **the rotated file** — saw stdout (rich banners, `print()`
    output, third-party libs that wrote to fd 1).

This was surprising for users coming from the pre-s6 image, where
`docker run … gateway run` produced a single unified stream in
`docker logs`. They'd see partial output, conclude something was
broken, and dig around for the missing pieces.

Fix: add the `1` s6-log action directive before the file destination
so each line is forwarded to s6-log's stdout — which propagates up
the s6-supervise pipeline to /init's stdout = container stdout =
`docker logs`. The file destination is preserved as a second
destination, so the rotated log (with ISO 8601 timestamps) still
exists for `hermes logs` and for survival across container restarts.

Trade-off considered: timestamps. Putting `T` between `1` and the
file destination (not before `1`) means:

  * docker logs sees raw lines — Python's logging formatter has its
    own timestamps, and `docker logs --timestamps` adds another
    layer when desired. No double-stamping in the common reading
    path.
  * The persisted file gets s6-log's ISO 8601 timestamp so even
    output that lacked a Python-logger timestamp (rich banners,
    third-party raw prints) is correlatable in `current`.

Verification:

  * New unit-test assertion in `test_service_manager.py` locks the
    `s6-log 1` directive into the rendered run-script. Mutation-
    tested by reverting to the pre-fix script (no `1`); the assert
    catches it cleanly.
  * New docker-harness test `test_supervised_gateway_stdout_reaches_docker_logs`
    builds the image, runs `docker run … gateway run`, and asserts
    the unique `⚕` banner glyph reaches `docker logs`. Also verifies
    the rotated file still contains the banner (no regression on
    the existing file destination). Mutation-tested end-to-end: built
    a deliberately-broken image without the `1` directive and the
    test failed exactly as designed, citing the banner present in
    `current` but absent from `docker logs`.
  * `website/docs/user-guide/docker.md` gains a new `:::note Where
    gateway logs go` admonition documenting both destinations and
    the audit-log file at `${HERMES_HOME}/logs/container-boot.log`.

Existing functionality preserved: every other docker-harness test
still passes against the new image. Unit-test sweep across
`tests/hermes_cli/` (5561 tests) is green.
2026-05-28 13:18:41 +10:00
brooklyn!
912e6e2274
fix(tui): suppress mouse-residue leaks during Python launcher startup (#31213)
* fix(tui): suppress mouse-residue leaks during Python launcher startup

`hermes --tui …` spends ~100–300ms inside the Python launcher (lazy
imports, arg parsing, session resolution) before exec'ing the Node TUI
binary. During that window stdin is still in cooked + echo mode. If a
prior session left DEC mouse tracking asserted (or the user spammed
mouse movement while the previous session was opening), the terminal
keeps emitting `\\x1b[<…M` SGR motion reports that get echoed straight
back into the user's shell scrollback as literal `^[[<…M` text and
sit there above the TUI banner until the next clear.

The Node side already calls `resetTerminalModes()` in `entry.tsx`, but
by then the race is already lost — the bytes echoed during the Python
warmup window were committed to the scrollback before Node started.

Fix: write the mouse-tracking disable sequence at the very top of
`hermes_cli.main`, before every heavy import. The terminal stops
emitting motion events as soon as the bytes hit the wire (one TTY
round-trip), shrinking the race window from hundreds of milliseconds
to a few. `HERMES_TUI_NO_EARLY_DISABLE=1` opts out for diagnostics.

* test(tui): drop dead _reload_main, hoist import out of patch context

Addresses Copilot review on PR #31213.

The tests used to import `hermes_cli.main` inside the `patch("os.write")`
context, which Copilot pointed out is order-dependent: if the module
is already loaded (e.g. imported by a prior test in the same process),
the import is a no-op and the patch only sees the explicit
`_suppress_mouse_residue_early()` call. Either way the assertion can
flake when run alongside other tests.

Move the import to module scope — every subprocess gets a fresh
`hermes_cli.main`, whose module-level invocation is a no-op under
pytest argv. Tests then exercise `_suppress_mouse_residue_early()`
directly inside their own patch context. Also drop the unused
`_reload_main` helper.

* fix(tui): skip early mouse-disable when stdout is not a TTY

Addresses Copilot review on PR #31213.

`hermes --tui … >log` or CI capture pipes fd 1 away from the terminal.
The disable bytes can't reach the terminal in that case but would
still get written into the log file as raw CSI sequences. Guard with
`os.isatty(1)` inside the existing `try/except OSError` block so the
'never break startup' contract holds.

* docs(tui): rephrase 'raw cooked mode' as 'cooked + echo mode'

Copilot review nit on PR #31213 — the original wording was self-
contradictory. Pre-TUI stdin state is cooked + echo (kernel TTY
discipline still owns the line buffer and echoes input back). The
TUI switches it to raw mode later when Ink mounts.
2026-05-27 22:03:45 -05:00
Ben Barclay
0927fb5584 feat(docker): auto-redirect gateway run to supervised mode inside s6 image
Pre-s6, `docker run nousresearch/hermes-agent gateway run` was the
standard invocation: gateway ran as the container's main process,
tini reaped zombies, container exit code matched gateway exit code,
no supervision. With s6-overlay as PID 1, the same invocation now
auto-upgrades to supervised semantics — auto-restart on crash,
dashboard supervised alongside (when HERMES_DASHBOARD=1 is set),
multiple profile gateways under the same /init.

Users get the new behavior with zero changes to their docker run
command. A loud one-line breadcrumb on stderr explains the upgrade
and points at the opt-out for users who genuinely want pre-s6
foreground semantics.

How it works:

  1. `_gateway_command_inner` (the `gateway run` handler) checks if
     we're inside a container with s6 as PID 1.
  2. If yes, dispatches `start` to the s6 service manager (registers
     and starts gateway-default), then `exec sleep infinity` to keep
     the CMD process alive without binding container lifetime to
     gateway PID lifetime. The supervised gateway can flap freely;
     `docker stop` still tears everything down via /init stage 3.
  3. If no, falls through to the existing foreground code path
     unchanged. Host runs of `hermes gateway run` are unaffected.

Three gates make the redirect inert outside the intended scope:

  * `detect_service_manager() != "s6"` — host/non-s6-container runs.
  * `HERMES_S6_SUPERVISED_CHILD=1` env var (recursion guard) —
    exported by `S6ServiceManager._render_run_script` for the
    s6-supervised invocation itself. Without this guard, the
    supervised `gateway run --replace` would re-enter the redirect
    and recurse (run → start → run → start → ...) infinitely.
  * `--no-supervise` CLI flag OR `HERMES_GATEWAY_NO_SUPERVISE=1` env
    var — explicit user opt-out for CI smoke tests, debugging the
    foreground startup path, or any case wanting "CMD exit =
    container exit" semantics. Strict truthiness (1/true/yes,
    case-insensitive); typos like `=0` do NOT silently opt out.

Tests:

  * Unit tests in tests/hermes_cli/test_gateway_s6_dispatch.py
    cover all five paths (host no-op, supervised fire, sentinel
    recursion guard, CLI flag, env var truthy + falsy). The two
    load-bearing gates (sentinel + opt-out) were mutation-tested
    by removing each gate in isolation and confirming the dedicated
    test fails with the expected error.
  * Docker harness tests in tests/docker/test_gateway_run_supervised.py
    cover the round trips end-to-end against a built image: redirect
    fires (sleep-infinity heartbeat + supervised gateway-default
    slot + breadcrumb), --no-supervise opt-out (foreground gateway,
    no want-up on the slot), HERMES_GATEWAY_NO_SUPERVISE env var
    works identically, recursion is impossible (≤1 supervised
    python gateway-run + exactly 1 sleep-infinity parented to the
    CMD wrapper), and HERMES_DASHBOARD=1 produces both supervised
    gateway and supervised dashboard.

Docs:

  * Added a `:::tip Gateway runs supervised` admonition near the
    main docker.md example explaining the upgrade and pointing at
    the opt-out. Pre-s6 (tini-based) images still run gateway run
    as the foreground main process, so the note is scoped to the
    s6 image only.

Trade-off documented in the helper docstring: container exit code
under the redirect is sleep's exit code (always 0 on SIGTERM), not
the gateway's. That was an explicit design call — the supervised
gateway is allowed to flap without taking the container with it,
which is what "supervision" means. CI users who want exit-code
forwarding can pass --no-supervise.
2026-05-28 12:42:13 +10:00
Stephen Chin
ffdc937c18 fix(kanban): hoist zombie reaper out of dispatch_once
Reaper now runs at the top of every dispatcher tick regardless of per-board connect() failures. Previously the reaper sat inside dispatch_once after the kanban_db.connect() call — any EIO during connect would skip reaping for that tick, accumulating zombie workers and stale claim_lock rows.

Also: reap_worker_zombies now returns the list of reaped pids (the dispatcher logs them) and a test indentation fix.

Squashes three sibling commits from PR #32301 into one logical change for batch review.
2026-05-27 14:31:55 -07:00
steveonjava
99c19eb2fe fix(kanban): add post-commit page_count invariant check to write_txn
Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect().

Refs: #31208 (Bug E - same file, coordinate), #30973 (wal_autocheckpoint)
Refs: #30445, #30896, #30908 (corruption reports)
2026-05-27 14:31:55 -07:00
Stephen Chin
c002668ff0 fix(kanban): add grace period to detect_crashed_workers
`detect_crashed_workers` calls `_pid_alive` on every `running` task whose
claim is held by this host. The check can transiently return False for a
freshly-spawned worker (fork → /proc-visibility lag, or reap-race
between SIGCHLD and parent reaping). When a second dispatcher ticks
inside that window it reclaims the task and spawns a duplicate worker.

Add `DEFAULT_CRASH_GRACE_SECONDS = 30` and an
`HERMES_KANBAN_CRASH_GRACE_SECONDS` env-var override.
`detect_crashed_workers` skips the liveness check when
`time.time() - started_at < grace`. The existing 15-minute claim TTL
still reclaims genuinely-crashed workers; grace only suppresses the
launch-window false positive.

`HERMES_KANBAN_CRASH_GRACE_SECONDS=0` is set on the `kanban_home`
fixture in `test_kanban_core_functionality.py` so existing tests that
assert immediate reclaim retain pre-fix semantics.

Companion to merged PR #23442 (`release_stale_claims`, closes #23025),
which addressed the same multi-dispatcher race in the stale-claim path.
Related: #20015 (`_pid_alive` false-negative behaviour),
2026-05-27 14:31:55 -07:00
Stephen Chin
e83252dc46 fix(kanban): preserve original exception when write_txn rollback fails
When code inside a write_txn block raises an OperationalError that SQLite
has already auto-rolled-back (typical for disk I/O error,
database is locked, and database disk image is malformed), the
explicit ROLLBACK in write_txn.__exit__ itself raises
cannot rollback - no transaction is active and the secondary exception
replaces the original in the traceback. Operators see a misleading error
and lose the diagnostic information they need.

Swallow the rollback-time OperationalError so the caller always sees the
original cause.

Confirmed reproducer: tests/hermes_cli/test_kanban_db.py::
test_write_txn_preserves_original_exception_when_rollback_fails
2026-05-27 14:31:55 -07:00
Stephen Chin
6416dd5187 fix(kanban): harden SQLite against torn-write corruption (secure_delete + cell_size_check + synchronous=FULL)
Production corruption #6 left b-tree pages with zeroed headers but intact old cell content — the Bug E pattern. This fix applies three pragma calls on every connect():

- synchronous=FULL (was NORMAL): closes the WAL-checkpoint reordering window where a crash between WAL commit and main-DB write leaves a partially-written b-tree page header. Cost is <1ms per commit on local SSD; negligible at kanban write volume.

- secure_delete=ON: forces SQLite to zero freed page bytes on disk. If a torn write or hardware fault later corrupts a page, the underlying cell content is zero, so corruption is detectable and no stale rows can resurface as live data.

- cell_size_check=ON: adds a read-side guard so corrupt cells surface as errors at read time rather than as silent wrong-data returns.

All three are connection-scoped and re-applied on every connect(). secure_delete also writes a persistent flag into the DB header on the first call against a fresh DB, making the protection durable across processes for new DBs.

Tests added for all four required cases: each pragma active on a fresh connection, and all three re-applied after close+reopen. Also adds the required negative test (migration path does not reset pragmas).
2026-05-27 14:31:55 -07:00
wysie
f040710d04 fix: backfill official optional skill provenance 2026-05-27 13:39:58 -07:00
wysie
a38e283395 fix: preserve nested official skill install paths 2026-05-27 13:39:58 -07:00
Franci Penov
6f2a2f157f fix: check upstream even when origin/main has no new commits
The upstream sync logic only ran after a successful origin pull,
so forks whose origin/main was already in sync with local (but
behind upstream/main) would bail out with "Already up to date!"
without ever checking upstream.
2026-05-27 13:10:50 -07:00
Teknium
e8955f222c
fix(codex): drop dead model slugs that HTTP 400 on ChatGPT Pro (#33424)
DEFAULT_CODEX_MODELS shipped three slugs that the chatgpt.com Codex
backend rejects with HTTP 400 'The <slug> model is not supported when
using Codex with a ChatGPT account.' on every account tested live:

  gpt-5.2-codex
  gpt-5.1-codex-max
  gpt-5.1-codex-mini

Live verified against https://chatgpt.com/backend-api/codex/models
which returns gpt-5.5, gpt-5.4, gpt-5.4-mini, gpt-5.3-codex,
gpt-5.3-codex-spark, gpt-5.2 for ChatGPT Pro accounts.

When _fetch_models_from_api fell back to DEFAULT_CODEX_MODELS (offline
first-run, transient API failure) the picker surfaced these dead slugs
and crashed on selection. The forward-compat synthesis table chained
them downstream too.

If OpenAI re-enables them on the OAuth-backed Codex backend, live
discovery will pick them up automatically — the defaults list is only
consulted when live discovery is unavailable.

Test fixture pivoted to use gpt-5.3-codex (templated by 4 entries) as
the synthesis driver so the forward-compat test still exercises the
synthesis path.
2026-05-27 12:16:15 -07:00
Teknium
9919caff46
feat(image_gen): add Krea provider plugin (Krea 2 Medium + Large) (#33236)
* feat(image_gen): add Krea provider plugin (Krea 2 Medium + Large)

New built-in image_gen backend wrapping Krea's Krea 2 foundation
image model family. Auto-discovered like the other image_gen plugins
and appears in 'hermes tools' → Image Generation → Krea.

Krea's API is asynchronous — submit returns a job_id, poll /jobs/{id}
until terminal. The provider hides that behind the synchronous
ImageGenProvider.generate() contract: submit, poll every 2s with
light backoff (max 5s), 3-minute ceiling matching Krea's hosted-tool
timeout. Result URL is materialised to $HERMES_HOME/cache/images/
to avoid CDN-expiry 404s downstream (same fix as xAI #26942).

Models:
- krea-2-medium (default — Krea's 'start here' recommendation)
- krea-2-large

Aspect ratios map landscape→16:9, square→1:1, portrait→9:16.
Resolution: 1K (Krea's only current option).

Kwarg passthrough: seed, creativity (raw/low/medium/high), styles,
image_style_references (capped 10), moodboards (capped 1) — matches
Krea's per-request limits. Unknown kwargs are ignored.

Config knobs (config.yaml):
  image_gen.provider: krea
  image_gen.krea.model: krea-2-medium | krea-2-large
  image_gen.krea.creativity: raw | low | medium | high
Env overrides: KREA_API_KEY (required), KREA_IMAGE_MODEL.

KREA_API_KEY is registered in OPTIONAL_ENV_VARS so 'hermes setup'
prompts for it.

31 new tests; image_gen suite + picker + tools_config: 211/211.

* fix(image_gen/krea): address review feedback

- Update KREA_API_KEY setup URL to the canonical token-creation page
  (https://www.krea.ai/app/api/tokens). The previous URL returned 404.

- Fail fast on non-retryable HTTP statuses during poll. The previous
  loop retried every HTTPError for the full 180s deadline, so an auth
  (401), billing (402), forbidden (403), or not-found (404) response
  would make image_generate hang for three minutes. Only retry
  transient statuses (408/409/425/429/5xx); surface everything else
  immediately.

- Add 5 tests covering fail-fast on 401/403/404 and retry on 429/503.

* fix(krea): point users at the real API token dashboard URL

Three call sites linked users to dashboard pages that don't exist:
- hermes_cli/config.py: https://www.krea.ai/app/api/tokens
- plugins/image_gen/krea/__init__.py get_setup_schema: https://www.krea.ai/api-keys
- plugins/image_gen/krea/__init__.py auth_required error: https://www.krea.ai/api-keys

Per Krea's own docs (https://docs.krea.ai/developers/api-keys-and-billing),
the real dashboard URL is https://www.krea.ai/settings/api-tokens. All three
sites now point there.
2026-05-27 11:01:47 -07:00
JohnC1009
414a5bc924 fix(auth): fall back to global auth.json in _load_provider_state
In profile mode, _load_provider_state previously returned None when a
provider was absent from the profile's auth.json — even if the user had
authenticated at the global root. This broke runtime credential resolvers
that read state directly (resolve_nous_access_token,
resolve_nous_runtime_credentials), causing profiles without their own
nous login to fail with 'Hermes is not logged into Nous Portal' despite
a valid global session.

Push the existing read-only global fallback (already used by
get_provider_auth_state and read_credential_pool) into _load_provider_state
so every caller benefits, and simplify get_provider_auth_state into a thin
wrapper. Writes still target the profile only — profile state continues to
shadow global state on the next read after a per-profile login. Behavior in
classic (non-profile) mode is unchanged because _load_global_auth_store
returns an empty dict.

Adds 5 tests covering the new contract on _load_provider_state directly.
Existing 770 auth/credential/nous tests still pass.
2026-05-27 09:38:58 -07:00
Teknium
69dfcdcc15 fix(auth): codex chat path falls back to credential_pool when singleton is empty
Closes #32992.

The chat path resolves Codex credentials via `resolve_codex_runtime_credentials`
which only reads `providers.openai-codex.tokens` (the singleton). The auxiliary
path uses `_read_codex_access_token` which checks the credential_pool first.
For users whose tokens live only in the pool — manual seed, partial re-auth,
restore from backup, or any state where the singleton is empty but the pool
is healthy — the chat path raised AuthError or (worse, since OpenAI(api_key='')
silently attaches no header) the wire saw HTTP 401 "Missing Authentication header"
while the auxiliary path worked fine.

This adds a pool fallback to `resolve_codex_runtime_credentials`: when the
singleton has no usable access_token, scan `credential_pool.openai-codex` for
the first entry that has a non-empty access_token and isn't in an exhaustion
cooldown window (`last_error_reset_at` in the future). If found, return that
token with `source="credential_pool"`. If no usable entry exists, the original
AuthError propagates as before.

Regression tests cover:
- Empty singleton + healthy pool entry → pool token returned
- Pool fallback skips entries currently in cooldown
- Empty singleton + empty/wedged pool → AuthError propagates (existing contract preserved)
2026-05-27 03:43:51 -07:00
konsisumer
f1422ffd77 fix(gateway): classify Codex 429 quota as rate-limit, not missing credentials
When the Codex OAuth token endpoint returns 429 (usage-limit / quota
exhaustion), refresh_codex_oauth_pure raised a generic auth error that the
gateway surfaced as 'Primary provider auth failed: No Codex credentials
stored. Run hermes auth', prompting re-auth that cannot lift a quota cap.

Classify 429 distinctly (codex_rate_limited, relogin_required=False) with a
non-alarming quota message that honors Retry-After, log it as
'Primary provider rate-limited (429)', and stop format_auth_error from
appending the re-authenticate remediation. Also log the fallback provider's
literal config key instead of the resolved runtime category.

Refs #32790
2026-05-27 03:13:15 -07:00
konsisumer
2bbd53493d fix(cli): sync credential_pool on Codex re-auth
Codex re-auth via `hermes setup` / `hermes model` wrote fresh OAuth
tokens to providers.openai-codex.tokens but left the credential_pool
device_code entry holding the consumed refresh token and stale error
markers. Since the runtime selects from the pool, the next request
spent a dead token and got a 401 token_invalidated. Update the
singleton-seeded pool entries in lockstep and clear their error state.

Fixes #33000
2026-05-27 03:02:06 -07:00
Robert DaSilva
efa952531b fix: ignore Telegram start pings 2026-05-27 02:41:24 -07:00
Ben
a890389b69 feat(dashboard-auth): HERMES_DASHBOARD_PUBLIC_URL / dashboard.public_url override
Operators behind reverse proxies that don't reliably forward
X-Forwarded-Host / X-Forwarded-Proto / X-Forwarded-Prefix (manual
nginx setups, on-prem ingresses, custom-domain Fly deploys with
incomplete proxy chains) had no way to force the absolute base URL
the OAuth callback redirects from. The dashboard would reconstruct
the redirect_uri from request headers, the IDP would echo it back,
and the user would land on the wrong host or wrong path — 404.

Add `dashboard.public_url` to config.yaml with env override
HERMES_DASHBOARD_PUBLIC_URL. When set, it is the complete authority —
scheme + host + optional path prefix (e.g. https://example.com/hermes) —
and becomes the base for the OAuth `redirect_uri`. X-Forwarded-Prefix
is IGNORED on this code path because the operator has explicitly
declared the public URL; we no longer need to guess from proxy
headers, and stacking the prefix on top would double-prefix the
common case where the prefix is already baked into public_url.

When unset, the existing proxy_headers + X-Forwarded-Prefix
reconstruction runs untouched. Existing Fly.io deploys continue to
work without configuration — this is purely additive.

Precedence mirrors dashboard.oauth.client_id:

  env (non-empty) > config.yaml > reconstructed from request

Implementation:

  - hermes_cli/config.py: add dashboard.public_url to DEFAULT_CONFIG
    with a multi-paragraph doc comment explaining the use case,
    the X-Forwarded-Prefix interaction, and the validation rules.
  - hermes_cli/dashboard_auth/prefix.py: factored out the existing
    _REJECT_CHARS frozenset, added _normalise_public_url() validator
    (requires http/https scheme + non-empty host + no header-injection
    chars), _load_dashboard_section() loader (robust to load_config
    raising, non-dict shapes), and resolve_public_url() entry point
    with the env-overrides-config precedence. A malformed value
    silently falls through to ""; the caller treats "" as "reconstruct
    from request" so a typo never breaks the login flow.
  - hermes_cli/dashboard_auth/routes.py: rewrite _redirect_uri()
    docstring to spell out the three resolution tiers; add the
    public_url short-circuit before the existing X-Forwarded-Prefix
    splicing. Source-level comment notes that X-Forwarded-Prefix is
    intentionally ignored when public_url is set so a future reader
    doesn't try to "fix" the missing prefix layering.
  - cli-config.yaml.example: extend the existing dashboard section
    with a public_url block.
  - website/docs/user-guide/features/web-dashboard.md: new "Public
    URL override" section between the provider configuration and
    the OAuth flow walkthrough. Documents the env-vs-config table,
    the validation rules, and the `http://` `public_url` ↔ Secure
    cookie footgun.

Test coverage — new TestPublicUrlOverride class (8 tests):

  - env var overrides request reconstruction (the primary motivating
    case)
  - config.yaml used when env unset
  - env wins over config (precedence pin)
  - public_url with a path prefix already baked in (the Q1-a case the
    user explicitly chose)
  - public_url suppresses X-Forwarded-Prefix layering (defends
    against the double-prefix bug)
  - trailing slash stripped from public_url (no //auth/callback)
  - malformed public_url falls through to reconstruction (six
    hostile inputs: javascript:, ftp:, missing scheme, missing host,
    quote chars, CRLF injection)
  - empty env string doesn't shadow config.yaml entry (CI / Fly
    provisioned-but-empty secret case)

Mutation-tested: flipping the precedence in resolve_public_url() trips
exactly test_env_overrides_config_public_url; weakening the validator
(accept any scheme) trips exactly test_malformed_public_url_falls_through_to_reconstruction.
Both other tests in each pair stay green, confirming the suite
discriminates the specific regression each test pins.
2026-05-27 02:12:27 -07:00
Ben
0af37ff272 style(dashboard-auth): redesign /login page to match Nous design system
The login page is the first surface the user sees on a gated dashboard
and shipped with off-the-shelf system fonts and a generic orange
accent that didn't match the React dashboard waiting on the other
side of the OAuth round trip. Apply the same visual language the SPA
uses (the @nous-research/ui package) so the auth flow feels like one
product, not two.

What changes (visual only — no functional changes):

  Typography
    - Body: Collapse (regular + bold), served from /fonts/ — the same
      woff2 files the dashboard SPA loads via the design-system's
      fonts.css.
    - Display: Rules Compressed (regular + medium) for the brand
      wordmark and the page heading.
    - Brand chrome (heading, buttons, footer) uses the DS idiom:
      uppercase + letter-spacing 0.2em (matching the DS Button class).

  Colour
    - Background: #170d02 (deep brown-black; --background-base in DS).
    - Accent: #ffac02 (amber; --midground in DS).
    - Foreground: #ffffff.
    - Hairlines: color-mix() of the midground at 18% / 35%, mirroring
      the DS "@theme inline" derived tokens.

  Button surface
    - Solid amber surface with dark text, no rounded corners (DS Button
      is squared). Inset bevel —  — directly mirrors the DS
      Button SHADOW_DEFAULT (). :active uses filter:invert(1) which matches the DS
      Button's .

  Atmosphere
    - Subtle 3px dither (repeating-conic-gradient at 4% midground) +
      a midground radial glow at top — same idioms as the DS .dither
      utility and the SPA's panel chrome.
    - slide-up fade-in entrance animation matching DS @keyframes
      slide-up (0.6s ease-out). Honours prefers-reduced-motion.

  Brand wordmark
    - 'NOUS · RESEARCH' above the card in Rules Compressed, amber,
      0.32em tracking. Establishes ownership before the user squints
      at the buttons.

  Empty-state page
    - The 'Sign-in unavailable' fallback (no providers registered)
      got the same colour-token and typography treatment so the
      misconfigured-deploy experience is also coherent.

Fonts are served from /fonts/*.woff2 — a path the dashboard-auth gate
already allowlists pre-auth (see _GATE_PUBLIC_PREFIXES in
middleware.py:42), so the login page renders with the brand typeface
without needing the React bundle loaded. The page is still entirely
static HTML+CSS with no JS — the original constraint (no SPA
dependency, no session token) is preserved.

The class="provider-btn" selector is unchanged — the existing test
suite extracts the anchor href via that class, and a regression that
renamed it would silently break tests/hermes_cli/test_dashboard_auth_401_reauth.py.
A docstring note on the module flags this so future visual tweaks
don't break the contract by accident.

Visual smoke-test: rendered both the happy path (multiple providers
listed) and the empty-state page in a browser and verified all five
DS criteria — brown-black bg, amber accent, uppercase wide-tracking
type, inset-bevel buttons, Nous · Research wordmark — render
correctly with no unstyled fallbacks. 208/208 dashboard-auth tests
remain green.
2026-05-27 02:12:27 -07:00
Ben
61dcc33893 feat(dashboard-auth): config.yaml as canonical surface for dashboard.oauth
Per AGENTS.md, ~/.hermes/.env is reserved for API keys / secrets and
config.yaml is the surface for non-secret configuration. The Nous
Portal plugin previously read HERMES_DASHBOARD_OAUTH_CLIENT_ID and
HERMES_DASHBOARD_PORTAL_URL from the environment only, which forced
local-dev / on-prem operators to put non-secret per-instance
configuration in .env — violating the convention.

Add dashboard.oauth.{client_id,portal_url} to DEFAULT_CONFIG and have
the plugin resolve each setting with env-overrides-config precedence:

  1. Env var when set to a non-empty value (Fly.io platform-secret
     injection — what pushes per-deploy client_ids without baking
     them into the image).
  2. config.yaml entry (canonical surface for local dev / on-prem).
  3. Plugin default (no provider registered when client_id is empty;
     portal_url defaults to https://portal.nousresearch.com).

Empty env values are explicitly treated as unset so a provisioned-but-
not-populated Fly secret can't accidentally shadow a valid config.yaml
entry with an empty string — operators would otherwise lose the gate.

Implementation:

  - hermes_cli/config.py: add dashboard.oauth.{client_id,portal_url}
    block to DEFAULT_CONFIG with full doc comment explaining the
    override precedence and Fly.io rationale.
  - plugins/dashboard_auth/nous/__init__.py: add _load_config_oauth_section,
    _resolve_client_id, _resolve_portal_url helpers; replace the two
    direct os.environ.get() calls in register() with the resolvers.
    Update the skip-reason string to mention BOTH surfaces so an
    operator looking at the fail-closed bind error knows config.yaml
    is a valid alternative to the env var.
  - plugins/dashboard_auth/nous/plugin.yaml: update description to
    name both surfaces. requires_env stays pointing at the env var
    name — it's metadata-only (not used by the plugin loader for
    gating) so this is documentation/UX, not enforcement.
  - cli-config.yaml.example: append commented dashboard.oauth block
    with the same override rationale operators see in code.
  - website/docs/user-guide/features/web-dashboard.md: rewrite the
    'Default provider: Nous Research' section to lead with config.yaml,
    present env vars as operator overrides (Fly.io's primary path).
    Updated the example fail-closed bind error to match the new
    skip-reason text.

Test coverage — new TestConfigYamlSource class (8 tests) pinning
every tier of the precedence chain:

  - config-yaml-only path registers correctly
  - both config-yaml fields (client_id + portal_url) honoured
  - env var overrides config for client_id (Fly.io critical path)
  - env var overrides config for portal_url
  - empty env string does NOT shadow config (CI/Fly edge case)
  - neither source set → skip with reason mentioning BOTH surfaces
  - load_config() raising falls through to env-only path (resilience)
  - non-dict oauth section falls through cleanly (typo resilience)

Mutation-tested: flipping the precedence to config-wins-over-env trips
exactly test_env_overrides_config_client_id while the other 7 stay
green, confirming the suite discriminates the order, not just the
sources.

This closes the last item in Teknium's PR review (PR #30156).
2026-05-27 02:12:27 -07:00
Ben
b26d81d536 feat(dashboard-auth): honour X-Forwarded-Prefix + __Host-/__Secure- cookies
Mission-control style deploys reverse-proxy the dashboard at a path
prefix (e.g. mission-control.tilos.com/hermes/* -> :9119) and inject
X-Forwarded-Prefix: /hermes on every request. The SPA mount already
honoured this for asset URLs and the bootstrap __HERMES_BASE_PATH__,
but the OAuth gate didn't:

  1. The gate's Location: header to /login and the 401 envelope's
     login_url were built bare ("/login?next=..."). Under a /hermes
     prefix the browser follows that to mission-control.tilos.com/login
     which the proxy doesn't route to the dashboard.
  2. _redirect_uri (the OAuth callback URL handed to the IDP) used
     request.url_for() which doesn't honour X-Forwarded-Prefix
     (Starlette/uvicorn only proxy_headers Host + Proto + For). The
     IDP redirects back to /auth/callback instead of /hermes/auth/
     callback → 404 in the user's browser.
  3. Cookies were set with Path=/ which leaks them to other apps on
     the same origin and won't be sent back on requests under the
     prefix in the first place.

Fix threads the normalised prefix through every boundary:

  * New hermes_cli/dashboard_auth/prefix.py — single source of truth
    for X-Forwarded-Prefix parsing. web_server._normalise_prefix
    becomes a re-export so the SPA mount, the gate, and the cookies
    helper all agree.
  * middleware._unauth_response builds login_url = f"{prefix}/login".
  * routes._redirect_uri splices the prefix into the path component
    of the IDP-bound URL (with full validation of the header).
  * cookies.{set,clear}_{session,pkce}_cookie now take prefix="".
    Path attribute switches to /hermes when set; cookie name switches
    name variant (see below). Every caller passes the request's
    normalised prefix.

Cookie hardening (Teknium's lesser-note #1 in the PR review): adopt
the __Host- / __Secure- cookie name prefixes per draft-west-cookie-
prefixes. The variant is selected from (use_https, prefix):

  * Loopback HTTP → bare "hermes_session_at" (both prefixes require
    Secure, incompatible with HTTP).
  * HTTPS, direct deploy (Path=/) → "__Host-hermes_session_at".
    Strongest spec: bound to exact origin, no Domain attribute, Secure
    required.
  * HTTPS, behind a proxy prefix (Path=/hermes) →
    "__Secure-hermes_session_at". __Host- forbids Path != "/"; the
    explicit Path=/hermes covers same-origin app isolation.

Setter and reader BOTH consult the prefix because the cookie *name*
changes — a reader that looked up the bare name when the setter wrote
__Secure- would never find the value. The reader falls back across
all three variants so a request whose shape changed mid-session (e.g.
post-deploy from no-prefix to /hermes) still picks up the existing
cookie until it expires.

Test coverage:

  - tests/hermes_cli/test_dashboard_auth_prefix.py — new file. 11 tests
    pinning:
      • Location: /hermes/login on the gate's HTML redirect
      • 401 envelope login_url carries the prefix
      • Malformed X-Forwarded-Prefix is ignored (header-injection
        defence; the script-tag value is normalised to empty string)
      • _redirect_uri splices /hermes into the path (the property
        that prevents the IDP-returns-to-404 failure)
      • PKCE cookie uses Path=/hermes + __Secure- when proxied
      • Session cookies use __Host- when direct, __Secure- when
        proxied, bare on loopback HTTP
      • End-to-end round trip with hand-managed PKCE cookie carriage
        (TestClient can't simulate a Path=/hermes cookie automatically)
  - tests/hermes_cli/test_dashboard_auth_cookies.py — rewritten to pin
    each (use_https, prefix) shape produces its expected cookie name,
    plus reader-side coverage that __Host- and __Secure- variants are
    both recognised.
  - Existing tests across middleware / 401-reauth / etc. updated to
    match the new cookie names (substring contains instead of
    startswith).

Mutation-tested: reverting _unauth_response to build the bare
"/login" URL trips exactly the two tests that pin the prefix
carriage, confirming the suite discriminates the regression.
2026-05-27 02:12:27 -07:00
Ben
034ad95fed fix(dashboard-auth): propagate next= through login page + PKCE cookie
The gate's _unauth_response set next=<path> on the /login redirect URL,
but nothing downstream read it: render_login_html ignored next=,
auth_login dropped it, and auth_callback read next= from its own query
string — which an IDP never sets on the callback URL (real IDPs only
echo back code+state). The _validate_post_login_target plumbing in the
callback was unreachable on the happy path, so users always landed on
"/" regardless of what they originally requested.

Worse: reading next= from the callback URL was a latent open-redirect
sink, since an attacker could craft /auth/callback?...&next=/admin and
have the server honour it post-auth.

Fix carries next= through the round trip on a server-controlled channel:

  1. login_page reads request.query_params['next'] and passes it (post-
     validation) to render_login_html.
  2. render_login_html threads next= URL-encoded into each provider
     button's href, with HTML-attribute escaping as defence in depth.
  3. auth_login accepts ?next= as a query param, re-validates, and
     appends it as a fourth segment (next=<urlquoted>) in the PKCE
     cookie payload alongside provider/state/verifier.
  4. auth_callback no longer accepts a next: str = "" query param. It
     parses next= out of the PKCE cookie and validates that with the
     same same-origin rules. Any attacker-supplied ?next= on the
     callback URL is silently ignored — server-only carrier.

Test coverage adds three classes:

  - TestAuthCallbackNext drives /login → /auth/login → IDP-bounce →
    /auth/callback end-to-end without smuggling next= onto the callback
    URL (which is what the previous tests did and why they didn't
    catch the bug). Includes test_attacker_callback_next_param_is_ignored
    to pin the security property that the URL value is never read.
  - TestRenderLoginHtmlNext covers the rendering function at the
    unit boundary so a regression that drops next_path is caught
    without spinning up the full app.
  - TestAuthLoginPkceCookieNext inspects the Set-Cookie header on
    /auth/login responses so a regression in cookie encoding is caught
    without driving the full round trip.

Mutation-tested: reverting auth_callback to read next= from the URL
trips 3 of 6 TestAuthCallbackNext tests (the safe-path and attacker-
hardening ones), confirming the suite discriminates between the cookie
read and the URL read.
2026-05-27 02:12:27 -07:00
Ben
c3104195b8 fix(dashboard-auth): bypass loopback WS peer check in gated mode
When the OAuth gate is active, start_server runs uvicorn with
proxy_headers=True so the dashboard can honour X-Forwarded-Proto from
Fly's TLS terminator (cookies, redirect URI reconstruction). A side
effect: ws.client.host is rewritten to the X-Forwarded-For value, which
on Fly is the real internet client IP — never loopback. The loopback
peer guard in _ws_client_is_allowed then rejected every WS upgrade in
gated mode (4403 close) even after a successful OAuth round trip and
ticket consumption, silently breaking /api/pty, /api/ws, /api/pub, and
/api/events.

Fix: in gated mode, bypass the peer-IP check. The OAuth gate +
single-use ticket is the auth. The Host/Origin guard in
_ws_host_origin_is_allowed still runs and is what protects against
DNS-rebinding here, not the peer IP.

Loopback mode behaviour is unchanged: the legacy ?token= path is the
only auth there and we don't want LAN hosts guessing tokens.

Regression coverage: TestWsRequestIsAllowedGated pins all four
behaviours — non-loopback peer allowed in gated mode, non-loopback peer
rejected in loopback mode, loopback peer allowed in loopback mode, and
the Host/Origin guard still firing on a rebinding attempt with gated
mode + matching peer.
2026-05-27 02:12:27 -07:00
Ben
42729775db fix(dashboard): trigger plugin discovery in cmd_dashboard before start_server
The argparse-setup plugin discovery path is gated on
_plugin_cli_discovery_needed(), which returns False for any built-in
subcommand including 'dashboard' (to save ~500ms startup on hot paths
like --tui). As a result, plugins/dashboard_auth/nous never registered
its DashboardAuthProvider, and start_server's fail-closed gate check
tripped for any non-loopback bind even when the Nous provider was
bundled and ready to run.

Call discover_plugins() explicitly in cmd_dashboard so the provider
registry is populated before the gate check runs. discover_plugins() is
idempotent (per its docstring), so this is safe to call regardless of
whether the argparse path already ran it.
2026-05-27 02:12:27 -07:00
Ben
b3dc539304 feat(dashboard-auth): Nous plugin always-on; default portal URL; specific error messages
The Nous OAuth provider plugin (plugins/dashboard_auth/nous) is bundled
and auto-loaded — same as before — but previously refused to register
unless BOTH HERMES_DASHBOARD_OAUTH_CLIENT_ID and HERMES_DASHBOARD_PORTAL_URL
were set, then the gate's fail-closed branch told the operator 'install
the default Nous provider'. That message is misleading: the provider IS
installed; it's just unconfigured. And the contract only really needs
the per-instance client_id — the portal URL is the same for everyone
in production.

Three changes:

1. plugins/dashboard_auth/nous/__init__.py:
   - HERMES_DASHBOARD_PORTAL_URL is now optional and defaults to
     'https://portal.nousresearch.com'. Override only for staging
     (portal.rewbs.uk) or a custom deployment. Empty string also
     falls back to the default so an empty Fly secret can't point
     the dashboard at nowhere.
   - Plugin exposes a module-level LAST_SKIP_REASON: str that the gate
     reads when no providers register. Cleared on each register() call.
     Skip reasons are human-readable and actionable
     ('HERMES_DASHBOARD_OAUTH_CLIENT_ID is not set. The Nous Portal
     provisions this env var…').

2. plugins/dashboard_auth/nous/plugin.yaml:
   - requires_env drops HERMES_DASHBOARD_PORTAL_URL; only the client_id
     is mandatory. Description updated to reflect this.

3. hermes_cli/web_server.py:
   - When the gate fail-closes for 'no providers', it now reads each
     bundled plugin's LAST_SKIP_REASON and embeds them in the SystemExit
     message. Operator sees the specific config fix needed:
       Bundled providers reported these issues:
         • nous: HERMES_DASHBOARD_OAUTH_CLIENT_ID is not set. …
     instead of the prior generic 'Install the default Nous provider'.

Tests:
  - TestPluginRegister rewritten to assert the new defaults +
    LAST_SKIP_REASON contents (6 tests, +1 new for empty-string env).
  - New gate test test_start_server_surfaces_nous_skip_reason_when_unconfigured.
  - test_get_method_is_not_allowed widened to handle the SPA-shell 200
    path explicitly — assertion now verifies no JSON ticket leaks
    rather than asserting a specific status code (covers all four of
    401/404/405/200).

Docs updated: web-dashboard.md's 'Default provider' section now shows
the env-var table with required/optional columns and embeds the
fail-closed error message verbatim so operators can match what they
see at the prompt.
2026-05-27 02:12:27 -07:00
Ben
2fc4615fc4 feat(dashboard-auth): Phase 7 — SPA AuthWidget + /api/status auth fields
Phase 7 surfaces the OAuth gate state to users.

web/src/components/AuthWidget.tsx (new):
  Sidebar widget that fetches /api/auth/me on mount and renders a
  compact 'Logged in as <user_id…> via <provider>' row with a logout
  icon. Contract V1 (Nous Portal) emits no email/display_name claims,
  so user_id is the display value (truncated to 14 chars + ellipsis);
  display_name and email fallthroughs are forward-compat for OQ-C1.
  Renders nothing on 401 from /api/auth/me — that's the signal the
  gate isn't engaged (loopback mode), in which case the widget would
  be confusing.
  Logout POSTs /auth/logout (which clears cookies + redirects to
  /login) then full-page-navigates to /login itself; the SPA's fetch
  wrapper doesn't follow that redirect, so the navigation is explicit.

web/src/App.tsx: mounts <AuthWidget /> above <SidebarFooter />.
  Component is self-hiding in loopback mode so there's no need for a
  conditional mount.

web/src/lib/api.ts:
  - getAuthMe() + logout() helpers
  - AuthMeResponse type
  - StatusResponse gets optional auth_required + auth_providers fields
    so the existing StatusPage can render a gated/loopback badge.

hermes_cli/web_server.py: /api/status payload now includes
  - auth_required: bool — whether app.state.auth_required is True
  - auth_providers: list[str] — registered DashboardAuthProvider names
  Lazy-imports list_providers so early-startup status calls don't
  crash if the dashboard_auth module is still being set up.

tests/hermes_cli/test_dashboard_auth_status_endpoint.py: 3 new tests
covering the new status fields in both gated and loopback modes plus
a regression that no existing field got dropped from the payload.

The hermes status CLI is unchanged in this commit — that command
tracks model providers + OAuth credentials, not running-dashboard
state. The /api/status endpoint is the canonical place to query
dashboard auth-gate state, consumed by the React StatusPage already.
2026-05-27 02:12:27 -07:00
Ben
5e9308b5b8 feat(dashboard-auth): Phase 6 — 401 re-auth envelope + next= propagation
Contract V1 of nous-account-service PR #180 ships no refresh tokens, so
the original Phase 6 silent-refresh design is replaced with a thinner
'401 → redirect to /login' UX. The dashboard's gated middleware now
emits a structured envelope on any auth failure; the SPA's fetch
wrapper sees it and full-page-navigates the user through re-auth.

hermes_cli/dashboard_auth/cookies.py:
  set_session_cookies(refresh_token='') SKIPS writing the
  hermes_session_rt cookie. Forward-compat: a non-empty refresh_token
  still emits the cookie unchanged, so a future Portal contract that
  starts issuing RTs flips the persistence on with no other change.
  clear_session_cookies still emits a Max-Age=0 deletion for the RT
  cookie so stale cookies from earlier deployments get flushed on
  logout / session expiry. Deprecation marker + rationale in
  module docstring per the user's docstring-only deprecation pattern.

hermes_cli/dashboard_auth/middleware.py:
  _unauth_response now builds a structured JSON envelope for API 401s:
    { error: 'session_expired' | 'unauthenticated',
      detail: 'Unauthorized',
      reason: <internal>,
      login_url: '/login?next=<safe-path>' }
  HTML redirects also carry next= so a user landing on /sessions
  without a cookie bounces back to /sessions after re-auth.
  _safe_next_target validates same-origin: drops protocol-relative
  paths (//evil.com), absolute URLs, and any /login or /auth/* loop.
  Dead cookies are cleared on the 401 path so the browser stops
  replaying invalid tokens.

hermes_cli/dashboard_auth/routes.py:
  /auth/callback accepts next= query param and validates via
  _validate_post_login_target (same rules as the gate's
  _safe_next_target — defence-in-depth because next= survived a full
  IDP round trip and attacker-controlled state can re-enter via the
  callback URL). Open-redirect attempts land at '/' instead.

web/src/lib/api.ts:
  fetchJSON parses the 401 envelope and full-page-navigates to
  body.login_url ONLY on the known session-expiry error codes.
  Domain-level 401s (e.g. permission errors) bubble up as regular
  errors. credentials: 'include' added so cookie auth works for all
  fetches routed through this wrapper. sessionStorage.lastLocation is
  preserved for future use by AuthWidget / hermes_status.

Test files marked with pytest.mark.xdist_group so the four files that
mutate web_server.app.state.auth_required serialize onto the same xdist
worker — eliminates 'works locally, fails in CI' app-state bleed.

20 new tests in test_dashboard_auth_401_reauth.py:
  - set_session_cookies(refresh_token='') skips RT cookie
  - clear_session_cookies still emits RT deletion
  - 401 envelope shape (unauthenticated vs session_expired)
  - dead cookie cleared on invalid-token 401
  - login_url carries next= for deep paths
  - login loop avoided when path is /login/auth/api-auth
  - protocol-relative URL rejected
  - _safe_next_target unit tests (accept same-origin, reject loops/abs)
  - /auth/callback respects safe next= but rejects open redirects

2 pre-existing tests updated to accept the new /login?next=%2F shape.

Full dashboard-auth suite: 168 passed, 1 skipped (Phase 0 pre-existing).
2026-05-27 02:12:27 -07:00