hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-07-14 14:12:44 +00:00

Author	SHA1	Message	Date
kshitijk4poor	66827f8947	chore: prune unused imports and duplicate import redefinitions Remove unused imports (F401) and duplicate/shadowed import redefinitions (F811) across the codebase using ruff's safe autofixes. No behavioral changes -- imports only. - ~1400 safe autofixes applied across 644 files (net -1072 lines) - __init__.py re-exports preserved (excluded from F401 removal so public re-export surfaces stay intact) - Re-exports that are imported or monkeypatched by tests but look unused in their defining module are kept with explicit # noqa: F401 (gateway/run.py load_dotenv; run_agent re-exports from agent.message_sanitization, agent.context_compressor, agent.retry_utils, agent.prompt_builder, agent.process_bootstrap, agent.codex_responses_adapter) - Unsafe F841 (unused-variable) fixes deliberately skipped -- those can change behavior when the RHS has side effects - ruff lints remain disabled in pyproject.toml (only PLW1514 is selected); this is a one-time cleanup, not a config change Verification: - python -m compileall: clean - pytest --collect-only: all 27161 tests collect (zero import errors) - core entry points import clean (run_agent, model_tools, cli, toolsets, hermes_state, batch_runner, gateway) - static scan: every name any test imports directly from an edited module still resolves	2026-05-28 22:26:25 -07:00
Teknium	69b74c15a3	fix(kanban): CLI dispatch honors max_in_progress/max_spawn from config; swap missing 'avoid-ai-writing' skill for bundled humanizer (#33488 , #29415 ) (#34337 ) Two small bugs in the kanban dispatcher's CLI surface that were silently degrading two distinct workflows. Bundled because the test files and the surrounding code surface overlap. ## #33488: hermes kanban dispatch ignored kanban.max_in_progress / max_spawn The CLI wrapper in hermes_cli/kanban.py:_cmd_dispatch only passed default_assignee and max_in_progress_per_profile through to dispatch_once. The global concurrency cap (kanban.max_in_progress) and the per-tick spawn limit (kanban.max_spawn) were silently dropped, so operators using 'hermes kanban dispatch' as a one-shot or in a custom loop couldn't reach either cap from config — only the gateway embedded dispatcher honored them. Fix: read both keys from config in the same coerce-positive-int helper that already handled max_in_progress_per_profile. CLI --max still wins over config kanban.max_spawn when both are present (explicit operator signal beats default), but absent --max falls back to config. ## #29415: synthesizer crashed in retry loop on missing skill hermes_cli/kanban_swarm.py:212 hardcoded skills=['avoid-ai-writing'], a skill that doesn't exist in the bundled skills/ directory or any registered hub source. Every synthesizer worker spawn failed at CLI startup with 'Unknown skill(s): avoid-ai-writing' before the agent loop even started — the dispatcher retried up to failure_limit (default 2), then auto-blocked the task, then dependency rules could re-promote it, looping forever until manual intervention. Fix: replace with 'humanizer' which is bundled at skills/creative/humanizer/SKILL.md (description: 'Humanize text: strip AI-isms and add real voice'). That's the obvious intent behind the 'avoid-ai-writing' name, and the skill is platform-portable (linux/macos/windows) so it works on every supported runtime. ## Tests tests/hermes_cli/test_kanban_cli_dispatch_passthrough.py — 4 cases: - CLI passes max_in_progress / max_spawn / default_assignee / max_in_progress_per_profile from config to dispatch_once - CLI --max flag overrides config kanban.max_spawn - Invalid cap values (0, -1, 'abc', '1.5') silently fall through to None - kanban_swarm.py no longer references 'avoid-ai-writing' AND the replacement 'humanizer' skill exists at the expected on-disk path Kanban suite: 468/468 pass (was 464; +4 new regression tests).	2026-05-28 21:00:46 -07:00
Ben	a618789dba	fix(dashboard-auth): share /api/* public allowlist between legacy and OAuth gates Two parallel public-path allowlists drifted: _PUBLIC_API_PATHS in hermes_cli/web_server.py (legacy _SESSION_TOKEN middleware) and _GATE_PUBLIC_PREFIXES in hermes_cli/dashboard_auth/middleware.py (OAuth gate). The legacy list included /api/status (documented as a non-sensitive read-only liveness target); the OAuth gate's list did not. Effect: every wildcard-subdomain agent surfaced as STARTING/down to the portal even though the dashboard was serving correctly. Nous account service (src/server/agents/fly-provider.ts getInstanceRuntimeStatus) fetches ``/api/status`` without a cookie as its sole liveness probe; the OAuth gate's 401 looked identical to 'agent dead' on the portal side. Fix: lift the allowlist into hermes_cli/dashboard_auth/public_paths.py and have both middlewares import it. _path_is_public now consults the shared frozenset first, then falls back to the gate's auth-bootstrap/static prefix list. Future additions to the public list hit both gates automatically. Endpoint inventory (verified safe to remain public): * /api/status — version, gateway state, active session count, auth-gate shape. Portal liveness probe target. * /api/config/defaults — config-defaults feed for the SPA's Config page * /api/config/schema — config schema for the SPA's Config page * /api/model/info — model catalogue metadata (context windows) * /api/dashboard/themes — theme manifests for the skin engine * /api/dashboard/plugins — plugin manifests for the dashboard No user data, no session content, no secrets. Same shape an external monitoring agent would hit on /healthz. Tests: * New: test_gated_status_is_public (regression guard with the NAS fly-provider.ts liveness-probe rationale spelled out in the docstring) * New: test_other_public_api_paths_are_public_under_gate (parametrised over the rest of PUBLIC_API_PATHS — proves 401 / 302-to-login is never the response) * New: docker integration check #3 in test_dashboard_oauth_gate_engaged_by_default — /api/status remains 200 under the gate AND reports auth_required=True so the portal can distinguish modes * Updated: test_full_login_round_trip_unlocks_gated_api now probes /api/sessions instead of /api/status (status is public, so it can no longer distinguish 'logged in' from 'gate accidentally disabled') * Updated: TestApi401Envelope (the no-cookie / invalid-cookie / dead-cookie tests) probes /api/sessions for the same reason * Updated: docker integration check #2 in test_dashboard_oauth_gate_engaged_by_default probes /api/sessions to prove the gate is intercepting * Removed: dead _login() helper in test_dashboard_auth_status_endpoint.py (no longer needed since /api/status is reachable cold) Companion to docs/handover/hermes-agent-dashboard-s6-insecure-fix.md (the --insecure flag fix that shipped earlier).	2026-05-29 12:17:12 +10:00
Teknium	3b6347af15	feat(kanban): default_assignee fallback + per-profile concurrency cap (#27145 , #21582 ) (#34244 ) Two related dispatcher behaviors that have been missing for a while. ## kanban.default_assignee (#27145) Reporter (@agarzon): dashboard creates a task without an assignee, task parks in 'ready' forever even though the operator's intent ('default') is perfectly clear. The dispatcher already had a 'skipped_unassigned' bucket but no fallback routing — users had to manually type 'default' in the assignee field every time. Behavior: when 'kanban.default_assignee' is set in config.yaml, the dispatcher applies that assignee to any unassigned ready task before deciding whether to spawn. The row is mutated (assignee column + an 'assigned' event with source='kanban.default_assignee' for the audit trail). Empty/whitespace config value = no fallback, preserving the existing skipped_unassigned behavior. Dry-run mode reports what WOULD happen via the new 'auto_assigned_default' bucket on DispatchResult, but does NOT mutate the DB — operators using 'hermes kanban dispatch --dry-run' see the routing decision before committing. ## kanban.max_in_progress_per_profile (#21582) Reporter (@edwardchenchen, @simlu, 4 reactions): fan-out workloads saturate one profile's local model / API quota / browser pool while other profiles sit idle. The existing global 'max_in_progress' caps total workers but doesn't balance across profiles. Behavior: when 'kanban.max_in_progress_per_profile' is set to a positive int, the dispatcher tracks per-assignee running counts (one query at tick start) and refuses to spawn for any assignee already at the cap. Tasks blocked this way go to a new 'skipped_per_profile_capped' bucket on DispatchResult as (task_id, assignee, current_running_count) tuples — NOT an operator-actionable failure, just 'try again next tick when the profile has capacity'. Pre-existing 'running' tasks count against the cap (verified via regression test). The cap respects dry_run mode by incrementing its in-memory counter on each would-be spawn so dry_run reports the same balanced subset that a real tick would. Invalid cap values (0, negative, non-int, None) are treated as 'no cap', preserving the existing behavior. Backward-compatible for installs that don't set the config. ## Surfaces - 'hermes kanban dispatch' CLI now prints 'Auto-assigned to kanban.default_assignee=X: ...' and 'Deferred (X at per-profile cap, N running): ...' lines, plus matching JSON keys in --json output. - Gateway dispatcher logs the configured values at startup ('default_assignee=X', 'max_in_progress_per_profile=N'). - 'kanban.max_in_progress_per_profile' added to DEFAULT_CONFIG with inline docs. ## Validation - tests/hermes_cli/test_kanban_default_assignee.py (6 cases): no-cap baseline, auto-assign + DB mutation, dry-run reports without mutating, whitespace treated as None, explicit assignees untouched, DispatchResult field schema. - tests/hermes_cli/test_kanban_per_profile_cap.py (9 cases including 4 parametrized): no-cap baseline, balanced 2-profile fan-out, pre-existing running counts against cap, invalid cap values (0/-1/'abc'/None), capped tasks dispatched on next tick after running task completes, DispatchResult field schema. - Broader kanban suite: 464/464 pass (was 449 baseline; +15 new regression tests across both features). ## Credit #27145 — Jimmy Johansson reported the dispatcher skipped-unassigned gap; @agarzon scoped the simpler 'honor kanban.default_assignee' fix that matches the existing config knob. #21582 — @edwardchenchen filed the per-profile cap ask after hitting model 429s on fan-out research projects; @simlu confirmed the same pain on local-model setups.	2026-05-28 19:02:55 -07:00
Teknium	769ee86cd2	feat(kanban): attach images referenced in task bodies to worker vision (#34210 ) Kanban workers now scan the task body for local image paths and http(s) image URLs and attach them to the worker's first user turn — matching the CLI/gateway behaviour for inbound images. Before, a user pasting `/home/me/screenshot.png` or `https://example.com/img.png` into a kanban task description had it sent to the model as plain text and the pixels were never seen. How it works: * agent/image_routing.py gains extract_image_refs(text) → (paths, urls) that mirrors gateway/platforms/base.py:extract_local_files (absolute / ~-relative paths, image extensions only, ignores fenced/inline code). * build_native_content_parts() accepts an optional image_urls= kwarg and emits passthrough image_url parts for remote URLs alongside the base64 data: URLs used for local paths. * cli.py (single-query/quiet branch — the path every dispatcher-spawned worker takes) detects HERMES_KANBAN_TASK, reads the task body via kanban_db.get_task, runs extract_image_refs, and threads the results into the existing image-routing decision (native vs text). Best-effort: enrichment failures never block worker startup. Tested: * tests/agent/test_image_routing.py — 22 new tests for extract_image_refs and URL pass-through in build_native_content_parts. * tests/hermes_cli/test_kanban_worker_image_extraction.py — 10 new tests driving real kanban_db round-trip (create task → read body → extract refs → build parts). * E2E: created a fake kanban task with a body referencing both a local PNG and an https URL; verified the worker pipeline produces a multimodal user turn with 1 text part + 2 image_url parts (data URL for the local file, passthrough URL for the remote).	2026-05-28 17:50:42 -07:00
teknium1	f30db14ced	fix(kanban): SIGTERM on worker must terminate the process (#28181 ) The single-query signal handler in cli.py raises KeyboardInterrupt on SIGTERM/SIGHUP. For interactive 'hermes chat -q' that unwinds the main thread cleanly. For kanban workers spawned by the dispatcher, the worker process is likely to have a non-daemon thread alive (terminal _wait_for_process, custom plugins, etc.). With KeyboardInterrupt only the main thread unwinds; the non-daemon thread keeps the process alive, the gateway has already restarted, and the dispatcher's _pid_alive check returns True forever — task stuck in 'running' indefinitely. When HERMES_KANBAN_TASK is set (dispatcher-spawned worker), flush logging + stdout/stderr, then os._exit(0) instead of raising KeyboardInterrupt. The kernel reclaims the PID immediately, and the existing zombie-state detection in _pid_alive flips the task to crashed on the next dispatcher tick. detect_crashed_workers then re-spawns it on the following tick — no manual recovery needed. A SIGALRM(2s) deadman is armed before the flush so a pathological blocking-I/O flush can't wedge the worker forever. In practice the reporter measured flush in <1ms; the alarm is a failsafe, never the common path. Interactive (non-kanban) chat -q is unchanged — the env-gated branch only fires for dispatcher-spawned workers. Live verification on this machine: - Without HERMES_KANBAN_TASK + non-daemon thread alive: process hangs alive 4+ seconds after SIGTERM. Dispatcher's _pid_alive returns True → task stuck. - With HERMES_KANBAN_TASK + same non-daemon thread: process exits in 0.10s via os._exit(0). Dispatcher reclaims on next tick. Tests: - tests/hermes_cli/test_signal_handler_kanban_worker.py (3 cases): end-to-end subprocess test with a non-daemon thread, HERMES_KANBAN_TASK env, SIGTERM, dispatcher-style _pid_alive check. Plus a source-level invariant test catching future refactors that drop the env-gated exit. - 452/452 kanban tests pass. Co-authored-by: andrewhosf <andrewho.sf@gmail.com>	2026-05-28 11:59:58 -07:00
Teknium	3a9bc9d88a	fix(model picker): unify /model and `hermes model` lists, add disk cache (#33867 ) * fix(model picker): unify /model and `hermes model` model lists, add disk cache The /model slash picker and `hermes model` were drifting apart. /model read the raw static `OPENROUTER_MODELS` list (31 entries, including 5 that fail at runtime — no tool-call support or absent from live catalog), while `hermes model` ran the same list through the live OpenRouter /v1/models tool-support filter and showed 26 valid entries. Same problem existed for every other authed provider: /model used curated static lists, `hermes model` used live /v1/models. Unifies both surfaces on `provider_model_ids()` and adds a generic disk-cached wrapper so the picker stays snappy. Changes - hermes_cli/models.py: new `cached_provider_model_ids()` — ~/.hermes/provider_models_cache.json, 1h TTL, per-provider entries keyed by credential fingerprint (env vars + OAuth file mtimes). Stale-data-beats-no-data on transient failures. Pair with `clear_provider_models_cache(provider=None)`. - hermes_cli/models.py: `provider_model_ids("nous")` now falls back to the docs-hosted manifest (not the in-repo snapshot) when the live Portal /models call fails — preserves the model_catalog regression guarantee while still going through the unified pathway. - hermes_cli/model_switch.py: `list_authenticated_providers` routes sections 1, 2, and 2b through `cached_provider_model_ids(slug)` with curated fallback when the live fetcher comes up empty. - hermes_cli/model_switch.py: `parse_model_flags` extended to a 4-tuple, parses `--refresh`. - cli.py / gateway/run.py / tui_gateway/server.py: updated unpacking; CLI + gateway wire `--refresh` to `clear_provider_models_cache()`. - hermes_cli/main.py: `hermes model --refresh` argparse flag. - hermes_cli/commands.py: `/model` args_hint advertises `--refresh`. - tests/hermes_cli/test_inventory.py: refresh stale comment. Live PTY parity verification - /model → OpenRouter row: `(26 models)` (was 31, with broken entries) - `hermes model` → OpenRouter: 26 models (unchanged) - The 5 dropped entries: `pareto-code` (no tool-call support), `gemini-3-pro-image-preview` (no tool-call support), `elephant-alpha`, `hy3-preview:free`, `ring-2.6-1t:free` (gone from OpenRouter's live catalog). Live PTY timing - First /model open, empty cache: 4624 ms (full network round trip across every authed provider) - Second /model open, warm cache: 51 ms (90× faster) - `/model --refresh` clears the disk cache and re-fetches. Cache schema (~/.hermes/provider_models_cache.json, ~3 KB): { "anthropic": {"fp": "<sha256:16>", "at": 1748..., "models": [...]}, ... } Targeted tests: tests/hermes_cli/ + gateway model tests + tui_gateway — 5855/5855 pass. * fix(model picker): use blake2b for cache fingerprint to silence CodeQL py/weak-sensitive-data-hashing flagged the sha256 call in _credential_fingerprint() as a high-severity alert because the input includes env var values whose names contain _API_KEY / _TOKEN. The hash is used solely as a cache-bust identity — never reversed, never stored, collisions are harmless (worst case: cache miss → live re-fetch). blake2b serves the same purpose and isn't flagged by this rule. Functional behavior identical: 16-hex-char digest, cache hit/miss logic unchanged. Live re-verified — 26 OpenRouter models, warm-cache 78ms.	2026-05-28 11:33:16 -07:00
kshitij	a82c88bac0	fix(xai-oauth): accept bare-code manual paste (state=None) (#26923 ) (#33880 ) xAI's consent page renders the authorization code in-page rather than redirecting through the 127.0.0.1 callback, so on remote/headless setups (GCP Cloud Shell, Codespaces, container consoles, headless VPS) the only value the user can paste is the opaque code with no `code=`/`state=` query parameters. `_parse_pasted_callback` correctly returns `state=None` for that input, but `_xai_oauth_loopback_login` then validated state unconditionally and raised `xai_state_mismatch`, making the documented bare-code paste path unreachable. PKCE (code_verifier) still binds the token exchange to this client, so the local state-equality check is redundant when there is no state to compare. On the manual-paste path only, substitute the locally generated state when the callback returned none — the rest of the validation chain (code presence, error field, token exchange) is unchanged. The loopback HTTP-server path still requires a matching state (a real browser redirect always carries one). Also: clarify the manual-paste prompt to mention xAI's in-page code rendering so users know pasting the bare code on its own is expected. Root-cause analysis from #26923 comment by @AccursedGalaxy (2026-05-20). Tests ----- * test_xai_loopback_login_manual_paste_bare_code_succeeds — positive end-to-end through the token exchange with state=None. * test_xai_loopback_login_loopback_path_rejects_missing_state — the HTTP-server path still rejects state=None as a regression guard (the bare-code relaxation must NOT widen the loopback path). * Existing test_xai_loopback_login_manual_paste_state_mismatch_raises continues to verify wrong (non-None) state is rejected on manual-paste. Closes #26923.	2026-05-28 05:47:30 -07:00
Teknium	e0572a6def	fix(skills-hub): stop ellipsis-truncating the Identifier column (#33810 ) `hermes skills search` rendered the Identifier column with the default overflow behaviour, so long slugs (notably browse-sh — every browse-sh skill ends in a `-XXXXXX` hash that's part of the identifier) were cut to `browse-sh/weathe…`. Users copied the visible string into `hermes skills install` and got a not-found error because the hash was gone. Set overflow="fold" on the Identifier column in both search tables (`do_search` and the `_resolve_short_name` multi-match table) so long slugs wrap onto a second line instead of getting eaten. Also add a `--json` flag to `hermes skills search` (and the `/skills search` slash variant) for scripting — emits a list of {name, identifier, source, trust_level, description} objects with the full identifier, which is the right shape for copy-paste pipelines too. Closes #33674.	2026-05-28 04:53:13 -07:00
teknium1	6f9182cb34	fix(kanban): content-addressed corrupt-DB backup filename Repeated quarantines of an unchanged corrupt kanban.db used to amplify disk usage by N: the gateway dispatcher's 5-minute retry loop, multi- profile fleets sharing one DB, and manual reopen attempts each produced a fresh '.corrupt.<timestamp>.bak' copy of the same bytes. After 10 retries on a 100KB DB you had 11x the disk footprint of duplicate corrupt data. Derive the backup filename from a sha256 of the main DB instead of a timestamp + collision counter. Same bytes → same filename → skip the copy on retries. Different bytes (partial repair, further damage) → different filename → preserve separately. Sidecar (-wal/-shm) backups inherit the same content-addressed name. Inspired by @hanzckernel's PR #33529, simplified down to ~30 LOC: drop the persistent JSON marker file, drop the atomic temp+fsync+rename helper (shutil.copy2 is fine for a quarantine-only path), drop the gateway-side WAL/SHM fingerprint extension (the existing (path, mtime, size) tuple still gives the 5-minute retry semantics it needs), and drop the gateway-side helper extraction. The backup file existing IS the marker; no separate state needed. Test: tests/hermes_cli/test_kanban_db.py::test_repeated_corrupt_open_reuses_single_backup proves 10 retries on the same corrupt bytes produce 1 backup (was 11), and mutating the corrupt bytes produces a second backup with a different fingerprint. Refs #33529 Co-authored-by: hanzckernel <zhicheng.han@mathematik.uni-goettingen.de>	2026-05-28 03:38:09 -07:00
Teknium	432a691758	fix(update): stream + idle-kill `npm run build` so a stalled webui-build can't soft-brick the install (#33803 ) `hermes update` ran the webui build with `capture_output=True` and no timeout. On low-memory hosts (WSL2's 4 GB default, small VPSes, antivirus stalls) Vite goes silent for minutes; users see a frozen terminal, decide the update is hung, and reboot. The reboot lands after `pip install -e .` has already touched the install but before the build completes, leaving the `hermes` launcher in place while `hermes_cli` is no longer importable — i.e. `ModuleNotFoundError: No module named 'hermes_cli'` (#33788, same class as #32384). Changes: - New `_run_with_idle_timeout()` helper: streams subprocess output line-by-line (so the user sees Vite progress in real time) and kills the process if no bytes appear on stdout/stderr for 180s. The existing stale-dist fallback (#23817) then serves the previous build instead of failing the update. - `_build_web_ui()` uses the helper for `npm run build` (the actual stall site). `npm install` keeps `subprocess.run` + capture_output to preserve the existing EPERM-retry-on-Windows contract. - Both `cmd_update` call sites print `→ Core update complete. Building dashboard (optional)...` before the webui build. The CLI is fully functional at this point; a webui-build failure only affects `hermes dashboard`. Telegraphing the boundary explicitly stops users from rebooting through the build step. Tests: - `tests/hermes_cli/test_run_with_idle_timeout.py` — 4 tests covering streaming success, nonzero exit, idle-kill, and missing-binary cases. Uses real `subprocess.Popen` on tiny Python scripts; isolated in its own file so per-file canonical-runner parallelism doesn't pair it with the mock-heavy tests. - `tests/hermes_cli/test_web_ui_build.py` — updated existing tests to patch `_run_with_idle_timeout` for the build step in addition to `subprocess.run` for the install step. - `tests/hermes_cli/test_cmd_update.py::test_update_refreshes_repo_and_tui_node_dependencies` — same update. Full suite: `scripts/run_tests.sh tests/hermes_cli/` → 5646 passed, 0 failed. Fixes #33788.	2026-05-28 03:34:47 -07:00
Teknium	10ee4a729b	fix(gateway): drain on Windows `hermes gateway stop` so sessions survive restart (#33798 ) Sessions now survive `hermes gateway stop` / `restart` on native Windows. Previously the gateway died on schtasks `/End` + os.kill SIGTERM without ever running the drain loop, so the v0.13.0 session-resume feature (#21192) silently broke on Windows: `resume_pending=True` was never written, and the next boot started with a blank conversation history (issue #33778). Root cause is twofold and the reporter only identified half of it: 1. `hermes_cli/gateway_windows.py::stop()` did not write the `planned_stop_marker` before signalling. The reporter caught this. 2. The bigger reason: `asyncio.add_signal_handler` raises NotImplementedError for SIGTERM/SIGINT on Windows, so even if the marker had been written, the gateway's existing SIGTERM handler (which is what calls `runner.stop()` and the `mark_resume_pending` loop) was never invoked. Writing the marker would have been necessary-but-insufficient. The fix has two parts: * gateway/run.py: new `_run_planned_stop_watcher` daemon thread polls for the planned-stop marker file every 0.5s. When the marker appears it `loop.call_soon_threadsafe(shutdown_signal_handler, None)` — the same shutdown path a real SIGTERM would have driven, including the pre-drain `mark_resume_pending` writes (run.py:5977) and graceful drain wait. The existing signal handler already accepts `received_signal=None` and falls through to `consume_planned_stop_marker_for_self()`, so no handler changes needed. Runs on every platform as cheap belt-and-suspenders. * hermes_cli/gateway_windows.py: `stop()` now writes the marker for the running gateway PID and waits up to `agent.restart_drain_timeout` (default 30s) for the PID to exit cleanly. On clean drain, the kill sweep is non-forceful; on timeout, escalates to `kill_gateway_processes(force=True)` which routes to taskkill /T /F per `references/windows-native-support.md`. Validation: * 7 new tests in tests/gateway/test_planned_stop_watcher.py covering: marker→handler dispatch, no-marker idle, already-draining skip, not-yet-running skip, stop_event responsiveness, fire-once semantics, error tolerance. * 8 new tests in tests/hermes_cli/test_gateway_windows.py covering: marker-before-kill ordering, clean-drain skips force-kill, drain-timeout escalates to force=True, no-pid-skips-drain, invalid-pid handling, fast-exit success, timeout failure, marker-write-failure tolerance. * E2E (Linux, detached orphan): write_planned_stop_marker(pid) + `_drain_gateway_pid(pid, 5.0)` returns True in 0.5s after the victim sees the marker and exits. Tested with a double-forked subprocess so the test parent isn't holding it as a zombie. * Targeted: tests/gateway/{restart_drain,restart_resume_pending, signal,signal_format,status,shutdown_forensics,approve_deny_commands, planned_stop_watcher} + tests/hermes_cli/{gateway_windows, gateway_service} → 519/519. What was wrong with the reporter's claim (for future archaeology): they described the symptom as "no `resume_pending=True` written to `sessions.json`" — but Hermes uses `state.db` (SQLite), not `sessions.json`, and `mark_resume_pending` is called regardless of the marker (the marker only affects exit code 0 vs 1 for systemd revival semantics). The real session-loss path is the missing drain on Windows, not a missing marker. Both halves are fixed here. Closes #33778.	2026-05-28 03:25:32 -07:00
teknium1	c7f7783e5c	test(xai-proxy): regression coverage for #28932 429 handling Three new tests in tests/hermes_cli/test_proxy.py: - xai_adapter_retry_rotates_pool_entry_on_429 — headline #28932 case. Two-entry pool, 429 on first entry, must rotate to second entry AND must NOT call refresh_xai_oauth_pure (refresh is irrelevant for rate limits). - xai_adapter_retry_returns_none_on_429_when_pool_exhausted — single-entry pool: 429 returns None so the rate-limit response flows back to the client unchanged (existing behavior preserved). - xai_adapter_retry_returns_none_for_unrelated_status — non-{401, 429} statuses must not trigger any retry path at all; guards against the gate becoming too broad in future changes. Each test asserts that refresh_xai_oauth_pure is never called on the 429 path — refresh is a 401-specific concern. 39/39 in tests/hermes_cli/test_proxy.py.	2026-05-28 02:36:37 -07:00
Dusk1e	aa3466063b	fix(android): reject unsafe tar members in psutil compatibility installer	2026-05-28 02:36:09 -07:00
Teknium	09a5cd8084	fix(auth): sync manual:device_code Codex pool entries on re-auth (#33744 ) #33164 made _save_codex_tokens sync the singleton-seeded `device_code` pool entry on Codex OAuth re-auth. That fixed the #33000 path but missed `manual:device_code` entries created by `hermes auth add openai-codex` (the recommended workaround for users who hit #33000 before #33164 landed). Every subsequent re-auth would refresh the device_code entry but leave the manual:device_code entry holding the consumed refresh token plus stale last_error_* markers — immediately recreating the 401 token_invalidated symptom on the next request, exactly as reported in #33538. Extend the refreshable source set to include `manual:device_code`. Completing the device-code OAuth flow proves the user owns the ChatGPT account, so it is safe to refresh every device-code-backed entry. Keep `manual:api_key` and other non-device-code manual sources untouched — those represent independent credentials. Closes #33538.	2026-05-28 01:33:10 -07:00
Stephen Schoettler	8595281f3c	fix: expose context engine tools with saved toolsets	2026-05-28 00:28:42 -07:00
LeonSGP43	442a9203c0	Fix xAI OAuth timeout manual fallback	2026-05-28 00:24:17 -07:00
Robin Fernandes	1cf5e639b3	fix(auth): refresh Nous entitlement in tool menus	2026-05-28 00:19:31 -07:00
Robin Fernandes	406901b27d	feat(auth) normalise the way in which we check whether a user has free/paid access to nous portal so we can expose behaviour and error messages accordingly.	2026-05-28 00:19:31 -07:00
Squiddy	3ba8962738	fix(kanban): add Windows init lock guard	2026-05-27 23:28:51 -07:00
Squiddy	90b6b3d18f	fix(kanban): harden sqlite connection concurrency	2026-05-27 23:28:51 -07:00
Ben	875d930ac7	test(docker-update): stub subprocess.run in git-install regression guard The regression-guard test `test_cmd_update_on_git_install_does_not_print_docker_message` mocked `is_managed` and `detect_install_method` but not `subprocess.run`, so once `cmd_update(check=True)` decided this was a git install it shelled out to a real `git fetch upstream` / `git fetch origin`. On CI runners the worktree has no `upstream` remote configured and the fetch hung past the 30s pytest-timeout — test (4) slice failed in #33659 CI. Fix: stub `subprocess.run` with a successful CompletedProcess-shaped object whose stdout is `"0\n"`, so: - no real git command is ever invoked - the rev-list parsing later in the flow (`int(stdout.strip())`) succeeds rather than `ValueError`-ing through the test's SystemExit catch - the flow proceeds far enough to confirm the docker banner is absent (the actual assertion) Also broaden the except clause to `(SystemExit, Exception)`: the only assertion in this test is the negative-banner check on captured stdout; any further failure in the rest of the update flow is irrelevant to that contract. Verified locally: all 7 tests in `tests/hermes_cli/test_cmd_update_docker.py` pass in 0.39s (previously the regression-guard test alone consumed 30s+ and got SIGTERM'd).	2026-05-28 15:50:25 +10:00
Ben	b924b22a9d	fix(docker): `hermes update` prints `docker pull` guidance instead of bogus git error Inside the published Docker image, `hermes update` was hitting the ".git missing → reinstall via curl" fallback: ✗ Not a git repository. Please reinstall: curl -fsSL https://raw.githubusercontent.com/.../install.sh \| bash That message is wrong on two counts: 1. It tells the user to run the host-side installer, which would install a new Hermes on the host — not update the running container. 2. It doesn't mention `docker pull` at all, leaving Docker users to figure out the right action from scratch. `hermes update --check` was worse: it bailed with "Not a git repository — cannot check for updates." and nothing else. Fix: detect the Docker install method (already stamped by `docker/stage2-hook.sh` and surfaced by `detect_install_method()`) in both update entry points and print a long-form message that covers: - The right command: `docker pull nousresearch/hermes-agent:latest` - Restart guidance (`docker compose up -d --force-recreate` / re-run `docker run`) - How to verify the new version after restart - Tag-pinning caveat (`:latest` doesn't move a pinned tag) - Config persistence across upgrades (state under `HERMES_HOME` / `/opt/data` is bind-mounted and survives) - Fork escape hatch (build your own image with the repo's Dockerfile) Exit code is 1 (matches `managed_error` semantic for "tried to update but can't update this way"). Plumbing: - hermes_cli/config.py: new `format_docker_update_message()` helper sits next to the existing `_NIX_UPDATE_MSG` / `format_managed_message()` family so the wording lives in one place and both call sites (apply path + check path) consume it. - hermes_cli/main.py: * `cmd_update()`: bail right after the `is_managed()` gate, before any of the apply-path branches. * `_cmd_update_check()`: bail at the top of the function, before the existing `method == "pip"` branch. Neither path touches subprocess.run / git when method == "docker". Coverage: - 7 new tests in `tests/hermes_cli/test_cmd_update_docker.py`: * `hermes update` in Docker → message + exit 1, no git calls * `hermes update --check` (via cmd_update) → same * `--yes` / `--force` don't bypass (intentional) * `_cmd_update_check` called directly → bails too * git/pip installs still take their normal paths (regression guards) * `format_docker_update_message` content-lock test pinning the five user-actionable bits the message must contain - Existing test_cmd_update.py (21 tests) + test_managed_installs.py (5 tests) still pass — no regression on the source-install path. - Verified end-to-end in a real container: `docker run ... update` and `docker run ... update --check` both render the message and exit 1.	2026-05-28 15:50:25 +10:00
Ben	66489f38c7	fix(docker): bake build-time git SHA into the image `hermes dump` and the startup banner both call `git rev-parse HEAD` to report the running commit, but `.dockerignore` line 2 excludes `.git` — so inside the published image `hermes dump` shows `version: ... [(unknown)]` and the banner drops its `· upstream <sha>` suffix entirely. That makes support triage from container bug reports impossible: we can't tell which commit the user is actually running. Fix: thread the build-time SHA through as a Docker build-arg, write it to `/opt/hermes/.hermes_build_sha` in the image, and have a new `hermes_cli/build_info.get_build_sha()` read it as a fallback after the existing live-git lookup fails. Output format is unchanged in both callsites — same 8-char short SHA whether resolved live or baked. Wiring: - Dockerfile: `ARG HERMES_GIT_SHA=` + write-file step after the source copy. Empty/missing arg → no file written → callers fall through to live git (so local `docker build` without --build-arg is unchanged). - docker-publish.yml: passes `HERMES_GIT_SHA=${{ github.sha }}` on all four build-push-action steps (amd64/arm64, smoke-test + final push). - dump.py:_get_git_commit() / banner.py:get_git_banner_state(): try live git first, fall back to baked SHA, then to legacy `(unknown)` / None. Banner returns `upstream == local, ahead=0` because a built image is by definition pinned to one commit. Coverage: - Unit tests cover build_info (file present/absent/empty/error, truncation, whitespace), dump (live-git wins, both fallbacks, identical output-format regression guard), and banner (no-repo + baked, no-repo + no-sha, shallow-clone fallback). - tests/docker/test_dump_build_sha.py is an integration regression guard that runs against the real image, reads `/opt/hermes/.hermes_build_sha`, and asserts `hermes dump` surfaces its content (or stays at `(unknown)` if no file). - Verified end-to-end: `docker build --build-arg HERMES_GIT_SHA=abc...` → `docker run ... dump` reports `[abc12345]`; without the build-arg it reports `[(unknown)]` as before.	2026-05-28 15:14:05 +10:00
teknium1	ebe04c66cd	fix(kanban): close kanban.db FD after every connect() in long-lived processes `sqlite3.Connection.__exit__` commits/rollbacks but does NOT close the underlying FD. `with kb.connect() as conn:` in long-lived processes (gateway `run_slash`, dashboard `decompose_task_endpoint`) therefore leaks one FD to `kanban.db` per call. After enough operations the gateway dies with `[Errno 24] Too many open files` (~4 days uptime in the production report — #33159). Fix: add a `connect_closing()` context manager in `hermes_cli/kanban_db` that wraps `connect()` with a real `try/finally: conn.close()`. Switch the 42 leak-prone call sites in `hermes_cli/kanban.py` (35), `hermes_cli/kanban_decompose.py` (4), and `hermes_cli/kanban_specify.py` (3) over to it. `kanban.py` matters because `run_slash` (called from the gateway for every `/kanban` slash command) parses argparse and dispatches to those `_cmd_*` functions in-process — each one was leaking one FD per invocation. Tests inside `tests/` are untouched: short-lived processes where OS cleanup masks the leak. Regression tests added in `test_kanban_db.py` cover both happy-path and exception-path closure, plus an explicit assertion that bare `with kb.connect()` still does NOT close (documenting the upstream sqlite3 behaviour we're working around). Closes #33159.	2026-05-27 22:07:49 -07:00
Dusk	c341a2d107	fix(docker): align HOME for dashboard and s6 gateway services (#33481 )	2026-05-28 13:42:27 +10:00
Ben Barclay	b345323195	fix(docker): tee supervised gateway stdout to docker logs Follow-up to #33583 (the gateway-run-supervised redirect). Before this fix, the supervised gateway's stdout (most visibly the "Hermes Gateway Starting…" rich-console banner) was swallowed by `s6-log` into the rotated file at `${HERMES_HOME}/logs/gateways/<profile>/current` and never reached `docker logs`. Operational signal lived in two places: * docker logs — saw stderr (Python `logging` defaults to stderr), so warnings/errors were visible. * the rotated file — saw stdout (rich banners, `print()` output, third-party libs that wrote to fd 1). This was surprising for users coming from the pre-s6 image, where `docker run … gateway run` produced a single unified stream in `docker logs`. They'd see partial output, conclude something was broken, and dig around for the missing pieces. Fix: add the `1` s6-log action directive before the file destination so each line is forwarded to s6-log's stdout — which propagates up the s6-supervise pipeline to /init's stdout = container stdout = `docker logs`. The file destination is preserved as a second destination, so the rotated log (with ISO 8601 timestamps) still exists for `hermes logs` and for survival across container restarts. Trade-off considered: timestamps. Putting `T` between `1` and the file destination (not before `1`) means: * docker logs sees raw lines — Python's logging formatter has its own timestamps, and `docker logs --timestamps` adds another layer when desired. No double-stamping in the common reading path. * The persisted file gets s6-log's ISO 8601 timestamp so even output that lacked a Python-logger timestamp (rich banners, third-party raw prints) is correlatable in `current`. Verification: * New unit-test assertion in `test_service_manager.py` locks the `s6-log 1` directive into the rendered run-script. Mutation- tested by reverting to the pre-fix script (no `1`); the assert catches it cleanly. * New docker-harness test `test_supervised_gateway_stdout_reaches_docker_logs` builds the image, runs `docker run … gateway run`, and asserts the unique `⚕` banner glyph reaches `docker logs`. Also verifies the rotated file still contains the banner (no regression on the existing file destination). Mutation-tested end-to-end: built a deliberately-broken image without the `1` directive and the test failed exactly as designed, citing the banner present in `current` but absent from `docker logs`. * `website/docs/user-guide/docker.md` gains a new `:::note Where gateway logs go` admonition documenting both destinations and the audit-log file at `${HERMES_HOME}/logs/container-boot.log`. Existing functionality preserved: every other docker-harness test still passes against the new image. Unit-test sweep across `tests/hermes_cli/` (5561 tests) is green.	2026-05-28 13:18:41 +10:00
brooklyn!	912e6e2274	fix(tui): suppress mouse-residue leaks during Python launcher startup (#31213 ) * fix(tui): suppress mouse-residue leaks during Python launcher startup `hermes --tui …` spends ~100–300ms inside the Python launcher (lazy imports, arg parsing, session resolution) before exec'ing the Node TUI binary. During that window stdin is still in cooked + echo mode. If a prior session left DEC mouse tracking asserted (or the user spammed mouse movement while the previous session was opening), the terminal keeps emitting `\\x1b[<…M` SGR motion reports that get echoed straight back into the user's shell scrollback as literal `^[[<…M` text and sit there above the TUI banner until the next clear. The Node side already calls `resetTerminalModes()` in `entry.tsx`, but by then the race is already lost — the bytes echoed during the Python warmup window were committed to the scrollback before Node started. Fix: write the mouse-tracking disable sequence at the very top of `hermes_cli.main`, before every heavy import. The terminal stops emitting motion events as soon as the bytes hit the wire (one TTY round-trip), shrinking the race window from hundreds of milliseconds to a few. `HERMES_TUI_NO_EARLY_DISABLE=1` opts out for diagnostics. * test(tui): drop dead _reload_main, hoist import out of patch context Addresses Copilot review on PR #31213. The tests used to import `hermes_cli.main` inside the `patch("os.write")` context, which Copilot pointed out is order-dependent: if the module is already loaded (e.g. imported by a prior test in the same process), the import is a no-op and the patch only sees the explicit `_suppress_mouse_residue_early()` call. Either way the assertion can flake when run alongside other tests. Move the import to module scope — every subprocess gets a fresh `hermes_cli.main`, whose module-level invocation is a no-op under pytest argv. Tests then exercise `_suppress_mouse_residue_early()` directly inside their own patch context. Also drop the unused `_reload_main` helper. * fix(tui): skip early mouse-disable when stdout is not a TTY Addresses Copilot review on PR #31213. `hermes --tui … >log` or CI capture pipes fd 1 away from the terminal. The disable bytes can't reach the terminal in that case but would still get written into the log file as raw CSI sequences. Guard with `os.isatty(1)` inside the existing `try/except OSError` block so the 'never break startup' contract holds. * docs(tui): rephrase 'raw cooked mode' as 'cooked + echo mode' Copilot review nit on PR #31213 — the original wording was self- contradictory. Pre-TUI stdin state is cooked + echo (kernel TTY discipline still owns the line buffer and echoes input back). The TUI switches it to raw mode later when Ink mounts.	2026-05-27 22:03:45 -05:00
Ben Barclay	0927fb5584	feat(docker): auto-redirect `gateway run` to supervised mode inside s6 image Pre-s6, `docker run nousresearch/hermes-agent gateway run` was the standard invocation: gateway ran as the container's main process, tini reaped zombies, container exit code matched gateway exit code, no supervision. With s6-overlay as PID 1, the same invocation now auto-upgrades to supervised semantics — auto-restart on crash, dashboard supervised alongside (when HERMES_DASHBOARD=1 is set), multiple profile gateways under the same /init. Users get the new behavior with zero changes to their docker run command. A loud one-line breadcrumb on stderr explains the upgrade and points at the opt-out for users who genuinely want pre-s6 foreground semantics. How it works: 1. `_gateway_command_inner` (the `gateway run` handler) checks if we're inside a container with s6 as PID 1. 2. If yes, dispatches `start` to the s6 service manager (registers and starts gateway-default), then `exec sleep infinity` to keep the CMD process alive without binding container lifetime to gateway PID lifetime. The supervised gateway can flap freely; `docker stop` still tears everything down via /init stage 3. 3. If no, falls through to the existing foreground code path unchanged. Host runs of `hermes gateway run` are unaffected. Three gates make the redirect inert outside the intended scope: * `detect_service_manager() != "s6"` — host/non-s6-container runs. * `HERMES_S6_SUPERVISED_CHILD=1` env var (recursion guard) — exported by `S6ServiceManager._render_run_script` for the s6-supervised invocation itself. Without this guard, the supervised `gateway run --replace` would re-enter the redirect and recurse (run → start → run → start → ...) infinitely. * `--no-supervise` CLI flag OR `HERMES_GATEWAY_NO_SUPERVISE=1` env var — explicit user opt-out for CI smoke tests, debugging the foreground startup path, or any case wanting "CMD exit = container exit" semantics. Strict truthiness (1/true/yes, case-insensitive); typos like `=0` do NOT silently opt out. Tests: * Unit tests in tests/hermes_cli/test_gateway_s6_dispatch.py cover all five paths (host no-op, supervised fire, sentinel recursion guard, CLI flag, env var truthy + falsy). The two load-bearing gates (sentinel + opt-out) were mutation-tested by removing each gate in isolation and confirming the dedicated test fails with the expected error. * Docker harness tests in tests/docker/test_gateway_run_supervised.py cover the round trips end-to-end against a built image: redirect fires (sleep-infinity heartbeat + supervised gateway-default slot + breadcrumb), --no-supervise opt-out (foreground gateway, no want-up on the slot), HERMES_GATEWAY_NO_SUPERVISE env var works identically, recursion is impossible (≤1 supervised python gateway-run + exactly 1 sleep-infinity parented to the CMD wrapper), and HERMES_DASHBOARD=1 produces both supervised gateway and supervised dashboard. Docs: * Added a `:::tip Gateway runs supervised` admonition near the main docker.md example explaining the upgrade and pointing at the opt-out. Pre-s6 (tini-based) images still run gateway run as the foreground main process, so the note is scoped to the s6 image only. Trade-off documented in the helper docstring: container exit code under the redirect is sleep's exit code (always 0 on SIGTERM), not the gateway's. That was an explicit design call — the supervised gateway is allowed to flap without taking the container with it, which is what "supervision" means. CI users who want exit-code forwarding can pass --no-supervise.	2026-05-28 12:42:13 +10:00
kshitijk4poor	2d5dcfabc3	test(kanban): update dispatcher tick counter for hoisted zombie reaper The reaper hoist in the prior commit adds an extra `asyncio.to_thread(_kb.reap_worker_zombies)` call at the top of every dispatcher tick (before the per-board work). The existing `test_gateway_dispatcher_disables_corrupt_board_without_traceback` mocks `to_thread` with a 4-call cap that previously matched 2 full dispatch ticks. With the reaper hoist each tick is now 3 `to_thread` calls instead of 2, so the cap is raised to 6 to preserve the same number of dispatch ticks. The `connect == 5` assertion is unchanged. Also add the contributor's `steveonjava@gmail.com` to AUTHOR_MAP alongside `steve@steveonjava.com` so contributor-audit passes for both identities used across the salvaged commits. Salvage follow-up for PR #32857.	2026-05-27 14:31:55 -07:00
Stephen Chin	ffdc937c18	fix(kanban): hoist zombie reaper out of dispatch_once Reaper now runs at the top of every dispatcher tick regardless of per-board connect() failures. Previously the reaper sat inside dispatch_once after the kanban_db.connect() call — any EIO during connect would skip reaping for that tick, accumulating zombie workers and stale claim_lock rows. Also: reap_worker_zombies now returns the list of reaped pids (the dispatcher logs them) and a test indentation fix. Squashes three sibling commits from PR #32301 into one logical change for batch review.	2026-05-27 14:31:55 -07:00
steveonjava	99c19eb2fe	fix(kanban): add post-commit page_count invariant check to write_txn Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect(). Refs: #31208 (Bug E - same file, coordinate), #30973 (wal_autocheckpoint) Refs: #30445, #30896, #30908 (corruption reports)	2026-05-27 14:31:55 -07:00
Stephen Chin	c002668ff0	fix(kanban): add grace period to detect_crashed_workers `detect_crashed_workers` calls `_pid_alive` on every `running` task whose claim is held by this host. The check can transiently return False for a freshly-spawned worker (fork → /proc-visibility lag, or reap-race between SIGCHLD and parent reaping). When a second dispatcher ticks inside that window it reclaims the task and spawns a duplicate worker. Add `DEFAULT_CRASH_GRACE_SECONDS = 30` and an `HERMES_KANBAN_CRASH_GRACE_SECONDS` env-var override. `detect_crashed_workers` skips the liveness check when `time.time() - started_at < grace`. The existing 15-minute claim TTL still reclaims genuinely-crashed workers; grace only suppresses the launch-window false positive. `HERMES_KANBAN_CRASH_GRACE_SECONDS=0` is set on the `kanban_home` fixture in `test_kanban_core_functionality.py` so existing tests that assert immediate reclaim retain pre-fix semantics. Companion to merged PR #23442 (`release_stale_claims`, closes #23025), which addressed the same multi-dispatcher race in the stale-claim path. Related: #20015 (`_pid_alive` false-negative behaviour),	2026-05-27 14:31:55 -07:00
Stephen Chin	e83252dc46	fix(kanban): preserve original exception when write_txn rollback fails When code inside a write_txn block raises an OperationalError that SQLite has already auto-rolled-back (typical for disk I/O error, database is locked, and database disk image is malformed), the explicit ROLLBACK in write_txn.__exit__ itself raises cannot rollback - no transaction is active and the secondary exception replaces the original in the traceback. Operators see a misleading error and lose the diagnostic information they need. Swallow the rollback-time OperationalError so the caller always sees the original cause. Confirmed reproducer: tests/hermes_cli/test_kanban_db.py:: test_write_txn_preserves_original_exception_when_rollback_fails	2026-05-27 14:31:55 -07:00
Stephen Chin	5c49cd0ed0	fix(state): never silently downgrade WAL to DELETE on transient EIO apply_wal_with_fallback() treated "disk i/o error" as a permanent WAL-incompatibility marker, identical to "locking protocol" (NFS) and "not authorized" (FUSE). But EIO during PRAGMA journal_mode=WAL is typically TRANSIENT — page-cache pressure, brief lock contention, recoverable storage hiccups — not a permanent filesystem property. Treating transient EIO as a permanent downgrade signal produces the mixed-journal-mode-across-processes corruption pattern: 1. Process A opens kanban.db, hits transient EIO on the WAL pragma, silently downgrades to journal_mode=DELETE. 2. Process B (no EIO) opens the same file moments later and successfully sets journal_mode=WAL. 3. A writes rollback-journal frames while B writes WAL frames. SQLite documents this as unsupported and corrupts the file: https://www.sqlite.org/wal.html ("all connections to the same database must use the same locking protocol"). This was the root cause of repeated kanban.db corruption on hosts with multiple gateway processes plus CLI invocations against the same DB (observed pattern: corruption shortly after gateway startup, after the process logged "WAL journal_mode unsupported on this filesystem (disk I/O error) — falling back to journal_mode=DELETE"). The fallback warning told the truth — fallback DID happen — but the premise ("unsupported on this filesystem") was wrong; the EIO was a one-shot event and sibling processes successfully used WAL. Fix has two layers: 1. Remove "disk i/o error" from _WAL_INCOMPAT_MARKERS. EIO now re-raises so callers can retry instead of silently corrupting the DB. The two remaining markers ("locking protocol", "not authorized") are deterministic per filesystem so they remain safe permanent-downgrade signals. 2. Belt-and-suspenders: before downgrading on ANY marker match, peek the on-disk journal mode. If the header says WAL, refuse to downgrade and re-raise the original error. This guards against any future addition to _WAL_INCOMPAT_MARKERS turning out to be transient in some environment we haven't yet seen. Tests: - tests/test_hermes_state_wal_fallback.py: * Flipped test_falls_back_on_disk_io_error → test_reraises_on_disk_io_error asserting EIO is re-raised, not silently swallowed. * Added test_does_not_downgrade_when_disk_says_wal covering the on-disk-header safety guard for the existing legitimate markers. - tests/hermes_cli/test_kanban_db.py: * test_connect_falls_back_to_delete_on_locking_protocol now uses a truly-fresh DB (instead of the kanban_home fixture which pre-inits in WAL). On NFS the very first process touching the file legitimately downgrades; on a file already in WAL the new guard correctly refuses. A standalone reproducer lives at /tmp/kanban-stress/repro_bugD_eio_wal_downgrade.py (not committed): without fix the DB silently flips from WAL to DELETE mid-process; with fix the EIO surfaces and the file stays WAL. Refs: Bug D in the kanban-corruption investigation series (Bugs A and C shipped in ebe7374f3 and e02147d5e respectively). Bug D explains every corruption incident this week including those that survived A's single-dispatcher mitigation, because every CLI invocation is a separate process whose WAL pragma can transiently fail.	2026-05-27 14:31:55 -07:00
Stephen Chin	6416dd5187	fix(kanban): harden SQLite against torn-write corruption (secure_delete + cell_size_check + synchronous=FULL) Production corruption #6 left b-tree pages with zeroed headers but intact old cell content — the Bug E pattern. This fix applies three pragma calls on every connect(): - synchronous=FULL (was NORMAL): closes the WAL-checkpoint reordering window where a crash between WAL commit and main-DB write leaves a partially-written b-tree page header. Cost is <1ms per commit on local SSD; negligible at kanban write volume. - secure_delete=ON: forces SQLite to zero freed page bytes on disk. If a torn write or hardware fault later corrupts a page, the underlying cell content is zero, so corruption is detectable and no stale rows can resurface as live data. - cell_size_check=ON: adds a read-side guard so corrupt cells surface as errors at read time rather than as silent wrong-data returns. All three are connection-scoped and re-applied on every connect(). secure_delete also writes a persistent flag into the DB header on the first call against a fresh DB, making the protection durable across processes for new DBs. Tests added for all four required cases: each pragma active on a fresh connection, and all three re-applied after close+reopen. Also adds the required negative test (migration path does not reset pragmas).	2026-05-27 14:31:55 -07:00
wysie	a38e283395	fix: preserve nested official skill install paths	2026-05-27 13:39:58 -07:00
kshitijk4poor	53bdef5775	test(cli): regression test for hermes update fork upstream sync (#26172 ) Asserts that when hermes update runs on a fork whose local HEAD matches origin/main but commit_count == 0, the early-return path still consults _sync_with_upstream_if_needed() before printing "Already up to date!". Locks in the fix from the parent commit so the upstream-sync call cannot silently regress out of the commit_count == 0 branch.	2026-05-27 13:10:50 -07:00
Teknium	e8955f222c	fix(codex): drop dead model slugs that HTTP 400 on ChatGPT Pro (#33424 ) DEFAULT_CODEX_MODELS shipped three slugs that the chatgpt.com Codex backend rejects with HTTP 400 'The <slug> model is not supported when using Codex with a ChatGPT account.' on every account tested live: gpt-5.2-codex gpt-5.1-codex-max gpt-5.1-codex-mini Live verified against https://chatgpt.com/backend-api/codex/models which returns gpt-5.5, gpt-5.4, gpt-5.4-mini, gpt-5.3-codex, gpt-5.3-codex-spark, gpt-5.2 for ChatGPT Pro accounts. When _fetch_models_from_api fell back to DEFAULT_CODEX_MODELS (offline first-run, transient API failure) the picker surfaced these dead slugs and crashed on selection. The forward-compat synthesis table chained them downstream too. If OpenAI re-enables them on the OAuth-backed Codex backend, live discovery will pick them up automatically — the defaults list is only consulted when live discovery is unavailable. Test fixture pivoted to use gpt-5.3-codex (templated by 4 entries) as the synthesis driver so the forward-compat test still exercises the synthesis path.	2026-05-27 12:16:15 -07:00
Donovan Yohan	c94ad89818	fix(kanban): retry corrupt-board dispatch after quarantine	2026-05-27 11:48:23 -07:00
JohnC1009	414a5bc924	fix(auth): fall back to global auth.json in _load_provider_state In profile mode, _load_provider_state previously returned None when a provider was absent from the profile's auth.json — even if the user had authenticated at the global root. This broke runtime credential resolvers that read state directly (resolve_nous_access_token, resolve_nous_runtime_credentials), causing profiles without their own nous login to fail with 'Hermes is not logged into Nous Portal' despite a valid global session. Push the existing read-only global fallback (already used by get_provider_auth_state and read_credential_pool) into _load_provider_state so every caller benefits, and simplify get_provider_auth_state into a thin wrapper. Writes still target the profile only — profile state continues to shadow global state on the next read after a per-profile login. Behavior in classic (non-profile) mode is unchanged because _load_global_auth_store returns an empty dict. Adds 5 tests covering the new contract on _load_provider_state directly. Existing 770 auth/credential/nous tests still pass.	2026-05-27 09:38:58 -07:00
Teknium	69dfcdcc15	fix(auth): codex chat path falls back to credential_pool when singleton is empty Closes #32992. The chat path resolves Codex credentials via `resolve_codex_runtime_credentials` which only reads `providers.openai-codex.tokens` (the singleton). The auxiliary path uses `_read_codex_access_token` which checks the credential_pool first. For users whose tokens live only in the pool — manual seed, partial re-auth, restore from backup, or any state where the singleton is empty but the pool is healthy — the chat path raised AuthError or (worse, since OpenAI(api_key='') silently attaches no header) the wire saw HTTP 401 "Missing Authentication header" while the auxiliary path worked fine. This adds a pool fallback to `resolve_codex_runtime_credentials`: when the singleton has no usable access_token, scan `credential_pool.openai-codex` for the first entry that has a non-empty access_token and isn't in an exhaustion cooldown window (`last_error_reset_at` in the future). If found, return that token with `source="credential_pool"`. If no usable entry exists, the original AuthError propagates as before. Regression tests cover: - Empty singleton + healthy pool entry → pool token returned - Pool fallback skips entries currently in cooldown - Empty singleton + empty/wedged pool → AuthError propagates (existing contract preserved)	2026-05-27 03:43:51 -07:00
konsisumer	f1422ffd77	fix(gateway): classify Codex 429 quota as rate-limit, not missing credentials When the Codex OAuth token endpoint returns 429 (usage-limit / quota exhaustion), refresh_codex_oauth_pure raised a generic auth error that the gateway surfaced as 'Primary provider auth failed: No Codex credentials stored. Run hermes auth', prompting re-auth that cannot lift a quota cap. Classify 429 distinctly (codex_rate_limited, relogin_required=False) with a non-alarming quota message that honors Retry-After, log it as 'Primary provider rate-limited (429)', and stop format_auth_error from appending the re-authenticate remediation. Also log the fallback provider's literal config key instead of the resolved runtime category. Refs #32790	2026-05-27 03:13:15 -07:00
konsisumer	2bbd53493d	fix(cli): sync credential_pool on Codex re-auth Codex re-auth via `hermes setup` / `hermes model` wrote fresh OAuth tokens to providers.openai-codex.tokens but left the credential_pool device_code entry holding the consumed refresh token and stale error markers. Since the runtime selects from the pool, the next request spent a dead token and got a 401 token_invalidated. Update the singleton-seeded pool entries in lockstep and clear their error state. Fixes #33000	2026-05-27 03:02:06 -07:00
Ben	a890389b69	feat(dashboard-auth): HERMES_DASHBOARD_PUBLIC_URL / dashboard.public_url override Operators behind reverse proxies that don't reliably forward X-Forwarded-Host / X-Forwarded-Proto / X-Forwarded-Prefix (manual nginx setups, on-prem ingresses, custom-domain Fly deploys with incomplete proxy chains) had no way to force the absolute base URL the OAuth callback redirects from. The dashboard would reconstruct the redirect_uri from request headers, the IDP would echo it back, and the user would land on the wrong host or wrong path — 404. Add `dashboard.public_url` to config.yaml with env override HERMES_DASHBOARD_PUBLIC_URL. When set, it is the complete authority — scheme + host + optional path prefix (e.g. https://example.com/hermes) — and becomes the base for the OAuth `redirect_uri`. X-Forwarded-Prefix is IGNORED on this code path because the operator has explicitly declared the public URL; we no longer need to guess from proxy headers, and stacking the prefix on top would double-prefix the common case where the prefix is already baked into public_url. When unset, the existing proxy_headers + X-Forwarded-Prefix reconstruction runs untouched. Existing Fly.io deploys continue to work without configuration — this is purely additive. Precedence mirrors dashboard.oauth.client_id: env (non-empty) > config.yaml > reconstructed from request Implementation: - hermes_cli/config.py: add dashboard.public_url to DEFAULT_CONFIG with a multi-paragraph doc comment explaining the use case, the X-Forwarded-Prefix interaction, and the validation rules. - hermes_cli/dashboard_auth/prefix.py: factored out the existing _REJECT_CHARS frozenset, added _normalise_public_url() validator (requires http/https scheme + non-empty host + no header-injection chars), _load_dashboard_section() loader (robust to load_config raising, non-dict shapes), and resolve_public_url() entry point with the env-overrides-config precedence. A malformed value silently falls through to ""; the caller treats "" as "reconstruct from request" so a typo never breaks the login flow. - hermes_cli/dashboard_auth/routes.py: rewrite _redirect_uri() docstring to spell out the three resolution tiers; add the public_url short-circuit before the existing X-Forwarded-Prefix splicing. Source-level comment notes that X-Forwarded-Prefix is intentionally ignored when public_url is set so a future reader doesn't try to "fix" the missing prefix layering. - cli-config.yaml.example: extend the existing dashboard section with a public_url block. - website/docs/user-guide/features/web-dashboard.md: new "Public URL override" section between the provider configuration and the OAuth flow walkthrough. Documents the env-vs-config table, the validation rules, and the `http://` `public_url` ↔ Secure cookie footgun. Test coverage — new TestPublicUrlOverride class (8 tests): - env var overrides request reconstruction (the primary motivating case) - config.yaml used when env unset - env wins over config (precedence pin) - public_url with a path prefix already baked in (the Q1-a case the user explicitly chose) - public_url suppresses X-Forwarded-Prefix layering (defends against the double-prefix bug) - trailing slash stripped from public_url (no //auth/callback) - malformed public_url falls through to reconstruction (six hostile inputs: javascript:, ftp:, missing scheme, missing host, quote chars, CRLF injection) - empty env string doesn't shadow config.yaml entry (CI / Fly provisioned-but-empty secret case) Mutation-tested: flipping the precedence in resolve_public_url() trips exactly test_env_overrides_config_public_url; weakening the validator (accept any scheme) trips exactly test_malformed_public_url_falls_through_to_reconstruction. Both other tests in each pair stay green, confirming the suite discriminates the specific regression each test pins.	2026-05-27 02:12:27 -07:00
Ben	b26d81d536	feat(dashboard-auth): honour X-Forwarded-Prefix + __Host-/__Secure- cookies Mission-control style deploys reverse-proxy the dashboard at a path prefix (e.g. mission-control.tilos.com/hermes/* -> :9119) and inject X-Forwarded-Prefix: /hermes on every request. The SPA mount already honoured this for asset URLs and the bootstrap __HERMES_BASE_PATH__, but the OAuth gate didn't: 1. The gate's Location: header to /login and the 401 envelope's login_url were built bare ("/login?next=..."). Under a /hermes prefix the browser follows that to mission-control.tilos.com/login which the proxy doesn't route to the dashboard. 2. _redirect_uri (the OAuth callback URL handed to the IDP) used request.url_for() which doesn't honour X-Forwarded-Prefix (Starlette/uvicorn only proxy_headers Host + Proto + For). The IDP redirects back to /auth/callback instead of /hermes/auth/ callback → 404 in the user's browser. 3. Cookies were set with Path=/ which leaks them to other apps on the same origin and won't be sent back on requests under the prefix in the first place. Fix threads the normalised prefix through every boundary: * New hermes_cli/dashboard_auth/prefix.py — single source of truth for X-Forwarded-Prefix parsing. web_server._normalise_prefix becomes a re-export so the SPA mount, the gate, and the cookies helper all agree. * middleware._unauth_response builds login_url = f"{prefix}/login". * routes._redirect_uri splices the prefix into the path component of the IDP-bound URL (with full validation of the header). * cookies.{set,clear}_{session,pkce}_cookie now take prefix="". Path attribute switches to /hermes when set; cookie name switches name variant (see below). Every caller passes the request's normalised prefix. Cookie hardening (Teknium's lesser-note #1 in the PR review): adopt the __Host- / __Secure- cookie name prefixes per draft-west-cookie- prefixes. The variant is selected from (use_https, prefix): * Loopback HTTP → bare "hermes_session_at" (both prefixes require Secure, incompatible with HTTP). * HTTPS, direct deploy (Path=/) → "__Host-hermes_session_at". Strongest spec: bound to exact origin, no Domain attribute, Secure required. * HTTPS, behind a proxy prefix (Path=/hermes) → "__Secure-hermes_session_at". __Host- forbids Path != "/"; the explicit Path=/hermes covers same-origin app isolation. Setter and reader BOTH consult the prefix because the cookie name changes — a reader that looked up the bare name when the setter wrote __Secure- would never find the value. The reader falls back across all three variants so a request whose shape changed mid-session (e.g. post-deploy from no-prefix to /hermes) still picks up the existing cookie until it expires. Test coverage: - tests/hermes_cli/test_dashboard_auth_prefix.py — new file. 11 tests pinning: • Location: /hermes/login on the gate's HTML redirect • 401 envelope login_url carries the prefix • Malformed X-Forwarded-Prefix is ignored (header-injection defence; the script-tag value is normalised to empty string) • _redirect_uri splices /hermes into the path (the property that prevents the IDP-returns-to-404 failure) • PKCE cookie uses Path=/hermes + __Secure- when proxied • Session cookies use __Host- when direct, __Secure- when proxied, bare on loopback HTTP • End-to-end round trip with hand-managed PKCE cookie carriage (TestClient can't simulate a Path=/hermes cookie automatically) - tests/hermes_cli/test_dashboard_auth_cookies.py — rewritten to pin each (use_https, prefix) shape produces its expected cookie name, plus reader-side coverage that __Host- and __Secure- variants are both recognised. - Existing tests across middleware / 401-reauth / etc. updated to match the new cookie names (substring contains instead of startswith). Mutation-tested: reverting _unauth_response to build the bare "/login" URL trips exactly the two tests that pin the prefix carriage, confirming the suite discriminates the regression.	2026-05-27 02:12:27 -07:00
Ben	034ad95fed	fix(dashboard-auth): propagate next= through login page + PKCE cookie The gate's _unauth_response set next=<path> on the /login redirect URL, but nothing downstream read it: render_login_html ignored next=, auth_login dropped it, and auth_callback read next= from its own query string — which an IDP never sets on the callback URL (real IDPs only echo back code+state). The _validate_post_login_target plumbing in the callback was unreachable on the happy path, so users always landed on "/" regardless of what they originally requested. Worse: reading next= from the callback URL was a latent open-redirect sink, since an attacker could craft /auth/callback?...&next=/admin and have the server honour it post-auth. Fix carries next= through the round trip on a server-controlled channel: 1. login_page reads request.query_params['next'] and passes it (post- validation) to render_login_html. 2. render_login_html threads next= URL-encoded into each provider button's href, with HTML-attribute escaping as defence in depth. 3. auth_login accepts ?next= as a query param, re-validates, and appends it as a fourth segment (next=<urlquoted>) in the PKCE cookie payload alongside provider/state/verifier. 4. auth_callback no longer accepts a next: str = "" query param. It parses next= out of the PKCE cookie and validates that with the same same-origin rules. Any attacker-supplied ?next= on the callback URL is silently ignored — server-only carrier. Test coverage adds three classes: - TestAuthCallbackNext drives /login → /auth/login → IDP-bounce → /auth/callback end-to-end without smuggling next= onto the callback URL (which is what the previous tests did and why they didn't catch the bug). Includes test_attacker_callback_next_param_is_ignored to pin the security property that the URL value is never read. - TestRenderLoginHtmlNext covers the rendering function at the unit boundary so a regression that drops next_path is caught without spinning up the full app. - TestAuthLoginPkceCookieNext inspects the Set-Cookie header on /auth/login responses so a regression in cookie encoding is caught without driving the full round trip. Mutation-tested: reverting auth_callback to read next= from the URL trips 3 of 6 TestAuthCallbackNext tests (the safe-path and attacker- hardening ones), confirming the suite discriminates between the cookie read and the URL read.	2026-05-27 02:12:27 -07:00
Ben	c3104195b8	fix(dashboard-auth): bypass loopback WS peer check in gated mode When the OAuth gate is active, start_server runs uvicorn with proxy_headers=True so the dashboard can honour X-Forwarded-Proto from Fly's TLS terminator (cookies, redirect URI reconstruction). A side effect: ws.client.host is rewritten to the X-Forwarded-For value, which on Fly is the real internet client IP — never loopback. The loopback peer guard in _ws_client_is_allowed then rejected every WS upgrade in gated mode (4403 close) even after a successful OAuth round trip and ticket consumption, silently breaking /api/pty, /api/ws, /api/pub, and /api/events. Fix: in gated mode, bypass the peer-IP check. The OAuth gate + single-use ticket is the auth. The Host/Origin guard in _ws_host_origin_is_allowed still runs and is what protects against DNS-rebinding here, not the peer IP. Loopback mode behaviour is unchanged: the legacy ?token= path is the only auth there and we don't want LAN hosts guessing tokens. Regression coverage: TestWsRequestIsAllowedGated pins all four behaviours — non-loopback peer allowed in gated mode, non-loopback peer rejected in loopback mode, loopback peer allowed in loopback mode, and the Host/Origin guard still firing on a rebinding attempt with gated mode + matching peer.	2026-05-27 02:12:27 -07:00
Ben	866cc988b5	fix(dashboard-auth): use fixed-length sig suffix in stub token framing The stub auth provider's _sign/_unsign helpers joined payload and HMAC with a 'b"."' separator and recovered the parts via bytes.rsplit. HMAC-SHA256 digests are random bytes, so ~12% of the time the digest contains 0x2E ('.') and rsplit picks the wrong split point -- HMAC verification then spuriously rejects valid tokens. test_stub_refresh_round_trips was failing ~25% of the time in isolation because of this. Switch to a fixed-length suffix (32 bytes, sliced off in _unsign): no separator means no collision class. After the fix, 10/10 runs pass.	2026-05-27 02:12:27 -07:00
Ben	b3dc539304	feat(dashboard-auth): Nous plugin always-on; default portal URL; specific error messages The Nous OAuth provider plugin (plugins/dashboard_auth/nous) is bundled and auto-loaded — same as before — but previously refused to register unless BOTH HERMES_DASHBOARD_OAUTH_CLIENT_ID and HERMES_DASHBOARD_PORTAL_URL were set, then the gate's fail-closed branch told the operator 'install the default Nous provider'. That message is misleading: the provider IS installed; it's just unconfigured. And the contract only really needs the per-instance client_id — the portal URL is the same for everyone in production. Three changes: 1. plugins/dashboard_auth/nous/__init__.py: - HERMES_DASHBOARD_PORTAL_URL is now optional and defaults to 'https://portal.nousresearch.com'. Override only for staging (portal.rewbs.uk) or a custom deployment. Empty string also falls back to the default so an empty Fly secret can't point the dashboard at nowhere. - Plugin exposes a module-level LAST_SKIP_REASON: str that the gate reads when no providers register. Cleared on each register() call. Skip reasons are human-readable and actionable ('HERMES_DASHBOARD_OAUTH_CLIENT_ID is not set. The Nous Portal provisions this env var…'). 2. plugins/dashboard_auth/nous/plugin.yaml: - requires_env drops HERMES_DASHBOARD_PORTAL_URL; only the client_id is mandatory. Description updated to reflect this. 3. hermes_cli/web_server.py: - When the gate fail-closes for 'no providers', it now reads each bundled plugin's LAST_SKIP_REASON and embeds them in the SystemExit message. Operator sees the specific config fix needed: Bundled providers reported these issues: • nous: HERMES_DASHBOARD_OAUTH_CLIENT_ID is not set. … instead of the prior generic 'Install the default Nous provider'. Tests: - TestPluginRegister rewritten to assert the new defaults + LAST_SKIP_REASON contents (6 tests, +1 new for empty-string env). - New gate test test_start_server_surfaces_nous_skip_reason_when_unconfigured. - test_get_method_is_not_allowed widened to handle the SPA-shell 200 path explicitly — assertion now verifies no JSON ticket leaks rather than asserting a specific status code (covers all four of 401/404/405/200). Docs updated: web-dashboard.md's 'Default provider' section now shows the env-var table with required/optional columns and embeds the fail-closed error message verbatim so operators can match what they see at the prompt.	2026-05-27 02:12:27 -07:00

1 2 3 4 5 ...

1121 commits