Commit graph

9770 commits

Author SHA1 Message Date
Biser Perchinkov
b5495db701 fix(agent): re-pad reasoning_content on cross-provider fallback to require-side providers
api_messages is built once before the retry loop while the primary provider
is active. When a mid-conversation fallback switches to a require-side thinking
provider (DeepSeek/Kimi/MiMo), assistant turns built under a non-require primary
(e.g. Codex) go out without reasoning_content and the new provider rejects the
request with HTTP 400 ("reasoning_content must be passed back").

Re-apply the echo-back pad against the current provider immediately before
building the request kwargs. Idempotent and a no-op unless the active provider
enforces echo-back, so it covers all fallback paths without affecting normal or
reject-side operation.

Drafted by Claude (Opus 4.7) under human review while fixing a personal deployment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 03:21:00 -07:00
Indigo Karasu
9179396cb7 fix(stream-consumer): only set _final_content_delivered when final response confirmed delivered
In GatewayStreamConsumer._run(), _final_content_delivered was set to True
based on the success of a mid-stream finalize edit, before the final
finalize edit was attempted. When the final edit later failed (Telegram
flood control, retry-after), _final_response_sent stayed False but
_final_content_delivered was already True, so gateway/run.py suppressed
its normal final send and the user saw a partial / fallback message
instead of the real answer.

Changes in gateway/stream_consumer.py:
- Remove the premature _final_content_delivered = True at the top of
  the got_done block.
- Set _final_content_delivered = True only when the actual final send /
  edit succeeds, in each finalize branch (no-finalize adapter,
  _message_id finalize, no-_already_sent send).
- _send_fallback_final: don't set _final_response_sent = True when only
  some chunks were delivered; the gateway should still attempt a
  complete final send. Set _final_content_delivered = True alongside
  _final_response_sent on the success path and short-text path.
- Cancellation handler: set _final_content_delivered = True alongside
  _final_response_sent when the best-effort final edit succeeds.

Adds TestFinalContentDeliveredGuard with 3 regression tests covering
the core bug scenario, the happy path, and partial fallback.

Closes #33708
Closes #25010
Refs #29200

Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>
2026-05-28 03:15:19 -07:00
Dusk1e
a91b1c8b31 fix(tirith): reject non-regular tar members during auto-install process 2026-05-28 02:49:26 -07:00
teknium1
247b24b49f chore(release): add AUTHOR_MAP entry for AdityaRajeshGadgil 2026-05-28 02:45:25 -07:00
Aditya Rajesh Gadgil
031983bbf8 fix: limit pre-update state snapshots 2026-05-28 02:45:25 -07:00
Teknium
8b6beaab5f
docs: 30-day overhaul — correctness audit, PR coverage, Nous Portal weave, sidebar reorg (#33782)
* docs(audit): correctness pass across getting-started, reference, features, messaging, developer-guide, guides, integrations, user-guide

* docs: add PR coverage for last 30d + Nous Portal weave + nav reorg + build fixes

- Add docs for top user-visible PRs that shipped without docs (api-server
  session control, kanban features, telegram pin/edit, provider client tag,
  xAI retired-model migration, cron name lookup, --branch update flag, etc.)
- Apply Nous Portal weave across 23 pages (tasteful one-liners on
  getting-started/learning-path, configuration, overview, vision, x-search,
  credential-pools, provider-routing, cron, codex-runtime, profiles, docker,
  messaging/index, multiple guides, plus FAQ + index promotion)
- Reorganize sidebar: split Messaging into Popular/M365/Chinese/Other,
  Reference into Command/Configuration/Tools-Skills sub-categories, add
  orphan developer-guide pages (web-search-provider-plugin,
  browser-supervisor), move features from Integrations back to Features,
  fold lone spotify into Media & Web.
- Regenerate skill stubs + catalogs (kanban-codex-lane, hermes-s6-container-
  supervision, web-pentest)
- Fix broken anchor links (security/cron, configuration/fallback, telegram
  large-files, adding-platform-adapters step-by-step)
2026-05-28 02:41:36 -07:00
teknium1
c7f7783e5c test(xai-proxy): regression coverage for #28932 429 handling
Three new tests in tests/hermes_cli/test_proxy.py:

- xai_adapter_retry_rotates_pool_entry_on_429 — headline #28932 case.
  Two-entry pool, 429 on first entry, must rotate to second entry
  AND must NOT call refresh_xai_oauth_pure (refresh is irrelevant
  for rate limits).
- xai_adapter_retry_returns_none_on_429_when_pool_exhausted —
  single-entry pool: 429 returns None so the rate-limit response
  flows back to the client unchanged (existing behavior preserved).
- xai_adapter_retry_returns_none_for_unrelated_status — non-{401,
  429} statuses must not trigger any retry path at all; guards
  against the gate becoming too broad in future changes.

Each test asserts that refresh_xai_oauth_pure is never called on the
429 path — refresh is a 401-specific concern.

39/39 in tests/hermes_cli/test_proxy.py.
2026-05-28 02:36:37 -07:00
sprmn24
4ed482549f fix(xai-proxy): handle 429 rate-limit responses in proxy retry path
get_retry_credential only triggered on 401; a 429 Too Many Requests from
xAI was silently streamed back with no key rotation or back-off signal.

- server.py: widen retry gate from == 401 to in {401, 429}
- xai.py: on 429, skip token refresh and call mark_exhausted_and_rotate
  to stamp the 1-hour cooldown on the rate-limited key and return the
  next available credential. Returns None if pool is exhausted.
2026-05-28 02:36:37 -07:00
Dusk1e
aa3466063b fix(android): reject unsafe tar members in psutil compatibility installer 2026-05-28 02:36:09 -07:00
Teknium
bb0ac5ced2 chore(release): AUTHOR_MAP entry for vynxevainglory-ai
PR #29233 salvage.
2026-05-28 02:33:51 -07:00
Teknium
70abae8e3b fix(kanban): show horizontal scrollbar instead of wrapping columns
Salvage follow-up on top of @vynxevainglory-ai's PR #29233. Keep the
column-body flex:1 + min-height:0 fix (tall columns scroll internally
now), but drop the flex-wrap: wrap part — instead just stop hiding
the existing horizontal scrollbar.

PR #523254b34 (sadiksaifi, May 18) deliberately moved the kanban board
from a wrapping grid to a single-row pinned-width flex so the board
stays as one stable horizontal row. The mistake in that PR was the
scrollbar-width: none + ::-webkit-scrollbar { display: none } pair,
which hid the affordance so columns past the viewport became visually
inaccessible. Fixing that hidden-scrollbar bug while keeping the
single-row design honors both contributors' intent.
2026-05-28 02:33:51 -07:00
Vynxe Vainglory
538f0fa339 fix(kanban): wrap columns into rows and fix vertical overflow
Two CSS issues in the kanban dashboard:

1. Columns overflow horizontally with no way to reach them — the
   original scrollbar-width: none hid the scrollbar entirely, and
   even with a scrollbar, a wrapping layout is better UX for a board
   with 8+ columns. Changed to flex-wrap: wrap and removed the
   overflow-x: auto + hidden scrollbar rules. Columns now flow into
   multiple rows (~3 per row on a typical viewport) instead of
   running off-screen.

2. .hermes-kanban-column-body lacked flex: 1 and min-height: 0,
   so the flex child's implicit min-height: auto prevented it from
   shrinking below its content size. Columns with many cards pushed
   past the parent max-height instead of scrolling internally.

Verified: 9 columns wrap into 3 rows, all visible without
horizontal scroll. Done column (53 tasks) scrolls vertically
within its column bounds.
2026-05-28 02:33:51 -07:00
teknium1
9b5dae17a5 feat(context-engine): host contract for external context engines
Condenses the substance of PRs #16453, #17453, #16451, #17600, and #13373
into a minimal generic host contract that external context engine plugins
(e.g. hermes-lcm) need to integrate cleanly. Drops scaffolding that
duplicated existing infrastructure or had marginal value.

Five concrete changes:

1. `_transition_context_engine_session()` on AIAgent — generic lifecycle
   helper that fires on_session_end → on_session_reset → on_session_start
   → optional carry_over_new_session_context. Engines implement only the
   hooks they need; missing hooks are skipped. Built-in compressor keeps
   its existing reset-only behavior because callers default to no
   metadata. `reset_session_state()` now optionally accepts
   previous_messages / old_session_id / carry_over_context and delegates
   to the transition helper when provided. (#16453)

2. `conversation_id` passed to `on_session_start()` — both the
   agent-init call site and the compression-boundary call site now
   forward `self._gateway_session_key` so plugin engines have a stable
   conversation identity that survives session_id rotation (compression
   splits, /new, resume). The key already existed on AIAgent; it just
   wasn't reaching engines. (#16453)

3. Canonical cache buckets forwarded to engines — the usage dict passed
   to `update_from_response()` now includes input_tokens, output_tokens,
   cache_read_tokens, cache_write_tokens, and reasoning_tokens on top of
   the legacy prompt/completion/total keys. Engines can make decisions on
   cache-hit ratios and reasoning costs instead of only aggregates. ABC
   docstring updated. (#17453)

4. Plugin-registered context engines visible in the picker —
   `_discover_context_engines()` in plugins_cmd.py now also includes
   engines registered via `ctx.register_context_engine()` from plugin
   manifests, deduplicating by name so repo-shipped descriptions win on
   collision. (#16451)

5. `_EngineCollector.register_command()` — context engines using the
   standard `register(ctx)` pattern can now expose slash commands (e.g.
   `/lcm`). Routes to the global plugin command registry with the same
   conflict-rejection policy regular plugins use (no shadowing built-ins,
   no clobbering other plugins). Previously these calls hit a no-op and
   the slash commands silently never appeared. (#17600)

Dropped from the original 5 PRs:

- Compression boundary signal (`boundary_reason="compression"`) from
  #16453 — already on main at `agent/conversation_compression.py:412-424`,
  landed via the bg-review extraction.

- `discover_plugins()` before fallback in run_agent.py from #16451 —
  redundant: `get_plugin_context_engine()` already routes through
  `_ensure_plugins_discovered()` which is idempotent.

- Runtime identity diagnostics method + helpers from #13373 (+251 LOC) —
  operators can already read engine state via `engine.get_status()`;
  the diagnostics view added marginal value relative to its surface area.

- The 553-LOC slash-command machinery from #17600 — replaced with a
  20-LOC `register_command` method on the collector that reuses the
  existing plugin command registry instead of building a parallel one.

Net: ~215 LOC of host-contract changes + 282 LOC of focused tests, vs
~1,176 LOC across the original 5 PRs.

Co-authored-by: Tosko4 <1294707+Tosko4@users.noreply.github.com>

Closes #16453.
Closes #17453.
Closes #16451.
Closes #17600.
Closes #13373.
Related: stephenschoettler/hermes-lcm#68.
2026-05-28 01:45:30 -07:00
Teknium
fb9f3a4ef9
fix(skills): pull full ClawHub catalog into the skills index (200 → 20k+) (#33748)
* fix(skills): pull full ClawHub catalog into the skills index

The website was showing 200 ClawHub skills out of 20k+ because
`ClawHubSource.search("")` for empty queries went straight to a single
unpaginated request. ClawHub's API caps any single page at 200 items and
returns a `nextCursor`; we grabbed page 1 and stopped, so the cached
index served from hermes-agent.nousresearch.com had a silent 99%
truncation.

End users never hit clawhub.ai directly (the index is rebuilt twice
daily by .github/workflows/skills-index.yml and served as a static JSON
on the docs site), so the cap-and-cache architecture is correct — it
just wasn't being filled.

Changes:
- `ClawHubSource.search(query="")` now routes through the existing
  `_load_catalog_index()` paginating walker instead of the unpaginated
  listing fallback (non-empty queries still hit the fast catalog search).
- `_load_catalog_index()` max_pages 50 → 250 (50k-skill ceiling; live
  catalog is ~20k as of May 2026, with headroom for growth).
- `build_skills_index.py`: per-source crawl limits split out — ClawHub
  and LobeHub get 100k, others keep their effective caps.
- `EXPECTED_FLOORS["clawhub"]` 50 → 5000 so the next pagination
  regression hard-fails the CI build instead of silently shipping a
  degenerate index.

Test plan:
- New unit test `test_search_empty_query_paginates_full_catalog`
  exercises the cursor-following path with three mocked pages (450
  total items) and asserts all pages are walked.
- Existing 9 ClawHub tests + 127 broader skills_hub tests all pass.
- E2E against live ClawHub API: walker reached 9700+ skills across 49
  pages before this commit landed, paginating well past the previous
  50-page cap.

* fix(skills): raise ClawHub ceilings — live catalog is 50k, not 20k

E2E walk against live ClawHub API hit my initial 250-page cap at 49,698
skills with cursor=yes still pending. The catalog is roughly 2.5x larger
than the docstring estimate.

- max_pages 250 → 750 (150k ceiling, walks terminate on cursor=None
  well before this in practice)
- SOURCE_LIMITS['clawhub'] 100k → 200k
- EXPECTED_FLOORS['clawhub'] 5000 → 20000
2026-05-28 01:42:19 -07:00
Teknium
09a5cd8084
fix(auth): sync manual:device_code Codex pool entries on re-auth (#33744)
#33164 made _save_codex_tokens sync the singleton-seeded `device_code`
pool entry on Codex OAuth re-auth. That fixed the #33000 path but missed
`manual:device_code` entries created by `hermes auth add openai-codex`
(the recommended workaround for users who hit #33000 before #33164
landed).

Every subsequent re-auth would refresh the device_code entry but leave
the manual:device_code entry holding the consumed refresh token plus
stale last_error_* markers — immediately recreating the 401
token_invalidated symptom on the next request, exactly as reported in
#33538.

Extend the refreshable source set to include `manual:device_code`.
Completing the device-code OAuth flow proves the user owns the ChatGPT
account, so it is safe to refresh every device-code-backed entry. Keep
`manual:api_key` and other non-device-code manual sources untouched —
those represent independent credentials.

Closes #33538.
2026-05-28 01:33:10 -07:00
Dusk1e
43abc51f66 fix(security): require source CIDR allowlisting for public msgraph webhook binds 2026-05-28 01:26:18 -07:00
Teknium
986abb3cf7
docs: drop stale Kimi/DeepSeek vision example (#33736)
Kimi K2.6 is natively multimodal — flagged by Shengyuan from the Kimi
growth team. Replace the named-vendor example with a model-agnostic
phrasing so the row doesn't go stale as more vendors ship vision.
2026-05-28 01:23:38 -07:00
Teknium
87e5b2fae0
feat(mcp): support TLS client certificates (mTLS) for HTTP and SSE servers (#33721)
Adds first-class `client_cert` / `client_key` config keys so MCP servers
behind mTLS work without an external TLS-terminating proxy. Resolves
inbound community question (Jeremy W.).

Schema (per `mcp_servers.<name>`, HTTP/SSE only):

- `client_cert: "/path/to/combined.pem"` — single PEM with cert + key
- `client_cert: "/path/to/cert"` + `client_key: "/path/to/key"` — separate
- `client_cert: [cert, key]` or `[cert, key, password]` — list form,
  with optional passphrase for encrypted keys

Paths support `~` expansion. Missing files raise a server-scoped
`FileNotFoundError` at connect time rather than failing later with an
opaque TLS handshake error.

Wiring:

- New SDK HTTP path (mcp >= 1.24): `cert=` on the user-owned
  `httpx.AsyncClient` alongside the existing `verify=` handling.
- SSE path: routed through an `httpx_client_factory` that wraps the
  SDK's defaults (follow_redirects=True) and layers `verify` + `cert`
  on top. The factory is only injected when needed, so the SDK's
  built-in `create_mcp_http_client` keeps being used in the default
  case.
- Deprecated mcp<1.24 path left untouched — that SDK's
  `streamablehttp_client` signature doesn't expose `cert`, and adding
  it would be dead code.

Also documents the previously-undocumented `ssl_verify` key (bool or
CA bundle path) in the MCP config reference.

Tests:

- `tests/tools/test_mcp_client_cert.py` (new, 19 tests):
  - `_resolve_client_cert` helper: all three input forms, `~` expansion,
    missing-file and validation errors.
  - HTTP transport: `cert=` forwarded into `httpx.AsyncClient` for
    string and tuple forms; absent when unset; missing-file error
    propagates.
  - SSE transport: factory only injected when cert or non-default
    verify is set; factory applies cert, custom CA bundle, and
    preserves `follow_redirects=True` + forwarded headers/auth.
- Existing tests: 200/200 in `test_mcp_tool.py` + `test_mcp_sse_transport.py`
  still pass.
2026-05-28 00:55:55 -07:00
Stephen Schoettler
8595281f3c fix: expose context engine tools with saved toolsets 2026-05-28 00:28:42 -07:00
Dusk1e
1a9ef83147 fix(security): require API_SERVER_KEY before dispatching API server work 2026-05-28 00:25:08 -07:00
LeonSGP43
442a9203c0 Fix xAI OAuth timeout manual fallback 2026-05-28 00:24:17 -07:00
helix4u
459d7694d3 fix(agent): preload jiter native parser 2026-05-28 00:20:11 -07:00
Robin Fernandes
dc52b82d53 test(auth): update entitlement CI expectations 2026-05-28 00:19:31 -07:00
Robin Fernandes
1cf5e639b3 fix(auth): refresh Nous entitlement in tool menus 2026-05-28 00:19:31 -07:00
Robin Fernandes
406901b27d feat(auth) normalise the way in which we check whether a user has free/paid access to nous portal so we can expose behaviour and error messages accordingly. 2026-05-28 00:19:31 -07:00
stephenschoettler
0bf9b867cf fix(website): pin serialize-javascript and uuid via npm overrides
Resolves the two Dependabot alerts currently open against the website
lockfile:

- serialize-javascript: pin to ^7.0.5 (was 6.0.2 — high-severity RCE
  via RegExp.flags + Date.prototype.to*, plus medium-severity DoS)
- uuid: pin to ^14.0.0 (was 8.3.2 — medium buffer bounds check miss
  in v3/v5/v6 when buf is provided)

Lockfile regenerated against current main (not the stale lockfile
from the original PR — several Dependabot bumps for mermaid,
webpack-dev-server, @babel/plugin-transform-modules-systemjs,
fast-uri, lodash-es+langium, lodash, follow-redirects, and dompurify
have landed since #30036 was opened, so the website portion was
re-applied surgically on top of those).

Salvaged the website half of PR #30036. The TUI test half landed
on main separately, so this PR is web-only.
2026-05-28 00:07:54 -07:00
kshitijk4poor
7b778db472 chore(release): map MoonRay305 contributor email for #32759 salvage
Adds `squiddy@2rook.ai → MoonRay305` to AUTHOR_MAP so contributor_audit.py
passes for the salvaged commits in #33482-followup PR.
2026-05-27 23:28:51 -07:00
Squiddy
3ba8962738 fix(kanban): add Windows init lock guard 2026-05-27 23:28:51 -07:00
Squiddy
90b6b3d18f fix(kanban): harden sqlite connection concurrency 2026-05-27 23:28:51 -07:00
Brian D. Evans
3ad46933d3
docs(voice): use uv pip install faster-whisper in STT install hints (#29800)
* docs(voice): use `uv pip install faster-whisper` in STT install hints

Three runtime messages told users to `pip install faster-whisper`
(reported in #29782 for the gateway STT failure message under
Telegram-in-Docker, where the user hit `bash: pip: command not
found`). The Hermes Docker image is built on `ghcr.io/astral-sh/uv`
with a uv-managed venv that doesn't ship `pip` on PATH; users on
modern `uv tool install` / `uv venv` installs see the same problem.

The canonical install command in this repo is `uv pip install`
(see `tools/lazy_deps.py:509` `feature_install_command()`), which
works in Docker (uv image), in `uv tool install` venvs, and in
pip-based venvs that already have uv on PATH.

Changed three locations to match:

- `gateway/run.py` — Telegram/Discord/Slack/WhatsApp/etc. voice
  reply when no STT provider is configured. Suggests
  `uv pip install faster-whisper` and notes that
  `pip install faster-whisper` also works if `pip` is on PATH.
- `tools/voice_mode.py` — `/voice` status line for missing STT.
- `cli.py` — Voice-mode startup error, "Option 1".

No behavior change beyond the user-facing text. No production
code path was touched.

* docs(voice): add pip fallback to cli + voice_mode STT hints

Copilot flagged that cli.py and tools/voice_mode.py recommend
`uv pip install faster-whisper` without a fallback for environments
where uv isn't on PATH. The gateway/run.py message already lists
`pip install faster-whisper` as an alternative; this commit aligns
the two remaining call sites to match.

Addresses inline Copilot review on #29800.

---------

Co-authored-by: briandevans <252620095+briandevans@users.noreply.github.com>
2026-05-28 16:23:14 +10:00
Teknium
4e702fe2d9
test(ci): harden two flaky tests against CI noise (#33675)
Some checks are pending
Deploy Site / deploy-vercel (push) Waiting to run
Deploy Site / deploy-docs (push) Waiting to run
Docker / shell lint / Lint Dockerfile (hadolint) (push) Waiting to run
Docker / shell lint / Lint docker/ shell scripts (shellcheck) (push) Waiting to run
Docker Build and Publish / build-amd64 (push) Waiting to run
Docker Build and Publish / build-arm64 (push) Waiting to run
Docker Build and Publish / merge (push) Blocked by required conditions
Lint (ruff + ty) / ruff + ty diff (push) Waiting to run
Lint (ruff + ty) / ruff enforcement (blocking) (push) Waiting to run
Lint (ruff + ty) / Windows footguns (blocking) (push) Waiting to run
Nix / nix (macos-latest) (push) Waiting to run
Nix / nix (ubuntu-latest) (push) Waiting to run
OSV-Scanner / Scan lockfiles (push) Waiting to run
Tests / test (1) (push) Waiting to run
Tests / test (2) (push) Waiting to run
Tests / test (3) (push) Waiting to run
Tests / test (4) (push) Waiting to run
Tests / test (5) (push) Waiting to run
Tests / test (6) (push) Waiting to run
Tests / save-durations (push) Blocked by required conditions
Tests / e2e (push) Waiting to run
uv.lock check / uv lock --check (push) Waiting to run
Two unrelated transient failures on PR #33661's initial CI run, both
pre-existing on main and recovered on rerun. Hardening:

1. tests/cron/test_scheduler.py::TestRunJobConfigLogging — added mocks for
   resolve_runtime_provider() and discover_mcp_tools(). The yaml-warning
   tests intend to exercise only the warning-log path, but
   _run_job_impl continues into provider resolution and MCP discovery
   after the warning. Both can spawn subprocesses / hit the network and
   pushed the test over its 30s budget under GHA load.

2. tests/tools/test_browser_supervisor.py — wrapped Chrome teardown
   against the stdlib subprocess._wait() race (bpo-38630). When SIGCHLD
   arrives during proc.wait(), _try_wait(WNOHANG) can return a foreign
   pid and the 'assert pid == self.pid or pid == 0' fires. Fixture now
   catches AssertionError/TimeoutExpired, force-kills, and always reaps
   so no zombie escapes. Same hardening applied to the early-skip branch.
2026-05-27 23:15:41 -07:00
Ben
875d930ac7 test(docker-update): stub subprocess.run in git-install regression guard
The regression-guard test
`test_cmd_update_on_git_install_does_not_print_docker_message` mocked
`is_managed` and `detect_install_method` but not `subprocess.run`, so
once `cmd_update(check=True)` decided this was a git install it shelled
out to a real `git fetch upstream` / `git fetch origin`. On CI runners
the worktree has no `upstream` remote configured and the fetch hung
past the 30s pytest-timeout — test (4) slice failed in #33659 CI.

Fix: stub `subprocess.run` with a successful CompletedProcess-shaped
object whose stdout is `"0\n"`, so:
  - no real git command is ever invoked
  - the rev-list parsing later in the flow (`int(stdout.strip())`)
    succeeds rather than `ValueError`-ing through the test's
    SystemExit catch
  - the flow proceeds far enough to confirm the docker banner is
    absent (the actual assertion)

Also broaden the except clause to `(SystemExit, Exception)`: the only
assertion in this test is the negative-banner check on captured stdout;
any further failure in the rest of the update flow is irrelevant to
that contract.

Verified locally: all 7 tests in
`tests/hermes_cli/test_cmd_update_docker.py` pass in 0.39s (previously
the regression-guard test alone consumed 30s+ and got SIGTERM'd).
2026-05-28 15:50:25 +10:00
Ben
b924b22a9d fix(docker): hermes update prints docker pull guidance instead of bogus git error
Inside the published Docker image, `hermes update` was hitting the
".git missing → reinstall via curl" fallback:

    ✗ Not a git repository. Please reinstall:
      curl -fsSL https://raw.githubusercontent.com/.../install.sh | bash

That message is wrong on two counts:
  1. It tells the user to run the host-side installer, which would
     install a *new* Hermes on the host — not update the running
     container.
  2. It doesn't mention `docker pull` at all, leaving Docker users
     to figure out the right action from scratch.

`hermes update --check` was worse: it bailed with "Not a git
repository — cannot check for updates." and nothing else.

Fix: detect the Docker install method (already stamped by
`docker/stage2-hook.sh` and surfaced by `detect_install_method()`)
in both update entry points and print a long-form message that
covers:

  - The right command: `docker pull nousresearch/hermes-agent:latest`
  - Restart guidance (`docker compose up -d --force-recreate` /
    re-run `docker run`)
  - How to verify the new version after restart
  - Tag-pinning caveat (`:latest` doesn't move a pinned tag)
  - Config persistence across upgrades (state under `HERMES_HOME` /
    `/opt/data` is bind-mounted and survives)
  - Fork escape hatch (build your own image with the repo's Dockerfile)

Exit code is 1 (matches `managed_error` semantic for "tried to
update but can't update this way").

Plumbing:
  - hermes_cli/config.py: new `format_docker_update_message()` helper
    sits next to the existing `_NIX_UPDATE_MSG` /
    `format_managed_message()` family so the wording lives in one
    place and both call sites (apply path + check path) consume it.
  - hermes_cli/main.py:
      * `cmd_update()`: bail right after the `is_managed()` gate, before
        any of the apply-path branches.
      * `_cmd_update_check()`: bail at the top of the function, before
        the existing `method == "pip"` branch.
    Neither path touches subprocess.run / git when method == "docker".

Coverage:
  - 7 new tests in `tests/hermes_cli/test_cmd_update_docker.py`:
      * `hermes update` in Docker → message + exit 1, no git calls
      * `hermes update --check` (via cmd_update) → same
      * `--yes` / `--force` don't bypass (intentional)
      * `_cmd_update_check` called directly → bails too
      * git/pip installs still take their normal paths (regression guards)
      * `format_docker_update_message` content-lock test pinning the
        five user-actionable bits the message must contain
  - Existing test_cmd_update.py (21 tests) + test_managed_installs.py
    (5 tests) still pass — no regression on the source-install path.
  - Verified end-to-end in a real container: `docker run ... update`
    and `docker run ... update --check` both render the message and
    exit 1.
2026-05-28 15:50:25 +10:00
stephenschoettler
4a6f1863ac test: cover ci-unblocker production regressions
Snapshot review_agent._session_messages before teardown so close() can
clean per-session state without dropping the user-visible
self-improvement summary. Adds two regressions:

- bg-review summarizer receives captured review-agent tool messages
  after review_agent.close() runs
- context-compressor protected-head handoff rehydration populates
  _previous_summary and keeps the old handoff out of newly summarized
  turns

Salvaged from PR #26039 onto current main after agent/background_review.py
extraction. Original commit 63eaf6055; bg-review test updated to patch
the module-level summarize_background_review_actions in
agent.background_review instead of the now-forwarder
AIAgent._summarize_background_review_actions.
2026-05-27 22:14:53 -07:00
Ben
66489f38c7 fix(docker): bake build-time git SHA into the image
`hermes dump` and the startup banner both call `git rev-parse HEAD` to
report the running commit, but `.dockerignore` line 2 excludes `.git` —
so inside the published image `hermes dump` shows
`version: ... [(unknown)]` and the banner drops its `· upstream <sha>`
suffix entirely.  That makes support triage from container bug reports
impossible: we can't tell which commit the user is actually running.

Fix: thread the build-time SHA through as a Docker build-arg, write it
to `/opt/hermes/.hermes_build_sha` in the image, and have a new
`hermes_cli/build_info.get_build_sha()` read it as a fallback after the
existing live-git lookup fails.  Output format is unchanged in both
callsites — same 8-char short SHA whether resolved live or baked.

Wiring:
  - Dockerfile: `ARG HERMES_GIT_SHA=` + write-file step after the source
    copy.  Empty/missing arg → no file written → callers fall through to
    live git (so local `docker build` without --build-arg is unchanged).
  - docker-publish.yml: passes `HERMES_GIT_SHA=${{ github.sha }}` on all
    four build-push-action steps (amd64/arm64, smoke-test + final push).
  - dump.py:_get_git_commit() / banner.py:get_git_banner_state(): try
    live git first, fall back to baked SHA, then to legacy `(unknown)`
    / None.  Banner returns `upstream == local, ahead=0` because a built
    image is by definition pinned to one commit.

Coverage:
  - Unit tests cover build_info (file present/absent/empty/error,
    truncation, whitespace), dump (live-git wins, both fallbacks,
    identical output-format regression guard), and banner (no-repo +
    baked, no-repo + no-sha, shallow-clone fallback).
  - tests/docker/test_dump_build_sha.py is an integration regression
    guard that runs against the real image, reads
    `/opt/hermes/.hermes_build_sha`, and asserts `hermes dump` surfaces
    its content (or stays at `(unknown)` if no file).
  - Verified end-to-end: `docker build --build-arg HERMES_GIT_SHA=abc...`
    → `docker run ... dump` reports `[abc12345]`; without the build-arg
    it reports `[(unknown)]` as before.
2026-05-28 15:14:05 +10:00
teknium1
ebe04c66cd fix(kanban): close kanban.db FD after every connect() in long-lived processes
`sqlite3.Connection.__exit__` commits/rollbacks but does NOT close the
underlying FD. `with kb.connect() as conn:` in long-lived processes
(gateway `run_slash`, dashboard `decompose_task_endpoint`) therefore
leaks one FD to `kanban.db` per call. After enough operations the
gateway dies with `[Errno 24] Too many open files` (~4 days uptime
in the production report — #33159).

Fix: add a `connect_closing()` context manager in `hermes_cli/kanban_db`
that wraps `connect()` with a real `try/finally: conn.close()`. Switch
the 42 leak-prone call sites in `hermes_cli/kanban.py` (35),
`hermes_cli/kanban_decompose.py` (4), and `hermes_cli/kanban_specify.py`
(3) over to it.

`kanban.py` matters because `run_slash` (called from the gateway for
every `/kanban` slash command) parses argparse and dispatches to those
`_cmd_*` functions in-process — each one was leaking one FD per
invocation.

Tests inside `tests/` are untouched: short-lived processes where OS
cleanup masks the leak. Regression tests added in
`test_kanban_db.py` cover both happy-path and exception-path closure,
plus an explicit assertion that bare `with kb.connect()` still does
NOT close (documenting the upstream sqlite3 behaviour we're working
around).

Closes #33159.
2026-05-27 22:07:49 -07:00
Teknium
6d947e4d78
feat(image_gen/fal): add Krea 2 Medium + Large to FAL catalog (#33506)
fal announced Krea 2 day-0 as an official API partner on 2026-05-27.
Add both variants to the FAL_MODELS catalog so they appear in the
'hermes tools' model picker alongside flux-2, gpt-image, nano-banana,
etc. Users who already bill through FAL or Nous Portal subscription
can now use Krea without registering directly with Krea.

Model IDs (as listed in fal's launch announcement):
  fal-ai/krea/v2/medium/text-to-image  — $0.030 / image
  fal-ai/krea/v2/large/text-to-image   — $0.060 / image

Both share the same parameter schema:
  - aspect_ratio (1:1, 4:3, 3:2, 16:9, 2.35:1, 4:5, 2:3, 9:16)
    mapped from our 3 abstract ratios via size_style='aspect_ratio'
  - creativity (raw|low|medium|high; default medium)
  - seed (reproducibility)
  - image_style_references (up to 10 per Krea's API spec)

No num_inference_steps / guidance_scale / num_images — Krea 2 does
not expose those, and the supports-set filter strips them defensively
if the agent ever passes them.

This is the FAL-routed variant. The separate native-Krea-API plugin
shipped in PR #33236 (plugins/image_gen/krea/) remains available for
users who want to bill directly through Krea's API with their own
key. Both routes converge on the same underlying model.

Nous Portal managed-FAL gateway: this commit makes the model IDs
known to the catalog and the picker. The Portal team will need to
allowlist these two endpoint slugs on the fal-queue origin server-side
for them to flow through the managed billing path.
2026-05-27 21:42:52 -07:00
Wesley Simplicio
10f13c3881
fix(web): allow mobile dashboard scrolling (#28051) (#28577)
* fix(web): allow mobile dashboard scrolling

* fix(web): combine mobile root scroll rules

---------

Co-authored-by: Wesley Simplicio <wesley.simplicio.ext@siemens-energy.com>
2026-05-28 00:02:50 -04:00
Austin Pickett
c9410b3462
feat(web): add collapsible sidebar for the dashboard (#33421)
* feat(web): add collapsible sidebar for the dashboard

The desktop sidebar can now be collapsed to an icon-only rail via a
toggle button in the sidebar header.  State is persisted in
localStorage so it survives page reloads.

When collapsed (lg+ only):
- Sidebar shrinks from w-64 to w-14 with a smooth width transition
- Nav items show only their icon with a native title tooltip
- Brand text, plugin headings, system actions, theme/language
  switchers, auth widget, and footer are hidden
- Mobile drawer behavior is unchanged (always full-width)

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(web): align sidebar tooltips to sidebar edge consistently

Tooltip left position now uses the sidebar's right edge instead of the
anchor element's right edge, so narrow anchors (theme/language switchers)
align with full-width anchors (nav links, system actions).

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(web): add tooltip animations, restore theme label, rename Sessions tab

- Sidebar tooltips now animate in with a subtle 120ms ease-out slide;
  subsequent tooltips within the same hover sequence appear instantly
  (no delay/animation) following Emil Kowalski's tooltip pattern
- Restore theme name label when sidebar is expanded
- Rename Sessions segment tab to "History" across all 16 locales

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(web): smooth sidebar collapse animation

- Remove icon centering on collapse; icons stay left-aligned at px-5
  so they don't jump during the width transition
- Text labels fade out with opacity transition instead of instant
  display:none, clipped naturally by overflow-hidden
- Slow collapse duration from 450ms to 600ms for a more relaxed feel
- Gateway dot always rendered with opacity toggle so it doesn't
  slide in from the right on collapse
- Pin gateway dot at fixed left offset (pl-[1.625rem]) to align
  with nav icons
- Align header toggle button with justify-center when collapsed
- Bottom switchers use items-start when collapsed to prevent reflow

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-27 23:58:41 -04:00
Dusk
c341a2d107
fix(docker): align HOME for dashboard and s6 gateway services (#33481) 2026-05-28 13:42:27 +10:00
teknium1
71b4a6b18e fix(docker): install python-is-python3 so bare python resolves in containers
Debian 13 ships only `python3` — there's no `/usr/bin/python` symlink. When
the agent emits bash commands using bare `python` (which models do frequently
from their training prior), every such call fails with:

    /usr/bin/bash: python: command not found
    Tool terminal returned error … exit_code 127

The agent then retries with different approaches, sessions take longer, and
agent.log fills with WARNING noise.

`python-is-python3` is the standard Debian package that drops a
`/usr/bin/python → python3` symlink. ~30 KB, zero behavior change for
anything calling `python3` directly; transparent fix for everything else.

Fixes #33178.
2026-05-28 13:37:17 +10:00
Ben Barclay
aeb992d343 fix(docker): drop docker exec to hermes uid before invoking the CLI
When operators ran `docker exec <c> hermes login` (or anything else
that wrote under $HERMES_HOME) they defaulted to root, leaving
/opt/data/auth.json root:root mode 0600. The supervised gateway
(UID 10000) then couldn't read its own credentials and returned
"Provider authentication failed: Hermes is not logged into Nous
Portal" on every Telegram/Discord/etc. message — even though
`docker exec <c> hermes chat -q ping` (also root) succeeded because
root could read its own root-owned file. _load_auth_store swallowed
PermissionError as a parse failure and copied the file aside as
auth.json.corrupt, making the diagnostic more misleading.

Fix: install a privilege-drop shim at /opt/hermes/bin/hermes,
prepended ahead of the venv on PATH. When invoked as root the shim
exec's the real venv binary via `s6-setuidgid hermes` — so any file
the docker-exec session writes is uid-aligned with the supervised
processes. Non-root callers (the supervised processes themselves,
`docker exec --user hermes`, kanban subagents, anything inside the
container that's not coming through docker-exec) hit a single exec
to the absolute venv path with no privilege change.

Recursion is impossible: the shim exec's the venv binary by
absolute path (/opt/hermes/.venv/bin/hermes), so the second hop
cannot re-enter the shim regardless of PATH state. No sentinel env
var needed (unlike #33583's gateway-run redirect which DOES need
HERMES_S6_SUPERVISED_CHILD because there's no absolute-path
equivalent for the s6 dispatch).

Opt-out: `docker exec -e HERMES_DOCKER_EXEC_AS_ROOT=1 …` for
diagnostic sessions where the operator deliberately wants root.
Strict truthiness (1/true/yes case-insensitive); typos like `=0`
do not silently opt out, mirroring HERMES_GATEWAY_NO_SUPERVISE in
#33583.

If `s6-setuidgid` is missing (someone stripped s6-overlay in a
downstream fork), the shim exits 126 with a remediation message
pointing at `--user hermes` and the opt-out — never silently runs
as root.

Test plan:
- tests/docker/test_docker_exec_privilege_drop.py — 11 tests
  - shim drops root to hermes uid (file ownership check)
  - shim short-circuits for non-root docker exec
  - HERMES_DOCKER_EXEC_AS_ROOT=1 keeps root
  - strict-truthiness parametrization (5 falsy values reject)
  - main CMD path unaffected (recursion guard)
  - E2E: every file written by docker-exec is readable by uid 10000
- Full tests/docker/ harness: 32/32 pass against fresh image build
- shellcheck --severity=error: clean
- hadolint: clean
- Manual: reproduced the original symptom (root-owned auth.json)
  by bypassing the shim; confirmed default docker-exec produces
  hermes-owned files; confirmed opt-out env keeps root semantics.

Known follow-up: this prevents NEW instances of the bug. Volumes
that already have root:root /opt/data/auth.json from a pre-shim
image need a one-time `chown hermes:hermes` before rebooting onto
the new image. A stage2-hook chown sweep can self-heal that, but
is deferred per scope decision.
2026-05-28 13:30:36 +10:00
Ben Barclay
b345323195 fix(docker): tee supervised gateway stdout to docker logs
Follow-up to #33583 (the gateway-run-supervised redirect).

Before this fix, the supervised gateway's stdout (most visibly the
"Hermes Gateway Starting…" rich-console banner) was swallowed by
`s6-log` into the rotated file at
`${HERMES_HOME}/logs/gateways/<profile>/current` and never reached
`docker logs`. Operational signal lived in two places:

  * **docker logs** — saw stderr (Python `logging` defaults to
    stderr), so warnings/errors were visible.
  * **the rotated file** — saw stdout (rich banners, `print()`
    output, third-party libs that wrote to fd 1).

This was surprising for users coming from the pre-s6 image, where
`docker run … gateway run` produced a single unified stream in
`docker logs`. They'd see partial output, conclude something was
broken, and dig around for the missing pieces.

Fix: add the `1` s6-log action directive before the file destination
so each line is forwarded to s6-log's stdout — which propagates up
the s6-supervise pipeline to /init's stdout = container stdout =
`docker logs`. The file destination is preserved as a second
destination, so the rotated log (with ISO 8601 timestamps) still
exists for `hermes logs` and for survival across container restarts.

Trade-off considered: timestamps. Putting `T` between `1` and the
file destination (not before `1`) means:

  * docker logs sees raw lines — Python's logging formatter has its
    own timestamps, and `docker logs --timestamps` adds another
    layer when desired. No double-stamping in the common reading
    path.
  * The persisted file gets s6-log's ISO 8601 timestamp so even
    output that lacked a Python-logger timestamp (rich banners,
    third-party raw prints) is correlatable in `current`.

Verification:

  * New unit-test assertion in `test_service_manager.py` locks the
    `s6-log 1` directive into the rendered run-script. Mutation-
    tested by reverting to the pre-fix script (no `1`); the assert
    catches it cleanly.
  * New docker-harness test `test_supervised_gateway_stdout_reaches_docker_logs`
    builds the image, runs `docker run … gateway run`, and asserts
    the unique `⚕` banner glyph reaches `docker logs`. Also verifies
    the rotated file still contains the banner (no regression on
    the existing file destination). Mutation-tested end-to-end: built
    a deliberately-broken image without the `1` directive and the
    test failed exactly as designed, citing the banner present in
    `current` but absent from `docker logs`.
  * `website/docs/user-guide/docker.md` gains a new `:::note Where
    gateway logs go` admonition documenting both destinations and
    the audit-log file at `${HERMES_HOME}/logs/container-boot.log`.

Existing functionality preserved: every other docker-harness test
still passes against the new image. Unit-test sweep across
`tests/hermes_cli/` (5561 tests) is green.
2026-05-28 13:18:41 +10:00
brooklyn!
912e6e2274
fix(tui): suppress mouse-residue leaks during Python launcher startup (#31213)
* fix(tui): suppress mouse-residue leaks during Python launcher startup

`hermes --tui …` spends ~100–300ms inside the Python launcher (lazy
imports, arg parsing, session resolution) before exec'ing the Node TUI
binary. During that window stdin is still in cooked + echo mode. If a
prior session left DEC mouse tracking asserted (or the user spammed
mouse movement while the previous session was opening), the terminal
keeps emitting `\\x1b[<…M` SGR motion reports that get echoed straight
back into the user's shell scrollback as literal `^[[<…M` text and
sit there above the TUI banner until the next clear.

The Node side already calls `resetTerminalModes()` in `entry.tsx`, but
by then the race is already lost — the bytes echoed during the Python
warmup window were committed to the scrollback before Node started.

Fix: write the mouse-tracking disable sequence at the very top of
`hermes_cli.main`, before every heavy import. The terminal stops
emitting motion events as soon as the bytes hit the wire (one TTY
round-trip), shrinking the race window from hundreds of milliseconds
to a few. `HERMES_TUI_NO_EARLY_DISABLE=1` opts out for diagnostics.

* test(tui): drop dead _reload_main, hoist import out of patch context

Addresses Copilot review on PR #31213.

The tests used to import `hermes_cli.main` inside the `patch("os.write")`
context, which Copilot pointed out is order-dependent: if the module
is already loaded (e.g. imported by a prior test in the same process),
the import is a no-op and the patch only sees the explicit
`_suppress_mouse_residue_early()` call. Either way the assertion can
flake when run alongside other tests.

Move the import to module scope — every subprocess gets a fresh
`hermes_cli.main`, whose module-level invocation is a no-op under
pytest argv. Tests then exercise `_suppress_mouse_residue_early()`
directly inside their own patch context. Also drop the unused
`_reload_main` helper.

* fix(tui): skip early mouse-disable when stdout is not a TTY

Addresses Copilot review on PR #31213.

`hermes --tui … >log` or CI capture pipes fd 1 away from the terminal.
The disable bytes can't reach the terminal in that case but would
still get written into the log file as raw CSI sequences. Guard with
`os.isatty(1)` inside the existing `try/except OSError` block so the
'never break startup' contract holds.

* docs(tui): rephrase 'raw cooked mode' as 'cooked + echo mode'

Copilot review nit on PR #31213 — the original wording was self-
contradictory. Pre-TUI stdin state is cooked + echo (kernel TTY
discipline still owns the line buffer and echoes input back). The
TUI switches it to raw mode later when Ink mounts.
2026-05-27 22:03:45 -05:00
Ben Barclay
0927fb5584 feat(docker): auto-redirect gateway run to supervised mode inside s6 image
Pre-s6, `docker run nousresearch/hermes-agent gateway run` was the
standard invocation: gateway ran as the container's main process,
tini reaped zombies, container exit code matched gateway exit code,
no supervision. With s6-overlay as PID 1, the same invocation now
auto-upgrades to supervised semantics — auto-restart on crash,
dashboard supervised alongside (when HERMES_DASHBOARD=1 is set),
multiple profile gateways under the same /init.

Users get the new behavior with zero changes to their docker run
command. A loud one-line breadcrumb on stderr explains the upgrade
and points at the opt-out for users who genuinely want pre-s6
foreground semantics.

How it works:

  1. `_gateway_command_inner` (the `gateway run` handler) checks if
     we're inside a container with s6 as PID 1.
  2. If yes, dispatches `start` to the s6 service manager (registers
     and starts gateway-default), then `exec sleep infinity` to keep
     the CMD process alive without binding container lifetime to
     gateway PID lifetime. The supervised gateway can flap freely;
     `docker stop` still tears everything down via /init stage 3.
  3. If no, falls through to the existing foreground code path
     unchanged. Host runs of `hermes gateway run` are unaffected.

Three gates make the redirect inert outside the intended scope:

  * `detect_service_manager() != "s6"` — host/non-s6-container runs.
  * `HERMES_S6_SUPERVISED_CHILD=1` env var (recursion guard) —
    exported by `S6ServiceManager._render_run_script` for the
    s6-supervised invocation itself. Without this guard, the
    supervised `gateway run --replace` would re-enter the redirect
    and recurse (run → start → run → start → ...) infinitely.
  * `--no-supervise` CLI flag OR `HERMES_GATEWAY_NO_SUPERVISE=1` env
    var — explicit user opt-out for CI smoke tests, debugging the
    foreground startup path, or any case wanting "CMD exit =
    container exit" semantics. Strict truthiness (1/true/yes,
    case-insensitive); typos like `=0` do NOT silently opt out.

Tests:

  * Unit tests in tests/hermes_cli/test_gateway_s6_dispatch.py
    cover all five paths (host no-op, supervised fire, sentinel
    recursion guard, CLI flag, env var truthy + falsy). The two
    load-bearing gates (sentinel + opt-out) were mutation-tested
    by removing each gate in isolation and confirming the dedicated
    test fails with the expected error.
  * Docker harness tests in tests/docker/test_gateway_run_supervised.py
    cover the round trips end-to-end against a built image: redirect
    fires (sleep-infinity heartbeat + supervised gateway-default
    slot + breadcrumb), --no-supervise opt-out (foreground gateway,
    no want-up on the slot), HERMES_GATEWAY_NO_SUPERVISE env var
    works identically, recursion is impossible (≤1 supervised
    python gateway-run + exactly 1 sleep-infinity parented to the
    CMD wrapper), and HERMES_DASHBOARD=1 produces both supervised
    gateway and supervised dashboard.

Docs:

  * Added a `:::tip Gateway runs supervised` admonition near the
    main docker.md example explaining the upgrade and pointing at
    the opt-out. Pre-s6 (tini-based) images still run gateway run
    as the foreground main process, so the note is scoped to the
    s6 image only.

Trade-off documented in the helper docstring: container exit code
under the redirect is sleep's exit code (always 0 on SIGTERM), not
the gateway's. That was an explicit design call — the supervised
gateway is allowed to flap without taking the container with it,
which is what "supervision" means. CI users who want exit-code
forwarding can pass --no-supervise.
2026-05-28 12:42:13 +10:00
teknium1
36c99af37a test(kanban): align two tests with recent kanban hardening
Two pre-existing test failures on main, both pointing at code that
was hardened recently — not behaviour bugs, test expectations that
fell out of date.

1. tests/tools/test_kanban_tools.py::test_worker_complete_rejects_stale_run_id
   c002668ff ("fix(kanban): add grace period to detect_crashed_workers")
   gates each running task behind a launch-window grace period so
   freshly-spawned workers whose PID isn't yet visible on /proc don't
   get reclaimed. The test creates a worker_env fixture moments before
   asserting reclamation, so the default 30s grace skips the liveness
   check and detect_crashed_workers returns []. Fix: set
   HERMES_KANBAN_CRASH_GRACE_SECONDS=0 in the test so we get the
   immediate-reclaim semantics the assertion expects.

2. tests/tools/test_windows_native_support.py::
     TestKanbanWaitpidWindowsGuard::test_source_gates_waitpid_loop
   ffdc937c1 ("fix(kanban): hoist zombie reaper out of dispatch_once")
   reshaped reap_worker_zombies to use an early-return Windows guard
   (\`if os.name == "nt": return []\`) instead of an inverted gate
   (\`if os.name != "nt":\`). Both correctly keep the waitpid loop off
   Windows — the early-return form is stronger because the rest of the
   function never runs. Fix: accept either gate pattern in the source
   scan.

Both failures reproduce verbatim on \`origin/main\` in a clean env;
neither relates to in-flight work on #33564 (the FD-leak fix). Filing
this as a separate fix-it PR per green-CI-policy so the kanban CI
shard stays green for downstream PRs.
2026-05-27 18:26:44 -07:00
kshitijk4poor
2d5dcfabc3 test(kanban): update dispatcher tick counter for hoisted zombie reaper
The reaper hoist in the prior commit adds an extra
`asyncio.to_thread(_kb.reap_worker_zombies)` call at the top of every
dispatcher tick (before the per-board work). The existing
`test_gateway_dispatcher_disables_corrupt_board_without_traceback`
mocks `to_thread` with a 4-call cap that previously matched 2 full
dispatch ticks. With the reaper hoist each tick is now 3
`to_thread` calls instead of 2, so the cap is raised to 6 to preserve
the same number of dispatch ticks. The `connect == 5` assertion is
unchanged.

Also add the contributor's `steveonjava@gmail.com` to AUTHOR_MAP
alongside `steve@steveonjava.com` so contributor-audit passes for
both identities used across the salvaged commits.

Salvage follow-up for PR #32857.
2026-05-27 14:31:55 -07:00
Stephen Chin
dc98314fbd fix(kanban): skip redundant WAL pragma on already-WAL connections
apply_wal_with_fallback() issued PRAGMA journal_mode=WAL on every call,
including connections to DBs already in WAL mode. This triggered the WAL
init code path, causing SQLite to acquire EXCLUSIVE, checkpoint, and unlink
kanban.db-{wal,shm}. Other open connections received (deleted) FDs and
raised sqlite3.OperationalError: disk I/O error.

Add a cheap read probe (PRAGMA journal_mode, no flock/checkpoint/unlink)
before the set-pragma path. If already wal, return early. The set-pragma
and DELETE fallback paths are unchanged.

Closes #31158. Addresses root cause that PRs #32226 and #32322 attempted
via connection-sharing/caching approaches.
2026-05-27 14:31:55 -07:00
Stephen Chin
ffdc937c18 fix(kanban): hoist zombie reaper out of dispatch_once
Reaper now runs at the top of every dispatcher tick regardless of per-board connect() failures. Previously the reaper sat inside dispatch_once after the kanban_db.connect() call — any EIO during connect would skip reaping for that tick, accumulating zombie workers and stale claim_lock rows.

Also: reap_worker_zombies now returns the list of reaped pids (the dispatcher logs them) and a test indentation fix.

Squashes three sibling commits from PR #32301 into one logical change for batch review.
2026-05-27 14:31:55 -07:00
steveonjava
99c19eb2fe fix(kanban): add post-commit page_count invariant check to write_txn
Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect().

Refs: #31208 (Bug E - same file, coordinate), #30973 (wal_autocheckpoint)
Refs: #30445, #30896, #30908 (corruption reports)
2026-05-27 14:31:55 -07:00