Salvages #27372 by @oemtalks. The dispatcher unconditionally injected
`--skills kanban-worker` into every worker spawn, but worker profiles
sometimes don't have that bundled skill in their skills dir, which is
fatal at CLI startup (`ValueError: Unknown skill(s): kanban-worker`).
Adds `_kanban_worker_skill_available(hermes_home)` and only injects the
flag when the skill resolves. The MANDATORY lifecycle still ships via
KANBAN_GUIDANCE in the system prompt, so omitting the flag is safe.
Salvages #28301 by @Ade5954. If WAL setup, PRAGMA application, or schema
init raises after sqlite3.connect() succeeds, the new connection was
leaking. Wrap the body in try/except so the connection is closed before
the exception propagates.
The checkbox label echoed its state ("Auto (default)" / "Manual") instead
of describing the action, so a checked box reading "Auto" parsed as a
status indicator rather than a control. The accompanying sub-description
was also static and started with "When on, ...", which read awkwardly
when the box was unchecked.
Replace the dynamic label with a static action label
("Auto-decompose triage tasks") and flip the sub-description between the
two modes so it stays accurate either way. The top-of-page Orchestration
pill is unchanged — that one is intentionally a status badge / toggle.
Fixes#28178
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Salvages #25579 by @wesleysimplicio. Stamps task_runs.metadata.worker_session_id
from HERMES_SESSION_ID on kanban_complete. Cherry-picked the substantive
commit (not the AUTHOR_MAP fixup tip) onto current main.
Prevents ValueError crash in dashboard get_board() when a task has
an ISO timestamp (e.g. "2026-05-10T15:00:00Z") instead of a unix epoch
int. Adds _to_epoch() helper that normalises both formats.
When a systemic failure (provider outage, auth expiry, OOM) crashes
multiple workers simultaneously, detect_crashed_workers increments
each task failure counter independently. The circuit breaker only
trips after N × failure_limit retries across the fleet.
Fingerprint crash errors by normalizing host-specific details (PIDs,
timestamps). When 3+ tasks crash with the same fingerprint in a
single detection cycle, immediately trip the circuit breaker
(failure_limit=1) instead of waiting for repeated failures.
Isolated crashes (unique fingerprints) retain their normal retry
budget. Protocol violations continue to trip immediately.
Includes regression tests for systemic and isolated crash paths.
When a task is manually unblocked (blocked → ready/todo), the
consecutive_failures counter and last_failure_error were left intact.
The next failure would immediately re-trip the circuit breaker because
the counter was still at or above the failure limit.
Reset both fields on unblock so the task gets a fresh retry budget.
Includes a regression test that verifies counters are zeroed.
max_runtime_seconds=0 was being silently coerced to None due to a falsy
check (if max_runtime_seconds). Zero is a valid value that causes the
dispatcher to immediately time out a task. The adjacent max_retries
parameter already used the correct 'is not None' pattern.
Fixes the inconsistency by aligning max_runtime_seconds with max_retries.
recompute_ready only scanned 'todo' tasks for promotion, ignoring
'blocked' tasks entirely. When a task was blocked (e.g. by the circuit
breaker) and its parent dependencies later completed, the task stayed
stuck in 'blocked' forever unless manually unblocked.
Now recompute_ready also scans 'blocked' tasks. When all parents are
done/archived, the blocked task is promoted to 'ready' with failure
counters reset — equivalent to an automatic unblock.
Includes a regression test for the blocked-parent-done promotion path.
Archiving or deleting a board via remove_board() leaves the path's
"schema already initialized" entry in the module-level cache. A
concurrent connect(board=<slug>) call (e.g. the dashboard event-stream
poll loop) then:
1. resolves the same kanban.db path,
2. recreates the directory + an empty sqlite file because
connect() does mkdir(parents=True, exist_ok=True),
3. skips the CREATE TABLE pass because the cache entry says the
schema is already in place,
4. errors on the next read with `no such table: task_events`.
Drop the cache entry before mutating the filesystem so the fresh file
gets a proper schema init on next connect(). Applies to both
archive=True (rename) and archive=False (rmtree) branches.
Fixes#23833.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The fix in 061a1830 added an outer try/except in plugin_api._task_dict
so that a future failure mode in kanban_db.task_age (anything _safe_int
doesn't already absorb) cannot 500 the GET /board response. The
_safe_int / task_age corruption paths got regression coverage in
tests/hermes_cli/test_kanban_db.py, but the OUTER fallback contract
remained untested -- meaning a refactor that drops the try/except would
not be caught by CI.
Pin that contract from both consumers of _task_dict:
- GET /board returns 200 with the literal fallback age dict for the
affected card (other cards continue to render via the same path)
- GET /tasks/:id (drawer view) returns 200 with the same fallback,
so a single corrupt task can't block its own drawer
Both tests force task_age to raise RuntimeError rather than ValueError
on '%s', because ValueError is absorbed by _safe_int and never reaches
the outer try/except -- testing that path would only re-cover what
test_kanban_db.py already pins.
Manually verified the regression discipline:
git checkout 061a1830^ -- plugins/kanban/dashboard/plugin_api.py
pytest -k task_age_exception # both FAIL with 500
git checkout HEAD -- plugins/kanban/dashboard/plugin_api.py
pytest -k task_age_exception # both PASS
- Add model_override field to Task class and tasks schema
- Add migration for existing databases
- Spawn worker with -m model when model_override is set
Wrap existing box-drawing diagrams with ascii-guard markers so docs-site checks pass when website docs are touched.
Co-authored-by: Cursor <cursoragent@cursor.com>
Tests (``tests/hermes_cli/test_auth_manual_paste.py``):
* 9 parametrised + scalar cases for ``_is_remote_session`` covering
the new Cloud Shell / Codespaces / Gitpod / Replit / StackBlitz
env vars (plus the existing SSH ones).
* 9 cases for ``_parse_pasted_callback`` covering every paste form
(full URL, https URL with extra params, bare ``?code=...``, bare
``code=...`` fragment, bare opaque value, error+description,
empty, whitespace-only, malformed URL).
* 3 cases for ``_prompt_manual_callback_paste`` (happy path, EOF,
Ctrl-C).
* 3 end-to-end ``_xai_oauth_loopback_login(manual_paste=True)``
cases: the HTTP server MUST NOT be started (asserted via a
callable that raises if invoked), wrong state still rejected
with ``xai_state_mismatch`` (no CSRF bypass), and empty paste
surfaces ``xai_code_missing``.
* SSH-hint mention test ensures the ``--manual-paste`` instruction
is printed in the remote-session hint.
Docs:
* ``oauth-over-ssh.md`` — new "Browser-only remote (Cloud Shell /
Codespaces / EC2 Instance Connect)" section with the
``--manual-paste`` recipe, plus a TL;DR note for the new flag.
* ``xai-grok-oauth.md`` — short subsection pointing at the same
recipe and the OAuth-over-SSH guide anchor.
Register the new ``--manual-paste`` flag on both entry points and
thread it through to the xAI loopback login:
* ``hermes auth add xai-oauth --manual-paste`` — pool-add path,
forwarded inside ``auth_commands.handle_auth_add``.
* ``hermes model --manual-paste`` — model-picker path, forwarded
by ``_model_flow_xai_oauth`` into the synthetic ``argparse.Namespace``
it passes to ``_login_xai_oauth``. The picker also now forwards
``--no-browser`` and ``--timeout`` for consistency (previously
hardcoded to defaults regardless of CLI flags).
Help text on both flags points at #26923 and names the
browser-only remote consoles (Cloud Shell, Codespaces, EC2
Instance Connect) so users searching ``hermes --help`` can find
the workaround.
xAI Grok OAuth (and Spotify) use a loopback redirect to
``http://127.0.0.1:<port>/callback`` to capture the authorization
code. That works when the browser and Hermes run on the same
machine, and the SSH tunnel recipe handles the regular remote
case. It breaks completely on **browser-only remote consoles**
(GCP Cloud Shell, GitHub Codespaces, AWS EC2 Instance Connect,
Gitpod, Replit, …) where the user has a browser but no real SSH
client to forward a port — the redirect to 127.0.0.1 on the
remote VM simply isn't reachable from the laptop, and there's
nothing the existing flow can do about it (#26923).
This commit adds the foundation for a manual-paste fallback:
* ``_is_remote_session`` now also recognises Cloud Shell,
Codespaces, Gitpod, Replit, StackBlitz (in addition to SSH),
so the existing tunnel hint at least fires in those
environments.
* ``_parse_pasted_callback`` accepts any of: a full
``http(s)://...?code=...&state=...`` URL, a bare ``?code=...``
query string, a bare ``code=...&state=...`` fragment, or a
bare opaque code value. Returns the same dict shape the HTTP
callback handler produces, so the caller's state / error
validation works unchanged (no CSRF bypass).
* ``_prompt_manual_callback_paste`` reads stdin with a clear
multi-line explanation of what's happening and what to paste.
* ``_xai_oauth_loopback_login`` gains a ``manual_paste`` kwarg
that skips the HTTP listener entirely. The redirect_uri,
PKCE verifier, state, and nonce are byte-identical to the
loopback path so xAI's token endpoint can't tell the
difference at the protocol level.
* ``_print_loopback_ssh_hint`` now also mentions
``--manual-paste`` so users without a real SSH client see a
path forward instead of a dead-end tunnel recipe.
* ``_login_xai_oauth`` threads ``args.manual_paste`` into the
loopback helper.
Salvages #19964 by @Beandon13. Adds `hermes kanban archive --rm` to
permanently remove already-archived tasks with cascading cleanup of
links, comments, events, runs, and notify-subs. Safety guard: only
archived tasks can be deleted; active/blocked/done must be archived
first.
Cherry-picked from #19964 onto current main (severe stale base, applied
manually to preserve substance only).
Two Mattermost thread-related bugs:
1. _resolve_root_id() — Mattermost CRT requires root_id to be the
thread root post. Using any reply's own ID as root_id causes
'400 Invalid RootId'. Add _resolve_root_id() that walks up the
post chain via API to find the actual root, and apply it in
send(), _send_url_as_file(), and _send_local_file().
2. _progress_reply_to — The condition in run.py only checked
Platform.FEISHU, missing Mattermost entirely. This caused tool
progress messages to always land in the main channel instead of
the thread. Add Platform.MATTERMOST to the condition so
progress messages are routed to threads when reply_mode=thread.
Impact: Tool progress messages now appear in Mattermost threads
instead of flooding the main channel; thread replies no longer
fail with Invalid RootId when the reply target is itself a reply.
Tests:
* ``test_refresh_xai_oauth_pure_403_marked_tier_denied_not_relogin`` —
refresh-403 raises ``xai_oauth_tier_denied`` with
``relogin_required=False`` and the API-key fallback hint in body.
* ``test_format_auth_error_tier_denied_does_not_suggest_relogin`` —
the renderer does not append "Run ``hermes model``" for the new
code.
* ``test_recover_with_credential_pool_skips_refresh_on_bare_403_for_xai_oauth`` —
bare ``{"reason":"forbidden","message":"Forbidden"}`` body (which
does not match the existing keyword heuristic) still short-circuits
``try_refresh_current`` on xai-oauth.
Docs:
* Drop the "(any active tier)" claim from the xai-grok-oauth guide,
add a top-of-page warning callout, and a Troubleshooting section
for the 403-after-login case pointing at ``XAI_API_KEY`` +
``provider: xai`` as the documented fallback.