Repeated quarantines of an unchanged corrupt kanban.db used to amplify
disk usage by N: the gateway dispatcher's 5-minute retry loop, multi-
profile fleets sharing one DB, and manual reopen attempts each produced
a fresh '.corrupt.<timestamp>.bak' copy of the same bytes. After 10
retries on a 100KB DB you had 11x the disk footprint of duplicate
corrupt data.
Derive the backup filename from a sha256 of the main DB instead of a
timestamp + collision counter. Same bytes → same filename → skip the
copy on retries. Different bytes (partial repair, further damage) →
different filename → preserve separately. Sidecar (-wal/-shm) backups
inherit the same content-addressed name.
Inspired by @hanzckernel's PR #33529, simplified down to ~30 LOC: drop
the persistent JSON marker file, drop the atomic temp+fsync+rename
helper (shutil.copy2 is fine for a quarantine-only path), drop the
gateway-side WAL/SHM fingerprint extension (the existing
(path, mtime, size) tuple still gives the 5-minute retry semantics it
needs), and drop the gateway-side helper extraction. The backup file
existing IS the marker; no separate state needed.
Test: tests/hermes_cli/test_kanban_db.py::test_repeated_corrupt_open_reuses_single_backup
proves 10 retries on the same corrupt bytes produce 1 backup (was 11),
and mutating the corrupt bytes produces a second backup with a
different fingerprint.
Refs #33529
Co-authored-by: hanzckernel <zhicheng.han@mathematik.uni-goettingen.de>
`sqlite3.Connection.__exit__` commits/rollbacks but does NOT close the
underlying FD. `with kb.connect() as conn:` in long-lived processes
(gateway `run_slash`, dashboard `decompose_task_endpoint`) therefore
leaks one FD to `kanban.db` per call. After enough operations the
gateway dies with `[Errno 24] Too many open files` (~4 days uptime
in the production report — #33159).
Fix: add a `connect_closing()` context manager in `hermes_cli/kanban_db`
that wraps `connect()` with a real `try/finally: conn.close()`. Switch
the 42 leak-prone call sites in `hermes_cli/kanban.py` (35),
`hermes_cli/kanban_decompose.py` (4), and `hermes_cli/kanban_specify.py`
(3) over to it.
`kanban.py` matters because `run_slash` (called from the gateway for
every `/kanban` slash command) parses argparse and dispatches to those
`_cmd_*` functions in-process — each one was leaking one FD per
invocation.
Tests inside `tests/` are untouched: short-lived processes where OS
cleanup masks the leak. Regression tests added in
`test_kanban_db.py` cover both happy-path and exception-path closure,
plus an explicit assertion that bare `with kb.connect()` still does
NOT close (documenting the upstream sqlite3 behaviour we're working
around).
Closes#33159.
Reaper now runs at the top of every dispatcher tick regardless of per-board connect() failures. Previously the reaper sat inside dispatch_once after the kanban_db.connect() call — any EIO during connect would skip reaping for that tick, accumulating zombie workers and stale claim_lock rows.
Also: reap_worker_zombies now returns the list of reaped pids (the dispatcher logs them) and a test indentation fix.
Squashes three sibling commits from PR #32301 into one logical change for batch review.
Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect().
Refs: #31208 (Bug E - same file, coordinate), #30973 (wal_autocheckpoint)
Refs: #30445, #30896, #30908 (corruption reports)
`detect_crashed_workers` calls `_pid_alive` on every `running` task whose
claim is held by this host. The check can transiently return False for a
freshly-spawned worker (fork → /proc-visibility lag, or reap-race
between SIGCHLD and parent reaping). When a second dispatcher ticks
inside that window it reclaims the task and spawns a duplicate worker.
Add `DEFAULT_CRASH_GRACE_SECONDS = 30` and an
`HERMES_KANBAN_CRASH_GRACE_SECONDS` env-var override.
`detect_crashed_workers` skips the liveness check when
`time.time() - started_at < grace`. The existing 15-minute claim TTL
still reclaims genuinely-crashed workers; grace only suppresses the
launch-window false positive.
`HERMES_KANBAN_CRASH_GRACE_SECONDS=0` is set on the `kanban_home`
fixture in `test_kanban_core_functionality.py` so existing tests that
assert immediate reclaim retain pre-fix semantics.
Companion to merged PR #23442 (`release_stale_claims`, closes#23025),
which addressed the same multi-dispatcher race in the stale-claim path.
Related: #20015 (`_pid_alive` false-negative behaviour),
When code inside a write_txn block raises an OperationalError that SQLite
has already auto-rolled-back (typical for disk I/O error,
database is locked, and database disk image is malformed), the
explicit ROLLBACK in write_txn.__exit__ itself raises
cannot rollback - no transaction is active and the secondary exception
replaces the original in the traceback. Operators see a misleading error
and lose the diagnostic information they need.
Swallow the rollback-time OperationalError so the caller always sees the
original cause.
Confirmed reproducer: tests/hermes_cli/test_kanban_db.py::
test_write_txn_preserves_original_exception_when_rollback_fails
apply_wal_with_fallback() treated "disk i/o error" as a permanent
WAL-incompatibility marker, identical to "locking protocol" (NFS) and
"not authorized" (FUSE). But EIO during PRAGMA journal_mode=WAL is
typically TRANSIENT — page-cache pressure, brief lock contention,
recoverable storage hiccups — not a permanent filesystem property.
Treating transient EIO as a permanent downgrade signal produces the
mixed-journal-mode-across-processes corruption pattern:
1. Process A opens kanban.db, hits transient EIO on the WAL pragma,
silently downgrades to journal_mode=DELETE.
2. Process B (no EIO) opens the same file moments later and
successfully sets journal_mode=WAL.
3. A writes rollback-journal frames while B writes WAL frames. SQLite
documents this as unsupported and corrupts the file:
https://www.sqlite.org/wal.html ("all connections to the same
database must use the same locking protocol").
This was the root cause of repeated kanban.db corruption on hosts with
multiple gateway processes plus CLI invocations against the same DB
(observed pattern: corruption shortly after gateway startup, after the
process logged "WAL journal_mode unsupported on this filesystem (disk
I/O error) — falling back to journal_mode=DELETE"). The fallback
warning told the truth — fallback DID happen — but the premise
("unsupported on this filesystem") was wrong; the EIO was a one-shot
event and sibling processes successfully used WAL.
Fix has two layers:
1. Remove "disk i/o error" from _WAL_INCOMPAT_MARKERS. EIO now re-raises
so callers can retry instead of silently corrupting the DB. The two
remaining markers ("locking protocol", "not authorized") are
deterministic per filesystem so they remain safe permanent-downgrade
signals.
2. Belt-and-suspenders: before downgrading on ANY marker match, peek the
on-disk journal mode. If the header says WAL, refuse to downgrade and
re-raise the original error. This guards against any future addition
to _WAL_INCOMPAT_MARKERS turning out to be transient in some
environment we haven't yet seen.
Tests:
- tests/test_hermes_state_wal_fallback.py:
* Flipped test_falls_back_on_disk_io_error → test_reraises_on_disk_io_error
asserting EIO is re-raised, not silently swallowed.
* Added test_does_not_downgrade_when_disk_says_wal covering the
on-disk-header safety guard for the existing legitimate markers.
- tests/hermes_cli/test_kanban_db.py:
* test_connect_falls_back_to_delete_on_locking_protocol now uses a
truly-fresh DB (instead of the kanban_home fixture which pre-inits
in WAL). On NFS the very first process touching the file legitimately
downgrades; on a file already in WAL the new guard correctly refuses.
A standalone reproducer lives at /tmp/kanban-stress/repro_bugD_eio_wal_downgrade.py
(not committed): without fix the DB silently flips from WAL to DELETE
mid-process; with fix the EIO surfaces and the file stays WAL.
Refs: Bug D in the kanban-corruption investigation series (Bugs A and C
shipped in ebe7374f3 and e02147d5e respectively). Bug D explains every
corruption incident this week including those that survived A's
single-dispatcher mitigation, because every CLI invocation is a
separate process whose WAL pragma can transiently fail.
Production corruption #6 left b-tree pages with zeroed headers but intact old cell content — the Bug E pattern. This fix applies three pragma calls on every connect():
- synchronous=FULL (was NORMAL): closes the WAL-checkpoint reordering window where a crash between WAL commit and main-DB write leaves a partially-written b-tree page header. Cost is <1ms per commit on local SSD; negligible at kanban write volume.
- secure_delete=ON: forces SQLite to zero freed page bytes on disk. If a torn write or hardware fault later corrupts a page, the underlying cell content is zero, so corruption is detectable and no stale rows can resurface as live data.
- cell_size_check=ON: adds a read-side guard so corrupt cells surface as errors at read time rather than as silent wrong-data returns.
All three are connection-scoped and re-applied on every connect(). secure_delete also writes a persistent flag into the DB header on the first call against a fresh DB, making the protection durable across processes for new DBs.
Tests added for all four required cases: each pragma active on a fresh connection, and all three re-applied after close+reopen. Also adds the required negative test (migration path does not reset pragmas).
Board defaults represent persistent project checkouts. Scratch workspaces
are auto-deleted on completion and must stay under the per-board scratch
root that resolve_workspace() creates. Inheriting default_workdir for a
scratch task pointed the cleanup path at the user's source tree — the
data-loss vector documented in #28818.
The containment guard in _cleanup_workspace (just added) is the safety
rail. This commit prevents the bad state from being created in the first
place: only persistent kinds (dir/worktree) inherit board defaults.
Tests updated to cover the new semantics: scratch with default_workdir
set keeps workspace_path=None; dir/worktree still inherits the board
default.
Salvaged from PR #31315 by @leeseoki0 — prevention layer on top of the
#28819 containment fix by @briandevans.
Co-authored-by: teknium1 <127238744+teknium1@users.noreply.github.com>
Copilot review on PR #28819 flagged that `_is_managed_scratch_path` accepted
the entire `<kanban_home>/kanban` subtree as managed scratch storage. With
that, a task whose `workspace_kind='scratch'` and `workspace_path` was
mis-set to `<kanban_home>/kanban`, `.../kanban/logs`, or a board's
metadata directory (e.g. `.../kanban/boards/<slug>` without the
`workspaces/` child) would pass the containment guard and let task
completion `shutil.rmtree` Hermes' own DB, metadata, and log subtrees.
Tighten the guard:
* Allowed roots are now exclusively `workspaces/` directories — the
`HERMES_KANBAN_WORKSPACES_ROOT` override, `<kanban_home>/kanban/workspaces`,
and each `<kanban_home>/kanban/boards/<slug>/workspaces` discovered on
disk.
* Require strict descendancy: a path equal to a root itself is rejected
too, because deleting a workspaces root would wipe every task's scratch
dir at once.
Add a regression test covering the three Copilot-named attack paths
(kanban root, kanban/logs, board root without `workspaces/`) plus the
workspaces-root-itself case, and confirm the inner task-id dir still
matches.
A board's ``default_workdir`` (e.g. ``hermes kanban boards
set-default-workdir my-board /path/to/real/source``) is copied into
``tasks.workspace_path`` for tasks created without an explicit
``workspace_kind``. Those tasks default to ``workspace_kind='scratch'``,
so completion calls ``_cleanup_workspace`` and unconditionally runs
``shutil.rmtree(wp, ignore_errors=True)`` — deleting the user's real
source tree as if it were disposable scratch storage.
Add ``_is_managed_scratch_path()`` and gate ``_cleanup_workspace`` on
it: only delete paths under ``HERMES_KANBAN_WORKSPACES_ROOT`` (the
worker-side override the dispatcher injects) or under the active kanban
home's ``kanban/`` subtree (covering both the legacy default-board root
and per-board ``kanban/boards/<slug>/workspaces`` roots). Anything else
gets a warning log and is left alone, so a misconfigured
``default_workdir`` can no longer destroy user data on task completion.
First scratch workspace creation on an install now emits a one-shot
warning log + a 'tip_scratch_workspace' event on the task. Sentinel
file at ~/.hermes/kanban/.scratch_tip_shown silences subsequent
creations across the whole install.
Behavior unchanged — scratch is still ephemeral by design. This just
makes the design visible to new users (reported in user community:
'progress files vanished, no warning anywhere').
Docs (en + ko) updated to spell out 'Deleted when the task completes'
on the scratch bullet and 'Preserved on completion' on worktree/dir.
Extends the previous commit to cover the remaining additive-column index
that sits on the same migration trap:
- ``task_events.run_id`` -> ``idx_events_run`` was still in SCHEMA_SQL.
A legacy ``task_events`` table predating #17805 (no ``run_id``) would
still abort ``executescript`` before ``_migrate_add_optional_columns``
could add the column. Hoisted out of SCHEMA_SQL and made unconditional
in the migration alongside the other three indexes.
- Removed the now-redundant ``CREATE INDEX idx_tasks_idempotency`` that
was nested inside the ``if "idempotency_key" not in cols`` branch.
The unconditional create lower in the function makes it idempotent
on both fresh and legacy DBs.
- Strengthened the regression test to cover all four indexes
(``idx_tasks_session_id``, ``idx_tasks_tenant``, ``idx_tasks_idempotency``,
``idx_events_run``) and to seed a pre-#17805 ``task_events`` shape that
exercises the ``run_id`` migration path.
The result: every ``CREATE INDEX`` that depends on an additive column now
runs after the migration ensures the column exists. Verified against a
realistic pre-#16081 board fixture (tasks + task_events both legacy
shape) — origin/main reproduces ``no such column: session_id``; this
branch migrates cleanly and creates all four indexes.
Follow-up to #28455. The respawn guard's blocker_auth rule (last error
matched a quota/auth/429 pattern) was auto-blocking the task on first
occurrence. That's too aggressive: transient rate limits typically
clear in seconds to minutes, but the auto-block puts the task in
'blocked' status which requires manual unblock.
Now treats blocker_auth the same as recent_success and active_pr:
defer the spawn this tick, leave the task in 'ready', let the next
tick try again. If the auth error genuinely persists, the existing
consecutive_failures counter trips the auto-block circuit breaker
after failure_limit failures via the normal path — so a persistent
401/403/quota-exhausted still ends up blocked, just not on first hit.
Also documents the respawn_guarded event in kanban.md's events table
with the three guard reasons.
Updated test_dispatch_respawn_guard_auto_blocks_auth_error → renamed
to test_dispatch_respawn_guard_defers_auth_error_without_auto_block;
asserts task stays in 'ready' and the guard reason is recorded.
Follow-up to #28452. detect_stale_running() was calling
_record_task_failure() on every reclaim, which ticked the
consecutive_failures counter. With the default failure_limit=2,
two legitimately long-running tasks (>4 h without explicit
heartbeat) would auto-block via the spawn-failure circuit
breaker — even though no worker actually failed.
Stale reclaim is dispatcher-side absence-of-heartbeat detection,
not a worker fault. Removed the _record_task_failure() call;
the 'stale' event in task_events is still the audit surface,
but the failure counter is now reserved for spawn_failed /
timed_out / crashed (real failures).
Also documents the heartbeat requirement:
- KANBAN_GUIDANCE in agent/prompt_builder.py now states the
rule ('call kanban_heartbeat at least once an hour for tasks
running longer than 1 hour') so workers learn the contract.
- kanban.md adds the stale event row to the events table and
flags the heartbeat requirement in the worker lifecycle list.
New regression test: test_detect_stale_does_not_tick_failure_counter
locks in the new behaviour.
Sweep of all CI failures on origin/main, grouped by drift source:
Telegram allowlist gate (db50af910 added user-authz to _should_process_message):
- Hardcoded "[Telegram]" prefix in the logger.warning so the call no
longer dereferences self.name → self.platform, which test fixtures
built via object.__new__ never set.
- test_telegram_format / test_allowed_channels_widening fixtures stub
_is_callback_user_authorized → True so the new gate doesn't reject
guest-mode / allowed-channels test messages.
- test_telegram_approval_buttons::test_update_prompt_callback_not_affected
sets TELEGRAM_ALLOWED_USERS="*" so the fail-closed default doesn't
reject the callback before it writes .update_response.
Approval surface (6d495d9e7 renamed status, 214b95392 detached stdin):
- test_no_callback_returns_approval_required: status is now
"pending_approval" (was "approval_required").
- test_close_stdin_allows_eof_driven_process_to_finish: switch to
use_pty=True; non-PTY now uses stdin=DEVNULL.
Mattermost (send() now resolves root_id via _api_get first):
- test_send_with_thread_reply mocks _session.get with a thread-root
response so the new resolver doesn't TypeError on a bare AsyncMock.
Kanban (d8ad431de rename, f55d94a1e review column, _kanban_worker_skill_available):
- _safe_int → _to_epoch in the two test_kanban_db tests.
- Spawn-skills tests (×3) monkey-patch _kanban_worker_skill_available
to True since the isolated kanban_home fixture has no devops/kanban-worker tree.
- test_gateway_dispatcher_disables_corrupt_board: connect count
3 → 5 (review-column probe now also runs per tick).
Aux-config severity at_or_above (a94ddd807):
- test_diagnostics_endpoint_severity_filter expects warning filter to
include error+critical now (was exact-match).
Anthropic error handling (conversation loop extracted from run_agent):
- _no_backoff_wait fixture patches BOTH run_agent.jittered_backoff AND
agent.conversation_loop.jittered_backoff. The latter is the actual
call site; without the second patch tests burn ~2s per retry and
hit the 30s SIGALRM timeout on CI.
Other test pollution / drift:
- test_auto_does_not_select_copilot_from_github_token: patch
agent.bedrock_adapter.has_aws_credentials → False so boto3's
credential chain can't auto-pick Bedrock from developer ~/.aws.
- test_setup_openclaw_migration: patch hermes_cli.gateway.get_env_value
in addition to setup_mod.get_env_value — _platform_status reads
through the gateway module's binding.
- test_gateway_prefix: COMPONENT_PREFIXES["gateway"] now includes
"hermes_plugins" too.
- test_recommended_update_command_defaults_to_hermes_update: also
short-circuit get_managed_update_command in case a stray
~/.hermes/.managed marker is present.
- test_user_id_is_not_explicit: _parse_target_ref now returns
is_explicit=False for Slack U.../W... IDs (chat.postMessage rejects
them — a DM must be opened first via conversations.open).
Salvages #28125 by @Jpalmer95. Adds:
- Drag-to-delete trash zone in the kanban dashboard
- Bulk delete endpoint with cascading delete_task cleanup
- Frontend updates (drag visual + drop handler)
- Confirmation prompt before delete
Resolved end-of-file test conflict by appending both halves.
Salvages #24533 by @roycepersonalassistant. Adds a first-class
'scheduled' Kanban status for time-delay follow-ups that aren't
waiting on human input.
- hermes kanban schedule <task_id> [reason] CLI command
- Dashboard/API transitions to/from Scheduled
- unblock_task() now releases both 'blocked' AND 'scheduled' tasks
(re-checking parent dependencies before moving to ready/todo)
- i18n + docs updates
Resolved conflicts: kept HEAD's failure-counter reset on unblock
alongside the PR's scheduled state, kept HEAD's 'running' direct-set
rejection, combined both bulk-status branches. Dropped the dist/
bundle changes (months-stale; would need rebuild from source).
Salvages #26496 by @aqilaziz. Adds branch_name column + CLI flag so
tasks with workspace_kind='worktree' can pin a target branch on
create. Schema migration added to _migrate_add_optional_columns.
- Task.branch_name field + DB column + migration
- create_task accepts branch_name kwarg
- hermes kanban create --branch <name> flag
- kanban show output includes 'Branch: <name>' when set
Cherry-picked the substantive commit (a7558cf27); the PR's tip was
an unrelated service-path-dirs commit. Resolved 2 INSERT-column-list
and show-output conflicts alongside main's session_id and
max_runtime_seconds additions; kept all three.
Salvages #27484 by @fardoche6. Adds a respawn guard that skips worker
spawn for tasks where:
- a recent run already succeeded (recent_success — within guard window)
- the previous run hit a quota/auth error (blocker_auth, also auto-blocks)
- a recent task comment includes a GitHub PR URL (active_pr)
The guard prevents repeat worker storms on the same bug/task. Includes
the contributor's review-findings fixup (regex hardening, observability,
auth coverage).
Resolved a small DispatchResult conflict alongside main's 'stale' field;
kept both. Authorship preserved via rebase merge.
Salvages #26745 by @nehaaprasaad. Exposes filtering for the existing
workflow_template_id and current_step_key columns:
- list_tasks() accepts workflow_template_id and current_step_key kwargs
- 'hermes kanban list' adds matching CLI flags
- dashboard plugin_api also exposes the filters
Resolved a small conflict in list_tasks signature alongside main's
session_id and order_by additions; combined all three into the single
filter list.
Salvages #23790 by @thewillhuang. Adds detect_stale_running() to
the dispatcher cycle. Running tasks that have been started for longer
than dispatch_stale_timeout_seconds (default 14400 = 4h) without a
heartbeat in the last hour are auto-reclaimed to ready.
- New config kanban.dispatch_stale_timeout_seconds (default 14400, 0 disables)
- New 'stale' field on DispatchResult
- detect_stale_running() in kanban_db.py with heartbeat freshness check
- Records outcome='stale' on run close + 'stale' event; ticks failure counter
- Wires config through gateway embedded dispatcher
- Updates _cmd_dispatch verbose/JSON output and daemon logging
Resolved test-file end-of-file conflict by appending both halves.
Salvages #23772 by @thewillhuang. Adds 'review' as a valid kanban task
status and extends dispatch_once to monitor the review column as a
second dispatch source (in addition to the existing ready column).
- Adds 'review' to VALID_STATUSES
- Adds claim_review_task() — atomically transitions review → running
- Adds has_spawnable_review() — health telemetry mirror
- Extends dispatch_once with a review column dispatch loop
- Review agents get 'sdlc-review' skill auto-loaded
Resolved 2 conflicts (VALID_STATUSES merge with main's 'scheduled' state,
test file additions). Adapted claim_review_task to main's
ttl_seconds: Optional[int] = None convention (matches claim_task).
Salvages #23208 by @awizemann. Tracks which chat session created a
kanban task so clients can render a per-session board without falling
back to tenant + time-window heuristics.
- Schema: tasks gains nullable session_id TEXT column with index
(additive migration in _migrate_add_optional_columns).
- ACP: server.py exposes the originating session id via HERMES_SESSION_ID
with save/restore around the agent loop.
- Tool: kanban_create reads HERMES_SESSION_ID (with explicit override).
- CLI: 'hermes kanban list --session <id>' filter; JSON output exposes
session_id.
Salvages #25745 by @LizerAIDev. Adds --sort {created,created-desc,
priority,priority-desc,status,assignee,title,updated} to 'hermes kanban
list'. Validated against VALID_SORT_ORDERS map; invalid values raise
ValueError. Default behaviour (priority DESC, created ASC) is unchanged
when --sort is omitted.
Salvages #22981 by @SimbaKingjoe. Adds 'kanban.max_in_progress' config
that caps simultaneously running tasks. When the board already has N
running, dispatcher skips spawning so slow workers (local LLMs,
resource-constrained hosts) don't pile up and time out.
Threads through dispatch_once(max_in_progress=) and gateway dispatcher
config parsing with validation (warns on invalid/below-1 values).
When a systemic failure (provider outage, auth expiry, OOM) crashes
multiple workers simultaneously, detect_crashed_workers increments
each task failure counter independently. The circuit breaker only
trips after N × failure_limit retries across the fleet.
Fingerprint crash errors by normalizing host-specific details (PIDs,
timestamps). When 3+ tasks crash with the same fingerprint in a
single detection cycle, immediately trip the circuit breaker
(failure_limit=1) instead of waiting for repeated failures.
Isolated crashes (unique fingerprints) retain their normal retry
budget. Protocol violations continue to trip immediately.
Includes regression tests for systemic and isolated crash paths.
When a task is manually unblocked (blocked → ready/todo), the
consecutive_failures counter and last_failure_error were left intact.
The next failure would immediately re-trip the circuit breaker because
the counter was still at or above the failure limit.
Reset both fields on unblock so the task gets a fresh retry budget.
Includes a regression test that verifies counters are zeroed.
recompute_ready only scanned 'todo' tasks for promotion, ignoring
'blocked' tasks entirely. When a task was blocked (e.g. by the circuit
breaker) and its parent dependencies later completed, the task stayed
stuck in 'blocked' forever unless manually unblocked.
Now recompute_ready also scans 'blocked' tasks. When all parents are
done/archived, the blocked task is promoted to 'ready' with failure
counters reset — equivalent to an automatic unblock.
Includes a regression test for the blocked-parent-done promotion path.
Salvages #19964 by @Beandon13. Adds `hermes kanban archive --rm` to
permanently remove already-archived tasks with cascading cleanup of
links, comments, events, runs, and notify-subs. Safety guard: only
archived tasks can be deleted; active/blocked/done must be archived
first.
Cherry-picked from #19964 onto current main (severe stale base, applied
manually to preserve substance only).
Workers running slow models (e.g. kimi-k2.6) can spend longer than
DEFAULT_CLAIM_TTL_SECONDS inside a single tool-free LLM call, making
no tool calls and therefore not heartbeating. release_stale_claims
previously reclaimed these healthy workers, producing the
spawn-then-immediately-reclaim loop reported in #23025.
When a stale-by-TTL claim's host-local worker PID is still alive,
extend the claim (emit a claim_extended event) rather than killing
it. enforce_max_runtime / detect_crashed_workers remain the upper
bounds for genuinely wedged or dead workers. Reclaim events now also
record claim_expires, last_heartbeat_at, worker_pid, and host_local
so operators can see why a worker was killed.
Follow-up to the previous commit's safe-int task_age fix.
The original PR shipped without test coverage. This commit adds:
- test_safe_int_accepts_int_and_int_string — sanity for the well-typed
path so the helper itself can't quietly start swallowing valid values.
- test_safe_int_returns_none_on_corrupt_inputs — the failure modes
(None, '%s', 'abc', '', '1.5', random objects). Covers both the
ValueError and TypeError catch branches.
- test_task_age_handles_corrupt_created_at — the headline regression:
a task with created_at='%s' used to raise ValueError and turn
GET /api/plugins/kanban/board into a 500.
- test_task_age_handles_corrupt_started_and_completed — confirms the
safe-int treatment is consistent across all three timestamp fields.
- test_task_age_well_formed_task — regression that the safe path
doesn't change observable output for normal data.
- test_task_dict_survives_corrupt_created_at — defense in depth.
Writes a corrupt row directly via SQL, reads it back through the
ORM, and confirms task_age + the surrounding plugin_api guard
degrade gracefully instead of crashing.
Also adds the AUTHOR_MAP entry for the contributor's GitHub-noreply
email so release notes credit @baocin (the commit was authored locally
as `aoi <aoi@hino.local>` — re-attributed during salvage to the
github noreply form).
Follow-up to the previous commit's contributor cherry-pick.
The cherry-picked change replaced the bare ``["hermes", ...]`` spawn with
``[sys.executable, "-m", "hermes", ...]``. The intent was right (avoid
PATH dependence — cron, systemd User= services, launchd jobs, and other
detached dispatcher invocations routinely run with a stripped $PATH that
doesn't include the venv's bin/, breaking the bare-shim spawn) but the
module name is wrong: there is no top-level ``hermes`` package. The
console-script entry point in pyproject.toml is
``hermes = "hermes_cli.main:main"``, and ``python -m hermes`` fails with
``No module named hermes``. The cherry-picked form would have replaced a
sometimes-broken spawn with an always-broken one.
This commit:
- Adds ``_resolve_hermes_argv()``, mirroring ``gateway.run._resolve_hermes_bin``.
Tries ``shutil.which("hermes")`` first (preferred — keeps existing ``ps``
output and log lines familiar in the common case) and falls back to
``[sys.executable, "-m", "hermes_cli.main"]`` when the shim is not on
PATH. The fallback goes through the running interpreter so it's
PATH-independent. Kept as a local helper rather than imported from
gateway because ``hermes_cli`` sits below ``gateway`` in the dependency
order.
- Switches the dispatcher's ``cmd`` list to use ``*_resolve_hermes_argv()``.
- Adds three regression tests:
* ``test_resolve_hermes_argv_prefers_path_shim`` — pins the PATH-first
branch so a future refactor doesn't silently flip the order.
* ``test_resolve_hermes_argv_falls_back_to_module_form_when_no_path_shim`` —
pins the correct module name (``hermes_cli.main``, NOT ``hermes``).
Direct regression guard for the form that shipped in the original PR.
* ``test_resolve_hermes_argv_module_actually_runs`` — runs the fallback
invocation as a real subprocess and asserts ``--version`` works, so
losing ``hermes_cli.main``'s ``__main__`` handling can't slip past the
string-match test.
Verified end-to-end: with the shim on PATH the resolver returns
``[/.../hermes]`` and ``--version`` works; with the shim removed the
resolver returns ``[python, -m, hermes_cli.main]`` and ``--version``
still works; the original PR's ``python -m hermes`` invocation fails as
expected (``No module named hermes``).
ALTER TABLE calls inside _migrate_add_optional_columns were guarded by a
snapshot of PRAGMA table_info taken at function entry. When the gateway
dispatcher opens the kanban DB twice per tick (once in _tick_once_for_board
and once via init_db's discard-and-reconnect path), a second connection can
run the same migration before the first one commits, causing:
sqlite3.OperationalError: duplicate column name: consecutive_failures
This crashed the dispatcher on every first tick after a gateway restart
(subsequent ticks succeeded because the columns were then present).
Fix: introduce _add_column_if_missing() which wraps ALTER TABLE in a
try/except that swallows OperationalError whose message contains
'duplicate column name'. All ALTER TABLE calls in
_migrate_add_optional_columns are routed through this helper.
Closes#21708
Enforce the parent-completion invariant at claim_task (the single
ready->running chokepoint) and re-gate unblock_task so blocked->ready
only fires when parents are done. Prevents child tasks from running
ahead of in-progress parents under the create-then-link race.
Also adds a stress test that races concurrent create+link against
hammered claim_task and asserts no child runs while any parent is undone.
Ref: kanban/boards/cookai/workspaces/t_a6acd07d/root-cause.md
Refs: t_8d6af9d6
Problem:
unlink_tasks() removes a parent→child dependency edge but does not trigger
recompute_ready(). A child whose last blocking parent is unlinked stays
stuck in 'todo' indefinitely — it only promotes to 'ready' on the next
dispatcher tick or a manual 'hermes kanban recompute'. For CLI-only users
without a dispatcher, the child is permanently stuck.
Root cause:
complete_task() and unblock_task() both call recompute_ready() after their
write transaction so downstream children are evaluated immediately.
unlink_tasks() was missing this call — removing a dependency is
semantically equivalent to completing one, so the same recompute is needed.
Fix:
Capture the rowcount result before the write_txn exits, then call
recompute_ready(conn) outside the transaction when a row was actually
deleted (so the child sees the updated task_links state).
Tests:
Added test_unlink_tasks_triggers_recompute_ready in
tests/hermes_cli/test_kanban_db.py: creates parent A (done) + parent C
(running), child B with both parents (todo), unlinks C→B, asserts B is
ready immediately. Stash-verified: FAILS without fix (child stays todo),
PASSES with fix.
62/62 tests green in tests/hermes_cli/test_kanban_db.py.
Closes#22459.
SQLite's WAL mode requires shared-memory (mmap) coordination and fcntl
byte-range locks that don't reliably work on network filesystems. Upstream
documents this explicitly:
https://www.sqlite.org/wal.html#sometimes_queries_return_sqlite_busy_in_wal_mode
On NFS / SMB / some FUSE mounts / WSL1, 'PRAGMA journal_mode=WAL' raises
'sqlite3.OperationalError: locking protocol' (SQLITE_PROTOCOL). Before
this change, every feature backed by state.db or kanban.db broke silently:
- /resume, /title, /history, /branch returned 'Session database not
available.' with no cause
- gateway logged the init failure at DEBUG (invisible in errors.log)
- kanban dispatcher crashed every 60s, driving the known migration race
(duplicate column name: consecutive_failures, #21708 / #21374)
Changes:
- hermes_state.apply_wal_with_fallback(): shared helper that tries WAL
and falls back to DELETE on SQLITE_PROTOCOL-style errors with one
WARNING explaining why
- hermes_state.get_last_init_error() + format_session_db_unavailable():
capture the init failure cause and surface it in user-facing strings
(with an NFS/SMB pointer for 'locking protocol')
- hermes_cli/kanban_db.connect(): use the shared helper
- gateway/run.py: bump SessionDB init failure log DEBUG -> WARNING
(matches cli.py's existing correct behavior)
- cli.py (4 sites) + gateway/run.py (5 sites): replace bare
'Session database not available.' with format_session_db_unavailable()
Tests: 12 new tests in tests/test_hermes_state_wal_fallback.py + 1 new
test in tests/hermes_cli/test_kanban_db.py. Existing suites (state,
kanban, gateway, cli) remain green for all tests unrelated to pre-existing
failures on main.
Evidence: real-world user on NFSv3 mount (172.26.224.200:d2dfac12/home,
local_lock=none) reporting 'Session database not available.' on /resume;
'locking protocol' appears in 4 distinct log entries across backup,
kanban, TUI, and CLI paths in the same session.
closes#22032
The kanban-worker skill (built into the gateway dispatcher's spawn
prompt) instructs every worker to hand off via
``kanban_complete(summary=..., metadata=...)``. That writes the summary
onto the closing ``task_runs`` row, NOT onto ``tasks.result`` — the
latter is left NULL unless the caller passes ``result=`` explicitly.
Result: a glance at the dashboard or ``hermes kanban show <id>`` shows
a blank "Result:" section even when the worker did real work, which
on 2026-05-05 caused a Mac false-alarm ("Hermes did nothing") on a
task that had a 10-line completion summary on its run.
This patch surfaces the latest non-null run summary as
``latest_summary`` so the worker's actual handoff lands in front of
operators.
* New helpers ``kanban_db.latest_summary(conn, task_id)`` and
``kanban_db.latest_summaries(conn, task_ids)``. The batch variant
uses a single window-function SELECT so the dashboard board endpoint
doesn't pay an N+1 cost on multi-hundred-task boards.
* CLI ``hermes kanban show <id>`` prints a "Latest summary:" block
when ``tasks.result`` is empty but a run has produced a summary
(the existing "Result:" section still wins when populated, so the
back-compat path for hand-edited results is untouched). JSON output
gains a top-level ``latest_summary`` field.
* Dashboard ``/board`` and ``/tasks/{id}`` now include a
``latest_summary`` field on every task. Cards on /board carry a
200-character preview (cheap to render, plenty for "what did this
worker do?" at a glance); the drawer/detail endpoint returns the
full summary.
* Five new tests cover: empty-runs case, post-complete surface,
newest-of-multiple selection, empty-string skip, batch with
missing tasks + empty input.
Smoke-tested locally against the live profile DB on the three
acceptance-criterion targets (t_f08fef91 cron-hygiene-audit,
t_007b7f1c EMA-analysis, t_05746fa4 self-assessment) — all three now
return their populated summaries via both ``latest_summary`` and
``latest_summaries``.
Test plan: 255/255 kanban tests pass + 91/91 dashboard plugin tests
pass. No regression on tasks where ``tasks.result`` is explicitly
populated (the existing "Result:" branch is preserved).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>