fix(kanban): stale reclaim must not tick failure counter (#28680)

Follow-up to #28452. detect_stale_running() was calling _record_task_failure() on every reclaim, which ticked the consecutive_failures counter. With the default failure_limit=2, two legitimately long-running tasks (>4 h without explicit heartbeat) would auto-block via the spawn-failure circuit breaker — even though no worker actually failed. Stale reclaim is dispatcher-side absence-of-heartbeat detection, not a worker fault. Removed the _record_task_failure() call; the 'stale' event in task_events is still the audit surface, but the failure counter is now reserved for spawn_failed / timed_out / crashed (real failures). Also documents the heartbeat requirement: - KANBAN_GUIDANCE in agent/prompt_builder.py now states the rule ('call kanban_heartbeat at least once an hour for tasks running longer than 1 hour') so workers learn the contract. - kanban.md adds the stale event row to the events table and flags the heartbeat requirement in the worker lifecycle list. New regression test: test_detect_stale_does_not_tick_failure_counter locks in the new behaviour.
2026-06-06 07:51:53 +00:00 · 2026-05-19 03:15:18 -07:00 · 2026-05-19 03:15:18 -07:00 · 88ee58f7d2
commit 88ee58f7d2
parent 7f253f5557
4 changed files with 79 additions and 23 deletions
--- a/website/docs/user-guide/features/kanban.md
+++ b/website/docs/user-guide/features/kanban.md
@ -334,7 +334,7 @@ Any profile that should be able to work kanban tasks must load the `kanban-worke

 1. On spawn, call `kanban_show()` to read title + body + parent handoffs + prior attempts + full comment thread.
 2. `cd $HERMES_KANBAN_WORKSPACE` (via the terminal tool) and do the work there.
-3. Call `kanban_heartbeat(note="...")` every few minutes during long operations.
+3. Call `kanban_heartbeat(note="...")` every few minutes during long operations. **If your work may run longer than 1 hour, call `kanban_heartbeat` at least once an hour** — the dispatcher reclaims tasks that have been running past `kanban.dispatch_stale_timeout_seconds` (default 4 h) with no heartbeat in the last hour, on the assumption the worker crashed without cleanup. A reclaim is benign (the task goes back to `ready` for re-dispatch without a failure-counter tick) but you lose your current run's progress.
 4. Complete with `kanban_complete(summary="...", metadata={...})`, or `kanban_block(reason="...")` if stuck.

 That final `kanban_complete` / `kanban_block` call is part of the worker
@ -833,6 +833,7 @@ Every transition appends a row to `task_events`. Each row carries an optional `r
 | `reclaimed` | `{stale_lock}` | Claim TTL expired without a completion; task goes back to `ready`. |
 | `crashed` | `{pid, claimer}` | Worker PID no longer alive but TTL hadn't expired yet. |
 | `timed_out` | `{pid, elapsed_seconds, limit_seconds, sigkill}` | `max_runtime_seconds` exceeded; dispatcher SIGTERM'd (then SIGKILL'd after 5 s grace) and re-queued. |
+| `stale` | `{elapsed_seconds, last_heartbeat_at, heartbeat_age_seconds, timeout_seconds, pid, terminated}` | Task ran longer than `kanban.dispatch_stale_timeout_seconds` (default 4 h) AND no `kanban_heartbeat` arrived in the last hour. Dispatcher SIGTERM'd the host-local worker (if any), reset the task to `ready` for re-dispatch. Does NOT tick the failure counter (stale is dispatcher-side absence detection, not a worker fault). Workers running long operations should call `kanban_heartbeat` at least once an hour to avoid this. |
 | `spawn_failed` | `{error, failures}` | One spawn attempt failed (missing PATH, workspace unmountable, …). Counter increments; task returns to `ready` for retry. |
 | `protocol_violation` | `{pid, claimer, exit_code}` | Worker exited successfully while the task was still `running`, usually because it answered without calling `kanban_complete` or `kanban_block`. The dispatcher also emits `gave_up` and auto-blocks immediately instead of retrying. |
 | `gave_up` | `{failures, effective_limit, limit_source, error}` | Circuit breaker fired after N consecutive non-successful attempts. Task auto-blocks with the last error. The effective limit resolves as task `max_retries`, then dispatcher `failure_limit` / `kanban.failure_limit`, then the built-in default. |