fix(kanban): stale reclaim must not tick failure counter (#28680)

Follow-up to #28452. detect_stale_running() was calling _record_task_failure() on every reclaim, which ticked the consecutive_failures counter. With the default failure_limit=2, two legitimately long-running tasks (>4 h without explicit heartbeat) would auto-block via the spawn-failure circuit breaker — even though no worker actually failed. Stale reclaim is dispatcher-side absence-of-heartbeat detection, not a worker fault. Removed the _record_task_failure() call; the 'stale' event in task_events is still the audit surface, but the failure counter is now reserved for spawn_failed / timed_out / crashed (real failures). Also documents the heartbeat requirement: - KANBAN_GUIDANCE in agent/prompt_builder.py now states the rule ('call kanban_heartbeat at least once an hour for tasks running longer than 1 hour') so workers learn the contract. - kanban.md adds the stale event row to the events table and flags the heartbeat requirement in the worker lifecycle list. New regression test: test_detect_stale_does_not_tick_failure_counter locks in the new behaviour.
2026-06-07 08:02:23 +00:00 · 2026-05-19 03:15:18 -07:00 · 2026-05-19 03:15:18 -07:00 · 88ee58f7d2
commit 88ee58f7d2
parent 7f253f5557
4 changed files with 79 additions and 23 deletions
--- a/hermes_cli/kanban_db.py
+++ b/hermes_cli/kanban_db.py
@ -4066,27 +4066,15 @@ def detect_stale_running(
            )
            reclaimed.append(tid)

-        # Increment failure counter. The task is already ``ready`` and the
-        # run is already closed; this just ticks the counter and may trip
-        # the circuit breaker.
-        _record_task_failure(
-            conn, tid,
-            error=(
-                f"no heartbeat for {int(hb_age)}s "
-                if hb_age is not None
-                else "no heartbeat ever"
-            ) + f" after {int(elapsed)}s running",
-            outcome="stale",
-            release_claim=False,
-            end_run=False,
-            event_payload_extra={
-                "elapsed_seconds": int(elapsed),
-                "heartbeat_age_seconds": (
-                    int(hb_age) if hb_age is not None else None
-                ),
-                "timeout_seconds": stale_timeout_seconds,
-            },
-        )
+        # Intentionally NOT calling _record_task_failure here. Stale reclaim
+        # is dispatcher-side detection of an absent heartbeat; the task is
+        # going straight back to ``ready`` for re-dispatch. Counting it as
+        # a worker failure would let two legitimately-long-running tasks
+        # (>4h without explicit heartbeat) trip the circuit breaker and
+        # auto-block, even though no worker actually failed. The 'stale'
+        # event already lives in task_events for auditability; that's the
+        # right surface for "this happened" without conflating with the
+        # spawn_failed / timed_out / crashed counters.

    return reclaimed