fix(kanban): stale reclaim must not tick failure counter (#28680)

Follow-up to #28452. detect_stale_running() was calling
_record_task_failure() on every reclaim, which ticked the
consecutive_failures counter. With the default failure_limit=2,
two legitimately long-running tasks (>4 h without explicit
heartbeat) would auto-block via the spawn-failure circuit
breaker — even though no worker actually failed.

Stale reclaim is dispatcher-side absence-of-heartbeat detection,
not a worker fault. Removed the _record_task_failure() call;
the 'stale' event in task_events is still the audit surface,
but the failure counter is now reserved for spawn_failed /
timed_out / crashed (real failures).

Also documents the heartbeat requirement:
- KANBAN_GUIDANCE in agent/prompt_builder.py now states the
  rule ('call kanban_heartbeat at least once an hour for tasks
  running longer than 1 hour') so workers learn the contract.
- kanban.md adds the stale event row to the events table and
  flags the heartbeat requirement in the worker lifecycle list.

New regression test: test_detect_stale_does_not_tick_failure_counter
locks in the new behaviour.
This commit is contained in:
Teknium 2026-05-19 03:15:18 -07:00 committed by GitHub
parent 7f253f5557
commit 88ee58f7d2
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
4 changed files with 79 additions and 23 deletions

View file

@ -4066,27 +4066,15 @@ def detect_stale_running(
)
reclaimed.append(tid)
# Increment failure counter. The task is already ``ready`` and the
# run is already closed; this just ticks the counter and may trip
# the circuit breaker.
_record_task_failure(
conn, tid,
error=(
f"no heartbeat for {int(hb_age)}s "
if hb_age is not None
else "no heartbeat ever"
) + f" after {int(elapsed)}s running",
outcome="stale",
release_claim=False,
end_run=False,
event_payload_extra={
"elapsed_seconds": int(elapsed),
"heartbeat_age_seconds": (
int(hb_age) if hb_age is not None else None
),
"timeout_seconds": stale_timeout_seconds,
},
)
# Intentionally NOT calling _record_task_failure here. Stale reclaim
# is dispatcher-side detection of an absent heartbeat; the task is
# going straight back to ``ready`` for re-dispatch. Counting it as
# a worker failure would let two legitimately-long-running tasks
# (>4h without explicit heartbeat) trip the circuit breaker and
# auto-block, even though no worker actually failed. The 'stale'
# event already lives in task_events for auditability; that's the
# right surface for "this happened" without conflating with the
# spawn_failed / timed_out / crashed counters.
return reclaimed