fix(kanban): stale reclaim must not tick failure counter (#28680)

Follow-up to #28452. detect_stale_running() was calling
_record_task_failure() on every reclaim, which ticked the
consecutive_failures counter. With the default failure_limit=2,
two legitimately long-running tasks (>4 h without explicit
heartbeat) would auto-block via the spawn-failure circuit
breaker — even though no worker actually failed.

Stale reclaim is dispatcher-side absence-of-heartbeat detection,
not a worker fault. Removed the _record_task_failure() call;
the 'stale' event in task_events is still the audit surface,
but the failure counter is now reserved for spawn_failed /
timed_out / crashed (real failures).

Also documents the heartbeat requirement:
- KANBAN_GUIDANCE in agent/prompt_builder.py now states the
  rule ('call kanban_heartbeat at least once an hour for tasks
  running longer than 1 hour') so workers learn the contract.
- kanban.md adds the stale event row to the events table and
  flags the heartbeat requirement in the worker lifecycle list.

New regression test: test_detect_stale_does_not_tick_failure_counter
locks in the new behaviour.
This commit is contained in:
Teknium 2026-05-19 03:15:18 -07:00 committed by GitHub
parent 7f253f5557
commit 88ee58f7d2
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
4 changed files with 79 additions and 23 deletions

View file

@ -206,7 +206,12 @@ KANBAN_GUIDANCE = (
"files outside it unless the task explicitly asks.\n"
"3. **Heartbeat on long operations.** Call `kanban_heartbeat(note=...)` "
"every few minutes during long subprocesses (training, encoding, crawling). "
"Skip heartbeats for short tasks.\n"
"Skip heartbeats for short tasks. **If your task may run longer than 1 hour, "
"you MUST call `kanban_heartbeat` at least once an hour** — the dispatcher "
"reclaims tasks running past `kanban.dispatch_stale_timeout_seconds` "
"(default 4 hours) when no heartbeat has arrived in the last hour. A "
"reclaim re-queues the task as `ready` without penalty (no failure counter "
"tick), but you lose your current run's progress.\n"
"4. **Block on genuine ambiguity.** If you need a human decision you cannot "
"infer (missing credentials, UX choice, paywalled source, peer output you "
"need first), call `kanban_block(reason=\"...\")` and stop. Don't guess. "