mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-06-07 08:02:23 +00:00
fix(kanban): stale reclaim must not tick failure counter (#28680)
Follow-up to #28452. detect_stale_running() was calling _record_task_failure() on every reclaim, which ticked the consecutive_failures counter. With the default failure_limit=2, two legitimately long-running tasks (>4 h without explicit heartbeat) would auto-block via the spawn-failure circuit breaker — even though no worker actually failed. Stale reclaim is dispatcher-side absence-of-heartbeat detection, not a worker fault. Removed the _record_task_failure() call; the 'stale' event in task_events is still the audit surface, but the failure counter is now reserved for spawn_failed / timed_out / crashed (real failures). Also documents the heartbeat requirement: - KANBAN_GUIDANCE in agent/prompt_builder.py now states the rule ('call kanban_heartbeat at least once an hour for tasks running longer than 1 hour') so workers learn the contract. - kanban.md adds the stale event row to the events table and flags the heartbeat requirement in the worker lifecycle list. New regression test: test_detect_stale_does_not_tick_failure_counter locks in the new behaviour.
This commit is contained in:
parent
7f253f5557
commit
88ee58f7d2
4 changed files with 79 additions and 23 deletions
|
|
@ -4066,27 +4066,15 @@ def detect_stale_running(
|
|||
)
|
||||
reclaimed.append(tid)
|
||||
|
||||
# Increment failure counter. The task is already ``ready`` and the
|
||||
# run is already closed; this just ticks the counter and may trip
|
||||
# the circuit breaker.
|
||||
_record_task_failure(
|
||||
conn, tid,
|
||||
error=(
|
||||
f"no heartbeat for {int(hb_age)}s "
|
||||
if hb_age is not None
|
||||
else "no heartbeat ever"
|
||||
) + f" after {int(elapsed)}s running",
|
||||
outcome="stale",
|
||||
release_claim=False,
|
||||
end_run=False,
|
||||
event_payload_extra={
|
||||
"elapsed_seconds": int(elapsed),
|
||||
"heartbeat_age_seconds": (
|
||||
int(hb_age) if hb_age is not None else None
|
||||
),
|
||||
"timeout_seconds": stale_timeout_seconds,
|
||||
},
|
||||
)
|
||||
# Intentionally NOT calling _record_task_failure here. Stale reclaim
|
||||
# is dispatcher-side detection of an absent heartbeat; the task is
|
||||
# going straight back to ``ready`` for re-dispatch. Counting it as
|
||||
# a worker failure would let two legitimately-long-running tasks
|
||||
# (>4h without explicit heartbeat) trip the circuit breaker and
|
||||
# auto-block, even though no worker actually failed. The 'stale'
|
||||
# event already lives in task_events for auditability; that's the
|
||||
# right surface for "this happened" without conflating with the
|
||||
# spawn_failed / timed_out / crashed counters.
|
||||
|
||||
return reclaimed
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue