mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-06-08 08:11:38 +00:00
fix(kanban): stale reclaim must not tick failure counter (#28680)
Follow-up to #28452. detect_stale_running() was calling _record_task_failure() on every reclaim, which ticked the consecutive_failures counter. With the default failure_limit=2, two legitimately long-running tasks (>4 h without explicit heartbeat) would auto-block via the spawn-failure circuit breaker — even though no worker actually failed. Stale reclaim is dispatcher-side absence-of-heartbeat detection, not a worker fault. Removed the _record_task_failure() call; the 'stale' event in task_events is still the audit surface, but the failure counter is now reserved for spawn_failed / timed_out / crashed (real failures). Also documents the heartbeat requirement: - KANBAN_GUIDANCE in agent/prompt_builder.py now states the rule ('call kanban_heartbeat at least once an hour for tasks running longer than 1 hour') so workers learn the contract. - kanban.md adds the stale event row to the events table and flags the heartbeat requirement in the worker lifecycle list. New regression test: test_detect_stale_does_not_tick_failure_counter locks in the new behaviour.
This commit is contained in:
parent
7f253f5557
commit
88ee58f7d2
4 changed files with 79 additions and 23 deletions
|
|
@ -206,7 +206,12 @@ KANBAN_GUIDANCE = (
|
|||
"files outside it unless the task explicitly asks.\n"
|
||||
"3. **Heartbeat on long operations.** Call `kanban_heartbeat(note=...)` "
|
||||
"every few minutes during long subprocesses (training, encoding, crawling). "
|
||||
"Skip heartbeats for short tasks.\n"
|
||||
"Skip heartbeats for short tasks. **If your task may run longer than 1 hour, "
|
||||
"you MUST call `kanban_heartbeat` at least once an hour** — the dispatcher "
|
||||
"reclaims tasks running past `kanban.dispatch_stale_timeout_seconds` "
|
||||
"(default 4 hours) when no heartbeat has arrived in the last hour. A "
|
||||
"reclaim re-queues the task as `ready` without penalty (no failure counter "
|
||||
"tick), but you lose your current run's progress.\n"
|
||||
"4. **Block on genuine ambiguity.** If you need a human decision you cannot "
|
||||
"infer (missing credentials, UX choice, paywalled source, peer output you "
|
||||
"need first), call `kanban_block(reason=\"...\")` and stop. Don't guess. "
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue