hermes-agent/website/docs/user-guide
Teknium 88ee58f7d2
fix(kanban): stale reclaim must not tick failure counter (#28680)
Follow-up to #28452. detect_stale_running() was calling
_record_task_failure() on every reclaim, which ticked the
consecutive_failures counter. With the default failure_limit=2,
two legitimately long-running tasks (>4 h without explicit
heartbeat) would auto-block via the spawn-failure circuit
breaker — even though no worker actually failed.

Stale reclaim is dispatcher-side absence-of-heartbeat detection,
not a worker fault. Removed the _record_task_failure() call;
the 'stale' event in task_events is still the audit surface,
but the failure counter is now reserved for spawn_failed /
timed_out / crashed (real failures).

Also documents the heartbeat requirement:
- KANBAN_GUIDANCE in agent/prompt_builder.py now states the
  rule ('call kanban_heartbeat at least once an hour for tasks
  running longer than 1 hour') so workers learn the contract.
- kanban.md adds the stale event row to the events table and
  flags the heartbeat requirement in the worker lifecycle list.

New regression test: test_detect_stale_does_not_tick_failure_counter
locks in the new behaviour.
2026-05-19 03:15:18 -07:00
..
features fix(kanban): stale reclaim must not tick failure counter (#28680) 2026-05-19 03:15:18 -07:00
messaging docs: comprehensive 2-week sweep of feature/PR coverage gaps (#28497) 2026-05-18 23:55:25 -07:00
skills docs: align kanban readiness docs and smoke tests 2026-05-18 21:07:03 -07:00
_category_.json feat: add documentation website (Docusaurus) 2026-03-05 05:24:55 -08:00
checkpoints-and-rollback.md feat(checkpoints): v2 single-store rewrite with real pruning + disk guardrails (#20709) 2026-05-06 05:44:35 -07:00
cli.md docs: comprehensive 2-week sweep of feature/PR coverage gaps (#28497) 2026-05-18 23:55:25 -07:00
configuration.md Revert "feat(telegram): support quick-command-only menus" 2026-05-18 23:59:57 -07:00
configuring-models.md docs(session_search): update all docs for the single-shape rewrite (#27840) 2026-05-18 00:36:17 -07:00
docker.md docs: comprehensive 2-week sweep of feature/PR coverage gaps (#28497) 2026-05-18 23:55:25 -07:00
git-worktrees.md docs: restructure site navigation — promote features and platforms to top-level (#4116) 2026-03-30 18:39:51 -07:00
profile-distributions.md docs(profiles): full user guide for profile distributions (#22017) 2026-05-08 11:13:45 -07:00
profiles.md feat(kanban): orchestrator-driven auto-decomposition on triage (#27572) 2026-05-17 13:54:12 -07:00
security.md docs: comprehensive 2-week sweep of feature/PR coverage gaps (#28497) 2026-05-18 23:55:25 -07:00
sessions.md docs: comprehensive 2-week sweep of feature/PR coverage gaps (#28497) 2026-05-18 23:55:25 -07:00
tui.md docs: comprehensive 2-week sweep of feature/PR coverage gaps (#28497) 2026-05-18 23:55:25 -07:00
windows-native.md docs: comprehensive 2-week sweep of feature/PR coverage gaps (#28497) 2026-05-18 23:55:25 -07:00
windows-wsl-quickstart.md docs: deep audit — fix stale config keys, missing commands, and registry drift (#22784) 2026-05-09 13:19:51 -07:00