mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-06-09 08:21:50 +00:00
Reporter diagnosed three independent gaps that together allowed infinite 'unblock → re-stuck' loops with no surfacing or escalation: GAP 1: `_rule_stuck_in_blocked` resets timer on any `commented`/`unblocked` event, so a task that cycles every few minutes is invisible to it regardless of how many times it cycles. Fix: new `_rule_block_unblock_cycling` rule (`hermes_cli/kanban_diagnostics.py`) that counts block→unblock cycles in a sliding window. Default threshold 3 cycles within 24h, configurable via `block_cycle_threshold` / `block_cycle_window_seconds`. Walks events in arrival order (event id) since multiple events can share the same `created_at` second. Fires as a warning with a CLI hint to inspect the block reasons. GAP 2: Iteration-budget-exhausted runs in kanban workers map to `kanban_block` (status=blocked, but a clean exit from the kernel's perspective). `_rule_repeated_failures` reads `consecutive_failures`, which `_record_task_failure` increments only for crashed/timed_out/ spawn_failed — `blocked` outcome bypasses the failure counter, so the `kanban.failure_limit` circuit breaker never trips on budget-exhaustion loops. Fix: `agent/conversation_loop.py` budget-exhaustion path now calls `_record_task_failure(outcome="timed_out")` instead of `kanban_block`. Budget exhaustion is genuinely a timeout-shaped failure (the task ran out of allowed iterations), so this is more honest semantics; it also routes through the unified failure counter, so repeated budget exhaustions trip the circuit breaker and the task auto-blocks with `gave_up` after `failure_limit` retries. GAP 3: `release_stale_claims` uses `_pid_alive(worker_pid)` only and ignores `last_heartbeat_at`. Reporter observed a 91-min run that held its claim with frozen heartbeat because the worker entered a logic loop with no tool calls — `_pid_alive` kept returning True so the claim was extended every 15 minutes indefinitely. Fix: heartbeat-stale backstop. If `last_heartbeat_at` is set AND older than `DEFAULT_CLAIM_HEARTBEAT_MAX_STALE_SECONDS` (default 1h), reclaim even if the PID is alive. NULL `last_heartbeat_at` preserves backward compatibility (no heartbeat yet = extend, as before). The reclaim event payload now includes a `heartbeat_stale` boolean so operators see why a live-PID worker was reclaimed. This works cleanly in concert with PR #34418 (#31752 runtime → heartbeat bridge): once `_touch_activity` keeps `last_heartbeat_at` fresh as a side effect of normal API traffic, the backstop only fires for genuinely wedged workers (no chunks, no tool results, no progress at all). Co-authored-by: baofuen <45189813+baofuen@users.noreply.github.com> |
||
|---|---|---|
| .. | ||
| lib | ||
| tests | ||
| whatsapp-bridge | ||
| benchmark_browser_eval.py | ||
| build_model_catalog.py | ||
| build_skills_index.py | ||
| check-windows-footguns.py | ||
| contributor_audit.py | ||
| discord-voice-doctor.py | ||
| hermes-gateway | ||
| install.cmd | ||
| install.ps1 | ||
| install.sh | ||
| install_psutil_android.py | ||
| keystroke_diagnostic.py | ||
| kill_modal.sh | ||
| lint_diff.py | ||
| profile-tui.py | ||
| release.py | ||
| run_tests.sh | ||
| run_tests_parallel.py | ||
| sample_and_compress.py | ||
| setup_open_webui.sh | ||