hermes-agent/hermes_cli
Teknium 1fc8733a69
fix(kanban): unify failure counter across spawn/timeout/crash outcomes (#20410)
The dispatcher's circuit breaker only protected against spawn-side
failures (profile missing, workspace mount error, exec failure).
Workers that successfully spawned but then timed out or crashed
re-queued to ``ready`` with no counter increment, so the next tick
re-spawned them — loops forever until someone noticed. Reported
externally on Twitter (Forbidden Seeds) and confirmed by walking the
kernel: ``enforce_max_runtime`` flipped the task back to ready, emitted
a ``timed_out`` event, and never touched ``spawn_failures``; same for
``detect_crashed_workers``.

Fix: unify the counter across all non-success outcomes.

Schema
------
* ``tasks.spawn_failures`` → ``tasks.consecutive_failures``
* ``tasks.last_spawn_error`` → ``tasks.last_failure_error``
* Migration renames the columns in-place on existing DBs (``ALTER
  TABLE RENAME COLUMN`` — SQLite >= 3.25) so historical counter
  values are preserved. Row mappers fall through to the legacy names
  if both column renames and a migration somehow got out of sync.

Counter lifecycle
-----------------
New helper ``_record_task_failure(conn, task_id, error, *, outcome,
release_claim, end_run, event_payload_extra)`` is the single point
every non-success outcome funnels through:

* ``spawn_failed``  → ``_record_spawn_failure`` (kept as alias)
  calls it with ``release_claim=True, end_run=True`` — transitions
  running→ready, clears claim, closes run.
* ``timed_out`` → ``enforce_max_runtime`` already does the status
  transition + run close + event emission, then calls
  ``_record_task_failure`` with ``release_claim=False, end_run=False``
  just to bump the counter (and trip the breaker if needed).
* ``crashed`` → ``detect_crashed_workers`` same pattern, but the
  counter increment runs after the main write_txn closes (SQLite
  doesn't nest write transactions).

If the counter hits the breaker threshold (``DEFAULT_FAILURE_LIMIT=5``,
same as before), the task transitions to ``blocked`` with a ``gave_up``
event on top of whatever outcome-specific event was already emitted.

Reset semantics changed: the counter now clears only on successful
``complete_task`` (and operator ``reclaim_task`` — an explicit "I've
looked at this, try again with a fresh budget"). Previously
``_clear_spawn_failures`` ran on every successful spawn, which would
have wiped the counter before a timeout could accumulate past threshold
— exactly the loop this fix prevents.

Diagnostics
-----------
* ``_rule_repeated_spawn_failures`` → ``_rule_repeated_failures``. Now
  fires regardless of which outcome is at fault. Classifies the most
  recent failure (spawn_failed / timed_out / crashed) from the run
  history so the title ("Agent timeout x3", "Agent crash x4", "Agent
  spawn x5") and suggested action (``doctor`` for spawn, ``log`` for
  timeout/crash) stay outcome-specific without N duplicate rules.
* ``_rule_repeated_crashes`` kept as a narrower early-warning at
  threshold 2 (vs 3 for the unified rule), but now suppresses itself
  when the unified rule would also fire — avoids double-flagging.
* Diagnostic ``data`` payload now carries
  ``{consecutive_failures, most_recent_outcome, last_error}`` instead
  of spawn-specific keys.

CLI
---
* ``Task.consecutive_failures`` / ``Task.last_failure_error`` are the
  public fields now. Existing callers that referenced the old names
  get migrated (tests updated in this commit).
* Backward-compat: ``DEFAULT_SPAWN_FAILURE_LIMIT``,
  ``_clear_spawn_failures``, ``_record_spawn_failure`` stay as aliases.

Tests
-----
* 6 new kernel tests: timeout increments counter, 3 consecutive
  timeouts trip the breaker (was the reported gap), crash increments
  counter, reclaim clears counter, completion clears counter, spawn
  success does NOT clear counter.
* Diagnostic tests: updated ``repeated_spawn_failures`` cases to use
  the new kind name and add a timeout-loop test.
* Dashboard API test: spawn_failures column update → consecutive_failures.

389/389 kanban-suite tests pass.

Live verification
-----------------
Seeded 4 tasks in an isolated HERMES_HOME: 3 timeouts, 4 crashes,
2-spawn-failed + 2-timed-out, and a task that had prior failures but
completed successfully. Board correctly shows "!! 3 tasks need
attention" (the successful one has no badge because the counter
reset). Drawer for the timeout-loop task renders "Agent timeout x3"
with most_recent_outcome=timed_out and the "Check logs" suggested
action (not the spawn-flavoured "Verify profile"). The successful
task has zero diagnostics.

Closes the Forbidden-Seeds-reported gap.
2026-05-05 13:55:37 -07:00
..
__init__.py
_parser.py
auth.py feat: provider modules — ProviderProfile ABC, 33 providers, fetch_models, transport single-path 2026-05-05 13:40:01 -07:00
auth_commands.py
azure_detect.py
backup.py
banner.py
browser_connect.py
callbacks.py
claw.py fix(claw): handle missing dir in _scan_workspace_state 2026-05-05 06:08:14 -07:00
cli_output.py
clipboard.py
codex_models.py
colors.py
commands.py feat(telegram): /topic off + help + auth gate + screenshot debounce 2026-05-04 12:07:17 -07:00
completion.py
config.py docs(bedrock): fix IAM permissions, add quickstart entry, add fallback provider, fix deployment section 2026-05-05 13:41:14 -07:00
copilot_auth.py
cron.py feat(cron): add no_agent mode for script-only cron jobs (watchdog pattern) (#19709) 2026-05-04 12:31:01 -07:00
curator.py feat(curator): add archive and prune subcommands (#20200) 2026-05-05 05:15:54 -07:00
curses_ui.py
debug.py
default_soul.py
dingtalk_auth.py
doctor.py feat: provider modules — ProviderProfile ABC, 33 providers, fetch_models, transport single-path 2026-05-05 13:40:01 -07:00
dump.py refactor(env): use shared Hermes dotenv loader 2026-05-05 10:13:13 -07:00
env_loader.py
fallback_cmd.py
gateway.py fix(gateway): handle planned service stops 2026-05-04 16:00:49 -07:00
goals.py
hooks.py
kanban.py feat(kanban): generic diagnostics engine for task distress signals (#20332) 2026-05-05 13:32:42 -07:00
kanban_db.py fix(kanban): unify failure counter across spawn/timeout/crash outcomes (#20410) 2026-05-05 13:55:37 -07:00
kanban_diagnostics.py fix(kanban): unify failure counter across spawn/timeout/crash outcomes (#20410) 2026-05-05 13:55:37 -07:00
logs.py
main.py fix(tui): close slash parity gaps with CLI (#20339) 2026-05-05 15:42:39 -05:00
mcp_config.py
memory_setup.py
model_catalog.py
model_normalize.py
model_switch.py feat(cli): add list_picker_providers for credential-filtered picker 2026-05-05 10:18:58 -07:00
models.py feat: provider modules — ProviderProfile ABC, 33 providers, fetch_models, transport single-path 2026-05-05 13:40:01 -07:00
nous_subscription.py
oneshot.py
pairing.py
platforms.py
plugins.py feat(providers): make all 33 providers pluggable under plugins/model-providers/ 2026-05-05 13:40:01 -07:00
plugins_cmd.py
profiles.py
providers.py
pty_bridge.py
relaunch.py
runtime_provider.py
setup.py fix(cli): sanitize bracketed paste markers during setup 2026-05-05 06:12:42 -07:00
skills_config.py
skills_hub.py
skin_engine.py
slack_cli.py
status.py fix(status): add missing popular provider API keys to hermes status display 2026-05-04 05:14:13 -07:00
timeouts.py
tips.py docs: refresh stale platform/LOC/test counts; clarify gateway vs plugin platforms 2026-05-05 13:45:47 -07:00
tools_config.py
uninstall.py
vercel_auth.py
voice.py fix(tui): respect voice.record_key config (supersedes #19028, #19339) (#19835) 2026-05-04 15:49:28 -07:00
web_server.py
webhook.py