mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-11 03:31:55 +00:00
fix(kanban): unify failure counter across spawn/timeout/crash outcomes (#20410)
The dispatcher's circuit breaker only protected against spawn-side
failures (profile missing, workspace mount error, exec failure).
Workers that successfully spawned but then timed out or crashed
re-queued to ``ready`` with no counter increment, so the next tick
re-spawned them — loops forever until someone noticed. Reported
externally on Twitter (Forbidden Seeds) and confirmed by walking the
kernel: ``enforce_max_runtime`` flipped the task back to ready, emitted
a ``timed_out`` event, and never touched ``spawn_failures``; same for
``detect_crashed_workers``.
Fix: unify the counter across all non-success outcomes.
Schema
------
* ``tasks.spawn_failures`` → ``tasks.consecutive_failures``
* ``tasks.last_spawn_error`` → ``tasks.last_failure_error``
* Migration renames the columns in-place on existing DBs (``ALTER
TABLE RENAME COLUMN`` — SQLite >= 3.25) so historical counter
values are preserved. Row mappers fall through to the legacy names
if both column renames and a migration somehow got out of sync.
Counter lifecycle
-----------------
New helper ``_record_task_failure(conn, task_id, error, *, outcome,
release_claim, end_run, event_payload_extra)`` is the single point
every non-success outcome funnels through:
* ``spawn_failed`` → ``_record_spawn_failure`` (kept as alias)
calls it with ``release_claim=True, end_run=True`` — transitions
running→ready, clears claim, closes run.
* ``timed_out`` → ``enforce_max_runtime`` already does the status
transition + run close + event emission, then calls
``_record_task_failure`` with ``release_claim=False, end_run=False``
just to bump the counter (and trip the breaker if needed).
* ``crashed`` → ``detect_crashed_workers`` same pattern, but the
counter increment runs after the main write_txn closes (SQLite
doesn't nest write transactions).
If the counter hits the breaker threshold (``DEFAULT_FAILURE_LIMIT=5``,
same as before), the task transitions to ``blocked`` with a ``gave_up``
event on top of whatever outcome-specific event was already emitted.
Reset semantics changed: the counter now clears only on successful
``complete_task`` (and operator ``reclaim_task`` — an explicit "I've
looked at this, try again with a fresh budget"). Previously
``_clear_spawn_failures`` ran on every successful spawn, which would
have wiped the counter before a timeout could accumulate past threshold
— exactly the loop this fix prevents.
Diagnostics
-----------
* ``_rule_repeated_spawn_failures`` → ``_rule_repeated_failures``. Now
fires regardless of which outcome is at fault. Classifies the most
recent failure (spawn_failed / timed_out / crashed) from the run
history so the title ("Agent timeout x3", "Agent crash x4", "Agent
spawn x5") and suggested action (``doctor`` for spawn, ``log`` for
timeout/crash) stay outcome-specific without N duplicate rules.
* ``_rule_repeated_crashes`` kept as a narrower early-warning at
threshold 2 (vs 3 for the unified rule), but now suppresses itself
when the unified rule would also fire — avoids double-flagging.
* Diagnostic ``data`` payload now carries
``{consecutive_failures, most_recent_outcome, last_error}`` instead
of spawn-specific keys.
CLI
---
* ``Task.consecutive_failures`` / ``Task.last_failure_error`` are the
public fields now. Existing callers that referenced the old names
get migrated (tests updated in this commit).
* Backward-compat: ``DEFAULT_SPAWN_FAILURE_LIMIT``,
``_clear_spawn_failures``, ``_record_spawn_failure`` stay as aliases.
Tests
-----
* 6 new kernel tests: timeout increments counter, 3 consecutive
timeouts trip the breaker (was the reported gap), crash increments
counter, reclaim clears counter, completion clears counter, spawn
success does NOT clear counter.
* Diagnostic tests: updated ``repeated_spawn_failures`` cases to use
the new kind name and add a timeout-loop test.
* Dashboard API test: spawn_failures column update → consecutive_failures.
389/389 kanban-suite tests pass.
Live verification
-----------------
Seeded 4 tasks in an isolated HERMES_HOME: 3 timeouts, 4 crashes,
2-spawn-failed + 2-timed-out, and a task that had prior failures but
completed successfully. Board correctly shows "!! 3 tasks need
attention" (the successful one has no badge because the counter
reset). Drawer for the timeout-loop task renders "Agent timeout x3"
with most_recent_outcome=timed_out and the "Check logs" suggested
action (not the spawn-flavoured "Verify profile"). The successful
task has zero diagnostics.
Closes the Forbidden-Seeds-reported gap.
This commit is contained in:
parent
587ef55f2c
commit
1fc8733a69
5 changed files with 630 additions and 125 deletions
|
|
@ -573,9 +573,18 @@ class Task:
|
|||
tenant: Optional[str]
|
||||
result: Optional[str] = None
|
||||
idempotency_key: Optional[str] = None
|
||||
spawn_failures: int = 0
|
||||
# Unified non-success counter. Incremented on any of:
|
||||
# * spawn failure (dispatcher couldn't launch the worker)
|
||||
# * timed_out outcome (worker exceeded max_runtime_seconds)
|
||||
# * crashed outcome (worker PID vanished)
|
||||
# Reset to 0 only on a successful completion. See
|
||||
# ``_record_task_failure`` for the circuit-breaker trip rule.
|
||||
# (Pre-rename column: ``spawn_failures``.)
|
||||
consecutive_failures: int = 0
|
||||
worker_pid: Optional[int] = None
|
||||
last_spawn_error: Optional[str] = None
|
||||
# Short excerpt of the last failure's error text (any outcome, not
|
||||
# just spawn). Pre-rename column: ``last_spawn_error``.
|
||||
last_failure_error: Optional[str] = None
|
||||
max_runtime_seconds: Optional[int] = None
|
||||
last_heartbeat_at: Optional[int] = None
|
||||
current_run_id: Optional[int] = None
|
||||
|
|
@ -617,9 +626,15 @@ class Task:
|
|||
tenant=row["tenant"] if "tenant" in keys else None,
|
||||
result=row["result"] if "result" in keys else None,
|
||||
idempotency_key=row["idempotency_key"] if "idempotency_key" in keys else None,
|
||||
spawn_failures=row["spawn_failures"] if "spawn_failures" in keys else 0,
|
||||
consecutive_failures=(
|
||||
row["consecutive_failures"] if "consecutive_failures" in keys
|
||||
else (row["spawn_failures"] if "spawn_failures" in keys else 0)
|
||||
),
|
||||
worker_pid=row["worker_pid"] if "worker_pid" in keys else None,
|
||||
last_spawn_error=row["last_spawn_error"] if "last_spawn_error" in keys else None,
|
||||
last_failure_error=(
|
||||
row["last_failure_error"] if "last_failure_error" in keys
|
||||
else (row["last_spawn_error"] if "last_spawn_error" in keys else None)
|
||||
),
|
||||
max_runtime_seconds=(
|
||||
row["max_runtime_seconds"] if "max_runtime_seconds" in keys else None
|
||||
),
|
||||
|
|
@ -735,9 +750,14 @@ CREATE TABLE IF NOT EXISTS tasks (
|
|||
tenant TEXT,
|
||||
result TEXT,
|
||||
idempotency_key TEXT,
|
||||
spawn_failures INTEGER NOT NULL DEFAULT 0,
|
||||
-- Unified consecutive-failure counter. Incremented on spawn
|
||||
-- failure, timeout, or crash; reset only on successful completion.
|
||||
-- The circuit breaker in _record_task_failure trips when this
|
||||
-- exceeds DEFAULT_FAILURE_LIMIT consecutive non-successes.
|
||||
consecutive_failures INTEGER NOT NULL DEFAULT 0,
|
||||
worker_pid INTEGER,
|
||||
last_spawn_error TEXT,
|
||||
-- Short excerpt of the most recent failure's error text.
|
||||
last_failure_error TEXT,
|
||||
max_runtime_seconds INTEGER,
|
||||
last_heartbeat_at INTEGER,
|
||||
-- Pointer into task_runs for the currently-active run (NULL if no
|
||||
|
|
@ -933,14 +953,31 @@ def _migrate_add_optional_columns(conn: sqlite3.Connection) -> None:
|
|||
"CREATE INDEX IF NOT EXISTS idx_tasks_idempotency "
|
||||
"ON tasks(idempotency_key)"
|
||||
)
|
||||
if "spawn_failures" not in cols:
|
||||
conn.execute(
|
||||
"ALTER TABLE tasks ADD COLUMN spawn_failures INTEGER NOT NULL DEFAULT 0"
|
||||
)
|
||||
# Legacy column rename: ``spawn_failures`` → ``consecutive_failures``
|
||||
# and ``last_spawn_error`` → ``last_failure_error``. The counter was
|
||||
# originally spawn-only; it's now unified across spawn/timeout/
|
||||
# crash outcomes. Rename when only the legacy columns exist to
|
||||
# preserve historical counter values across upgrades. Add fresh
|
||||
# otherwise.
|
||||
if "consecutive_failures" not in cols:
|
||||
if "spawn_failures" in cols:
|
||||
conn.execute(
|
||||
"ALTER TABLE tasks RENAME COLUMN spawn_failures TO consecutive_failures"
|
||||
)
|
||||
else:
|
||||
conn.execute(
|
||||
"ALTER TABLE tasks ADD COLUMN consecutive_failures "
|
||||
"INTEGER NOT NULL DEFAULT 0"
|
||||
)
|
||||
if "worker_pid" not in cols:
|
||||
conn.execute("ALTER TABLE tasks ADD COLUMN worker_pid INTEGER")
|
||||
if "last_spawn_error" not in cols:
|
||||
conn.execute("ALTER TABLE tasks ADD COLUMN last_spawn_error TEXT")
|
||||
if "last_failure_error" not in cols:
|
||||
if "last_spawn_error" in cols:
|
||||
conn.execute(
|
||||
"ALTER TABLE tasks RENAME COLUMN last_spawn_error TO last_failure_error"
|
||||
)
|
||||
else:
|
||||
conn.execute("ALTER TABLE tasks ADD COLUMN last_failure_error TEXT")
|
||||
if "max_runtime_seconds" not in cols:
|
||||
conn.execute("ALTER TABLE tasks ADD COLUMN max_runtime_seconds INTEGER")
|
||||
if "last_heartbeat_at" not in cols:
|
||||
|
|
@ -1895,6 +1932,11 @@ def reclaim_task(
|
|||
},
|
||||
run_id=run_id,
|
||||
)
|
||||
# Operator intervention — they've looked at the task, so the
|
||||
# consecutive-failures counter is now stale. Give the next retry
|
||||
# a fresh budget. (_clear_failure_counter opens its own write_txn,
|
||||
# so it runs after the enclosing one commits.)
|
||||
_clear_failure_counter(conn, task_id)
|
||||
return True
|
||||
|
||||
|
||||
|
|
@ -2186,6 +2228,11 @@ def complete_task(
|
|||
},
|
||||
run_id=run_id,
|
||||
)
|
||||
# Successful completion — wipe the consecutive-failures counter.
|
||||
# Failure history stays on the event log for audit; the counter
|
||||
# just tracks "is there a current pathology the breaker should
|
||||
# care about", and a success resets that question.
|
||||
_clear_failure_counter(conn, task_id)
|
||||
# Recompute ready status for dependents (separate txn so children see done).
|
||||
recompute_ready(conn)
|
||||
return True
|
||||
|
|
@ -2444,7 +2491,9 @@ def set_workspace_path(
|
|||
# stops retrying and parks the task in ``blocked`` with a reason so a human
|
||||
# can investigate. Prevents the dispatcher from thrashing forever on a task
|
||||
# whose profile doesn't exist, whose workspace is unmountable, etc.
|
||||
DEFAULT_SPAWN_FAILURE_LIMIT = 5
|
||||
DEFAULT_FAILURE_LIMIT = 5
|
||||
# Legacy alias — callers / tests still reference the old name.
|
||||
DEFAULT_SPAWN_FAILURE_LIMIT = DEFAULT_FAILURE_LIMIT
|
||||
|
||||
# Max bytes to keep in a single worker log file. The dispatcher truncates
|
||||
# and rotates on spawn if the file is larger than this at spawn time.
|
||||
|
|
@ -2668,6 +2717,20 @@ def enforce_max_runtime(
|
|||
conn, tid, "timed_out", payload, run_id=run_id,
|
||||
)
|
||||
timed_out.append(tid)
|
||||
# Increment the unified failure counter. Outside the write_txn
|
||||
# above because ``_record_task_failure`` opens its own. If the
|
||||
# breaker trips, this flips the task ``ready → blocked`` and
|
||||
# emits a ``gave_up`` event on top of the ``timed_out`` we
|
||||
# already emitted.
|
||||
if cur.rowcount == 1:
|
||||
_record_task_failure(
|
||||
conn, tid,
|
||||
error=f"elapsed {int(elapsed)}s > limit {int(row['max_runtime_seconds'])}s",
|
||||
outcome="timed_out",
|
||||
release_claim=False,
|
||||
end_run=False,
|
||||
event_payload_extra={"pid": pid, "sigkill": killed},
|
||||
)
|
||||
return timed_out
|
||||
|
||||
|
||||
|
|
@ -2699,6 +2762,10 @@ def detect_crashed_workers(conn: sqlite3.Connection) -> list[str]:
|
|||
dispatcher (the whole design is single-host).
|
||||
"""
|
||||
crashed: list[str] = []
|
||||
# Per-crash details collected inside the main txn, used after it
|
||||
# closes to run ``_record_task_failure`` (which needs its own
|
||||
# write_txn so can't nest).
|
||||
crash_details: list[tuple[str, int, str]] = [] # (task_id, pid, claimer)
|
||||
with write_txn(conn):
|
||||
rows = conn.execute(
|
||||
"SELECT id, worker_pid, claim_lock FROM tasks "
|
||||
|
|
@ -2734,67 +2801,169 @@ def detect_crashed_workers(conn: sqlite3.Connection) -> list[str]:
|
|||
run_id=run_id,
|
||||
)
|
||||
crashed.append(row["id"])
|
||||
crash_details.append(
|
||||
(row["id"], int(row["worker_pid"]), row["claim_lock"])
|
||||
)
|
||||
# Outside the main txn: increment the unified failure counter for
|
||||
# each crashed task. If the breaker trips, the task transitions
|
||||
# ready → blocked with a ``gave_up`` event on top of the ``crashed``
|
||||
# event we already emitted.
|
||||
for tid, pid, claimer in crash_details:
|
||||
_record_task_failure(
|
||||
conn, tid,
|
||||
error=f"pid {pid} not alive",
|
||||
outcome="crashed",
|
||||
release_claim=False,
|
||||
end_run=False,
|
||||
event_payload_extra={"pid": pid, "claimer": claimer},
|
||||
)
|
||||
return crashed
|
||||
|
||||
|
||||
def _record_task_failure(
|
||||
conn: sqlite3.Connection,
|
||||
task_id: str,
|
||||
error: str,
|
||||
*,
|
||||
outcome: str,
|
||||
failure_limit: int = None,
|
||||
release_claim: bool = False,
|
||||
end_run: bool = False,
|
||||
event_payload_extra: Optional[dict] = None,
|
||||
) -> bool:
|
||||
"""Record a non-success outcome (spawn_failed / crashed / timed_out)
|
||||
and maybe trip the circuit breaker.
|
||||
|
||||
Unified replacement for the old spawn-only ``_record_spawn_failure``.
|
||||
Every path that ends a task with a non-success outcome funnels
|
||||
through here so the ``consecutive_failures`` counter and the
|
||||
auto-block threshold stay consistent.
|
||||
|
||||
Returns True when the task was auto-blocked (counter reached
|
||||
``failure_limit``), False when it was just updated in place.
|
||||
|
||||
Modes:
|
||||
|
||||
* ``release_claim=True, end_run=True`` — spawn-failure path.
|
||||
Caller has a running task with an open run; this transitions
|
||||
it back to ``ready`` (or ``blocked`` when the breaker trips),
|
||||
releases the claim, and closes the run with ``outcome=<outcome>``.
|
||||
|
||||
* ``release_claim=False, end_run=False`` — timeout/crash path.
|
||||
Caller has ALREADY flipped the task to ``ready`` and closed the
|
||||
run with the appropriate outcome. This just increments the
|
||||
counter; if the breaker trips, the task is re-transitioned
|
||||
``ready → blocked`` and a ``gave_up`` event is emitted.
|
||||
|
||||
``event_payload_extra`` merges into the ``gave_up`` event payload
|
||||
when the breaker trips, so callers can include outcome-specific
|
||||
context (e.g. pid on crash, elapsed on timeout).
|
||||
"""
|
||||
if failure_limit is None:
|
||||
failure_limit = DEFAULT_FAILURE_LIMIT
|
||||
blocked = False
|
||||
with write_txn(conn):
|
||||
row = conn.execute(
|
||||
"SELECT consecutive_failures, status FROM tasks WHERE id = ?", (task_id,),
|
||||
).fetchone()
|
||||
if row is None:
|
||||
return False
|
||||
failures = int(row["consecutive_failures"]) + 1
|
||||
cur_status = row["status"]
|
||||
|
||||
if failures >= failure_limit:
|
||||
# Trip the breaker.
|
||||
if release_claim:
|
||||
# Spawn path: still running, also clear claim state.
|
||||
conn.execute(
|
||||
"UPDATE tasks SET status = 'blocked', claim_lock = NULL, "
|
||||
"claim_expires = NULL, worker_pid = NULL, "
|
||||
"consecutive_failures = ?, last_failure_error = ? "
|
||||
"WHERE id = ? AND status IN ('running', 'ready')",
|
||||
(failures, error[:500], task_id),
|
||||
)
|
||||
else:
|
||||
# Timeout/crash path: task is already at ``ready``
|
||||
# with claim cleared; just flip to blocked + update
|
||||
# counter fields.
|
||||
conn.execute(
|
||||
"UPDATE tasks SET status = 'blocked', "
|
||||
"consecutive_failures = ?, last_failure_error = ? "
|
||||
"WHERE id = ? AND status IN ('ready', 'running')",
|
||||
(failures, error[:500], task_id),
|
||||
)
|
||||
run_id = None
|
||||
if end_run:
|
||||
# Only the spawn path has an open run to close.
|
||||
run_id = _end_run(
|
||||
conn, task_id,
|
||||
outcome="gave_up", status="gave_up",
|
||||
error=error[:500],
|
||||
metadata={"failures": failures, "trigger_outcome": outcome},
|
||||
)
|
||||
payload = {
|
||||
"failures": failures,
|
||||
"error": error[:500],
|
||||
"trigger_outcome": outcome,
|
||||
}
|
||||
if event_payload_extra:
|
||||
payload.update(event_payload_extra)
|
||||
_append_event(
|
||||
conn, task_id, "gave_up", payload, run_id=run_id,
|
||||
)
|
||||
blocked = True
|
||||
else:
|
||||
# Below threshold.
|
||||
if release_claim:
|
||||
# Spawn path: transition running → ready + clear claim.
|
||||
conn.execute(
|
||||
"UPDATE tasks SET status = 'ready', claim_lock = NULL, "
|
||||
"claim_expires = NULL, worker_pid = NULL, "
|
||||
"consecutive_failures = ?, last_failure_error = ? "
|
||||
"WHERE id = ? AND status = 'running'",
|
||||
(failures, error[:500], task_id),
|
||||
)
|
||||
else:
|
||||
# Timeout/crash path: task is already at ``ready`` via
|
||||
# its own UPDATE. Just bookkeep the counter + last error.
|
||||
conn.execute(
|
||||
"UPDATE tasks SET consecutive_failures = ?, "
|
||||
"last_failure_error = ? WHERE id = ?",
|
||||
(failures, error[:500], task_id),
|
||||
)
|
||||
if end_run:
|
||||
# Spawn path: close the open run with outcome.
|
||||
run_id = _end_run(
|
||||
conn, task_id,
|
||||
outcome=outcome, status=outcome,
|
||||
error=error[:500],
|
||||
metadata={"failures": failures},
|
||||
)
|
||||
_append_event(
|
||||
conn, task_id, outcome,
|
||||
{"error": error[:500], "failures": failures},
|
||||
run_id=run_id,
|
||||
)
|
||||
# Timeout/crash path's caller already emitted its own event.
|
||||
return blocked
|
||||
|
||||
|
||||
# Backward-compat alias. Old name is referenced from tests and possibly
|
||||
# third-party callers. New code should call ``_record_task_failure``.
|
||||
def _record_spawn_failure(
|
||||
conn: sqlite3.Connection,
|
||||
task_id: str,
|
||||
error: str,
|
||||
*,
|
||||
failure_limit: int = DEFAULT_SPAWN_FAILURE_LIMIT,
|
||||
failure_limit: int = None,
|
||||
) -> bool:
|
||||
"""Release the claim, increment the failure counter, maybe auto-block.
|
||||
|
||||
Returns True when the task was auto-blocked (N failures exceeded),
|
||||
False when it was just released back to ``ready`` for another try.
|
||||
"""
|
||||
blocked = False
|
||||
with write_txn(conn):
|
||||
row = conn.execute(
|
||||
"SELECT spawn_failures FROM tasks WHERE id = ?", (task_id,),
|
||||
).fetchone()
|
||||
failures = int(row["spawn_failures"]) + 1 if row else 1
|
||||
if failures >= failure_limit:
|
||||
conn.execute(
|
||||
"UPDATE tasks SET status = 'blocked', claim_lock = NULL, "
|
||||
"claim_expires = NULL, worker_pid = NULL, "
|
||||
"spawn_failures = ?, last_spawn_error = ? "
|
||||
"WHERE id = ? AND status IN ('running', 'ready')",
|
||||
(failures, error[:500], task_id),
|
||||
)
|
||||
run_id = _end_run(
|
||||
conn, task_id,
|
||||
outcome="gave_up", status="gave_up",
|
||||
error=error[:500],
|
||||
metadata={"failures": failures},
|
||||
)
|
||||
_append_event(
|
||||
conn, task_id, "gave_up",
|
||||
{"failures": failures, "error": error[:500]},
|
||||
run_id=run_id,
|
||||
)
|
||||
blocked = True
|
||||
else:
|
||||
conn.execute(
|
||||
"UPDATE tasks SET status = 'ready', claim_lock = NULL, "
|
||||
"claim_expires = NULL, worker_pid = NULL, "
|
||||
"spawn_failures = ?, last_spawn_error = ? "
|
||||
"WHERE id = ? AND status = 'running'",
|
||||
(failures, error[:500], task_id),
|
||||
)
|
||||
run_id = _end_run(
|
||||
conn, task_id,
|
||||
outcome="spawn_failed", status="spawn_failed",
|
||||
error=error[:500],
|
||||
metadata={"failures": failures},
|
||||
)
|
||||
_append_event(
|
||||
conn, task_id, "spawn_failed",
|
||||
{"error": error[:500], "failures": failures},
|
||||
run_id=run_id,
|
||||
)
|
||||
return blocked
|
||||
return _record_task_failure(
|
||||
conn, task_id, error,
|
||||
outcome="spawn_failed",
|
||||
failure_limit=failure_limit,
|
||||
release_claim=True,
|
||||
end_run=True,
|
||||
)
|
||||
|
||||
|
||||
def _set_worker_pid(conn: sqlite3.Connection, task_id: str, pid: int) -> None:
|
||||
|
|
@ -2818,16 +2987,28 @@ def _set_worker_pid(conn: sqlite3.Connection, task_id: str, pid: int) -> None:
|
|||
_append_event(conn, task_id, "spawned", {"pid": int(pid)}, run_id=run_id)
|
||||
|
||||
|
||||
def _clear_spawn_failures(conn: sqlite3.Connection, task_id: str) -> None:
|
||||
"""Reset the failure counter after a successful spawn."""
|
||||
def _clear_failure_counter(conn: sqlite3.Connection, task_id: str) -> None:
|
||||
"""Reset the unified consecutive-failures counter.
|
||||
|
||||
Called from ``complete_task`` on successful completion — a fresh
|
||||
success means the task + profile combination is working and any
|
||||
past failures are history. NOT called on spawn success anymore:
|
||||
a successful spawn proves the worker could start but says nothing
|
||||
about whether the run will succeed, so we need to let timeouts and
|
||||
crashes accumulate across spawn boundaries.
|
||||
"""
|
||||
with write_txn(conn):
|
||||
conn.execute(
|
||||
"UPDATE tasks SET spawn_failures = 0, last_spawn_error = NULL "
|
||||
"WHERE id = ?",
|
||||
"UPDATE tasks SET consecutive_failures = 0, "
|
||||
"last_failure_error = NULL WHERE id = ?",
|
||||
(task_id,),
|
||||
)
|
||||
|
||||
|
||||
# Legacy alias for test-code and anything else that still imports it.
|
||||
_clear_spawn_failures = _clear_failure_counter
|
||||
|
||||
|
||||
def has_spawnable_ready(conn: sqlite3.Connection) -> bool:
|
||||
"""Return True iff there is at least one ready+assigned+unclaimed task
|
||||
whose assignee maps to a real Hermes profile.
|
||||
|
|
@ -2964,7 +3145,13 @@ def dispatch_once(
|
|||
pid = _spawn(claimed, str(workspace))
|
||||
if pid:
|
||||
_set_worker_pid(conn, claimed.id, int(pid))
|
||||
_clear_spawn_failures(conn, claimed.id)
|
||||
# NOTE: we intentionally do NOT reset consecutive_failures
|
||||
# here. A successful spawn proves the worker can start but
|
||||
# doesn't prove the run will succeed. Under unified
|
||||
# failure counting, resetting on spawn would let a task
|
||||
# that keeps timing out after spawn loop forever. The
|
||||
# counter is cleared only on successful completion (see
|
||||
# complete_task).
|
||||
result.spawned.append((claimed.id, claimed.assignee or "", str(workspace)))
|
||||
spawned += 1
|
||||
except Exception as exc:
|
||||
|
|
|
|||
|
|
@ -312,21 +312,57 @@ def _rule_prose_phantom_refs(task, events, runs, now, cfg) -> list[Diagnostic]:
|
|||
)]
|
||||
|
||||
|
||||
def _rule_repeated_spawn_failures(task, events, runs, now, cfg) -> list[Diagnostic]:
|
||||
"""Task's ``spawn_failures`` counter is climbing — worker can't
|
||||
even start. Usually a profile misconfiguration (missing config.yaml,
|
||||
bad PATH/venv, wrong credentials).
|
||||
def _rule_repeated_failures(task, events, runs, now, cfg) -> list[Diagnostic]:
|
||||
"""Task's unified ``consecutive_failures`` counter is climbing —
|
||||
something about this task+profile combo is broken and each retry
|
||||
fails the same way. Triggers regardless of the specific failure
|
||||
mode (spawn error, timeout, crash) because operationally they
|
||||
all look the same: the kernel keeps retrying and the operator
|
||||
needs to intervene.
|
||||
|
||||
Threshold: cfg["spawn_failure_threshold"] (default 3).
|
||||
Threshold: cfg["failure_threshold"] (default 3). A threshold of 3
|
||||
is one below the circuit-breaker's default (5), so the diagnostic
|
||||
surfaces BEFORE the breaker trips — giving operators a window to
|
||||
fix the problem while the dispatcher's still retrying.
|
||||
|
||||
Accepts the legacy ``spawn_failure_threshold`` config key for
|
||||
back-compat.
|
||||
"""
|
||||
threshold = int(cfg.get("spawn_failure_threshold", 3))
|
||||
failures = _task_field(task, "spawn_failures", 0)
|
||||
threshold = int(cfg.get(
|
||||
"failure_threshold",
|
||||
cfg.get("spawn_failure_threshold", 3),
|
||||
))
|
||||
# Read the new unified counter name, with a fallback to the legacy
|
||||
# column name so this rule keeps working against old DB rows the
|
||||
# caller somehow materialised without running the migration.
|
||||
failures = (
|
||||
_task_field(task, "consecutive_failures", None)
|
||||
if _task_field(task, "consecutive_failures", None) is not None
|
||||
else _task_field(task, "spawn_failures", 0)
|
||||
)
|
||||
if failures is None or failures < threshold:
|
||||
return []
|
||||
last_err = _task_field(task, "last_spawn_error")
|
||||
last_err = (
|
||||
_task_field(task, "last_failure_error", None)
|
||||
if _task_field(task, "last_failure_error", None) is not None
|
||||
else _task_field(task, "last_spawn_error", None)
|
||||
)
|
||||
assignee = _task_field(task, "assignee")
|
||||
|
||||
# Classify the most recent failure by peeking at run outcomes so
|
||||
# the title + suggested action can be specific without a separate
|
||||
# per-outcome rule.
|
||||
ordered_runs = sorted(runs, key=lambda r: _task_field(r, "id", 0))
|
||||
most_recent_outcome = None
|
||||
for r in reversed(ordered_runs):
|
||||
oc = _task_field(r, "outcome")
|
||||
if oc in ("spawn_failed", "timed_out", "crashed"):
|
||||
most_recent_outcome = oc
|
||||
break
|
||||
|
||||
actions: list[DiagnosticAction] = []
|
||||
if assignee and assignee != "default":
|
||||
if most_recent_outcome == "spawn_failed" and assignee and assignee != "default":
|
||||
# Spawn is failing specifically — profile setup issue.
|
||||
actions.append(DiagnosticAction(
|
||||
kind="cli_hint",
|
||||
label=f"Verify profile: hermes -p {assignee} doctor",
|
||||
|
|
@ -338,28 +374,49 @@ def _rule_repeated_spawn_failures(task, events, runs, now, cfg) -> list[Diagnost
|
|||
label=f"Fix profile auth: hermes -p {assignee} auth",
|
||||
payload={"command": f"hermes -p {assignee} auth"},
|
||||
))
|
||||
actions.extend(_generic_recovery_actions(task, running=False))
|
||||
elif most_recent_outcome in ("timed_out", "crashed"):
|
||||
# Worker got off the ground but died. Logs are the right place
|
||||
# to diagnose; reclaim/reassign are the recovery levers.
|
||||
task_id = _task_field(task, "id")
|
||||
if task_id:
|
||||
actions.append(DiagnosticAction(
|
||||
kind="cli_hint",
|
||||
label=f"Check logs: hermes kanban log {task_id}",
|
||||
payload={"command": f"hermes kanban log {task_id}"},
|
||||
suggested=True,
|
||||
))
|
||||
actions.extend(_generic_recovery_actions(
|
||||
task, running=_task_field(task, "status") == "running",
|
||||
))
|
||||
|
||||
severity = "critical" if failures >= threshold * 2 else "error"
|
||||
err_text = (last_err or "").strip() if last_err else ""
|
||||
err_snippet = err_text[:500] + ("…" if len(err_text) > 500 else "") if err_text else ""
|
||||
outcome_label = {
|
||||
"spawn_failed": "spawn",
|
||||
"timed_out": "timeout",
|
||||
"crashed": "crash",
|
||||
}.get(most_recent_outcome or "", "failure")
|
||||
if err_snippet:
|
||||
title = f"Agent spawn failed {failures}x: {err_snippet.splitlines()[0][:160]}"
|
||||
title = f"Agent {outcome_label} x{failures}: {err_snippet.splitlines()[0][:160]}"
|
||||
detail = (
|
||||
f"The dispatcher tried to launch a worker {failures} times "
|
||||
f"and failed every time. Full last error:\n\n{err_snippet}\n\n"
|
||||
f"Common causes: missing config.yaml, bad venv/PATH, or "
|
||||
f"missing credentials for the profile's configured provider."
|
||||
f"This task has failed {failures} times in a row "
|
||||
f"(most recent: {outcome_label}). Full last error:\n\n"
|
||||
f"{err_snippet}\n\n"
|
||||
f"The dispatcher will keep retrying until the consecutive-"
|
||||
f"failures counter trips the circuit breaker (default 5), "
|
||||
f"at which point the task auto-blocks. Fix the root cause "
|
||||
f"and reclaim to retry."
|
||||
)
|
||||
else:
|
||||
title = f"Agent spawn failed {failures}x (no error recorded)"
|
||||
title = f"Agent {outcome_label} x{failures} (no error recorded)"
|
||||
detail = (
|
||||
f"The dispatcher tried to launch a worker {failures} times "
|
||||
f"and failed every time, but no error text was captured. "
|
||||
f"Usually a profile configuration issue — check profile "
|
||||
f"health with the suggested command."
|
||||
f"This task has failed {failures} times in a row "
|
||||
f"(most recent: {outcome_label}) but no error text was "
|
||||
f"captured. Check the suggested command or the worker log."
|
||||
)
|
||||
return [Diagnostic(
|
||||
kind="repeated_spawn_failures",
|
||||
kind="repeated_failures",
|
||||
severity=severity,
|
||||
title=title,
|
||||
detail=detail,
|
||||
|
|
@ -367,7 +424,11 @@ def _rule_repeated_spawn_failures(task, events, runs, now, cfg) -> list[Diagnost
|
|||
first_seen_at=now,
|
||||
last_seen_at=now,
|
||||
count=failures,
|
||||
data={"spawn_failures": failures, "last_spawn_error": last_err},
|
||||
data={
|
||||
"consecutive_failures": failures,
|
||||
"most_recent_outcome": most_recent_outcome,
|
||||
"last_error": last_err,
|
||||
},
|
||||
)]
|
||||
|
||||
|
||||
|
|
@ -378,7 +439,23 @@ def _rule_repeated_crashes(task, events, runs, now, cfg) -> list[Diagnostic]:
|
|||
broken (OOM, missing dependency, tool it needs is down).
|
||||
|
||||
Threshold: cfg["crash_threshold"] (default 2).
|
||||
|
||||
Narrower than ``repeated_failures`` — fires earlier (2 crashes vs 3
|
||||
total failures) so the operator gets a crash-specific heads-up
|
||||
before the unified rule kicks in. Suppresses itself when the
|
||||
unified rule is also about to fire, to avoid double-flagging.
|
||||
"""
|
||||
failure_threshold = int(cfg.get(
|
||||
"failure_threshold",
|
||||
cfg.get("spawn_failure_threshold", 3),
|
||||
))
|
||||
unified_counter = (
|
||||
_task_field(task, "consecutive_failures", 0) or 0
|
||||
)
|
||||
# Unified rule will catch this — let it handle to avoid double fire.
|
||||
if unified_counter >= failure_threshold:
|
||||
return []
|
||||
|
||||
threshold = int(cfg.get("crash_threshold", 2))
|
||||
ordered = sorted(runs, key=lambda r: _task_field(r, "id", 0))
|
||||
# Count trailing consecutive 'crashed' outcomes.
|
||||
|
|
@ -498,7 +575,7 @@ def _rule_stuck_in_blocked(task, events, runs, now, cfg) -> list[Diagnostic]:
|
|||
_RULES: list[RuleFn] = [
|
||||
_rule_hallucinated_cards,
|
||||
_rule_prose_phantom_refs,
|
||||
_rule_repeated_spawn_failures,
|
||||
_rule_repeated_failures,
|
||||
_rule_repeated_crashes,
|
||||
_rule_stuck_in_blocked,
|
||||
]
|
||||
|
|
@ -509,13 +586,15 @@ _RULES: list[RuleFn] = [
|
|||
DIAGNOSTIC_KINDS = (
|
||||
"hallucinated_cards",
|
||||
"prose_phantom_refs",
|
||||
"repeated_spawn_failures",
|
||||
"repeated_failures",
|
||||
"repeated_crashes",
|
||||
"stuck_in_blocked",
|
||||
)
|
||||
|
||||
|
||||
DEFAULT_CONFIG = {
|
||||
"failure_threshold": 3,
|
||||
# Legacy alias accepted at read time by _rule_repeated_failures.
|
||||
"spawn_failure_threshold": 3,
|
||||
"crash_threshold": 2,
|
||||
"blocked_stale_hours": 24,
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue