hermes-agent/tests/hermes_cli/test_kanban_diagnostics.py
Teknium f67063ba81
feat(kanban): generic diagnostics engine for task distress signals (#20332)
* feat(kanban): generic diagnostics engine for task distress signals

Replaces the hallucination-specific ``warnings`` / ``RecoverySection``
surface (shipped in PR #20232) with a reusable diagnostic-rule engine
that covers five distress kinds in v1 and can be extended without
touching UI code. The "something's wrong with this task" signal is
no longer limited to phantom card ids.

Closes the follow-up from #20232 discussion.

New module
----------
``hermes_cli/kanban_diagnostics.py`` — stateless, no-side-effect rule
engine. Each rule is a pure function of
``(task, events, runs, now, config) -> list[Diagnostic]``. Registry
is a simple list; adding a new distress kind is one function + one
import, no UI or API changes required.

v1 rule set
-----------
* ``hallucinated_cards`` (error) — folds the existing
  ``completion_blocked_hallucination`` event into the new surface.
* ``prose_phantom_refs`` (warning) — folds
  ``suspected_hallucinated_references``.
* ``repeated_spawn_failures`` (error → critical at 2x threshold) —
  fires when ``tasks.spawn_failures >= 3``; suggests
  ``hermes -p <profile> doctor`` / ``auth``.
* ``repeated_crashes`` (error → critical) — fires after N consecutive
  ``crashed`` run outcomes with no successful completion between;
  suggests ``hermes kanban log <id>``.
* ``stuck_in_blocked`` (warning) — fires after 24h in ``blocked``
  state with no comments / unblock attempts; suggests commenting.

Every diagnostic carries structured ``actions`` (reclaim, reassign,
unblock, cli_hint, comment, open_docs) that render consistently in
both CLI and dashboard. Suggested actions are highlighted; generic
recovery actions (reclaim / reassign) are available on every kind as
fallbacks.

Diagnostics auto-clear when the underlying failure resolves — a
clean ``completed``/``edited`` event drops hallucination diagnostics,
a successful run drops crash diagnostics, a comment drops
stuck-blocked diagnostics. Audit events persist; the badge goes away.

API
---
``plugin_api.py``:
* ``/board`` now attaches ``diagnostics`` (full list) and
  ``warnings`` (compact summary with ``highest_severity``) per task.
* ``/tasks/{id}`` attaches diagnostics so the drawer's Diagnostics
  section auto-opens on flagged tasks.
* NEW ``/diagnostics`` endpoint — fleet-wide listing, filterable by
  severity, sorted critical-first.

CLI
---
* NEW ``hermes kanban diagnostics [--severity X] [--task id]
  [--json]`` — fleet view or single-task view, matches dashboard rule
  output so CLI users see the same picture.
* ``hermes kanban show <id>`` now renders a Diagnostics section near
  the top with severity markers + suggested actions.

Dashboard
---------
* Card badge is severity-coloured (⚠ amber warning, !! orange error,
  !!! red critical) using ``warnings.highest_severity``.
* Attention strip above the toolbar counts EVERY task with active
  diagnostics (not just hallucinations), severity-coloured, lists
  affected tasks with Open buttons when expanded.
* Drawer's old ``RecoverySection`` replaced with generic
  ``DiagnosticsSection`` rendering a card per active diagnostic:
  title + detail + structured data (task-id chips when payload keys
  look like id lists) + action buttons. Reassign profile picker is
  inline per-diagnostic. Clipboard fallback uses ``.catch()`` for
  environments where writeText rejects.
* Three-rung severity palette; amber for warning, orange for error,
  red for critical. Uses CSS variables so theming is straightforward.

Tests
-----
* NEW ``tests/hermes_cli/test_kanban_diagnostics.py`` — 14 unit tests
  covering each rule's positive/negative/threshold paths, severity
  sorting, broken-rule isolation, and sqlite3.Row integration.
* Dashboard plugin tests extended: ``/diagnostics`` endpoint (empty,
  populated, severity-filtered), ``/board`` exposes both diagnostic
  list and compact summary with ``highest_severity``.
* Existing hallucination-specific test (``test_board_surfaces_
  warnings_field_for_hallucinated_completions``) updated to reflect
  the new contract: warning summary keys by diagnostic kind
  (``hallucinated_cards``) not event kind.

379 kanban-suite tests pass (+16 net from this PR).

Live verification
-----------------
Seeded all 5 diagnostic kinds + one clean + one plain-running task
(7 total) into an isolated HERMES_HOME, spun up the dashboard, and
verified:

* Attention strip: shows ``!! 5 tasks need attention`` in the
  error-severity orange; Show expands to a list of 5 rows ordered
  critical > error > warning.
* Card badges: error tasks render ``!!`` orange, warning tasks
  render ``⚠`` amber, clean and plain-running tasks render no badge.
* Each of the 5 rules opens a correctly-coloured, correctly-styled
  diagnostic card in the drawer with its specific suggested action.
* Live reassign from a diagnostic card flipped
  ``broken-ml-worker → alice`` and the drawer refreshed with the
  new assignee + the same diagnostic still firing (correct:
  spawn_failures counter hasn't reset yet).
* CLI ``hermes kanban diagnostics`` prints all 5 in severity order;
  ``--severity error`` narrows to 3; ``kanban show <id>`` includes
  the Diagnostics block at the top with suggested action hint.

Migration note
--------------
The old ``warnings`` shape (``{count, kinds, latest_at}``) is
preserved on the API but ``kinds`` now keys by diagnostic kind
(``hallucinated_cards``) instead of event kind
(``completion_blocked_hallucination``). ``highest_severity`` is a
new required field. The dashboard was the only consumer and has
been updated in the same commit; external API consumers of the
``warnings`` field will need to update their kind-match logic.

* feat(kanban/diagnostics): lead titles with the actual error text

The generic 'Worker crashed N runs in a row' / 'Worker failed to spawn
N times' titles buried the actual cause in the data section. Operators
had to open logs or expand the diagnostic to see WHY the worker is
stuck — rate-limit vs insufficient quota vs bad auth vs context
overflow vs network blip all looked identical at a glance.

New titles:

  Agent crashed 3x: openai: 429 Too Many Requests - rate limit reached
  Agent crashed 3x: anthropic: 402 insufficient_quota - credit balance
  Agent crashed 3x: provider auth error: 401 Unauthorized
  Agent spawn failed 4x: insufficient_quota: You exceeded your current

Detail keeps the full error snippet (capped at 500 chars + ellipsis
for tracebacks). Title takes the first line capped at 160 chars.
Fallback title if no error recorded stays honest ('no error recorded').

Tests: 4 new cases covering 429/billing/spawn/truncation. 383 total
pass (+4).

Live-verified on dashboard with 6 seeded scenarios
(rate-limit, billing, auth, context, network, spawn-billing) —
each card title leads with the actionable error text.
2026-05-05 13:32:42 -07:00

353 lines
12 KiB
Python

"""Tests for hermes_cli.kanban_diagnostics — rule-engine that produces
structured distress signals (diagnostics) for kanban tasks.
These tests exercise each rule in isolation using minimal in-memory
task/event/run fixtures (no DB) plus a few integration-style cases
that round-trip through the real kanban_db to make sure the rule
engine works on sqlite3.Row objects as well as dataclasses.
"""
from __future__ import annotations
import time
from pathlib import Path
import pytest
from hermes_cli import kanban_db as kb
from hermes_cli import kanban_diagnostics as kd
# ---------------------------------------------------------------------------
# Fixtures
# ---------------------------------------------------------------------------
@pytest.fixture
def kanban_home(tmp_path, monkeypatch):
home = tmp_path / ".hermes"
home.mkdir()
monkeypatch.setenv("HERMES_HOME", str(home))
monkeypatch.setattr(Path, "home", lambda: tmp_path)
kb.init_db()
return home
def _task(**overrides):
base = {
"id": "t_demo00",
"title": "demo task",
"assignee": "demo",
"status": "ready",
"spawn_failures": 0,
"last_spawn_error": None,
}
base.update(overrides)
return base
def _event(kind, ts=None, **payload):
return {
"kind": kind,
"created_at": int(ts if ts is not None else time.time()),
"payload": payload or None,
}
def _run(outcome="completed", run_id=1, error=None):
return {
"id": run_id,
"outcome": outcome,
"error": error,
}
# ---------------------------------------------------------------------------
# Each rule — positive + negative + clearing
# ---------------------------------------------------------------------------
def test_hallucinated_cards_fires_on_blocked_event():
task = _task(status="ready")
events = [
_event("created", ts=100),
_event("completion_blocked_hallucination", ts=200,
phantom_cards=["t_bad1", "t_bad2"],
verified_cards=["t_good1"]),
]
diags = kd.compute_task_diagnostics(task, events, [])
assert len(diags) == 1
d = diags[0]
assert d.kind == "hallucinated_cards"
assert d.severity == "error"
assert d.data["phantom_ids"] == ["t_bad1", "t_bad2"]
# Generic recovery actions always available; comment action too.
kinds = [a.kind for a in d.actions]
assert "comment" in kinds
assert "reassign" in kinds
def test_hallucinated_cards_clears_on_subsequent_completion():
task = _task(status="done")
events = [
_event("completion_blocked_hallucination", ts=100, phantom_cards=["t_x"]),
_event("completed", ts=200, summary="retry worked"),
]
diags = kd.compute_task_diagnostics(task, events, [])
assert diags == []
def test_prose_phantom_refs_fires_after_clean_completion():
# Prose scan emits its event AFTER the completed event in the DB
# path, but a subsequent clean completion clears it. Phantom id
# must be valid hex — the scanner regex is ``t_[a-f0-9]{8,}``.
task = _task(status="done")
events = [
_event("completed", ts=100, summary="referenced t_bad", result_len=0),
_event("suspected_hallucinated_references", ts=101,
phantom_refs=["t_deadbeef99"], source="completion_summary"),
]
diags = kd.compute_task_diagnostics(task, events, [])
assert len(diags) == 1
assert diags[0].kind == "prose_phantom_refs"
assert diags[0].severity == "warning"
assert diags[0].data["phantom_refs"] == ["t_deadbeef99"]
def test_prose_phantom_refs_clears_on_later_clean_edit():
task = _task(status="done")
events = [
_event("completed", ts=100, summary="bad"),
_event("suspected_hallucinated_references", ts=101,
phantom_refs=["t_ffff0000cc"]),
_event("edited", ts=200, fields=["result", "summary"]),
]
diags = kd.compute_task_diagnostics(task, events, [])
assert diags == []
def test_repeated_spawn_failures_fires_at_threshold():
task = _task(status="blocked", spawn_failures=3,
last_spawn_error="Profile 'debugger' does not exist")
diags = kd.compute_task_diagnostics(task, [], [])
assert len(diags) == 1
d = diags[0]
assert d.kind == "repeated_spawn_failures"
assert d.severity == "error"
# CLI hints are what operators actually need here.
suggested = [a.label for a in d.actions if a.suggested]
assert any("doctor" in s for s in suggested)
def test_repeated_spawn_failures_escalates_to_critical():
task = _task(spawn_failures=6, last_spawn_error="boom")
diags = kd.compute_task_diagnostics(task, [], [])
assert diags[0].severity == "critical"
def test_repeated_spawn_failures_below_threshold_silent():
task = _task(spawn_failures=2)
assert kd.compute_task_diagnostics(task, [], []) == []
def test_repeated_crashes_counts_trailing_streak_only():
task = _task(status="ready", assignee="crashy")
runs = [
_run(outcome="completed", run_id=1),
_run(outcome="crashed", run_id=2, error="OOM"),
_run(outcome="crashed", run_id=3, error="OOM again"),
]
diags = kd.compute_task_diagnostics(task, [], runs)
assert len(diags) == 1
d = diags[0]
assert d.kind == "repeated_crashes"
# 2 consecutive crashes at the end → default threshold 2 → error severity.
assert d.severity == "error"
assert d.data["consecutive_crashes"] == 2
def test_repeated_crashes_breaks_on_recent_success():
task = _task(status="ready", assignee="fixed")
runs = [
_run(outcome="crashed", run_id=1),
_run(outcome="crashed", run_id=2),
_run(outcome="completed", run_id=3),
]
assert kd.compute_task_diagnostics(task, [], runs) == []
def test_repeated_crashes_escalates_on_many_crashes():
task = _task(status="ready", assignee="x")
runs = [_run(outcome="crashed", run_id=i) for i in range(1, 6)] # 5 in a row
diags = kd.compute_task_diagnostics(task, [], runs)
assert diags[0].severity == "critical"
def test_stuck_in_blocked_fires_past_threshold():
now = int(time.time())
task = _task(status="blocked")
events = [
_event("blocked", ts=now - 3600 * 48, reason="needs approval"),
]
diags = kd.compute_task_diagnostics(
task, events, [], now=now,
)
assert len(diags) == 1
d = diags[0]
assert d.kind == "stuck_in_blocked"
assert d.severity == "warning"
assert d.data["age_hours"] >= 48
def test_stuck_in_blocked_silent_with_recent_comment():
now = int(time.time())
task = _task(status="blocked")
events = [
_event("blocked", ts=now - 3600 * 48),
_event("commented", ts=now - 3600 * 2, author="human"),
]
assert kd.compute_task_diagnostics(task, events, [], now=now) == []
def test_stuck_in_blocked_silent_when_not_blocked():
task = _task(status="ready")
events = [_event("blocked", ts=1000)]
assert kd.compute_task_diagnostics(task, events, [], now=9999999) == []
def test_repeated_crashes_surfaces_actual_error_in_title():
"""The title should lead with the actual error text so operators
see WHAT broke (e.g. rate-limit, auth, OOM) without opening logs.
"""
task = _task(status="ready", assignee="x")
runs = [
_run(outcome="crashed", run_id=1, error="openai: 429 Too Many Requests"),
_run(outcome="crashed", run_id=2, error="openai: 429 Too Many Requests"),
]
diags = kd.compute_task_diagnostics(task, [], runs)
assert len(diags) == 1
d = diags[0]
assert "429" in d.title
assert "Too Many Requests" in d.title
# Full error in detail.
assert "429 Too Many Requests" in d.detail
def test_repeated_crashes_no_error_fallback_title():
task = _task(status="ready", assignee="x")
runs = [
_run(outcome="crashed", run_id=1, error=None),
_run(outcome="crashed", run_id=2, error=None),
]
diags = kd.compute_task_diagnostics(task, [], runs)
assert "no error recorded" in diags[0].title
def test_repeated_spawn_failures_surfaces_actual_error_in_title():
task = _task(spawn_failures=5,
last_spawn_error="insufficient_quota: billing limit reached")
diags = kd.compute_task_diagnostics(task, [], [])
assert len(diags) == 1
d = diags[0]
assert "insufficient_quota" in d.title or "billing limit" in d.title
assert "insufficient_quota" in d.detail
def test_repeated_crashes_truncates_huge_tracebacks():
"""Full Python tracebacks can be tens of KB. The title stays one
line (≤160 chars); the detail caps at 500 chars + ellipsis so the
card doesn't explode visually."""
huge = "Traceback (most recent call last):\n" + (" File\n" * 500)
task = _task(status="ready")
runs = [
_run(outcome="crashed", run_id=1, error=huge),
_run(outcome="crashed", run_id=2, error=huge),
]
diags = kd.compute_task_diagnostics(task, [], runs)
d = diags[0]
# Title only the first line, capped.
assert "\n" not in d.title
assert len(d.title) < 250
# Detail contains the snippet with ellipsis.
assert d.detail.endswith("") or len(d.detail) < 700
# ---------------------------------------------------------------------------
# Severity sorting
# ---------------------------------------------------------------------------
def test_diagnostics_sorted_critical_first():
"""A task with both a critical (many spawn failures) and a warning
(prose phantoms) diagnostic should list the critical one first."""
task = _task(status="done", spawn_failures=10,
last_spawn_error="nope")
events = [
_event("completed", ts=100, summary="referenced t_missing"),
_event("suspected_hallucinated_references", ts=101,
phantom_refs=["t_missing11"]),
]
diags = kd.compute_task_diagnostics(task, events, [])
kinds = [d.kind for d in diags]
assert kinds[0] == "repeated_spawn_failures" # critical
assert "prose_phantom_refs" in kinds
# ---------------------------------------------------------------------------
# Integration — runs through real kanban_db so sqlite.Row fields work
# ---------------------------------------------------------------------------
def test_engine_works_on_sqlite_row_objects(kanban_home):
"""Regression: the rule functions must handle sqlite3.Row (which
supports mapping access but not attribute access and isn't a dict)
as well as dataclass Task / plain dict. The API layer passes Row
objects directly.
"""
conn = kb.connect()
try:
parent = kb.create_task(conn, title="p", assignee="w")
real = kb.create_task(conn, title="r", assignee="x", created_by="w")
with pytest.raises(kb.HallucinatedCardsError):
kb.complete_task(
conn, parent,
summary="with phantom", created_cards=[real, "t_deadbeef1"],
)
# Pull Row objects the way the API helper does.
row = conn.execute(
"SELECT * FROM tasks WHERE id = ?", (parent,),
).fetchone()
events = list(conn.execute(
"SELECT * FROM task_events WHERE task_id = ? ORDER BY id",
(parent,),
).fetchall())
runs = list(conn.execute(
"SELECT * FROM task_runs WHERE task_id = ? ORDER BY id",
(parent,),
).fetchall())
diags = kd.compute_task_diagnostics(row, events, runs)
assert len(diags) == 1
assert diags[0].kind == "hallucinated_cards"
assert "t_deadbeef1" in diags[0].data["phantom_ids"]
finally:
conn.close()
# ---------------------------------------------------------------------------
# Error-tolerance: a broken rule shouldn't 500 the whole compute call
# ---------------------------------------------------------------------------
def test_broken_rule_is_isolated(monkeypatch):
def _bad_rule(task, events, runs, now, cfg):
raise RuntimeError("synthetic rule bug")
# Insert a broken rule at the front of the registry; subsequent
# rules should still run and produce their diagnostics.
monkeypatch.setattr(kd, "_RULES", [_bad_rule] + kd._RULES)
task = _task(spawn_failures=5, last_spawn_error="e")
diags = kd.compute_task_diagnostics(task, [], [])
# The broken rule silently drops, the real one still fires.
kinds = [d.kind for d in diags]
assert "repeated_spawn_failures" in kinds