fix(kanban): suppress dispatcher stuck-warn when ready queue holds only non-spawnable assignees

After PR #20105 (dispatcher skips ready tasks whose assignee fails
``profile_exists()`` to prevent the orion-cc/orion-research crash
loop), the gateway and CLI emit a spurious "kanban dispatcher stuck:
ready queue non-empty for N consecutive ticks but 0 workers spawned"
warning every 5 minutes on multi-lane setups where the queue is
steadily full of human-pulled work assigned to terminal lanes.

The warn is intended to catch real failure modes (broken PATH,
missing venv, credential loss for a real Hermes profile). On a
multi-lane host it fires forever even though everything is healthy:
the dispatcher correctly chose not to spawn, and there is nothing
for the operator to fix.

Changes:

* ``DispatchResult`` gains a ``skipped_nonspawnable`` field
  (separate from ``skipped_unassigned``) so callers can distinguish
  "task missing an owner — operator should route it" from "task
  owned by a control-plane lane — terminal will pull it".
* ``dispatch_once`` routes the ``not profile_exists(assignee)`` skip
  into the new bucket (was lumped into ``skipped_unassigned``).
* New helper ``has_spawnable_ready(conn)`` returns True iff at least
  one ready+assigned+unclaimed task in the DB has an assignee that
  maps to a real Hermes profile. Falls back to legacy "any
  ready+assigned" when ``profile_exists`` is unimportable so degraded
  installs still surface the original warn.
* The gateway dispatcher (``gateway/run.py``) and the CLI standalone
  daemon (``hermes_cli/kanban.py``) both swap their cheap
  ``ready_nonempty`` probe to use ``has_spawnable_ready``. Stuck-warn
  now fires only when there is genuine spawnable work the dispatcher
  failed to start.
* CLI dispatch output prints ``Skipped (non-spawnable assignee —
  terminal lane, OK)`` for visibility without alarm.

Tests:

* New ``has_spawnable_ready`` cases (empty queue, terminal-lane
  only, mixed real+terminal).
* New ``test_dispatch_skips_nonspawnable_into_separate_bucket``
  verifies the bucketing change.
* Updated ``test_dispatch_skips_unassigned`` to assert no
  cross-leak.
* Added ``all_assignees_spawnable`` fixture in
  ``tests/hermes_cli/conftest.py`` and threaded it through dispatcher
  tests that use synthetic assignees ("alice", "bob"). PR #20105
  (the parent commit) silently broke 8 such tests by routing those
  assignees into ``skipped_nonspawnable`` instead of spawning; this
  PR repairs them as part of the same code area.

Verified locally: 246/246 kanban-suite tests pass.

Stacks on top of fix/kanban-dispatcher-skip-missing-profile-2026-05-05
(PR #20105). Reviewer: this PR is meant to merge AFTER #20105.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Brecht-H 2026-05-05 09:43:06 +00:00 committed by Teknium
parent ca5595fe7b
commit f25d3ec917
6 changed files with 152 additions and 25 deletions

View file

@ -1274,6 +1274,7 @@ def _cmd_dispatch(args: argparse.Namespace) -> int:
for (tid, who, ws) in res.spawned
],
"skipped_unassigned": res.skipped_unassigned,
"skipped_nonspawnable": res.skipped_nonspawnable,
}, indent=2))
return 0
print(f"Reclaimed: {res.reclaimed}")
@ -1293,6 +1294,11 @@ def _cmd_dispatch(args: argparse.Namespace) -> int:
print(f" - {tid} -> {who} @ {ws or '-'}{tag}")
if res.skipped_unassigned:
print(f"Skipped (unassigned): {', '.join(res.skipped_unassigned)}")
if res.skipped_nonspawnable:
print(
f"Skipped (non-spawnable assignee — terminal lane, OK): "
f"{', '.join(res.skipped_nonspawnable)}"
)
return 0
@ -1404,16 +1410,18 @@ def _cmd_daemon(args: argparse.Namespace) -> int:
)
def _ready_queue_nonempty() -> bool:
"""Cheap SELECT — just asks whether there's at least one ready
task with an assignee that the dispatcher could have picked up."""
"""Cheap probe — is there at least one ready+assigned+unclaimed
task whose assignee maps to a real Hermes profile (i.e. one the
dispatcher would actually try to spawn for)?
Filters out tasks assigned to control-plane lanes
(e.g. ``orion-cc``, ``orion-research``) that are pulled by
terminals via ``claim_task`` directly those are correctly idle
from the dispatcher's perspective, not stuck.
"""
try:
with kb.connect() as conn:
row = conn.execute(
"SELECT 1 FROM tasks "
"WHERE status = 'ready' AND assignee IS NOT NULL "
" AND claim_lock IS NULL LIMIT 1"
).fetchone()
return row is not None
return kb.has_spawnable_ready(conn)
except Exception:
return False

View file

@ -2118,6 +2118,15 @@ class DispatchResult:
spawned: list[tuple[str, str, str]] = field(default_factory=list)
"""List of ``(task_id, assignee, workspace_path)`` triples."""
skipped_unassigned: list[str] = field(default_factory=list)
"""Ready task ids skipped because they have no assignee at all.
Operator-actionable usually a misfiled task waiting for routing."""
skipped_nonspawnable: list[str] = field(default_factory=list)
"""Ready task ids skipped because their assignee names a control-plane
lane (a Claude Code terminal like ``orion-cc``) rather than a Hermes
profile. Expected steady-state on multi-lane setups; NOT an
operator-actionable failure. Tracked separately so health telemetry
can distinguish "real stuck" (nothing spawned but spawnable work
available) from "correctly idle" (nothing spawnable in the queue)."""
crashed: list[str] = field(default_factory=list)
"""Task ids reclaimed because their worker PID disappeared."""
auto_blocked: list[str] = field(default_factory=list)
@ -2459,6 +2468,38 @@ def _clear_spawn_failures(conn: sqlite3.Connection, task_id: str) -> None:
)
def has_spawnable_ready(conn: sqlite3.Connection) -> bool:
"""Return True iff there is at least one ready+assigned+unclaimed task
whose assignee maps to a real Hermes profile.
Used by the gateway- and CLI-embedded dispatchers' health telemetry to
decide whether ``0 spawned`` is a "stuck" condition (real spawnable
work waiting) or a "correctly idle" condition (only control-plane
lanes like ``orion-cc`` / ``orion-research`` waiting on terminals
that pull tasks via ``claim_task`` directly).
Falls back to "any ready+assigned" if ``profile_exists`` is not
importable (e.g. partial install) preserves the old behavior so
the warning still fires in degraded environments.
"""
rows = conn.execute(
"SELECT DISTINCT assignee FROM tasks "
"WHERE status = 'ready' AND assignee IS NOT NULL "
" AND claim_lock IS NULL"
).fetchall()
if not rows:
return False
try:
from hermes_cli.profiles import profile_exists # local import: avoids cycle
except Exception:
# Can't introspect — assume spawnable, preserve legacy behavior.
return True
for row in rows:
if profile_exists(row["assignee"]):
return True
return False
def dispatch_once(
conn: sqlite3.Connection,
*,
@ -2521,7 +2562,13 @@ def dispatch_once(
except Exception:
profile_exists = None # type: ignore[assignment]
if profile_exists is not None and not profile_exists(row["assignee"]):
result.skipped_unassigned.append(row["id"])
# Bucket separately from skipped_unassigned: the operator
# cannot fix this by assigning a profile (the assignee IS the
# intended owner — a terminal lane). Health telemetry uses
# this distinction to suppress spurious "stuck" warnings on
# multi-lane setups where the ready queue is steadily full
# of human-pulled work.
result.skipped_nonspawnable.append(row["id"])
continue
if dry_run:
result.spawned.append((row["id"], row["assignee"], ""))