mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-12 03:42:08 +00:00
fix(kanban): suppress dispatcher stuck-warn when ready queue holds only non-spawnable assignees
After PR #20105 (dispatcher skips ready tasks whose assignee fails ``profile_exists()`` to prevent the orion-cc/orion-research crash loop), the gateway and CLI emit a spurious "kanban dispatcher stuck: ready queue non-empty for N consecutive ticks but 0 workers spawned" warning every 5 minutes on multi-lane setups where the queue is steadily full of human-pulled work assigned to terminal lanes. The warn is intended to catch real failure modes (broken PATH, missing venv, credential loss for a real Hermes profile). On a multi-lane host it fires forever even though everything is healthy: the dispatcher correctly chose not to spawn, and there is nothing for the operator to fix. Changes: * ``DispatchResult`` gains a ``skipped_nonspawnable`` field (separate from ``skipped_unassigned``) so callers can distinguish "task missing an owner — operator should route it" from "task owned by a control-plane lane — terminal will pull it". * ``dispatch_once`` routes the ``not profile_exists(assignee)`` skip into the new bucket (was lumped into ``skipped_unassigned``). * New helper ``has_spawnable_ready(conn)`` returns True iff at least one ready+assigned+unclaimed task in the DB has an assignee that maps to a real Hermes profile. Falls back to legacy "any ready+assigned" when ``profile_exists`` is unimportable so degraded installs still surface the original warn. * The gateway dispatcher (``gateway/run.py``) and the CLI standalone daemon (``hermes_cli/kanban.py``) both swap their cheap ``ready_nonempty`` probe to use ``has_spawnable_ready``. Stuck-warn now fires only when there is genuine spawnable work the dispatcher failed to start. * CLI dispatch output prints ``Skipped (non-spawnable assignee — terminal lane, OK)`` for visibility without alarm. Tests: * New ``has_spawnable_ready`` cases (empty queue, terminal-lane only, mixed real+terminal). * New ``test_dispatch_skips_nonspawnable_into_separate_bucket`` verifies the bucketing change. * Updated ``test_dispatch_skips_unassigned`` to assert no cross-leak. * Added ``all_assignees_spawnable`` fixture in ``tests/hermes_cli/conftest.py`` and threaded it through dispatcher tests that use synthetic assignees ("alice", "bob"). PR #20105 (the parent commit) silently broke 8 such tests by routing those assignees into ``skipped_nonspawnable`` instead of spawning; this PR repairs them as part of the same code area. Verified locally: 246/246 kanban-suite tests pass. Stacks on top of fix/kanban-dispatcher-skip-missing-profile-2026-05-05 (PR #20105). Reviewer: this PR is meant to merge AFTER #20105. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
ca5595fe7b
commit
f25d3ec917
6 changed files with 152 additions and 25 deletions
|
|
@ -3906,7 +3906,17 @@ class GatewayRunner:
|
|||
return out
|
||||
|
||||
def _ready_nonempty() -> bool:
|
||||
"""Cheap probe: is there a ready+assigned+unclaimed task on ANY board?"""
|
||||
"""Cheap probe: is there at least one ready+assigned+unclaimed
|
||||
task on ANY board whose assignee maps to a real Hermes profile
|
||||
(i.e. one the dispatcher would actually spawn for)?
|
||||
|
||||
Tasks assigned to control-plane lanes (e.g. ``orion-cc``,
|
||||
``orion-research``) are pulled by terminals via
|
||||
``claim_task`` directly and never spawnable, so a queue full
|
||||
of those is "correctly idle", not "stuck". Filtering them out
|
||||
here keeps the stuck-warn fire only on real failures (broken
|
||||
PATH, missing venv, credential loss for a real Hermes profile).
|
||||
"""
|
||||
try:
|
||||
boards = _kb.list_boards(include_archived=False)
|
||||
except Exception:
|
||||
|
|
@ -3916,12 +3926,7 @@ class GatewayRunner:
|
|||
conn = None
|
||||
try:
|
||||
conn = _kb.connect(board=slug)
|
||||
row = conn.execute(
|
||||
"SELECT 1 FROM tasks "
|
||||
"WHERE status = 'ready' AND assignee IS NOT NULL "
|
||||
" AND claim_lock IS NULL LIMIT 1"
|
||||
).fetchone()
|
||||
if row is not None:
|
||||
if _kb.has_spawnable_ready(conn):
|
||||
return True
|
||||
except Exception:
|
||||
continue
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue