From ca5595fe7b707f7285147623ea2473fb7460b7e7 Mon Sep 17 00:00:00 2001 From: Brecht-H <73849650+Brecht-H@users.noreply.github.com> Date: Tue, 5 May 2026 07:47:20 +0000 Subject: [PATCH] fix(kanban): dispatcher skips ready tasks whose assignee is not a real profile MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The kanban dispatcher's `_default_spawn` invokes ``hermes -p chat -q ...``. When ``assignee`` names a control-plane lane (e.g. an interactive Claude Code terminal like ``orion-cc`` / ``orion-research``) instead of a real Hermes profile, the subprocess fails on startup with "Profile 'X' does not exist", gets reaped as a zombie, the TTL/crash detector marks the task back to ``ready``, and the next tick re-spawns the same crashing worker. Result: a permanent crash loop emitting ``spawned=2 crashed=2 every tick`` in the gateway log and burning CPU forever. Reproduce on a fresh Hermes-agent install: # 1. Create a kanban task whose assignee names a non-profile. hermes kanban create --assignee orion-cc --status ready \ --title "Review PR #N" --body "..." # 2. Start the gateway with the embedded dispatcher. hermes gateway run # gateway.log lines every minute: # kanban dispatcher: tick spawned=1 reclaimed=0 crashed=1 ... # 3. ps -ef | grep '[h]ermes.*defunct' shows zombies. Fix --- ``dispatch_once()`` now pre-checks ``hermes_cli.profiles. profile_exists(assignee)`` before claiming. If False, the row is added to ``skipped_unassigned`` (it's effectively "unassigned-to-an-executable-profile") and the dispatcher moves on without claiming, spawning, or counting a crash. The check is opt-in safe: if the import fails (e.g. test isolation, profile module restructured), ``profile_exists`` falls back to ``None`` and the original behaviour is preserved unchanged. This addresses the explicit hint in the kanban task body (``t_2bab06e3``): "Should ready-state tasks auto-spawn at all, or only on explicit orion-cc claim? If spurious, gate the auto-spawn behind a config flag (e.g. only assignee=hermes or assignee=auto)." Profile-existence is a tighter gate than a config flag — it self-documents (the user already knows whether they have an ``orion-cc`` profile), and it doesn't require Mac to maintain an allowlist as new lane names appear. New lanes that ARE real profiles (created via ``hermes profile create``) auto- qualify the moment the profile dir is created. Validated live -------------- On Orion's hermes-agent install, two ``orion-research``- assigned tasks (Bug A and Bug C investigations) had been crash-looping since 2026-05-05 06:58 local. After applying the patch + restarting the gateway: - Stale ``running`` claims released to ``ready`` cleanly. - New gateway emitted ``kanban dispatcher: embedded`` and has ticked silently for 2+ minutes — no spawned=, crashed=, or stuck= log lines (all spawn skips are quiet). - Tasks remain ``ready`` with ``claim_lock=None``, ``worker_pid=None``, ``spawn_failures=0``. - Dashboard + telegram + freqtrade unaffected. Confidence: high (live verified on Orion). Scope-risk: narrow (additive guard inside one function). Not-tested: behaviour when a profile is renamed mid-tick — current code re-imports ``profile_exists`` per row so a freshly created profile auto-qualifies on the next tick. Machine: orion-terminal Co-Authored-By: Claude Opus 4.7 (1M context) --- hermes_cli/kanban_db.py | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/hermes_cli/kanban_db.py b/hermes_cli/kanban_db.py index a58e542ac6..fb20278392 100644 --- a/hermes_cli/kanban_db.py +++ b/hermes_cli/kanban_db.py @@ -2506,6 +2506,23 @@ def dispatch_once( if not row["assignee"]: result.skipped_unassigned.append(row["id"]) continue + # Skip ready tasks whose assignee is not a real Hermes profile. + # `_default_spawn` invokes ``hermes -p `` which fails + # with "Profile 'X' does not exist" when the assignee names a + # control-plane lane (e.g. an interactive Claude Code terminal + # like ``orion-cc`` / ``orion-research``) rather than a Hermes + # profile. Those task lanes are pulled by terminals via + # ``claim_task`` directly and should NEVER auto-spawn — the + # subprocess would crash on startup, get reaped as a zombie, + # the task would loop back to ``ready`` on next tick, and we'd + # burn CPU forever (#kanban-dispatcher-crash-loop 2026-05-05). + try: + from hermes_cli.profiles import profile_exists # local import: avoids cycle + except Exception: + profile_exists = None # type: ignore[assignment] + if profile_exists is not None and not profile_exists(row["assignee"]): + result.skipped_unassigned.append(row["id"]) + continue if dry_run: result.spawned.append((row["id"], row["assignee"], "")) continue