The kanban prefix makes the skill discoverable alongside `kanban-orchestrator` and `kanban-worker`, and signals up front that this skill drives the kanban plugin rather than being a generic video tool. Updated: - directory rename - SKILL.md frontmatter `name:` and H1 - setup.sh.tmpl header
7.3 KiB
Monitoring — Watch the Pipeline + Intervene
After setup.sh fires the kanban, the work runs autonomously. The role of
this skill in the execution phase is to help the user (and the AI overseeing
the session) detect problems early and intervene effectively.
Live monitoring commands
# Live event stream — task spawns, status changes, heartbeats, completions
hermes kanban watch --tenant <project-slug>
# Snapshot of the board
hermes kanban list --tenant <project-slug>
hermes kanban list --tenant <project-slug> --json # machine-readable
# Per-status counts + oldest-ready age
hermes kanban stats --tenant <project-slug>
# Visual dashboard (browser)
hermes dashboard
# Inspect a specific task (includes comments + events)
hermes kanban show <task-id>
# Follow a single task's event stream
hermes kanban tail <task-id>
Verify available subcommands with hermes kanban --help — the kanban CLI
ships with init / create / list / show / assign / link / unlink / claim / comment / complete / block / unblock / archive / tail / dispatch / watch / stats / heartbeat / log / runs / context / gc.
The companion scripts/monitor.py polls the kanban via the CLI and surfaces
common issues (stuck tasks, missing heartbeats, repeated retries, dependency
deadlocks).
What to watch for
Healthy pipeline indicators
- Tasks transition
READY → RUNNING → DONEin roughly the expected order - Renderers emit periodic
kanban_heartbeatevents with progress (e.g. "frame 240/720") - Each task's runtime is well under its
max_runtime_secondscap - No task accumulates more than 1 retry
- Dependency arrows resolve (children unblock as parents complete)
Warning signs
| Symptom | Likely cause | Action |
|---|---|---|
| Task RUNNING but no heartbeat in 2+ min | Worker stuck, infinite loop, blocked on input | hermes kanban show <id> — read the worker's last events. The dispatcher SIGTERMs tasks that exceed their max-runtime; if you need to stop one earlier, hermes kanban block <id> then hermes kanban archive <id>, and create a re-run task. |
| Same task retried 2+ times | Reproducible failure (missing key, bad spec, broken tool) | hermes kanban show <id> to read failure events. Fix root cause before re-running. |
| RUNNING longer than max_runtime | Task is slow but progressing OR genuinely stuck | Check heartbeats with hermes kanban tail <id>. If progressing, the dispatcher will SIGTERM eventually anyway — raise max-runtime on a re-created task. |
| Child task READY but parents still RUNNING for >2× expected | Cascade slow, dependency miswired | Check the dependency graph. Inspect the parent: sometimes it completed but its handoff fields (summary, metadata) were empty so the child has nothing to consume. |
| New tasks not appearing | Director is hung in decomposition | Inspect director task with kanban show. Often a malformed kanban_create call. |
| Specialist tasks completing instantly | Decomposition created tasks without bodies | Director didn't pass enough context. Re-create with explicit body content. |
| Tasks created but never picked up | Profile not running, or tenant mismatch, or dispatcher not running | Check hermes profile list (profile exists?), hermes status (gateway/dispatcher up?), and verify tenant. |
| Specific renderer task fails → review note → renderer redoes → fails again | Brief is asking for the impossible | Pivot the brief, not the renderer. |
Intervention recipes
Rejecting bad output
When a renderer ships a clip that doesn't pass review:
# 1. Comment on the renderer's task with specific feedback
hermes kanban comment <renderer-task-id> "Scene 3 looks too sparse \
— increase visual density. Tighten color palette to brand spec."
# 2. Create a re-render task with the original as parent
hermes kanban create "Scene 3 — re-render with feedback" \
--assignee renderer-ascii \
--parent <renderer-task-id> \
--workspace dir:"$HOME/projects/video-pipeline/<slug>" \
--tenant <slug> \
--skill ascii-video \
--max-runtime 30m
Adding a new dependency mid-flight
When the editor needs an asset that wasn't originally planned (e.g., a captions file):
# 1. Create the new task and capture its id
NEW_TASK_ID=$(hermes kanban create "Generate SRT captions from voiceover" \
--assignee captioner \
--workspace dir:"$HOME/projects/video-pipeline/<slug>" \
--tenant <slug> \
--json | python3 -c "import json,sys;print(json.load(sys.stdin)['id'])")
# 2. Wire it as a parent of the editor's task with `kanban link`
hermes kanban link "$NEW_TASK_ID" <editor-task-id>
kanban link takes parent_id child_id (parent first). Use kanban unlink
to remove a dependency.
Stopping a worker that's stuck
The kanban dispatcher will SIGTERM (then SIGKILL) any task that exceeds its
--max-runtime automatically. To stop one sooner:
# Mark blocked so the dispatcher leaves it alone, then archive
hermes kanban block <task-id>
hermes kanban archive <task-id>
# Diagnose what happened
hermes kanban show <task-id> # task body, comments, recent events
hermes kanban tail <task-id> # follow the live event stream
hermes kanban log <task-id> # worker process log
After stopping, decide: fix root cause + re-create the task, or skip and adjust dependent tasks.
Pivoting the brief
If during execution the user wants something fundamentally different:
- Cancel the active director task and all RUNNING children
- Edit
brief.mdandTEAM.md - Re-fire the initial
hermes kanban createfor the director
Don't try to "edit while running" — the kanban's audit trail makes a clean pivot more legible than mid-stream changes.
Periodic check-in script
A simple polling pattern for hands-off monitoring:
while true; do
clear
hermes kanban list --tenant <slug>
echo "---"
hermes kanban stats --tenant <slug>
sleep 30
done
For a live event feed, run hermes kanban watch --tenant <slug> in a
separate terminal — it streams task lifecycle events as they happen.
For automated intervention (auto-restart stuck tasks, auto-create re-render on
review failure), see the scripts/monitor.py patterns.
When to call it done
The pipeline is finished when:
- All RENDER tasks complete and pass review
- The editor's
output/final.mp4exists andffprobeconfirms expected duration + streams - The reviewer (if present) has approved
- Optional masterer variants exist
At this point, present the final.mp4 path to the user along with any review notes. Do NOT delete the workspace — the user may want to iterate on a single scene without re-running the whole pipeline.
Common gotchas
- Tenant mismatches. A task created with the wrong tenant won't appear in
monitoring. Always pass
--tenant <slug>consistently. - Profile process not running. Tasks queue indefinitely in READY if no
worker for that profile is online. Check
hermes profile listand start any missing profiles. - Workspace permissions. All profiles need read+write to the workspace
directory.
chmod -R u+rw <workspace>if any worker reports permission errors. - Audio/visual sync. The editor's clip stitching must match the renderer's actual output durations. Don't hardcode scene durations in the editor — read from the renderer's handoff metadata.