mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-05-07 02:51:50 +00:00

SHL0MS 0dd8e3f8d8 rename: video-orchestrator → kanban-video-orchestrator

The kanban prefix makes the skill discoverable alongside `kanban-orchestrator`
and `kanban-worker`, and signals up front that this skill drives the kanban
plugin rather than being a generic video tool.

Updated:
- directory rename
- SKILL.md frontmatter `name:` and H1
- setup.sh.tmpl header

2026-05-03 10:26:54 -07:00

10 KiB

Raw Blame History

Worked Examples

Six concrete pipelines covering different video styles. Each shows the team composition, task graph, and skill/tool choices the orchestrator would make for that brief. These are illustrative, not templates — adapt to the actual brief.

Example 1 — Narrative short film (text-to-image → image-to-video → cut)

Brief: A 90-second noir-style short. A detective walks through a rainy city. Voiceover narration. AI-generated visuals.

Team:

director — vision, decomposition, approval
writer — script + voiceover copy (loads humanizer for natural voice)
storyboarder — beat-by-beat shot list (loads excalidraw)
image-generator — generates each shot's still via local ComfyUI workflows (loads comfyui)
image-to-video-generator — animates each still (Runway/Kling, OR ComfyUI's AnimateDiff/WAN workflows via comfyui)
voice-talent — narration via ElevenLabs
audio-mixer — VO + ambient pad
editor — assembly + transitions
reviewer — final QA

Task graph:

T0  director         decompose
T1  writer           script + voiceover.md                    (parent: T0)
T2  storyboarder     shot list with framing per beat          (parent: T1)
T3  image-generator  one still per shot (~12 shots)           (parent: T2)
T4  image-to-video   animate each still                       (parent: T3)
T5  voice-talent     generate narration audio                 (parent: T1)
T6  audio-mixer      mix VO + ambient                         (parent: T5)
T7  editor           cut + transitions + audio mux            (parents: T4, T6)
T8  reviewer         final QA                                 (parent: T7)

Key choices:

Local ComfyUI via comfyui skill is preferred over external API for cost/control — but external APIs are fine if ComfyUI isn't installed
editor profile is ffmpeg-only, no Hermes skill required beyond kanban-worker
Storyboarder produces storyboard.excalidraw alongside the markdown

Example 2 — Product / marketing teaser

Brief: A 30-second product teaser for a developer tool. Shows code + terminal + UI screen recordings, voiceover, CTA at end. Square 1:1.

Team:

director
copywriter — taglines, voiceover script, CTA (loads humanizer)
concept-artist — style frames (loads claude-design for UI mockups)
renderer-motion-graphics — animated UI sequences (Remotion CLI)
renderer-ascii — terminal-style demo scenes (loads ascii-video)
voice-talent — VO via ElevenLabs
editor — assembly + brand-color treatment
audio-mixer — VO + light music bed
captioner — burned subtitles for muted-autoplay platforms
masterer — produces 1:1 + 9:16 + 16:9 variants

Task graph:

T0  director              decompose
T1  copywriter            copy.md + cta + vo script               (parent: T0)
T2  concept-artist        visual-spec.md + style frames           (parent: T1)
T3a renderer-motion-graphics  scene 1: UI sequence                (parent: T2)
T3b renderer-ascii        scene 2: terminal demo                  (parent: T2)
T3c renderer-motion-graphics  scene 3: feature highlight          (parent: T2)
T3d renderer-motion-graphics  scene 4: CTA card                   (parent: T2)
T4  voice-talent          narration                                (parent: T1)
T5  audio-mixer           VO + music bed                          (parent: T4)
T6  editor                cut + transitions                        (parents: T3*, T5)
T7  captioner             SRT + burned subtitles                  (parent: T6)
T8  masterer              1:1, 9:16, 16:9 variants                (parent: T7)

Key choices:

Multiple specialized renderers (motion-graphics + ASCII) coexist
Captioner is included because muted autoplay is the norm on social
claude-design skill for UI mockups maps directly to the product video idiom

Example 3 — Music video (synced to provided track)

Brief: A 3-minute music video for a provided lo-fi hip-hop track. Visuals should pulse with the beat. Generative + ASCII hybrid. Vertical 9:16.

Team:

director
music-supervisor — analyze track, emit audio/beats.json (loads songsee)
storyboarder — beat-aligned shot list (loads excalidraw)
renderer-ascii — ASCII scenes synced to bass kicks (loads ascii-video)
renderer-p5js — generative particle scenes synced to highs (loads p5js)
editor — beat-cut assembly using beats.json
reviewer — sync QA

Task graph:

T0  director              decompose
T1  music-supervisor      analyze track → beats.json + spectrogram  (parent: T0)
T2  storyboarder          shot list aligned to beats                (parents: T1, T0)
T3a renderer-ascii        scene 1: bass-driven ASCII                (parent: T2)
T3b renderer-p5js         scene 2: high-end particle field          (parent: T2)
... (more scenes)
T4  editor                cut to beats + mux track                  (parents: T3*, T1)
T5  reviewer              sync QA + final approval                  (parent: T4)

Key choices:

music-supervisor runs FIRST — beats.json gates the renderers
editor uses beats.json directly to align cuts to bass kicks
No voice-talent — music is the audio
Two specialized renderers (ascii-video + p5js) for visual variety

Example 4 — Math/algorithm explainer

Brief: A 2-minute explainer of an algorithm. 3Blue1Brown-style. Animated diagrams, equations, narration. Square 1:1.

Team:

director
writer — narration script (loads humanizer)
cinematographer — visual spec (loads manim-video)
renderer-manim — all animated scenes (loads manim-video)
voice-talent — narration via ElevenLabs
editor — assembly + audio mux
captioner — burned subtitles

Task graph:

T0  director           decompose
T1  writer             script + narration                  (parent: T0)
T2  cinematographer    visual spec for all scenes           (parent: T1)
T3a-Tn renderer-manim  scenes 1..N                          (parents: T2)
T4  voice-talent       narration audio                      (parent: T1)
T5  editor             cut + mux                            (parents: T3*, T4)
T6  captioner          SRT + burn                           (parent: T5)

Key choices:

manim-video skill drives both the cinematographer (visual language) and the renderer (actual scene production)
The manim-video skill's reference docs (animation-design-thinking, scene-planning, equations) auto-load when needed via the renderer's pinned skill

Example 5 — ASCII video, music-track-only

Brief: A 60-second pure-ASCII video reactive to an existing track. No voiceover, no other tools. Square 1:1.

Team:

director
music-supervisor — track analysis (loads songsee)
renderer-ascii — all visuals (loads ascii-video)
editor — assembly + audio mux

Task graph:

T0  director           decompose
T1  music-supervisor   analyze track                       (parent: T0)
T2a renderer-ascii     scene 1                             (parents: T1, T0)
T2b renderer-ascii     scene 2                             (parents: T1, T0)
T2c renderer-ascii     scene 3                             (parents: T1, T0)
T3  editor             stitch + mux audio                  (parents: T2*)

Key choices:

Minimal team (4 profiles) for a focused single-tool project
No reviewer — short experimental piece, director approves directly
All scenes run through one renderer-ascii profile because the ascii-video skill covers everything

This example illustrates the rule: don't over-decompose. Three scenes through one renderer is fine. Don't spawn three renderer profiles.

Example 6 — Real-time / installation art

Brief: A 2-minute audio-reactive visual for a gallery installation. Driven by an audio input feed. TouchDesigner-based. 16:9 4K.

Team:

director
cinematographer — visual language spec (loads touchdesigner-mcp)
renderer-touchdesigner — all visuals + record-to-disk (loads touchdesigner-mcp)
audio-mixer — final loudness pass on the captured audio (optional if pre-mixed source)
editor — assemble final clip from TouchDesigner recording
reviewer — visual QA

Task graph:

T0  director                decompose
T1  cinematographer         TD operator graph spec           (parent: T0)
T2  renderer-touchdesigner  build TD network + record output (parent: T1)
T3  editor                  trim + audio mux                 (parent: T2)
T4  reviewer                final QA                         (parent: T3)

Key choices:

touchdesigner-mcp controls a running TouchDesigner instance — the cinematographer designs the operator graph, renderer builds it
Output is a recording from the running TD network, not a render-to-frames process; editor mostly just trims

Pattern recognition

When the user describes a video, look for these signals to map to an example:

Plot, characters, scripted dialogue → Example 1 (narrative)
Specific product, CTA, brand colors, voiceover → Example 2 (marketing)
Track file provided, "synced to music" → Example 3 (music video)
"Explain how X works", math/algorithm/concept walkthrough → Example 4 (manim explainer)
Terminal aesthetic, ASCII, retro pixel → Example 5 (ASCII)
"Audio-reactive", "real-time", "installation" → Example 6 (TouchDesigner)
Comic-style narrative → use renderer-comic (baoyu-comic skill)
Retro game / pixel-art aesthetic → use renderer-pixel (pixel-art skill)
3D scene, photoreal environment → use renderer-3d (blender-mcp)
Generative art, particle system, shader → use renderer-p5js (p5js)
AI-generated photoreal stills + animation → use renderer-comfyui (comfyui) for both stills and image-to-video
"video about how the system works", recursive demo → composable from any of the above; the recursion is a rendering technique, not a style

The actual team should be derived from the specific brief — these examples are starting points, not endpoints.

10 KiB Raw Blame History

Worked Examples

Example 1 — Narrative short film (text-to-image → image-to-video → cut)

Example 2 — Product / marketing teaser

Example 3 — Music video (synced to provided track)

Example 4 — Math/algorithm explainer

Example 5 — ASCII video, music-track-only

Example 6 — Real-time / installation art

Pattern recognition

10 KiB

Raw Blame History