mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-05-07 02:51:50 +00:00

SHL0MS 0dd8e3f8d8 rename: video-orchestrator → kanban-video-orchestrator

The kanban prefix makes the skill discoverable alongside `kanban-orchestrator`
and `kanban-worker`, and signals up front that this skill drives the kanban
plugin rather than being a generic video tool.

Updated:
- directory rename
- SKILL.md frontmatter `name:` and H1
- setup.sh.tmpl header

2026-05-03 10:26:54 -07:00

7.8 KiB

Raw Blame History

Intake — Discovery Question Banks

The discovery process is adaptive. Always start with three baseline questions to identify the broad style category, then drill into a per-style question bank. Ask 2-4 questions at a time, listen, then proceed. Make reasonable assumptions whenever the user implies an answer.

Tier 0 — Baseline (always ask)

What is the video? — One-sentence pitch
How long? — Approximate duration
Aspect ratio + target platform? — 16:9 / 9:16 / 1:1 / 4:5; X, IG, YouTube, internal, etc.

From these answers, classify the style category and pick the relevant Tier 1 follow-ups. Do not continue asking until you have at least these three.

Style classification

Map the brief to one of these archetypes (or a hybrid):

Archetype	Tells
Narrative film	Plot, characters, scenes-with-events, dialogue, location
Product / marketing	A specific product or feature being shown / sold; CTA at end
Music video	A specific track exists; visuals sync to music
Explainer / educational	A concept being taught; voiceover-driven
Tutorial / changelog	Software demo, terminal-heavy, technical
ASCII / terminal art	Retro terminal aesthetic explicit, character-grid
Abstract / loop	Generative, no plot, often perfect-loop
Documentary / interview cut	Real footage, transcription-driven
Real-time / installation	Audio-reactive, gallery installation, VJ output

If ambiguous, ask which category fits — don't guess. Hybrids are common (e.g., a product video with a narrative arc); decompose into the dominant mode + secondary modifiers.

Recursive / meta ("a video that shows its own production") is a rendering technique, not a separate style — compose it from any of the above by adding a two-pass render step where pass 2 uses pass 1's output as texture inside the final scene.

Tier 1 — Per-style follow-ups

Narrative film

Setting / world? — When and where the story takes place
Characters? — How many, archetypes, who carries dialogue
Beat list or full script? — Has the user written the story or do we draft it
Dialogue language? — Spoken lines, on-screen subs only, silent
Visual generation approach? — Text-to-image (FAL/Midjourney/Imagen) → image-to-video (Runway/Kling), 3D animation (Blender), 2D animation, procedural, or hybrid
Voice approach? — TTS (which voice), recorded VO, no dialogue
Music / score? — Commissioned (via songwriting-and-ai-music Suno prompts, or local heartmula), licensed track provided, silent

Product / marketing

Product? — Name, what it does, key feature being shown
Target audience? — Who's watching, what they care about
CTA? — Visit URL, install, sign up, etc.
Tone? — Serious, playful, technical, premium, edgy
Brand assets available? — Logo files, color palette, fonts, existing footage
Animation style? — Motion graphics (Remotion / AE-style), screen recording, generative, illustrated
Voiceover? — Yes (which voice / language) or text-only
Music? — Track provided, license-free needed, custom-composed

Music video

Track file? — Path to the audio (essential — we'll analyze BPM + beats)
Track length to use? — Full song or a section
Genre / energy? — Tells what visual rhythm and density to use
Lyric / narrative content? — Are there lyrics to render on screen, or is it purely visual?
Visual reference style? — Existing music videos / artists for reference
Performer footage? — None, has clips, will provide
Visual generation approach? — Per-beat generative, edit-driven cuts of stock footage, illustrated, hybrid

Explainer / educational

What concept is being taught? — One-sentence concept, key takeaway
Audience expertise? — Beginner / intermediate / expert
Diagram density? — Heavy math / formulas / code / abstract concepts
Voiceover? — TTS / recorded / on-screen text only
Tool preference? — manim-video (math), p5js (generative), Remotion (UI motion graphics), comfyui (AI-generated visuals), ascii-video (technical/retro), hybrid
Pacing? — Fast and dense (3Blue1Brown) or slow and contemplative

Tutorial / changelog / software demo

Software being demonstrated? — Name, what it does
Demo script? — Sequence of commands / screens to show
Terminal-only or with GUI?
Voiceover for narration?
Diagram support needed? — Often these benefit from a diagram skill alongside the screen-capture/render step (excalidraw, architecture-diagram, concept-diagrams)

ASCII / terminal art

Source material? — Generative / driven by audio / converting existing video / static image starting point
Color palette? — Brand-driven (gold/black/blue), Matrix green, full rainbow, monochrome
Audio reactivity? — None / loose mood / tight beat sync / FFT-driven
Character set? — ASCII only / Unicode block-drawing / mystic glyphs
Loop or narrative? — Perfect loop or one-shot

Abstract / loop

Mood / emotion? — One word that captures the feel
Motion type? — Zoom-into-itself, particle drift, wave, geometric, organic
Loop required? — Perfect loop (Droste-style) or just satisfying ending
Audio? — Silent, ambient pad, beat-synced

Documentary / interview cut

Source footage? — Provided clips, length per clip
Transcript / subtitles? — Provided or to be generated
Story structure? — Chronological / thematic / arc
B-roll approach? — Generated, stock library, none

Real-time / installation

Output environment? — Gallery wall, projector, screen, web embed
Audio source? — Live audio input, pre-recorded track, both
Reactivity tightness? — Mood-level (loose) vs. tight beat-sync vs. live parameter control
Tool preference? — touchdesigner-mcp for full TD operator graphs; p5js for web-canvas; comfyui for generative-AI fed by audio features

Tier 2 — Always ask near the end

Brand assets path? — Where logo / color palette / fonts / music library lives
Output format requirements? — Codec preference, target file size, accepted alternates (vertical cut, GIF, audio-only)
Deadline? — Affects task max_runtime_seconds and acceptable scope
Quality bar? — Rough draft for review / polished final / archival
Existing footage / assets to reuse? — Anything that should appear, not just inform

Reasonable assumption defaults

When the user under-specifies, fill in these defaults rather than asking:

Question	Default
Frame rate	30 fps for X / IG; 60 fps for tutorials/explainers; 24 fps for narrative film
Resolution	1080×1080 for square, 1920×1080 for 16:9, 1080×1920 for 9:16
Codec	H.264 / yuv420p, CRF 18
Audio codec	AAC 192 kbps
Voice	Provider's mid-range neutral voice unless brand calls for distinctive timbre
Music	Silent (require user to specify if music is wanted)
Captions	On for explainer/tutorial; off for narrative/abstract unless requested
Quality bar	Polished final unless user says draft

State the assumption explicitly: "Assuming 30fps and AAC audio unless you say otherwise — proceed?"

Anti-patterns

Asking 10 questions at once. Maximum 4 per turn.
Asking for things the brief already implies. If the user said "music video for my track," do not ask "is there a track?"
Failing to classify before drilling in. Tier-1 questions depend on classification; mixing them up wastes turns.
Treating "make a video" as enough to proceed. Always confirm the three baseline questions.

7.8 KiB Raw Blame History Unescape Escape