mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-05-07 02:51:50 +00:00

Siddharth Balyan 8163d37192

fix(skill): reference built-in video_analyze/vision_analyze tools in kanban-video-orchestrator (#19562 )

The tool-matrix.md had a vague 'Gemini multimodal / Claude vision' entry
in the external tools table that didn't point to the actual built-in
Hermes tools. Now that video_analyze exists (merged in #19301), update
the skill to reference it properly:

- Add 'Built-in Hermes tools for media review' section with proper
  toolset names, enablement instructions, and capability details
- Add video + vision toolsets to cinematographer, editor, and reviewer
  profile configs
- Update role-archetypes.md to reference tools by name
- Update API key table to explain video_analyze routing

2026-05-04 12:54:50 +05:30

14 KiB

Raw Blame History

Tool Matrix — Skills + Toolsets per Role

Maps each role archetype to the Hermes skills it should always_load and the toolsets it needs. Only references skills that ship in the public hermes-agent repository (under skills/ or optional-skills/). External APIs and CLIs are called from the terminal toolset; they don't appear in always_load.

Hermes skills relevant to video production

Visual / rendering skills (`hermes-agent/skills/creative/`)

Skill	What it does	Best fit for
`ascii-video`	Production pipeline for ASCII art video — generative, audio-reactive, video-to-ASCII	Renderer for ASCII / terminal / retro pixel content; cinematographer for ASCII projects
`ascii-art`	Static ASCII art generation	Concept artist for ASCII style frames; secondary tool for ASCII renderer
`manim-video`	Manim CE animations — math, algorithms, 3Blue1Brown-style explainers	Renderer for math, algorithm walkthroughs, technical concept explainers
`p5js`	p5.js sketches — generative art, shaders, interactive, 3D	Renderer for generative art, particle systems, organic motion, web-canvas content
`comfyui`	Generate images, video, audio with ComfyUI workflows (image-to-image, image-to-video, etc.)	image-generator, image-to-video-generator, or general renderer for AI-generated content
`touchdesigner-mcp`	Control a running TouchDesigner instance — real-time visuals, audio-reactive installation art, VJ	Renderer for real-time/audio-reactive content; installation art; live performance
`blender-mcp` (optional)	Control Blender 4.3+ via MCP — 3D modeling, animation, rendering	Renderer for 3D scenes, photoreal environments, character animation
`pixel-art`	Pixel art with era palettes (NES, Game Boy, PICO-8)	Renderer for retro game aesthetic; concept artist for pixel-style frames
`baoyu-comic`	Knowledge-comic generation (educational, biography, tutorial)	Renderer for comic-style narrative; explainer in panel form
`baoyu-infographic`	Infographic generation	Renderer for data-driven explainer scenes
`meme-generation` (optional)	Generate meme images by overlaying text on templates	Generator for satirical/social content; meme-style stills

Design / pre-production skills (`hermes-agent/skills/creative/`)

Skill	What it does	Best fit for
`claude-design`	Design one-off HTML artifacts (landing, deck, prototype)	Concept artist for product video style frames; storyboarder for UI-heavy content
`design-md`	Design markdown docs	Concept artist documenting visual specs
`popular-web-designs`	Reference patterns for popular web designs	Concept artist; cinematographer when matching a known UI aesthetic
`sketch`	Throwaway HTML mockups (2-3 design variants to compare)	Concept artist exploring directions; storyboarder for UI flows
`excalidraw`	Excalidraw-style hand-drawn diagrams	Storyboarder; concept artist for sketch-style frames
`architecture-diagram`	Software architecture diagrams	Storyboarder for technical content; explainer scenes about systems
`concept-diagrams` (optional)	Flat, minimal SVG diagrams (educational visual language; physics, chemistry, math, anatomy, etc.)	Renderer / storyboarder for explainer scenes with clean educational diagrams
`pretext`	Mathematical/scientific content authoring	Writer / cinematographer for technical-explainer pretexts
`creative-ideation`	Constraint-driven project ideation	Director / cinematographer when the brief is wide-open and needs framing
`humanizer`	Strip AI-isms from text, add real voice	Writer / copywriter post-process to avoid AI-tells in scripts and VO copy

Audio / media skills (`hermes-agent/skills/creative/` + `skills/media/`)

Skill	What it does	Best fit for
`songwriting-and-ai-music`	Songwriting craft + Suno prompt patterns	Music supervisor when commissioning a track via Suno
`heartmula`	Open-source music generation (Apache-2.0, Suno-like)	Music supervisor generating bespoke tracks without external APIs
`songsee`	Spectrograms, mel/chroma/MFCC of audio files	Music supervisor analyzing tracks; foley-designer designing to a beat; editor visualizing a mix
`spotify`	Spotify control — play, search, queue, manage playlists	Music supervisor sourcing existing tracks; reference research
`youtube-content`	Fetch transcripts + transform to chapters/summaries/posts	Documentary cut, content adaptation, research for explainers
`gif-search`	Find existing GIFs	Editor / concept artist sourcing references
`gifs`	GIF tooling	Masterer producing GIF deliverables

Kanban infrastructure (`hermes-agent/skills/devops/`)

Skill	What it does	When to load
`kanban-orchestrator`	Decomposition playbook + anti-temptation rules for orchestrator profiles	Director only
`kanban-worker`	Pitfalls, examples, edge cases for kanban workers (deeper than auto-injected guidance)	Any profile — load when handling tricky multi-step workflows

The kanban plugin auto-injects baseline orchestration guidance into every worker's system prompt — the kanban_create fan-out pattern, claim/handoff lifecycle, and the "decompose, don't execute" rule for orchestrators. kanban-orchestrator and kanban-worker are deeper playbooks loaded when a profile needs them.

External tools (called from terminal toolset)

These are not Hermes skills but external CLIs / APIs that profiles invoke. They don't appear in always_load; instead the role's terminal commands hit them directly.

Tool	What it does	Profile that uses it
`ffmpeg`	Video / audio encode, splice, mux	renderer, editor, audio-mixer, masterer
`ffprobe`	Inspect media	All media-touching profiles
Whisper (CLI or API)	Speech-to-text for captions	captioner
Text-to-image API (FAL / Replicate / OpenAI / Midjourney)	Stills generation	image-generator (alternative to local `comfyui`)
Image-to-video API (Runway / Kling / Luma / Pika)	Animate stills	image-to-video-generator
Text-to-speech API (ElevenLabs / OpenAI TTS / etc.)	Voiceover generation	voice-talent
Suno API or web	Track composition (paired with `songwriting-and-ai-music`)	music-supervisor
Remotion CLI (`npx remotion render`)	React-based motion graphics	renderer-motion-graphics
Manim CE (`manim`)	Math animation render (driven by `manim-video` skill's recipes)	renderer-manim
Blender (`blender -b`)	3D rendering (alternative to `blender-mcp`)	renderer-3d

Built-in Hermes tools for media review

These are native Hermes tools — not invoked via terminal but through their own toolsets. Enable them per-profile by adding the toolset to the profile config.

Tool	Toolset	What it does	Profile that uses it
`video_analyze`	`video` (opt-in — `hermes tools enable video`)	Native video understanding — sends full clip to a multimodal LLM (Gemini via OpenRouter) for review without frame extraction. Supports mp4, webm, mov, avi, mkv. 50 MB cap. Model: `AUXILIARY_VIDEO_MODEL` env → `AUXILIARY_VISION_MODEL` fallback.	reviewer, cinematographer, editor
`vision_analyze`	`vision` (core — enabled by default)	Image/frame analysis — review stills, thumbnails, exported frames. Already available to all profiles without opt-in.	reviewer, cinematographer, concept-artist

Standard toolset configurations per role

director

toolsets:
  - kanban
  - terminal
  - file
skills:
  always_load:
    - kanban-orchestrator

The director's terminal access is conventional but the SOUL.md rules forbid execution. Audit logs catch violations.

writer / copywriter

toolsets:
  - kanban
  - file
skills:
  always_load:
    - kanban-worker
    - humanizer            # post-process scripts to strip AI-tells

No terminal — writers don't need it.

concept-artist

toolsets:
  - kanban
  - terminal
  - file
skills:
  always_load:
    - kanban-worker
    # plus one or more (style-dependent):
    # - claude-design       (UI / web product video)
    # - sketch              (quick mockup variants)
    # - excalidraw          (hand-drawn frames)
    # - ascii-art           (ASCII style frames)
    # - pixel-art           (retro/game aesthetic)
    # - popular-web-designs (matching known web aesthetic)
    # - design-md           (text-based design docs)

storyboarder

toolsets:
  - kanban
  - file
skills:
  always_load:
    - kanban-worker
    # one of:
    # - excalidraw              (sketch storyboards)
    # - architecture-diagram    (technical/system content)
    # - concept-diagrams        (educational / scientific content)

cinematographer

toolsets:
  - kanban
  - terminal
  - file
  - video               # video_analyze — review full clips natively
  - vision              # vision_analyze — review stills / exported frames
skills:
  always_load:
    - kanban-worker
    # the visual skill that matches the project, e.g.:
    # - ascii-video            (ASCII projects)
    # - manim-video            (math/explainer)
    # - p5js                   (generative)
    # - comfyui                (AI-generated visuals)
    # - blender-mcp            (3D)
    # - touchdesigner-mcp      (real-time/installation)

renderer (specialized variants)

toolsets:
  - kanban
  - terminal
  - file
skills:
  always_load:
    - kanban-worker
    # ONE skill per renderer variant (or empty for external-API renderers):
    # - ascii-video               (renderer-ascii)
    # - manim-video               (renderer-manim)
    # - p5js                      (renderer-p5js)
    # - comfyui                   (renderer-comfyui — img/video AI gen)
    # - touchdesigner-mcp         (renderer-touchdesigner)
    # - blender-mcp               (renderer-3d)
    # - pixel-art                 (renderer-pixel)
    # - baoyu-comic               (renderer-comic)
    # - meme-generation           (renderer-meme)

For external-API renderers (image-to-video-generator using Runway, voice-talent using ElevenLabs, renderer-motion-graphics using Remotion), always_load only contains kanban-worker — the role's work is API-driven and the API key + terminal commands suffice.

For multi-skill renderer setups (rare — usually one variant per skill is cleaner) use --skill <name> on individual kanban_create calls to override which skill loads for that specific task.

image-generator / image-to-video-generator / voice-talent

toolsets:
  - kanban
  - terminal
  - file
skills:
  always_load:
    - kanban-worker
    # for image-generator that drives ComfyUI locally:
    # - comfyui
env_required:
  # populate based on the chosen API:
  - FAL_KEY                 # or REPLICATE_API_TOKEN, OPENAI_API_KEY for image-gen
  - RUNWAY_API_KEY          # or KLING_API_KEY, LUMA_API_KEY for image-to-video
  - ELEVENLABS_API_KEY      # or OPENAI_API_KEY for TTS

If the user's setup has ComfyUI installed locally, the comfyui skill can replace the external image-gen API entirely (cheaper, more control, supports custom workflows for image-to-video too).

music-supervisor

toolsets:
  - kanban
  - terminal
  - file
skills:
  always_load:
    - kanban-worker
    - songsee                         # spectrograms / audio analysis
    # plus (depending on what the project needs):
    # - songwriting-and-ai-music      (commissioning Suno tracks)
    # - heartmula                     (commissioning open-source local generation)
    # - spotify                       (sourcing existing tracks)

editor / audio-mixer / captioner / masterer

toolsets:
  - kanban
  - terminal
  - file
  - video              # video_analyze — editor reviews assembled cuts natively
  - vision             # vision_analyze — spot-check frames
skills:
  always_load:
    - kanban-worker

These are mostly ffmpeg-driven; no special skill needed beyond kanban-worker. For captioner add Whisper invocation patterns to the SOUL.md.

reviewer / brand-cop

toolsets:
  - kanban
  - terminal           # for media inspection (ffprobe, etc.)
  - file
  - video              # video_analyze — review full clips natively
  - vision             # vision_analyze — review stills / exported frames
skills:
  always_load:
    - kanban-worker

API key requirements

Track these in the project setup. The setup script should verify each required key is present in ~/.hermes/.env (or macOS Keychain) before firing the kanban.

Service	Env var	Used by
ElevenLabs	`ELEVENLABS_API_KEY`	voice-talent
OpenAI	`OPENAI_API_KEY`	image-generator (DALL-E), voice-talent (TTS)
OpenRouter	`OPENROUTER_API_KEY`	reviewer, cinematographer, editor (`video_analyze` routes through `AUXILIARY_VIDEO_MODEL` → OpenRouter)
FAL	`FAL_KEY`	image-generator (FAL flux models)
Replicate	`REPLICATE_API_TOKEN`	image-generator (alternate provider)
Runway	`RUNWAY_API_KEY`	image-to-video-generator
Kling	`KLING_API_KEY`	image-to-video-generator (alternate)
Luma	`LUMA_API_KEY`	image-to-video-generator (alternate)
Suno	`SUNO_API_KEY`	music-supervisor (paired with `songwriting-and-ai-music`)
Spotify	`SPOTIFY_CLIENT_ID` + `SPOTIFY_CLIENT_SECRET`	music-supervisor (paired with `spotify` skill)
Anthropic	`ANTHROPIC_API_KEY`	every Hermes profile (Claude)

If a key is missing, prompt the user to add it. Storage methods, in order of preference: macOS Keychain → ~/.hermes/.env → environment variable.

Skill version pinning

If a specific skill version is desired, pass it via the per-task --skill <name>=<version> flag. The default is whatever's installed.

Adding a new skill to the matrix

When a new Hermes-public video skill ships:

Add a row to the relevant table at the top of this file
If it warrants a specialized renderer variant, add to role-archetypes.md
Update relevant per-style examples in examples.md

14 KiB Raw Blame History