mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-05-06 02:41:48 +00:00

James 20859cc408 docs(skill): sync hyperframes skill with upstream changes

Pulls the hyperframes skill up to the latest state of heygen-com/hyperframes
skill content. Opened 2026-04-17; upstream has shipped CLI, layout, and path
changes since.

- SKILL.md: promote the visual-style check to a proper HARD-GATE
  (DESIGN.md > named style > ask 3 questions, with the #333/#3b82f6/Roboto
  tells); expand Step 6 to cover audio-reactive (mandatory per-frame
  tl.call() sampling loop — a single long tween does NOT react to audio),
  caption exit guarantee (hard tl.set kill after group.end), marker
  highlighting, and scene transitions; add the animation-map script to
  Verification; link the new features.md.

- references/cli.md: add capture and validate (both shipped commands, both
  referenced from the workflow but missing from the reference). Add
  --lang to tts with the voice-prefix auto-inference table and espeak-ng
  dependency note (heygen-com/hyperframes#351, 2026-04-20 — after this
  PR opened).

- references/website-to-video.md: update all paths to the capture/
  subfolder layout introduced in heygen-com/hyperframes#345
  (capture/screenshots/, capture/assets/, capture/extracted/tokens.json).
  Old captured/ prefix was broken — agents following the skill were
  looking for files in wrong locations.

- references/features.md (new): distilled coverage for captions (language
  rule, tone table, word grouping, fitTextFontSize, exit guarantee), TTS
  (multilingual phonemization, speed tuning), audio-reactive (data
  format, mapping table, sampling pattern), marker highlighting
  (highlight/circle/burst/scribble/sketchout), and transitions (energy/
  mood tables, presets, shader-compatible CSS rules). Five topics the
  original PR didn't cover.

2026-05-04 14:13:17 -07:00

14 KiB

Raw Blame History

HyperFrames Feature Reference

Load this file when a composition needs captions, TTS narration, audio-reactive visuals, marker-style text highlighting, or scene transitions. All patterns here are deterministic (no Math.random(), no Date.now(), no runtime audio analysis) and live on the same GSAP timeline as the rest of the composition.

Captions

Language Rule (Non-Negotiable)

Never use .en whisper models unless the audio is confirmed English. .en models TRANSLATE non-English audio into English instead of transcribing it.

User says the language → npx hyperframes transcribe audio.mp3 --model small --language <code> (no .en)
User confirms English → --model small.en
Language unknown → --model small (auto-detects)

Style Detection

If the user doesn't specify a caption style, detect it from the transcript tone:

Tone	Font mood	Animation	Color	Size
Hype / launch	Heavy condensed, 800-900	Scale-pop, `back.out(1.7)`, 0.1-0.2s	Bright on dark	72-96px
Corporate	Clean sans, 600-700	Fade+slide, `power3.out`, 0.3s	White / neutral + muted accent	56-72px
Tutorial	Mono / clean sans, 500-600	Typewriter or fade, 0.4-0.5s	High contrast, minimal	48-64px
Storytelling	Serif / elegant, 400-500	Slow fade, `power2.out`, 0.5-0.6s	Warm muted tones	44-56px
Social	Rounded sans, 700-800	Bounce, `elastic.out`, word-by-word	Playful, colored pills	56-80px

Word Grouping

High energy: 2-3 words, quick turnover.
Conversational: 3-5 words, natural phrases.
Measured / calm: 4-6 words.

Break on sentence boundaries, 150ms+ pauses, or a max word count.

Positioning

Landscape (1920x1080): bottom 80-120px, centered.
Portrait (1080x1920): ~600-700px from bottom, centered.
Never cover the subject's face. position: absolute (never relative). One caption group visible at a time.

Text Overflow Prevention

Use the runtime helper so captions never overflow:

const result = window.__hyperframes.fitTextFontSize(group.text.toUpperCase(), {
  fontFamily: "Outfit",
  fontWeight: 900,
  maxWidth: 1600, // 1600 landscape, 900 portrait
});
el.style.fontSize = result.fontSize + "px";

When per-word styling uses scale > 1.0, compute maxWidth = safeWidth / maxScale to leave headroom. Container needs overflow: visible (not hidden — hidden clips scaled emphasis words and glow).

Caption Exit Guarantee

Every group MUST have a hard kill after its exit tween — otherwise groups leak into later ones:

tl.to(groupEl, { opacity: 0, scale: 0.95, duration: 0.12, ease: "power2.in" }, group.end - 0.12);
tl.set(groupEl, { opacity: 0, visibility: "hidden" }, group.end); // deterministic kill

Per-Word Styling

Scan the transcript for words that deserve distinct treatment:

Brand / product names — larger, unique color.
ALL CAPS — scale boost, flash, accent color.
Numbers / statistics — bold weight, accent color.
Emotional keywords — exaggerated animation (overshoot, bounce).
Call-to-action — highlight, underline, color pop.

TTS (Kokoro-82M)

Local, no API key. Runs on CPU. Model downloads on first use (~311 MB + ~27 MB voices, cached in ~/.cache/hyperframes/tts/).

Voice Selection

Content type	Voice	Why
Product demo	`af_heart` / `af_nova`	Warm, professional
Tutorial	`am_adam` / `bf_emma`	Neutral, easy to follow
Marketing	`af_sky` / `am_michael`	Energetic or authoritative
Documentation	`bf_emma` / `bm_george`	Clear British English
Casual	`af_heart` / `af_sky`	Approachable, natural

Run npx hyperframes tts --list for all 54 voices across 8 languages.

Multilingual Phonemization

Voice ID first letter encodes language: a=American English, b=British English, e=Spanish, f=French, h=Hindi, i=Italian, j=Japanese, p=Brazilian Portuguese, z=Mandarin. The CLI auto-infers the phonemizer locale from that prefix — you don't need --lang when voice and text match.

npx hyperframes tts "La reunión empieza a las nueve" --voice ef_dora --output es.wav
npx hyperframes tts "今日はいい天気ですね"            --voice jf_alpha --output ja.wav

Pass --lang only to override auto-detection (e.g. stylized accents):

npx hyperframes tts "Hello there" --voice af_heart --lang fr-fr --output accented.wav

Valid --lang codes: en-us, en-gb, es, fr-fr, hi, it, pt-br, ja, zh. Non-English phonemization requires espeak-ng installed system-wide (apt-get install espeak-ng / brew install espeak-ng).

Speed

0.7-0.8 — tutorial, complex content
1.0 — natural (default)
1.1-1.2 — intros, upbeat content
1.5+ — rarely appropriate

TTS + Captions Workflow

npx hyperframes tts script.txt --voice af_heart --output narration.wav
npx hyperframes transcribe narration.wav   # → transcript.json (word-level)

Audio-Reactive Visuals

Drive visuals from music, voice, or sound. Any GSAP-tweenable property can respond to pre-extracted audio data.

Data format

const AUDIO_DATA = {
  fps: 30,
  totalFrames: 900,
  frames: [{ bands: [0.82, 0.45, 0.31, /* ... */] }, /* ... */],
};

frames[i].bands[] are frequency band amplitudes, 0-1. Index 0 = bass, higher indices = treble. Each band is normalized independently across the full track.

Mapping audio to visuals

Audio signal	Visual property	Effect
Bass (`bands[0]`)	`scale`	Pulse on beat
Treble (`bands[12-14]`)	`textShadow`, `boxShadow`	Glow intensity
Overall amplitude	`opacity`, `y`, `backgroundColor`	Breathe, lift, color shift
Mid-range (`bands[4-8]`)	`borderRadius`, `width`	Shape morphing

Any GSAP-tweenable property works — clipPath, filter, SVG attributes, CSS custom properties. Let content guide the visual and let audio drive its behavior. Never add equalizer bars, spectrum analyzers, waveform displays, rainbow cycling, or generic particle systems — they look cheap.

Sampling pattern (required)

Audio reactivity needs per-frame sampling via a for loop of tl.call(), NOT a single tween. A single long tween does NOT react to audio:

for (let f = 0; f < AUDIO_DATA.totalFrames; f++) {
  tl.call(
    ((frame) => () => draw(frame))(AUDIO_DATA.frames[f]),
    [],
    f / AUDIO_DATA.fps,
  );
}

Gotchas

textShadow on a container with semi-transparent children (e.g. inactive caption words at rgba(255,255,255,0.3)) renders a visible glow rectangle behind every child. Apply the glow to active words individually, not to the container.
Subtlety for text — 3-6% scale variation, soft glow. Heavy pulsing makes text unreadable.
Go bigger on non-text — backgrounds and shapes can handle 10-30% swings.
Deterministic only — pre-extracted audio data, no Web Audio API, no runtime analysis.

Marker-Style Highlighting

Deterministic CSS + GSAP implementations of the classic "highlight / circle / burst / scribble / sketchout" drawing modes for emphasizing text. Fully seekable — no animated SVG filters, no JS timers.

Highlight (yellow marker sweep)

<span class="mh-highlight-wrap">
  <span class="mh-highlight-bar" id="hl-1"></span>
  <span class="mh-highlight-text">highlighted text</span>
</span>

.mh-highlight-wrap { position: relative; display: inline; }
.mh-highlight-bar {
  position: absolute; inset: 0 -6px;
  background: #fdd835; opacity: 0.35;
  transform: scaleX(0); transform-origin: left center;
  border-radius: 3px; z-index: 0;
}
.mh-highlight-text { position: relative; z-index: 1; }

tl.to("#hl-1", { scaleX: 1, duration: 0.5, ease: "power2.out" }, 0.6);

Multi-line: apply to .mh-highlight-bar with stagger: 0.3.

Circle

Hand-drawn ellipse around a word. Use a positioned ::before with border-radius: 50%, slight rotation, and clip-path to avoid covering the letters. Animate clip-path or stroke-dashoffset on an inline SVG circle.

Burst

Short radiating lines around a word. Render 6-12 small <span> elements positioned in a radial pattern; animate scaleY from 0.

Scribble

A chaotic overlay created by animating stroke-dashoffset on an inline SVG <path> with a d attribute describing a zig-zag. Seed values, never Math.random().

Sketchout

A rough rectangle outline. Two <rect>s with slight transform offsets, animated via stroke-dashoffset.

All five modes tween CSS transforms or stroke-dashoffset only — both tween cleanly, are deterministic, and seek correctly.

Scene Transitions

Every multi-scene composition MUST use transitions. No jump cuts.

Energy → primary transition

Energy	CSS primary	Shader primary	Accent	Duration	Easing
Calm (wellness, brand, luxury)	Blur crossfade, focus pull	Cross-warp morph, thermal distortion	Light leak, circle iris	0.5-0.8s	`sine.inOut`, `power1`
Medium (corporate, SaaS)	Push slide, staggered blocks	Whip pan, cinematic zoom	Squeeze, vertical push	0.3-0.5s	`power2`, `power3`
High (promos, sports, launch)	Zoom through, overexposure	Ridged burn, glitch, chromatic split	Staggered blocks, gravity drop	0.15-0.3s	`power4`, `expo`

Pick ONE primary (60-70% of scene changes) plus 1-2 accents. Never use a different transition for every scene.

Mood → transition type

Mood	Transitions
Warm / inviting	Light leak, blur crossfade, focus pull, film burn · Shader: thermal distortion, cross-warp morph
Cold / clinical	Squeeze, zoom out, blinds, shutter, grid dissolve · Shader: gravitational lens
Editorial / magazine	Push slide, vertical push, diagonal split, shutter · Shader: whip pan
Tech / futuristic	Grid dissolve, staggered blocks, blinds · Shader: glitch, chromatic split
Tense / edgy	Glitch, VHS, chromatic aberration, ripple · Shader: ridged burn, domain warp
Playful / fun	Elastic push, 3D flip, circle iris, morph circle · Shader: swirl vortex, ripple waves
Dramatic / cinematic	Zoom through, gravity drop, overexposure · Shader: cinematic zoom, gravitational lens
Premium / luxury	Focus pull, blur crossfade, color dip to black · Shader: cross-warp morph
Retro / analog	Film burn, light leak, VHS, clock wipe · Shader: light leak

Presets

Preset	Duration	Easing
`snappy`	0.2s	`power4.inOut`
`smooth`	0.4s	`power2.inOut`
`gentle`	0.6s	`sine.inOut`
`dramatic`	0.5s	`power3.in` → out
`instant`	0.15s	`expo.inOut`
`luxe`	0.7s	`power1.inOut`

Install a shader transition

npx hyperframes add flash-through-white
npx hyperframes add --list

CSS vs shader

CSS transitions animate scene containers with opacity, transforms, clip-path, and filters. Simpler to set up.
Shader transitions composite both scene textures per-pixel on a WebGL canvas — can warp, dissolve, and morph in ways CSS cannot. Import from @hyperframes/shader-transitions instead of writing raw GLSL.

Don't mix CSS and shader transitions in the same composition — once a composition uses shader transitions, the WebGL canvas replaces DOM-based scene switching for every transition.

Shader-compatible CSS rules

Shader transitions capture DOM scenes to WebGL textures via html2canvas. The canvas 2D pipeline doesn't match CSS exactly:

No transparent keyword in gradients — use the target color at zero alpha: rgba(200,117,51,0) not transparent. (Canvas interpolates transparent as rgba(0,0,0,0) creating dark fringes.)
No gradient backgrounds on elements thinner than 4px. Use solid background-color on thin accent lines.
No CSS variables (var()) on elements visible during capture — html2canvas doesn't reliably resolve custom properties. Use literal color values.
Mark uncapturable decoratives with data-no-capture — they stay on the live DOM but are absent from the shader texture.
No gradient opacity below 0.15 — renders differently in canvas vs CSS.
Every .scene div must have explicit background-color, AND pass the same color as bgColor in the init() config. Without either, the texture renders as black.

These rules only apply to shader transition compositions. CSS-only compositions have no restrictions.

Don't

Mix CSS and shader transitions in one composition.
Use exit animations on any scene except the final scene — the transition IS the exit.
Introduce a new transition type every scene — pick one primary + 1-2 accents.
Use transitions that create visible geometric repetition (grids, hex cells, uniform dots) — they look artificial regardless of the math behind them. Prefer organic noise (FBM, domain warping).

14 KiB Raw Blame History