hermes-agent/tools/video_generation_tool.py
Teknium 9d42c2c286
feat(video_gen): unified video_generate tool with pluggable provider backends (#25126)
* feat(video_gen): unified video_generate tool with pluggable provider backends

One core video_generate tool, every backend a plugin. Mirrors the
image_gen + memory_provider + context_engine architecture: ABC, registry,
plugin-context registration hook, and per-plugin model catalogs surfaced
through hermes tools.

Surface (one schema, every backend):
- operation: generate / edit / extend
- modalities: text-to-video (prompt only), image-to-video (prompt +
  image_url), video edit (prompt + video_url), video extend (video_url)
- reference_image_urls, duration, aspect_ratio, resolution,
  negative_prompt, audio, seed, model override
- Providers ignore unknown kwargs and declare what they support via
  VideoGenProvider.capabilities() — backend-specific quirks stay in the
  backend, the agent learns one tool

Backends shipped:
- plugins/video_gen/xai/  — Grok-Imagine, full generate/edit/extend +
  image-to-video + reference images (salvaged from PR #10600 by
  @Jaaneek, reshaped into the plugin interface)
- plugins/video_gen/fal/  — Veo 3.1 (t2v + i2v), Kling O3 i2v,
  Pixverse v6 i2v with model-aware payload building that drops keys a
  model doesn't declare

Wiring:
- agent/video_gen_provider.py — VideoGenProvider ABC, normalize_operation,
  success_response / error_response, save_b64_video / save_bytes_video,
  $HERMES_HOME/cache/videos/
- agent/video_gen_registry.py — thread-safe register/get/list +
  get_active_provider() reading video_gen.provider from config.yaml
- hermes_cli/plugins.py — PluginContext.register_video_gen_provider()
- hermes_cli/tools_config.py — Video Generation category in
  hermes tools, plugin-only providers list, model picker per plugin,
  config write to video_gen.{provider,model}
- toolsets.py — new video_gen toolset
- tests: 31 new tests covering ABC, registry, tool dispatch, both plugins
- docs: developer-guide/video-gen-provider-plugin.md (parallel to the
  image-gen guide), sidebar + toolsets-reference + plugin guides updated

Supersedes: #25035 (FAL), #17972 (FAL), #14543 (xAI), #13847 (HappyHorse),
#10458 (provider categories), #10786 (xAI media+search bundle), #2984
(FAL duplicate), #19086 (Google Veo standalone — easy port to plugin
interface).

Co-authored-by: Jaaneek <Jaaneek@users.noreply.github.com>

* feat(video_gen): dynamic schema reflects active backend's capabilities

Address the 'capability variance' question — instead of one tool with a
static schema that lies about what every backend supports, the
video_generate tool now rebuilds its description at get_definitions()
time based on the configured video_gen.provider and video_gen.model.

The agent sees backend-specific guidance up-front:
- 'fal-ai/veo3.1/image-to-video': 'image-to-video only — image_url is
  REQUIRED; text-only prompts will be rejected'
- 'fal-ai/veo3.1' (t2v): no image_url restriction shown
- xAI grok-imagine-video: 'operations: generate, edit, extend; up to 7
  reference_image_urls'
- Backends without edit/extend: 'not supported on this backend — surface
  that they need to switch backends via hermes tools'

This is the same pattern PR #22694 used for delegate_task self-capping —
documented in the dynamic-tool-schemas skill. Cache invalidation is
free: get_tool_definitions() already memoizes on config.yaml mtime, so a
mid-session backend swap rebuilds the schema automatically.

Tested:
- Empirical FAL OpenAPI schema check confirms image-to-video models
  require image_url (FAL returns HTTP 422 otherwise) — client-side
  rejection in FALVideoGenProvider.generate() now prevents the wasted
  round-trip
- Live E2E: fal-ai/veo3.1/image-to-video + prompt-only → clean
  missing_image_url error; fal-ai/veo3.1 + prompt-only → dispatches
- 6 new tests cover the builder (no config / image-only / full-surface /
  text-only / unknown provider / registry wiring), all passing
- 37/37 in the slice, 134/134 in the broader regression set

* test(video_gen/xai): full surface integration tests + cleaner schema

Verified end-to-end that the xAI plugin handles every documented mode
from PR #10600's surface: text-to-video, image-to-video,
reference-images-to-video, video edit, video extend (with and without
prompt). All five modes route to the correct xAI endpoint
(/videos/generations, /videos/edits, /videos/extensions) with the right
payload shape (image / reference_images / video keys), and all five
client-side rejections fire before the network: edit-without-prompt,
extend-without-video_url, image+refs conflict, >7 references, and
duration/aspect_ratio clamping.

15 new integration tests grouped into four classes (endpoint routing,
modalities, validation, clamping). httpx is stubbed via a small fake
AsyncClient that records POSTs so the tests assert the actual payload
the plugin would send to xAI — not just the success/error envelope.

Also cleaned up a description redundancy: when a model's operations
match the backend's overall set, we no longer print the duplicate
'operations supported by this model' line. xAI's description now reads:

    Active backend: xAI . model: grok-imagine-video
    - operations supported by this backend: edit, extend, generate
    - modalities supported by this backend: image, reference_images, text
    - aspect_ratio choices: 16:9, 1:1, 2:3, 3:2, 3:4, 4:3, 9:16
    - resolution choices: 480p, 720p
    - duration range: 1-15s
    - reference_image_urls: up to 7 images

Co-authored-by: Jaaneek <Jaaneek@users.noreply.github.com>

* feat(video_gen): collapse surface to t2v + i2v, family-based auto-routing

Two design changes per Teknium:

1) Drop edit/extend from the tool surface entirely. Only text-to-video
and image-to-video remain. The agent sees a clean tool with two
modalities; backend-specific quirks like xAI's edit/extend endpoints
stay out of the unified schema.

2) FAL: pick a model FAMILY once, the plugin routes between the
family's text-to-video and image-to-video endpoints based on whether
image_url was passed. Users no longer pick 'fal-ai/veo3.1' AND
'fal-ai/veo3.1/image-to-video' as separate options — they pick
'veo3.1', and the plugin handles the rest.

Catalog rewritten as families:

    veo3.1            fal-ai/veo3.1                                /  fal-ai/veo3.1/image-to-video
    pixverse-v6       fal-ai/pixverse/v6/text-to-video             /  fal-ai/pixverse/v6/image-to-video
    kling-o3-standard fal-ai/kling-video/o3/standard/text-to-video /  fal-ai/kling-video/o3/standard/image-to-video

xAI uses a single endpoint (/videos/generations) for both modes,
routed by the presence of the 'image' field in the payload — no
edit/extend exposure.

Schema changes:
- VIDEO_GENERATE_SCHEMA: drop operation, drop video_url. Final params:
  prompt (required), image_url, reference_image_urls, duration,
  aspect_ratio, resolution, negative_prompt, audio, seed, model.
- VideoGenProvider ABC: drop normalize_operation, VALID_OPERATIONS,
  DEFAULT_OPERATION. capabilities() drops 'operations' key.
- success_response: add 'modality' field ('text' | 'image') so the
  agent and logs can see which endpoint was actually hit.

Dynamic schema builder simplified — no operations bullet, no
'switch backends if you need edit/extend' guidance. When the active
backend supports both modalities (the common case), description reads:

    Active backend: FAL . model: pixverse-v6
    - supports both text-to-video (omit image_url) and image-to-video
      (pass image_url) - routes automatically
    - aspect_ratio choices: 16:9, 9:16, 1:1
    - resolution choices: 360p, 540p, 720p, 1080p
    - duration range: 1-15s
    - audio: pass audio=true to enable native audio (pricing tier)
    - negative_prompt: supported

Tests: 51 in the video_gen slice, 216 across the broader image+video
sweep, all passing. New FAL routing tests prove pixverse-v6 + no image
hits text-to-video endpoint, pixverse-v6 + image_url hits
image-to-video endpoint, same for veo3.1 and kling-o3-standard.

Docs updated: developer-guide page rewrites the 'model families' pattern
as a first-class section so external plugin authors know the convention.
toolsets-reference and toolsets.py descriptions match the new surface.

Co-authored-by: Jaaneek <Jaaneek@users.noreply.github.com>

* feat(video_gen/fal): expand catalog to 6 families, cheap + premium tiers

Catalog now covers everything Teknium specced from FAL:

  Cheap tier:
    ltx-2.3        fal-ai/ltx-2.3-22b/text-to-video       / image-to-video
    pixverse-v6    fal-ai/pixverse/v6/text-to-video       / image-to-video

  Premium tier:
    veo3.1         fal-ai/veo3.1                          / fal-ai/veo3.1/image-to-video
    seedance-2.0   bytedance/seedance-2.0/text-to-video   / image-to-video
    kling-v3-4k    fal-ai/kling-video/v3/4k/text-to-video / image-to-video
    happy-horse    fal-ai/happy-horse/text-to-video       / image-to-video

DEFAULT_MODEL moved from veo3.1 (premium) to pixverse-v6 (cheap, sane
defaults, both modalities) — better first-run UX for users who haven't
explicitly picked a model.

New family-entry knob: image_param_key. Kling v3 4K's image-to-video
endpoint expects start_image_url instead of image_url; declaring
image_param_key='start_image_url' on the family lets _build_payload
remap correctly. Other families default to plain image_url.

Per-family capability flags reflect each model's docs:
- LTX 2.3 + Happy Horse: minimal payloads (no duration/aspect/resolution
  enum exposed by FAL — let endpoint apply defaults)
- Seedance: 6 aspect ratios incl 21:9, durations 4-15, audio supported,
  negative prompts NOT supported per docs
- Kling v3 4K: 16:9/9:16/1:1, 3-15s, audio + negative
- Veo 3.1: unchanged, 16:9/9:16, 4/6/8s

Tests: +5 covering the new families (full catalog, Kling 4K
start_image_url remap, Seedance routing, LTX payload minimality, Happy
Horse minimality). 56/56 in the slice green.

Note: I did NOT add the FAL-hosted xAI Grok-Imagine variant. Hermes
already has a direct xAI plugin that talks to xAI's own API; routing
the same model through FAL's wrapper would duplicate the surface
without adding capabilities. Users on FAL who want Grok-Imagine should
use the xAI plugin directly; flag if you want both routes available.

* test(video_gen): tool-surface routing matrix — every model x modality

End-to-end matrix test driven through _handle_video_generate() — the
actual function the agent's video_generate tool call lands in. Writes
config.yaml, invokes the registered handler with a raw args dict, then
asserts the outbound HTTP/SDK call hit the right endpoint with the right
payload shape.

Parametrized over FAL_FAMILIES.keys() so the matrix auto-discovers new
families as they're added (add a family to FAL_FAMILIES and you get
both modalities tested for free).

Coverage:
- All 6 FAL families x {text-only, text+image} = 12 cases
- xAI x {text-only, text+image} = 2 cases
- tool-level model= arg overrides config = 2 cases

For each case, verifies:
- result['success'] is True
- result['modality'] matches input shape ('text' if no image_url, 'image' otherwise)
- outbound endpoint URL matches the family's text_endpoint or image_endpoint
- text-only payloads carry no image-shaped keys
- text+image payloads carry the family's image key (image_url for most,
  start_image_url for kling-v3-4k, wrapped 'image' object for xAI)

All 16 cases passing. Confirms the tool surface routes every
(provider, model, modality) combination correctly with zero leakage.

* feat(video_gen): keep video_gen out of first-run setup, surface in status

Two changes:

1. video_gen joins _DEFAULT_OFF_TOOLSETS, so it is NOT pre-selected in
   the first-run toolset checklist. Video gen is niche, paid, and slow —
   most users don't want it nagging them during initial setup. Anyone
   who wants it opts in via 'hermes tools' -> Video Generation, which
   already routes to the provider+model picker.

2. The 'hermes setup' status panel learns about video_gen — but only
   shows the row when a plugin reports available. Users without
   FAL_KEY/XAI_API_KEY see nothing about video gen; users with one of
   those keys see 'Video Generation (FAL) ✓' as confirmation it's wired.

Verified live:
- Fresh install (no creds): zero video_gen mentions in wizard.
- With FAL_KEY: status row appears with active backend name.
- 160/160 in the setup + tools_config + video_gen test slice.

Rationale: image_gen is on by default because it's a featured creative
tool used in casual chat (telegrams, etc). Video gen is heavier — long
wait, paid per-second pricing. Default-off matches user intent better.

---------

Co-authored-by: Jaaneek <Jaaneek@users.noreply.github.com>
2026-05-13 16:39:41 -07:00

561 lines
20 KiB
Python

#!/usr/bin/env python3
"""
Video Generation Tool
=====================
Single ``video_generate`` tool that dispatches to a plugin-registered
video generation provider. Mirrors the ``image_generate`` design:
- ``agent/video_gen_provider.py`` defines the :class:`VideoGenProvider` ABC.
- ``agent/video_gen_registry.py`` holds the active providers (populated by
plugins at import time).
- Each provider lives under ``plugins/video_gen/<name>/``.
The tool itself is intentionally backend-agnostic and ships **no in-tree
provider** — turn on a backend by enabling a plugin (``hermes plugins
enable video_gen/<name>``) and selecting it in ``hermes tools`` → Video
Generation.
Unified surface
---------------
One tool covers the common cases — text-to-video, image-to-video, video
edit, video extend — with a compact schema:
prompt text instruction (required for generate/edit)
operation "generate" | "edit" | "extend"
image_url drives image-to-video when operation=generate
video_url source video for edit/extend
reference_image_urls list, up to provider-declared cap
duration seconds (provider clamps)
aspect_ratio "16:9" | "9:16" | "1:1" | ...
resolution "480p" | "540p" | "720p" | "1080p"
negative_prompt optional (Pixverse/Kling style)
audio optional (Veo3/Pixverse pricing tier)
seed optional
model optional, override the active provider's default
Providers ignore parameters they do not support. The tool layer does
**lightweight** validation (type/required-prompt) and lets each provider
do its own clamping inside :meth:`VideoGenProvider.generate` — that keeps
the tool surface stable as new providers ship with different capabilities.
"""
from __future__ import annotations
import json
import logging
from typing import Any, Dict, List, Optional
from agent.video_gen_provider import (
COMMON_ASPECT_RATIOS,
COMMON_RESOLUTIONS,
DEFAULT_ASPECT_RATIO,
DEFAULT_RESOLUTION,
error_response,
)
from tools.registry import registry, tool_error
logger = logging.getLogger(__name__)
VIDEO_GENERATE_SCHEMA: Dict[str, Any] = {
"name": "video_generate",
# Placeholder — the real description is built dynamically at
# get_tool_definitions() time so it reflects the active backend's
# actual capabilities (which modalities / resolutions / duration
# ranges the user's currently-selected model supports).
# See _build_dynamic_video_schema() below and the dynamic-tool-schemas
# skill at github/hermes-agent-dev/references/dynamic-tool-schemas.md.
"description": "(rebuilt at get_definitions() time — see _build_dynamic_video_schema)",
"parameters": {
"type": "object",
"properties": {
"prompt": {
"type": "string",
"description": (
"Text instruction describing the desired video, motion, "
"subject, style, camera movement, etc."
),
},
"image_url": {
"type": "string",
"description": (
"Optional public URL of a still image. When provided, "
"the active backend routes to its image-to-video "
"endpoint (animate the image); when omitted, it routes "
"to text-to-video. Pass either a URL the user supplied "
"or a path/URL from the conversation."
),
},
"reference_image_urls": {
"type": "array",
"items": {"type": "string"},
"description": (
"Optional list of reference image URLs (style or "
"character refs). Only supported by some backends; "
"the active backend's description below indicates whether "
"this is honored and what the max is."
),
},
"duration": {
"type": "integer",
"description": (
"Desired video duration in seconds. Providers clamp to "
"their supported range (commonly 4-15s). Omit to use the "
"provider's default."
),
},
"aspect_ratio": {
"type": "string",
"enum": list(COMMON_ASPECT_RATIOS),
"description": (
"Output aspect ratio. Providers clamp to their supported "
"set."
),
"default": DEFAULT_ASPECT_RATIO,
},
"resolution": {
"type": "string",
"enum": list(COMMON_RESOLUTIONS),
"description": (
"Output resolution. Providers clamp to their supported "
"set."
),
"default": DEFAULT_RESOLUTION,
},
"negative_prompt": {
"type": "string",
"description": (
"Optional negative prompt — content to avoid in the "
"output. Supported by Pixverse, Kling, and similar; "
"ignored by providers that do not support it."
),
},
"audio": {
"type": "boolean",
"description": (
"Optional audio generation toggle. Supported by Veo3 and "
"Pixverse (affects pricing tier); ignored elsewhere."
),
},
"seed": {
"type": "integer",
"description": (
"Optional seed for reproducible outputs (provider-"
"dependent)."
),
},
"model": {
"type": "string",
"description": (
"Optional model override. If omitted, the user's "
"configured ``video_gen.model`` (set via `hermes tools` "
"→ Video Generation) is used. Models that the active "
"provider does not know are rejected."
),
},
},
"required": ["prompt"],
},
}
# ---------------------------------------------------------------------------
# Config readers (mirror image_generation_tool.py)
# ---------------------------------------------------------------------------
def _read_video_gen_section() -> Dict[str, Any]:
try:
from hermes_cli.config import load_config
cfg = load_config()
section = cfg.get("video_gen") if isinstance(cfg, dict) else None
return section if isinstance(section, dict) else {}
except Exception as exc:
logger.debug("Could not read video_gen config: %s", exc)
return {}
def _read_configured_video_provider() -> Optional[str]:
value = _read_video_gen_section().get("provider")
if isinstance(value, str) and value.strip():
return value.strip()
return None
def _read_configured_video_model() -> Optional[str]:
value = _read_video_gen_section().get("model")
if isinstance(value, str) and value.strip():
return value.strip()
return None
# ---------------------------------------------------------------------------
# Availability check
# ---------------------------------------------------------------------------
def check_video_generation_requirements() -> bool:
"""Return True when at least one registered provider reports available.
Triggers plugin discovery (idempotent) so user-installed plugins are
visible to the toolset gate.
"""
try:
from agent.video_gen_registry import list_providers
from hermes_cli.plugins import _ensure_plugins_discovered
_ensure_plugins_discovered()
for provider in list_providers():
try:
if provider.is_available():
return True
except Exception:
continue
except Exception:
pass
return False
# ---------------------------------------------------------------------------
# Dispatch
# ---------------------------------------------------------------------------
def _resolve_active_provider():
"""Return the active provider object or None.
Forces plugin discovery before checking the registry — handles cases
where a long-lived session was started before a plugin was installed.
"""
try:
from agent.video_gen_registry import get_active_provider
from hermes_cli.plugins import _ensure_plugins_discovered
_ensure_plugins_discovered()
provider = get_active_provider()
if provider is None:
_ensure_plugins_discovered(force=True)
provider = get_active_provider()
return provider
except Exception as exc:
logger.debug("video_gen provider resolution failed: %s", exc)
return None
def _missing_provider_error(configured: Optional[str]) -> str:
if configured:
msg = (
f"video_gen.provider='{configured}' is set but no plugin "
f"registered that name. Run `hermes plugins list` to see "
f"installed video gen backends, or `hermes tools` → Video "
f"Generation to pick one."
)
return json.dumps(error_response(
error=msg, error_type="provider_not_registered",
provider=configured,
))
msg = (
"No video generation backend is configured. Run `hermes tools` → "
"Video Generation to enable one (xAI, FAL, or Google Veo)."
)
return json.dumps(error_response(
error=msg, error_type="no_provider_configured",
))
# ---------------------------------------------------------------------------
# Handler
# ---------------------------------------------------------------------------
def _coerce_int(value: Any) -> Optional[int]:
if value is None or value == "":
return None
try:
return int(value)
except (TypeError, ValueError):
return None
def _coerce_bool(value: Any) -> Optional[bool]:
if value is None:
return None
if isinstance(value, bool):
return value
if isinstance(value, str):
v = value.strip().lower()
if v in ("true", "1", "yes", "on"):
return True
if v in ("false", "0", "no", "off"):
return False
return None
def _normalize_reference_images(value: Any) -> Optional[List[str]]:
if value is None:
return None
if isinstance(value, str):
value = [value]
if not isinstance(value, (list, tuple)):
return None
out: List[str] = []
for item in value:
if isinstance(item, str) and item.strip():
out.append(item.strip())
return out or None
def _handle_video_generate(args: Dict[str, Any], **_kw: Any) -> str:
prompt = (args.get("prompt") or "").strip()
image_url = (args.get("image_url") or "").strip() or None
reference_image_urls = _normalize_reference_images(args.get("reference_image_urls"))
duration = _coerce_int(args.get("duration"))
aspect_ratio = (args.get("aspect_ratio") or DEFAULT_ASPECT_RATIO).strip() or DEFAULT_ASPECT_RATIO
resolution = (args.get("resolution") or DEFAULT_RESOLUTION).strip() or DEFAULT_RESOLUTION
negative_prompt = (args.get("negative_prompt") or "").strip() or None
audio = _coerce_bool(args.get("audio"))
seed = _coerce_int(args.get("seed"))
model_override = (args.get("model") or "").strip() or None
# Soft validation — providers do their own. Prompt is required by the
# schema; the backend may still accept image-only on its image-to-video
# endpoint but our surface always needs a prompt.
if not prompt:
return tool_error("prompt is required for video generation")
# Resolve the active provider.
configured = _read_configured_video_provider()
provider = _resolve_active_provider()
if provider is None:
return _missing_provider_error(configured)
# Resolve model: explicit arg wins, then config, then provider default.
model = model_override or _read_configured_video_model() or provider.default_model()
kwargs: Dict[str, Any] = {
"model": model,
"image_url": image_url,
"reference_image_urls": reference_image_urls,
"duration": duration,
"aspect_ratio": aspect_ratio,
"resolution": resolution,
"negative_prompt": negative_prompt,
"audio": audio,
"seed": seed,
}
# Drop None entries so providers see clean defaults.
kwargs = {k: v for k, v in kwargs.items() if v is not None}
try:
result = provider.generate(prompt=prompt, **kwargs)
except TypeError as exc:
# A provider that hasn't widened its signature is a bug, not a
# caller error — log and surface a clear contract message.
logger.warning(
"video_gen provider '%s' rejected kwargs (signature too narrow): %s",
getattr(provider, "name", "?"), exc,
)
return json.dumps(error_response(
error=(
f"Provider '{getattr(provider, 'name', '?')}' signature is "
f"out of date with the video_generate schema. Report this "
f"to the plugin author."
),
error_type="provider_contract",
provider=getattr(provider, "name", ""),
model=model or "",
prompt=prompt,
))
except Exception as exc:
logger.warning(
"video_gen provider '%s' raised: %s",
getattr(provider, "name", "?"), exc,
)
return json.dumps(error_response(
error=f"Provider '{getattr(provider, 'name', '?')}' error: {exc}",
error_type="provider_exception",
provider=getattr(provider, "name", ""),
model=model or "",
prompt=prompt,
))
if not isinstance(result, dict):
return json.dumps(error_response(
error="Provider returned a non-dict result",
error_type="provider_contract",
provider=getattr(provider, "name", ""),
model=model or "",
prompt=prompt,
))
return json.dumps(result)
# ---------------------------------------------------------------------------
# Dynamic schema — reflect the active backend's actual capabilities
# ---------------------------------------------------------------------------
#
# Why dynamic: the user's configured backend determines which operations
# (generate/edit/extend), modalities (text / image / refs), aspect ratios,
# resolutions, durations, and audio/negative-prompt flags are real. A model
# that calls video_generate without knowing the active backend wastes a
# turn on something like "fal-ai/veo3.1/image-to-video requires image_url".
# Surfacing the per-model surface in the description means the model
# usually gets the call right on the first try.
#
# Memoization: model_tools.get_tool_definitions() keys its cache on
# config.yaml mtime, so when the user changes provider/model via
# `hermes tools` or `/skills`, the schema rebuilds automatically.
_GENERIC_DESCRIPTION = (
"Generate a video from a text prompt (text-to-video) or animate a "
"still image (image-to-video) using the user's configured video "
"generation backend. Pass `image_url` to animate that image; omit it "
"to generate from text alone. The backend auto-routes to the right "
"endpoint. The backend and model family are user-configured via "
"`hermes tools` → Video Generation; the agent does not pick them. "
"Long-running generations may take 30 seconds to several minutes — "
"the call blocks until the video is ready. Returns either an HTTP "
"URL or an absolute file path in the `video` field; display it with "
"markdown ![description](url-or-path) and the gateway will deliver it."
)
def _format_model_caveats(
model_meta: Dict[str, Any],
backend_caps: Dict[str, Any],
) -> List[str]:
"""Pull human-readable caveats out of one model's catalog metadata.
Only surfaces things that meaningfully differ from the backend's
overall capabilities — repeating defaults is noise.
"""
caveats: List[str] = []
modalities = set(model_meta.get("modalities") or [])
modality = model_meta.get("modality") # FAL's plugin uses this key for single-modality entries
if modality:
modalities.add(modality)
if "image" in modalities and "text" not in modalities:
caveats.append(
"this model is image-to-video only — image_url is REQUIRED; "
"text-only calls will be rejected"
)
elif "text" in modalities and "image" not in modalities:
caveats.append(
"this model is text-to-video only — image_url is not supported"
)
return caveats
def _build_dynamic_video_schema() -> Dict[str, Any]:
"""Build a description that reflects the active backend's actual surface.
Cheap: reads config (already memoized by the caller), asks the active
provider for `capabilities()` and the active model's catalog entry,
and formats a few lines of prose. Falls back to the generic
description when no provider is configured or registered.
"""
parts: List[str] = [_GENERIC_DESCRIPTION]
configured = _read_configured_video_provider()
configured_model = _read_configured_video_model()
if not configured:
parts.append(
"\nNo video backend is configured. Calls will return an error "
"until the user picks one via `hermes tools` → Video Generation."
)
return {"description": "\n".join(parts)}
try:
from agent.video_gen_registry import get_provider
from hermes_cli.plugins import _ensure_plugins_discovered
_ensure_plugins_discovered()
provider = get_provider(configured)
except Exception:
provider = None
if provider is None:
parts.append(
f"\nActive backend: {configured} (plugin not yet loaded — the "
f"tool will retry discovery on first call)."
)
return {"description": "\n".join(parts)}
try:
caps = provider.capabilities() or {}
except Exception:
caps = {}
try:
models = provider.list_models() or []
except Exception:
models = []
active_model = configured_model or provider.default_model()
model_meta = next(
(m for m in models if isinstance(m, dict) and m.get("id") == active_model),
{},
)
backend_label = provider.display_name
line = f"\nActive backend: {backend_label}"
if active_model:
line += f" · model: {active_model}"
parts.append(line)
# Model-specific caveats (the high-signal stuff)
for c in _format_model_caveats(model_meta, caps):
parts.append(f"- {c}")
# Backend modality summary — only useful when the backend supports
# both text and image. Single-modality backends are already covered by
# the model caveat above.
modalities = set(caps.get("modalities") or [])
if "text" in modalities and "image" in modalities and not model_meta.get("modality"):
parts.append(
"- supports both text-to-video (omit image_url) and "
"image-to-video (pass image_url) — routes automatically"
)
if caps.get("aspect_ratios"):
parts.append(f"- aspect_ratio choices: {', '.join(caps['aspect_ratios'])}")
if caps.get("resolutions"):
parts.append(f"- resolution choices: {', '.join(caps['resolutions'])}")
if caps.get("min_duration") and caps.get("max_duration"):
parts.append(
f"- duration range: {caps['min_duration']}-{caps['max_duration']}s"
)
if caps.get("supports_audio"):
parts.append("- audio: pass `audio=true` to enable native audio (pricing tier)")
if caps.get("supports_negative_prompt"):
parts.append("- negative_prompt: supported")
max_refs = caps.get("max_reference_images") or 0
if max_refs:
parts.append(f"- reference_image_urls: up to {max_refs} images")
return {"description": "\n".join(parts)}
# ---------------------------------------------------------------------------
# Registry
# ---------------------------------------------------------------------------
registry.register(
name="video_generate",
toolset="video_gen",
schema=VIDEO_GENERATE_SCHEMA,
handler=_handle_video_generate,
check_fn=check_video_generation_requirements,
requires_env=[],
is_async=False,
emoji="🎬",
dynamic_schema_overrides=_build_dynamic_video_schema,
)