feat(tts): add register_tts_provider() plugin hook (closes #30398)

Adds a `TTSProvider(ABC)` + `register_tts_provider()` extension point
to the plugin context API, **alongside** the existing config-driven
`tts.providers.<name>: type: command` registry from PR #17843. This is
additive — the command-provider surface stays as the primary way to
add a TTS backend.

The hook covers cases the shell-template grammar can't reasonably
express:

- Native Python SDKs without a CLI (Cartesia, Fish Audio, etc.)
- Streaming synthesis (chunked Opus → voice-bubble delivery)
- Voice metadata API for the `hermes tools` picker
- OAuth-refreshing auth flows

None of the 10 inline built-in providers (`edge`, `openai`,
`elevenlabs`, `minimax`, `gemini`, `mistral`, `xai`, `piper`,
`kittentts`, `neutts`) are migrated to plugins. They stay inline. The
hook is for *new* engines that aren't built-in.

## Resolution order

The dispatcher's resolution order is the load-bearing invariant:

1. `tts.provider` is a built-in name → built-in dispatch. **Always wins.**
2. `tts.provider` matches `tts.providers.<name>` with `command:` set
   → command-provider dispatch (PR #17843).
3. `tts.provider` matches a plugin-registered `TTSProvider`
   → plugin dispatch (new).
4. No match → falls through to Edge TTS default (legacy behavior).

Built-ins-always-win is enforced at THREE layers:
- Registry: `register_provider()` rejects shadowing names with a warning.
- Dispatcher: `_dispatch_to_plugin_provider()` short-circuits built-in
  names defensively before consulting the registry.
- Picker: `_plugin_tts_providers()` filters built-in shadows out of
  the `hermes tools` row list defensively.

Command-providers-win-over-plugins is enforced at TWO layers:
- The caller in `text_to_speech_tool` checks
  `_resolve_command_provider_config` first.
- `_dispatch_to_plugin_provider` re-checks for a same-name command
  config defensively so a refactor of the caller can't silently break
  the invariant.

## New files

- `agent/tts_provider.py` — `TTSProvider(ABC)` with `synthesize()` (required),
  `list_voices()`, `list_models()`, `get_setup_schema()`, `stream()`,
  `voice_compatible` (all optional with sane defaults). Mirrors
  `agent/image_gen_provider.py` shape.
- `agent/tts_registry.py` — `register_provider`/`get_provider`/`list_providers`
  with `_BUILTIN_NAMES` reject-shadowing invariant. Mirrors
  `agent/image_gen_registry.py` shape.
- `plugins/tts/...` directory ready for community plugins (none shipped).

## Modified files

- `hermes_cli/plugins.py` — `register_tts_provider()` method on
  `PluginContext`. Matches the gating shape of
  `register_image_gen_provider()` / `register_browser_provider()`.
- `tools/tts_tool.py` — `_dispatch_to_plugin_provider()` +
  `_plugin_provider_is_voice_compatible()` + walrus-elif wiring into
  the main dispatcher. Built-in elif chain untouched.
- `hermes_cli/tools_config.py` — `_plugin_tts_providers()` injects
  plugin rows into the Text-to-Speech picker category alongside the
  10 hardcoded built-in rows.

## Tests

- `tests/agent/test_tts_registry.py` — 47 tests covering registration,
  lookup, ABC contract, helpers, AND a `TestBuiltinSync` regression
  test that fails if `agent.tts_registry._BUILTIN_NAMES` drifts from
  `tools.tts_tool.BUILTIN_TTS_PROVIDERS` (kept duplicated due to
  circular import constraints).
- `tests/tools/test_tts_plugin_dispatch.py` — 35 tests covering
  built-in-always-wins, command-wins-over-plugin, plugin dispatch,
  exception passthrough, voice_compatible helper.
- `tests/hermes_cli/test_tts_picker.py` — 10 tests covering the
  picker surface, builtin shadowing defense, integration with
  `_visible_providers`.
- `tests/hermes_cli/test_plugins_tts_registration.py` — 3 end-to-end
  tests via `PluginManager.discover_and_load()`.
- `tests/plugins/tts/check_parity_vs_main.py` — 9-scenario subprocess
  parity harness vs `origin/main`. The only intentional diff is
  `fallback_edge → plugin` for the `plugin-installed` scenario.

## Verification

- 95/95 new tests pass.
- 170/170 pre-existing TTS tests (test_tts_command_providers,
  test_tts_max_text_length, test_tts_speed, etc.) pass unchanged.
- Parity harness against `origin/main`: 8 OK + 1 expected DIFF.
- E2E smoke: a registered plugin's `synthesize()` is called via
  `text_to_speech_tool` with the standard JSON envelope returned.
- Ruff clean on all touched files.

## Docs

- `website/docs/user-guide/features/tts.md` — new "Python plugin
  providers" section with a decision table (command-provider vs
  plugin), minimal plugin example, and the optional-hook reference.
- `website/docs/user-guide/features/plugins.md` — TTS row updated to
  mention both surfaces (command-provider primary, plugin for
  SDK/streaming).

Closes #30398
This commit is contained in:
kshitijk4poor 2026-05-22 17:58:07 +05:30 committed by Teknium
parent 782681f904
commit 00ec0b617c
13 changed files with 2037 additions and 1 deletions

View file

@ -419,6 +419,123 @@ def _resolve_command_provider_config(
return None
def _dispatch_to_plugin_provider(
text: str,
output_path: str,
provider: str,
tts_config: Dict[str, Any],
) -> Optional[str]:
"""Route the call to a plugin-registered TTS provider, or return None.
Returns the path to the written audio file on dispatch, or ``None``
to fall through to the next resolution layer (built-in dispatch or
Edge TTS default).
Resolution invariants enforced here (matches issue #30398):
1. Built-in provider names short-circuit never reach the plugin
registry. The caller is responsible for the elif chain that
handles ``edge``/``openai``/etc.; this function explicitly
rejects those names defensively.
2. Command-type providers declared under
``tts.providers.<name>: type: command`` (PR #17843) win over a
plugin with the same name. The caller passes us only when its
own command-provider check returned None we re-verify here so
a refactor of the caller can't silently break the invariant.
3. Plugin dispatch fires only when ``provider`` matches a registered
:class:`TTSProvider` whose ``name`` equals the configured value.
Unknown names return None (caller falls through to Edge default).
Plugin exceptions are caught and re-raised the outer
``text_to_speech_tool`` try/except converts them to the standard
error envelope, matching how command-provider failures surface.
"""
if not provider:
return None
key = provider.lower().strip()
if key in BUILTIN_TTS_PROVIDERS:
return None
# Defense in depth: command-provider check should already have
# short-circuited the caller. If a same-name command config exists,
# bail so the command path wins.
if _is_command_provider_config(_get_named_provider_config(tts_config, key)):
return None
try:
from agent.tts_registry import get_provider
from hermes_cli.plugins import _ensure_plugins_discovered
_ensure_plugins_discovered()
plugin_provider = get_provider(key)
if plugin_provider is None:
# Long-lived sessions may have discovered plugins before the
# bundled backend was patched in or before config changed.
# Retry once with a forced refresh before surfacing fall-
# through. Mirrors the image_gen / browser dispatcher
# recovery pattern.
_ensure_plugins_discovered(force=True)
plugin_provider = get_provider(key)
except Exception as exc: # noqa: BLE001 — discovery failure is non-fatal
logger.debug("tts plugin dispatch skipped (discovery failed): %s", exc)
return None
if plugin_provider is None:
return None
# Resolve voice / model / format from tts_config — providers should
# treat all of these as optional and fall back to their own defaults
# when None is passed (matches the ABC contract documented on
# ``TTSProvider.synthesize``).
voice = tts_config.get("voice") if isinstance(tts_config, dict) else None
model = tts_config.get("model") if isinstance(tts_config, dict) else None
speed = tts_config.get("speed") if isinstance(tts_config, dict) else None
fmt = (
tts_config.get("output_format", DEFAULT_COMMAND_TTS_OUTPUT_FORMAT)
if isinstance(tts_config, dict)
else DEFAULT_COMMAND_TTS_OUTPUT_FORMAT
)
logger.info(
"Generating speech with plugin TTS provider '%s'...", key,
)
written = plugin_provider.synthesize(
text,
output_path,
voice=voice if isinstance(voice, str) and voice else None,
model=model if isinstance(model, str) and model else None,
speed=float(speed) if isinstance(speed, (int, float)) else None,
format=str(fmt).lower() if fmt else "mp3",
)
# Provider contract: returns the (possibly rewritten) output path.
# Defensive against a provider returning None or a non-string —
# fall back to the caller's expected output_path.
return written if isinstance(written, str) and written else output_path
def _plugin_provider_is_voice_compatible(provider: str) -> bool:
"""Return True when the registered plugin provider opts into voice
bubble delivery via its ``voice_compatible`` property.
Defensive: any registry or property access failure means False
(matches the safe default for the command-provider path).
"""
if not provider:
return False
key = provider.lower().strip()
if key in BUILTIN_TTS_PROVIDERS:
return False
try:
from agent.tts_registry import get_provider
plugin_provider = get_provider(key)
if plugin_provider is None:
return False
return bool(plugin_provider.voice_compatible)
except Exception as exc: # noqa: BLE001
logger.debug(
"tts plugin voice_compatible check failed for '%s': %s", key, exc,
)
return False
def _iter_command_providers(tts_config: Dict[str, Any]):
"""Yield (name, config) pairs for every declared command-type provider."""
if not isinstance(tts_config, dict):
@ -1787,6 +1904,21 @@ def text_to_speech_tool(
text, file_str, provider, command_provider_config, tts_config,
)
# Plugin-registered TTS backend (issue #30398). Fires when the
# configured provider is neither a built-in nor a command-type
# entry, AND a plugin is registered under that name. The walrus
# binds `_plugin_path` only when the dispatcher returns a path
# (i.e. a plugin was actually found); a None return falls
# through to the built-in elif chain so unknown names hit the
# Edge TTS default at the bottom. The dispatcher itself enforces
# built-ins-always-win + command-wins-over-plugin defensively.
elif provider not in BUILTIN_TTS_PROVIDERS and (
_plugin_path := _dispatch_to_plugin_provider(
text, file_str, provider, tts_config,
)
) is not None:
file_str = _plugin_path
elif provider == "elevenlabs":
try:
_import_elevenlabs()
@ -1925,6 +2057,18 @@ def text_to_speech_tool(
if opus_path:
file_str = opus_path
voice_compatible = file_str.endswith(".ogg")
elif provider not in BUILTIN_TTS_PROVIDERS:
# Plugin-registered provider (issue #30398). Voice-bubble
# delivery opts in via ``TTSProvider.voice_compatible``
# (mirrors the command-provider opt-in). Plugins that
# already write Opus skip the ffmpeg conversion.
plugin_voice_compatible = _plugin_provider_is_voice_compatible(provider)
if plugin_voice_compatible:
if not file_str.endswith(".ogg"):
opus_path = _convert_to_opus(file_str)
if opus_path:
file_str = opus_path
voice_compatible = file_str.endswith(".ogg")
elif (
want_opus
and provider in {"edge", "neutts", "minimax", "xai", "kittentts", "piper"}