feat(tts): add register_tts_provider() plugin hook (closes #30398)

Adds a `TTSProvider(ABC)` + `register_tts_provider()` extension point to the plugin context API, **alongside** the existing config-driven `tts.providers.<name>: type: command` registry from PR #17843. This is additive — the command-provider surface stays as the primary way to add a TTS backend. The hook covers cases the shell-template grammar can't reasonably express: - Native Python SDKs without a CLI (Cartesia, Fish Audio, etc.) - Streaming synthesis (chunked Opus → voice-bubble delivery) - Voice metadata API for the `hermes tools` picker - OAuth-refreshing auth flows None of the 10 inline built-in providers (`edge`, `openai`, `elevenlabs`, `minimax`, `gemini`, `mistral`, `xai`, `piper`, `kittentts`, `neutts`) are migrated to plugins. They stay inline. The hook is for *new* engines that aren't built-in. ## Resolution order The dispatcher's resolution order is the load-bearing invariant: 1. `tts.provider` is a built-in name → built-in dispatch. **Always wins.** 2. `tts.provider` matches `tts.providers.<name>` with `command:` set → command-provider dispatch (PR #17843). 3. `tts.provider` matches a plugin-registered `TTSProvider` → plugin dispatch (new). 4. No match → falls through to Edge TTS default (legacy behavior). Built-ins-always-win is enforced at THREE layers: - Registry: `register_provider()` rejects shadowing names with a warning. - Dispatcher: `_dispatch_to_plugin_provider()` short-circuits built-in names defensively before consulting the registry. - Picker: `_plugin_tts_providers()` filters built-in shadows out of the `hermes tools` row list defensively. Command-providers-win-over-plugins is enforced at TWO layers: - The caller in `text_to_speech_tool` checks `_resolve_command_provider_config` first. - `_dispatch_to_plugin_provider` re-checks for a same-name command config defensively so a refactor of the caller can't silently break the invariant. ## New files - `agent/tts_provider.py` — `TTSProvider(ABC)` with `synthesize()` (required), `list_voices()`, `list_models()`, `get_setup_schema()`, `stream()`, `voice_compatible` (all optional with sane defaults). Mirrors `agent/image_gen_provider.py` shape. - `agent/tts_registry.py` — `register_provider`/`get_provider`/`list_providers` with `_BUILTIN_NAMES` reject-shadowing invariant. Mirrors `agent/image_gen_registry.py` shape. - `plugins/tts/...` directory ready for community plugins (none shipped). ## Modified files - `hermes_cli/plugins.py` — `register_tts_provider()` method on `PluginContext`. Matches the gating shape of `register_image_gen_provider()` / `register_browser_provider()`. - `tools/tts_tool.py` — `_dispatch_to_plugin_provider()` + `_plugin_provider_is_voice_compatible()` + walrus-elif wiring into the main dispatcher. Built-in elif chain untouched. - `hermes_cli/tools_config.py` — `_plugin_tts_providers()` injects plugin rows into the Text-to-Speech picker category alongside the 10 hardcoded built-in rows. ## Tests - `tests/agent/test_tts_registry.py` — 47 tests covering registration, lookup, ABC contract, helpers, AND a `TestBuiltinSync` regression test that fails if `agent.tts_registry._BUILTIN_NAMES` drifts from `tools.tts_tool.BUILTIN_TTS_PROVIDERS` (kept duplicated due to circular import constraints). - `tests/tools/test_tts_plugin_dispatch.py` — 35 tests covering built-in-always-wins, command-wins-over-plugin, plugin dispatch, exception passthrough, voice_compatible helper. - `tests/hermes_cli/test_tts_picker.py` — 10 tests covering the picker surface, builtin shadowing defense, integration with `_visible_providers`. - `tests/hermes_cli/test_plugins_tts_registration.py` — 3 end-to-end tests via `PluginManager.discover_and_load()`. - `tests/plugins/tts/check_parity_vs_main.py` — 9-scenario subprocess parity harness vs `origin/main`. The only intentional diff is `fallback_edge → plugin` for the `plugin-installed` scenario. ## Verification - 95/95 new tests pass. - 170/170 pre-existing TTS tests (test_tts_command_providers, test_tts_max_text_length, test_tts_speed, etc.) pass unchanged. - Parity harness against `origin/main`: 8 OK + 1 expected DIFF. - E2E smoke: a registered plugin's `synthesize()` is called via `text_to_speech_tool` with the standard JSON envelope returned. - Ruff clean on all touched files. ## Docs - `website/docs/user-guide/features/tts.md` — new "Python plugin providers" section with a decision table (command-provider vs plugin), minimal plugin example, and the optional-hook reference. - `website/docs/user-guide/features/plugins.md` — TTS row updated to mention both surfaces (command-provider primary, plugin for SDK/streaming). Closes #30398
2026-06-04 07:31:58 +00:00 · 2026-05-22 17:58:07 +05:30 · 2026-05-22 17:58:07 +05:30 · 00ec0b617c
commit 00ec0b617c
parent 782681f904
13 changed files with 2037 additions and 1 deletions
--- a/website/docs/user-guide/features/plugins.md
+++ b/website/docs/user-guide/features/plugins.md
@ -234,7 +234,7 @@ The table above shows the four plugin categories, but within "General plugins" t
 | A **context-compression strategy** | Context-engine plugin — `ctx.register_context_engine()` | [Context Engine Plugins](/docs/developer-guide/context-engine-plugin) |
 | An **image-generation backend** (DALL·E, SDXL, …) | Backend plugin — `ctx.register_image_gen_provider()` | [Image Generation Provider Plugins](/docs/developer-guide/image-gen-provider-plugin) |
 | A **video-generation backend** (Veo, Kling, Pixverse, Grok-Imagine, Runway, …) | Backend plugin — `ctx.register_video_gen_provider()` | [Video Generation Provider Plugins](/docs/developer-guide/video-gen-provider-plugin) |
-| A **TTS backend** (any CLI — Piper, VoxCPM, Kokoro, xtts, voice-cloning scripts, …) | Config-driven — declare under `tts.providers.<name>` with `type: command` in `config.yaml` | [TTS setup](/docs/user-guide/features/tts#custom-command-providers) |
+| A **TTS backend** (any CLI — Piper, VoxCPM, Kokoro, xtts, voice-cloning scripts, …) | Config-driven (recommended) — declare under `tts.providers.<name>` with `type: command` in `config.yaml`. OR Python backend plugin — `ctx.register_tts_provider()` for Python-SDK / streaming engines that need more than a shell template. | [TTS Setup](/docs/user-guide/features/tts#custom-command-providers) · [Python plugin guide](/docs/user-guide/features/tts#python-plugin-providers) |
 | An **STT backend** (custom whisper binary, local ASR CLI) | Config-driven — set `HERMES_LOCAL_STT_COMMAND` env var to a shell template | [Voice Message Transcription (STT)](/docs/user-guide/features/tts#voice-message-transcription-stt) |
 | **External tools via MCP** (filesystem, GitHub, Linear, Notion, any MCP server) | Config-driven — declare `mcp_servers.<name>` with `command:` / `url:` in `config.yaml`. Hermes auto-discovers the server's tools and registers them alongside built-ins. | [MCP](/docs/user-guide/features/mcp) |
 | **Additional skill sources** (custom GitHub repos, private skill indexes) | CLI — `hermes skills tap add <repo>` | [Skills Hub](/docs/user-guide/features/skills#skills-hub) · [Publishing a custom tap](/docs/user-guide/features/skills#publishing-a-custom-skill-tap) |
--- a/website/docs/user-guide/features/tts.md
+++ b/website/docs/user-guide/features/tts.md
@ -297,6 +297,85 @@ Use `{{` and `}}` for literal braces.

 Command-type providers run whatever shell command you configure, with your user's permissions. Hermes quotes placeholder values and enforces the configured timeout, but the command template itself is trusted local input — treat it the same way you would a shell script on your PATH.

+### Python plugin providers
+
+For TTS engines that can't be expressed as a single shell command — Python SDKs without a CLI, streaming engines, voice-listing APIs, OAuth-refreshing auth — register a Python plugin via `ctx.register_tts_provider()`. The plugin **coexists with** (does not replace) the [Custom command providers](#custom-command-providers) registry; pick the surface that fits your engine.
+
+#### When to pick which
+
+| Your backend has… | Use |
+|---|---|
+| A single CLI reading text from a file/stdin and writing audio to a file/stdout | **Command provider** (no Python needed) |
+| Two or three CLIs chained with shell pipes | **Command provider** |
+| A Python SDK only — no CLI | **Plugin** |
+| Streaming bytes you want to deliver chunked (mid-generation voice bubbles) | **Plugin** (override `stream()`) |
+| A voice-listing API used by `hermes setup` | **Plugin** (override `list_voices()`) |
+| OAuth refresh flow (not a static bearer token) | **Plugin** |
+
+Built-ins always win, and command providers win over a same-name plugin — so plugins are safe to register against any non-built-in name without worrying about shadowing your existing config.
+
+#### Minimal plugin
+
+Drop this in `~/.hermes/plugins/my-tts/`:
+
+`plugin.yaml`:
+```yaml
+name: my-tts
+version: 0.1.0
+description: "My custom Python TTS backend"
+```
+
+`__init__.py`:
+```python
+from agent.tts_provider import TTSProvider
+
+
+class MyTTSProvider(TTSProvider):
+    @property
+    def name(self) -> str:
+        return "my-tts"  # what tts.provider matches against
+
+    @property
+    def display_name(self) -> str:
+        return "My Custom TTS"
+
+    def is_available(self) -> bool:
+        # Return False when credentials/deps are missing — picker skips
+        # this row but the dispatcher still routes here on explicit config.
+        import os
+        return bool(os.environ.get("MY_TTS_API_KEY"))
+
+    def synthesize(self, text, output_path, *, voice=None, model=None,
+                   speed=None, format="mp3", **extra) -> str:
+        # Write audio bytes to output_path, return the path.
+        # Raise on failure — the dispatcher converts exceptions to a
+        # standard error envelope.
+        import my_tts_sdk
+        client = my_tts_sdk.Client()
+        audio_bytes = client.synthesize(text=text, voice=voice or "default")
+        with open(output_path, "wb") as f:
+            f.write(audio_bytes)
+        return output_path
+
+
+def register(ctx):
+    ctx.register_tts_provider(MyTTSProvider())
+```
+
+Enable it (`hermes plugins enable my-tts`), point `tts.provider` at it (`tts.provider: my-tts` in `config.yaml`), and the `text_to_speech` tool will route through your plugin.
+
+#### Optional hooks
+
+Override these on your provider class for richer integration:
+
+- `list_voices()` → list of `{id, display, language, gender, preview_url}` dicts shown in `hermes tools`.
+- `list_models()` → list of `{id, display, languages, max_text_length}` dicts.
+- `get_setup_schema()` → return `{name, badge, tag, env_vars: [{key, prompt, url}]}` to power the picker row in `hermes tools` / `hermes setup`. Without this, the plugin still works but its row in the picker is minimal.
+- `stream(text, *, voice, model, format, **extra)` → iterator yielding audio bytes for streaming delivery (default raises `NotImplementedError`).
+- `voice_compatible` property → set `True` if your output is Opus-compatible and the gateway should deliver it as a voice bubble (default `False` = regular audio attachment).
+
+See `agent/tts_provider.py` for the full ABC including docstrings.
+
 ## Voice Message Transcription (STT)

 Voice messages sent on Telegram, Discord, WhatsApp, Slack, or Signal are automatically transcribed and injected as text into the conversation. The agent sees the transcript as normal text.