mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-06-06 07:51:53 +00:00
feat(tts): add register_tts_provider() plugin hook (closes #30398)
Adds a `TTSProvider(ABC)` + `register_tts_provider()` extension point to the plugin context API, **alongside** the existing config-driven `tts.providers.<name>: type: command` registry from PR #17843. This is additive — the command-provider surface stays as the primary way to add a TTS backend. The hook covers cases the shell-template grammar can't reasonably express: - Native Python SDKs without a CLI (Cartesia, Fish Audio, etc.) - Streaming synthesis (chunked Opus → voice-bubble delivery) - Voice metadata API for the `hermes tools` picker - OAuth-refreshing auth flows None of the 10 inline built-in providers (`edge`, `openai`, `elevenlabs`, `minimax`, `gemini`, `mistral`, `xai`, `piper`, `kittentts`, `neutts`) are migrated to plugins. They stay inline. The hook is for *new* engines that aren't built-in. ## Resolution order The dispatcher's resolution order is the load-bearing invariant: 1. `tts.provider` is a built-in name → built-in dispatch. **Always wins.** 2. `tts.provider` matches `tts.providers.<name>` with `command:` set → command-provider dispatch (PR #17843). 3. `tts.provider` matches a plugin-registered `TTSProvider` → plugin dispatch (new). 4. No match → falls through to Edge TTS default (legacy behavior). Built-ins-always-win is enforced at THREE layers: - Registry: `register_provider()` rejects shadowing names with a warning. - Dispatcher: `_dispatch_to_plugin_provider()` short-circuits built-in names defensively before consulting the registry. - Picker: `_plugin_tts_providers()` filters built-in shadows out of the `hermes tools` row list defensively. Command-providers-win-over-plugins is enforced at TWO layers: - The caller in `text_to_speech_tool` checks `_resolve_command_provider_config` first. - `_dispatch_to_plugin_provider` re-checks for a same-name command config defensively so a refactor of the caller can't silently break the invariant. ## New files - `agent/tts_provider.py` — `TTSProvider(ABC)` with `synthesize()` (required), `list_voices()`, `list_models()`, `get_setup_schema()`, `stream()`, `voice_compatible` (all optional with sane defaults). Mirrors `agent/image_gen_provider.py` shape. - `agent/tts_registry.py` — `register_provider`/`get_provider`/`list_providers` with `_BUILTIN_NAMES` reject-shadowing invariant. Mirrors `agent/image_gen_registry.py` shape. - `plugins/tts/...` directory ready for community plugins (none shipped). ## Modified files - `hermes_cli/plugins.py` — `register_tts_provider()` method on `PluginContext`. Matches the gating shape of `register_image_gen_provider()` / `register_browser_provider()`. - `tools/tts_tool.py` — `_dispatch_to_plugin_provider()` + `_plugin_provider_is_voice_compatible()` + walrus-elif wiring into the main dispatcher. Built-in elif chain untouched. - `hermes_cli/tools_config.py` — `_plugin_tts_providers()` injects plugin rows into the Text-to-Speech picker category alongside the 10 hardcoded built-in rows. ## Tests - `tests/agent/test_tts_registry.py` — 47 tests covering registration, lookup, ABC contract, helpers, AND a `TestBuiltinSync` regression test that fails if `agent.tts_registry._BUILTIN_NAMES` drifts from `tools.tts_tool.BUILTIN_TTS_PROVIDERS` (kept duplicated due to circular import constraints). - `tests/tools/test_tts_plugin_dispatch.py` — 35 tests covering built-in-always-wins, command-wins-over-plugin, plugin dispatch, exception passthrough, voice_compatible helper. - `tests/hermes_cli/test_tts_picker.py` — 10 tests covering the picker surface, builtin shadowing defense, integration with `_visible_providers`. - `tests/hermes_cli/test_plugins_tts_registration.py` — 3 end-to-end tests via `PluginManager.discover_and_load()`. - `tests/plugins/tts/check_parity_vs_main.py` — 9-scenario subprocess parity harness vs `origin/main`. The only intentional diff is `fallback_edge → plugin` for the `plugin-installed` scenario. ## Verification - 95/95 new tests pass. - 170/170 pre-existing TTS tests (test_tts_command_providers, test_tts_max_text_length, test_tts_speed, etc.) pass unchanged. - Parity harness against `origin/main`: 8 OK + 1 expected DIFF. - E2E smoke: a registered plugin's `synthesize()` is called via `text_to_speech_tool` with the standard JSON envelope returned. - Ruff clean on all touched files. ## Docs - `website/docs/user-guide/features/tts.md` — new "Python plugin providers" section with a decision table (command-provider vs plugin), minimal plugin example, and the optional-hook reference. - `website/docs/user-guide/features/plugins.md` — TTS row updated to mention both surfaces (command-provider primary, plugin for SDK/streaming). Closes #30398
This commit is contained in:
parent
782681f904
commit
00ec0b617c
13 changed files with 2037 additions and 1 deletions
274
agent/tts_provider.py
Normal file
274
agent/tts_provider.py
Normal file
|
|
@ -0,0 +1,274 @@
|
|||
"""
|
||||
Text-to-Speech Provider ABC
|
||||
============================
|
||||
|
||||
Defines the pluggable-backend interface for text-to-speech synthesis.
|
||||
Providers register instances via
|
||||
``PluginContext.register_tts_provider()``; the active one (selected via
|
||||
``tts.provider`` in ``config.yaml``) services every ``text_to_speech``
|
||||
tool call **only when the configured name is neither a built-in nor a
|
||||
command-type provider declared under ``tts.providers.<name>``**.
|
||||
|
||||
Three coexisting TTS extension surfaces — in resolution order:
|
||||
|
||||
1. **Built-in providers** (``BUILTIN_TTS_PROVIDERS`` in
|
||||
:mod:`tools.tts_tool`) — native Python implementations (edge, openai,
|
||||
elevenlabs, …). **Always win** — plugins cannot shadow them.
|
||||
2. **Command-type providers** declared under ``tts.providers.<name>:
|
||||
type: command`` (PR #17843, commit ``2facea7f7``). Wire any local
|
||||
CLI into Hermes with shell-template placeholders. **Wins over a
|
||||
same-name plugin** — config is more local than plugin install.
|
||||
3. **Plugin-registered providers** (this ABC). For backends that need a
|
||||
Python SDK, streaming bytes, OAuth refresh, or voice-listing APIs
|
||||
the shell-template grammar can't reasonably express.
|
||||
|
||||
Built-ins-always-win is enforced at registration time
|
||||
(:func:`agent.tts_registry.register_provider` rejects names in
|
||||
``BUILTIN_TTS_PROVIDERS`` with a warning) AND at dispatch time
|
||||
(:func:`tools.tts_tool._dispatch_to_plugin_provider` re-checks
|
||||
defensively). The dispatcher also rejects plugin dispatch when a same-
|
||||
name command provider is configured.
|
||||
|
||||
Providers live in ``<repo>/plugins/tts/<name>/`` (built-in plugins, no
|
||||
shipped today) or ``~/.hermes/plugins/tts/<name>/`` (user-installed).
|
||||
None ship in-tree as of issue #30398 — the hook is additive
|
||||
infrastructure waiting for a real consumer (Cartesia, Fish Audio, …).
|
||||
|
||||
Response contract
|
||||
-----------------
|
||||
:meth:`TTSProvider.synthesize` writes the audio bytes to ``output_path``
|
||||
and returns the path as a string. Implementations should raise on
|
||||
failure — the dispatcher converts exceptions into the standard
|
||||
``{success: False, error: …}`` JSON envelope the rest of Hermes
|
||||
expects.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import abc
|
||||
import logging
|
||||
from typing import Any, Dict, Iterator, List, Optional
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
DEFAULT_OUTPUT_FORMAT = "mp3"
|
||||
VALID_OUTPUT_FORMATS = frozenset({"mp3", "wav", "ogg", "opus", "flac"})
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# ABC
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TTSProvider(abc.ABC):
|
||||
"""Abstract base class for a text-to-speech backend.
|
||||
|
||||
Subclasses must implement :attr:`name` and :meth:`synthesize`.
|
||||
Everything else has sane defaults — override only what your provider
|
||||
needs.
|
||||
"""
|
||||
|
||||
@property
|
||||
@abc.abstractmethod
|
||||
def name(self) -> str:
|
||||
"""Stable short identifier used in ``tts.provider`` config.
|
||||
|
||||
Lowercase, no spaces. Examples: ``cartesia``, ``fishaudio``,
|
||||
``deepgram``. Names that collide with a built-in TTS provider
|
||||
(``edge``, ``openai``, ``elevenlabs``, ``minimax``, ``gemini``,
|
||||
``mistral``, ``xai``, ``piper``, ``kittentts``, ``neutts``) are
|
||||
rejected at registration time.
|
||||
"""
|
||||
|
||||
@property
|
||||
def display_name(self) -> str:
|
||||
"""Human-readable label shown in ``hermes tools``.
|
||||
|
||||
Defaults to ``name.title()`` (e.g. ``Cartesia`` for ``cartesia``).
|
||||
"""
|
||||
return self.name.title()
|
||||
|
||||
def is_available(self) -> bool:
|
||||
"""Return True when this provider can service calls.
|
||||
|
||||
Typically checks for a required API key + that the SDK is
|
||||
importable. Default: True (providers with no external
|
||||
dependencies are always available).
|
||||
|
||||
Must NOT raise — used by the picker and ``hermes setup`` for
|
||||
availability displays and should fail gracefully.
|
||||
"""
|
||||
return True
|
||||
|
||||
def list_voices(self) -> List[Dict[str, Any]]:
|
||||
"""Return voice catalog entries.
|
||||
|
||||
Each entry::
|
||||
|
||||
{
|
||||
"id": "voice-abc-123", # required
|
||||
"display": "Aria — neutral female", # optional; defaults to id
|
||||
"language": "en-US", # optional
|
||||
"gender": "female", # optional
|
||||
"preview_url": "https://...mp3", # optional
|
||||
}
|
||||
|
||||
Default: empty list (provider has no enumerable voices or
|
||||
doesn't surface them via API).
|
||||
"""
|
||||
return []
|
||||
|
||||
def list_models(self) -> List[Dict[str, Any]]:
|
||||
"""Return model catalog entries.
|
||||
|
||||
Each entry::
|
||||
|
||||
{
|
||||
"id": "sonic-2", # required
|
||||
"display": "Sonic 2", # optional
|
||||
"languages": ["en", "es", "fr"], # optional
|
||||
"max_text_length": 5000, # optional
|
||||
}
|
||||
|
||||
Default: empty list (provider has a single fixed model or
|
||||
doesn't expose model selection).
|
||||
"""
|
||||
return []
|
||||
|
||||
def get_setup_schema(self) -> Dict[str, Any]:
|
||||
"""Return provider metadata for the ``hermes tools`` picker.
|
||||
|
||||
Used by ``tools_config.py`` to inject this provider as a row in
|
||||
the Text-to-Speech provider list. Shape::
|
||||
|
||||
{
|
||||
"name": "Cartesia", # picker label
|
||||
"badge": "paid", # optional short tag
|
||||
"tag": "Ultra-low-latency streaming", # optional subtitle
|
||||
"env_vars": [ # keys to prompt for
|
||||
{"key": "CARTESIA_API_KEY",
|
||||
"prompt": "Cartesia API key",
|
||||
"url": "https://play.cartesia.ai/console"},
|
||||
],
|
||||
}
|
||||
|
||||
Default: minimal entry derived from ``display_name`` with no
|
||||
env vars. Override to expose API key prompts and custom badges.
|
||||
"""
|
||||
return {
|
||||
"name": self.display_name,
|
||||
"badge": "",
|
||||
"tag": "",
|
||||
"env_vars": [],
|
||||
}
|
||||
|
||||
def default_model(self) -> Optional[str]:
|
||||
"""Return the default model id, or None if not applicable."""
|
||||
models = self.list_models()
|
||||
if models:
|
||||
return models[0].get("id")
|
||||
return None
|
||||
|
||||
def default_voice(self) -> Optional[str]:
|
||||
"""Return the default voice id, or None if not applicable."""
|
||||
voices = self.list_voices()
|
||||
if voices:
|
||||
return voices[0].get("id")
|
||||
return None
|
||||
|
||||
@abc.abstractmethod
|
||||
def synthesize(
|
||||
self,
|
||||
text: str,
|
||||
output_path: str,
|
||||
*,
|
||||
voice: Optional[str] = None,
|
||||
model: Optional[str] = None,
|
||||
speed: Optional[float] = None,
|
||||
format: str = DEFAULT_OUTPUT_FORMAT,
|
||||
**extra: Any,
|
||||
) -> str:
|
||||
"""Synthesize ``text`` and write audio bytes to ``output_path``.
|
||||
|
||||
Returns the absolute path to the written file as a string
|
||||
(typically just echoes ``output_path``). Raises on failure —
|
||||
the dispatcher converts exceptions to the standard
|
||||
``{success: False, error: ...}`` JSON envelope.
|
||||
|
||||
Args:
|
||||
text: The text to synthesize. Already truncated to the
|
||||
provider's max length by the dispatcher.
|
||||
output_path: Absolute path where the audio file should be
|
||||
written. Parent directory is guaranteed to exist.
|
||||
voice: Voice identifier from :meth:`list_voices`, or None
|
||||
to use :meth:`default_voice`.
|
||||
model: Model identifier from :meth:`list_models`, or None
|
||||
to use :meth:`default_model`.
|
||||
speed: Optional speech-rate multiplier (1.0 = normal).
|
||||
Providers that don't support speed control should
|
||||
ignore this argument.
|
||||
format: Output audio format. Implementations should match
|
||||
the requested format when possible; if unsupported,
|
||||
pick the closest equivalent and ensure ``output_path``
|
||||
ends with the correct extension.
|
||||
**extra: Forward-compat parameters future schema versions
|
||||
may expose. Implementations should ignore unknown keys.
|
||||
"""
|
||||
|
||||
def stream(
|
||||
self,
|
||||
text: str,
|
||||
*,
|
||||
voice: Optional[str] = None,
|
||||
model: Optional[str] = None,
|
||||
format: str = "opus",
|
||||
**extra: Any,
|
||||
) -> Iterator[bytes]:
|
||||
"""Stream synthesized audio bytes.
|
||||
|
||||
Optional. Providers that don't support streaming raise
|
||||
:class:`NotImplementedError` (the default) and the dispatcher
|
||||
falls back to :meth:`synthesize` + read-whole-file.
|
||||
|
||||
Args mirror :meth:`synthesize`. Default ``format`` is ``opus``
|
||||
because the primary streaming use case is voice-bubble
|
||||
delivery (Telegram et al.) which requires Opus.
|
||||
"""
|
||||
raise NotImplementedError(
|
||||
f"TTS provider {self.name!r} does not implement streaming "
|
||||
"synthesis. Use synthesize() instead, or implement stream() "
|
||||
"if your backend supports it."
|
||||
)
|
||||
|
||||
@property
|
||||
def voice_compatible(self) -> bool:
|
||||
"""Whether output is suitable for voice-bubble delivery.
|
||||
|
||||
Mirrors the ``tts.providers.<name>.voice_compatible`` field
|
||||
from PR #17843. When True, the gateway's voice-message
|
||||
delivery pipeline runs ffmpeg conversion to Opus if needed.
|
||||
When False, output is delivered as a regular audio attachment.
|
||||
|
||||
Default: False (safe — providers opt in explicitly).
|
||||
"""
|
||||
return False
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def resolve_output_format(value: Optional[str]) -> str:
|
||||
"""Clamp an output_format value to the valid set.
|
||||
|
||||
Invalid values are coerced to :data:`DEFAULT_OUTPUT_FORMAT` rather
|
||||
than rejected so the tool surface is forgiving of agent mistakes.
|
||||
"""
|
||||
if not isinstance(value, str):
|
||||
return DEFAULT_OUTPUT_FORMAT
|
||||
v = value.strip().lower()
|
||||
if v in VALID_OUTPUT_FORMATS:
|
||||
return v
|
||||
return DEFAULT_OUTPUT_FORMAT
|
||||
133
agent/tts_registry.py
Normal file
133
agent/tts_registry.py
Normal file
|
|
@ -0,0 +1,133 @@
|
|||
"""
|
||||
TTS Provider Registry
|
||||
=====================
|
||||
|
||||
Central map of registered TTS providers. Populated by plugins at
|
||||
import-time via :meth:`PluginContext.register_tts_provider`; consumed
|
||||
by :mod:`tools.tts_tool` to dispatch ``text_to_speech`` tool calls to
|
||||
the active plugin backend **when** the configured ``tts.provider``
|
||||
name is neither a built-in nor a command-type provider.
|
||||
|
||||
Built-ins-always-win
|
||||
--------------------
|
||||
Plugin names that collide with a built-in TTS provider (``edge``,
|
||||
``openai``, ``elevenlabs``, ``minimax``, ``gemini``, ``mistral``,
|
||||
``xai``, ``piper``, ``kittentts``, ``neutts``) are rejected at
|
||||
registration with a warning. This invariant is also re-checked at
|
||||
dispatch time in :func:`tools.tts_tool._dispatch_to_plugin_provider`.
|
||||
|
||||
Command-providers-win-over-plugins
|
||||
----------------------------------
|
||||
This registry doesn't enforce the command-vs-plugin precedence — that
|
||||
lives in the dispatcher, which checks for a same-name
|
||||
``tts.providers.<name>: type: command`` entry before consulting the
|
||||
registry. The rationale is locality: a name declared in the user's
|
||||
``config.yaml`` is more specific to their setup than a plugin that
|
||||
happens to be installed.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import threading
|
||||
from typing import Dict, List, Optional
|
||||
|
||||
from agent.tts_provider import TTSProvider
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
# Names reserved for native built-in TTS handlers. Plugins cannot
|
||||
# register a name in this set — the registration call is rejected with
|
||||
# a warning. **Kept in sync with ``BUILTIN_TTS_PROVIDERS`` in
|
||||
# :mod:`tools.tts_tool`** — a regression test in
|
||||
# ``tests/agent/test_tts_registry.py::TestBuiltinSync`` fails if the
|
||||
# two lists drift. Importing from ``tools.tts_tool`` directly would
|
||||
# create a circular dependency (``tools.tts_tool`` imports
|
||||
# ``agent.tts_registry`` for dispatch).
|
||||
_BUILTIN_NAMES = frozenset({
|
||||
"edge",
|
||||
"elevenlabs",
|
||||
"openai",
|
||||
"minimax",
|
||||
"xai",
|
||||
"mistral",
|
||||
"gemini",
|
||||
"neutts",
|
||||
"kittentts",
|
||||
"piper",
|
||||
})
|
||||
|
||||
|
||||
_providers: Dict[str, TTSProvider] = {}
|
||||
_lock = threading.Lock()
|
||||
|
||||
|
||||
def register_provider(provider: TTSProvider) -> None:
|
||||
"""Register a TTS provider.
|
||||
|
||||
Rejects:
|
||||
|
||||
- Non-:class:`TTSProvider` instances (raises :class:`TypeError`).
|
||||
- Empty/whitespace ``.name`` (raises :class:`ValueError`).
|
||||
- Names colliding with a built-in (logs a warning, silently
|
||||
ignores — built-ins-always-win invariant).
|
||||
|
||||
Re-registration (same ``name``) overwrites the previous entry and
|
||||
logs a debug message — makes hot-reload scenarios (tests, dev
|
||||
loops) behave predictably.
|
||||
"""
|
||||
if not isinstance(provider, TTSProvider):
|
||||
raise TypeError(
|
||||
f"register_provider() expects a TTSProvider instance, "
|
||||
f"got {type(provider).__name__}"
|
||||
)
|
||||
name = provider.name
|
||||
if not isinstance(name, str) or not name.strip():
|
||||
raise ValueError("TTS provider .name must be a non-empty string")
|
||||
key = name.strip().lower()
|
||||
if key in _BUILTIN_NAMES:
|
||||
logger.warning(
|
||||
"TTS provider '%s' shadows a built-in name; registration ignored. "
|
||||
"Built-in TTS providers (%s) always win — pick a different name.",
|
||||
key, ", ".join(sorted(_BUILTIN_NAMES)),
|
||||
)
|
||||
return
|
||||
with _lock:
|
||||
existing = _providers.get(key)
|
||||
_providers[key] = provider
|
||||
if existing is not None:
|
||||
logger.debug(
|
||||
"TTS provider '%s' re-registered (was %r)",
|
||||
key, type(existing).__name__,
|
||||
)
|
||||
else:
|
||||
logger.debug(
|
||||
"Registered TTS provider '%s' (%s)",
|
||||
key, type(provider).__name__,
|
||||
)
|
||||
|
||||
|
||||
def list_providers() -> List[TTSProvider]:
|
||||
"""Return all registered providers, sorted by name."""
|
||||
with _lock:
|
||||
items = list(_providers.values())
|
||||
return sorted(items, key=lambda p: p.name)
|
||||
|
||||
|
||||
def get_provider(name: str) -> Optional[TTSProvider]:
|
||||
"""Return the provider registered under *name*, or None.
|
||||
|
||||
Name matching is case-insensitive and whitespace-tolerant — mirrors
|
||||
how ``tools.tts_tool._get_provider`` normalizes the configured
|
||||
``tts.provider`` value.
|
||||
"""
|
||||
if not isinstance(name, str):
|
||||
return None
|
||||
return _providers.get(name.strip().lower())
|
||||
|
||||
|
||||
def _reset_for_tests() -> None:
|
||||
"""Clear the registry. **Test-only.**"""
|
||||
with _lock:
|
||||
_providers.clear()
|
||||
Loading…
Add table
Add a link
Reference in a new issue