feat(tts): add command-type provider registry under tts.providers.<name> (#17843)

Reshape of PR #17211 (@versun). Lets users wire any local or external TTS CLI into Hermes without adding engine-specific Python code. Users declare any number of named providers in config.yaml and switch between them with tts.provider: <name>, alongside the built-ins (edge, openai, elevenlabs, …). Config shape: tts: provider: piper-en providers: piper-en: type: command command: 'piper -m ~/model.onnx -f {output_path} < {input_path}' output_format: wav Placeholders: {input_path}, {text_path}, {output_path}, {format}, {voice}, {model}, {speed}. Use {{ / }} for literal braces. Key behavior: - Built-in provider names always win — a tts.providers.openai entry cannot shadow the native OpenAI provider. - type: command is the default when command: is set. - Placeholder values are shell-quote-aware (bare / single / double context), so paths with spaces and shell metacharacters are safe. - Default delivery is a regular audio attachment. voice_compatible: true opts in to Telegram voice-bubble delivery via ffmpeg Opus conversion. - Command failures (non-zero exit, timeout, empty output) surface to the agent with stderr/stdout included so you can debug from chat. - Process-tree kill on timeout (Unix killpg, Windows taskkill /T). - max_text_length defaults to 5000 for command providers; override under tts.providers.<name>.max_text_length. Tests: tests/tools/test_tts_command_providers.py — 42 new tests cover provider resolution, shell-quote context, placeholder rendering with injection payloads, timeout, non-zero exit, empty output, voice_compatible opt-in, and end-to-end dispatch through text_to_speech_tool. All 88 pre-existing TTS tests still pass. Docs: new "Custom command providers" section in website/docs/user-guide/features/tts.md with three worked examples (Piper, VoxCPM, MLX-Kokoro), placeholder reference, optional keys, behavior notes, and security caveat. E2E-verified live: isolated HERMES_HOME, command provider declared in config.yaml, text_to_speech_tool dispatches through the registered shell command and the output file is produced as expected. Co-authored-by: Versun <me+github7604@versun.org>
2026-05-02 02:01:47 +00:00 · 2026-04-30 02:29:08 -07:00 · 2026-04-30 02:29:08 -07:00 · 2facea7f71
commit 2facea7f71
parent 5b85a7d351
3 changed files with 1040 additions and 13 deletions
--- a/website/docs/user-guide/features/tts.md
+++ b/website/docs/user-guide/features/tts.md
@ -116,6 +116,73 @@ Without ffmpeg, Edge TTS, MiniMax TTS, NeuTTS, and KittenTTS audio are sent as r
 If you want voice bubbles without installing ffmpeg, switch to the OpenAI, ElevenLabs, or Mistral provider.
 :::

+### Custom command providers
+
+If a TTS engine you want isn't natively supported (Piper, VoxCPM, MLX-Kokoro, XTTS CLI, a voice-cloning script, anything else that exposes a CLI), you can wire it in as a **command-type provider** without writing any Python. Hermes writes the input text to a temp UTF-8 file, runs your shell command, and reads the audio file the command produced.
+
+Declare one or more providers under `tts.providers.<name>` and switch between them with `tts.provider: <name>` — the same way you switch between built-ins like `edge` and `openai`.
+
+```yaml
+tts:
+  provider: piper-en               # pick any name under tts.providers
+  providers:
+    piper-en:
+      type: command
+      command: "piper -m ~/models/en_US-amy.onnx -f {output_path} < {input_path}"
+      output_format: wav
+
+    voxcpm:
+      type: command
+      command: "voxcpm --ref ~/voice.wav --text-file {input_path} --out {output_path}"
+      output_format: mp3
+      timeout: 180
+      voice_compatible: true       # try to deliver as a Telegram voice bubble
+
+    mlx-kokoro:
+      type: command
+      command: "python -m mlx_kokoro --in {input_path} --out {output_path} --voice {voice}"
+      voice: af_sky
+      output_format: wav
+```
+
+#### Placeholders
+
+Your command template can reference these placeholders. Hermes substitutes them at render time and shell-quotes each value for the surrounding context (bare / single-quoted / double-quoted), so paths with spaces and other shell-sensitive characters are safe.
+
+| Placeholder      | Meaning                                              |
+|------------------|------------------------------------------------------|
+| `{input_path}`   | Path to the temp UTF-8 text file Hermes wrote        |
+| `{text_path}`    | Alias for `{input_path}`                             |
+| `{output_path}`  | Path the command must write audio to                 |
+| `{format}`       | `mp3` / `wav` / `ogg` / `flac`                       |
+| `{voice}`        | `tts.providers.<name>.voice`, empty when unset       |
+| `{model}`        | `tts.providers.<name>.model`                         |
+| `{speed}`        | Resolved speed multiplier (provider or global)       |
+
+Use `{{` and `}}` for literal braces.
+
+#### Optional keys
+
+| Key                | Default | Meaning                                                                                                    |
+|--------------------|---------|------------------------------------------------------------------------------------------------------------|
+| `timeout`          | `120`   | Seconds; the process tree is killed on expiry (Unix `killpg`, Windows `taskkill /T`).                       |
+| `output_format`    | `mp3`   | One of `mp3` / `wav` / `ogg` / `flac`. Auto-inferred from the output extension if Hermes picks a path.      |
+| `voice_compatible` | `false` | When `true`, Hermes converts MP3/WAV output to Opus/OGG via ffmpeg so Telegram renders a voice bubble.      |
+| `max_text_length`  | `5000`  | Input is truncated to this length before rendering the command.                                             |
+| `voice` / `model`  | empty   | Passed to the command as placeholder values only.                                                           |
+
+#### Behavior notes
+
+- **Built-in names always win.** A `tts.providers.openai` entry never shadows the native OpenAI provider, so no user config can silently replace a built-in.
+- **Default delivery is a document.** Command providers deliver as regular audio attachments on every platform. Opt in to voice-bubble delivery per-provider with `voice_compatible: true`.
+- **Command failures surface to the agent.** Non-zero exit, empty output, or timeout all return an error with the command's stderr/stdout included so you can debug the provider from the conversation.
+- **`type: command` is the default when `command:` is set.** Writing `type: command` explicitly is good practice but not required; an entry with a non-empty `command` string is treated as a command provider.
+- **`{input_path}` / `{text_path}` are interchangeable.** Use whichever reads better in your command.
+
+#### Security
+
+Command-type providers run whatever shell command you configure, with your user's permissions. Hermes quotes placeholder values and enforces the configured timeout, but the command template itself is trusted local input — treat it the same way you would a shell script on your PATH.
+
 ## Voice Message Transcription (STT)

 Voice messages sent on Telegram, Discord, WhatsApp, Slack, or Signal are automatically transcribed and injected as text into the conversation. The agent sees the transcript as normal text.