feat(tts): add Gemini TTS provider

Add Google's Gemini speech-generation API as 8th TTS backend. Returns base64-encoded signed 16-bit PCM at 24 kHz mono, wrapped in WAV natively via stdlib wave module. Optional ffmpeg conversion to mp3/ogg for Telegram voice bubbles. Supports GEMINI_API_KEY and GOOGLE_API_KEY (fallback), 30 prebuilt voices, configurable model (flash/pro). Cherry-picked from #10922 by @zhonghui5207. Fixes #10918.
2026-05-08 03:01:47 +00:00 · 2026-04-16 20:47:29 +05:30 · 2026-04-16 20:47:29 +05:30 · 0671201c05
commit 0671201c05
parent f19ca50cd9
7 changed files with 388 additions and 26 deletions
--- a/website/docs/user-guide/features/tts.md
+++ b/website/docs/user-guide/features/tts.md
@ -10,7 +10,7 @@ Hermes Agent supports both text-to-speech output and voice message transcription

 ## Text-to-Speech

-Convert text to speech with six providers:
+Convert text to speech with seven providers:

 | Provider | Quality | Cost | API Key |
 |----------|---------|------|---------|
@ -20,6 +20,7 @@ Convert text to speech with six providers:
 | **MiniMax TTS** | Excellent | Paid | `MINIMAX_API_KEY` |
 | **Mistral (Voxtral TTS)** | Excellent | Paid | `MISTRAL_API_KEY` |
 | **NeuTTS** | Good | Free | None needed |
+| **Gemini TTS** | Excellent | Paid (free tier) | `GEMINI_API_KEY` |

 ### Platform Delivery

@ -62,6 +63,9 @@ tts:
    ref_text: ''
    model: neuphonic/neutts-air-q4-gguf
    device: cpu
+  gemini:
+    model: "gemini-2.5-flash-preview-tts"  # or gemini-2.5-pro-preview-tts
+    voice: "Kore"                          # 30 prebuilt voices (Zephyr, Puck, Charon, ...)
 ```

 **Speed control**: The global `tts.speed` value applies to all providers by default. Each provider can override it with its own `speed` setting (e.g., `tts.openai.speed: 1.5`). Provider-specific speed takes precedence over the global value. Default is `1.0` (normal speed).
@ -74,6 +78,7 @@ Telegram voice bubbles require Opus/OGG audio format:
 - **Edge TTS** (default) outputs MP3 and needs **ffmpeg** to convert:
 - **MiniMax TTS** outputs MP3 and needs **ffmpeg** to convert for Telegram voice bubbles
 - **NeuTTS** outputs WAV and also needs **ffmpeg** to convert for Telegram voice bubbles
+- **Gemini TTS** returns raw PCM (wrapped in WAV natively) and needs **ffmpeg** to convert for Telegram voice bubbles

 ```bash
 # Ubuntu/Debian