feat(tts): add Gemini TTS provider

Add Google's Gemini speech-generation API as 8th TTS backend.
Returns base64-encoded signed 16-bit PCM at 24 kHz mono, wrapped
in WAV natively via stdlib wave module. Optional ffmpeg conversion
to mp3/ogg for Telegram voice bubbles.

Supports GEMINI_API_KEY and GOOGLE_API_KEY (fallback), 30 prebuilt
voices, configurable model (flash/pro).

Cherry-picked from #10922 by @zhonghui5207. Fixes #10918.
This commit is contained in:
zhonghui5207 2026-04-16 20:47:29 +05:30 committed by kshitijk4poor
parent f19ca50cd9
commit 0671201c05
7 changed files with 388 additions and 26 deletions

View file

@ -10,7 +10,7 @@ Hermes Agent supports both text-to-speech output and voice message transcription
## Text-to-Speech
Convert text to speech with six providers:
Convert text to speech with seven providers:
| Provider | Quality | Cost | API Key |
|----------|---------|------|---------|
@ -20,6 +20,7 @@ Convert text to speech with six providers:
| **MiniMax TTS** | Excellent | Paid | `MINIMAX_API_KEY` |
| **Mistral (Voxtral TTS)** | Excellent | Paid | `MISTRAL_API_KEY` |
| **NeuTTS** | Good | Free | None needed |
| **Gemini TTS** | Excellent | Paid (free tier) | `GEMINI_API_KEY` |
### Platform Delivery
@ -62,6 +63,9 @@ tts:
ref_text: ''
model: neuphonic/neutts-air-q4-gguf
device: cpu
gemini:
model: "gemini-2.5-flash-preview-tts" # or gemini-2.5-pro-preview-tts
voice: "Kore" # 30 prebuilt voices (Zephyr, Puck, Charon, ...)
```
**Speed control**: The global `tts.speed` value applies to all providers by default. Each provider can override it with its own `speed` setting (e.g., `tts.openai.speed: 1.5`). Provider-specific speed takes precedence over the global value. Default is `1.0` (normal speed).
@ -74,6 +78,7 @@ Telegram voice bubbles require Opus/OGG audio format:
- **Edge TTS** (default) outputs MP3 and needs **ffmpeg** to convert:
- **MiniMax TTS** outputs MP3 and needs **ffmpeg** to convert for Telegram voice bubbles
- **NeuTTS** outputs WAV and also needs **ffmpeg** to convert for Telegram voice bubbles
- **Gemini TTS** returns raw PCM (wrapped in WAV natively) and needs **ffmpeg** to convert for Telegram voice bubbles
```bash
# Ubuntu/Debian