hermes-agent/website/docs/user-guide/features/tts.md
Teknium 4ca6668daf
docs: comprehensive update for recent merged PRs (#9019)
Audit and update documentation across 12 files to match changes from
~50 recently merged PRs. Key updates:

Slash commands (slash-commands.md):
- Add 5 missing commands: /snapshot, /fast, /image, /debug, /restart
- Fix /status incorrectly labeled as messaging-only (available in both)
- Add --global flag to /model docs
- Add [focus topic] arg to /compress docs

CLI commands (cli-commands.md):
- Add hermes debug share section with options and examples
- Add hermes backup section with --quick and --label flags
- Add hermes import section

Feature docs:
- TTS: document global tts.speed and per-provider speed for Edge/OpenAI
- Web dashboard: add docs for 5 missing pages (Sessions, Logs,
  Analytics, Cron, Skills) and 15+ API endpoints
- WhatsApp: add streaming, 4K chunking, and markdown formatting docs
- Skills: add GitHub rate-limit/GITHUB_TOKEN troubleshooting tip
- Budget: document CLI notification on iteration budget exhaustion

Config migration (compression.summary_* → auxiliary.compression.*):
- Update configuration.md, environment-variables.md,
  fallback-providers.md, cli.md, and context-compression-and-caching.md
- Replace legacy compression.summary_model/provider/base_url references
  with auxiliary.compression.model/provider/base_url
- Add legacy migration info boxes explaining auto-migration

Minor fixes:
- wecom-callback.md: clarify 'text only' limitation (input only)
- Escape {session_id}/{job_id} in web-dashboard.md headings for MDX
2026-04-13 10:50:59 -07:00

150 lines
6.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
sidebar_position: 9
title: "Voice & TTS"
description: "Text-to-speech and voice message transcription across all platforms"
---
# Voice & TTS
Hermes Agent supports both text-to-speech output and voice message transcription across all messaging platforms.
## Text-to-Speech
Convert text to speech with six providers:
| Provider | Quality | Cost | API Key |
|----------|---------|------|---------|
| **Edge TTS** (default) | Good | Free | None needed |
| **ElevenLabs** | Excellent | Paid | `ELEVENLABS_API_KEY` |
| **OpenAI TTS** | Good | Paid | `VOICE_TOOLS_OPENAI_KEY` |
| **MiniMax TTS** | Excellent | Paid | `MINIMAX_API_KEY` |
| **Mistral (Voxtral TTS)** | Excellent | Paid | `MISTRAL_API_KEY` |
| **NeuTTS** | Good | Free | None needed |
### Platform Delivery
| Platform | Delivery | Format |
|----------|----------|--------|
| Telegram | Voice bubble (plays inline) | Opus `.ogg` |
| Discord | Voice bubble (Opus/OGG), falls back to file attachment | Opus/MP3 |
| WhatsApp | Audio file attachment | MP3 |
| CLI | Saved to `~/.hermes/audio_cache/` | MP3 |
### Configuration
```yaml
# In ~/.hermes/config.yaml
tts:
provider: "edge" # "edge" | "elevenlabs" | "openai" | "minimax" | "mistral" | "neutts"
speed: 1.0 # Global speed multiplier (provider-specific settings override this)
edge:
voice: "en-US-AriaNeural" # 322 voices, 74 languages
speed: 1.0 # Converted to rate percentage (+/-%)
elevenlabs:
voice_id: "pNInz6obpgDQGcFmaJgB" # Adam
model_id: "eleven_multilingual_v2"
openai:
model: "gpt-4o-mini-tts"
voice: "alloy" # alloy, echo, fable, onyx, nova, shimmer
base_url: "https://api.openai.com/v1" # Override for OpenAI-compatible TTS endpoints
speed: 1.0 # 0.25 - 4.0
minimax:
model: "speech-2.8-hd" # speech-2.8-hd (default), speech-2.8-turbo
voice_id: "English_Graceful_Lady" # See https://platform.minimax.io/faq/system-voice-id
speed: 1 # 0.5 - 2.0
vol: 1 # 0 - 10
pitch: 0 # -12 - 12
mistral:
model: "voxtral-mini-tts-2603"
voice_id: "c69964a6-ab8b-4f8a-9465-ec0925096ec8" # Paul - Neutral (default)
neutts:
ref_audio: ''
ref_text: ''
model: neuphonic/neutts-air-q4-gguf
device: cpu
```
**Speed control**: The global `tts.speed` value applies to all providers by default. Each provider can override it with its own `speed` setting (e.g., `tts.openai.speed: 1.5`). Provider-specific speed takes precedence over the global value. Default is `1.0` (normal speed).
### Telegram Voice Bubbles & ffmpeg
Telegram voice bubbles require Opus/OGG audio format:
- **OpenAI, ElevenLabs, and Mistral** produce Opus natively — no extra setup
- **Edge TTS** (default) outputs MP3 and needs **ffmpeg** to convert:
- **MiniMax TTS** outputs MP3 and needs **ffmpeg** to convert for Telegram voice bubbles
- **NeuTTS** outputs WAV and also needs **ffmpeg** to convert for Telegram voice bubbles
```bash
# Ubuntu/Debian
sudo apt install ffmpeg
# macOS
brew install ffmpeg
# Fedora
sudo dnf install ffmpeg
```
Without ffmpeg, Edge TTS, MiniMax TTS, and NeuTTS audio are sent as regular audio files (playable, but shown as a rectangular player instead of a voice bubble).
:::tip
If you want voice bubbles without installing ffmpeg, switch to the OpenAI, ElevenLabs, or Mistral provider.
:::
## Voice Message Transcription (STT)
Voice messages sent on Telegram, Discord, WhatsApp, Slack, or Signal are automatically transcribed and injected as text into the conversation. The agent sees the transcript as normal text.
| Provider | Quality | Cost | API Key |
|----------|---------|------|---------|
| **Local Whisper** (default) | Good | Free | None needed |
| **Groq Whisper API** | GoodBest | Free tier | `GROQ_API_KEY` |
| **OpenAI Whisper API** | GoodBest | Paid | `VOICE_TOOLS_OPENAI_KEY` or `OPENAI_API_KEY` |
:::info Zero Config
Local transcription works out of the box when `faster-whisper` is installed. If that's unavailable, Hermes can also use a local `whisper` CLI from common install locations (like `/opt/homebrew/bin`) or a custom command via `HERMES_LOCAL_STT_COMMAND`.
:::
### Configuration
```yaml
# In ~/.hermes/config.yaml
stt:
provider: "local" # "local" | "groq" | "openai" | "mistral"
local:
model: "base" # tiny, base, small, medium, large-v3
openai:
model: "whisper-1" # whisper-1, gpt-4o-mini-transcribe, gpt-4o-transcribe
mistral:
model: "voxtral-mini-latest" # voxtral-mini-latest, voxtral-mini-2602
```
### Provider Details
**Local (faster-whisper)** — Runs Whisper locally via [faster-whisper](https://github.com/SYSTRAN/faster-whisper). Uses CPU by default, GPU if available. Model sizes:
| Model | Size | Speed | Quality |
|-------|------|-------|---------|
| `tiny` | ~75 MB | Fastest | Basic |
| `base` | ~150 MB | Fast | Good (default) |
| `small` | ~500 MB | Medium | Better |
| `medium` | ~1.5 GB | Slower | Great |
| `large-v3` | ~3 GB | Slowest | Best |
**Groq API** — Requires `GROQ_API_KEY`. Good cloud fallback when you want a free hosted STT option.
**OpenAI API** — Accepts `VOICE_TOOLS_OPENAI_KEY` first and falls back to `OPENAI_API_KEY`. Supports `whisper-1`, `gpt-4o-mini-transcribe`, and `gpt-4o-transcribe`.
**Mistral API (Voxtral Transcribe)** — Requires `MISTRAL_API_KEY`. Uses Mistral's [Voxtral Transcribe](https://docs.mistral.ai/capabilities/audio/speech_to_text/) models. Supports 13 languages, speaker diarization, and word-level timestamps. Install with `pip install hermes-agent[mistral]`.
**Custom local CLI fallback** — Set `HERMES_LOCAL_STT_COMMAND` if you want Hermes to call a local transcription command directly. The command template supports `{input_path}`, `{output_dir}`, `{language}`, and `{model}` placeholders.
### Fallback Behavior
If your configured provider isn't available, Hermes automatically falls back:
- **Local faster-whisper unavailable** → Tries a local `whisper` CLI or `HERMES_LOCAL_STT_COMMAND` before cloud providers
- **Groq key not set** → Falls back to local transcription, then OpenAI
- **OpenAI key not set** → Falls back to local transcription, then Groq
- **Mistral key/SDK not set** → Skipped in auto-detect; falls through to next available provider
- **Nothing available** → Voice messages pass through with an accurate note to the user