hermes-agent/website/docs/user-guide/features/tts.md
Teknium 8b6beaab5f
docs: 30-day overhaul — correctness audit, PR coverage, Nous Portal weave, sidebar reorg (#33782)
* docs(audit): correctness pass across getting-started, reference, features, messaging, developer-guide, guides, integrations, user-guide

* docs: add PR coverage for last 30d + Nous Portal weave + nav reorg + build fixes

- Add docs for top user-visible PRs that shipped without docs (api-server
  session control, kanban features, telegram pin/edit, provider client tag,
  xAI retired-model migration, cron name lookup, --branch update flag, etc.)
- Apply Nous Portal weave across 23 pages (tasteful one-liners on
  getting-started/learning-path, configuration, overview, vision, x-search,
  credential-pools, provider-routing, cron, codex-runtime, profiles, docker,
  messaging/index, multiple guides, plus FAQ + index promotion)
- Reorganize sidebar: split Messaging into Popular/M365/Chinese/Other,
  Reference into Command/Configuration/Tools-Skills sub-categories, add
  orphan developer-guide pages (web-search-provider-plugin,
  browser-supervisor), move features from Integrations back to Features,
  fold lone spotify into Media & Web.
- Regenerate skill stubs + catalogs (kanban-codex-lane, hermes-s6-container-
  supervision, web-pentest)
- Fix broken anchor links (security/cron, configuration/fallback, telegram
  large-files, adding-platform-adapters step-by-step)
2026-05-28 02:41:36 -07:00

32 KiB
Raw Blame History

sidebar_position title description
9 Voice & TTS Text-to-speech and voice message transcription across all platforms

Voice & TTS

Hermes Agent supports both text-to-speech output and voice message transcription across all messaging platforms.

:::tip Nous Subscribers If you have a paid Nous Portal subscription, OpenAI TTS is available through the Tool Gateway without a separate OpenAI API key. New installs can run hermes setup --portal to log in and turn on every gateway tool at once; existing installs can pick Nous Subscription for just TTS via hermes model or hermes tools. :::

Text-to-Speech

Convert text to speech with ten providers:

Provider Quality Cost API Key
Edge TTS (default) Good Free None needed
ElevenLabs Excellent Paid ELEVENLABS_API_KEY
OpenAI TTS Good Paid VOICE_TOOLS_OPENAI_KEY
MiniMax TTS Excellent Paid MINIMAX_API_KEY
Mistral (Voxtral TTS) Excellent Paid MISTRAL_API_KEY
Google Gemini TTS Excellent Free tier GEMINI_API_KEY
xAI TTS Excellent Paid XAI_API_KEY
NeuTTS Good Free (local) None needed
KittenTTS Good Free (local) None needed
Piper Good Free (local) None needed

Platform Delivery

Platform Delivery Format
Telegram Voice bubble (plays inline) Opus .ogg
Discord Voice bubble (Opus/OGG), falls back to file attachment Opus/MP3
WhatsApp Audio file attachment MP3
CLI Saved to ~/.hermes/audio_cache/ MP3

Configuration

# In ~/.hermes/config.yaml
tts:
  provider: "edge"              # "edge" | "elevenlabs" | "openai" | "minimax" | "mistral" | "gemini" | "xai" | "neutts" | "kittentts" | "piper"
  speed: 1.0                    # Global speed multiplier (provider-specific settings override this)
  edge:
    voice: "en-US-AriaNeural"   # 322 voices, 74 languages
    speed: 1.0                  # Converted to rate percentage (+/-%)
  elevenlabs:
    voice_id: "pNInz6obpgDQGcFmaJgB"  # Adam
    model_id: "eleven_multilingual_v2"
  openai:
    model: "gpt-4o-mini-tts"
    voice: "alloy"              # alloy, echo, fable, onyx, nova, shimmer
    base_url: "https://api.openai.com/v1"  # Override for OpenAI-compatible TTS endpoints
    speed: 1.0                  # 0.25 - 4.0
  minimax:
    model: "speech-2.8-hd"     # speech-2.8-hd (default), speech-2.8-turbo
    voice_id: "English_Graceful_Lady"  # See https://platform.minimax.io/faq/system-voice-id
    speed: 1                    # 0.5 - 2.0
    vol: 1                      # 0 - 10
    pitch: 0                    # -12 - 12
  mistral:
    model: "voxtral-mini-tts-2603"
    voice_id: "c69964a6-ab8b-4f8a-9465-ec0925096ec8"  # Paul - Neutral (default)
  gemini:
    model: "gemini-2.5-flash-preview-tts"  # or gemini-2.5-pro-preview-tts
    voice: "Kore"               # 30 prebuilt voices: Zephyr, Puck, Kore, Enceladus, Gacrux, etc.
  xai:
    voice_id: "eve"             # or a custom voice ID — see docs below
    language: "en"              # ISO 639-1 code
    sample_rate: 24000          # 22050 / 24000 (default) / 44100 / 48000
    bit_rate: 128000            # MP3 bitrate; only applies when codec=mp3
    # base_url: "https://api.x.ai/v1"   # Override via XAI_BASE_URL env var
  neutts:
    ref_audio: ''
    ref_text: ''
    model: neuphonic/neutts-air-q4-gguf
    device: cpu
  kittentts:
    model: KittenML/kitten-tts-nano-0.8-int8   # 25MB int8; also: kitten-tts-micro-0.8 (41MB), kitten-tts-mini-0.8 (80MB)
    voice: Jasper                               # Jasper, Bella, Luna, Bruno, Rosie, Hugo, Kiki, Leo
    speed: 1.0                                  # 0.5 - 2.0
    clean_text: true                            # Expand numbers, currencies, units
  piper:
    voice: en_US-lessac-medium                  # voice name (auto-downloaded) OR absolute path to .onnx
    # voices_dir: ''                            # default: ~/.hermes/cache/piper-voices/
    # use_cuda: false                           # requires onnxruntime-gpu
    # length_scale: 1.0                         # 2.0 = twice as slow
    # noise_scale: 0.667
    # noise_w_scale: 0.8
    # volume: 1.0                               # 0.5 = half as loud
    # normalize_audio: true

Speed control: The global tts.speed value applies to all providers by default. Each provider can override it with its own speed setting (e.g., tts.openai.speed: 1.5). Provider-specific speed takes precedence over the global value. Default is 1.0 (normal speed).

Input length limits

Each provider has a documented per-request input-character cap. Hermes truncates text before calling the provider so requests never fail with a length error:

Provider Default cap (chars)
Edge TTS 5000
OpenAI 4096
xAI 15000
MiniMax 10000
Mistral 4000
Google Gemini 5000
ElevenLabs Model-aware (see below)
NeuTTS 2000
KittenTTS 2000
Piper 5000

ElevenLabs picks a cap from the configured model_id:

model_id Cap (chars)
eleven_flash_v2_5 40000
eleven_flash_v2 30000
eleven_multilingual_v2 (default), eleven_multilingual_v1, eleven_english_sts_v2, eleven_english_sts_v1 10000
eleven_v3, eleven_ttv_v3 5000
Unknown model Falls back to provider default (10000)

Override per provider with max_text_length: under the provider section of your TTS config:

tts:
  openai:
    max_text_length: 8192   # raise or lower the provider cap

Only positive integers are honored. Zero, negative, non-numeric, or boolean values fall through to the provider default, so a broken config can't accidentally disable truncation.

Telegram Voice Bubbles & ffmpeg

Telegram voice bubbles require Opus/OGG audio format:

  • OpenAI, ElevenLabs, and Mistral produce Opus natively — no extra setup
  • Edge TTS (default) outputs MP3 and needs ffmpeg to convert:
  • MiniMax TTS outputs MP3 and needs ffmpeg to convert for Telegram voice bubbles
  • Google Gemini TTS outputs raw PCM and uses ffmpeg to encode Opus directly for Telegram voice bubbles
  • xAI TTS outputs MP3 and needs ffmpeg to convert for Telegram voice bubbles
  • NeuTTS outputs WAV and also needs ffmpeg to convert for Telegram voice bubbles
  • KittenTTS outputs WAV and also needs ffmpeg to convert for Telegram voice bubbles
  • Piper outputs WAV and also needs ffmpeg to convert for Telegram voice bubbles
# Ubuntu/Debian
sudo apt install ffmpeg

# macOS
brew install ffmpeg

# Fedora
sudo dnf install ffmpeg

Without ffmpeg, Edge TTS, MiniMax TTS, NeuTTS, KittenTTS, and Piper audio are sent as regular audio files (playable, but shown as a rectangular player instead of a voice bubble).

:::tip If you want voice bubbles without installing ffmpeg, switch to the OpenAI, ElevenLabs, or Mistral provider. :::

xAI Custom Voices (voice cloning)

xAI supports cloning your voice and using it with TTS. Create a custom voice in the xAI Console, then set the resulting voice_id in your config:

tts:
  provider: xai
  xai:
    voice_id: "nlbqfwie"   # your custom voice ID

See the xAI Custom Voices docs for details on recording, supported formats, and limits.

Piper (local, 44 languages)

Piper is a fast, local neural TTS engine from the Open Home Foundation (the Home Assistant maintainers). It runs entirely on CPU, supports 44 languages with pre-trained voices, and needs no API key.

Install via hermes tools → Voice & TTS → Piper — Hermes runs pip install piper-tts for you. Or install manually: pip install piper-tts.

Switch to Piper:

tts:
  provider: piper
  piper:
    voice: en_US-lessac-medium

On the first TTS call for a voice that isn't cached locally, Hermes runs python -m piper.download_voices <name> and downloads the model (~20-90MB depending on quality tier) into ~/.hermes/cache/piper-voices/. Subsequent calls reuse the cached model.

Picking a voice. The full voice catalog covers English, Spanish, French, German, Italian, Dutch, Portuguese, Russian, Polish, Turkish, Chinese, Arabic, Hindi, and more — each with x_low / low / medium / high quality tiers. Sample voices at rhasspy.github.io/piper-samples.

Using a pre-downloaded voice. Set tts.piper.voice to an absolute path ending in .onnx:

tts:
  piper:
    voice: /path/to/my-custom-voice.onnx

Advanced knobs (tts.piper.length_scale / noise_scale / noise_w_scale / volume / normalize_audio, use_cuda) correspond 1:1 to Piper's SynthesisConfig. They're ignored on older piper-tts versions.

Custom command providers

If a TTS engine you want isn't natively supported (VoxCPM, MLX-Kokoro, XTTS CLI, a voice-cloning script, anything else that exposes a CLI), you can wire it in as a command-type provider without writing any Python. Hermes writes the input text to a temp UTF-8 file, runs your shell command, and reads the audio file the command produced.

Declare one or more providers under tts.providers.<name> and switch between them with tts.provider: <name> — the same way you switch between built-ins like edge and openai.

tts:
  provider: voxcpm                 # pick any name under tts.providers
  providers:
    voxcpm:
      type: command
      command: "voxcpm --ref ~/voice.wav --text-file {input_path} --out {output_path}"
      output_format: mp3
      timeout: 180
      voice_compatible: true       # try to deliver as a Telegram voice bubble

    mlx-kokoro:
      type: command
      command: "python -m mlx_kokoro --in {input_path} --out {output_path} --voice {voice}"
      voice: af_sky
      output_format: wav

    piper-custom:                  # native Piper also supports custom .onnx via tts.piper.voice
      type: command
      command: "piper -m /path/to/custom.onnx -f {output_path} < {input_path}"
      output_format: wav

Example: Doubao (Chinese seed-tts-2.0)

For high-quality Chinese TTS via ByteDance's seed-tts-2.0 bidirectional-streaming API, install the doubao-speech PyPI package and wire it in as a command provider:

pip install doubao-speech
export VOLCENGINE_APP_ID="your-app-id"
export VOLCENGINE_ACCESS_TOKEN="your-access-token"
tts:
  provider: doubao
  providers:
    doubao:
      type: command
      command: "doubao-speech say --text-file {input_path} --out {output_path}"
      output_format: mp3
      max_text_length: 1024
      timeout: 30

Credentials come from your shell environment (VOLCENGINE_APP_ID / VOLCENGINE_ACCESS_TOKEN) or ~/.doubao-speech/config.yaml. Pick a voice by adding --voice zh-female-warm (or any other alias from doubao-speech list-voices) to the command. doubao-speech also bundles streaming ASR — see the STT section below for Hermes integration. Source and full docs: github.com/Hypnus-Yuan/doubao-speech.

Placeholders

Your command template can reference these placeholders. Hermes substitutes them at render time and shell-quotes each value for the surrounding context (bare / single-quoted / double-quoted), so paths with spaces and other shell-sensitive characters are safe.

Placeholder Meaning
{input_path} Path to the temp UTF-8 text file Hermes wrote
{text_path} Alias for {input_path}
{output_path} Path the command must write audio to
{format} mp3 / wav / ogg / flac
{voice} tts.providers.<name>.voice, empty when unset
{model} tts.providers.<name>.model
{speed} Resolved speed multiplier (provider or global)

Use {{ and }} for literal braces.

Optional keys

Key Default Meaning
timeout 120 Seconds; the process tree is killed on expiry (Unix killpg, Windows taskkill /T).
output_format mp3 One of mp3 / wav / ogg / flac. Auto-inferred from the output extension if Hermes picks a path.
voice_compatible false When true, Hermes converts MP3/WAV output to Opus/OGG via ffmpeg so Telegram renders a voice bubble.
max_text_length 5000 Input is truncated to this length before rendering the command.
voice / model empty Passed to the command as placeholder values only.

Behavior notes

  • Built-in names always win. A tts.providers.openai entry never shadows the native OpenAI provider, so no user config can silently replace a built-in.
  • Default delivery is a document. Command providers deliver as regular audio attachments on every platform. Opt in to voice-bubble delivery per-provider with voice_compatible: true.
  • Command failures surface to the agent. Non-zero exit, empty output, or timeout all return an error with the command's stderr/stdout included so you can debug the provider from the conversation.
  • type: command is the default when command: is set. Writing type: command explicitly is good practice but not required; an entry with a non-empty command string is treated as a command provider.
  • {input_path} / {text_path} are interchangeable. Use whichever reads better in your command.

Security

Command-type providers run whatever shell command you configure, with your user's permissions. Hermes quotes placeholder values and enforces the configured timeout, but the command template itself is trusted local input — treat it the same way you would a shell script on your PATH.

Python plugin providers

For TTS engines that can't be expressed as a single shell command — Python SDKs without a CLI, streaming engines, voice-listing APIs, OAuth-refreshing auth — register a Python plugin via ctx.register_tts_provider(). The plugin coexists with (does not replace) the Custom command providers registry; pick the surface that fits your engine.

When to pick which

Your backend has… Use
A single CLI reading text from a file/stdin and writing audio to a file/stdout Command provider (no Python needed)
Two or three CLIs chained with shell pipes Command provider
A Python SDK only — no CLI Plugin
Streaming bytes you want to deliver chunked (mid-generation voice bubbles) Plugin (override stream())
A voice-listing API used by hermes setup Plugin (override list_voices())
OAuth refresh flow (not a static bearer token) Plugin

Built-ins always win, and command providers win over a same-name plugin — so plugins are safe to register against any non-built-in name without worrying about shadowing your existing config.

Minimal plugin

Drop this in ~/.hermes/plugins/my-tts/:

plugin.yaml:

name: my-tts
version: 0.1.0
description: "My custom Python TTS backend"

__init__.py:

from agent.tts_provider import TTSProvider


class MyTTSProvider(TTSProvider):
    @property
    def name(self) -> str:
        return "my-tts"  # what tts.provider matches against

    @property
    def display_name(self) -> str:
        return "My Custom TTS"

    def is_available(self) -> bool:
        # Return False when credentials/deps are missing — picker skips
        # this row but the dispatcher still routes here on explicit config.
        import os
        return bool(os.environ.get("MY_TTS_API_KEY"))

    def synthesize(self, text, output_path, *, voice=None, model=None,
                   speed=None, format="mp3", **extra) -> str:
        # Write audio bytes to output_path, return the path.
        # Raise on failure — the dispatcher converts exceptions to a
        # standard error envelope.
        import my_tts_sdk
        client = my_tts_sdk.Client()
        audio_bytes = client.synthesize(text=text, voice=voice or "default")
        with open(output_path, "wb") as f:
            f.write(audio_bytes)
        return output_path


def register(ctx):
    ctx.register_tts_provider(MyTTSProvider())

Enable it (hermes plugins enable my-tts), point tts.provider at it (tts.provider: my-tts in config.yaml), and the text_to_speech tool will route through your plugin.

Optional hooks

Override these on your provider class for richer integration:

  • list_voices() → list of {id, display, language, gender, preview_url} dicts shown in hermes tools.
  • list_models() → list of {id, display, languages, max_text_length} dicts.
  • get_setup_schema() → return {name, badge, tag, env_vars: [{key, prompt, url}]} to power the picker row in hermes tools / hermes setup. Without this, the plugin still works but its row in the picker is minimal.
  • stream(text, *, voice, model, format, **extra) → iterator yielding audio bytes for streaming delivery (default raises NotImplementedError).
  • voice_compatible property → set True if your output is Opus-compatible and the gateway should deliver it as a voice bubble (default False = regular audio attachment).

See agent/tts_provider.py for the full ABC including docstrings.

Voice Message Transcription (STT)

Voice messages sent on Telegram, Discord, WhatsApp, Slack, or Signal are automatically transcribed and injected as text into the conversation. The agent sees the transcript as normal text.

Provider Quality Cost API Key
Local Whisper (default) Good Free None needed
Groq Whisper API GoodBest Free tier GROQ_API_KEY
OpenAI Whisper API GoodBest Paid VOICE_TOOLS_OPENAI_KEY or OPENAI_API_KEY

:::info Zero Config Local transcription works out of the box when faster-whisper is installed. If that's unavailable, Hermes can also use a local whisper CLI from common install locations (like /opt/homebrew/bin) or a custom command via HERMES_LOCAL_STT_COMMAND. :::

Configuration

# In ~/.hermes/config.yaml
stt:
  provider: "local"           # "local" | "groq" | "openai" | "mistral" | "xai"
  local:
    model: "base"             # tiny, base, small, medium, large-v3
  openai:
    model: "whisper-1"        # whisper-1, gpt-4o-mini-transcribe, gpt-4o-transcribe
  mistral:
    model: "voxtral-mini-latest"  # voxtral-mini-latest, voxtral-mini-2602
  xai:
    model: "grok-stt"         # xAI Grok STT

Provider Details

Local (faster-whisper) — Runs Whisper locally via faster-whisper. Uses CPU by default, GPU if available. Model sizes:

Model Size Speed Quality
tiny ~75 MB Fastest Basic
base ~150 MB Fast Good (default)
small ~500 MB Medium Better
medium ~1.5 GB Slower Great
large-v3 ~3 GB Slowest Best

Groq API — Requires GROQ_API_KEY. Good cloud fallback when you want a free hosted STT option.

OpenAI API — Accepts VOICE_TOOLS_OPENAI_KEY first and falls back to OPENAI_API_KEY. Supports whisper-1, gpt-4o-mini-transcribe, and gpt-4o-transcribe.

Mistral API (Voxtral Transcribe) — Requires MISTRAL_API_KEY. Uses Mistral's Voxtral Transcribe models. Supports 13 languages, speaker diarization, and word-level timestamps. Install with pip install hermes-agent[mistral].

xAI Grok STT — Requires XAI_API_KEY. Posts to https://api.x.ai/v1/stt as multipart/form-data. Good choice if you're already using xAI for chat or TTS and want one API key for everything. Auto-detection order puts it after Groq — explicitly set stt.provider: xai to force it.

Custom local CLI fallback — Set HERMES_LOCAL_STT_COMMAND if you want Hermes to call a local transcription command directly. The command template supports {input_path}, {output_dir}, {language}, and {model} placeholders. Your command must write a .txt transcript somewhere under {output_dir}.

Example: Doubao / Volcengine ASR

If you use doubao-speech for Doubao TTS (see above), the same package handles speech-to-text via the local-command STT surface:

pip install doubao-speech
export VOLCENGINE_APP_ID="your-app-id"
export VOLCENGINE_ACCESS_TOKEN="your-access-token"
export HERMES_LOCAL_STT_COMMAND='doubao-speech transcribe {input_path} --out {output_dir}/transcript.txt'
stt:
  provider: local_command

Hermes writes the incoming voice message to {input_path}, runs the command, and reads the .txt file produced under {output_dir}. Language is auto-detected by the Volcengine bigmodel endpoint.

Fallback Behavior

If your configured provider isn't available, Hermes automatically falls back:

  • Local faster-whisper unavailable → Tries a local whisper CLI or HERMES_LOCAL_STT_COMMAND before cloud providers
  • Groq key not set → Falls back to local transcription, then OpenAI
  • OpenAI key not set → Falls back to local transcription, then Groq
  • Mistral key/SDK not set → Skipped in auto-detect; falls through to next available provider
  • Nothing available → Voice messages pass through with an accurate note to the user

STT custom command providers

If the STT engine you want isn't natively supported (Doubao ASR, NVIDIA Parakeet, a whisper.cpp build, an open-source SenseVoice CLI, anything else that exposes a shell command), wire it in as a command-type provider without writing any Python. Hermes runs your shell command against the audio file and reads back the transcript.

Declare one or more providers under stt.providers.<name> and switch between them with stt.provider: <name> — same shape as the TTS command-provider registry, adapted for the input=audio → output=transcript direction.

stt:
  provider: parakeet                # pick any name under stt.providers
  providers:
    parakeet:
      type: command
      command: "parakeet-asr --model nvidia/parakeet-tdt-0.6b-v2 --in {input_path} --out {output_path}"
      format: txt
      language: en
      timeout: 300

    whispercpp:
      type: command
      command: "whisper-cli -m ~/models/ggml-large-v3.bin -f {input_path} -otxt -of {output_dir}/transcript"
      format: txt

    sensevoice:
      type: command
      command: "sensevoice-cli {input_path} --json | tee {output_path}"
      format: json

This complements the legacy HERMES_LOCAL_STT_COMMAND escape hatch — that env var still works untouched via the built-in local_command path. Use stt.providers.<name> when you want multiple shell-driven STT engines, a name you can pick via stt.provider, or anything that needs per-provider language / model / timeout.

STT placeholders

Your command template can reference these placeholders. Hermes substitutes them at render time and shell-quotes each value for the surrounding context (bare / single-quoted / double-quoted), so paths with spaces are safe.

Placeholder Meaning
{input_path} Absolute path to the input audio file (original location, read-only)
{output_path} Absolute path the command should write the transcript to
{output_dir} Parent directory of {output_path} (handy for whisper-style tools)
{format} Configured output format: txt / json / srt / vtt
{language} Configured language code (defaults to en)
{model} stt.providers.<name>.model, empty when unset

Use {{ and }} for literal braces (handy when embedding JSON snippets in the command).

How the transcript is read back

After your command exits successfully:

  1. If {output_path} exists and is non-empty → Hermes reads it as UTF-8 text.
  2. Otherwise, if the command wrote to stdout → Hermes uses that.
  3. Otherwise → error: "Command STT provider wrote no output file and produced no stdout".

This lets you use the registry for both file-writing CLIs (whisper-cli, parakeet-asr) and curl-style one-liners that emit transcript to stdout (curl … | jq -r .text).

For format: json / srt / vtt, Hermes returns the raw file content as the transcript field. Extracting .text from JSON is out of scope for the runner — either configure format: txt, or post-process JSON downstream.

STT command-provider optional keys

Key Default Meaning
timeout 300 Seconds; the process tree is killed on expiry (Unix start_new_session, Windows taskkill /T).
format txt One of txt / json / srt / vtt. Sets the extension of {output_path}.
language en Forwarded to {language}. Defaults to stt.language then en.
model empty Forwarded to {model}. The model= argument to transcribe_audio() overrides this.

STT command-provider behavior notes

  • Built-ins always win. Declaring stt.providers.openai: type: command does NOT override the real OpenAI Whisper handler. The built-in name is short-circuited before the command-provider resolver runs.
  • Process-tree cleanup. A command running over timeout has its entire process tree killed, not just the shell wrapper. Long-running ASR pipelines that fork model-loading subprocesses are reaped reliably.
  • Shell-quoting is automatic. Placeholders inside '…' get single-quote-safe escaping; inside "…" get $/`/" escaping; outside quotes get shlex.quote. Don't pre-quote placeholder values.

STT command-provider security

The shell command runs under the same user as Hermes with full filesystem access — same trust model as tts.providers.<name>: type: command and HERMES_LOCAL_STT_COMMAND. Only declare command providers from sources you trust.

Python plugin providers (STT)

For STT engines that aren't built-in AND can't be expressed as a shell command (need a Python SDK, OAuth-refreshing auth, streaming chunks, etc.), register a Python plugin via ctx.register_transcription_provider(). The plugin coexists with the 6 built-in providers (local, local_command, groq, openai, mistral, xai) and the stt.providers.<name>: type: command registry — built-ins keep their native implementations and always win on name collision; command providers win over plugins of the same name (config is more local than plugin install).

When to pick which (STT)

Backend has… Use
A single shell command that takes an audio file and emits text stt.providers.<name>: type: command (no Python needed)
Only the legacy single-command escape hatch is wanted HERMES_LOCAL_STT_COMMAND env var (preserved for back-compat)
A Python SDK with no CLI register_transcription_provider() plugin
OAuth-refreshing auth, streaming chunks, voice-list metadata register_transcription_provider() plugin
A built-in already covers it (local, groq, openai, …) Set stt.provider: <name> — built-ins are inline

Resolution order

  1. stt.provider is a built-in name → built-in dispatch. Always wins.
  2. stt.provider matches stt.providers.<name> with command: set → command-provider runner (see STT custom command providers). Wins over a same-name plugin.
  3. stt.provider matches a plugin-registered TranscriptionProvider → plugin dispatch:
    • if the plugin's is_available() returns False (missing creds or SDK), the call surfaces an unavailability error envelope identifying the plugin — not the generic "No STT provider available" message.
    • otherwise the plugin's transcribe() is called with model (from the public model= arg, falling back to stt.<provider>.model) and language (from stt.<provider>.language).
  4. No match → "No STT provider available" error.

Per-provider config namespace

Plugins read their per-provider configuration from stt.<provider> in config.yaml, mirroring how built-ins read stt.openai.model / stt.mistral.model:

stt:
  provider: my-stt
  my-stt:
    model: whisper-large-v3
    language: ja          # forwarded as language= to transcribe()
    # any other plugin-specific keys go here; read them via your
    # own config.yaml access in __init__/is_available/transcribe

The dispatcher forwards model and language from this section; everything else, the plugin can read itself.

Minimal plugin

Drop this in ~/.hermes/plugins/my-stt/:

plugin.yaml:

name: my-stt
version: 0.1.0
description: "My custom Python STT backend"

__init__.py:

from agent.transcription_provider import TranscriptionProvider


class MySTTProvider(TranscriptionProvider):
    @property
    def name(self) -> str:
        return "my-stt"  # what stt.provider matches against

    @property
    def display_name(self) -> str:
        return "My Custom STT"

    def is_available(self) -> bool:
        # Return False when credentials/deps are missing — picker skips
        # this row but the dispatcher still routes here on explicit config.
        import os
        return bool(os.environ.get("MY_STT_API_KEY"))

    def transcribe(self, file_path, *, model=None, language=None, **extra):
        # Return the standard transcribe envelope:
        #   {"success": bool, "transcript": str, "provider": str, "error": str}
        # Do NOT raise — convert exceptions to the error envelope so the
        # gateway/CLI caller sees a consistent shape on failure.
        try:
            import my_stt_sdk
            client = my_stt_sdk.Client()
            text = client.transcribe(open(file_path, "rb"))
            return {
                "success": True,
                "transcript": text,
                "provider": "my-stt",
            }
        except Exception as exc:
            return {
                "success": False,
                "transcript": "",
                "error": f"my-stt failed: {exc}",
                "provider": "my-stt",
            }


def register(ctx):
    ctx.register_transcription_provider(MySTTProvider())

Enable it (hermes plugins enable my-stt), set stt.provider: my-stt in config.yaml, and voice-message transcription will route through your plugin.

Optional hooks

Override these on your provider class for richer integration:

  • list_models() → list of {id, display, languages, max_audio_seconds} dicts.
  • default_model() → string returned when the user doesn't override the model.
  • get_setup_schema() → return {name, badge, tag, env_vars: [{key, prompt, url}]} to power picker rows in hermes tools / hermes setup (the picker category for STT is not yet shipped — this metadata is available to plugins for forward compatibility).

See agent/transcription_provider.py for the full ABC including docstrings.