feat(telegram): skip-STT audio path + 2GB cap via local Bot API server

Two coordinated changes that unblock downstream audio pipelines (diarization, custom transcription, archival) on attachments larger than the public Bot API's 20MB getFile ceiling. - `stt.enabled: false` no longer drops voice/audio with a generic "transcription disabled" note. The gateway probes the cached file's duration (wave → mutagen → ffprobe ladder) and surfaces `[The user sent a voice message: <abs path> (duration: M:SS)]` to the agent so a skill or tool can pick up the raw file. The previous placeholder is replaced rather than appended when present. - `platforms.telegram.extra.base_url` set → adapter auto-lifts its document size cap from 20MB to 2GB (the local telegram-bot-api `--local` ceiling) and the "too large" reply reports the active limit dynamically. No new config knob; presence of `base_url` is the opt-in. - `platforms.telegram.extra.local_mode: true` wires `Application.builder().local_mode(True)` on the python-telegram-bot builder. PTB then reads files from disk instead of HTTP, which is required when telegram-bot-api runs in `--local` mode (the server returns absolute filesystem paths, not `/file/bot...` URLs). - gateway/run.py: rewrites the `stt.enabled: false` branch of `_enrich_message_with_transcription`. New `_format_duration` + `_probe_audio_duration` helpers. - gateway/platforms/telegram.py: `_max_doc_bytes` instance attribute derived from `extra.base_url`; `local_mode` builder wiring; dynamic "too large" message. - tests/gateway/test_stt_config.py: covers path-surfacing with and without an existing user message, and placeholder replacement. - tests/gateway/test_telegram_max_doc_bytes.py: 3 cases — default 20MB without base_url, 2GB when set, empty-string base_url keeps default. - website/docs/user-guide/messaging/telegram.md: new "Skipping STT" subsection under Voice Messages and a full "Large Files (>20MB) via Local Bot API Server" walkthrough (api_id/api_hash, docker-compose, one-time `logOut` migration, `platforms.telegram.extra` config, the `local_mode` disk-access requirement, the silent HTTP-fallback 404). - website/docs/user-guide/features/voice-mode.md: documents the `stt.enabled` knob in the config reference. - `pytest tests/gateway/test_telegram_max_doc_bytes.py tests/gateway/test_stt_config.py` → 9/9 passing. - Verified end-to-end on a live deployment: gateway log shows `Using custom Telegram base_url: http://...` and `Using Telegram local_mode (read files from disk)` on startup; voice messages above 20MB cache to disk and surface their path to the agent. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-05 07:41:39 +00:00 · 2026-05-14 00:38:32 +02:00 · 2026-05-14 00:38:32 +02:00 · ad2531be08
commit ad2531be08
parent 6265b3a132
6 changed files with 327 additions and 14 deletions
--- a/website/docs/user-guide/features/voice-mode.md
+++ b/website/docs/user-guide/features/voice-mode.md
@ -391,6 +391,11 @@ voice:

 # Speech-to-Text
 stt:
+  enabled: true                     # set to false to skip auto-transcription —
+                                    # the gateway still caches the audio file and
+                                    # passes its path to the agent as part of the
+                                    # inbound message, useful for custom pipelines
+                                    # (diarization, alignment, archival, etc.)
  provider: "local"                  # "local" (free) | "groq" | "openai"
  local:
    model: "base"                    # tiny, base, small, medium, large-v3
--- a/website/docs/user-guide/messaging/telegram.md
+++ b/website/docs/user-guide/messaging/telegram.md
@ -276,6 +276,25 @@ Voice messages you send on Telegram are automatically transcribed by Hermes's co
 - `groq` uses Groq Whisper and requires `GROQ_API_KEY`
 - `openai` uses OpenAI Whisper and requires `VOICE_TOOLS_OPENAI_KEY`

+#### Skipping STT: pass the raw audio file to the agent
+
+If you'd rather have the **agent itself** handle audio — for diarization, a custom transcription tool, or just archiving the recording — set `stt.enabled: false` in `~/.hermes/config.yaml`:
+
+```yaml
+stt:
+  enabled: false
+```
+
+With STT disabled, the gateway still downloads the voice/audio attachment into Hermes's audio cache, but **does not transcribe it**. The agent receives the message with a marker like:
+
+```
+[The user sent a voice message: /home/<user>/.hermes/cache/audio/<hash>.ogg]
+```
+
+Your tools or skills can then read that path directly (e.g., hand it off to a local diarization pipeline, a richer transcription model, or upload it to long-term storage). The file extension reflects the original format Telegram delivered (`.ogg` for voice notes, `.mp3`/`.m4a`/etc. for audio attachments).
+
+This pairs naturally with the [local Bot API server](#large-files-20mb--via-local-bot-api-server) section below, which lifts Telegram's 20MB getFile ceiling to 2GB — useful when the recordings you want to process are longer than a couple of minutes.
+
 ### Outgoing Voice (Text-to-Speech)

 When the agent generates audio via TTS, it's delivered as native Telegram **voice bubbles** — the round, inline-playable kind.
@ -295,6 +314,135 @@ Without ffmpeg, Edge TTS audio is sent as a regular audio file (still playable,

 Configure the TTS provider in your `config.yaml` under the `tts.provider` key.

+## Large Files (>20MB) via Local Bot API Server
+
+Telegram's **public** Bot API caps `getFile` downloads at **20 MB**, so any voice note, audio file, video, or document larger than that is silently rejected by Hermes with a "too large" reply. The documented way around this is to run a **local** [telegram-bot-api](https://github.com/tdlib/telegram-bot-api) daemon — the same server software Telegram uses, but running on your network. A local server raises the file ceiling to **2 GB** and Hermes auto-lifts its own internal cap when it sees a custom `base_url` configured.
+
+This unlocks workflows like:
+
+- Sending long voice memos (45-minute meetings, podcasts) to the bot
+- Uploading large videos for vision-tool processing
+- Archiving raw audio for offline pipelines like diarization, alignment, or training data
+
+### Step 1: Obtain Telegram API credentials
+
+The local server talks directly to Telegram's MTProto layer (not the public Bot API), so it needs **MTProto credentials**:
+
+1. Visit [my.telegram.org/apps](https://my.telegram.org/apps) and sign in with your Telegram account.
+2. Create a new application (any name and short description will do).
+3. Copy the `api_id` and `api_hash` — both are required.
+
+### Step 2: Run the telegram-bot-api server
+
+The community-maintained [`aiogram/telegram-bot-api`](https://hub.docker.com/r/aiogram/telegram-bot-api) Docker image is the easiest path. A minimal `docker-compose.yaml` (use `--local` mode to enable the higher limits):
+
+```yaml
+services:
+  tg-bot-api:
+    image: aiogram/telegram-bot-api:latest
+    container_name: tg-bot-api
+    restart: unless-stopped
+    ports:
+      - "127.0.0.1:8081:8081"   # bind to loopback only; see security note
+    environment:
+      TELEGRAM_API_ID: "12345"           # your api_id from Step 1
+      TELEGRAM_API_HASH: "abcdef..."     # your api_hash from Step 1
+      TELEGRAM_LOCAL: "1"                # enable --local mode (raises 20MB → 2GB)
+    volumes:
+      - ./tg-bot-api-data:/var/lib/telegram-bot-api
+```
+
+Bring it up:
+
+```bash
+docker compose up -d tg-bot-api
+docker logs --tail 20 tg-bot-api
+```
+
+:::warning Security
+The local Bot API server takes your bot token in the URL path (e.g. `/bot<TOKEN>/getMe`) with **no additional auth**. Anyone who can reach the port can fully control your bot — read every message it can see, send messages as it, etc. Bind the container to `127.0.0.1` and/or front it with a reverse proxy on a private network. **Never expose port 8081 to the public internet.**
+:::
+
+### Step 3: Log the bot out of the public API (one-time)
+
+A bot can only be active on **one** Bot API server at a time. If your bot was already running against `api.telegram.org` (which it almost certainly was), you must explicitly log it out there before the local server will accept it:
+
+```bash
+curl "https://api.telegram.org/bot<YOUR_BOT_TOKEN>/logOut"
+# expected response: {"ok":true,"result":true}
+```
+
+This is a one-shot migration step — you don't repeat it on every restart. Telegram delivers any messages received after `logOut` through the new server instead.
+
+Verify the local server can talk to Telegram on the bot's behalf:
+
+```bash
+curl "http://127.0.0.1:8081/bot<YOUR_BOT_TOKEN>/getMe"
+# expected response: {"ok":true,"result":{"id":...,"is_bot":true,...}}
+```
+
+### Step 4: Point Hermes at the local server
+
+Add the URLs under `platforms.telegram.extra` in `~/.hermes/config.yaml`:
+
+```yaml
+platforms:
+  telegram:
+    extra:
+      base_url: "http://127.0.0.1:8081/bot"
+      base_file_url: "http://127.0.0.1:8081/file/bot"
+      local_mode: true        # see Step 5 below — only set this if the bot's data
+                              # directory is readable by the Hermes process
+```
+
+:::caution Use `platforms.telegram.extra`, not `telegram.extra`
+At the moment only the `platforms.<name>.extra` form is deep-merged into the platform config. Keys placed directly under a top-level `telegram.extra` block are silently dropped.
+:::
+
+When `base_url` is set, Hermes:
+
+- Builds the python-telegram-bot client against the local server
+- Auto-lifts its internal document/audio size cap from 20 MB → 2 GB
+- Reports the active limit in the "too large" error message (`Maximum: 2048 MB.`) so it's obvious which mode you're in
+
+Restart the gateway and look for a confirmation log line:
+
+```bash
+hermes gateway restart
+grep -E "Using custom Telegram base_url|Using Telegram local_mode" ~/.hermes/logs/gateway.log | tail
+```
+
+### Step 5: `local_mode` — file access on disk
+
+The local server has **two ways** to deliver files:
+
+1. **Without `--local`** (the default): files are served over HTTP at `/file/bot<TOKEN>/<path>`, same as the public Bot API. The 20MB ceiling stays in effect. Useful as a network-fix only (e.g. when `api.telegram.org` is unreachable but you can self-host); not what you want for the size lift.
+2. **With `--local`** (set via `TELEGRAM_LOCAL=1` above): files are written to the server's filesystem and the `getFile` response returns an **absolute path** instead of an HTTP URL. The 20MB ceiling is lifted. Hermes must then read the bytes **from disk**, not over HTTP.
+
+To make the disk-read path work, set `local_mode: true` in the config above **and** make sure the Hermes process can read the path the server returns. Two scenarios:
+
+- **Same machine** — telegram-bot-api and Hermes run on the same host. Bind-mount the data volume to a directory that Hermes can read (e.g., `/var/lib/telegram-bot-api`), and make sure the file ownership matches. The container drops privileges to its internal `telegram-bot-api` user (uid varies by image); the simplest fix is to add `user: "<UID>:<GID>"` to the compose service so files are owned by a uid Hermes already runs as.
+- **Different machines** — the bot server runs on one host (e.g., a NAS, a separate VM) and Hermes on another. The server's data directory must be shared with the Hermes machine at the **same absolute path** the server reports (typically `/var/lib/telegram-bot-api`). NFS works well for this; CIFS/SMB with `uid=` mount remapping is friendlier if you don't want to deal with uid mismatches at the filesystem level.
+
+If `local_mode: true` is set but Hermes can't `stat` the returned file path (permissions or wrong mount), python-telegram-bot silently falls back to an HTTP `getFile` against the local server — which in `--local` mode responds with `404 Not Found`. The symptom shows up in `gateway.log` as:
+
+```
+[Telegram] Failed to cache voice: Not Found
+telegram.error.InvalidToken: Not Found
+```
+
+If you see that, the cap-lift is working but the file-share isn't. Verify `ls -la /var/lib/telegram-bot-api/<TOKEN>/voice/` from the Hermes host as the user the gateway runs as, and confirm a single file is `cat`-able without a permission error.
+
+### Step 6: Test it
+
+Send the bot a voice note or audio file that's bigger than 20 MB. Tail the gateway log:
+
+```bash
+tail -f ~/.hermes/logs/gateway.log | grep -iE "telegram|cache"
+```
+
+You should see a `[Telegram] Cached user voice at /home/<user>/.hermes/cache/audio/...` line and **no** "too large" rejection. Combined with `stt.enabled: false` (above), the path to the original audio file then lands in the agent's inbound message for downstream processing.
+
 ## Group Chat Usage

 Hermes Agent works in Telegram group chats with a few considerations: