mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-27 01:11:40 +00:00
- Restored 21 skills removed in commits757d012and740dd92: accelerate, audiocraft, code-review, faiss, flash-attention, gguf, grpo-rl-training, guidance, llava, nemo-curator, obliteratus, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, stable-diffusion, tensorrt-llm, torchtitan, trl-fine-tuning, whisper - Rewrote sync_skills() with proper update semantics: * New skills (not in manifest): copied to user dir * Existing skills (in manifest + on disk): updated via hash comparison * User-deleted skills (in manifest, not on disk): respected, not re-added * Stale manifest entries (removed from bundled): cleaned from manifest - Added sync_skills() to CLI startup (cmd_chat) and gateway startup (start_gateway) — previously only ran during 'hermes update' - Updated cmd_update output to show new/updated/cleaned counts - Rewrote tests: 20 tests covering manifest CRUD, dir hashing, fresh install, user deletion respect, update detection, stale cleanup, and name collision handling 75 bundled skills total. 2002 tests pass.
320 lines
7.3 KiB
Markdown
320 lines
7.3 KiB
Markdown
---
|
||
name: whisper
|
||
description: OpenAI's general-purpose speech recognition model. Supports 99 languages, transcription, translation to English, and language identification. Six model sizes from tiny (39M params) to large (1550M params). Use for speech-to-text, podcast transcription, or multilingual audio processing. Best for robust, multilingual ASR.
|
||
version: 1.0.0
|
||
author: Orchestra Research
|
||
license: MIT
|
||
dependencies: [openai-whisper, transformers, torch]
|
||
metadata:
|
||
hermes:
|
||
tags: [Whisper, Speech Recognition, ASR, Multimodal, Multilingual, OpenAI, Speech-To-Text, Transcription, Translation, Audio Processing]
|
||
|
||
---
|
||
|
||
# Whisper - Robust Speech Recognition
|
||
|
||
OpenAI's multilingual speech recognition model.
|
||
|
||
## When to use Whisper
|
||
|
||
**Use when:**
|
||
- Speech-to-text transcription (99 languages)
|
||
- Podcast/video transcription
|
||
- Meeting notes automation
|
||
- Translation to English
|
||
- Noisy audio transcription
|
||
- Multilingual audio processing
|
||
|
||
**Metrics**:
|
||
- **72,900+ GitHub stars**
|
||
- 99 languages supported
|
||
- Trained on 680,000 hours of audio
|
||
- MIT License
|
||
|
||
**Use alternatives instead**:
|
||
- **AssemblyAI**: Managed API, speaker diarization
|
||
- **Deepgram**: Real-time streaming ASR
|
||
- **Google Speech-to-Text**: Cloud-based
|
||
|
||
## Quick start
|
||
|
||
### Installation
|
||
|
||
```bash
|
||
# Requires Python 3.8-3.11
|
||
pip install -U openai-whisper
|
||
|
||
# Requires ffmpeg
|
||
# macOS: brew install ffmpeg
|
||
# Ubuntu: sudo apt install ffmpeg
|
||
# Windows: choco install ffmpeg
|
||
```
|
||
|
||
### Basic transcription
|
||
|
||
```python
|
||
import whisper
|
||
|
||
# Load model
|
||
model = whisper.load_model("base")
|
||
|
||
# Transcribe
|
||
result = model.transcribe("audio.mp3")
|
||
|
||
# Print text
|
||
print(result["text"])
|
||
|
||
# Access segments
|
||
for segment in result["segments"]:
|
||
print(f"[{segment['start']:.2f}s - {segment['end']:.2f}s] {segment['text']}")
|
||
```
|
||
|
||
## Model sizes
|
||
|
||
```python
|
||
# Available models
|
||
models = ["tiny", "base", "small", "medium", "large", "turbo"]
|
||
|
||
# Load specific model
|
||
model = whisper.load_model("turbo") # Fastest, good quality
|
||
```
|
||
|
||
| Model | Parameters | English-only | Multilingual | Speed | VRAM |
|
||
|-------|------------|--------------|--------------|-------|------|
|
||
| tiny | 39M | ✓ | ✓ | ~32x | ~1 GB |
|
||
| base | 74M | ✓ | ✓ | ~16x | ~1 GB |
|
||
| small | 244M | ✓ | ✓ | ~6x | ~2 GB |
|
||
| medium | 769M | ✓ | ✓ | ~2x | ~5 GB |
|
||
| large | 1550M | ✗ | ✓ | 1x | ~10 GB |
|
||
| turbo | 809M | ✗ | ✓ | ~8x | ~6 GB |
|
||
|
||
**Recommendation**: Use `turbo` for best speed/quality, `base` for prototyping
|
||
|
||
## Transcription options
|
||
|
||
### Language specification
|
||
|
||
```python
|
||
# Auto-detect language
|
||
result = model.transcribe("audio.mp3")
|
||
|
||
# Specify language (faster)
|
||
result = model.transcribe("audio.mp3", language="en")
|
||
|
||
# Supported: en, es, fr, de, it, pt, ru, ja, ko, zh, and 89 more
|
||
```
|
||
|
||
### Task selection
|
||
|
||
```python
|
||
# Transcription (default)
|
||
result = model.transcribe("audio.mp3", task="transcribe")
|
||
|
||
# Translation to English
|
||
result = model.transcribe("spanish.mp3", task="translate")
|
||
# Input: Spanish audio → Output: English text
|
||
```
|
||
|
||
### Initial prompt
|
||
|
||
```python
|
||
# Improve accuracy with context
|
||
result = model.transcribe(
|
||
"audio.mp3",
|
||
initial_prompt="This is a technical podcast about machine learning and AI."
|
||
)
|
||
|
||
# Helps with:
|
||
# - Technical terms
|
||
# - Proper nouns
|
||
# - Domain-specific vocabulary
|
||
```
|
||
|
||
### Timestamps
|
||
|
||
```python
|
||
# Word-level timestamps
|
||
result = model.transcribe("audio.mp3", word_timestamps=True)
|
||
|
||
for segment in result["segments"]:
|
||
for word in segment["words"]:
|
||
print(f"{word['word']} ({word['start']:.2f}s - {word['end']:.2f}s)")
|
||
```
|
||
|
||
### Temperature fallback
|
||
|
||
```python
|
||
# Retry with different temperatures if confidence low
|
||
result = model.transcribe(
|
||
"audio.mp3",
|
||
temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)
|
||
)
|
||
```
|
||
|
||
## Command line usage
|
||
|
||
```bash
|
||
# Basic transcription
|
||
whisper audio.mp3
|
||
|
||
# Specify model
|
||
whisper audio.mp3 --model turbo
|
||
|
||
# Output formats
|
||
whisper audio.mp3 --output_format txt # Plain text
|
||
whisper audio.mp3 --output_format srt # Subtitles
|
||
whisper audio.mp3 --output_format vtt # WebVTT
|
||
whisper audio.mp3 --output_format json # JSON with timestamps
|
||
|
||
# Language
|
||
whisper audio.mp3 --language Spanish
|
||
|
||
# Translation
|
||
whisper spanish.mp3 --task translate
|
||
```
|
||
|
||
## Batch processing
|
||
|
||
```python
|
||
import os
|
||
|
||
audio_files = ["file1.mp3", "file2.mp3", "file3.mp3"]
|
||
|
||
for audio_file in audio_files:
|
||
print(f"Transcribing {audio_file}...")
|
||
result = model.transcribe(audio_file)
|
||
|
||
# Save to file
|
||
output_file = audio_file.replace(".mp3", ".txt")
|
||
with open(output_file, "w") as f:
|
||
f.write(result["text"])
|
||
```
|
||
|
||
## Real-time transcription
|
||
|
||
```python
|
||
# For streaming audio, use faster-whisper
|
||
# pip install faster-whisper
|
||
|
||
from faster_whisper import WhisperModel
|
||
|
||
model = WhisperModel("base", device="cuda", compute_type="float16")
|
||
|
||
# Transcribe with streaming
|
||
segments, info = model.transcribe("audio.mp3", beam_size=5)
|
||
|
||
for segment in segments:
|
||
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
|
||
```
|
||
|
||
## GPU acceleration
|
||
|
||
```python
|
||
import whisper
|
||
|
||
# Automatically uses GPU if available
|
||
model = whisper.load_model("turbo")
|
||
|
||
# Force CPU
|
||
model = whisper.load_model("turbo", device="cpu")
|
||
|
||
# Force GPU
|
||
model = whisper.load_model("turbo", device="cuda")
|
||
|
||
# 10-20× faster on GPU
|
||
```
|
||
|
||
## Integration with other tools
|
||
|
||
### Subtitle generation
|
||
|
||
```bash
|
||
# Generate SRT subtitles
|
||
whisper video.mp4 --output_format srt --language English
|
||
|
||
# Output: video.srt
|
||
```
|
||
|
||
### With LangChain
|
||
|
||
```python
|
||
from langchain.document_loaders import WhisperTranscriptionLoader
|
||
|
||
loader = WhisperTranscriptionLoader(file_path="audio.mp3")
|
||
docs = loader.load()
|
||
|
||
# Use transcription in RAG
|
||
from langchain_chroma import Chroma
|
||
from langchain_openai import OpenAIEmbeddings
|
||
|
||
vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())
|
||
```
|
||
|
||
### Extract audio from video
|
||
|
||
```bash
|
||
# Use ffmpeg to extract audio
|
||
ffmpeg -i video.mp4 -vn -acodec pcm_s16le audio.wav
|
||
|
||
# Then transcribe
|
||
whisper audio.wav
|
||
```
|
||
|
||
## Best practices
|
||
|
||
1. **Use turbo model** - Best speed/quality for English
|
||
2. **Specify language** - Faster than auto-detect
|
||
3. **Add initial prompt** - Improves technical terms
|
||
4. **Use GPU** - 10-20× faster
|
||
5. **Batch process** - More efficient
|
||
6. **Convert to WAV** - Better compatibility
|
||
7. **Split long audio** - <30 min chunks
|
||
8. **Check language support** - Quality varies by language
|
||
9. **Use faster-whisper** - 4× faster than openai-whisper
|
||
10. **Monitor VRAM** - Scale model size to hardware
|
||
|
||
## Performance
|
||
|
||
| Model | Real-time factor (CPU) | Real-time factor (GPU) |
|
||
|-------|------------------------|------------------------|
|
||
| tiny | ~0.32 | ~0.01 |
|
||
| base | ~0.16 | ~0.01 |
|
||
| turbo | ~0.08 | ~0.01 |
|
||
| large | ~1.0 | ~0.05 |
|
||
|
||
*Real-time factor: 0.1 = 10× faster than real-time*
|
||
|
||
## Language support
|
||
|
||
Top-supported languages:
|
||
- English (en)
|
||
- Spanish (es)
|
||
- French (fr)
|
||
- German (de)
|
||
- Italian (it)
|
||
- Portuguese (pt)
|
||
- Russian (ru)
|
||
- Japanese (ja)
|
||
- Korean (ko)
|
||
- Chinese (zh)
|
||
|
||
Full list: 99 languages total
|
||
|
||
## Limitations
|
||
|
||
1. **Hallucinations** - May repeat or invent text
|
||
2. **Long-form accuracy** - Degrades on >30 min audio
|
||
3. **Speaker identification** - No diarization
|
||
4. **Accents** - Quality varies
|
||
5. **Background noise** - Can affect accuracy
|
||
6. **Real-time latency** - Not suitable for live captioning
|
||
|
||
## Resources
|
||
|
||
- **GitHub**: https://github.com/openai/whisper ⭐ 72,900+
|
||
- **Paper**: https://arxiv.org/abs/2212.04356
|
||
- **Model Card**: https://github.com/openai/whisper/blob/main/model-card.md
|
||
- **Colab**: Available in repo
|
||
- **License**: MIT
|
||
|
||
|