mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-25 00:51:20 +00:00
skills: move 7 niche mlops/mcp skills to optional (#12474)
Built-in → optional-skills/: mlops/training/peft → optional-skills/mlops/peft mlops/training/pytorch-fsdp → optional-skills/mlops/pytorch-fsdp mlops/models/clip → optional-skills/mlops/clip mlops/models/stable-diffusion → optional-skills/mlops/stable-diffusion mlops/models/whisper → optional-skills/mlops/whisper mlops/cloud/modal → optional-skills/mlops/modal mcp/mcporter → optional-skills/mcp/mcporter Built-in mlops training kept: axolotl, trl-fine-tuning, unsloth. Built-in mlops models kept: audiocraft, segment-anything. Built-in mlops evaluation/research/huggingface-hub/inference all kept. native-mcp stays built-in (documents the native MCP tool); mcporter was a redundant alternative CLI. Also: removed now-empty skills/mlops/cloud/ dir, refreshed skills/mlops/models/DESCRIPTION.md and skills/mcp/DESCRIPTION.md to match what's left, and synchronized both catalog pages (skills-catalog.md, optional-skills-catalog.md).
This commit is contained in:
parent
957ca79e8e
commit
66ee081dc1
22 changed files with 10 additions and 20 deletions
122
optional-skills/mcp/mcporter/SKILL.md
Normal file
122
optional-skills/mcp/mcporter/SKILL.md
Normal file
|
|
@ -0,0 +1,122 @@
|
|||
---
|
||||
name: mcporter
|
||||
description: Use the mcporter CLI to list, configure, auth, and call MCP servers/tools directly (HTTP or stdio), including ad-hoc servers, config edits, and CLI/type generation.
|
||||
version: 1.0.0
|
||||
author: community
|
||||
license: MIT
|
||||
metadata:
|
||||
hermes:
|
||||
tags: [MCP, Tools, API, Integrations, Interop]
|
||||
homepage: https://mcporter.dev
|
||||
prerequisites:
|
||||
commands: [npx]
|
||||
---
|
||||
|
||||
# mcporter
|
||||
|
||||
Use `mcporter` to discover, call, and manage [MCP (Model Context Protocol)](https://modelcontextprotocol.io/) servers and tools directly from the terminal.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Requires Node.js:
|
||||
```bash
|
||||
# No install needed (runs via npx)
|
||||
npx mcporter list
|
||||
|
||||
# Or install globally
|
||||
npm install -g mcporter
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# List MCP servers already configured on this machine
|
||||
mcporter list
|
||||
|
||||
# List tools for a specific server with schema details
|
||||
mcporter list <server> --schema
|
||||
|
||||
# Call a tool
|
||||
mcporter call <server.tool> key=value
|
||||
```
|
||||
|
||||
## Discovering MCP Servers
|
||||
|
||||
mcporter auto-discovers servers configured by other MCP clients (Claude Desktop, Cursor, etc.) on the machine. To find new servers to use, browse registries like [mcpfinder.dev](https://mcpfinder.dev) or [mcp.so](https://mcp.so), then connect ad-hoc:
|
||||
|
||||
```bash
|
||||
# Connect to any MCP server by URL (no config needed)
|
||||
mcporter list --http-url https://some-mcp-server.com --name my_server
|
||||
|
||||
# Or run a stdio server on the fly
|
||||
mcporter list --stdio "npx -y @modelcontextprotocol/server-filesystem" --name fs
|
||||
```
|
||||
|
||||
## Calling Tools
|
||||
|
||||
```bash
|
||||
# Key=value syntax
|
||||
mcporter call linear.list_issues team=ENG limit:5
|
||||
|
||||
# Function syntax
|
||||
mcporter call "linear.create_issue(title: \"Bug fix needed\")"
|
||||
|
||||
# Ad-hoc HTTP server (no config needed)
|
||||
mcporter call https://api.example.com/mcp.fetch url=https://example.com
|
||||
|
||||
# Ad-hoc stdio server
|
||||
mcporter call --stdio "bun run ./server.ts" scrape url=https://example.com
|
||||
|
||||
# JSON payload
|
||||
mcporter call <server.tool> --args '{"limit": 5}'
|
||||
|
||||
# Machine-readable output (recommended for Hermes)
|
||||
mcporter call <server.tool> key=value --output json
|
||||
```
|
||||
|
||||
## Auth and Config
|
||||
|
||||
```bash
|
||||
# OAuth login for a server
|
||||
mcporter auth <server | url> [--reset]
|
||||
|
||||
# Manage config
|
||||
mcporter config list
|
||||
mcporter config get <key>
|
||||
mcporter config add <server>
|
||||
mcporter config remove <server>
|
||||
mcporter config import <path>
|
||||
```
|
||||
|
||||
Config file location: `./config/mcporter.json` (override with `--config`).
|
||||
|
||||
## Daemon
|
||||
|
||||
For persistent server connections:
|
||||
```bash
|
||||
mcporter daemon start
|
||||
mcporter daemon status
|
||||
mcporter daemon stop
|
||||
mcporter daemon restart
|
||||
```
|
||||
|
||||
## Code Generation
|
||||
|
||||
```bash
|
||||
# Generate a CLI wrapper for an MCP server
|
||||
mcporter generate-cli --server <name>
|
||||
mcporter generate-cli --command <url>
|
||||
|
||||
# Inspect a generated CLI
|
||||
mcporter inspect-cli <path> [--json]
|
||||
|
||||
# Generate TypeScript types/client
|
||||
mcporter emit-ts <server> --mode client
|
||||
mcporter emit-ts <server> --mode types
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- Use `--output json` for structured output that's easier to parse
|
||||
- Ad-hoc servers (HTTP URL or `--stdio` command) work without any config — useful for one-off calls
|
||||
- OAuth auth may require interactive browser flow — use `terminal(command="mcporter auth <server>", pty=true)` if needed
|
||||
256
optional-skills/mlops/clip/SKILL.md
Normal file
256
optional-skills/mlops/clip/SKILL.md
Normal file
|
|
@ -0,0 +1,256 @@
|
|||
---
|
||||
name: clip
|
||||
description: OpenAI's model connecting vision and language. Enables zero-shot image classification, image-text matching, and cross-modal retrieval. Trained on 400M image-text pairs. Use for image search, content moderation, or vision-language tasks without fine-tuning. Best for general-purpose image understanding.
|
||||
version: 1.0.0
|
||||
author: Orchestra Research
|
||||
license: MIT
|
||||
dependencies: [transformers, torch, pillow]
|
||||
metadata:
|
||||
hermes:
|
||||
tags: [Multimodal, CLIP, Vision-Language, Zero-Shot, Image Classification, OpenAI, Image Search, Cross-Modal Retrieval, Content Moderation]
|
||||
|
||||
---
|
||||
|
||||
# CLIP - Contrastive Language-Image Pre-Training
|
||||
|
||||
OpenAI's model that understands images from natural language.
|
||||
|
||||
## When to use CLIP
|
||||
|
||||
**Use when:**
|
||||
- Zero-shot image classification (no training data needed)
|
||||
- Image-text similarity/matching
|
||||
- Semantic image search
|
||||
- Content moderation (detect NSFW, violence)
|
||||
- Visual question answering
|
||||
- Cross-modal retrieval (image→text, text→image)
|
||||
|
||||
**Metrics**:
|
||||
- **25,300+ GitHub stars**
|
||||
- Trained on 400M image-text pairs
|
||||
- Matches ResNet-50 on ImageNet (zero-shot)
|
||||
- MIT License
|
||||
|
||||
**Use alternatives instead**:
|
||||
- **BLIP-2**: Better captioning
|
||||
- **LLaVA**: Vision-language chat
|
||||
- **Segment Anything**: Image segmentation
|
||||
|
||||
## Quick start
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
pip install git+https://github.com/openai/CLIP.git
|
||||
pip install torch torchvision ftfy regex tqdm
|
||||
```
|
||||
|
||||
### Zero-shot classification
|
||||
|
||||
```python
|
||||
import torch
|
||||
import clip
|
||||
from PIL import Image
|
||||
|
||||
# Load model
|
||||
device = "cuda" if torch.cuda.is_available() else "cpu"
|
||||
model, preprocess = clip.load("ViT-B/32", device=device)
|
||||
|
||||
# Load image
|
||||
image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)
|
||||
|
||||
# Define possible labels
|
||||
text = clip.tokenize(["a dog", "a cat", "a bird", "a car"]).to(device)
|
||||
|
||||
# Compute similarity
|
||||
with torch.no_grad():
|
||||
image_features = model.encode_image(image)
|
||||
text_features = model.encode_text(text)
|
||||
|
||||
# Cosine similarity
|
||||
logits_per_image, logits_per_text = model(image, text)
|
||||
probs = logits_per_image.softmax(dim=-1).cpu().numpy()
|
||||
|
||||
# Print results
|
||||
labels = ["a dog", "a cat", "a bird", "a car"]
|
||||
for label, prob in zip(labels, probs[0]):
|
||||
print(f"{label}: {prob:.2%}")
|
||||
```
|
||||
|
||||
## Available models
|
||||
|
||||
```python
|
||||
# Models (sorted by size)
|
||||
models = [
|
||||
"RN50", # ResNet-50
|
||||
"RN101", # ResNet-101
|
||||
"ViT-B/32", # Vision Transformer (recommended)
|
||||
"ViT-B/16", # Better quality, slower
|
||||
"ViT-L/14", # Best quality, slowest
|
||||
]
|
||||
|
||||
model, preprocess = clip.load("ViT-B/32")
|
||||
```
|
||||
|
||||
| Model | Parameters | Speed | Quality |
|
||||
|-------|------------|-------|---------|
|
||||
| RN50 | 102M | Fast | Good |
|
||||
| ViT-B/32 | 151M | Medium | Better |
|
||||
| ViT-L/14 | 428M | Slow | Best |
|
||||
|
||||
## Image-text similarity
|
||||
|
||||
```python
|
||||
# Compute embeddings
|
||||
image_features = model.encode_image(image)
|
||||
text_features = model.encode_text(text)
|
||||
|
||||
# Normalize
|
||||
image_features /= image_features.norm(dim=-1, keepdim=True)
|
||||
text_features /= text_features.norm(dim=-1, keepdim=True)
|
||||
|
||||
# Cosine similarity
|
||||
similarity = (image_features @ text_features.T).item()
|
||||
print(f"Similarity: {similarity:.4f}")
|
||||
```
|
||||
|
||||
## Semantic image search
|
||||
|
||||
```python
|
||||
# Index images
|
||||
image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"]
|
||||
image_embeddings = []
|
||||
|
||||
for img_path in image_paths:
|
||||
image = preprocess(Image.open(img_path)).unsqueeze(0).to(device)
|
||||
with torch.no_grad():
|
||||
embedding = model.encode_image(image)
|
||||
embedding /= embedding.norm(dim=-1, keepdim=True)
|
||||
image_embeddings.append(embedding)
|
||||
|
||||
image_embeddings = torch.cat(image_embeddings)
|
||||
|
||||
# Search with text query
|
||||
query = "a sunset over the ocean"
|
||||
text_input = clip.tokenize([query]).to(device)
|
||||
with torch.no_grad():
|
||||
text_embedding = model.encode_text(text_input)
|
||||
text_embedding /= text_embedding.norm(dim=-1, keepdim=True)
|
||||
|
||||
# Find most similar images
|
||||
similarities = (text_embedding @ image_embeddings.T).squeeze(0)
|
||||
top_k = similarities.topk(3)
|
||||
|
||||
for idx, score in zip(top_k.indices, top_k.values):
|
||||
print(f"{image_paths[idx]}: {score:.3f}")
|
||||
```
|
||||
|
||||
## Content moderation
|
||||
|
||||
```python
|
||||
# Define categories
|
||||
categories = [
|
||||
"safe for work",
|
||||
"not safe for work",
|
||||
"violent content",
|
||||
"graphic content"
|
||||
]
|
||||
|
||||
text = clip.tokenize(categories).to(device)
|
||||
|
||||
# Check image
|
||||
with torch.no_grad():
|
||||
logits_per_image, _ = model(image, text)
|
||||
probs = logits_per_image.softmax(dim=-1)
|
||||
|
||||
# Get classification
|
||||
max_idx = probs.argmax().item()
|
||||
max_prob = probs[0, max_idx].item()
|
||||
|
||||
print(f"Category: {categories[max_idx]} ({max_prob:.2%})")
|
||||
```
|
||||
|
||||
## Batch processing
|
||||
|
||||
```python
|
||||
# Process multiple images
|
||||
images = [preprocess(Image.open(f"img{i}.jpg")) for i in range(10)]
|
||||
images = torch.stack(images).to(device)
|
||||
|
||||
with torch.no_grad():
|
||||
image_features = model.encode_image(images)
|
||||
image_features /= image_features.norm(dim=-1, keepdim=True)
|
||||
|
||||
# Batch text
|
||||
texts = ["a dog", "a cat", "a bird"]
|
||||
text_tokens = clip.tokenize(texts).to(device)
|
||||
|
||||
with torch.no_grad():
|
||||
text_features = model.encode_text(text_tokens)
|
||||
text_features /= text_features.norm(dim=-1, keepdim=True)
|
||||
|
||||
# Similarity matrix (10 images × 3 texts)
|
||||
similarities = image_features @ text_features.T
|
||||
print(similarities.shape) # (10, 3)
|
||||
```
|
||||
|
||||
## Integration with vector databases
|
||||
|
||||
```python
|
||||
# Store CLIP embeddings in Chroma/FAISS
|
||||
import chromadb
|
||||
|
||||
client = chromadb.Client()
|
||||
collection = client.create_collection("image_embeddings")
|
||||
|
||||
# Add image embeddings
|
||||
for img_path, embedding in zip(image_paths, image_embeddings):
|
||||
collection.add(
|
||||
embeddings=[embedding.cpu().numpy().tolist()],
|
||||
metadatas=[{"path": img_path}],
|
||||
ids=[img_path]
|
||||
)
|
||||
|
||||
# Query with text
|
||||
query = "a sunset"
|
||||
text_embedding = model.encode_text(clip.tokenize([query]))
|
||||
results = collection.query(
|
||||
query_embeddings=[text_embedding.cpu().numpy().tolist()],
|
||||
n_results=5
|
||||
)
|
||||
```
|
||||
|
||||
## Best practices
|
||||
|
||||
1. **Use ViT-B/32 for most cases** - Good balance
|
||||
2. **Normalize embeddings** - Required for cosine similarity
|
||||
3. **Batch processing** - More efficient
|
||||
4. **Cache embeddings** - Expensive to recompute
|
||||
5. **Use descriptive labels** - Better zero-shot performance
|
||||
6. **GPU recommended** - 10-50× faster
|
||||
7. **Preprocess images** - Use provided preprocess function
|
||||
|
||||
## Performance
|
||||
|
||||
| Operation | CPU | GPU (V100) |
|
||||
|-----------|-----|------------|
|
||||
| Image encoding | ~200ms | ~20ms |
|
||||
| Text encoding | ~50ms | ~5ms |
|
||||
| Similarity compute | <1ms | <1ms |
|
||||
|
||||
## Limitations
|
||||
|
||||
1. **Not for fine-grained tasks** - Best for broad categories
|
||||
2. **Requires descriptive text** - Vague labels perform poorly
|
||||
3. **Biased on web data** - May have dataset biases
|
||||
4. **No bounding boxes** - Whole image only
|
||||
5. **Limited spatial understanding** - Position/counting weak
|
||||
|
||||
## Resources
|
||||
|
||||
- **GitHub**: https://github.com/openai/CLIP ⭐ 25,300+
|
||||
- **Paper**: https://arxiv.org/abs/2103.00020
|
||||
- **Colab**: https://colab.research.google.com/github/openai/clip/
|
||||
- **License**: MIT
|
||||
|
||||
|
||||
207
optional-skills/mlops/clip/references/applications.md
Normal file
207
optional-skills/mlops/clip/references/applications.md
Normal file
|
|
@ -0,0 +1,207 @@
|
|||
# CLIP Applications Guide
|
||||
|
||||
Practical applications and use cases for CLIP.
|
||||
|
||||
## Zero-shot image classification
|
||||
|
||||
```python
|
||||
import torch
|
||||
import clip
|
||||
from PIL import Image
|
||||
|
||||
model, preprocess = clip.load("ViT-B/32")
|
||||
|
||||
# Define categories
|
||||
categories = [
|
||||
"a photo of a dog",
|
||||
"a photo of a cat",
|
||||
"a photo of a bird",
|
||||
"a photo of a car",
|
||||
"a photo of a person"
|
||||
]
|
||||
|
||||
# Prepare image
|
||||
image = preprocess(Image.open("photo.jpg")).unsqueeze(0)
|
||||
text = clip.tokenize(categories)
|
||||
|
||||
# Classify
|
||||
with torch.no_grad():
|
||||
image_features = model.encode_image(image)
|
||||
text_features = model.encode_text(text)
|
||||
|
||||
logits_per_image, _ = model(image, text)
|
||||
probs = logits_per_image.softmax(dim=-1).cpu().numpy()
|
||||
|
||||
# Print results
|
||||
for category, prob in zip(categories, probs[0]):
|
||||
print(f"{category}: {prob:.2%}")
|
||||
```
|
||||
|
||||
## Semantic image search
|
||||
|
||||
```python
|
||||
# Index images
|
||||
image_database = []
|
||||
image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"]
|
||||
|
||||
for img_path in image_paths:
|
||||
image = preprocess(Image.open(img_path)).unsqueeze(0)
|
||||
with torch.no_grad():
|
||||
features = model.encode_image(image)
|
||||
features /= features.norm(dim=-1, keepdim=True)
|
||||
image_database.append((img_path, features))
|
||||
|
||||
# Search with text
|
||||
query = "a sunset over mountains"
|
||||
text_input = clip.tokenize([query])
|
||||
|
||||
with torch.no_grad():
|
||||
text_features = model.encode_text(text_input)
|
||||
text_features /= text_features.norm(dim=-1, keepdim=True)
|
||||
|
||||
# Find matches
|
||||
similarities = []
|
||||
for img_path, img_features in image_database:
|
||||
similarity = (text_features @ img_features.T).item()
|
||||
similarities.append((img_path, similarity))
|
||||
|
||||
# Sort by similarity
|
||||
similarities.sort(key=lambda x: x[1], reverse=True)
|
||||
for img_path, score in similarities[:3]:
|
||||
print(f"{img_path}: {score:.3f}")
|
||||
```
|
||||
|
||||
## Content moderation
|
||||
|
||||
```python
|
||||
# Define safety categories
|
||||
categories = [
|
||||
"safe for work content",
|
||||
"not safe for work content",
|
||||
"violent or graphic content",
|
||||
"hate speech or offensive content",
|
||||
"spam or misleading content"
|
||||
]
|
||||
|
||||
text = clip.tokenize(categories)
|
||||
|
||||
# Check image
|
||||
with torch.no_grad():
|
||||
logits, _ = model(image, text)
|
||||
probs = logits.softmax(dim=-1)
|
||||
|
||||
# Get classification
|
||||
max_idx = probs.argmax().item()
|
||||
confidence = probs[0, max_idx].item()
|
||||
|
||||
if confidence > 0.7:
|
||||
print(f"Classified as: {categories[max_idx]} ({confidence:.2%})")
|
||||
else:
|
||||
print(f"Uncertain classification (confidence: {confidence:.2%})")
|
||||
```
|
||||
|
||||
## Image-to-text retrieval
|
||||
|
||||
```python
|
||||
# Text database
|
||||
captions = [
|
||||
"A beautiful sunset over the ocean",
|
||||
"A cute dog playing in the park",
|
||||
"A modern city skyline at night",
|
||||
"A delicious pizza with toppings"
|
||||
]
|
||||
|
||||
# Encode captions
|
||||
caption_features = []
|
||||
for caption in captions:
|
||||
text = clip.tokenize([caption])
|
||||
with torch.no_grad():
|
||||
features = model.encode_text(text)
|
||||
features /= features.norm(dim=-1, keepdim=True)
|
||||
caption_features.append(features)
|
||||
|
||||
caption_features = torch.cat(caption_features)
|
||||
|
||||
# Find matching captions for image
|
||||
with torch.no_grad():
|
||||
image_features = model.encode_image(image)
|
||||
image_features /= image_features.norm(dim=-1, keepdim=True)
|
||||
|
||||
similarities = (image_features @ caption_features.T).squeeze(0)
|
||||
top_k = similarities.topk(3)
|
||||
|
||||
for idx, score in zip(top_k.indices, top_k.values):
|
||||
print(f"{captions[idx]}: {score:.3f}")
|
||||
```
|
||||
|
||||
## Visual question answering
|
||||
|
||||
```python
|
||||
# Create yes/no questions
|
||||
image = preprocess(Image.open("photo.jpg")).unsqueeze(0)
|
||||
|
||||
questions = [
|
||||
"a photo showing people",
|
||||
"a photo showing animals",
|
||||
"a photo taken indoors",
|
||||
"a photo taken outdoors",
|
||||
"a photo taken during daytime",
|
||||
"a photo taken at night"
|
||||
]
|
||||
|
||||
text = clip.tokenize(questions)
|
||||
|
||||
with torch.no_grad():
|
||||
logits, _ = model(image, text)
|
||||
probs = logits.softmax(dim=-1)
|
||||
|
||||
# Answer questions
|
||||
for question, prob in zip(questions, probs[0]):
|
||||
answer = "Yes" if prob > 0.5 else "No"
|
||||
print(f"{question}: {answer} ({prob:.2%})")
|
||||
```
|
||||
|
||||
## Image deduplication
|
||||
|
||||
```python
|
||||
# Detect duplicate/similar images
|
||||
def compute_similarity(img1_path, img2_path):
|
||||
img1 = preprocess(Image.open(img1_path)).unsqueeze(0)
|
||||
img2 = preprocess(Image.open(img2_path)).unsqueeze(0)
|
||||
|
||||
with torch.no_grad():
|
||||
feat1 = model.encode_image(img1)
|
||||
feat2 = model.encode_image(img2)
|
||||
|
||||
feat1 /= feat1.norm(dim=-1, keepdim=True)
|
||||
feat2 /= feat2.norm(dim=-1, keepdim=True)
|
||||
|
||||
similarity = (feat1 @ feat2.T).item()
|
||||
|
||||
return similarity
|
||||
|
||||
# Check for duplicates
|
||||
threshold = 0.95
|
||||
image_pairs = [("img1.jpg", "img2.jpg"), ("img1.jpg", "img3.jpg")]
|
||||
|
||||
for img1, img2 in image_pairs:
|
||||
sim = compute_similarity(img1, img2)
|
||||
if sim > threshold:
|
||||
print(f"{img1} and {img2} are duplicates (similarity: {sim:.3f})")
|
||||
```
|
||||
|
||||
## Best practices
|
||||
|
||||
1. **Use descriptive labels** - "a photo of X" works better than just "X"
|
||||
2. **Normalize embeddings** - Always normalize for cosine similarity
|
||||
3. **Batch processing** - Process multiple images/texts together
|
||||
4. **Cache embeddings** - Expensive to recompute
|
||||
5. **Set appropriate thresholds** - Test on validation data
|
||||
6. **Use GPU** - 10-50× faster than CPU
|
||||
7. **Consider model size** - ViT-B/32 good default, ViT-L/14 for best quality
|
||||
|
||||
## Resources
|
||||
|
||||
- **Paper**: https://arxiv.org/abs/2103.00020
|
||||
- **GitHub**: https://github.com/openai/CLIP
|
||||
- **Colab**: https://colab.research.google.com/github/openai/clip/
|
||||
344
optional-skills/mlops/modal/SKILL.md
Normal file
344
optional-skills/mlops/modal/SKILL.md
Normal file
|
|
@ -0,0 +1,344 @@
|
|||
---
|
||||
name: modal-serverless-gpu
|
||||
description: Serverless GPU cloud platform for running ML workloads. Use when you need on-demand GPU access without infrastructure management, deploying ML models as APIs, or running batch jobs with automatic scaling.
|
||||
version: 1.0.0
|
||||
author: Orchestra Research
|
||||
license: MIT
|
||||
dependencies: [modal>=0.64.0]
|
||||
metadata:
|
||||
hermes:
|
||||
tags: [Infrastructure, Serverless, GPU, Cloud, Deployment, Modal]
|
||||
|
||||
---
|
||||
|
||||
# Modal Serverless GPU
|
||||
|
||||
Comprehensive guide to running ML workloads on Modal's serverless GPU cloud platform.
|
||||
|
||||
## When to use Modal
|
||||
|
||||
**Use Modal when:**
|
||||
- Running GPU-intensive ML workloads without managing infrastructure
|
||||
- Deploying ML models as auto-scaling APIs
|
||||
- Running batch processing jobs (training, inference, data processing)
|
||||
- Need pay-per-second GPU pricing without idle costs
|
||||
- Prototyping ML applications quickly
|
||||
- Running scheduled jobs (cron-like workloads)
|
||||
|
||||
**Key features:**
|
||||
- **Serverless GPUs**: T4, L4, A10G, L40S, A100, H100, H200, B200 on-demand
|
||||
- **Python-native**: Define infrastructure in Python code, no YAML
|
||||
- **Auto-scaling**: Scale to zero, scale to 100+ GPUs instantly
|
||||
- **Sub-second cold starts**: Rust-based infrastructure for fast container launches
|
||||
- **Container caching**: Image layers cached for rapid iteration
|
||||
- **Web endpoints**: Deploy functions as REST APIs with zero-downtime updates
|
||||
|
||||
**Use alternatives instead:**
|
||||
- **RunPod**: For longer-running pods with persistent state
|
||||
- **Lambda Labs**: For reserved GPU instances
|
||||
- **SkyPilot**: For multi-cloud orchestration and cost optimization
|
||||
- **Kubernetes**: For complex multi-service architectures
|
||||
|
||||
## Quick start
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
pip install modal
|
||||
modal setup # Opens browser for authentication
|
||||
```
|
||||
|
||||
### Hello World with GPU
|
||||
|
||||
```python
|
||||
import modal
|
||||
|
||||
app = modal.App("hello-gpu")
|
||||
|
||||
@app.function(gpu="T4")
|
||||
def gpu_info():
|
||||
import subprocess
|
||||
return subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout
|
||||
|
||||
@app.local_entrypoint()
|
||||
def main():
|
||||
print(gpu_info.remote())
|
||||
```
|
||||
|
||||
Run: `modal run hello_gpu.py`
|
||||
|
||||
### Basic inference endpoint
|
||||
|
||||
```python
|
||||
import modal
|
||||
|
||||
app = modal.App("text-generation")
|
||||
image = modal.Image.debian_slim().pip_install("transformers", "torch", "accelerate")
|
||||
|
||||
@app.cls(gpu="A10G", image=image)
|
||||
class TextGenerator:
|
||||
@modal.enter()
|
||||
def load_model(self):
|
||||
from transformers import pipeline
|
||||
self.pipe = pipeline("text-generation", model="gpt2", device=0)
|
||||
|
||||
@modal.method()
|
||||
def generate(self, prompt: str) -> str:
|
||||
return self.pipe(prompt, max_length=100)[0]["generated_text"]
|
||||
|
||||
@app.local_entrypoint()
|
||||
def main():
|
||||
print(TextGenerator().generate.remote("Hello, world"))
|
||||
```
|
||||
|
||||
## Core concepts
|
||||
|
||||
### Key components
|
||||
|
||||
| Component | Purpose |
|
||||
|-----------|---------|
|
||||
| `App` | Container for functions and resources |
|
||||
| `Function` | Serverless function with compute specs |
|
||||
| `Cls` | Class-based functions with lifecycle hooks |
|
||||
| `Image` | Container image definition |
|
||||
| `Volume` | Persistent storage for models/data |
|
||||
| `Secret` | Secure credential storage |
|
||||
|
||||
### Execution modes
|
||||
|
||||
| Command | Description |
|
||||
|---------|-------------|
|
||||
| `modal run script.py` | Execute and exit |
|
||||
| `modal serve script.py` | Development with live reload |
|
||||
| `modal deploy script.py` | Persistent cloud deployment |
|
||||
|
||||
## GPU configuration
|
||||
|
||||
### Available GPUs
|
||||
|
||||
| GPU | VRAM | Best For |
|
||||
|-----|------|----------|
|
||||
| `T4` | 16GB | Budget inference, small models |
|
||||
| `L4` | 24GB | Inference, Ada Lovelace arch |
|
||||
| `A10G` | 24GB | Training/inference, 3.3x faster than T4 |
|
||||
| `L40S` | 48GB | Recommended for inference (best cost/perf) |
|
||||
| `A100-40GB` | 40GB | Large model training |
|
||||
| `A100-80GB` | 80GB | Very large models |
|
||||
| `H100` | 80GB | Fastest, FP8 + Transformer Engine |
|
||||
| `H200` | 141GB | Auto-upgrade from H100, 4.8TB/s bandwidth |
|
||||
| `B200` | Latest | Blackwell architecture |
|
||||
|
||||
### GPU specification patterns
|
||||
|
||||
```python
|
||||
# Single GPU
|
||||
@app.function(gpu="A100")
|
||||
|
||||
# Specific memory variant
|
||||
@app.function(gpu="A100-80GB")
|
||||
|
||||
# Multiple GPUs (up to 8)
|
||||
@app.function(gpu="H100:4")
|
||||
|
||||
# GPU with fallbacks
|
||||
@app.function(gpu=["H100", "A100", "L40S"])
|
||||
|
||||
# Any available GPU
|
||||
@app.function(gpu="any")
|
||||
```
|
||||
|
||||
## Container images
|
||||
|
||||
```python
|
||||
# Basic image with pip
|
||||
image = modal.Image.debian_slim(python_version="3.11").pip_install(
|
||||
"torch==2.1.0", "transformers==4.36.0", "accelerate"
|
||||
)
|
||||
|
||||
# From CUDA base
|
||||
image = modal.Image.from_registry(
|
||||
"nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04",
|
||||
add_python="3.11"
|
||||
).pip_install("torch", "transformers")
|
||||
|
||||
# With system packages
|
||||
image = modal.Image.debian_slim().apt_install("git", "ffmpeg").pip_install("whisper")
|
||||
```
|
||||
|
||||
## Persistent storage
|
||||
|
||||
```python
|
||||
volume = modal.Volume.from_name("model-cache", create_if_missing=True)
|
||||
|
||||
@app.function(gpu="A10G", volumes={"/models": volume})
|
||||
def load_model():
|
||||
import os
|
||||
model_path = "/models/llama-7b"
|
||||
if not os.path.exists(model_path):
|
||||
model = download_model()
|
||||
model.save_pretrained(model_path)
|
||||
volume.commit() # Persist changes
|
||||
return load_from_path(model_path)
|
||||
```
|
||||
|
||||
## Web endpoints
|
||||
|
||||
### FastAPI endpoint decorator
|
||||
|
||||
```python
|
||||
@app.function()
|
||||
@modal.fastapi_endpoint(method="POST")
|
||||
def predict(text: str) -> dict:
|
||||
return {"result": model.predict(text)}
|
||||
```
|
||||
|
||||
### Full ASGI app
|
||||
|
||||
```python
|
||||
from fastapi import FastAPI
|
||||
web_app = FastAPI()
|
||||
|
||||
@web_app.post("/predict")
|
||||
async def predict(text: str):
|
||||
return {"result": await model.predict.remote.aio(text)}
|
||||
|
||||
@app.function()
|
||||
@modal.asgi_app()
|
||||
def fastapi_app():
|
||||
return web_app
|
||||
```
|
||||
|
||||
### Web endpoint types
|
||||
|
||||
| Decorator | Use Case |
|
||||
|-----------|----------|
|
||||
| `@modal.fastapi_endpoint()` | Simple function → API |
|
||||
| `@modal.asgi_app()` | Full FastAPI/Starlette apps |
|
||||
| `@modal.wsgi_app()` | Django/Flask apps |
|
||||
| `@modal.web_server(port)` | Arbitrary HTTP servers |
|
||||
|
||||
## Dynamic batching
|
||||
|
||||
```python
|
||||
@app.function()
|
||||
@modal.batched(max_batch_size=32, wait_ms=100)
|
||||
async def batch_predict(inputs: list[str]) -> list[dict]:
|
||||
# Inputs automatically batched
|
||||
return model.batch_predict(inputs)
|
||||
```
|
||||
|
||||
## Secrets management
|
||||
|
||||
```bash
|
||||
# Create secret
|
||||
modal secret create huggingface HF_TOKEN=hf_xxx
|
||||
```
|
||||
|
||||
```python
|
||||
@app.function(secrets=[modal.Secret.from_name("huggingface")])
|
||||
def download_model():
|
||||
import os
|
||||
token = os.environ["HF_TOKEN"]
|
||||
```
|
||||
|
||||
## Scheduling
|
||||
|
||||
```python
|
||||
@app.function(schedule=modal.Cron("0 0 * * *")) # Daily midnight
|
||||
def daily_job():
|
||||
pass
|
||||
|
||||
@app.function(schedule=modal.Period(hours=1))
|
||||
def hourly_job():
|
||||
pass
|
||||
```
|
||||
|
||||
## Performance optimization
|
||||
|
||||
### Cold start mitigation
|
||||
|
||||
```python
|
||||
@app.function(
|
||||
container_idle_timeout=300, # Keep warm 5 min
|
||||
allow_concurrent_inputs=10, # Handle concurrent requests
|
||||
)
|
||||
def inference():
|
||||
pass
|
||||
```
|
||||
|
||||
### Model loading best practices
|
||||
|
||||
```python
|
||||
@app.cls(gpu="A100")
|
||||
class Model:
|
||||
@modal.enter() # Run once at container start
|
||||
def load(self):
|
||||
self.model = load_model() # Load during warm-up
|
||||
|
||||
@modal.method()
|
||||
def predict(self, x):
|
||||
return self.model(x)
|
||||
```
|
||||
|
||||
## Parallel processing
|
||||
|
||||
```python
|
||||
@app.function()
|
||||
def process_item(item):
|
||||
return expensive_computation(item)
|
||||
|
||||
@app.function()
|
||||
def run_parallel():
|
||||
items = list(range(1000))
|
||||
# Fan out to parallel containers
|
||||
results = list(process_item.map(items))
|
||||
return results
|
||||
```
|
||||
|
||||
## Common configuration
|
||||
|
||||
```python
|
||||
@app.function(
|
||||
gpu="A100",
|
||||
memory=32768, # 32GB RAM
|
||||
cpu=4, # 4 CPU cores
|
||||
timeout=3600, # 1 hour max
|
||||
container_idle_timeout=120,# Keep warm 2 min
|
||||
retries=3, # Retry on failure
|
||||
concurrency_limit=10, # Max concurrent containers
|
||||
)
|
||||
def my_function():
|
||||
pass
|
||||
```
|
||||
|
||||
## Debugging
|
||||
|
||||
```python
|
||||
# Test locally
|
||||
if __name__ == "__main__":
|
||||
result = my_function.local()
|
||||
|
||||
# View logs
|
||||
# modal app logs my-app
|
||||
```
|
||||
|
||||
## Common issues
|
||||
|
||||
| Issue | Solution |
|
||||
|-------|----------|
|
||||
| Cold start latency | Increase `container_idle_timeout`, use `@modal.enter()` |
|
||||
| GPU OOM | Use larger GPU (`A100-80GB`), enable gradient checkpointing |
|
||||
| Image build fails | Pin dependency versions, check CUDA compatibility |
|
||||
| Timeout errors | Increase `timeout`, add checkpointing |
|
||||
|
||||
## References
|
||||
|
||||
- **[Advanced Usage](references/advanced-usage.md)** - Multi-GPU, distributed training, cost optimization
|
||||
- **[Troubleshooting](references/troubleshooting.md)** - Common issues and solutions
|
||||
|
||||
## Resources
|
||||
|
||||
- **Documentation**: https://modal.com/docs
|
||||
- **Examples**: https://github.com/modal-labs/modal-examples
|
||||
- **Pricing**: https://modal.com/pricing
|
||||
- **Discord**: https://discord.gg/modal
|
||||
503
optional-skills/mlops/modal/references/advanced-usage.md
Normal file
503
optional-skills/mlops/modal/references/advanced-usage.md
Normal file
|
|
@ -0,0 +1,503 @@
|
|||
# Modal Advanced Usage Guide
|
||||
|
||||
## Multi-GPU Training
|
||||
|
||||
### Single-node multi-GPU
|
||||
|
||||
```python
|
||||
import modal
|
||||
|
||||
app = modal.App("multi-gpu-training")
|
||||
image = modal.Image.debian_slim().pip_install("torch", "transformers", "accelerate")
|
||||
|
||||
@app.function(gpu="H100:4", image=image, timeout=7200)
|
||||
def train_multi_gpu():
|
||||
from accelerate import Accelerator
|
||||
|
||||
accelerator = Accelerator()
|
||||
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
|
||||
|
||||
for batch in dataloader:
|
||||
outputs = model(**batch)
|
||||
loss = outputs.loss
|
||||
accelerator.backward(loss)
|
||||
optimizer.step()
|
||||
```
|
||||
|
||||
### DeepSpeed integration
|
||||
|
||||
```python
|
||||
image = modal.Image.debian_slim().pip_install(
|
||||
"torch", "transformers", "deepspeed", "accelerate"
|
||||
)
|
||||
|
||||
@app.function(gpu="A100:8", image=image, timeout=14400)
|
||||
def deepspeed_train(config: dict):
|
||||
from transformers import Trainer, TrainingArguments
|
||||
|
||||
args = TrainingArguments(
|
||||
output_dir="/outputs",
|
||||
deepspeed="ds_config.json",
|
||||
fp16=True,
|
||||
per_device_train_batch_size=4,
|
||||
gradient_accumulation_steps=4
|
||||
)
|
||||
|
||||
trainer = Trainer(model=model, args=args, train_dataset=dataset)
|
||||
trainer.train()
|
||||
```
|
||||
|
||||
### Multi-GPU considerations
|
||||
|
||||
For frameworks that re-execute the Python entrypoint (like PyTorch Lightning), use:
|
||||
- `ddp_spawn` or `ddp_notebook` strategy
|
||||
- Run training as a subprocess to avoid issues
|
||||
|
||||
```python
|
||||
@app.function(gpu="H100:4")
|
||||
def train_with_subprocess():
|
||||
import subprocess
|
||||
subprocess.run(["python", "-m", "torch.distributed.launch", "train.py"])
|
||||
```
|
||||
|
||||
## Advanced Container Configuration
|
||||
|
||||
### Multi-stage builds for caching
|
||||
|
||||
```python
|
||||
# Stage 1: Base dependencies (cached)
|
||||
base_image = modal.Image.debian_slim().pip_install("torch", "numpy", "scipy")
|
||||
|
||||
# Stage 2: ML libraries (cached separately)
|
||||
ml_image = base_image.pip_install("transformers", "datasets", "accelerate")
|
||||
|
||||
# Stage 3: Custom code (rebuilt on changes)
|
||||
final_image = ml_image.copy_local_dir("./src", "/app/src")
|
||||
```
|
||||
|
||||
### Custom Dockerfiles
|
||||
|
||||
```python
|
||||
image = modal.Image.from_dockerfile("./Dockerfile")
|
||||
```
|
||||
|
||||
### Installing from Git
|
||||
|
||||
```python
|
||||
image = modal.Image.debian_slim().pip_install(
|
||||
"git+https://github.com/huggingface/transformers.git@main"
|
||||
)
|
||||
```
|
||||
|
||||
### Using uv for faster installs
|
||||
|
||||
```python
|
||||
image = modal.Image.debian_slim().uv_pip_install(
|
||||
"torch", "transformers", "accelerate"
|
||||
)
|
||||
```
|
||||
|
||||
## Advanced Class Patterns
|
||||
|
||||
### Lifecycle hooks
|
||||
|
||||
```python
|
||||
@app.cls(gpu="A10G")
|
||||
class InferenceService:
|
||||
@modal.enter()
|
||||
def startup(self):
|
||||
"""Called once when container starts"""
|
||||
self.model = load_model()
|
||||
self.tokenizer = load_tokenizer()
|
||||
|
||||
@modal.exit()
|
||||
def shutdown(self):
|
||||
"""Called when container shuts down"""
|
||||
cleanup_resources()
|
||||
|
||||
@modal.method()
|
||||
def predict(self, text: str):
|
||||
return self.model(self.tokenizer(text))
|
||||
```
|
||||
|
||||
### Concurrent request handling
|
||||
|
||||
```python
|
||||
@app.cls(
|
||||
gpu="A100",
|
||||
allow_concurrent_inputs=20, # Handle 20 requests per container
|
||||
container_idle_timeout=300
|
||||
)
|
||||
class BatchInference:
|
||||
@modal.enter()
|
||||
def load(self):
|
||||
self.model = load_model()
|
||||
|
||||
@modal.method()
|
||||
def predict(self, inputs: list):
|
||||
return self.model.batch_predict(inputs)
|
||||
```
|
||||
|
||||
### Input concurrency vs batching
|
||||
|
||||
- **Input concurrency**: Multiple requests processed simultaneously (async I/O)
|
||||
- **Dynamic batching**: Requests accumulated and processed together (GPU efficiency)
|
||||
|
||||
```python
|
||||
# Input concurrency - good for I/O-bound
|
||||
@app.function(allow_concurrent_inputs=10)
|
||||
async def fetch_data(url: str):
|
||||
async with aiohttp.ClientSession() as session:
|
||||
return await session.get(url)
|
||||
|
||||
# Dynamic batching - good for GPU inference
|
||||
@app.function()
|
||||
@modal.batched(max_batch_size=32, wait_ms=100)
|
||||
async def batch_embed(texts: list[str]) -> list[list[float]]:
|
||||
return model.encode(texts)
|
||||
```
|
||||
|
||||
## Advanced Volumes
|
||||
|
||||
### Volume operations
|
||||
|
||||
```python
|
||||
volume = modal.Volume.from_name("my-volume", create_if_missing=True)
|
||||
|
||||
@app.function(volumes={"/data": volume})
|
||||
def volume_operations():
|
||||
import os
|
||||
|
||||
# Write data
|
||||
with open("/data/output.txt", "w") as f:
|
||||
f.write("Results")
|
||||
|
||||
# Commit changes (persist to volume)
|
||||
volume.commit()
|
||||
|
||||
# Reload from remote (get latest)
|
||||
volume.reload()
|
||||
```
|
||||
|
||||
### Shared volumes between functions
|
||||
|
||||
```python
|
||||
shared_volume = modal.Volume.from_name("shared-data", create_if_missing=True)
|
||||
|
||||
@app.function(volumes={"/shared": shared_volume})
|
||||
def writer():
|
||||
with open("/shared/data.txt", "w") as f:
|
||||
f.write("Hello from writer")
|
||||
shared_volume.commit()
|
||||
|
||||
@app.function(volumes={"/shared": shared_volume})
|
||||
def reader():
|
||||
shared_volume.reload() # Get latest
|
||||
with open("/shared/data.txt", "r") as f:
|
||||
return f.read()
|
||||
```
|
||||
|
||||
### Cloud bucket mounts
|
||||
|
||||
```python
|
||||
# Mount S3 bucket
|
||||
bucket = modal.CloudBucketMount(
|
||||
bucket_name="my-bucket",
|
||||
secret=modal.Secret.from_name("aws-credentials")
|
||||
)
|
||||
|
||||
@app.function(volumes={"/s3": bucket})
|
||||
def process_s3_data():
|
||||
# Access S3 files like local filesystem
|
||||
data = open("/s3/data.parquet").read()
|
||||
```
|
||||
|
||||
## Function Composition
|
||||
|
||||
### Chaining functions
|
||||
|
||||
```python
|
||||
@app.function()
|
||||
def preprocess(data):
|
||||
return cleaned_data
|
||||
|
||||
@app.function(gpu="T4")
|
||||
def inference(data):
|
||||
return predictions
|
||||
|
||||
@app.function()
|
||||
def postprocess(predictions):
|
||||
return formatted_results
|
||||
|
||||
@app.function()
|
||||
def pipeline(raw_data):
|
||||
cleaned = preprocess.remote(raw_data)
|
||||
predictions = inference.remote(cleaned)
|
||||
results = postprocess.remote(predictions)
|
||||
return results
|
||||
```
|
||||
|
||||
### Parallel fan-out
|
||||
|
||||
```python
|
||||
@app.function()
|
||||
def process_item(item):
|
||||
return expensive_computation(item)
|
||||
|
||||
@app.function()
|
||||
def parallel_pipeline(items):
|
||||
# Fan out: process all items in parallel
|
||||
results = list(process_item.map(items))
|
||||
return results
|
||||
```
|
||||
|
||||
### Starmap for multiple arguments
|
||||
|
||||
```python
|
||||
@app.function()
|
||||
def process(x, y, z):
|
||||
return x + y + z
|
||||
|
||||
@app.function()
|
||||
def orchestrate():
|
||||
args = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
|
||||
results = list(process.starmap(args))
|
||||
return results
|
||||
```
|
||||
|
||||
## Advanced Web Endpoints
|
||||
|
||||
### WebSocket support
|
||||
|
||||
```python
|
||||
from fastapi import FastAPI, WebSocket
|
||||
|
||||
app = modal.App("websocket-app")
|
||||
web_app = FastAPI()
|
||||
|
||||
@web_app.websocket("/ws")
|
||||
async def websocket_endpoint(websocket: WebSocket):
|
||||
await websocket.accept()
|
||||
while True:
|
||||
data = await websocket.receive_text()
|
||||
await websocket.send_text(f"Processed: {data}")
|
||||
|
||||
@app.function()
|
||||
@modal.asgi_app()
|
||||
def ws_app():
|
||||
return web_app
|
||||
```
|
||||
|
||||
### Streaming responses
|
||||
|
||||
```python
|
||||
from fastapi.responses import StreamingResponse
|
||||
|
||||
@app.function(gpu="A100")
|
||||
def generate_stream(prompt: str):
|
||||
for token in model.generate_stream(prompt):
|
||||
yield token
|
||||
|
||||
@web_app.get("/stream")
|
||||
async def stream_response(prompt: str):
|
||||
return StreamingResponse(
|
||||
generate_stream.remote_gen(prompt),
|
||||
media_type="text/event-stream"
|
||||
)
|
||||
```
|
||||
|
||||
### Authentication
|
||||
|
||||
```python
|
||||
from fastapi import Depends, HTTPException, Header
|
||||
|
||||
async def verify_token(authorization: str = Header(None)):
|
||||
if not authorization or not authorization.startswith("Bearer "):
|
||||
raise HTTPException(status_code=401)
|
||||
token = authorization.split(" ")[1]
|
||||
if not verify_jwt(token):
|
||||
raise HTTPException(status_code=403)
|
||||
return token
|
||||
|
||||
@web_app.post("/predict")
|
||||
async def predict(data: dict, token: str = Depends(verify_token)):
|
||||
return model.predict(data)
|
||||
```
|
||||
|
||||
## Cost Optimization
|
||||
|
||||
### Right-sizing GPUs
|
||||
|
||||
```python
|
||||
# For inference: smaller GPUs often sufficient
|
||||
@app.function(gpu="L40S") # 48GB, best cost/perf for inference
|
||||
def inference():
|
||||
pass
|
||||
|
||||
# For training: larger GPUs for throughput
|
||||
@app.function(gpu="A100-80GB")
|
||||
def training():
|
||||
pass
|
||||
```
|
||||
|
||||
### GPU fallbacks for availability
|
||||
|
||||
```python
|
||||
@app.function(gpu=["H100", "A100", "L40S"]) # Try in order
|
||||
def flexible_compute():
|
||||
pass
|
||||
```
|
||||
|
||||
### Scale to zero
|
||||
|
||||
```python
|
||||
# Default behavior: scale to zero when idle
|
||||
@app.function(gpu="A100")
|
||||
def on_demand():
|
||||
pass
|
||||
|
||||
# Keep containers warm for low latency (costs more)
|
||||
@app.function(gpu="A100", keep_warm=1)
|
||||
def always_ready():
|
||||
pass
|
||||
```
|
||||
|
||||
### Batch processing for efficiency
|
||||
|
||||
```python
|
||||
# Process in batches to reduce cold starts
|
||||
@app.function(gpu="A100")
|
||||
def batch_process(items: list):
|
||||
return [process(item) for item in items]
|
||||
|
||||
# Better than individual calls
|
||||
results = batch_process.remote(all_items)
|
||||
```
|
||||
|
||||
## Monitoring and Observability
|
||||
|
||||
### Structured logging
|
||||
|
||||
```python
|
||||
import json
|
||||
import logging
|
||||
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@app.function()
|
||||
def structured_logging(request_id: str, data: dict):
|
||||
logger.info(json.dumps({
|
||||
"event": "inference_start",
|
||||
"request_id": request_id,
|
||||
"input_size": len(data)
|
||||
}))
|
||||
|
||||
result = process(data)
|
||||
|
||||
logger.info(json.dumps({
|
||||
"event": "inference_complete",
|
||||
"request_id": request_id,
|
||||
"output_size": len(result)
|
||||
}))
|
||||
|
||||
return result
|
||||
```
|
||||
|
||||
### Custom metrics
|
||||
|
||||
```python
|
||||
@app.function(gpu="A100")
|
||||
def monitored_inference(inputs):
|
||||
import time
|
||||
|
||||
start = time.time()
|
||||
results = model.predict(inputs)
|
||||
latency = time.time() - start
|
||||
|
||||
# Log metrics (visible in Modal dashboard)
|
||||
print(f"METRIC latency={latency:.3f}s batch_size={len(inputs)}")
|
||||
|
||||
return results
|
||||
```
|
||||
|
||||
## Production Deployment
|
||||
|
||||
### Environment separation
|
||||
|
||||
```python
|
||||
import os
|
||||
|
||||
env = os.environ.get("MODAL_ENV", "dev")
|
||||
app = modal.App(f"my-service-{env}")
|
||||
|
||||
# Environment-specific config
|
||||
if env == "prod":
|
||||
gpu_config = "A100"
|
||||
timeout = 3600
|
||||
else:
|
||||
gpu_config = "T4"
|
||||
timeout = 300
|
||||
```
|
||||
|
||||
### Zero-downtime deployments
|
||||
|
||||
Modal automatically handles zero-downtime deployments:
|
||||
1. New containers are built and started
|
||||
2. Traffic gradually shifts to new version
|
||||
3. Old containers drain existing requests
|
||||
4. Old containers are terminated
|
||||
|
||||
### Health checks
|
||||
|
||||
```python
|
||||
@app.function()
|
||||
@modal.web_endpoint()
|
||||
def health():
|
||||
return {
|
||||
"status": "healthy",
|
||||
"model_loaded": hasattr(Model, "_model"),
|
||||
"gpu_available": torch.cuda.is_available()
|
||||
}
|
||||
```
|
||||
|
||||
## Sandboxes
|
||||
|
||||
### Interactive execution environments
|
||||
|
||||
```python
|
||||
@app.function()
|
||||
def run_sandbox():
|
||||
sandbox = modal.Sandbox.create(
|
||||
app=app,
|
||||
image=image,
|
||||
gpu="T4"
|
||||
)
|
||||
|
||||
# Execute code in sandbox
|
||||
result = sandbox.exec("python", "-c", "print('Hello from sandbox')")
|
||||
|
||||
sandbox.terminate()
|
||||
return result
|
||||
```
|
||||
|
||||
## Invoking Deployed Functions
|
||||
|
||||
### From external code
|
||||
|
||||
```python
|
||||
# Call deployed function from any Python script
|
||||
import modal
|
||||
|
||||
f = modal.Function.lookup("my-app", "my_function")
|
||||
result = f.remote(arg1, arg2)
|
||||
```
|
||||
|
||||
### REST API invocation
|
||||
|
||||
```bash
|
||||
# Deployed endpoints accessible via HTTPS
|
||||
curl -X POST https://your-workspace--my-app-predict.modal.run \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"text": "Hello world"}'
|
||||
```
|
||||
494
optional-skills/mlops/modal/references/troubleshooting.md
Normal file
494
optional-skills/mlops/modal/references/troubleshooting.md
Normal file
|
|
@ -0,0 +1,494 @@
|
|||
# Modal Troubleshooting Guide
|
||||
|
||||
## Installation Issues
|
||||
|
||||
### Authentication fails
|
||||
|
||||
**Error**: `modal setup` doesn't complete or token is invalid
|
||||
|
||||
**Solutions**:
|
||||
```bash
|
||||
# Re-authenticate
|
||||
modal token new
|
||||
|
||||
# Check current token
|
||||
modal config show
|
||||
|
||||
# Set token via environment
|
||||
export MODAL_TOKEN_ID=ak-...
|
||||
export MODAL_TOKEN_SECRET=as-...
|
||||
```
|
||||
|
||||
### Package installation issues
|
||||
|
||||
**Error**: `pip install modal` fails
|
||||
|
||||
**Solutions**:
|
||||
```bash
|
||||
# Upgrade pip
|
||||
pip install --upgrade pip
|
||||
|
||||
# Install with specific Python version
|
||||
python3.11 -m pip install modal
|
||||
|
||||
# Install from wheel
|
||||
pip install modal --prefer-binary
|
||||
```
|
||||
|
||||
## Container Image Issues
|
||||
|
||||
### Image build fails
|
||||
|
||||
**Error**: `ImageBuilderError: Failed to build image`
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Pin package versions to avoid conflicts
|
||||
image = modal.Image.debian_slim().pip_install(
|
||||
"torch==2.1.0",
|
||||
"transformers==4.36.0", # Pin versions
|
||||
"accelerate==0.25.0"
|
||||
)
|
||||
|
||||
# Use compatible CUDA versions
|
||||
image = modal.Image.from_registry(
|
||||
"nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04", # Match PyTorch CUDA
|
||||
add_python="3.11"
|
||||
)
|
||||
```
|
||||
|
||||
### Dependency conflicts
|
||||
|
||||
**Error**: `ERROR: Cannot install package due to conflicting dependencies`
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Layer dependencies separately
|
||||
base = modal.Image.debian_slim().pip_install("torch")
|
||||
ml = base.pip_install("transformers") # Install after torch
|
||||
|
||||
# Use uv for better resolution
|
||||
image = modal.Image.debian_slim().uv_pip_install(
|
||||
"torch", "transformers"
|
||||
)
|
||||
```
|
||||
|
||||
### Large image builds timeout
|
||||
|
||||
**Error**: Image build exceeds time limit
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Split into multiple layers (better caching)
|
||||
base = modal.Image.debian_slim().pip_install("torch") # Cached
|
||||
ml = base.pip_install("transformers", "datasets") # Cached
|
||||
app = ml.copy_local_dir("./src", "/app") # Rebuilds on code change
|
||||
|
||||
# Download models during build, not runtime
|
||||
image = modal.Image.debian_slim().pip_install("transformers").run_commands(
|
||||
"python -c 'from transformers import AutoModel; AutoModel.from_pretrained(\"bert-base\")'"
|
||||
)
|
||||
```
|
||||
|
||||
## GPU Issues
|
||||
|
||||
### GPU not available
|
||||
|
||||
**Error**: `RuntimeError: CUDA not available`
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Ensure GPU is specified
|
||||
@app.function(gpu="T4") # Must specify GPU
|
||||
def my_function():
|
||||
import torch
|
||||
assert torch.cuda.is_available()
|
||||
|
||||
# Check CUDA compatibility in image
|
||||
image = modal.Image.from_registry(
|
||||
"nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04",
|
||||
add_python="3.11"
|
||||
).pip_install(
|
||||
"torch",
|
||||
index_url="https://download.pytorch.org/whl/cu121" # Match CUDA
|
||||
)
|
||||
```
|
||||
|
||||
### GPU out of memory
|
||||
|
||||
**Error**: `torch.cuda.OutOfMemoryError: CUDA out of memory`
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Use larger GPU
|
||||
@app.function(gpu="A100-80GB") # More VRAM
|
||||
def train():
|
||||
pass
|
||||
|
||||
# Enable memory optimization
|
||||
@app.function(gpu="A100")
|
||||
def memory_optimized():
|
||||
import torch
|
||||
torch.backends.cuda.enable_flash_sdp(True)
|
||||
|
||||
# Use gradient checkpointing
|
||||
model.gradient_checkpointing_enable()
|
||||
|
||||
# Mixed precision
|
||||
with torch.autocast(device_type="cuda", dtype=torch.float16):
|
||||
outputs = model(**inputs)
|
||||
```
|
||||
|
||||
### Wrong GPU allocated
|
||||
|
||||
**Error**: Got different GPU than requested
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Use strict GPU selection
|
||||
@app.function(gpu="H100!") # H100! prevents auto-upgrade to H200
|
||||
|
||||
# Specify exact memory variant
|
||||
@app.function(gpu="A100-80GB") # Not just "A100"
|
||||
|
||||
# Check GPU at runtime
|
||||
@app.function(gpu="A100")
|
||||
def check_gpu():
|
||||
import subprocess
|
||||
result = subprocess.run(["nvidia-smi"], capture_output=True, text=True)
|
||||
print(result.stdout)
|
||||
```
|
||||
|
||||
## Cold Start Issues
|
||||
|
||||
### Slow cold starts
|
||||
|
||||
**Problem**: First request takes too long
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Keep containers warm
|
||||
@app.function(
|
||||
container_idle_timeout=600, # Keep warm 10 min
|
||||
keep_warm=1 # Always keep 1 container ready
|
||||
)
|
||||
def low_latency():
|
||||
pass
|
||||
|
||||
# Load model during container start
|
||||
@app.cls(gpu="A100")
|
||||
class Model:
|
||||
@modal.enter()
|
||||
def load(self):
|
||||
# This runs once at container start, not per request
|
||||
self.model = load_heavy_model()
|
||||
|
||||
# Cache model in volume
|
||||
volume = modal.Volume.from_name("models", create_if_missing=True)
|
||||
|
||||
@app.function(volumes={"/cache": volume})
|
||||
def cached_model():
|
||||
if os.path.exists("/cache/model"):
|
||||
model = load_from_disk("/cache/model")
|
||||
else:
|
||||
model = download_model()
|
||||
save_to_disk(model, "/cache/model")
|
||||
volume.commit()
|
||||
```
|
||||
|
||||
### Container keeps restarting
|
||||
|
||||
**Problem**: Containers are killed and restarted frequently
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Increase memory
|
||||
@app.function(memory=32768) # 32GB RAM
|
||||
def memory_heavy():
|
||||
pass
|
||||
|
||||
# Increase timeout
|
||||
@app.function(timeout=3600) # 1 hour
|
||||
def long_running():
|
||||
pass
|
||||
|
||||
# Handle signals gracefully
|
||||
import signal
|
||||
|
||||
def handler(signum, frame):
|
||||
cleanup()
|
||||
exit(0)
|
||||
|
||||
signal.signal(signal.SIGTERM, handler)
|
||||
```
|
||||
|
||||
## Volume Issues
|
||||
|
||||
### Volume changes not persisting
|
||||
|
||||
**Error**: Data written to volume disappears
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
volume = modal.Volume.from_name("my-volume", create_if_missing=True)
|
||||
|
||||
@app.function(volumes={"/data": volume})
|
||||
def write_data():
|
||||
with open("/data/file.txt", "w") as f:
|
||||
f.write("data")
|
||||
|
||||
# CRITICAL: Commit changes!
|
||||
volume.commit()
|
||||
```
|
||||
|
||||
### Volume read shows stale data
|
||||
|
||||
**Error**: Reading outdated data from volume
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
@app.function(volumes={"/data": volume})
|
||||
def read_data():
|
||||
# Reload to get latest
|
||||
volume.reload()
|
||||
|
||||
with open("/data/file.txt", "r") as f:
|
||||
return f.read()
|
||||
```
|
||||
|
||||
### Volume mount fails
|
||||
|
||||
**Error**: `VolumeError: Failed to mount volume`
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Ensure volume exists
|
||||
volume = modal.Volume.from_name("my-volume", create_if_missing=True)
|
||||
|
||||
# Use absolute path
|
||||
@app.function(volumes={"/data": volume}) # Not "./data"
|
||||
def my_function():
|
||||
pass
|
||||
|
||||
# Check volume in dashboard
|
||||
# modal volume list
|
||||
```
|
||||
|
||||
## Web Endpoint Issues
|
||||
|
||||
### Endpoint returns 502
|
||||
|
||||
**Error**: Gateway timeout or bad gateway
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Increase timeout
|
||||
@app.function(timeout=300) # 5 min
|
||||
@modal.web_endpoint()
|
||||
def slow_endpoint():
|
||||
pass
|
||||
|
||||
# Return streaming response for long operations
|
||||
from fastapi.responses import StreamingResponse
|
||||
|
||||
@app.function()
|
||||
@modal.asgi_app()
|
||||
def streaming_app():
|
||||
async def generate():
|
||||
for i in range(100):
|
||||
yield f"data: {i}\n\n"
|
||||
await process_chunk(i)
|
||||
return StreamingResponse(generate(), media_type="text/event-stream")
|
||||
```
|
||||
|
||||
### Endpoint not accessible
|
||||
|
||||
**Error**: 404 or cannot reach endpoint
|
||||
|
||||
**Solutions**:
|
||||
```bash
|
||||
# Check deployment status
|
||||
modal app list
|
||||
|
||||
# Redeploy
|
||||
modal deploy my_app.py
|
||||
|
||||
# Check logs
|
||||
modal app logs my-app
|
||||
```
|
||||
|
||||
### CORS errors
|
||||
|
||||
**Error**: Cross-origin request blocked
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
from fastapi import FastAPI
|
||||
from fastapi.middleware.cors import CORSMiddleware
|
||||
|
||||
web_app = FastAPI()
|
||||
web_app.add_middleware(
|
||||
CORSMiddleware,
|
||||
allow_origins=["*"],
|
||||
allow_credentials=True,
|
||||
allow_methods=["*"],
|
||||
allow_headers=["*"],
|
||||
)
|
||||
|
||||
@app.function()
|
||||
@modal.asgi_app()
|
||||
def cors_enabled():
|
||||
return web_app
|
||||
```
|
||||
|
||||
## Secret Issues
|
||||
|
||||
### Secret not found
|
||||
|
||||
**Error**: `SecretNotFound: Secret 'my-secret' not found`
|
||||
|
||||
**Solutions**:
|
||||
```bash
|
||||
# Create secret via CLI
|
||||
modal secret create my-secret KEY=value
|
||||
|
||||
# List secrets
|
||||
modal secret list
|
||||
|
||||
# Check secret name matches exactly
|
||||
```
|
||||
|
||||
### Secret value not accessible
|
||||
|
||||
**Error**: Environment variable is empty
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Ensure secret is attached
|
||||
@app.function(secrets=[modal.Secret.from_name("my-secret")])
|
||||
def use_secret():
|
||||
import os
|
||||
value = os.environ.get("KEY") # Use get() to handle missing
|
||||
if not value:
|
||||
raise ValueError("KEY not set in secret")
|
||||
```
|
||||
|
||||
## Scheduling Issues
|
||||
|
||||
### Scheduled job not running
|
||||
|
||||
**Error**: Cron job doesn't execute
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Verify cron syntax
|
||||
@app.function(schedule=modal.Cron("0 0 * * *")) # Daily at midnight UTC
|
||||
def daily_job():
|
||||
pass
|
||||
|
||||
# Check timezone (Modal uses UTC)
|
||||
# "0 8 * * *" = 8am UTC, not local time
|
||||
|
||||
# Ensure app is deployed
|
||||
# modal deploy my_app.py
|
||||
```
|
||||
|
||||
### Job runs multiple times
|
||||
|
||||
**Problem**: Scheduled job executes more than expected
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Implement idempotency
|
||||
@app.function(schedule=modal.Cron("0 * * * *"))
|
||||
def hourly_job():
|
||||
job_id = get_current_hour_id()
|
||||
if already_processed(job_id):
|
||||
return
|
||||
process()
|
||||
mark_processed(job_id)
|
||||
```
|
||||
|
||||
## Debugging Tips
|
||||
|
||||
### Enable debug logging
|
||||
|
||||
```python
|
||||
import logging
|
||||
logging.basicConfig(level=logging.DEBUG)
|
||||
|
||||
@app.function()
|
||||
def debug_function():
|
||||
logging.debug("Debug message")
|
||||
logging.info("Info message")
|
||||
```
|
||||
|
||||
### View container logs
|
||||
|
||||
```bash
|
||||
# Stream logs
|
||||
modal app logs my-app
|
||||
|
||||
# View specific function
|
||||
modal app logs my-app --function my_function
|
||||
|
||||
# View historical logs
|
||||
modal app logs my-app --since 1h
|
||||
```
|
||||
|
||||
### Test locally
|
||||
|
||||
```python
|
||||
# Run function locally without Modal
|
||||
if __name__ == "__main__":
|
||||
result = my_function.local() # Runs on your machine
|
||||
print(result)
|
||||
```
|
||||
|
||||
### Inspect container
|
||||
|
||||
```python
|
||||
@app.function(gpu="T4")
|
||||
def debug_environment():
|
||||
import subprocess
|
||||
import sys
|
||||
|
||||
# System info
|
||||
print(f"Python: {sys.version}")
|
||||
print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)
|
||||
print(subprocess.run(["pip", "list"], capture_output=True, text=True).stdout)
|
||||
|
||||
# CUDA info
|
||||
import torch
|
||||
print(f"CUDA available: {torch.cuda.is_available()}")
|
||||
print(f"CUDA version: {torch.version.cuda}")
|
||||
print(f"GPU: {torch.cuda.get_device_name(0)}")
|
||||
```
|
||||
|
||||
## Common Error Messages
|
||||
|
||||
| Error | Cause | Solution |
|
||||
|-------|-------|----------|
|
||||
| `FunctionTimeoutError` | Function exceeded timeout | Increase `timeout` parameter |
|
||||
| `ContainerMemoryExceeded` | OOM killed | Increase `memory` parameter |
|
||||
| `ImageBuilderError` | Build failed | Check dependencies, pin versions |
|
||||
| `ResourceExhausted` | No GPUs available | Use GPU fallbacks, try later |
|
||||
| `AuthenticationError` | Invalid token | Run `modal token new` |
|
||||
| `VolumeNotFound` | Volume doesn't exist | Use `create_if_missing=True` |
|
||||
| `SecretNotFound` | Secret doesn't exist | Create secret via CLI |
|
||||
|
||||
## Getting Help
|
||||
|
||||
1. **Documentation**: https://modal.com/docs
|
||||
2. **Examples**: https://github.com/modal-labs/modal-examples
|
||||
3. **Discord**: https://discord.gg/modal
|
||||
4. **Status**: https://status.modal.com
|
||||
|
||||
### Reporting Issues
|
||||
|
||||
Include:
|
||||
- Modal client version: `modal --version`
|
||||
- Python version: `python --version`
|
||||
- Full error traceback
|
||||
- Minimal reproducible code
|
||||
- GPU type if relevant
|
||||
434
optional-skills/mlops/peft/SKILL.md
Normal file
434
optional-skills/mlops/peft/SKILL.md
Normal file
|
|
@ -0,0 +1,434 @@
|
|||
---
|
||||
name: peft-fine-tuning
|
||||
description: Parameter-efficient fine-tuning for LLMs using LoRA, QLoRA, and 25+ methods. Use when fine-tuning large models (7B-70B) with limited GPU memory, when you need to train <1% of parameters with minimal accuracy loss, or for multi-adapter serving. HuggingFace's official library integrated with transformers ecosystem.
|
||||
version: 1.0.0
|
||||
author: Orchestra Research
|
||||
license: MIT
|
||||
dependencies: [peft>=0.13.0, transformers>=4.45.0, torch>=2.0.0, bitsandbytes>=0.43.0]
|
||||
metadata:
|
||||
hermes:
|
||||
tags: [Fine-Tuning, PEFT, LoRA, QLoRA, Parameter-Efficient, Adapters, Low-Rank, Memory Optimization, Multi-Adapter]
|
||||
|
||||
---
|
||||
|
||||
# PEFT (Parameter-Efficient Fine-Tuning)
|
||||
|
||||
Fine-tune LLMs by training <1% of parameters using LoRA, QLoRA, and 25+ adapter methods.
|
||||
|
||||
## When to use PEFT
|
||||
|
||||
**Use PEFT/LoRA when:**
|
||||
- Fine-tuning 7B-70B models on consumer GPUs (RTX 4090, A100)
|
||||
- Need to train <1% parameters (6MB adapters vs 14GB full model)
|
||||
- Want fast iteration with multiple task-specific adapters
|
||||
- Deploying multiple fine-tuned variants from one base model
|
||||
|
||||
**Use QLoRA (PEFT + quantization) when:**
|
||||
- Fine-tuning 70B models on single 24GB GPU
|
||||
- Memory is the primary constraint
|
||||
- Can accept ~5% quality trade-off vs full fine-tuning
|
||||
|
||||
**Use full fine-tuning instead when:**
|
||||
- Training small models (<1B parameters)
|
||||
- Need maximum quality and have compute budget
|
||||
- Significant domain shift requires updating all weights
|
||||
|
||||
## Quick start
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
# Basic installation
|
||||
pip install peft
|
||||
|
||||
# With quantization support (recommended)
|
||||
pip install peft bitsandbytes
|
||||
|
||||
# Full stack
|
||||
pip install peft transformers accelerate bitsandbytes datasets
|
||||
```
|
||||
|
||||
### LoRA fine-tuning (standard)
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
|
||||
from peft import get_peft_model, LoraConfig, TaskType
|
||||
from datasets import load_dataset
|
||||
|
||||
# Load base model
|
||||
model_name = "meta-llama/Llama-3.1-8B"
|
||||
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
tokenizer.pad_token = tokenizer.eos_token
|
||||
|
||||
# LoRA configuration
|
||||
lora_config = LoraConfig(
|
||||
task_type=TaskType.CAUSAL_LM,
|
||||
r=16, # Rank (8-64, higher = more capacity)
|
||||
lora_alpha=32, # Scaling factor (typically 2*r)
|
||||
lora_dropout=0.05, # Dropout for regularization
|
||||
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], # Attention layers
|
||||
bias="none" # Don't train biases
|
||||
)
|
||||
|
||||
# Apply LoRA
|
||||
model = get_peft_model(model, lora_config)
|
||||
model.print_trainable_parameters()
|
||||
# Output: trainable params: 13,631,488 || all params: 8,043,307,008 || trainable%: 0.17%
|
||||
|
||||
# Prepare dataset
|
||||
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
|
||||
|
||||
def tokenize(example):
|
||||
text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}"
|
||||
return tokenizer(text, truncation=True, max_length=512, padding="max_length")
|
||||
|
||||
tokenized = dataset.map(tokenize, remove_columns=dataset.column_names)
|
||||
|
||||
# Training
|
||||
training_args = TrainingArguments(
|
||||
output_dir="./lora-llama",
|
||||
num_train_epochs=3,
|
||||
per_device_train_batch_size=4,
|
||||
gradient_accumulation_steps=4,
|
||||
learning_rate=2e-4,
|
||||
fp16=True,
|
||||
logging_steps=10,
|
||||
save_strategy="epoch"
|
||||
)
|
||||
|
||||
trainer = Trainer(
|
||||
model=model,
|
||||
args=training_args,
|
||||
train_dataset=tokenized,
|
||||
data_collator=lambda data: {"input_ids": torch.stack([f["input_ids"] for f in data]),
|
||||
"attention_mask": torch.stack([f["attention_mask"] for f in data]),
|
||||
"labels": torch.stack([f["input_ids"] for f in data])}
|
||||
)
|
||||
|
||||
trainer.train()
|
||||
|
||||
# Save adapter only (6MB vs 16GB)
|
||||
model.save_pretrained("./lora-llama-adapter")
|
||||
```
|
||||
|
||||
### QLoRA fine-tuning (memory-efficient)
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
|
||||
from peft import get_peft_model, LoraConfig, prepare_model_for_kbit_training
|
||||
|
||||
# 4-bit quantization config
|
||||
bnb_config = BitsAndBytesConfig(
|
||||
load_in_4bit=True,
|
||||
bnb_4bit_quant_type="nf4", # NormalFloat4 (best for LLMs)
|
||||
bnb_4bit_compute_dtype="bfloat16", # Compute in bf16
|
||||
bnb_4bit_use_double_quant=True # Nested quantization
|
||||
)
|
||||
|
||||
# Load quantized model
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"meta-llama/Llama-3.1-70B",
|
||||
quantization_config=bnb_config,
|
||||
device_map="auto"
|
||||
)
|
||||
|
||||
# Prepare for training (enables gradient checkpointing)
|
||||
model = prepare_model_for_kbit_training(model)
|
||||
|
||||
# LoRA config for QLoRA
|
||||
lora_config = LoraConfig(
|
||||
r=64, # Higher rank for 70B
|
||||
lora_alpha=128,
|
||||
lora_dropout=0.1,
|
||||
target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
|
||||
bias="none",
|
||||
task_type="CAUSAL_LM"
|
||||
)
|
||||
|
||||
model = get_peft_model(model, lora_config)
|
||||
# 70B model now fits on single 24GB GPU!
|
||||
```
|
||||
|
||||
## LoRA parameter selection
|
||||
|
||||
### Rank (r) - capacity vs efficiency
|
||||
|
||||
| Rank | Trainable Params | Memory | Quality | Use Case |
|
||||
|------|-----------------|--------|---------|----------|
|
||||
| 4 | ~3M | Minimal | Lower | Simple tasks, prototyping |
|
||||
| **8** | ~7M | Low | Good | **Recommended starting point** |
|
||||
| **16** | ~14M | Medium | Better | **General fine-tuning** |
|
||||
| 32 | ~27M | Higher | High | Complex tasks |
|
||||
| 64 | ~54M | High | Highest | Domain adaptation, 70B models |
|
||||
|
||||
### Alpha (lora_alpha) - scaling factor
|
||||
|
||||
```python
|
||||
# Rule of thumb: alpha = 2 * rank
|
||||
LoraConfig(r=16, lora_alpha=32) # Standard
|
||||
LoraConfig(r=16, lora_alpha=16) # Conservative (lower learning rate effect)
|
||||
LoraConfig(r=16, lora_alpha=64) # Aggressive (higher learning rate effect)
|
||||
```
|
||||
|
||||
### Target modules by architecture
|
||||
|
||||
```python
|
||||
# Llama / Mistral / Qwen
|
||||
target_modules = ["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
|
||||
|
||||
# GPT-2 / GPT-Neo
|
||||
target_modules = ["c_attn", "c_proj", "c_fc"]
|
||||
|
||||
# Falcon
|
||||
target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]
|
||||
|
||||
# BLOOM
|
||||
target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]
|
||||
|
||||
# Auto-detect all linear layers
|
||||
target_modules = "all-linear" # PEFT 0.6.0+
|
||||
```
|
||||
|
||||
## Loading and merging adapters
|
||||
|
||||
### Load trained adapter
|
||||
|
||||
```python
|
||||
from peft import PeftModel, AutoPeftModelForCausalLM
|
||||
from transformers import AutoModelForCausalLM
|
||||
|
||||
# Option 1: Load with PeftModel
|
||||
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
|
||||
model = PeftModel.from_pretrained(base_model, "./lora-llama-adapter")
|
||||
|
||||
# Option 2: Load directly (recommended)
|
||||
model = AutoPeftModelForCausalLM.from_pretrained(
|
||||
"./lora-llama-adapter",
|
||||
device_map="auto"
|
||||
)
|
||||
```
|
||||
|
||||
### Merge adapter into base model
|
||||
|
||||
```python
|
||||
# Merge for deployment (no adapter overhead)
|
||||
merged_model = model.merge_and_unload()
|
||||
|
||||
# Save merged model
|
||||
merged_model.save_pretrained("./llama-merged")
|
||||
tokenizer.save_pretrained("./llama-merged")
|
||||
|
||||
# Push to Hub
|
||||
merged_model.push_to_hub("username/llama-finetuned")
|
||||
```
|
||||
|
||||
### Multi-adapter serving
|
||||
|
||||
```python
|
||||
from peft import PeftModel
|
||||
|
||||
# Load base with first adapter
|
||||
model = AutoPeftModelForCausalLM.from_pretrained("./adapter-task1")
|
||||
|
||||
# Load additional adapters
|
||||
model.load_adapter("./adapter-task2", adapter_name="task2")
|
||||
model.load_adapter("./adapter-task3", adapter_name="task3")
|
||||
|
||||
# Switch between adapters at runtime
|
||||
model.set_adapter("task1") # Use task1 adapter
|
||||
output1 = model.generate(**inputs)
|
||||
|
||||
model.set_adapter("task2") # Switch to task2
|
||||
output2 = model.generate(**inputs)
|
||||
|
||||
# Disable adapters (use base model)
|
||||
with model.disable_adapter():
|
||||
base_output = model.generate(**inputs)
|
||||
```
|
||||
|
||||
## PEFT methods comparison
|
||||
|
||||
| Method | Trainable % | Memory | Speed | Best For |
|
||||
|--------|------------|--------|-------|----------|
|
||||
| **LoRA** | 0.1-1% | Low | Fast | General fine-tuning |
|
||||
| **QLoRA** | 0.1-1% | Very Low | Medium | Memory-constrained |
|
||||
| AdaLoRA | 0.1-1% | Low | Medium | Automatic rank selection |
|
||||
| IA3 | 0.01% | Minimal | Fastest | Few-shot adaptation |
|
||||
| Prefix Tuning | 0.1% | Low | Medium | Generation control |
|
||||
| Prompt Tuning | 0.001% | Minimal | Fast | Simple task adaptation |
|
||||
| P-Tuning v2 | 0.1% | Low | Medium | NLU tasks |
|
||||
|
||||
### IA3 (minimal parameters)
|
||||
|
||||
```python
|
||||
from peft import IA3Config
|
||||
|
||||
ia3_config = IA3Config(
|
||||
target_modules=["q_proj", "v_proj", "k_proj", "down_proj"],
|
||||
feedforward_modules=["down_proj"]
|
||||
)
|
||||
model = get_peft_model(model, ia3_config)
|
||||
# Trains only 0.01% of parameters!
|
||||
```
|
||||
|
||||
### Prefix Tuning
|
||||
|
||||
```python
|
||||
from peft import PrefixTuningConfig
|
||||
|
||||
prefix_config = PrefixTuningConfig(
|
||||
task_type="CAUSAL_LM",
|
||||
num_virtual_tokens=20, # Prepended tokens
|
||||
prefix_projection=True # Use MLP projection
|
||||
)
|
||||
model = get_peft_model(model, prefix_config)
|
||||
```
|
||||
|
||||
## Integration patterns
|
||||
|
||||
### With TRL (SFTTrainer)
|
||||
|
||||
```python
|
||||
from trl import SFTTrainer, SFTConfig
|
||||
from peft import LoraConfig
|
||||
|
||||
lora_config = LoraConfig(r=16, lora_alpha=32, target_modules="all-linear")
|
||||
|
||||
trainer = SFTTrainer(
|
||||
model=model,
|
||||
args=SFTConfig(output_dir="./output", max_seq_length=512),
|
||||
train_dataset=dataset,
|
||||
peft_config=lora_config, # Pass LoRA config directly
|
||||
)
|
||||
trainer.train()
|
||||
```
|
||||
|
||||
### With Axolotl (YAML config)
|
||||
|
||||
```yaml
|
||||
# axolotl config.yaml
|
||||
adapter: lora
|
||||
lora_r: 16
|
||||
lora_alpha: 32
|
||||
lora_dropout: 0.05
|
||||
lora_target_modules:
|
||||
- q_proj
|
||||
- v_proj
|
||||
- k_proj
|
||||
- o_proj
|
||||
lora_target_linear: true # Target all linear layers
|
||||
```
|
||||
|
||||
### With vLLM (inference)
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
from vllm.lora.request import LoRARequest
|
||||
|
||||
# Load base model with LoRA support
|
||||
llm = LLM(model="meta-llama/Llama-3.1-8B", enable_lora=True)
|
||||
|
||||
# Serve with adapter
|
||||
outputs = llm.generate(
|
||||
prompts,
|
||||
lora_request=LoRARequest("adapter1", 1, "./lora-adapter")
|
||||
)
|
||||
```
|
||||
|
||||
## Performance benchmarks
|
||||
|
||||
### Memory usage (Llama 3.1 8B)
|
||||
|
||||
| Method | GPU Memory | Trainable Params |
|
||||
|--------|-----------|------------------|
|
||||
| Full fine-tuning | 60+ GB | 8B (100%) |
|
||||
| LoRA r=16 | 18 GB | 14M (0.17%) |
|
||||
| QLoRA r=16 | 6 GB | 14M (0.17%) |
|
||||
| IA3 | 16 GB | 800K (0.01%) |
|
||||
|
||||
### Training speed (A100 80GB)
|
||||
|
||||
| Method | Tokens/sec | vs Full FT |
|
||||
|--------|-----------|------------|
|
||||
| Full FT | 2,500 | 1x |
|
||||
| LoRA | 3,200 | 1.3x |
|
||||
| QLoRA | 2,100 | 0.84x |
|
||||
|
||||
### Quality (MMLU benchmark)
|
||||
|
||||
| Model | Full FT | LoRA | QLoRA |
|
||||
|-------|---------|------|-------|
|
||||
| Llama 2-7B | 45.3 | 44.8 | 44.1 |
|
||||
| Llama 2-13B | 54.8 | 54.2 | 53.5 |
|
||||
|
||||
## Common issues
|
||||
|
||||
### CUDA OOM during training
|
||||
|
||||
```python
|
||||
# Solution 1: Enable gradient checkpointing
|
||||
model.gradient_checkpointing_enable()
|
||||
|
||||
# Solution 2: Reduce batch size + increase accumulation
|
||||
TrainingArguments(
|
||||
per_device_train_batch_size=1,
|
||||
gradient_accumulation_steps=16
|
||||
)
|
||||
|
||||
# Solution 3: Use QLoRA
|
||||
from transformers import BitsAndBytesConfig
|
||||
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")
|
||||
```
|
||||
|
||||
### Adapter not applying
|
||||
|
||||
```python
|
||||
# Verify adapter is active
|
||||
print(model.active_adapters) # Should show adapter name
|
||||
|
||||
# Check trainable parameters
|
||||
model.print_trainable_parameters()
|
||||
|
||||
# Ensure model in training mode
|
||||
model.train()
|
||||
```
|
||||
|
||||
### Quality degradation
|
||||
|
||||
```python
|
||||
# Increase rank
|
||||
LoraConfig(r=32, lora_alpha=64)
|
||||
|
||||
# Target more modules
|
||||
target_modules = "all-linear"
|
||||
|
||||
# Use more training data and epochs
|
||||
TrainingArguments(num_train_epochs=5)
|
||||
|
||||
# Lower learning rate
|
||||
TrainingArguments(learning_rate=1e-4)
|
||||
```
|
||||
|
||||
## Best practices
|
||||
|
||||
1. **Start with r=8-16**, increase if quality insufficient
|
||||
2. **Use alpha = 2 * rank** as starting point
|
||||
3. **Target attention + MLP layers** for best quality/efficiency
|
||||
4. **Enable gradient checkpointing** for memory savings
|
||||
5. **Save adapters frequently** (small files, easy rollback)
|
||||
6. **Evaluate on held-out data** before merging
|
||||
7. **Use QLoRA for 70B+ models** on consumer hardware
|
||||
|
||||
## References
|
||||
|
||||
- **[Advanced Usage](references/advanced-usage.md)** - DoRA, LoftQ, rank stabilization, custom modules
|
||||
- **[Troubleshooting](references/troubleshooting.md)** - Common errors, debugging, optimization
|
||||
|
||||
## Resources
|
||||
|
||||
- **GitHub**: https://github.com/huggingface/peft
|
||||
- **Docs**: https://huggingface.co/docs/peft
|
||||
- **LoRA Paper**: arXiv:2106.09685
|
||||
- **QLoRA Paper**: arXiv:2305.14314
|
||||
- **Models**: https://huggingface.co/models?library=peft
|
||||
514
optional-skills/mlops/peft/references/advanced-usage.md
Normal file
514
optional-skills/mlops/peft/references/advanced-usage.md
Normal file
|
|
@ -0,0 +1,514 @@
|
|||
# PEFT Advanced Usage Guide
|
||||
|
||||
## Advanced LoRA Variants
|
||||
|
||||
### DoRA (Weight-Decomposed Low-Rank Adaptation)
|
||||
|
||||
DoRA decomposes weights into magnitude and direction components, often achieving better results than standard LoRA:
|
||||
|
||||
```python
|
||||
from peft import LoraConfig
|
||||
|
||||
dora_config = LoraConfig(
|
||||
r=16,
|
||||
lora_alpha=32,
|
||||
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
|
||||
use_dora=True, # Enable DoRA
|
||||
task_type="CAUSAL_LM"
|
||||
)
|
||||
|
||||
model = get_peft_model(model, dora_config)
|
||||
```
|
||||
|
||||
**When to use DoRA**:
|
||||
- Consistently outperforms LoRA on instruction-following tasks
|
||||
- Slightly higher memory (~10%) due to magnitude vectors
|
||||
- Best for quality-critical fine-tuning
|
||||
|
||||
### AdaLoRA (Adaptive Rank)
|
||||
|
||||
Automatically adjusts rank per layer based on importance:
|
||||
|
||||
```python
|
||||
from peft import AdaLoraConfig
|
||||
|
||||
adalora_config = AdaLoraConfig(
|
||||
init_r=64, # Initial rank
|
||||
target_r=16, # Target average rank
|
||||
tinit=200, # Warmup steps
|
||||
tfinal=1000, # Final pruning step
|
||||
deltaT=10, # Rank update frequency
|
||||
beta1=0.85,
|
||||
beta2=0.85,
|
||||
orth_reg_weight=0.5, # Orthogonality regularization
|
||||
target_modules=["q_proj", "v_proj"],
|
||||
task_type="CAUSAL_LM"
|
||||
)
|
||||
```
|
||||
|
||||
**Benefits**:
|
||||
- Allocates more rank to important layers
|
||||
- Can reduce total parameters while maintaining quality
|
||||
- Good for exploring optimal rank distribution
|
||||
|
||||
### LoRA+ (Asymmetric Learning Rates)
|
||||
|
||||
Different learning rates for A and B matrices:
|
||||
|
||||
```python
|
||||
from peft import LoraConfig
|
||||
|
||||
# LoRA+ uses higher LR for B matrix
|
||||
lora_plus_config = LoraConfig(
|
||||
r=16,
|
||||
lora_alpha=32,
|
||||
target_modules="all-linear",
|
||||
use_rslora=True, # Rank-stabilized LoRA (related technique)
|
||||
)
|
||||
|
||||
# Manual implementation of LoRA+
|
||||
from torch.optim import AdamW
|
||||
|
||||
# Group parameters
|
||||
lora_A_params = [p for n, p in model.named_parameters() if "lora_A" in n]
|
||||
lora_B_params = [p for n, p in model.named_parameters() if "lora_B" in n]
|
||||
|
||||
optimizer = AdamW([
|
||||
{"params": lora_A_params, "lr": 1e-4},
|
||||
{"params": lora_B_params, "lr": 1e-3}, # 10x higher for B
|
||||
])
|
||||
```
|
||||
|
||||
### rsLoRA (Rank-Stabilized LoRA)
|
||||
|
||||
Scales LoRA outputs to stabilize training with different ranks:
|
||||
|
||||
```python
|
||||
lora_config = LoraConfig(
|
||||
r=64,
|
||||
lora_alpha=64,
|
||||
use_rslora=True, # Enables rank-stabilized scaling
|
||||
target_modules="all-linear"
|
||||
)
|
||||
```
|
||||
|
||||
**When to use**:
|
||||
- When experimenting with different ranks
|
||||
- Helps maintain consistent behavior across rank values
|
||||
- Recommended for r > 32
|
||||
|
||||
## LoftQ (LoRA-Fine-Tuning-aware Quantization)
|
||||
|
||||
Initializes LoRA weights to compensate for quantization error:
|
||||
|
||||
```python
|
||||
from peft import LoftQConfig, LoraConfig, get_peft_model
|
||||
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
|
||||
|
||||
# LoftQ configuration
|
||||
loftq_config = LoftQConfig(
|
||||
loftq_bits=4, # Quantization bits
|
||||
loftq_iter=5, # Alternating optimization iterations
|
||||
)
|
||||
|
||||
# LoRA config with LoftQ initialization
|
||||
lora_config = LoraConfig(
|
||||
r=16,
|
||||
lora_alpha=32,
|
||||
target_modules="all-linear",
|
||||
init_lora_weights="loftq",
|
||||
loftq_config=loftq_config,
|
||||
task_type="CAUSAL_LM"
|
||||
)
|
||||
|
||||
# Load quantized model
|
||||
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"meta-llama/Llama-3.1-8B",
|
||||
quantization_config=bnb_config
|
||||
)
|
||||
|
||||
model = get_peft_model(model, lora_config)
|
||||
```
|
||||
|
||||
**Benefits over standard QLoRA**:
|
||||
- Better initial quality after quantization
|
||||
- Faster convergence
|
||||
- ~1-2% better final accuracy on benchmarks
|
||||
|
||||
## Custom Module Targeting
|
||||
|
||||
### Target specific layers
|
||||
|
||||
```python
|
||||
# Target only first and last transformer layers
|
||||
lora_config = LoraConfig(
|
||||
r=16,
|
||||
lora_alpha=32,
|
||||
target_modules=["model.layers.0.self_attn.q_proj",
|
||||
"model.layers.0.self_attn.v_proj",
|
||||
"model.layers.31.self_attn.q_proj",
|
||||
"model.layers.31.self_attn.v_proj"],
|
||||
layers_to_transform=[0, 31] # Alternative approach
|
||||
)
|
||||
```
|
||||
|
||||
### Layer pattern matching
|
||||
|
||||
```python
|
||||
# Target layers 0-10 only
|
||||
lora_config = LoraConfig(
|
||||
r=16,
|
||||
lora_alpha=32,
|
||||
target_modules="all-linear",
|
||||
layers_to_transform=list(range(11)), # Layers 0-10
|
||||
layers_pattern="model.layers"
|
||||
)
|
||||
```
|
||||
|
||||
### Exclude specific layers
|
||||
|
||||
```python
|
||||
lora_config = LoraConfig(
|
||||
r=16,
|
||||
target_modules="all-linear",
|
||||
modules_to_save=["lm_head"], # Train these fully (not LoRA)
|
||||
)
|
||||
```
|
||||
|
||||
## Embedding and LM Head Training
|
||||
|
||||
### Train embeddings with LoRA
|
||||
|
||||
```python
|
||||
from peft import LoraConfig
|
||||
|
||||
# Include embeddings
|
||||
lora_config = LoraConfig(
|
||||
r=16,
|
||||
lora_alpha=32,
|
||||
target_modules=["q_proj", "v_proj", "embed_tokens"], # Include embeddings
|
||||
modules_to_save=["lm_head"], # Train lm_head fully
|
||||
)
|
||||
```
|
||||
|
||||
### Extending vocabulary with LoRA
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
from peft import get_peft_model, LoraConfig
|
||||
|
||||
# Add new tokens
|
||||
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
|
||||
new_tokens = ["<custom_token_1>", "<custom_token_2>"]
|
||||
tokenizer.add_tokens(new_tokens)
|
||||
|
||||
# Resize model embeddings
|
||||
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
|
||||
model.resize_token_embeddings(len(tokenizer))
|
||||
|
||||
# Configure LoRA to train new embeddings
|
||||
lora_config = LoraConfig(
|
||||
r=16,
|
||||
target_modules="all-linear",
|
||||
modules_to_save=["embed_tokens", "lm_head"], # Train these fully
|
||||
)
|
||||
|
||||
model = get_peft_model(model, lora_config)
|
||||
```
|
||||
|
||||
## Multi-Adapter Patterns
|
||||
|
||||
### Adapter composition
|
||||
|
||||
```python
|
||||
from peft import PeftModel
|
||||
|
||||
# Load model with multiple adapters
|
||||
model = AutoPeftModelForCausalLM.from_pretrained("./base-adapter")
|
||||
model.load_adapter("./style-adapter", adapter_name="style")
|
||||
model.load_adapter("./task-adapter", adapter_name="task")
|
||||
|
||||
# Combine adapters (weighted sum)
|
||||
model.add_weighted_adapter(
|
||||
adapters=["style", "task"],
|
||||
weights=[0.7, 0.3],
|
||||
adapter_name="combined",
|
||||
combination_type="linear" # or "cat", "svd"
|
||||
)
|
||||
|
||||
model.set_adapter("combined")
|
||||
```
|
||||
|
||||
### Adapter stacking
|
||||
|
||||
```python
|
||||
# Stack adapters (apply sequentially)
|
||||
model.add_weighted_adapter(
|
||||
adapters=["base", "domain", "task"],
|
||||
weights=[1.0, 1.0, 1.0],
|
||||
adapter_name="stacked",
|
||||
combination_type="cat" # Concatenate adapter outputs
|
||||
)
|
||||
```
|
||||
|
||||
### Dynamic adapter switching
|
||||
|
||||
```python
|
||||
import torch
|
||||
|
||||
class MultiAdapterModel:
|
||||
def __init__(self, base_model_path, adapter_paths):
|
||||
self.model = AutoPeftModelForCausalLM.from_pretrained(adapter_paths[0])
|
||||
for name, path in adapter_paths[1:].items():
|
||||
self.model.load_adapter(path, adapter_name=name)
|
||||
|
||||
def generate(self, prompt, adapter_name="default"):
|
||||
self.model.set_adapter(adapter_name)
|
||||
return self.model.generate(**self.tokenize(prompt))
|
||||
|
||||
def generate_ensemble(self, prompt, adapters, weights):
|
||||
"""Generate with weighted adapter ensemble"""
|
||||
outputs = []
|
||||
for adapter, weight in zip(adapters, weights):
|
||||
self.model.set_adapter(adapter)
|
||||
logits = self.model(**self.tokenize(prompt)).logits
|
||||
outputs.append(weight * logits)
|
||||
return torch.stack(outputs).sum(dim=0)
|
||||
```
|
||||
|
||||
## Memory Optimization
|
||||
|
||||
### Gradient checkpointing with LoRA
|
||||
|
||||
```python
|
||||
from peft import prepare_model_for_kbit_training
|
||||
|
||||
# Enable gradient checkpointing
|
||||
model = prepare_model_for_kbit_training(
|
||||
model,
|
||||
use_gradient_checkpointing=True,
|
||||
gradient_checkpointing_kwargs={"use_reentrant": False}
|
||||
)
|
||||
```
|
||||
|
||||
### CPU offloading for training
|
||||
|
||||
```python
|
||||
from accelerate import Accelerator
|
||||
|
||||
accelerator = Accelerator(
|
||||
mixed_precision="bf16",
|
||||
gradient_accumulation_steps=8,
|
||||
cpu_offload=True # Offload optimizer states to CPU
|
||||
)
|
||||
|
||||
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
|
||||
```
|
||||
|
||||
### Memory-efficient attention with LoRA
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM
|
||||
|
||||
# Combine Flash Attention 2 with LoRA
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"meta-llama/Llama-3.1-8B",
|
||||
attn_implementation="flash_attention_2",
|
||||
torch_dtype=torch.bfloat16
|
||||
)
|
||||
|
||||
# Apply LoRA
|
||||
model = get_peft_model(model, lora_config)
|
||||
```
|
||||
|
||||
## Inference Optimization
|
||||
|
||||
### Merge for deployment
|
||||
|
||||
```python
|
||||
# Merge adapter weights into base model
|
||||
merged_model = model.merge_and_unload()
|
||||
|
||||
# Quantize merged model for inference
|
||||
from transformers import BitsAndBytesConfig
|
||||
|
||||
bnb_config = BitsAndBytesConfig(load_in_4bit=True)
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(
|
||||
"./merged-model",
|
||||
quantization_config=bnb_config
|
||||
)
|
||||
```
|
||||
|
||||
### Export to different formats
|
||||
|
||||
```python
|
||||
# Export to GGUF (llama.cpp)
|
||||
# First merge, then convert
|
||||
merged_model.save_pretrained("./merged-model")
|
||||
|
||||
# Use llama.cpp converter
|
||||
# python convert-hf-to-gguf.py ./merged-model --outfile model.gguf
|
||||
|
||||
# Export to ONNX
|
||||
from optimum.onnxruntime import ORTModelForCausalLM
|
||||
|
||||
ort_model = ORTModelForCausalLM.from_pretrained(
|
||||
"./merged-model",
|
||||
export=True
|
||||
)
|
||||
ort_model.save_pretrained("./onnx-model")
|
||||
```
|
||||
|
||||
### Batch adapter inference
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
from vllm.lora.request import LoRARequest
|
||||
|
||||
# Initialize with LoRA support
|
||||
llm = LLM(
|
||||
model="meta-llama/Llama-3.1-8B",
|
||||
enable_lora=True,
|
||||
max_lora_rank=64,
|
||||
max_loras=4 # Max concurrent adapters
|
||||
)
|
||||
|
||||
# Batch with different adapters
|
||||
requests = [
|
||||
("prompt1", LoRARequest("adapter1", 1, "./adapter1")),
|
||||
("prompt2", LoRARequest("adapter2", 2, "./adapter2")),
|
||||
("prompt3", LoRARequest("adapter1", 1, "./adapter1")),
|
||||
]
|
||||
|
||||
outputs = llm.generate(
|
||||
[r[0] for r in requests],
|
||||
lora_request=[r[1] for r in requests]
|
||||
)
|
||||
```
|
||||
|
||||
## Training Recipes
|
||||
|
||||
### Instruction tuning recipe
|
||||
|
||||
```python
|
||||
lora_config = LoraConfig(
|
||||
r=16,
|
||||
lora_alpha=32,
|
||||
lora_dropout=0.05,
|
||||
target_modules="all-linear",
|
||||
bias="none",
|
||||
task_type="CAUSAL_LM"
|
||||
)
|
||||
|
||||
training_args = TrainingArguments(
|
||||
output_dir="./output",
|
||||
num_train_epochs=3,
|
||||
per_device_train_batch_size=4,
|
||||
gradient_accumulation_steps=4,
|
||||
learning_rate=2e-4,
|
||||
lr_scheduler_type="cosine",
|
||||
warmup_ratio=0.03,
|
||||
bf16=True,
|
||||
logging_steps=10,
|
||||
save_strategy="steps",
|
||||
save_steps=100,
|
||||
eval_strategy="steps",
|
||||
eval_steps=100,
|
||||
)
|
||||
```
|
||||
|
||||
### Code generation recipe
|
||||
|
||||
```python
|
||||
lora_config = LoraConfig(
|
||||
r=32, # Higher rank for code
|
||||
lora_alpha=64,
|
||||
lora_dropout=0.1,
|
||||
target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
|
||||
bias="none",
|
||||
task_type="CAUSAL_LM"
|
||||
)
|
||||
|
||||
training_args = TrainingArguments(
|
||||
learning_rate=1e-4, # Lower LR for code
|
||||
num_train_epochs=2,
|
||||
max_seq_length=2048, # Longer sequences
|
||||
)
|
||||
```
|
||||
|
||||
### Conversational/Chat recipe
|
||||
|
||||
```python
|
||||
from trl import SFTTrainer
|
||||
|
||||
lora_config = LoraConfig(
|
||||
r=16,
|
||||
lora_alpha=16, # alpha = r for chat
|
||||
lora_dropout=0.05,
|
||||
target_modules="all-linear"
|
||||
)
|
||||
|
||||
# Use chat template
|
||||
def format_chat(example):
|
||||
messages = [
|
||||
{"role": "user", "content": example["instruction"]},
|
||||
{"role": "assistant", "content": example["response"]}
|
||||
]
|
||||
return tokenizer.apply_chat_template(messages, tokenize=False)
|
||||
|
||||
trainer = SFTTrainer(
|
||||
model=model,
|
||||
peft_config=lora_config,
|
||||
train_dataset=dataset.map(format_chat),
|
||||
max_seq_length=1024,
|
||||
)
|
||||
```
|
||||
|
||||
## Debugging and Validation
|
||||
|
||||
### Verify adapter application
|
||||
|
||||
```python
|
||||
# Check which modules have LoRA
|
||||
for name, module in model.named_modules():
|
||||
if hasattr(module, "lora_A"):
|
||||
print(f"LoRA applied to: {name}")
|
||||
|
||||
# Print detailed config
|
||||
print(model.peft_config)
|
||||
|
||||
# Check adapter state
|
||||
print(f"Active adapters: {model.active_adapters}")
|
||||
print(f"Trainable: {sum(p.numel() for p in model.parameters() if p.requires_grad)}")
|
||||
```
|
||||
|
||||
### Compare with base model
|
||||
|
||||
```python
|
||||
# Generate with adapter
|
||||
model.set_adapter("default")
|
||||
adapter_output = model.generate(**inputs)
|
||||
|
||||
# Generate without adapter
|
||||
with model.disable_adapter():
|
||||
base_output = model.generate(**inputs)
|
||||
|
||||
print(f"Adapter: {tokenizer.decode(adapter_output[0])}")
|
||||
print(f"Base: {tokenizer.decode(base_output[0])}")
|
||||
```
|
||||
|
||||
### Monitor training metrics
|
||||
|
||||
```python
|
||||
from transformers import TrainerCallback
|
||||
|
||||
class LoRACallback(TrainerCallback):
|
||||
def on_log(self, args, state, control, logs=None, **kwargs):
|
||||
if "loss" in logs:
|
||||
# Log adapter-specific metrics
|
||||
model = kwargs["model"]
|
||||
lora_params = sum(p.numel() for n, p in model.named_parameters()
|
||||
if "lora" in n and p.requires_grad)
|
||||
print(f"Step {state.global_step}: loss={logs['loss']:.4f}, lora_params={lora_params}")
|
||||
```
|
||||
480
optional-skills/mlops/peft/references/troubleshooting.md
Normal file
480
optional-skills/mlops/peft/references/troubleshooting.md
Normal file
|
|
@ -0,0 +1,480 @@
|
|||
# PEFT Troubleshooting Guide
|
||||
|
||||
## Installation Issues
|
||||
|
||||
### bitsandbytes CUDA Error
|
||||
|
||||
**Error**: `CUDA Setup failed despite GPU being available`
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Check CUDA version
|
||||
nvcc --version
|
||||
|
||||
# Install matching bitsandbytes
|
||||
pip uninstall bitsandbytes
|
||||
pip install bitsandbytes --no-cache-dir
|
||||
|
||||
# Or compile from source for specific CUDA
|
||||
git clone https://github.com/TimDettmers/bitsandbytes.git
|
||||
cd bitsandbytes
|
||||
CUDA_VERSION=118 make cuda11x # Adjust for your CUDA
|
||||
pip install .
|
||||
```
|
||||
|
||||
### Triton Import Error
|
||||
|
||||
**Error**: `ModuleNotFoundError: No module named 'triton'`
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Install triton (Linux only)
|
||||
pip install triton
|
||||
|
||||
# Windows: Triton not supported, use CUDA backend
|
||||
# Set environment variable to disable triton
|
||||
export CUDA_VISIBLE_DEVICES=0
|
||||
```
|
||||
|
||||
### PEFT Version Conflicts
|
||||
|
||||
**Error**: `AttributeError: 'LoraConfig' object has no attribute 'use_dora'`
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Upgrade to latest PEFT
|
||||
pip install peft>=0.13.0 --upgrade
|
||||
|
||||
# Check version
|
||||
python -c "import peft; print(peft.__version__)"
|
||||
```
|
||||
|
||||
## Training Issues
|
||||
|
||||
### CUDA Out of Memory
|
||||
|
||||
**Error**: `torch.cuda.OutOfMemoryError: CUDA out of memory`
|
||||
|
||||
**Solutions**:
|
||||
|
||||
1. **Enable gradient checkpointing**:
|
||||
```python
|
||||
from peft import prepare_model_for_kbit_training
|
||||
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
|
||||
```
|
||||
|
||||
2. **Reduce batch size**:
|
||||
```python
|
||||
TrainingArguments(
|
||||
per_device_train_batch_size=1,
|
||||
gradient_accumulation_steps=16 # Maintain effective batch size
|
||||
)
|
||||
```
|
||||
|
||||
3. **Use QLoRA**:
|
||||
```python
|
||||
from transformers import BitsAndBytesConfig
|
||||
|
||||
bnb_config = BitsAndBytesConfig(
|
||||
load_in_4bit=True,
|
||||
bnb_4bit_quant_type="nf4",
|
||||
bnb_4bit_use_double_quant=True
|
||||
)
|
||||
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)
|
||||
```
|
||||
|
||||
4. **Lower LoRA rank**:
|
||||
```python
|
||||
LoraConfig(r=8) # Instead of r=16 or higher
|
||||
```
|
||||
|
||||
5. **Target fewer modules**:
|
||||
```python
|
||||
target_modules=["q_proj", "v_proj"] # Instead of all-linear
|
||||
```
|
||||
|
||||
### Loss Not Decreasing
|
||||
|
||||
**Problem**: Training loss stays flat or increases.
|
||||
|
||||
**Solutions**:
|
||||
|
||||
1. **Check learning rate**:
|
||||
```python
|
||||
# Start lower
|
||||
TrainingArguments(learning_rate=1e-4) # Not 2e-4 or higher
|
||||
```
|
||||
|
||||
2. **Verify adapter is active**:
|
||||
```python
|
||||
model.print_trainable_parameters()
|
||||
# Should show >0 trainable params
|
||||
|
||||
# Check adapter applied
|
||||
print(model.peft_config)
|
||||
```
|
||||
|
||||
3. **Check data formatting**:
|
||||
```python
|
||||
# Verify tokenization
|
||||
sample = dataset[0]
|
||||
decoded = tokenizer.decode(sample["input_ids"])
|
||||
print(decoded) # Should look correct
|
||||
```
|
||||
|
||||
4. **Increase rank**:
|
||||
```python
|
||||
LoraConfig(r=32, lora_alpha=64) # More capacity
|
||||
```
|
||||
|
||||
### NaN Loss
|
||||
|
||||
**Error**: `Loss is NaN`
|
||||
|
||||
**Fix**:
|
||||
```python
|
||||
# Use bf16 instead of fp16
|
||||
TrainingArguments(bf16=True, fp16=False)
|
||||
|
||||
# Or enable loss scaling
|
||||
TrainingArguments(fp16=True, fp16_full_eval=True)
|
||||
|
||||
# Lower learning rate
|
||||
TrainingArguments(learning_rate=5e-5)
|
||||
|
||||
# Check for data issues
|
||||
for batch in dataloader:
|
||||
if torch.isnan(batch["input_ids"].float()).any():
|
||||
print("NaN in input!")
|
||||
```
|
||||
|
||||
### Adapter Not Training
|
||||
|
||||
**Problem**: `trainable params: 0` or model not updating.
|
||||
|
||||
**Fix**:
|
||||
```python
|
||||
# Verify LoRA applied to correct modules
|
||||
for name, module in model.named_modules():
|
||||
if "lora" in name.lower():
|
||||
print(f"Found LoRA: {name}")
|
||||
|
||||
# Check target_modules match model architecture
|
||||
from peft.utils import TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING
|
||||
print(TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING.get(model.config.model_type))
|
||||
|
||||
# Ensure model in training mode
|
||||
model.train()
|
||||
|
||||
# Check requires_grad
|
||||
for name, param in model.named_parameters():
|
||||
if param.requires_grad:
|
||||
print(f"Trainable: {name}")
|
||||
```
|
||||
|
||||
## Loading Issues
|
||||
|
||||
### Adapter Loading Fails
|
||||
|
||||
**Error**: `ValueError: Can't find adapter weights`
|
||||
|
||||
**Fix**:
|
||||
```python
|
||||
# Check adapter files exist
|
||||
import os
|
||||
print(os.listdir("./adapter-path"))
|
||||
# Should contain: adapter_config.json, adapter_model.safetensors
|
||||
|
||||
# Load with correct structure
|
||||
from peft import PeftModel, PeftConfig
|
||||
|
||||
# Check config
|
||||
config = PeftConfig.from_pretrained("./adapter-path")
|
||||
print(config)
|
||||
|
||||
# Load base model first
|
||||
base_model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path)
|
||||
model = PeftModel.from_pretrained(base_model, "./adapter-path")
|
||||
```
|
||||
|
||||
### Base Model Mismatch
|
||||
|
||||
**Error**: `RuntimeError: size mismatch`
|
||||
|
||||
**Fix**:
|
||||
```python
|
||||
# Ensure base model matches adapter
|
||||
from peft import PeftConfig
|
||||
|
||||
config = PeftConfig.from_pretrained("./adapter-path")
|
||||
print(f"Base model: {config.base_model_name_or_path}")
|
||||
|
||||
# Load exact same base model
|
||||
base_model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path)
|
||||
```
|
||||
|
||||
### Safetensors vs PyTorch Format
|
||||
|
||||
**Error**: `ValueError: We couldn't connect to 'https://huggingface.co'`
|
||||
|
||||
**Fix**:
|
||||
```python
|
||||
# Force local loading
|
||||
model = PeftModel.from_pretrained(
|
||||
base_model,
|
||||
"./adapter-path",
|
||||
local_files_only=True
|
||||
)
|
||||
|
||||
# Or specify format
|
||||
model.save_pretrained("./adapter", safe_serialization=True) # safetensors
|
||||
model.save_pretrained("./adapter", safe_serialization=False) # pytorch
|
||||
```
|
||||
|
||||
## Inference Issues
|
||||
|
||||
### Slow Generation
|
||||
|
||||
**Problem**: Inference much slower than expected.
|
||||
|
||||
**Solutions**:
|
||||
|
||||
1. **Merge adapter for deployment**:
|
||||
```python
|
||||
merged_model = model.merge_and_unload()
|
||||
# No adapter overhead during inference
|
||||
```
|
||||
|
||||
2. **Use optimized inference engine**:
|
||||
```python
|
||||
from vllm import LLM
|
||||
llm = LLM(model="./merged-model", dtype="half")
|
||||
```
|
||||
|
||||
3. **Enable Flash Attention**:
|
||||
```python
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
model_name,
|
||||
attn_implementation="flash_attention_2"
|
||||
)
|
||||
```
|
||||
|
||||
### Output Quality Issues
|
||||
|
||||
**Problem**: Fine-tuned model produces worse outputs.
|
||||
|
||||
**Solutions**:
|
||||
|
||||
1. **Check evaluation without adapter**:
|
||||
```python
|
||||
with model.disable_adapter():
|
||||
base_output = model.generate(**inputs)
|
||||
# Compare with adapter output
|
||||
```
|
||||
|
||||
2. **Lower temperature during eval**:
|
||||
```python
|
||||
model.generate(**inputs, temperature=0.1, do_sample=False)
|
||||
```
|
||||
|
||||
3. **Retrain with more data**:
|
||||
```python
|
||||
# Increase training samples
|
||||
# Use higher quality data
|
||||
# Train for more epochs
|
||||
```
|
||||
|
||||
### Wrong Adapter Active
|
||||
|
||||
**Problem**: Model using wrong adapter or no adapter.
|
||||
|
||||
**Fix**:
|
||||
```python
|
||||
# Check active adapters
|
||||
print(model.active_adapters)
|
||||
|
||||
# Explicitly set adapter
|
||||
model.set_adapter("your-adapter-name")
|
||||
|
||||
# List all adapters
|
||||
print(model.peft_config.keys())
|
||||
```
|
||||
|
||||
## QLoRA Specific Issues
|
||||
|
||||
### Quantization Errors
|
||||
|
||||
**Error**: `RuntimeError: mat1 and mat2 shapes cannot be multiplied`
|
||||
|
||||
**Fix**:
|
||||
```python
|
||||
# Ensure compute dtype matches
|
||||
bnb_config = BitsAndBytesConfig(
|
||||
load_in_4bit=True,
|
||||
bnb_4bit_compute_dtype=torch.bfloat16, # Match model dtype
|
||||
bnb_4bit_quant_type="nf4"
|
||||
)
|
||||
|
||||
# Load with correct dtype
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
model_name,
|
||||
quantization_config=bnb_config,
|
||||
torch_dtype=torch.bfloat16
|
||||
)
|
||||
```
|
||||
|
||||
### QLoRA OOM
|
||||
|
||||
**Error**: OOM even with 4-bit quantization.
|
||||
|
||||
**Fix**:
|
||||
```python
|
||||
# Enable double quantization
|
||||
bnb_config = BitsAndBytesConfig(
|
||||
load_in_4bit=True,
|
||||
bnb_4bit_use_double_quant=True # Further memory reduction
|
||||
)
|
||||
|
||||
# Use offloading
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
model_name,
|
||||
quantization_config=bnb_config,
|
||||
device_map="auto",
|
||||
max_memory={0: "20GB", "cpu": "100GB"}
|
||||
)
|
||||
```
|
||||
|
||||
### QLoRA Merge Fails
|
||||
|
||||
**Error**: `RuntimeError: expected scalar type BFloat16 but found Float`
|
||||
|
||||
**Fix**:
|
||||
```python
|
||||
# Dequantize before merging
|
||||
from peft import PeftModel
|
||||
|
||||
# Load in higher precision for merging
|
||||
base_model = AutoModelForCausalLM.from_pretrained(
|
||||
base_model_name,
|
||||
torch_dtype=torch.float16, # Not quantized
|
||||
device_map="auto"
|
||||
)
|
||||
|
||||
# Load adapter
|
||||
model = PeftModel.from_pretrained(base_model, "./qlora-adapter")
|
||||
|
||||
# Now merge
|
||||
merged = model.merge_and_unload()
|
||||
```
|
||||
|
||||
## Multi-Adapter Issues
|
||||
|
||||
### Adapter Conflict
|
||||
|
||||
**Error**: `ValueError: Adapter with name 'default' already exists`
|
||||
|
||||
**Fix**:
|
||||
```python
|
||||
# Use unique names
|
||||
model.load_adapter("./adapter1", adapter_name="task1")
|
||||
model.load_adapter("./adapter2", adapter_name="task2")
|
||||
|
||||
# Or delete existing
|
||||
model.delete_adapter("default")
|
||||
```
|
||||
|
||||
### Mixed Precision Adapters
|
||||
|
||||
**Error**: Adapters trained with different dtypes.
|
||||
|
||||
**Fix**:
|
||||
```python
|
||||
# Convert adapter precision
|
||||
model = PeftModel.from_pretrained(base_model, "./adapter")
|
||||
model = model.to(torch.bfloat16)
|
||||
|
||||
# Or load with specific dtype
|
||||
model = PeftModel.from_pretrained(
|
||||
base_model,
|
||||
"./adapter",
|
||||
torch_dtype=torch.bfloat16
|
||||
)
|
||||
```
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### Memory Profiling
|
||||
|
||||
```python
|
||||
import torch
|
||||
|
||||
def print_memory():
|
||||
if torch.cuda.is_available():
|
||||
allocated = torch.cuda.memory_allocated() / 1e9
|
||||
reserved = torch.cuda.memory_reserved() / 1e9
|
||||
print(f"Allocated: {allocated:.2f}GB, Reserved: {reserved:.2f}GB")
|
||||
|
||||
# Profile during training
|
||||
print_memory() # Before
|
||||
model.train()
|
||||
loss = model(**batch).loss
|
||||
loss.backward()
|
||||
print_memory() # After
|
||||
```
|
||||
|
||||
### Speed Profiling
|
||||
|
||||
```python
|
||||
import time
|
||||
import torch
|
||||
|
||||
def benchmark_generation(model, tokenizer, prompt, n_runs=5):
|
||||
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
|
||||
|
||||
# Warmup
|
||||
model.generate(**inputs, max_new_tokens=10)
|
||||
torch.cuda.synchronize()
|
||||
|
||||
# Benchmark
|
||||
times = []
|
||||
for _ in range(n_runs):
|
||||
start = time.perf_counter()
|
||||
outputs = model.generate(**inputs, max_new_tokens=100)
|
||||
torch.cuda.synchronize()
|
||||
times.append(time.perf_counter() - start)
|
||||
|
||||
tokens = outputs.shape[1] - inputs.input_ids.shape[1]
|
||||
avg_time = sum(times) / len(times)
|
||||
print(f"Speed: {tokens/avg_time:.2f} tokens/sec")
|
||||
|
||||
# Compare adapter vs merged
|
||||
benchmark_generation(adapter_model, tokenizer, "Hello")
|
||||
benchmark_generation(merged_model, tokenizer, "Hello")
|
||||
```
|
||||
|
||||
## Getting Help
|
||||
|
||||
1. **Check PEFT GitHub Issues**: https://github.com/huggingface/peft/issues
|
||||
2. **HuggingFace Forums**: https://discuss.huggingface.co/
|
||||
3. **PEFT Documentation**: https://huggingface.co/docs/peft
|
||||
|
||||
### Debugging Template
|
||||
|
||||
When reporting issues, include:
|
||||
|
||||
```python
|
||||
# System info
|
||||
import peft
|
||||
import transformers
|
||||
import torch
|
||||
|
||||
print(f"PEFT: {peft.__version__}")
|
||||
print(f"Transformers: {transformers.__version__}")
|
||||
print(f"PyTorch: {torch.__version__}")
|
||||
print(f"CUDA: {torch.version.cuda}")
|
||||
print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'N/A'}")
|
||||
|
||||
# Config
|
||||
print(model.peft_config)
|
||||
model.print_trainable_parameters()
|
||||
```
|
||||
129
optional-skills/mlops/pytorch-fsdp/SKILL.md
Normal file
129
optional-skills/mlops/pytorch-fsdp/SKILL.md
Normal file
File diff suppressed because one or more lines are too long
7
optional-skills/mlops/pytorch-fsdp/references/index.md
Normal file
7
optional-skills/mlops/pytorch-fsdp/references/index.md
Normal file
|
|
@ -0,0 +1,7 @@
|
|||
# Pytorch-Fsdp Documentation Index
|
||||
|
||||
## Categories
|
||||
|
||||
### Other
|
||||
**File:** `other.md`
|
||||
**Pages:** 15
|
||||
4261
optional-skills/mlops/pytorch-fsdp/references/other.md
Normal file
4261
optional-skills/mlops/pytorch-fsdp/references/other.md
Normal file
File diff suppressed because it is too large
Load diff
522
optional-skills/mlops/stable-diffusion/SKILL.md
Normal file
522
optional-skills/mlops/stable-diffusion/SKILL.md
Normal file
|
|
@ -0,0 +1,522 @@
|
|||
---
|
||||
name: stable-diffusion-image-generation
|
||||
description: State-of-the-art text-to-image generation with Stable Diffusion models via HuggingFace Diffusers. Use when generating images from text prompts, performing image-to-image translation, inpainting, or building custom diffusion pipelines.
|
||||
version: 1.0.0
|
||||
author: Orchestra Research
|
||||
license: MIT
|
||||
dependencies: [diffusers>=0.30.0, transformers>=4.41.0, accelerate>=0.31.0, torch>=2.0.0]
|
||||
metadata:
|
||||
hermes:
|
||||
tags: [Image Generation, Stable Diffusion, Diffusers, Text-to-Image, Multimodal, Computer Vision]
|
||||
|
||||
---
|
||||
|
||||
# Stable Diffusion Image Generation
|
||||
|
||||
Comprehensive guide to generating images with Stable Diffusion using the HuggingFace Diffusers library.
|
||||
|
||||
## When to use Stable Diffusion
|
||||
|
||||
**Use Stable Diffusion when:**
|
||||
- Generating images from text descriptions
|
||||
- Performing image-to-image translation (style transfer, enhancement)
|
||||
- Inpainting (filling in masked regions)
|
||||
- Outpainting (extending images beyond boundaries)
|
||||
- Creating variations of existing images
|
||||
- Building custom image generation workflows
|
||||
|
||||
**Key features:**
|
||||
- **Text-to-Image**: Generate images from natural language prompts
|
||||
- **Image-to-Image**: Transform existing images with text guidance
|
||||
- **Inpainting**: Fill masked regions with context-aware content
|
||||
- **ControlNet**: Add spatial conditioning (edges, poses, depth)
|
||||
- **LoRA Support**: Efficient fine-tuning and style adaptation
|
||||
- **Multiple Models**: SD 1.5, SDXL, SD 3.0, Flux support
|
||||
|
||||
**Use alternatives instead:**
|
||||
- **DALL-E 3**: For API-based generation without GPU
|
||||
- **Midjourney**: For artistic, stylized outputs
|
||||
- **Imagen**: For Google Cloud integration
|
||||
- **Leonardo.ai**: For web-based creative workflows
|
||||
|
||||
## Quick start
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
pip install diffusers transformers accelerate torch
|
||||
pip install xformers # Optional: memory-efficient attention
|
||||
```
|
||||
|
||||
### Basic text-to-image
|
||||
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
import torch
|
||||
|
||||
# Load pipeline (auto-detects model type)
|
||||
pipe = DiffusionPipeline.from_pretrained(
|
||||
"stable-diffusion-v1-5/stable-diffusion-v1-5",
|
||||
torch_dtype=torch.float16
|
||||
)
|
||||
pipe.to("cuda")
|
||||
|
||||
# Generate image
|
||||
image = pipe(
|
||||
"A serene mountain landscape at sunset, highly detailed",
|
||||
num_inference_steps=50,
|
||||
guidance_scale=7.5
|
||||
).images[0]
|
||||
|
||||
image.save("output.png")
|
||||
```
|
||||
|
||||
### Using SDXL (higher quality)
|
||||
|
||||
```python
|
||||
from diffusers import AutoPipelineForText2Image
|
||||
import torch
|
||||
|
||||
pipe = AutoPipelineForText2Image.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0",
|
||||
torch_dtype=torch.float16,
|
||||
variant="fp16"
|
||||
)
|
||||
pipe.to("cuda")
|
||||
|
||||
# Enable memory optimization
|
||||
pipe.enable_model_cpu_offload()
|
||||
|
||||
image = pipe(
|
||||
prompt="A futuristic city with flying cars, cinematic lighting",
|
||||
height=1024,
|
||||
width=1024,
|
||||
num_inference_steps=30
|
||||
).images[0]
|
||||
```
|
||||
|
||||
## Architecture overview
|
||||
|
||||
### Three-pillar design
|
||||
|
||||
Diffusers is built around three core components:
|
||||
|
||||
```
|
||||
Pipeline (orchestration)
|
||||
├── Model (neural networks)
|
||||
│ ├── UNet / Transformer (noise prediction)
|
||||
│ ├── VAE (latent encoding/decoding)
|
||||
│ └── Text Encoder (CLIP/T5)
|
||||
└── Scheduler (denoising algorithm)
|
||||
```
|
||||
|
||||
### Pipeline inference flow
|
||||
|
||||
```
|
||||
Text Prompt → Text Encoder → Text Embeddings
|
||||
↓
|
||||
Random Noise → [Denoising Loop] ← Scheduler
|
||||
↓
|
||||
Predicted Noise
|
||||
↓
|
||||
VAE Decoder → Final Image
|
||||
```
|
||||
|
||||
## Core concepts
|
||||
|
||||
### Pipelines
|
||||
|
||||
Pipelines orchestrate complete workflows:
|
||||
|
||||
| Pipeline | Purpose |
|
||||
|----------|---------|
|
||||
| `StableDiffusionPipeline` | Text-to-image (SD 1.x/2.x) |
|
||||
| `StableDiffusionXLPipeline` | Text-to-image (SDXL) |
|
||||
| `StableDiffusion3Pipeline` | Text-to-image (SD 3.0) |
|
||||
| `FluxPipeline` | Text-to-image (Flux models) |
|
||||
| `StableDiffusionImg2ImgPipeline` | Image-to-image |
|
||||
| `StableDiffusionInpaintPipeline` | Inpainting |
|
||||
|
||||
### Schedulers
|
||||
|
||||
Schedulers control the denoising process:
|
||||
|
||||
| Scheduler | Steps | Quality | Use Case |
|
||||
|-----------|-------|---------|----------|
|
||||
| `EulerDiscreteScheduler` | 20-50 | Good | Default choice |
|
||||
| `EulerAncestralDiscreteScheduler` | 20-50 | Good | More variation |
|
||||
| `DPMSolverMultistepScheduler` | 15-25 | Excellent | Fast, high quality |
|
||||
| `DDIMScheduler` | 50-100 | Good | Deterministic |
|
||||
| `LCMScheduler` | 4-8 | Good | Very fast |
|
||||
| `UniPCMultistepScheduler` | 15-25 | Excellent | Fast convergence |
|
||||
|
||||
### Swapping schedulers
|
||||
|
||||
```python
|
||||
from diffusers import DPMSolverMultistepScheduler
|
||||
|
||||
# Swap for faster generation
|
||||
pipe.scheduler = DPMSolverMultistepScheduler.from_config(
|
||||
pipe.scheduler.config
|
||||
)
|
||||
|
||||
# Now generate with fewer steps
|
||||
image = pipe(prompt, num_inference_steps=20).images[0]
|
||||
```
|
||||
|
||||
## Generation parameters
|
||||
|
||||
### Key parameters
|
||||
|
||||
| Parameter | Default | Description |
|
||||
|-----------|---------|-------------|
|
||||
| `prompt` | Required | Text description of desired image |
|
||||
| `negative_prompt` | None | What to avoid in the image |
|
||||
| `num_inference_steps` | 50 | Denoising steps (more = better quality) |
|
||||
| `guidance_scale` | 7.5 | Prompt adherence (7-12 typical) |
|
||||
| `height`, `width` | 512/1024 | Output dimensions (multiples of 8) |
|
||||
| `generator` | None | Torch generator for reproducibility |
|
||||
| `num_images_per_prompt` | 1 | Batch size |
|
||||
|
||||
### Reproducible generation
|
||||
|
||||
```python
|
||||
import torch
|
||||
|
||||
generator = torch.Generator(device="cuda").manual_seed(42)
|
||||
|
||||
image = pipe(
|
||||
prompt="A cat wearing a top hat",
|
||||
generator=generator,
|
||||
num_inference_steps=50
|
||||
).images[0]
|
||||
```
|
||||
|
||||
### Negative prompts
|
||||
|
||||
```python
|
||||
image = pipe(
|
||||
prompt="Professional photo of a dog in a garden",
|
||||
negative_prompt="blurry, low quality, distorted, ugly, bad anatomy",
|
||||
guidance_scale=7.5
|
||||
).images[0]
|
||||
```
|
||||
|
||||
## Image-to-image
|
||||
|
||||
Transform existing images with text guidance:
|
||||
|
||||
```python
|
||||
from diffusers import AutoPipelineForImage2Image
|
||||
from PIL import Image
|
||||
|
||||
pipe = AutoPipelineForImage2Image.from_pretrained(
|
||||
"stable-diffusion-v1-5/stable-diffusion-v1-5",
|
||||
torch_dtype=torch.float16
|
||||
).to("cuda")
|
||||
|
||||
init_image = Image.open("input.jpg").resize((512, 512))
|
||||
|
||||
image = pipe(
|
||||
prompt="A watercolor painting of the scene",
|
||||
image=init_image,
|
||||
strength=0.75, # How much to transform (0-1)
|
||||
num_inference_steps=50
|
||||
).images[0]
|
||||
```
|
||||
|
||||
## Inpainting
|
||||
|
||||
Fill masked regions:
|
||||
|
||||
```python
|
||||
from diffusers import AutoPipelineForInpainting
|
||||
from PIL import Image
|
||||
|
||||
pipe = AutoPipelineForInpainting.from_pretrained(
|
||||
"runwayml/stable-diffusion-inpainting",
|
||||
torch_dtype=torch.float16
|
||||
).to("cuda")
|
||||
|
||||
image = Image.open("photo.jpg")
|
||||
mask = Image.open("mask.png") # White = inpaint region
|
||||
|
||||
result = pipe(
|
||||
prompt="A red car parked on the street",
|
||||
image=image,
|
||||
mask_image=mask,
|
||||
num_inference_steps=50
|
||||
).images[0]
|
||||
```
|
||||
|
||||
## ControlNet
|
||||
|
||||
Add spatial conditioning for precise control:
|
||||
|
||||
```python
|
||||
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
|
||||
import torch
|
||||
|
||||
# Load ControlNet for edge conditioning
|
||||
controlnet = ControlNetModel.from_pretrained(
|
||||
"lllyasviel/control_v11p_sd15_canny",
|
||||
torch_dtype=torch.float16
|
||||
)
|
||||
|
||||
pipe = StableDiffusionControlNetPipeline.from_pretrained(
|
||||
"stable-diffusion-v1-5/stable-diffusion-v1-5",
|
||||
controlnet=controlnet,
|
||||
torch_dtype=torch.float16
|
||||
).to("cuda")
|
||||
|
||||
# Use Canny edge image as control
|
||||
control_image = get_canny_image(input_image)
|
||||
|
||||
image = pipe(
|
||||
prompt="A beautiful house in the style of Van Gogh",
|
||||
image=control_image,
|
||||
num_inference_steps=30
|
||||
).images[0]
|
||||
```
|
||||
|
||||
### Available ControlNets
|
||||
|
||||
| ControlNet | Input Type | Use Case |
|
||||
|------------|------------|----------|
|
||||
| `canny` | Edge maps | Preserve structure |
|
||||
| `openpose` | Pose skeletons | Human poses |
|
||||
| `depth` | Depth maps | 3D-aware generation |
|
||||
| `normal` | Normal maps | Surface details |
|
||||
| `mlsd` | Line segments | Architectural lines |
|
||||
| `scribble` | Rough sketches | Sketch-to-image |
|
||||
|
||||
## LoRA adapters
|
||||
|
||||
Load fine-tuned style adapters:
|
||||
|
||||
```python
|
||||
from diffusers import DiffusionPipeline
|
||||
|
||||
pipe = DiffusionPipeline.from_pretrained(
|
||||
"stable-diffusion-v1-5/stable-diffusion-v1-5",
|
||||
torch_dtype=torch.float16
|
||||
).to("cuda")
|
||||
|
||||
# Load LoRA weights
|
||||
pipe.load_lora_weights("path/to/lora", weight_name="style.safetensors")
|
||||
|
||||
# Generate with LoRA style
|
||||
image = pipe("A portrait in the trained style").images[0]
|
||||
|
||||
# Adjust LoRA strength
|
||||
pipe.fuse_lora(lora_scale=0.8)
|
||||
|
||||
# Unload LoRA
|
||||
pipe.unload_lora_weights()
|
||||
```
|
||||
|
||||
### Multiple LoRAs
|
||||
|
||||
```python
|
||||
# Load multiple LoRAs
|
||||
pipe.load_lora_weights("lora1", adapter_name="style")
|
||||
pipe.load_lora_weights("lora2", adapter_name="character")
|
||||
|
||||
# Set weights for each
|
||||
pipe.set_adapters(["style", "character"], adapter_weights=[0.7, 0.5])
|
||||
|
||||
image = pipe("A portrait").images[0]
|
||||
```
|
||||
|
||||
## Memory optimization
|
||||
|
||||
### Enable CPU offloading
|
||||
|
||||
```python
|
||||
# Model CPU offload - moves models to CPU when not in use
|
||||
pipe.enable_model_cpu_offload()
|
||||
|
||||
# Sequential CPU offload - more aggressive, slower
|
||||
pipe.enable_sequential_cpu_offload()
|
||||
```
|
||||
|
||||
### Attention slicing
|
||||
|
||||
```python
|
||||
# Reduce memory by computing attention in chunks
|
||||
pipe.enable_attention_slicing()
|
||||
|
||||
# Or specific chunk size
|
||||
pipe.enable_attention_slicing("max")
|
||||
```
|
||||
|
||||
### xFormers memory-efficient attention
|
||||
|
||||
```python
|
||||
# Requires xformers package
|
||||
pipe.enable_xformers_memory_efficient_attention()
|
||||
```
|
||||
|
||||
### VAE slicing for large images
|
||||
|
||||
```python
|
||||
# Decode latents in tiles for large images
|
||||
pipe.enable_vae_slicing()
|
||||
pipe.enable_vae_tiling()
|
||||
```
|
||||
|
||||
## Model variants
|
||||
|
||||
### Loading different precisions
|
||||
|
||||
```python
|
||||
# FP16 (recommended for GPU)
|
||||
pipe = DiffusionPipeline.from_pretrained(
|
||||
"model-id",
|
||||
torch_dtype=torch.float16,
|
||||
variant="fp16"
|
||||
)
|
||||
|
||||
# BF16 (better precision, requires Ampere+ GPU)
|
||||
pipe = DiffusionPipeline.from_pretrained(
|
||||
"model-id",
|
||||
torch_dtype=torch.bfloat16
|
||||
)
|
||||
```
|
||||
|
||||
### Loading specific components
|
||||
|
||||
```python
|
||||
from diffusers import UNet2DConditionModel, AutoencoderKL
|
||||
|
||||
# Load custom VAE
|
||||
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse")
|
||||
|
||||
# Use with pipeline
|
||||
pipe = DiffusionPipeline.from_pretrained(
|
||||
"stable-diffusion-v1-5/stable-diffusion-v1-5",
|
||||
vae=vae,
|
||||
torch_dtype=torch.float16
|
||||
)
|
||||
```
|
||||
|
||||
## Batch generation
|
||||
|
||||
Generate multiple images efficiently:
|
||||
|
||||
```python
|
||||
# Multiple prompts
|
||||
prompts = [
|
||||
"A cat playing piano",
|
||||
"A dog reading a book",
|
||||
"A bird painting a picture"
|
||||
]
|
||||
|
||||
images = pipe(prompts, num_inference_steps=30).images
|
||||
|
||||
# Multiple images per prompt
|
||||
images = pipe(
|
||||
"A beautiful sunset",
|
||||
num_images_per_prompt=4,
|
||||
num_inference_steps=30
|
||||
).images
|
||||
```
|
||||
|
||||
## Common workflows
|
||||
|
||||
### Workflow 1: High-quality generation
|
||||
|
||||
```python
|
||||
from diffusers import StableDiffusionXLPipeline, DPMSolverMultistepScheduler
|
||||
import torch
|
||||
|
||||
# 1. Load SDXL with optimizations
|
||||
pipe = StableDiffusionXLPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0",
|
||||
torch_dtype=torch.float16,
|
||||
variant="fp16"
|
||||
)
|
||||
pipe.to("cuda")
|
||||
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
|
||||
pipe.enable_model_cpu_offload()
|
||||
|
||||
# 2. Generate with quality settings
|
||||
image = pipe(
|
||||
prompt="A majestic lion in the savanna, golden hour lighting, 8k, detailed fur",
|
||||
negative_prompt="blurry, low quality, cartoon, anime, sketch",
|
||||
num_inference_steps=30,
|
||||
guidance_scale=7.5,
|
||||
height=1024,
|
||||
width=1024
|
||||
).images[0]
|
||||
```
|
||||
|
||||
### Workflow 2: Fast prototyping
|
||||
|
||||
```python
|
||||
from diffusers import AutoPipelineForText2Image, LCMScheduler
|
||||
import torch
|
||||
|
||||
# Use LCM for 4-8 step generation
|
||||
pipe = AutoPipelineForText2Image.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0",
|
||||
torch_dtype=torch.float16
|
||||
).to("cuda")
|
||||
|
||||
# Load LCM LoRA for fast generation
|
||||
pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl")
|
||||
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
|
||||
pipe.fuse_lora()
|
||||
|
||||
# Generate in ~1 second
|
||||
image = pipe(
|
||||
"A beautiful landscape",
|
||||
num_inference_steps=4,
|
||||
guidance_scale=1.0
|
||||
).images[0]
|
||||
```
|
||||
|
||||
## Common issues
|
||||
|
||||
**CUDA out of memory:**
|
||||
```python
|
||||
# Enable memory optimizations
|
||||
pipe.enable_model_cpu_offload()
|
||||
pipe.enable_attention_slicing()
|
||||
pipe.enable_vae_slicing()
|
||||
|
||||
# Or use lower precision
|
||||
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
|
||||
```
|
||||
|
||||
**Black/noise images:**
|
||||
```python
|
||||
# Check VAE configuration
|
||||
# Use safety checker bypass if needed
|
||||
pipe.safety_checker = None
|
||||
|
||||
# Ensure proper dtype consistency
|
||||
pipe = pipe.to(dtype=torch.float16)
|
||||
```
|
||||
|
||||
**Slow generation:**
|
||||
```python
|
||||
# Use faster scheduler
|
||||
from diffusers import DPMSolverMultistepScheduler
|
||||
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
|
||||
|
||||
# Reduce steps
|
||||
image = pipe(prompt, num_inference_steps=20).images[0]
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
- **[Advanced Usage](references/advanced-usage.md)** - Custom pipelines, fine-tuning, deployment
|
||||
- **[Troubleshooting](references/troubleshooting.md)** - Common issues and solutions
|
||||
|
||||
## Resources
|
||||
|
||||
- **Documentation**: https://huggingface.co/docs/diffusers
|
||||
- **Repository**: https://github.com/huggingface/diffusers
|
||||
- **Model Hub**: https://huggingface.co/models?library=diffusers
|
||||
- **Discord**: https://discord.gg/diffusers
|
||||
|
|
@ -0,0 +1,716 @@
|
|||
# Stable Diffusion Advanced Usage Guide
|
||||
|
||||
## Custom Pipelines
|
||||
|
||||
### Building from components
|
||||
|
||||
```python
|
||||
from diffusers import (
|
||||
UNet2DConditionModel,
|
||||
AutoencoderKL,
|
||||
DDPMScheduler,
|
||||
StableDiffusionPipeline
|
||||
)
|
||||
from transformers import CLIPTextModel, CLIPTokenizer
|
||||
import torch
|
||||
|
||||
# Load components individually
|
||||
unet = UNet2DConditionModel.from_pretrained(
|
||||
"stable-diffusion-v1-5/stable-diffusion-v1-5",
|
||||
subfolder="unet"
|
||||
)
|
||||
vae = AutoencoderKL.from_pretrained(
|
||||
"stable-diffusion-v1-5/stable-diffusion-v1-5",
|
||||
subfolder="vae"
|
||||
)
|
||||
text_encoder = CLIPTextModel.from_pretrained(
|
||||
"stable-diffusion-v1-5/stable-diffusion-v1-5",
|
||||
subfolder="text_encoder"
|
||||
)
|
||||
tokenizer = CLIPTokenizer.from_pretrained(
|
||||
"stable-diffusion-v1-5/stable-diffusion-v1-5",
|
||||
subfolder="tokenizer"
|
||||
)
|
||||
scheduler = DDPMScheduler.from_pretrained(
|
||||
"stable-diffusion-v1-5/stable-diffusion-v1-5",
|
||||
subfolder="scheduler"
|
||||
)
|
||||
|
||||
# Assemble pipeline
|
||||
pipe = StableDiffusionPipeline(
|
||||
unet=unet,
|
||||
vae=vae,
|
||||
text_encoder=text_encoder,
|
||||
tokenizer=tokenizer,
|
||||
scheduler=scheduler,
|
||||
safety_checker=None,
|
||||
feature_extractor=None,
|
||||
requires_safety_checker=False
|
||||
)
|
||||
```
|
||||
|
||||
### Custom denoising loop
|
||||
|
||||
```python
|
||||
from diffusers import DDIMScheduler, AutoencoderKL, UNet2DConditionModel
|
||||
from transformers import CLIPTextModel, CLIPTokenizer
|
||||
import torch
|
||||
|
||||
def custom_generate(
|
||||
prompt: str,
|
||||
num_steps: int = 50,
|
||||
guidance_scale: float = 7.5,
|
||||
height: int = 512,
|
||||
width: int = 512
|
||||
):
|
||||
# Load components
|
||||
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
|
||||
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")
|
||||
unet = UNet2DConditionModel.from_pretrained("sd-model", subfolder="unet")
|
||||
vae = AutoencoderKL.from_pretrained("sd-model", subfolder="vae")
|
||||
scheduler = DDIMScheduler.from_pretrained("sd-model", subfolder="scheduler")
|
||||
|
||||
device = "cuda"
|
||||
text_encoder.to(device)
|
||||
unet.to(device)
|
||||
vae.to(device)
|
||||
|
||||
# Encode prompt
|
||||
text_input = tokenizer(
|
||||
prompt,
|
||||
padding="max_length",
|
||||
max_length=77,
|
||||
truncation=True,
|
||||
return_tensors="pt"
|
||||
)
|
||||
text_embeddings = text_encoder(text_input.input_ids.to(device))[0]
|
||||
|
||||
# Unconditional embeddings for classifier-free guidance
|
||||
uncond_input = tokenizer(
|
||||
"",
|
||||
padding="max_length",
|
||||
max_length=77,
|
||||
return_tensors="pt"
|
||||
)
|
||||
uncond_embeddings = text_encoder(uncond_input.input_ids.to(device))[0]
|
||||
|
||||
# Concatenate for batch processing
|
||||
text_embeddings = torch.cat([uncond_embeddings, text_embeddings])
|
||||
|
||||
# Initialize latents
|
||||
latents = torch.randn(
|
||||
(1, 4, height // 8, width // 8),
|
||||
device=device
|
||||
)
|
||||
latents = latents * scheduler.init_noise_sigma
|
||||
|
||||
# Denoising loop
|
||||
scheduler.set_timesteps(num_steps)
|
||||
for t in scheduler.timesteps:
|
||||
latent_model_input = torch.cat([latents] * 2)
|
||||
latent_model_input = scheduler.scale_model_input(latent_model_input, t)
|
||||
|
||||
# Predict noise
|
||||
with torch.no_grad():
|
||||
noise_pred = unet(
|
||||
latent_model_input,
|
||||
t,
|
||||
encoder_hidden_states=text_embeddings
|
||||
).sample
|
||||
|
||||
# Classifier-free guidance
|
||||
noise_pred_uncond, noise_pred_cond = noise_pred.chunk(2)
|
||||
noise_pred = noise_pred_uncond + guidance_scale * (
|
||||
noise_pred_cond - noise_pred_uncond
|
||||
)
|
||||
|
||||
# Update latents
|
||||
latents = scheduler.step(noise_pred, t, latents).prev_sample
|
||||
|
||||
# Decode latents
|
||||
latents = latents / vae.config.scaling_factor
|
||||
with torch.no_grad():
|
||||
image = vae.decode(latents).sample
|
||||
|
||||
# Convert to PIL
|
||||
image = (image / 2 + 0.5).clamp(0, 1)
|
||||
image = image.cpu().permute(0, 2, 3, 1).numpy()
|
||||
image = (image * 255).round().astype("uint8")[0]
|
||||
|
||||
return Image.fromarray(image)
|
||||
```
|
||||
|
||||
## IP-Adapter
|
||||
|
||||
Use image prompts alongside text:
|
||||
|
||||
```python
|
||||
from diffusers import StableDiffusionPipeline
|
||||
from diffusers.utils import load_image
|
||||
import torch
|
||||
|
||||
pipe = StableDiffusionPipeline.from_pretrained(
|
||||
"stable-diffusion-v1-5/stable-diffusion-v1-5",
|
||||
torch_dtype=torch.float16
|
||||
).to("cuda")
|
||||
|
||||
# Load IP-Adapter
|
||||
pipe.load_ip_adapter(
|
||||
"h94/IP-Adapter",
|
||||
subfolder="models",
|
||||
weight_name="ip-adapter_sd15.bin"
|
||||
)
|
||||
|
||||
# Set IP-Adapter scale
|
||||
pipe.set_ip_adapter_scale(0.6)
|
||||
|
||||
# Load reference image
|
||||
ip_image = load_image("reference_style.jpg")
|
||||
|
||||
# Generate with image + text prompt
|
||||
image = pipe(
|
||||
prompt="A portrait in a garden",
|
||||
ip_adapter_image=ip_image,
|
||||
num_inference_steps=50
|
||||
).images[0]
|
||||
```
|
||||
|
||||
### Multiple IP-Adapter images
|
||||
|
||||
```python
|
||||
# Use multiple reference images
|
||||
pipe.set_ip_adapter_scale([0.5, 0.7])
|
||||
|
||||
images = [
|
||||
load_image("style_reference.jpg"),
|
||||
load_image("composition_reference.jpg")
|
||||
]
|
||||
|
||||
result = pipe(
|
||||
prompt="A landscape painting",
|
||||
ip_adapter_image=images,
|
||||
num_inference_steps=50
|
||||
).images[0]
|
||||
```
|
||||
|
||||
## SDXL Refiner
|
||||
|
||||
Two-stage generation for higher quality:
|
||||
|
||||
```python
|
||||
from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline
|
||||
import torch
|
||||
|
||||
# Load base model
|
||||
base = StableDiffusionXLPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0",
|
||||
torch_dtype=torch.float16,
|
||||
variant="fp16"
|
||||
).to("cuda")
|
||||
|
||||
# Load refiner
|
||||
refiner = StableDiffusionXLImg2ImgPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-refiner-1.0",
|
||||
torch_dtype=torch.float16,
|
||||
variant="fp16"
|
||||
).to("cuda")
|
||||
|
||||
# Generate with base (partial denoising)
|
||||
image = base(
|
||||
prompt="A majestic eagle soaring over mountains",
|
||||
num_inference_steps=40,
|
||||
denoising_end=0.8,
|
||||
output_type="latent"
|
||||
).images
|
||||
|
||||
# Refine with refiner
|
||||
refined = refiner(
|
||||
prompt="A majestic eagle soaring over mountains",
|
||||
image=image,
|
||||
num_inference_steps=40,
|
||||
denoising_start=0.8
|
||||
).images[0]
|
||||
```
|
||||
|
||||
## T2I-Adapter
|
||||
|
||||
Lightweight conditioning without full ControlNet:
|
||||
|
||||
```python
|
||||
from diffusers import StableDiffusionXLAdapterPipeline, T2IAdapter
|
||||
import torch
|
||||
|
||||
# Load adapter
|
||||
adapter = T2IAdapter.from_pretrained(
|
||||
"TencentARC/t2i-adapter-canny-sdxl-1.0",
|
||||
torch_dtype=torch.float16
|
||||
)
|
||||
|
||||
pipe = StableDiffusionXLAdapterPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0",
|
||||
adapter=adapter,
|
||||
torch_dtype=torch.float16
|
||||
).to("cuda")
|
||||
|
||||
# Get canny edges
|
||||
canny_image = get_canny_image(input_image)
|
||||
|
||||
image = pipe(
|
||||
prompt="A colorful anime character",
|
||||
image=canny_image,
|
||||
num_inference_steps=30,
|
||||
adapter_conditioning_scale=0.8
|
||||
).images[0]
|
||||
```
|
||||
|
||||
## Fine-tuning with DreamBooth
|
||||
|
||||
Train on custom subjects:
|
||||
|
||||
```python
|
||||
from diffusers import StableDiffusionPipeline, DDPMScheduler
|
||||
from diffusers.optimization import get_scheduler
|
||||
import torch
|
||||
from torch.utils.data import Dataset, DataLoader
|
||||
from PIL import Image
|
||||
import os
|
||||
|
||||
class DreamBoothDataset(Dataset):
|
||||
def __init__(self, instance_images_path, instance_prompt, tokenizer, size=512):
|
||||
self.instance_images_path = instance_images_path
|
||||
self.instance_prompt = instance_prompt
|
||||
self.tokenizer = tokenizer
|
||||
self.size = size
|
||||
|
||||
self.instance_images = [
|
||||
os.path.join(instance_images_path, f)
|
||||
for f in os.listdir(instance_images_path)
|
||||
if f.endswith(('.png', '.jpg', '.jpeg'))
|
||||
]
|
||||
|
||||
def __len__(self):
|
||||
return len(self.instance_images)
|
||||
|
||||
def __getitem__(self, idx):
|
||||
image = Image.open(self.instance_images[idx]).convert("RGB")
|
||||
image = image.resize((self.size, self.size))
|
||||
image = torch.tensor(np.array(image)).permute(2, 0, 1) / 127.5 - 1.0
|
||||
|
||||
tokens = self.tokenizer(
|
||||
self.instance_prompt,
|
||||
padding="max_length",
|
||||
max_length=77,
|
||||
truncation=True,
|
||||
return_tensors="pt"
|
||||
)
|
||||
|
||||
return {"image": image, "input_ids": tokens.input_ids.squeeze()}
|
||||
|
||||
def train_dreambooth(
|
||||
pretrained_model: str,
|
||||
instance_data_dir: str,
|
||||
instance_prompt: str,
|
||||
output_dir: str,
|
||||
learning_rate: float = 5e-6,
|
||||
max_train_steps: int = 800,
|
||||
train_batch_size: int = 1
|
||||
):
|
||||
# Load pipeline
|
||||
pipe = StableDiffusionPipeline.from_pretrained(pretrained_model)
|
||||
|
||||
unet = pipe.unet
|
||||
vae = pipe.vae
|
||||
text_encoder = pipe.text_encoder
|
||||
tokenizer = pipe.tokenizer
|
||||
noise_scheduler = DDPMScheduler.from_pretrained(pretrained_model, subfolder="scheduler")
|
||||
|
||||
# Freeze VAE and text encoder
|
||||
vae.requires_grad_(False)
|
||||
text_encoder.requires_grad_(False)
|
||||
|
||||
# Create dataset
|
||||
dataset = DreamBoothDataset(
|
||||
instance_data_dir, instance_prompt, tokenizer
|
||||
)
|
||||
dataloader = DataLoader(dataset, batch_size=train_batch_size, shuffle=True)
|
||||
|
||||
# Setup optimizer
|
||||
optimizer = torch.optim.AdamW(unet.parameters(), lr=learning_rate)
|
||||
lr_scheduler = get_scheduler(
|
||||
"constant",
|
||||
optimizer=optimizer,
|
||||
num_warmup_steps=0,
|
||||
num_training_steps=max_train_steps
|
||||
)
|
||||
|
||||
# Training loop
|
||||
unet.train()
|
||||
device = "cuda"
|
||||
unet.to(device)
|
||||
vae.to(device)
|
||||
text_encoder.to(device)
|
||||
|
||||
global_step = 0
|
||||
for epoch in range(max_train_steps // len(dataloader) + 1):
|
||||
for batch in dataloader:
|
||||
if global_step >= max_train_steps:
|
||||
break
|
||||
|
||||
# Encode images to latents
|
||||
latents = vae.encode(batch["image"].to(device)).latent_dist.sample()
|
||||
latents = latents * vae.config.scaling_factor
|
||||
|
||||
# Sample noise
|
||||
noise = torch.randn_like(latents)
|
||||
timesteps = torch.randint(0, noise_scheduler.num_train_timesteps, (latents.shape[0],))
|
||||
timesteps = timesteps.to(device)
|
||||
|
||||
# Add noise
|
||||
noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps)
|
||||
|
||||
# Get text embeddings
|
||||
encoder_hidden_states = text_encoder(batch["input_ids"].to(device))[0]
|
||||
|
||||
# Predict noise
|
||||
noise_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
|
||||
|
||||
# Compute loss
|
||||
loss = torch.nn.functional.mse_loss(noise_pred, noise)
|
||||
|
||||
# Backprop
|
||||
loss.backward()
|
||||
optimizer.step()
|
||||
lr_scheduler.step()
|
||||
optimizer.zero_grad()
|
||||
|
||||
global_step += 1
|
||||
|
||||
if global_step % 100 == 0:
|
||||
print(f"Step {global_step}, Loss: {loss.item():.4f}")
|
||||
|
||||
# Save model
|
||||
pipe.unet = unet
|
||||
pipe.save_pretrained(output_dir)
|
||||
```
|
||||
|
||||
## LoRA Training
|
||||
|
||||
Efficient fine-tuning with Low-Rank Adaptation:
|
||||
|
||||
```python
|
||||
from peft import LoraConfig, get_peft_model
|
||||
from diffusers import StableDiffusionPipeline
|
||||
import torch
|
||||
|
||||
def train_lora(
|
||||
base_model: str,
|
||||
train_dataset,
|
||||
output_dir: str,
|
||||
lora_rank: int = 4,
|
||||
learning_rate: float = 1e-4,
|
||||
max_train_steps: int = 1000
|
||||
):
|
||||
pipe = StableDiffusionPipeline.from_pretrained(base_model)
|
||||
unet = pipe.unet
|
||||
|
||||
# Configure LoRA
|
||||
lora_config = LoraConfig(
|
||||
r=lora_rank,
|
||||
lora_alpha=lora_rank,
|
||||
target_modules=["to_q", "to_v", "to_k", "to_out.0"],
|
||||
lora_dropout=0.1
|
||||
)
|
||||
|
||||
# Apply LoRA to UNet
|
||||
unet = get_peft_model(unet, lora_config)
|
||||
unet.print_trainable_parameters() # Shows ~0.1% trainable
|
||||
|
||||
# Train (similar to DreamBooth but only LoRA params)
|
||||
optimizer = torch.optim.AdamW(
|
||||
unet.parameters(),
|
||||
lr=learning_rate
|
||||
)
|
||||
|
||||
# ... training loop ...
|
||||
|
||||
# Save LoRA weights only
|
||||
unet.save_pretrained(output_dir)
|
||||
```
|
||||
|
||||
## Textual Inversion
|
||||
|
||||
Learn new concepts through embeddings:
|
||||
|
||||
```python
|
||||
from diffusers import StableDiffusionPipeline
|
||||
import torch
|
||||
|
||||
# Load with textual inversion
|
||||
pipe = StableDiffusionPipeline.from_pretrained(
|
||||
"stable-diffusion-v1-5/stable-diffusion-v1-5",
|
||||
torch_dtype=torch.float16
|
||||
).to("cuda")
|
||||
|
||||
# Load learned embedding
|
||||
pipe.load_textual_inversion(
|
||||
"sd-concepts-library/cat-toy",
|
||||
token="<cat-toy>"
|
||||
)
|
||||
|
||||
# Use in prompts
|
||||
image = pipe("A photo of <cat-toy> on a beach").images[0]
|
||||
```
|
||||
|
||||
## Quantization
|
||||
|
||||
Reduce memory with quantization:
|
||||
|
||||
```python
|
||||
from diffusers import BitsAndBytesConfig, StableDiffusionXLPipeline
|
||||
import torch
|
||||
|
||||
# 8-bit quantization
|
||||
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
|
||||
|
||||
pipe = StableDiffusionXLPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0",
|
||||
quantization_config=quantization_config,
|
||||
torch_dtype=torch.float16
|
||||
)
|
||||
```
|
||||
|
||||
### NF4 quantization (4-bit)
|
||||
|
||||
```python
|
||||
quantization_config = BitsAndBytesConfig(
|
||||
load_in_4bit=True,
|
||||
bnb_4bit_quant_type="nf4",
|
||||
bnb_4bit_compute_dtype=torch.float16
|
||||
)
|
||||
|
||||
pipe = StableDiffusionXLPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0",
|
||||
quantization_config=quantization_config
|
||||
)
|
||||
```
|
||||
|
||||
## Production Deployment
|
||||
|
||||
### FastAPI server
|
||||
|
||||
```python
|
||||
from fastapi import FastAPI, HTTPException
|
||||
from pydantic import BaseModel
|
||||
from diffusers import DiffusionPipeline
|
||||
import torch
|
||||
import base64
|
||||
from io import BytesIO
|
||||
|
||||
app = FastAPI()
|
||||
|
||||
# Load model at startup
|
||||
pipe = DiffusionPipeline.from_pretrained(
|
||||
"stable-diffusion-v1-5/stable-diffusion-v1-5",
|
||||
torch_dtype=torch.float16
|
||||
).to("cuda")
|
||||
pipe.enable_model_cpu_offload()
|
||||
|
||||
class GenerationRequest(BaseModel):
|
||||
prompt: str
|
||||
negative_prompt: str = ""
|
||||
num_inference_steps: int = 30
|
||||
guidance_scale: float = 7.5
|
||||
width: int = 512
|
||||
height: int = 512
|
||||
seed: int = None
|
||||
|
||||
class GenerationResponse(BaseModel):
|
||||
image_base64: str
|
||||
seed: int
|
||||
|
||||
@app.post("/generate", response_model=GenerationResponse)
|
||||
async def generate(request: GenerationRequest):
|
||||
try:
|
||||
generator = None
|
||||
seed = request.seed or torch.randint(0, 2**32, (1,)).item()
|
||||
generator = torch.Generator("cuda").manual_seed(seed)
|
||||
|
||||
image = pipe(
|
||||
prompt=request.prompt,
|
||||
negative_prompt=request.negative_prompt,
|
||||
num_inference_steps=request.num_inference_steps,
|
||||
guidance_scale=request.guidance_scale,
|
||||
width=request.width,
|
||||
height=request.height,
|
||||
generator=generator
|
||||
).images[0]
|
||||
|
||||
# Convert to base64
|
||||
buffer = BytesIO()
|
||||
image.save(buffer, format="PNG")
|
||||
image_base64 = base64.b64encode(buffer.getvalue()).decode()
|
||||
|
||||
return GenerationResponse(image_base64=image_base64, seed=seed)
|
||||
|
||||
except Exception as e:
|
||||
raise HTTPException(status_code=500, detail=str(e))
|
||||
|
||||
@app.get("/health")
|
||||
async def health():
|
||||
return {"status": "healthy"}
|
||||
```
|
||||
|
||||
### Docker deployment
|
||||
|
||||
```dockerfile
|
||||
FROM nvidia/cuda:12.1-runtime-ubuntu22.04
|
||||
|
||||
RUN apt-get update && apt-get install -y python3 python3-pip
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
COPY requirements.txt .
|
||||
RUN pip3 install -r requirements.txt
|
||||
|
||||
COPY . .
|
||||
|
||||
# Pre-download model
|
||||
RUN python3 -c "from diffusers import DiffusionPipeline; DiffusionPipeline.from_pretrained('stable-diffusion-v1-5/stable-diffusion-v1-5')"
|
||||
|
||||
EXPOSE 8000
|
||||
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8000"]
|
||||
```
|
||||
|
||||
### Kubernetes deployment
|
||||
|
||||
```yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: stable-diffusion
|
||||
spec:
|
||||
replicas: 2
|
||||
selector:
|
||||
matchLabels:
|
||||
app: stable-diffusion
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: stable-diffusion
|
||||
spec:
|
||||
containers:
|
||||
- name: sd
|
||||
image: your-registry/stable-diffusion:latest
|
||||
ports:
|
||||
- containerPort: 8000
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1
|
||||
memory: "16Gi"
|
||||
requests:
|
||||
nvidia.com/gpu: 1
|
||||
memory: "8Gi"
|
||||
env:
|
||||
- name: TRANSFORMERS_CACHE
|
||||
value: "/cache/huggingface"
|
||||
volumeMounts:
|
||||
- name: model-cache
|
||||
mountPath: /cache
|
||||
volumes:
|
||||
- name: model-cache
|
||||
persistentVolumeClaim:
|
||||
claimName: model-cache-pvc
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: stable-diffusion
|
||||
spec:
|
||||
selector:
|
||||
app: stable-diffusion
|
||||
ports:
|
||||
- port: 80
|
||||
targetPort: 8000
|
||||
type: LoadBalancer
|
||||
```
|
||||
|
||||
## Callback System
|
||||
|
||||
Monitor and modify generation:
|
||||
|
||||
```python
|
||||
from diffusers import StableDiffusionPipeline
|
||||
from diffusers.callbacks import PipelineCallback
|
||||
import torch
|
||||
|
||||
class ProgressCallback(PipelineCallback):
|
||||
def __init__(self):
|
||||
self.progress = []
|
||||
|
||||
def callback_fn(self, pipe, step_index, timestep, callback_kwargs):
|
||||
self.progress.append({
|
||||
"step": step_index,
|
||||
"timestep": timestep.item()
|
||||
})
|
||||
|
||||
# Optionally modify latents
|
||||
latents = callback_kwargs["latents"]
|
||||
|
||||
return callback_kwargs
|
||||
|
||||
# Use callback
|
||||
callback = ProgressCallback()
|
||||
|
||||
image = pipe(
|
||||
prompt="A sunset",
|
||||
callback_on_step_end=callback.callback_fn,
|
||||
callback_on_step_end_tensor_inputs=["latents"]
|
||||
).images[0]
|
||||
|
||||
print(f"Generation completed in {len(callback.progress)} steps")
|
||||
```
|
||||
|
||||
### Early stopping
|
||||
|
||||
```python
|
||||
def early_stop_callback(pipe, step_index, timestep, callback_kwargs):
|
||||
# Stop after 20 steps
|
||||
if step_index >= 20:
|
||||
pipe._interrupt = True
|
||||
return callback_kwargs
|
||||
|
||||
image = pipe(
|
||||
prompt="A landscape",
|
||||
num_inference_steps=50,
|
||||
callback_on_step_end=early_stop_callback
|
||||
).images[0]
|
||||
```
|
||||
|
||||
## Multi-GPU Inference
|
||||
|
||||
### Device map auto
|
||||
|
||||
```python
|
||||
from diffusers import StableDiffusionXLPipeline
|
||||
|
||||
pipe = StableDiffusionXLPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0",
|
||||
device_map="auto", # Automatically distribute across GPUs
|
||||
torch_dtype=torch.float16
|
||||
)
|
||||
```
|
||||
|
||||
### Manual distribution
|
||||
|
||||
```python
|
||||
from accelerate import infer_auto_device_map, dispatch_model
|
||||
|
||||
# Create device map
|
||||
device_map = infer_auto_device_map(
|
||||
pipe.unet,
|
||||
max_memory={0: "10GiB", 1: "10GiB"}
|
||||
)
|
||||
|
||||
# Dispatch model
|
||||
pipe.unet = dispatch_model(pipe.unet, device_map=device_map)
|
||||
```
|
||||
|
|
@ -0,0 +1,555 @@
|
|||
# Stable Diffusion Troubleshooting Guide
|
||||
|
||||
## Installation Issues
|
||||
|
||||
### Package conflicts
|
||||
|
||||
**Error**: `ImportError: cannot import name 'cached_download' from 'huggingface_hub'`
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Update huggingface_hub
|
||||
pip install --upgrade huggingface_hub
|
||||
|
||||
# Reinstall diffusers
|
||||
pip install --upgrade diffusers
|
||||
```
|
||||
|
||||
### xFormers installation fails
|
||||
|
||||
**Error**: `RuntimeError: CUDA error: no kernel image is available for execution`
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Check CUDA version
|
||||
nvcc --version
|
||||
|
||||
# Install matching xformers
|
||||
pip install xformers --index-url https://download.pytorch.org/whl/cu121 # For CUDA 12.1
|
||||
|
||||
# Or build from source
|
||||
pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
|
||||
```
|
||||
|
||||
### Torch/CUDA mismatch
|
||||
|
||||
**Error**: `RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED`
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Check versions
|
||||
python -c "import torch; print(torch.__version__, torch.cuda.is_available())"
|
||||
|
||||
# Reinstall PyTorch with correct CUDA
|
||||
pip uninstall torch torchvision
|
||||
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
|
||||
```
|
||||
|
||||
## Memory Issues
|
||||
|
||||
### CUDA out of memory
|
||||
|
||||
**Error**: `torch.cuda.OutOfMemoryError: CUDA out of memory`
|
||||
|
||||
**Solutions**:
|
||||
|
||||
```python
|
||||
# Solution 1: Enable CPU offloading
|
||||
pipe.enable_model_cpu_offload()
|
||||
|
||||
# Solution 2: Sequential CPU offload (more aggressive)
|
||||
pipe.enable_sequential_cpu_offload()
|
||||
|
||||
# Solution 3: Attention slicing
|
||||
pipe.enable_attention_slicing()
|
||||
|
||||
# Solution 4: VAE slicing for large images
|
||||
pipe.enable_vae_slicing()
|
||||
|
||||
# Solution 5: Use lower precision
|
||||
pipe = DiffusionPipeline.from_pretrained(
|
||||
"model-id",
|
||||
torch_dtype=torch.float16 # or torch.bfloat16
|
||||
)
|
||||
|
||||
# Solution 6: Reduce batch size
|
||||
image = pipe(prompt, num_images_per_prompt=1).images[0]
|
||||
|
||||
# Solution 7: Generate smaller images
|
||||
image = pipe(prompt, height=512, width=512).images[0]
|
||||
|
||||
# Solution 8: Clear cache between generations
|
||||
import gc
|
||||
torch.cuda.empty_cache()
|
||||
gc.collect()
|
||||
```
|
||||
|
||||
### Memory grows over time
|
||||
|
||||
**Problem**: Memory usage increases with each generation
|
||||
|
||||
**Fix**:
|
||||
```python
|
||||
import gc
|
||||
import torch
|
||||
|
||||
def generate_with_cleanup(pipe, prompt, **kwargs):
|
||||
try:
|
||||
image = pipe(prompt, **kwargs).images[0]
|
||||
return image
|
||||
finally:
|
||||
# Clear cache after generation
|
||||
if torch.cuda.is_available():
|
||||
torch.cuda.empty_cache()
|
||||
gc.collect()
|
||||
```
|
||||
|
||||
### Large model loading fails
|
||||
|
||||
**Error**: `RuntimeError: Unable to load model weights`
|
||||
|
||||
**Fix**:
|
||||
```python
|
||||
# Use low CPU memory mode
|
||||
pipe = DiffusionPipeline.from_pretrained(
|
||||
"large-model-id",
|
||||
low_cpu_mem_usage=True,
|
||||
torch_dtype=torch.float16
|
||||
)
|
||||
```
|
||||
|
||||
## Generation Issues
|
||||
|
||||
### Black images
|
||||
|
||||
**Problem**: Output images are completely black
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Solution 1: Disable safety checker
|
||||
pipe.safety_checker = None
|
||||
|
||||
# Solution 2: Check VAE scaling
|
||||
# The issue might be with VAE encoding/decoding
|
||||
latents = latents / pipe.vae.config.scaling_factor # Before decode
|
||||
|
||||
# Solution 3: Ensure proper dtype
|
||||
pipe = pipe.to(dtype=torch.float16)
|
||||
pipe.vae = pipe.vae.to(dtype=torch.float32) # VAE often needs fp32
|
||||
|
||||
# Solution 4: Check guidance scale
|
||||
# Too high can cause issues
|
||||
image = pipe(prompt, guidance_scale=7.5).images[0] # Not 20+
|
||||
```
|
||||
|
||||
### Noise/static images
|
||||
|
||||
**Problem**: Output looks like random noise
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Solution 1: Increase inference steps
|
||||
image = pipe(prompt, num_inference_steps=50).images[0]
|
||||
|
||||
# Solution 2: Check scheduler configuration
|
||||
pipe.scheduler = pipe.scheduler.from_config(pipe.scheduler.config)
|
||||
|
||||
# Solution 3: Verify model was loaded correctly
|
||||
print(pipe.unet) # Should show model architecture
|
||||
```
|
||||
|
||||
### Blurry images
|
||||
|
||||
**Problem**: Output images are low quality or blurry
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Solution 1: Use more steps
|
||||
image = pipe(prompt, num_inference_steps=50).images[0]
|
||||
|
||||
# Solution 2: Use better VAE
|
||||
from diffusers import AutoencoderKL
|
||||
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse")
|
||||
pipe.vae = vae
|
||||
|
||||
# Solution 3: Use SDXL or refiner
|
||||
pipe = DiffusionPipeline.from_pretrained(
|
||||
"stabilityai/stable-diffusion-xl-base-1.0"
|
||||
)
|
||||
|
||||
# Solution 4: Upscale with img2img
|
||||
upscale_pipe = StableDiffusionImg2ImgPipeline.from_pretrained(...)
|
||||
upscaled = upscale_pipe(
|
||||
prompt=prompt,
|
||||
image=image.resize((1024, 1024)),
|
||||
strength=0.3
|
||||
).images[0]
|
||||
```
|
||||
|
||||
### Prompt not being followed
|
||||
|
||||
**Problem**: Generated image doesn't match the prompt
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Solution 1: Increase guidance scale
|
||||
image = pipe(prompt, guidance_scale=10.0).images[0]
|
||||
|
||||
# Solution 2: Use negative prompts
|
||||
image = pipe(
|
||||
prompt="A red car",
|
||||
negative_prompt="blue, green, yellow, wrong color",
|
||||
guidance_scale=7.5
|
||||
).images[0]
|
||||
|
||||
# Solution 3: Use prompt weighting
|
||||
# Emphasize important words
|
||||
prompt = "A (red:1.5) car on a street"
|
||||
|
||||
# Solution 4: Use longer, more detailed prompts
|
||||
prompt = """
|
||||
A bright red sports car, ferrari style, parked on a city street,
|
||||
photorealistic, high detail, 8k, professional photography
|
||||
"""
|
||||
```
|
||||
|
||||
### Distorted faces/hands
|
||||
|
||||
**Problem**: Faces and hands look deformed
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Solution 1: Use negative prompts
|
||||
negative_prompt = """
|
||||
bad hands, bad anatomy, deformed, ugly, blurry,
|
||||
extra fingers, mutated hands, poorly drawn hands,
|
||||
poorly drawn face, mutation, deformed face
|
||||
"""
|
||||
|
||||
# Solution 2: Use face-specific models
|
||||
# ADetailer or similar post-processing
|
||||
|
||||
# Solution 3: Use ControlNet for poses
|
||||
# Load pose estimation and condition generation
|
||||
|
||||
# Solution 4: Inpaint problematic areas
|
||||
mask = create_face_mask(image)
|
||||
fixed = inpaint_pipe(
|
||||
prompt="beautiful detailed face",
|
||||
image=image,
|
||||
mask_image=mask
|
||||
).images[0]
|
||||
```
|
||||
|
||||
## Scheduler Issues
|
||||
|
||||
### Scheduler not compatible
|
||||
|
||||
**Error**: `ValueError: Scheduler ... is not compatible with pipeline`
|
||||
|
||||
**Fix**:
|
||||
```python
|
||||
from diffusers import EulerDiscreteScheduler
|
||||
|
||||
# Create scheduler from config
|
||||
pipe.scheduler = EulerDiscreteScheduler.from_config(
|
||||
pipe.scheduler.config
|
||||
)
|
||||
|
||||
# Check compatible schedulers
|
||||
print(pipe.scheduler.compatibles)
|
||||
```
|
||||
|
||||
### Wrong number of steps
|
||||
|
||||
**Problem**: Model generates different quality with same steps
|
||||
|
||||
**Fix**:
|
||||
```python
|
||||
# Reset timesteps explicitly
|
||||
pipe.scheduler.set_timesteps(num_inference_steps)
|
||||
|
||||
# Check scheduler's step count
|
||||
print(len(pipe.scheduler.timesteps))
|
||||
```
|
||||
|
||||
## LoRA Issues
|
||||
|
||||
### LoRA weights not loading
|
||||
|
||||
**Error**: `RuntimeError: Error(s) in loading state_dict for UNet2DConditionModel`
|
||||
|
||||
**Fix**:
|
||||
```python
|
||||
# Check weight file format
|
||||
# Should be .safetensors or .bin
|
||||
|
||||
# Load with correct key prefix
|
||||
pipe.load_lora_weights(
|
||||
"path/to/lora",
|
||||
weight_name="lora.safetensors"
|
||||
)
|
||||
|
||||
# Try loading into specific component
|
||||
pipe.unet.load_attn_procs("path/to/lora")
|
||||
```
|
||||
|
||||
### LoRA not affecting output
|
||||
|
||||
**Problem**: Generated images look the same with/without LoRA
|
||||
|
||||
**Fix**:
|
||||
```python
|
||||
# Fuse LoRA weights
|
||||
pipe.fuse_lora(lora_scale=1.0)
|
||||
|
||||
# Or set scale explicitly
|
||||
pipe.set_adapters(["lora_name"], adapter_weights=[1.0])
|
||||
|
||||
# Verify LoRA is loaded
|
||||
print(list(pipe.unet.attn_processors.keys()))
|
||||
```
|
||||
|
||||
### Multiple LoRAs conflict
|
||||
|
||||
**Problem**: Multiple LoRAs produce artifacts
|
||||
|
||||
**Fix**:
|
||||
```python
|
||||
# Load with different adapter names
|
||||
pipe.load_lora_weights("lora1", adapter_name="style")
|
||||
pipe.load_lora_weights("lora2", adapter_name="subject")
|
||||
|
||||
# Balance weights
|
||||
pipe.set_adapters(
|
||||
["style", "subject"],
|
||||
adapter_weights=[0.5, 0.5] # Lower weights
|
||||
)
|
||||
|
||||
# Or use LoRA merge before loading
|
||||
# Merge LoRAs offline with appropriate ratios
|
||||
```
|
||||
|
||||
## ControlNet Issues
|
||||
|
||||
### ControlNet not conditioning
|
||||
|
||||
**Problem**: ControlNet has no effect on output
|
||||
|
||||
**Fix**:
|
||||
```python
|
||||
# Check control image format
|
||||
# Should be RGB, matching generation size
|
||||
control_image = control_image.resize((512, 512))
|
||||
|
||||
# Increase conditioning scale
|
||||
image = pipe(
|
||||
prompt=prompt,
|
||||
image=control_image,
|
||||
controlnet_conditioning_scale=1.0, # Try 0.5-1.5
|
||||
num_inference_steps=30
|
||||
).images[0]
|
||||
|
||||
# Verify ControlNet is loaded
|
||||
print(pipe.controlnet)
|
||||
```
|
||||
|
||||
### Control image preprocessing
|
||||
|
||||
**Fix**:
|
||||
```python
|
||||
from controlnet_aux import CannyDetector
|
||||
|
||||
# Proper preprocessing
|
||||
canny = CannyDetector()
|
||||
control_image = canny(input_image)
|
||||
|
||||
# Ensure correct format
|
||||
control_image = control_image.convert("RGB")
|
||||
control_image = control_image.resize((512, 512))
|
||||
```
|
||||
|
||||
## Hub/Download Issues
|
||||
|
||||
### Model download fails
|
||||
|
||||
**Error**: `requests.exceptions.ConnectionError`
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Set longer timeout
|
||||
export HF_HUB_DOWNLOAD_TIMEOUT=600
|
||||
|
||||
# Use mirror if available
|
||||
export HF_ENDPOINT=https://hf-mirror.com
|
||||
|
||||
# Or download manually
|
||||
huggingface-cli download stable-diffusion-v1-5/stable-diffusion-v1-5
|
||||
```
|
||||
|
||||
### Cache issues
|
||||
|
||||
**Error**: `OSError: Can't load model from cache`
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Clear cache
|
||||
rm -rf ~/.cache/huggingface/hub
|
||||
|
||||
# Or set different cache location
|
||||
export HF_HOME=/path/to/cache
|
||||
|
||||
# Force re-download
|
||||
pipe = DiffusionPipeline.from_pretrained(
|
||||
"model-id",
|
||||
force_download=True
|
||||
)
|
||||
```
|
||||
|
||||
### Access denied for gated models
|
||||
|
||||
**Error**: `401 Client Error: Unauthorized`
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Login to Hugging Face
|
||||
huggingface-cli login
|
||||
|
||||
# Or use token
|
||||
pipe = DiffusionPipeline.from_pretrained(
|
||||
"model-id",
|
||||
token="hf_xxxxx"
|
||||
)
|
||||
|
||||
# Accept model license on Hub website first
|
||||
```
|
||||
|
||||
## Performance Issues
|
||||
|
||||
### Slow generation
|
||||
|
||||
**Problem**: Generation takes too long
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Solution 1: Use faster scheduler
|
||||
from diffusers import DPMSolverMultistepScheduler
|
||||
pipe.scheduler = DPMSolverMultistepScheduler.from_config(
|
||||
pipe.scheduler.config
|
||||
)
|
||||
|
||||
# Solution 2: Reduce steps
|
||||
image = pipe(prompt, num_inference_steps=20).images[0]
|
||||
|
||||
# Solution 3: Use LCM
|
||||
from diffusers import LCMScheduler
|
||||
pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl")
|
||||
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
|
||||
image = pipe(prompt, num_inference_steps=4, guidance_scale=1.0).images[0]
|
||||
|
||||
# Solution 4: Enable xFormers
|
||||
pipe.enable_xformers_memory_efficient_attention()
|
||||
|
||||
# Solution 5: Compile model
|
||||
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
|
||||
```
|
||||
|
||||
### First generation is slow
|
||||
|
||||
**Problem**: First image takes much longer
|
||||
|
||||
**Fix**:
|
||||
```python
|
||||
# Warm up the model
|
||||
_ = pipe("warmup", num_inference_steps=1)
|
||||
|
||||
# Then run actual generation
|
||||
image = pipe(prompt, num_inference_steps=50).images[0]
|
||||
|
||||
# Compile for faster subsequent runs
|
||||
pipe.unet = torch.compile(pipe.unet)
|
||||
```
|
||||
|
||||
## Debugging Tips
|
||||
|
||||
### Enable debug logging
|
||||
|
||||
```python
|
||||
import logging
|
||||
logging.basicConfig(level=logging.DEBUG)
|
||||
|
||||
# Or for specific modules
|
||||
logging.getLogger("diffusers").setLevel(logging.DEBUG)
|
||||
logging.getLogger("transformers").setLevel(logging.DEBUG)
|
||||
```
|
||||
|
||||
### Check model components
|
||||
|
||||
```python
|
||||
# Print pipeline components
|
||||
print(pipe.components)
|
||||
|
||||
# Check model config
|
||||
print(pipe.unet.config)
|
||||
print(pipe.vae.config)
|
||||
print(pipe.scheduler.config)
|
||||
|
||||
# Verify device placement
|
||||
print(pipe.device)
|
||||
for name, module in pipe.components.items():
|
||||
if hasattr(module, 'device'):
|
||||
print(f"{name}: {module.device}")
|
||||
```
|
||||
|
||||
### Validate inputs
|
||||
|
||||
```python
|
||||
# Check image dimensions
|
||||
print(f"Height: {height}, Width: {width}")
|
||||
assert height % 8 == 0, "Height must be divisible by 8"
|
||||
assert width % 8 == 0, "Width must be divisible by 8"
|
||||
|
||||
# Check prompt tokenization
|
||||
tokens = pipe.tokenizer(prompt, return_tensors="pt")
|
||||
print(f"Token count: {tokens.input_ids.shape[1]}") # Max 77 for SD
|
||||
```
|
||||
|
||||
### Save intermediate results
|
||||
|
||||
```python
|
||||
def save_latents_callback(pipe, step_index, timestep, callback_kwargs):
|
||||
latents = callback_kwargs["latents"]
|
||||
|
||||
# Decode and save intermediate
|
||||
with torch.no_grad():
|
||||
image = pipe.vae.decode(latents / pipe.vae.config.scaling_factor).sample
|
||||
image = (image / 2 + 0.5).clamp(0, 1)
|
||||
image = image.cpu().permute(0, 2, 3, 1).numpy()[0]
|
||||
Image.fromarray((image * 255).astype("uint8")).save(f"step_{step_index}.png")
|
||||
|
||||
return callback_kwargs
|
||||
|
||||
image = pipe(
|
||||
prompt,
|
||||
callback_on_step_end=save_latents_callback,
|
||||
callback_on_step_end_tensor_inputs=["latents"]
|
||||
).images[0]
|
||||
```
|
||||
|
||||
## Getting Help
|
||||
|
||||
1. **Documentation**: https://huggingface.co/docs/diffusers
|
||||
2. **GitHub Issues**: https://github.com/huggingface/diffusers/issues
|
||||
3. **Discord**: https://discord.gg/diffusers
|
||||
4. **Forum**: https://discuss.huggingface.co
|
||||
|
||||
### Reporting Issues
|
||||
|
||||
Include:
|
||||
- Diffusers version: `pip show diffusers`
|
||||
- PyTorch version: `python -c "import torch; print(torch.__version__)"`
|
||||
- CUDA version: `nvcc --version`
|
||||
- GPU model: `nvidia-smi`
|
||||
- Full error traceback
|
||||
- Minimal reproducible code
|
||||
- Model name/ID used
|
||||
320
optional-skills/mlops/whisper/SKILL.md
Normal file
320
optional-skills/mlops/whisper/SKILL.md
Normal file
|
|
@ -0,0 +1,320 @@
|
|||
---
|
||||
name: whisper
|
||||
description: OpenAI's general-purpose speech recognition model. Supports 99 languages, transcription, translation to English, and language identification. Six model sizes from tiny (39M params) to large (1550M params). Use for speech-to-text, podcast transcription, or multilingual audio processing. Best for robust, multilingual ASR.
|
||||
version: 1.0.0
|
||||
author: Orchestra Research
|
||||
license: MIT
|
||||
dependencies: [openai-whisper, transformers, torch]
|
||||
metadata:
|
||||
hermes:
|
||||
tags: [Whisper, Speech Recognition, ASR, Multimodal, Multilingual, OpenAI, Speech-To-Text, Transcription, Translation, Audio Processing]
|
||||
|
||||
---
|
||||
|
||||
# Whisper - Robust Speech Recognition
|
||||
|
||||
OpenAI's multilingual speech recognition model.
|
||||
|
||||
## When to use Whisper
|
||||
|
||||
**Use when:**
|
||||
- Speech-to-text transcription (99 languages)
|
||||
- Podcast/video transcription
|
||||
- Meeting notes automation
|
||||
- Translation to English
|
||||
- Noisy audio transcription
|
||||
- Multilingual audio processing
|
||||
|
||||
**Metrics**:
|
||||
- **72,900+ GitHub stars**
|
||||
- 99 languages supported
|
||||
- Trained on 680,000 hours of audio
|
||||
- MIT License
|
||||
|
||||
**Use alternatives instead**:
|
||||
- **AssemblyAI**: Managed API, speaker diarization
|
||||
- **Deepgram**: Real-time streaming ASR
|
||||
- **Google Speech-to-Text**: Cloud-based
|
||||
|
||||
## Quick start
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
# Requires Python 3.8-3.11
|
||||
pip install -U openai-whisper
|
||||
|
||||
# Requires ffmpeg
|
||||
# macOS: brew install ffmpeg
|
||||
# Ubuntu: sudo apt install ffmpeg
|
||||
# Windows: choco install ffmpeg
|
||||
```
|
||||
|
||||
### Basic transcription
|
||||
|
||||
```python
|
||||
import whisper
|
||||
|
||||
# Load model
|
||||
model = whisper.load_model("base")
|
||||
|
||||
# Transcribe
|
||||
result = model.transcribe("audio.mp3")
|
||||
|
||||
# Print text
|
||||
print(result["text"])
|
||||
|
||||
# Access segments
|
||||
for segment in result["segments"]:
|
||||
print(f"[{segment['start']:.2f}s - {segment['end']:.2f}s] {segment['text']}")
|
||||
```
|
||||
|
||||
## Model sizes
|
||||
|
||||
```python
|
||||
# Available models
|
||||
models = ["tiny", "base", "small", "medium", "large", "turbo"]
|
||||
|
||||
# Load specific model
|
||||
model = whisper.load_model("turbo") # Fastest, good quality
|
||||
```
|
||||
|
||||
| Model | Parameters | English-only | Multilingual | Speed | VRAM |
|
||||
|-------|------------|--------------|--------------|-------|------|
|
||||
| tiny | 39M | ✓ | ✓ | ~32x | ~1 GB |
|
||||
| base | 74M | ✓ | ✓ | ~16x | ~1 GB |
|
||||
| small | 244M | ✓ | ✓ | ~6x | ~2 GB |
|
||||
| medium | 769M | ✓ | ✓ | ~2x | ~5 GB |
|
||||
| large | 1550M | ✗ | ✓ | 1x | ~10 GB |
|
||||
| turbo | 809M | ✗ | ✓ | ~8x | ~6 GB |
|
||||
|
||||
**Recommendation**: Use `turbo` for best speed/quality, `base` for prototyping
|
||||
|
||||
## Transcription options
|
||||
|
||||
### Language specification
|
||||
|
||||
```python
|
||||
# Auto-detect language
|
||||
result = model.transcribe("audio.mp3")
|
||||
|
||||
# Specify language (faster)
|
||||
result = model.transcribe("audio.mp3", language="en")
|
||||
|
||||
# Supported: en, es, fr, de, it, pt, ru, ja, ko, zh, and 89 more
|
||||
```
|
||||
|
||||
### Task selection
|
||||
|
||||
```python
|
||||
# Transcription (default)
|
||||
result = model.transcribe("audio.mp3", task="transcribe")
|
||||
|
||||
# Translation to English
|
||||
result = model.transcribe("spanish.mp3", task="translate")
|
||||
# Input: Spanish audio → Output: English text
|
||||
```
|
||||
|
||||
### Initial prompt
|
||||
|
||||
```python
|
||||
# Improve accuracy with context
|
||||
result = model.transcribe(
|
||||
"audio.mp3",
|
||||
initial_prompt="This is a technical podcast about machine learning and AI."
|
||||
)
|
||||
|
||||
# Helps with:
|
||||
# - Technical terms
|
||||
# - Proper nouns
|
||||
# - Domain-specific vocabulary
|
||||
```
|
||||
|
||||
### Timestamps
|
||||
|
||||
```python
|
||||
# Word-level timestamps
|
||||
result = model.transcribe("audio.mp3", word_timestamps=True)
|
||||
|
||||
for segment in result["segments"]:
|
||||
for word in segment["words"]:
|
||||
print(f"{word['word']} ({word['start']:.2f}s - {word['end']:.2f}s)")
|
||||
```
|
||||
|
||||
### Temperature fallback
|
||||
|
||||
```python
|
||||
# Retry with different temperatures if confidence low
|
||||
result = model.transcribe(
|
||||
"audio.mp3",
|
||||
temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)
|
||||
)
|
||||
```
|
||||
|
||||
## Command line usage
|
||||
|
||||
```bash
|
||||
# Basic transcription
|
||||
whisper audio.mp3
|
||||
|
||||
# Specify model
|
||||
whisper audio.mp3 --model turbo
|
||||
|
||||
# Output formats
|
||||
whisper audio.mp3 --output_format txt # Plain text
|
||||
whisper audio.mp3 --output_format srt # Subtitles
|
||||
whisper audio.mp3 --output_format vtt # WebVTT
|
||||
whisper audio.mp3 --output_format json # JSON with timestamps
|
||||
|
||||
# Language
|
||||
whisper audio.mp3 --language Spanish
|
||||
|
||||
# Translation
|
||||
whisper spanish.mp3 --task translate
|
||||
```
|
||||
|
||||
## Batch processing
|
||||
|
||||
```python
|
||||
import os
|
||||
|
||||
audio_files = ["file1.mp3", "file2.mp3", "file3.mp3"]
|
||||
|
||||
for audio_file in audio_files:
|
||||
print(f"Transcribing {audio_file}...")
|
||||
result = model.transcribe(audio_file)
|
||||
|
||||
# Save to file
|
||||
output_file = audio_file.replace(".mp3", ".txt")
|
||||
with open(output_file, "w") as f:
|
||||
f.write(result["text"])
|
||||
```
|
||||
|
||||
## Real-time transcription
|
||||
|
||||
```python
|
||||
# For streaming audio, use faster-whisper
|
||||
# pip install faster-whisper
|
||||
|
||||
from faster_whisper import WhisperModel
|
||||
|
||||
model = WhisperModel("base", device="cuda", compute_type="float16")
|
||||
|
||||
# Transcribe with streaming
|
||||
segments, info = model.transcribe("audio.mp3", beam_size=5)
|
||||
|
||||
for segment in segments:
|
||||
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
|
||||
```
|
||||
|
||||
## GPU acceleration
|
||||
|
||||
```python
|
||||
import whisper
|
||||
|
||||
# Automatically uses GPU if available
|
||||
model = whisper.load_model("turbo")
|
||||
|
||||
# Force CPU
|
||||
model = whisper.load_model("turbo", device="cpu")
|
||||
|
||||
# Force GPU
|
||||
model = whisper.load_model("turbo", device="cuda")
|
||||
|
||||
# 10-20× faster on GPU
|
||||
```
|
||||
|
||||
## Integration with other tools
|
||||
|
||||
### Subtitle generation
|
||||
|
||||
```bash
|
||||
# Generate SRT subtitles
|
||||
whisper video.mp4 --output_format srt --language English
|
||||
|
||||
# Output: video.srt
|
||||
```
|
||||
|
||||
### With LangChain
|
||||
|
||||
```python
|
||||
from langchain.document_loaders import WhisperTranscriptionLoader
|
||||
|
||||
loader = WhisperTranscriptionLoader(file_path="audio.mp3")
|
||||
docs = loader.load()
|
||||
|
||||
# Use transcription in RAG
|
||||
from langchain_chroma import Chroma
|
||||
from langchain_openai import OpenAIEmbeddings
|
||||
|
||||
vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())
|
||||
```
|
||||
|
||||
### Extract audio from video
|
||||
|
||||
```bash
|
||||
# Use ffmpeg to extract audio
|
||||
ffmpeg -i video.mp4 -vn -acodec pcm_s16le audio.wav
|
||||
|
||||
# Then transcribe
|
||||
whisper audio.wav
|
||||
```
|
||||
|
||||
## Best practices
|
||||
|
||||
1. **Use turbo model** - Best speed/quality for English
|
||||
2. **Specify language** - Faster than auto-detect
|
||||
3. **Add initial prompt** - Improves technical terms
|
||||
4. **Use GPU** - 10-20× faster
|
||||
5. **Batch process** - More efficient
|
||||
6. **Convert to WAV** - Better compatibility
|
||||
7. **Split long audio** - <30 min chunks
|
||||
8. **Check language support** - Quality varies by language
|
||||
9. **Use faster-whisper** - 4× faster than openai-whisper
|
||||
10. **Monitor VRAM** - Scale model size to hardware
|
||||
|
||||
## Performance
|
||||
|
||||
| Model | Real-time factor (CPU) | Real-time factor (GPU) |
|
||||
|-------|------------------------|------------------------|
|
||||
| tiny | ~0.32 | ~0.01 |
|
||||
| base | ~0.16 | ~0.01 |
|
||||
| turbo | ~0.08 | ~0.01 |
|
||||
| large | ~1.0 | ~0.05 |
|
||||
|
||||
*Real-time factor: 0.1 = 10× faster than real-time*
|
||||
|
||||
## Language support
|
||||
|
||||
Top-supported languages:
|
||||
- English (en)
|
||||
- Spanish (es)
|
||||
- French (fr)
|
||||
- German (de)
|
||||
- Italian (it)
|
||||
- Portuguese (pt)
|
||||
- Russian (ru)
|
||||
- Japanese (ja)
|
||||
- Korean (ko)
|
||||
- Chinese (zh)
|
||||
|
||||
Full list: 99 languages total
|
||||
|
||||
## Limitations
|
||||
|
||||
1. **Hallucinations** - May repeat or invent text
|
||||
2. **Long-form accuracy** - Degrades on >30 min audio
|
||||
3. **Speaker identification** - No diarization
|
||||
4. **Accents** - Quality varies
|
||||
5. **Background noise** - Can affect accuracy
|
||||
6. **Real-time latency** - Not suitable for live captioning
|
||||
|
||||
## Resources
|
||||
|
||||
- **GitHub**: https://github.com/openai/whisper ⭐ 72,900+
|
||||
- **Paper**: https://arxiv.org/abs/2212.04356
|
||||
- **Model Card**: https://github.com/openai/whisper/blob/main/model-card.md
|
||||
- **Colab**: Available in repo
|
||||
- **License**: MIT
|
||||
|
||||
|
||||
189
optional-skills/mlops/whisper/references/languages.md
Normal file
189
optional-skills/mlops/whisper/references/languages.md
Normal file
|
|
@ -0,0 +1,189 @@
|
|||
# Whisper Language Support Guide
|
||||
|
||||
Complete guide to Whisper's multilingual capabilities.
|
||||
|
||||
## Supported languages (99 total)
|
||||
|
||||
### Top-tier support (WER < 10%)
|
||||
|
||||
- English (en)
|
||||
- Spanish (es)
|
||||
- French (fr)
|
||||
- German (de)
|
||||
- Italian (it)
|
||||
- Portuguese (pt)
|
||||
- Dutch (nl)
|
||||
- Polish (pl)
|
||||
- Russian (ru)
|
||||
- Japanese (ja)
|
||||
- Korean (ko)
|
||||
- Chinese (zh)
|
||||
|
||||
### Good support (WER 10-20%)
|
||||
|
||||
- Arabic (ar)
|
||||
- Turkish (tr)
|
||||
- Vietnamese (vi)
|
||||
- Swedish (sv)
|
||||
- Finnish (fi)
|
||||
- Czech (cs)
|
||||
- Romanian (ro)
|
||||
- Hungarian (hu)
|
||||
- Danish (da)
|
||||
- Norwegian (no)
|
||||
- Thai (th)
|
||||
- Hebrew (he)
|
||||
- Greek (el)
|
||||
- Indonesian (id)
|
||||
- Malay (ms)
|
||||
|
||||
### Full list (99 languages)
|
||||
|
||||
Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Bashkir, Basque, Belarusian, Bengali, Bosnian, Breton, Bulgarian, Burmese, Cantonese, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Faroese, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Lao, Latin, Latvian, Lingala, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Moldavian, Mongolian, Myanmar, Nepali, Norwegian, Nynorsk, Occitan, Pashto, Persian, Polish, Portuguese, Punjabi, Pushto, Romanian, Russian, Sanskrit, Serbian, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Tibetan, Turkish, Turkmen, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, Yiddish, Yoruba
|
||||
|
||||
## Usage examples
|
||||
|
||||
### Auto-detect language
|
||||
|
||||
```python
|
||||
import whisper
|
||||
|
||||
model = whisper.load_model("turbo")
|
||||
|
||||
# Auto-detect language
|
||||
result = model.transcribe("audio.mp3")
|
||||
|
||||
print(f"Detected language: {result['language']}")
|
||||
print(f"Text: {result['text']}")
|
||||
```
|
||||
|
||||
### Specify language (faster)
|
||||
|
||||
```python
|
||||
# Specify language for faster transcription
|
||||
result = model.transcribe("audio.mp3", language="es") # Spanish
|
||||
result = model.transcribe("audio.mp3", language="fr") # French
|
||||
result = model.transcribe("audio.mp3", language="ja") # Japanese
|
||||
```
|
||||
|
||||
### Translation to English
|
||||
|
||||
```python
|
||||
# Translate any language to English
|
||||
result = model.transcribe(
|
||||
"spanish_audio.mp3",
|
||||
task="translate" # Translates to English
|
||||
)
|
||||
|
||||
print(f"Original language: {result['language']}")
|
||||
print(f"English translation: {result['text']}")
|
||||
```
|
||||
|
||||
## Language-specific tips
|
||||
|
||||
### Chinese
|
||||
|
||||
```python
|
||||
# Chinese works well with larger models
|
||||
model = whisper.load_model("large")
|
||||
|
||||
result = model.transcribe(
|
||||
"chinese_audio.mp3",
|
||||
language="zh",
|
||||
initial_prompt="这是一段关于技术的讨论" # Context helps
|
||||
)
|
||||
```
|
||||
|
||||
### Japanese
|
||||
|
||||
```python
|
||||
# Japanese benefits from initial prompt
|
||||
result = model.transcribe(
|
||||
"japanese_audio.mp3",
|
||||
language="ja",
|
||||
initial_prompt="これは技術的な会議の録音です"
|
||||
)
|
||||
```
|
||||
|
||||
### Arabic
|
||||
|
||||
```python
|
||||
# Arabic: Use large model for best results
|
||||
model = whisper.load_model("large")
|
||||
|
||||
result = model.transcribe(
|
||||
"arabic_audio.mp3",
|
||||
language="ar"
|
||||
)
|
||||
```
|
||||
|
||||
## Model size recommendations
|
||||
|
||||
| Language Tier | Recommended Model | WER |
|
||||
|---------------|-------------------|-----|
|
||||
| Top-tier (en, es, fr, de) | base/turbo | < 10% |
|
||||
| Good (ar, tr, vi) | medium/large | 10-20% |
|
||||
| Lower-resource | large | 20-30% |
|
||||
|
||||
## Performance by language
|
||||
|
||||
### English
|
||||
|
||||
- **tiny**: WER ~15%
|
||||
- **base**: WER ~8%
|
||||
- **small**: WER ~5%
|
||||
- **medium**: WER ~4%
|
||||
- **large**: WER ~3%
|
||||
- **turbo**: WER ~3.5%
|
||||
|
||||
### Spanish
|
||||
|
||||
- **tiny**: WER ~20%
|
||||
- **base**: WER ~12%
|
||||
- **medium**: WER ~6%
|
||||
- **large**: WER ~4%
|
||||
|
||||
### Chinese
|
||||
|
||||
- **small**: WER ~15%
|
||||
- **medium**: WER ~8%
|
||||
- **large**: WER ~5%
|
||||
|
||||
## Best practices
|
||||
|
||||
1. **Use English-only models** - Better for small models (tiny/base)
|
||||
2. **Specify language** - Faster than auto-detect
|
||||
3. **Add initial prompt** - Improves accuracy for technical terms
|
||||
4. **Use larger models** - For low-resource languages
|
||||
5. **Test on sample** - Quality varies by accent/dialect
|
||||
6. **Consider audio quality** - Clear audio = better results
|
||||
7. **Check language codes** - Use ISO 639-1 codes (2 letters)
|
||||
|
||||
## Language detection
|
||||
|
||||
```python
|
||||
# Detect language only (no transcription)
|
||||
import whisper
|
||||
|
||||
model = whisper.load_model("base")
|
||||
|
||||
# Load audio
|
||||
audio = whisper.load_audio("audio.mp3")
|
||||
audio = whisper.pad_or_trim(audio)
|
||||
|
||||
# Make log-Mel spectrogram
|
||||
mel = whisper.log_mel_spectrogram(audio).to(model.device)
|
||||
|
||||
# Detect language
|
||||
_, probs = model.detect_language(mel)
|
||||
detected_language = max(probs, key=probs.get)
|
||||
|
||||
print(f"Detected language: {detected_language}")
|
||||
print(f"Confidence: {probs[detected_language]:.2%}")
|
||||
```
|
||||
|
||||
## Resources
|
||||
|
||||
- **Paper**: https://arxiv.org/abs/2212.04356
|
||||
- **GitHub**: https://github.com/openai/whisper
|
||||
- **Model Card**: https://github.com/openai/whisper/blob/main/model-card.md
|
||||
Loading…
Add table
Add a link
Reference in a new issue