skills: move 7 niche mlops/mcp skills to optional (#12474)

Built-in → optional-skills/:
  mlops/training/peft         → optional-skills/mlops/peft
  mlops/training/pytorch-fsdp → optional-skills/mlops/pytorch-fsdp
  mlops/models/clip           → optional-skills/mlops/clip
  mlops/models/stable-diffusion → optional-skills/mlops/stable-diffusion
  mlops/models/whisper        → optional-skills/mlops/whisper
  mlops/cloud/modal           → optional-skills/mlops/modal
  mcp/mcporter                → optional-skills/mcp/mcporter

Built-in mlops training kept: axolotl, trl-fine-tuning, unsloth.
Built-in mlops models kept: audiocraft, segment-anything.
Built-in mlops evaluation/research/huggingface-hub/inference all kept.
native-mcp stays built-in (documents the native MCP tool); mcporter was a
redundant alternative CLI.

Also: removed now-empty skills/mlops/cloud/ dir, refreshed
skills/mlops/models/DESCRIPTION.md and skills/mcp/DESCRIPTION.md to match
what's left, and synchronized both catalog pages (skills-catalog.md,
optional-skills-catalog.md).
This commit is contained in:
Teknium 2026-04-19 05:14:17 -07:00 committed by GitHub
parent 957ca79e8e
commit 66ee081dc1
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
22 changed files with 10 additions and 20 deletions

View file

@ -0,0 +1,122 @@
---
name: mcporter
description: Use the mcporter CLI to list, configure, auth, and call MCP servers/tools directly (HTTP or stdio), including ad-hoc servers, config edits, and CLI/type generation.
version: 1.0.0
author: community
license: MIT
metadata:
hermes:
tags: [MCP, Tools, API, Integrations, Interop]
homepage: https://mcporter.dev
prerequisites:
commands: [npx]
---
# mcporter
Use `mcporter` to discover, call, and manage [MCP (Model Context Protocol)](https://modelcontextprotocol.io/) servers and tools directly from the terminal.
## Prerequisites
Requires Node.js:
```bash
# No install needed (runs via npx)
npx mcporter list
# Or install globally
npm install -g mcporter
```
## Quick Start
```bash
# List MCP servers already configured on this machine
mcporter list
# List tools for a specific server with schema details
mcporter list <server> --schema
# Call a tool
mcporter call <server.tool> key=value
```
## Discovering MCP Servers
mcporter auto-discovers servers configured by other MCP clients (Claude Desktop, Cursor, etc.) on the machine. To find new servers to use, browse registries like [mcpfinder.dev](https://mcpfinder.dev) or [mcp.so](https://mcp.so), then connect ad-hoc:
```bash
# Connect to any MCP server by URL (no config needed)
mcporter list --http-url https://some-mcp-server.com --name my_server
# Or run a stdio server on the fly
mcporter list --stdio "npx -y @modelcontextprotocol/server-filesystem" --name fs
```
## Calling Tools
```bash
# Key=value syntax
mcporter call linear.list_issues team=ENG limit:5
# Function syntax
mcporter call "linear.create_issue(title: \"Bug fix needed\")"
# Ad-hoc HTTP server (no config needed)
mcporter call https://api.example.com/mcp.fetch url=https://example.com
# Ad-hoc stdio server
mcporter call --stdio "bun run ./server.ts" scrape url=https://example.com
# JSON payload
mcporter call <server.tool> --args '{"limit": 5}'
# Machine-readable output (recommended for Hermes)
mcporter call <server.tool> key=value --output json
```
## Auth and Config
```bash
# OAuth login for a server
mcporter auth <server | url> [--reset]
# Manage config
mcporter config list
mcporter config get <key>
mcporter config add <server>
mcporter config remove <server>
mcporter config import <path>
```
Config file location: `./config/mcporter.json` (override with `--config`).
## Daemon
For persistent server connections:
```bash
mcporter daemon start
mcporter daemon status
mcporter daemon stop
mcporter daemon restart
```
## Code Generation
```bash
# Generate a CLI wrapper for an MCP server
mcporter generate-cli --server <name>
mcporter generate-cli --command <url>
# Inspect a generated CLI
mcporter inspect-cli <path> [--json]
# Generate TypeScript types/client
mcporter emit-ts <server> --mode client
mcporter emit-ts <server> --mode types
```
## Notes
- Use `--output json` for structured output that's easier to parse
- Ad-hoc servers (HTTP URL or `--stdio` command) work without any config — useful for one-off calls
- OAuth auth may require interactive browser flow — use `terminal(command="mcporter auth <server>", pty=true)` if needed

View file

@ -0,0 +1,256 @@
---
name: clip
description: OpenAI's model connecting vision and language. Enables zero-shot image classification, image-text matching, and cross-modal retrieval. Trained on 400M image-text pairs. Use for image search, content moderation, or vision-language tasks without fine-tuning. Best for general-purpose image understanding.
version: 1.0.0
author: Orchestra Research
license: MIT
dependencies: [transformers, torch, pillow]
metadata:
hermes:
tags: [Multimodal, CLIP, Vision-Language, Zero-Shot, Image Classification, OpenAI, Image Search, Cross-Modal Retrieval, Content Moderation]
---
# CLIP - Contrastive Language-Image Pre-Training
OpenAI's model that understands images from natural language.
## When to use CLIP
**Use when:**
- Zero-shot image classification (no training data needed)
- Image-text similarity/matching
- Semantic image search
- Content moderation (detect NSFW, violence)
- Visual question answering
- Cross-modal retrieval (image→text, text→image)
**Metrics**:
- **25,300+ GitHub stars**
- Trained on 400M image-text pairs
- Matches ResNet-50 on ImageNet (zero-shot)
- MIT License
**Use alternatives instead**:
- **BLIP-2**: Better captioning
- **LLaVA**: Vision-language chat
- **Segment Anything**: Image segmentation
## Quick start
### Installation
```bash
pip install git+https://github.com/openai/CLIP.git
pip install torch torchvision ftfy regex tqdm
```
### Zero-shot classification
```python
import torch
import clip
from PIL import Image
# Load model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
# Load image
image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)
# Define possible labels
text = clip.tokenize(["a dog", "a cat", "a bird", "a car"]).to(device)
# Compute similarity
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
# Cosine similarity
logits_per_image, logits_per_text = model(image, text)
probs = logits_per_image.softmax(dim=-1).cpu().numpy()
# Print results
labels = ["a dog", "a cat", "a bird", "a car"]
for label, prob in zip(labels, probs[0]):
print(f"{label}: {prob:.2%}")
```
## Available models
```python
# Models (sorted by size)
models = [
"RN50", # ResNet-50
"RN101", # ResNet-101
"ViT-B/32", # Vision Transformer (recommended)
"ViT-B/16", # Better quality, slower
"ViT-L/14", # Best quality, slowest
]
model, preprocess = clip.load("ViT-B/32")
```
| Model | Parameters | Speed | Quality |
|-------|------------|-------|---------|
| RN50 | 102M | Fast | Good |
| ViT-B/32 | 151M | Medium | Better |
| ViT-L/14 | 428M | Slow | Best |
## Image-text similarity
```python
# Compute embeddings
image_features = model.encode_image(image)
text_features = model.encode_text(text)
# Normalize
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
# Cosine similarity
similarity = (image_features @ text_features.T).item()
print(f"Similarity: {similarity:.4f}")
```
## Semantic image search
```python
# Index images
image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"]
image_embeddings = []
for img_path in image_paths:
image = preprocess(Image.open(img_path)).unsqueeze(0).to(device)
with torch.no_grad():
embedding = model.encode_image(image)
embedding /= embedding.norm(dim=-1, keepdim=True)
image_embeddings.append(embedding)
image_embeddings = torch.cat(image_embeddings)
# Search with text query
query = "a sunset over the ocean"
text_input = clip.tokenize([query]).to(device)
with torch.no_grad():
text_embedding = model.encode_text(text_input)
text_embedding /= text_embedding.norm(dim=-1, keepdim=True)
# Find most similar images
similarities = (text_embedding @ image_embeddings.T).squeeze(0)
top_k = similarities.topk(3)
for idx, score in zip(top_k.indices, top_k.values):
print(f"{image_paths[idx]}: {score:.3f}")
```
## Content moderation
```python
# Define categories
categories = [
"safe for work",
"not safe for work",
"violent content",
"graphic content"
]
text = clip.tokenize(categories).to(device)
# Check image
with torch.no_grad():
logits_per_image, _ = model(image, text)
probs = logits_per_image.softmax(dim=-1)
# Get classification
max_idx = probs.argmax().item()
max_prob = probs[0, max_idx].item()
print(f"Category: {categories[max_idx]} ({max_prob:.2%})")
```
## Batch processing
```python
# Process multiple images
images = [preprocess(Image.open(f"img{i}.jpg")) for i in range(10)]
images = torch.stack(images).to(device)
with torch.no_grad():
image_features = model.encode_image(images)
image_features /= image_features.norm(dim=-1, keepdim=True)
# Batch text
texts = ["a dog", "a cat", "a bird"]
text_tokens = clip.tokenize(texts).to(device)
with torch.no_grad():
text_features = model.encode_text(text_tokens)
text_features /= text_features.norm(dim=-1, keepdim=True)
# Similarity matrix (10 images × 3 texts)
similarities = image_features @ text_features.T
print(similarities.shape) # (10, 3)
```
## Integration with vector databases
```python
# Store CLIP embeddings in Chroma/FAISS
import chromadb
client = chromadb.Client()
collection = client.create_collection("image_embeddings")
# Add image embeddings
for img_path, embedding in zip(image_paths, image_embeddings):
collection.add(
embeddings=[embedding.cpu().numpy().tolist()],
metadatas=[{"path": img_path}],
ids=[img_path]
)
# Query with text
query = "a sunset"
text_embedding = model.encode_text(clip.tokenize([query]))
results = collection.query(
query_embeddings=[text_embedding.cpu().numpy().tolist()],
n_results=5
)
```
## Best practices
1. **Use ViT-B/32 for most cases** - Good balance
2. **Normalize embeddings** - Required for cosine similarity
3. **Batch processing** - More efficient
4. **Cache embeddings** - Expensive to recompute
5. **Use descriptive labels** - Better zero-shot performance
6. **GPU recommended** - 10-50× faster
7. **Preprocess images** - Use provided preprocess function
## Performance
| Operation | CPU | GPU (V100) |
|-----------|-----|------------|
| Image encoding | ~200ms | ~20ms |
| Text encoding | ~50ms | ~5ms |
| Similarity compute | <1ms | <1ms |
## Limitations
1. **Not for fine-grained tasks** - Best for broad categories
2. **Requires descriptive text** - Vague labels perform poorly
3. **Biased on web data** - May have dataset biases
4. **No bounding boxes** - Whole image only
5. **Limited spatial understanding** - Position/counting weak
## Resources
- **GitHub**: https://github.com/openai/CLIP ⭐ 25,300+
- **Paper**: https://arxiv.org/abs/2103.00020
- **Colab**: https://colab.research.google.com/github/openai/clip/
- **License**: MIT

View file

@ -0,0 +1,207 @@
# CLIP Applications Guide
Practical applications and use cases for CLIP.
## Zero-shot image classification
```python
import torch
import clip
from PIL import Image
model, preprocess = clip.load("ViT-B/32")
# Define categories
categories = [
"a photo of a dog",
"a photo of a cat",
"a photo of a bird",
"a photo of a car",
"a photo of a person"
]
# Prepare image
image = preprocess(Image.open("photo.jpg")).unsqueeze(0)
text = clip.tokenize(categories)
# Classify
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
logits_per_image, _ = model(image, text)
probs = logits_per_image.softmax(dim=-1).cpu().numpy()
# Print results
for category, prob in zip(categories, probs[0]):
print(f"{category}: {prob:.2%}")
```
## Semantic image search
```python
# Index images
image_database = []
image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"]
for img_path in image_paths:
image = preprocess(Image.open(img_path)).unsqueeze(0)
with torch.no_grad():
features = model.encode_image(image)
features /= features.norm(dim=-1, keepdim=True)
image_database.append((img_path, features))
# Search with text
query = "a sunset over mountains"
text_input = clip.tokenize([query])
with torch.no_grad():
text_features = model.encode_text(text_input)
text_features /= text_features.norm(dim=-1, keepdim=True)
# Find matches
similarities = []
for img_path, img_features in image_database:
similarity = (text_features @ img_features.T).item()
similarities.append((img_path, similarity))
# Sort by similarity
similarities.sort(key=lambda x: x[1], reverse=True)
for img_path, score in similarities[:3]:
print(f"{img_path}: {score:.3f}")
```
## Content moderation
```python
# Define safety categories
categories = [
"safe for work content",
"not safe for work content",
"violent or graphic content",
"hate speech or offensive content",
"spam or misleading content"
]
text = clip.tokenize(categories)
# Check image
with torch.no_grad():
logits, _ = model(image, text)
probs = logits.softmax(dim=-1)
# Get classification
max_idx = probs.argmax().item()
confidence = probs[0, max_idx].item()
if confidence > 0.7:
print(f"Classified as: {categories[max_idx]} ({confidence:.2%})")
else:
print(f"Uncertain classification (confidence: {confidence:.2%})")
```
## Image-to-text retrieval
```python
# Text database
captions = [
"A beautiful sunset over the ocean",
"A cute dog playing in the park",
"A modern city skyline at night",
"A delicious pizza with toppings"
]
# Encode captions
caption_features = []
for caption in captions:
text = clip.tokenize([caption])
with torch.no_grad():
features = model.encode_text(text)
features /= features.norm(dim=-1, keepdim=True)
caption_features.append(features)
caption_features = torch.cat(caption_features)
# Find matching captions for image
with torch.no_grad():
image_features = model.encode_image(image)
image_features /= image_features.norm(dim=-1, keepdim=True)
similarities = (image_features @ caption_features.T).squeeze(0)
top_k = similarities.topk(3)
for idx, score in zip(top_k.indices, top_k.values):
print(f"{captions[idx]}: {score:.3f}")
```
## Visual question answering
```python
# Create yes/no questions
image = preprocess(Image.open("photo.jpg")).unsqueeze(0)
questions = [
"a photo showing people",
"a photo showing animals",
"a photo taken indoors",
"a photo taken outdoors",
"a photo taken during daytime",
"a photo taken at night"
]
text = clip.tokenize(questions)
with torch.no_grad():
logits, _ = model(image, text)
probs = logits.softmax(dim=-1)
# Answer questions
for question, prob in zip(questions, probs[0]):
answer = "Yes" if prob > 0.5 else "No"
print(f"{question}: {answer} ({prob:.2%})")
```
## Image deduplication
```python
# Detect duplicate/similar images
def compute_similarity(img1_path, img2_path):
img1 = preprocess(Image.open(img1_path)).unsqueeze(0)
img2 = preprocess(Image.open(img2_path)).unsqueeze(0)
with torch.no_grad():
feat1 = model.encode_image(img1)
feat2 = model.encode_image(img2)
feat1 /= feat1.norm(dim=-1, keepdim=True)
feat2 /= feat2.norm(dim=-1, keepdim=True)
similarity = (feat1 @ feat2.T).item()
return similarity
# Check for duplicates
threshold = 0.95
image_pairs = [("img1.jpg", "img2.jpg"), ("img1.jpg", "img3.jpg")]
for img1, img2 in image_pairs:
sim = compute_similarity(img1, img2)
if sim > threshold:
print(f"{img1} and {img2} are duplicates (similarity: {sim:.3f})")
```
## Best practices
1. **Use descriptive labels** - "a photo of X" works better than just "X"
2. **Normalize embeddings** - Always normalize for cosine similarity
3. **Batch processing** - Process multiple images/texts together
4. **Cache embeddings** - Expensive to recompute
5. **Set appropriate thresholds** - Test on validation data
6. **Use GPU** - 10-50× faster than CPU
7. **Consider model size** - ViT-B/32 good default, ViT-L/14 for best quality
## Resources
- **Paper**: https://arxiv.org/abs/2103.00020
- **GitHub**: https://github.com/openai/CLIP
- **Colab**: https://colab.research.google.com/github/openai/clip/

View file

@ -0,0 +1,344 @@
---
name: modal-serverless-gpu
description: Serverless GPU cloud platform for running ML workloads. Use when you need on-demand GPU access without infrastructure management, deploying ML models as APIs, or running batch jobs with automatic scaling.
version: 1.0.0
author: Orchestra Research
license: MIT
dependencies: [modal>=0.64.0]
metadata:
hermes:
tags: [Infrastructure, Serverless, GPU, Cloud, Deployment, Modal]
---
# Modal Serverless GPU
Comprehensive guide to running ML workloads on Modal's serverless GPU cloud platform.
## When to use Modal
**Use Modal when:**
- Running GPU-intensive ML workloads without managing infrastructure
- Deploying ML models as auto-scaling APIs
- Running batch processing jobs (training, inference, data processing)
- Need pay-per-second GPU pricing without idle costs
- Prototyping ML applications quickly
- Running scheduled jobs (cron-like workloads)
**Key features:**
- **Serverless GPUs**: T4, L4, A10G, L40S, A100, H100, H200, B200 on-demand
- **Python-native**: Define infrastructure in Python code, no YAML
- **Auto-scaling**: Scale to zero, scale to 100+ GPUs instantly
- **Sub-second cold starts**: Rust-based infrastructure for fast container launches
- **Container caching**: Image layers cached for rapid iteration
- **Web endpoints**: Deploy functions as REST APIs with zero-downtime updates
**Use alternatives instead:**
- **RunPod**: For longer-running pods with persistent state
- **Lambda Labs**: For reserved GPU instances
- **SkyPilot**: For multi-cloud orchestration and cost optimization
- **Kubernetes**: For complex multi-service architectures
## Quick start
### Installation
```bash
pip install modal
modal setup # Opens browser for authentication
```
### Hello World with GPU
```python
import modal
app = modal.App("hello-gpu")
@app.function(gpu="T4")
def gpu_info():
import subprocess
return subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout
@app.local_entrypoint()
def main():
print(gpu_info.remote())
```
Run: `modal run hello_gpu.py`
### Basic inference endpoint
```python
import modal
app = modal.App("text-generation")
image = modal.Image.debian_slim().pip_install("transformers", "torch", "accelerate")
@app.cls(gpu="A10G", image=image)
class TextGenerator:
@modal.enter()
def load_model(self):
from transformers import pipeline
self.pipe = pipeline("text-generation", model="gpt2", device=0)
@modal.method()
def generate(self, prompt: str) -> str:
return self.pipe(prompt, max_length=100)[0]["generated_text"]
@app.local_entrypoint()
def main():
print(TextGenerator().generate.remote("Hello, world"))
```
## Core concepts
### Key components
| Component | Purpose |
|-----------|---------|
| `App` | Container for functions and resources |
| `Function` | Serverless function with compute specs |
| `Cls` | Class-based functions with lifecycle hooks |
| `Image` | Container image definition |
| `Volume` | Persistent storage for models/data |
| `Secret` | Secure credential storage |
### Execution modes
| Command | Description |
|---------|-------------|
| `modal run script.py` | Execute and exit |
| `modal serve script.py` | Development with live reload |
| `modal deploy script.py` | Persistent cloud deployment |
## GPU configuration
### Available GPUs
| GPU | VRAM | Best For |
|-----|------|----------|
| `T4` | 16GB | Budget inference, small models |
| `L4` | 24GB | Inference, Ada Lovelace arch |
| `A10G` | 24GB | Training/inference, 3.3x faster than T4 |
| `L40S` | 48GB | Recommended for inference (best cost/perf) |
| `A100-40GB` | 40GB | Large model training |
| `A100-80GB` | 80GB | Very large models |
| `H100` | 80GB | Fastest, FP8 + Transformer Engine |
| `H200` | 141GB | Auto-upgrade from H100, 4.8TB/s bandwidth |
| `B200` | Latest | Blackwell architecture |
### GPU specification patterns
```python
# Single GPU
@app.function(gpu="A100")
# Specific memory variant
@app.function(gpu="A100-80GB")
# Multiple GPUs (up to 8)
@app.function(gpu="H100:4")
# GPU with fallbacks
@app.function(gpu=["H100", "A100", "L40S"])
# Any available GPU
@app.function(gpu="any")
```
## Container images
```python
# Basic image with pip
image = modal.Image.debian_slim(python_version="3.11").pip_install(
"torch==2.1.0", "transformers==4.36.0", "accelerate"
)
# From CUDA base
image = modal.Image.from_registry(
"nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04",
add_python="3.11"
).pip_install("torch", "transformers")
# With system packages
image = modal.Image.debian_slim().apt_install("git", "ffmpeg").pip_install("whisper")
```
## Persistent storage
```python
volume = modal.Volume.from_name("model-cache", create_if_missing=True)
@app.function(gpu="A10G", volumes={"/models": volume})
def load_model():
import os
model_path = "/models/llama-7b"
if not os.path.exists(model_path):
model = download_model()
model.save_pretrained(model_path)
volume.commit() # Persist changes
return load_from_path(model_path)
```
## Web endpoints
### FastAPI endpoint decorator
```python
@app.function()
@modal.fastapi_endpoint(method="POST")
def predict(text: str) -> dict:
return {"result": model.predict(text)}
```
### Full ASGI app
```python
from fastapi import FastAPI
web_app = FastAPI()
@web_app.post("/predict")
async def predict(text: str):
return {"result": await model.predict.remote.aio(text)}
@app.function()
@modal.asgi_app()
def fastapi_app():
return web_app
```
### Web endpoint types
| Decorator | Use Case |
|-----------|----------|
| `@modal.fastapi_endpoint()` | Simple function → API |
| `@modal.asgi_app()` | Full FastAPI/Starlette apps |
| `@modal.wsgi_app()` | Django/Flask apps |
| `@modal.web_server(port)` | Arbitrary HTTP servers |
## Dynamic batching
```python
@app.function()
@modal.batched(max_batch_size=32, wait_ms=100)
async def batch_predict(inputs: list[str]) -> list[dict]:
# Inputs automatically batched
return model.batch_predict(inputs)
```
## Secrets management
```bash
# Create secret
modal secret create huggingface HF_TOKEN=hf_xxx
```
```python
@app.function(secrets=[modal.Secret.from_name("huggingface")])
def download_model():
import os
token = os.environ["HF_TOKEN"]
```
## Scheduling
```python
@app.function(schedule=modal.Cron("0 0 * * *")) # Daily midnight
def daily_job():
pass
@app.function(schedule=modal.Period(hours=1))
def hourly_job():
pass
```
## Performance optimization
### Cold start mitigation
```python
@app.function(
container_idle_timeout=300, # Keep warm 5 min
allow_concurrent_inputs=10, # Handle concurrent requests
)
def inference():
pass
```
### Model loading best practices
```python
@app.cls(gpu="A100")
class Model:
@modal.enter() # Run once at container start
def load(self):
self.model = load_model() # Load during warm-up
@modal.method()
def predict(self, x):
return self.model(x)
```
## Parallel processing
```python
@app.function()
def process_item(item):
return expensive_computation(item)
@app.function()
def run_parallel():
items = list(range(1000))
# Fan out to parallel containers
results = list(process_item.map(items))
return results
```
## Common configuration
```python
@app.function(
gpu="A100",
memory=32768, # 32GB RAM
cpu=4, # 4 CPU cores
timeout=3600, # 1 hour max
container_idle_timeout=120,# Keep warm 2 min
retries=3, # Retry on failure
concurrency_limit=10, # Max concurrent containers
)
def my_function():
pass
```
## Debugging
```python
# Test locally
if __name__ == "__main__":
result = my_function.local()
# View logs
# modal app logs my-app
```
## Common issues
| Issue | Solution |
|-------|----------|
| Cold start latency | Increase `container_idle_timeout`, use `@modal.enter()` |
| GPU OOM | Use larger GPU (`A100-80GB`), enable gradient checkpointing |
| Image build fails | Pin dependency versions, check CUDA compatibility |
| Timeout errors | Increase `timeout`, add checkpointing |
## References
- **[Advanced Usage](references/advanced-usage.md)** - Multi-GPU, distributed training, cost optimization
- **[Troubleshooting](references/troubleshooting.md)** - Common issues and solutions
## Resources
- **Documentation**: https://modal.com/docs
- **Examples**: https://github.com/modal-labs/modal-examples
- **Pricing**: https://modal.com/pricing
- **Discord**: https://discord.gg/modal

View file

@ -0,0 +1,503 @@
# Modal Advanced Usage Guide
## Multi-GPU Training
### Single-node multi-GPU
```python
import modal
app = modal.App("multi-gpu-training")
image = modal.Image.debian_slim().pip_install("torch", "transformers", "accelerate")
@app.function(gpu="H100:4", image=image, timeout=7200)
def train_multi_gpu():
from accelerate import Accelerator
accelerator = Accelerator()
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
for batch in dataloader:
outputs = model(**batch)
loss = outputs.loss
accelerator.backward(loss)
optimizer.step()
```
### DeepSpeed integration
```python
image = modal.Image.debian_slim().pip_install(
"torch", "transformers", "deepspeed", "accelerate"
)
@app.function(gpu="A100:8", image=image, timeout=14400)
def deepspeed_train(config: dict):
from transformers import Trainer, TrainingArguments
args = TrainingArguments(
output_dir="/outputs",
deepspeed="ds_config.json",
fp16=True,
per_device_train_batch_size=4,
gradient_accumulation_steps=4
)
trainer = Trainer(model=model, args=args, train_dataset=dataset)
trainer.train()
```
### Multi-GPU considerations
For frameworks that re-execute the Python entrypoint (like PyTorch Lightning), use:
- `ddp_spawn` or `ddp_notebook` strategy
- Run training as a subprocess to avoid issues
```python
@app.function(gpu="H100:4")
def train_with_subprocess():
import subprocess
subprocess.run(["python", "-m", "torch.distributed.launch", "train.py"])
```
## Advanced Container Configuration
### Multi-stage builds for caching
```python
# Stage 1: Base dependencies (cached)
base_image = modal.Image.debian_slim().pip_install("torch", "numpy", "scipy")
# Stage 2: ML libraries (cached separately)
ml_image = base_image.pip_install("transformers", "datasets", "accelerate")
# Stage 3: Custom code (rebuilt on changes)
final_image = ml_image.copy_local_dir("./src", "/app/src")
```
### Custom Dockerfiles
```python
image = modal.Image.from_dockerfile("./Dockerfile")
```
### Installing from Git
```python
image = modal.Image.debian_slim().pip_install(
"git+https://github.com/huggingface/transformers.git@main"
)
```
### Using uv for faster installs
```python
image = modal.Image.debian_slim().uv_pip_install(
"torch", "transformers", "accelerate"
)
```
## Advanced Class Patterns
### Lifecycle hooks
```python
@app.cls(gpu="A10G")
class InferenceService:
@modal.enter()
def startup(self):
"""Called once when container starts"""
self.model = load_model()
self.tokenizer = load_tokenizer()
@modal.exit()
def shutdown(self):
"""Called when container shuts down"""
cleanup_resources()
@modal.method()
def predict(self, text: str):
return self.model(self.tokenizer(text))
```
### Concurrent request handling
```python
@app.cls(
gpu="A100",
allow_concurrent_inputs=20, # Handle 20 requests per container
container_idle_timeout=300
)
class BatchInference:
@modal.enter()
def load(self):
self.model = load_model()
@modal.method()
def predict(self, inputs: list):
return self.model.batch_predict(inputs)
```
### Input concurrency vs batching
- **Input concurrency**: Multiple requests processed simultaneously (async I/O)
- **Dynamic batching**: Requests accumulated and processed together (GPU efficiency)
```python
# Input concurrency - good for I/O-bound
@app.function(allow_concurrent_inputs=10)
async def fetch_data(url: str):
async with aiohttp.ClientSession() as session:
return await session.get(url)
# Dynamic batching - good for GPU inference
@app.function()
@modal.batched(max_batch_size=32, wait_ms=100)
async def batch_embed(texts: list[str]) -> list[list[float]]:
return model.encode(texts)
```
## Advanced Volumes
### Volume operations
```python
volume = modal.Volume.from_name("my-volume", create_if_missing=True)
@app.function(volumes={"/data": volume})
def volume_operations():
import os
# Write data
with open("/data/output.txt", "w") as f:
f.write("Results")
# Commit changes (persist to volume)
volume.commit()
# Reload from remote (get latest)
volume.reload()
```
### Shared volumes between functions
```python
shared_volume = modal.Volume.from_name("shared-data", create_if_missing=True)
@app.function(volumes={"/shared": shared_volume})
def writer():
with open("/shared/data.txt", "w") as f:
f.write("Hello from writer")
shared_volume.commit()
@app.function(volumes={"/shared": shared_volume})
def reader():
shared_volume.reload() # Get latest
with open("/shared/data.txt", "r") as f:
return f.read()
```
### Cloud bucket mounts
```python
# Mount S3 bucket
bucket = modal.CloudBucketMount(
bucket_name="my-bucket",
secret=modal.Secret.from_name("aws-credentials")
)
@app.function(volumes={"/s3": bucket})
def process_s3_data():
# Access S3 files like local filesystem
data = open("/s3/data.parquet").read()
```
## Function Composition
### Chaining functions
```python
@app.function()
def preprocess(data):
return cleaned_data
@app.function(gpu="T4")
def inference(data):
return predictions
@app.function()
def postprocess(predictions):
return formatted_results
@app.function()
def pipeline(raw_data):
cleaned = preprocess.remote(raw_data)
predictions = inference.remote(cleaned)
results = postprocess.remote(predictions)
return results
```
### Parallel fan-out
```python
@app.function()
def process_item(item):
return expensive_computation(item)
@app.function()
def parallel_pipeline(items):
# Fan out: process all items in parallel
results = list(process_item.map(items))
return results
```
### Starmap for multiple arguments
```python
@app.function()
def process(x, y, z):
return x + y + z
@app.function()
def orchestrate():
args = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
results = list(process.starmap(args))
return results
```
## Advanced Web Endpoints
### WebSocket support
```python
from fastapi import FastAPI, WebSocket
app = modal.App("websocket-app")
web_app = FastAPI()
@web_app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket):
await websocket.accept()
while True:
data = await websocket.receive_text()
await websocket.send_text(f"Processed: {data}")
@app.function()
@modal.asgi_app()
def ws_app():
return web_app
```
### Streaming responses
```python
from fastapi.responses import StreamingResponse
@app.function(gpu="A100")
def generate_stream(prompt: str):
for token in model.generate_stream(prompt):
yield token
@web_app.get("/stream")
async def stream_response(prompt: str):
return StreamingResponse(
generate_stream.remote_gen(prompt),
media_type="text/event-stream"
)
```
### Authentication
```python
from fastapi import Depends, HTTPException, Header
async def verify_token(authorization: str = Header(None)):
if not authorization or not authorization.startswith("Bearer "):
raise HTTPException(status_code=401)
token = authorization.split(" ")[1]
if not verify_jwt(token):
raise HTTPException(status_code=403)
return token
@web_app.post("/predict")
async def predict(data: dict, token: str = Depends(verify_token)):
return model.predict(data)
```
## Cost Optimization
### Right-sizing GPUs
```python
# For inference: smaller GPUs often sufficient
@app.function(gpu="L40S") # 48GB, best cost/perf for inference
def inference():
pass
# For training: larger GPUs for throughput
@app.function(gpu="A100-80GB")
def training():
pass
```
### GPU fallbacks for availability
```python
@app.function(gpu=["H100", "A100", "L40S"]) # Try in order
def flexible_compute():
pass
```
### Scale to zero
```python
# Default behavior: scale to zero when idle
@app.function(gpu="A100")
def on_demand():
pass
# Keep containers warm for low latency (costs more)
@app.function(gpu="A100", keep_warm=1)
def always_ready():
pass
```
### Batch processing for efficiency
```python
# Process in batches to reduce cold starts
@app.function(gpu="A100")
def batch_process(items: list):
return [process(item) for item in items]
# Better than individual calls
results = batch_process.remote(all_items)
```
## Monitoring and Observability
### Structured logging
```python
import json
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
@app.function()
def structured_logging(request_id: str, data: dict):
logger.info(json.dumps({
"event": "inference_start",
"request_id": request_id,
"input_size": len(data)
}))
result = process(data)
logger.info(json.dumps({
"event": "inference_complete",
"request_id": request_id,
"output_size": len(result)
}))
return result
```
### Custom metrics
```python
@app.function(gpu="A100")
def monitored_inference(inputs):
import time
start = time.time()
results = model.predict(inputs)
latency = time.time() - start
# Log metrics (visible in Modal dashboard)
print(f"METRIC latency={latency:.3f}s batch_size={len(inputs)}")
return results
```
## Production Deployment
### Environment separation
```python
import os
env = os.environ.get("MODAL_ENV", "dev")
app = modal.App(f"my-service-{env}")
# Environment-specific config
if env == "prod":
gpu_config = "A100"
timeout = 3600
else:
gpu_config = "T4"
timeout = 300
```
### Zero-downtime deployments
Modal automatically handles zero-downtime deployments:
1. New containers are built and started
2. Traffic gradually shifts to new version
3. Old containers drain existing requests
4. Old containers are terminated
### Health checks
```python
@app.function()
@modal.web_endpoint()
def health():
return {
"status": "healthy",
"model_loaded": hasattr(Model, "_model"),
"gpu_available": torch.cuda.is_available()
}
```
## Sandboxes
### Interactive execution environments
```python
@app.function()
def run_sandbox():
sandbox = modal.Sandbox.create(
app=app,
image=image,
gpu="T4"
)
# Execute code in sandbox
result = sandbox.exec("python", "-c", "print('Hello from sandbox')")
sandbox.terminate()
return result
```
## Invoking Deployed Functions
### From external code
```python
# Call deployed function from any Python script
import modal
f = modal.Function.lookup("my-app", "my_function")
result = f.remote(arg1, arg2)
```
### REST API invocation
```bash
# Deployed endpoints accessible via HTTPS
curl -X POST https://your-workspace--my-app-predict.modal.run \
-H "Content-Type: application/json" \
-d '{"text": "Hello world"}'
```

View file

@ -0,0 +1,494 @@
# Modal Troubleshooting Guide
## Installation Issues
### Authentication fails
**Error**: `modal setup` doesn't complete or token is invalid
**Solutions**:
```bash
# Re-authenticate
modal token new
# Check current token
modal config show
# Set token via environment
export MODAL_TOKEN_ID=ak-...
export MODAL_TOKEN_SECRET=as-...
```
### Package installation issues
**Error**: `pip install modal` fails
**Solutions**:
```bash
# Upgrade pip
pip install --upgrade pip
# Install with specific Python version
python3.11 -m pip install modal
# Install from wheel
pip install modal --prefer-binary
```
## Container Image Issues
### Image build fails
**Error**: `ImageBuilderError: Failed to build image`
**Solutions**:
```python
# Pin package versions to avoid conflicts
image = modal.Image.debian_slim().pip_install(
"torch==2.1.0",
"transformers==4.36.0", # Pin versions
"accelerate==0.25.0"
)
# Use compatible CUDA versions
image = modal.Image.from_registry(
"nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04", # Match PyTorch CUDA
add_python="3.11"
)
```
### Dependency conflicts
**Error**: `ERROR: Cannot install package due to conflicting dependencies`
**Solutions**:
```python
# Layer dependencies separately
base = modal.Image.debian_slim().pip_install("torch")
ml = base.pip_install("transformers") # Install after torch
# Use uv for better resolution
image = modal.Image.debian_slim().uv_pip_install(
"torch", "transformers"
)
```
### Large image builds timeout
**Error**: Image build exceeds time limit
**Solutions**:
```python
# Split into multiple layers (better caching)
base = modal.Image.debian_slim().pip_install("torch") # Cached
ml = base.pip_install("transformers", "datasets") # Cached
app = ml.copy_local_dir("./src", "/app") # Rebuilds on code change
# Download models during build, not runtime
image = modal.Image.debian_slim().pip_install("transformers").run_commands(
"python -c 'from transformers import AutoModel; AutoModel.from_pretrained(\"bert-base\")'"
)
```
## GPU Issues
### GPU not available
**Error**: `RuntimeError: CUDA not available`
**Solutions**:
```python
# Ensure GPU is specified
@app.function(gpu="T4") # Must specify GPU
def my_function():
import torch
assert torch.cuda.is_available()
# Check CUDA compatibility in image
image = modal.Image.from_registry(
"nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04",
add_python="3.11"
).pip_install(
"torch",
index_url="https://download.pytorch.org/whl/cu121" # Match CUDA
)
```
### GPU out of memory
**Error**: `torch.cuda.OutOfMemoryError: CUDA out of memory`
**Solutions**:
```python
# Use larger GPU
@app.function(gpu="A100-80GB") # More VRAM
def train():
pass
# Enable memory optimization
@app.function(gpu="A100")
def memory_optimized():
import torch
torch.backends.cuda.enable_flash_sdp(True)
# Use gradient checkpointing
model.gradient_checkpointing_enable()
# Mixed precision
with torch.autocast(device_type="cuda", dtype=torch.float16):
outputs = model(**inputs)
```
### Wrong GPU allocated
**Error**: Got different GPU than requested
**Solutions**:
```python
# Use strict GPU selection
@app.function(gpu="H100!") # H100! prevents auto-upgrade to H200
# Specify exact memory variant
@app.function(gpu="A100-80GB") # Not just "A100"
# Check GPU at runtime
@app.function(gpu="A100")
def check_gpu():
import subprocess
result = subprocess.run(["nvidia-smi"], capture_output=True, text=True)
print(result.stdout)
```
## Cold Start Issues
### Slow cold starts
**Problem**: First request takes too long
**Solutions**:
```python
# Keep containers warm
@app.function(
container_idle_timeout=600, # Keep warm 10 min
keep_warm=1 # Always keep 1 container ready
)
def low_latency():
pass
# Load model during container start
@app.cls(gpu="A100")
class Model:
@modal.enter()
def load(self):
# This runs once at container start, not per request
self.model = load_heavy_model()
# Cache model in volume
volume = modal.Volume.from_name("models", create_if_missing=True)
@app.function(volumes={"/cache": volume})
def cached_model():
if os.path.exists("/cache/model"):
model = load_from_disk("/cache/model")
else:
model = download_model()
save_to_disk(model, "/cache/model")
volume.commit()
```
### Container keeps restarting
**Problem**: Containers are killed and restarted frequently
**Solutions**:
```python
# Increase memory
@app.function(memory=32768) # 32GB RAM
def memory_heavy():
pass
# Increase timeout
@app.function(timeout=3600) # 1 hour
def long_running():
pass
# Handle signals gracefully
import signal
def handler(signum, frame):
cleanup()
exit(0)
signal.signal(signal.SIGTERM, handler)
```
## Volume Issues
### Volume changes not persisting
**Error**: Data written to volume disappears
**Solutions**:
```python
volume = modal.Volume.from_name("my-volume", create_if_missing=True)
@app.function(volumes={"/data": volume})
def write_data():
with open("/data/file.txt", "w") as f:
f.write("data")
# CRITICAL: Commit changes!
volume.commit()
```
### Volume read shows stale data
**Error**: Reading outdated data from volume
**Solutions**:
```python
@app.function(volumes={"/data": volume})
def read_data():
# Reload to get latest
volume.reload()
with open("/data/file.txt", "r") as f:
return f.read()
```
### Volume mount fails
**Error**: `VolumeError: Failed to mount volume`
**Solutions**:
```python
# Ensure volume exists
volume = modal.Volume.from_name("my-volume", create_if_missing=True)
# Use absolute path
@app.function(volumes={"/data": volume}) # Not "./data"
def my_function():
pass
# Check volume in dashboard
# modal volume list
```
## Web Endpoint Issues
### Endpoint returns 502
**Error**: Gateway timeout or bad gateway
**Solutions**:
```python
# Increase timeout
@app.function(timeout=300) # 5 min
@modal.web_endpoint()
def slow_endpoint():
pass
# Return streaming response for long operations
from fastapi.responses import StreamingResponse
@app.function()
@modal.asgi_app()
def streaming_app():
async def generate():
for i in range(100):
yield f"data: {i}\n\n"
await process_chunk(i)
return StreamingResponse(generate(), media_type="text/event-stream")
```
### Endpoint not accessible
**Error**: 404 or cannot reach endpoint
**Solutions**:
```bash
# Check deployment status
modal app list
# Redeploy
modal deploy my_app.py
# Check logs
modal app logs my-app
```
### CORS errors
**Error**: Cross-origin request blocked
**Solutions**:
```python
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
web_app = FastAPI()
web_app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
@app.function()
@modal.asgi_app()
def cors_enabled():
return web_app
```
## Secret Issues
### Secret not found
**Error**: `SecretNotFound: Secret 'my-secret' not found`
**Solutions**:
```bash
# Create secret via CLI
modal secret create my-secret KEY=value
# List secrets
modal secret list
# Check secret name matches exactly
```
### Secret value not accessible
**Error**: Environment variable is empty
**Solutions**:
```python
# Ensure secret is attached
@app.function(secrets=[modal.Secret.from_name("my-secret")])
def use_secret():
import os
value = os.environ.get("KEY") # Use get() to handle missing
if not value:
raise ValueError("KEY not set in secret")
```
## Scheduling Issues
### Scheduled job not running
**Error**: Cron job doesn't execute
**Solutions**:
```python
# Verify cron syntax
@app.function(schedule=modal.Cron("0 0 * * *")) # Daily at midnight UTC
def daily_job():
pass
# Check timezone (Modal uses UTC)
# "0 8 * * *" = 8am UTC, not local time
# Ensure app is deployed
# modal deploy my_app.py
```
### Job runs multiple times
**Problem**: Scheduled job executes more than expected
**Solutions**:
```python
# Implement idempotency
@app.function(schedule=modal.Cron("0 * * * *"))
def hourly_job():
job_id = get_current_hour_id()
if already_processed(job_id):
return
process()
mark_processed(job_id)
```
## Debugging Tips
### Enable debug logging
```python
import logging
logging.basicConfig(level=logging.DEBUG)
@app.function()
def debug_function():
logging.debug("Debug message")
logging.info("Info message")
```
### View container logs
```bash
# Stream logs
modal app logs my-app
# View specific function
modal app logs my-app --function my_function
# View historical logs
modal app logs my-app --since 1h
```
### Test locally
```python
# Run function locally without Modal
if __name__ == "__main__":
result = my_function.local() # Runs on your machine
print(result)
```
### Inspect container
```python
@app.function(gpu="T4")
def debug_environment():
import subprocess
import sys
# System info
print(f"Python: {sys.version}")
print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)
print(subprocess.run(["pip", "list"], capture_output=True, text=True).stdout)
# CUDA info
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
```
## Common Error Messages
| Error | Cause | Solution |
|-------|-------|----------|
| `FunctionTimeoutError` | Function exceeded timeout | Increase `timeout` parameter |
| `ContainerMemoryExceeded` | OOM killed | Increase `memory` parameter |
| `ImageBuilderError` | Build failed | Check dependencies, pin versions |
| `ResourceExhausted` | No GPUs available | Use GPU fallbacks, try later |
| `AuthenticationError` | Invalid token | Run `modal token new` |
| `VolumeNotFound` | Volume doesn't exist | Use `create_if_missing=True` |
| `SecretNotFound` | Secret doesn't exist | Create secret via CLI |
## Getting Help
1. **Documentation**: https://modal.com/docs
2. **Examples**: https://github.com/modal-labs/modal-examples
3. **Discord**: https://discord.gg/modal
4. **Status**: https://status.modal.com
### Reporting Issues
Include:
- Modal client version: `modal --version`
- Python version: `python --version`
- Full error traceback
- Minimal reproducible code
- GPU type if relevant

View file

@ -0,0 +1,434 @@
---
name: peft-fine-tuning
description: Parameter-efficient fine-tuning for LLMs using LoRA, QLoRA, and 25+ methods. Use when fine-tuning large models (7B-70B) with limited GPU memory, when you need to train <1% of parameters with minimal accuracy loss, or for multi-adapter serving. HuggingFace's official library integrated with transformers ecosystem.
version: 1.0.0
author: Orchestra Research
license: MIT
dependencies: [peft>=0.13.0, transformers>=4.45.0, torch>=2.0.0, bitsandbytes>=0.43.0]
metadata:
hermes:
tags: [Fine-Tuning, PEFT, LoRA, QLoRA, Parameter-Efficient, Adapters, Low-Rank, Memory Optimization, Multi-Adapter]
---
# PEFT (Parameter-Efficient Fine-Tuning)
Fine-tune LLMs by training <1% of parameters using LoRA, QLoRA, and 25+ adapter methods.
## When to use PEFT
**Use PEFT/LoRA when:**
- Fine-tuning 7B-70B models on consumer GPUs (RTX 4090, A100)
- Need to train <1% parameters (6MB adapters vs 14GB full model)
- Want fast iteration with multiple task-specific adapters
- Deploying multiple fine-tuned variants from one base model
**Use QLoRA (PEFT + quantization) when:**
- Fine-tuning 70B models on single 24GB GPU
- Memory is the primary constraint
- Can accept ~5% quality trade-off vs full fine-tuning
**Use full fine-tuning instead when:**
- Training small models (<1B parameters)
- Need maximum quality and have compute budget
- Significant domain shift requires updating all weights
## Quick start
### Installation
```bash
# Basic installation
pip install peft
# With quantization support (recommended)
pip install peft bitsandbytes
# Full stack
pip install peft transformers accelerate bitsandbytes datasets
```
### LoRA fine-tuning (standard)
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from peft import get_peft_model, LoraConfig, TaskType
from datasets import load_dataset
# Load base model
model_name = "meta-llama/Llama-3.1-8B"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# LoRA configuration
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # Rank (8-64, higher = more capacity)
lora_alpha=32, # Scaling factor (typically 2*r)
lora_dropout=0.05, # Dropout for regularization
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], # Attention layers
bias="none" # Don't train biases
)
# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 13,631,488 || all params: 8,043,307,008 || trainable%: 0.17%
# Prepare dataset
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
def tokenize(example):
text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}"
return tokenizer(text, truncation=True, max_length=512, padding="max_length")
tokenized = dataset.map(tokenize, remove_columns=dataset.column_names)
# Training
training_args = TrainingArguments(
output_dir="./lora-llama",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
save_strategy="epoch"
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized,
data_collator=lambda data: {"input_ids": torch.stack([f["input_ids"] for f in data]),
"attention_mask": torch.stack([f["attention_mask"] for f in data]),
"labels": torch.stack([f["input_ids"] for f in data])}
)
trainer.train()
# Save adapter only (6MB vs 16GB)
model.save_pretrained("./lora-llama-adapter")
```
### QLoRA fine-tuning (memory-efficient)
```python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import get_peft_model, LoraConfig, prepare_model_for_kbit_training
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4 (best for LLMs)
bnb_4bit_compute_dtype="bfloat16", # Compute in bf16
bnb_4bit_use_double_quant=True # Nested quantization
)
# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-70B",
quantization_config=bnb_config,
device_map="auto"
)
# Prepare for training (enables gradient checkpointing)
model = prepare_model_for_kbit_training(model)
# LoRA config for QLoRA
lora_config = LoraConfig(
r=64, # Higher rank for 70B
lora_alpha=128,
lora_dropout=0.1,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
# 70B model now fits on single 24GB GPU!
```
## LoRA parameter selection
### Rank (r) - capacity vs efficiency
| Rank | Trainable Params | Memory | Quality | Use Case |
|------|-----------------|--------|---------|----------|
| 4 | ~3M | Minimal | Lower | Simple tasks, prototyping |
| **8** | ~7M | Low | Good | **Recommended starting point** |
| **16** | ~14M | Medium | Better | **General fine-tuning** |
| 32 | ~27M | Higher | High | Complex tasks |
| 64 | ~54M | High | Highest | Domain adaptation, 70B models |
### Alpha (lora_alpha) - scaling factor
```python
# Rule of thumb: alpha = 2 * rank
LoraConfig(r=16, lora_alpha=32) # Standard
LoraConfig(r=16, lora_alpha=16) # Conservative (lower learning rate effect)
LoraConfig(r=16, lora_alpha=64) # Aggressive (higher learning rate effect)
```
### Target modules by architecture
```python
# Llama / Mistral / Qwen
target_modules = ["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
# GPT-2 / GPT-Neo
target_modules = ["c_attn", "c_proj", "c_fc"]
# Falcon
target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]
# BLOOM
target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]
# Auto-detect all linear layers
target_modules = "all-linear" # PEFT 0.6.0+
```
## Loading and merging adapters
### Load trained adapter
```python
from peft import PeftModel, AutoPeftModelForCausalLM
from transformers import AutoModelForCausalLM
# Option 1: Load with PeftModel
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
model = PeftModel.from_pretrained(base_model, "./lora-llama-adapter")
# Option 2: Load directly (recommended)
model = AutoPeftModelForCausalLM.from_pretrained(
"./lora-llama-adapter",
device_map="auto"
)
```
### Merge adapter into base model
```python
# Merge for deployment (no adapter overhead)
merged_model = model.merge_and_unload()
# Save merged model
merged_model.save_pretrained("./llama-merged")
tokenizer.save_pretrained("./llama-merged")
# Push to Hub
merged_model.push_to_hub("username/llama-finetuned")
```
### Multi-adapter serving
```python
from peft import PeftModel
# Load base with first adapter
model = AutoPeftModelForCausalLM.from_pretrained("./adapter-task1")
# Load additional adapters
model.load_adapter("./adapter-task2", adapter_name="task2")
model.load_adapter("./adapter-task3", adapter_name="task3")
# Switch between adapters at runtime
model.set_adapter("task1") # Use task1 adapter
output1 = model.generate(**inputs)
model.set_adapter("task2") # Switch to task2
output2 = model.generate(**inputs)
# Disable adapters (use base model)
with model.disable_adapter():
base_output = model.generate(**inputs)
```
## PEFT methods comparison
| Method | Trainable % | Memory | Speed | Best For |
|--------|------------|--------|-------|----------|
| **LoRA** | 0.1-1% | Low | Fast | General fine-tuning |
| **QLoRA** | 0.1-1% | Very Low | Medium | Memory-constrained |
| AdaLoRA | 0.1-1% | Low | Medium | Automatic rank selection |
| IA3 | 0.01% | Minimal | Fastest | Few-shot adaptation |
| Prefix Tuning | 0.1% | Low | Medium | Generation control |
| Prompt Tuning | 0.001% | Minimal | Fast | Simple task adaptation |
| P-Tuning v2 | 0.1% | Low | Medium | NLU tasks |
### IA3 (minimal parameters)
```python
from peft import IA3Config
ia3_config = IA3Config(
target_modules=["q_proj", "v_proj", "k_proj", "down_proj"],
feedforward_modules=["down_proj"]
)
model = get_peft_model(model, ia3_config)
# Trains only 0.01% of parameters!
```
### Prefix Tuning
```python
from peft import PrefixTuningConfig
prefix_config = PrefixTuningConfig(
task_type="CAUSAL_LM",
num_virtual_tokens=20, # Prepended tokens
prefix_projection=True # Use MLP projection
)
model = get_peft_model(model, prefix_config)
```
## Integration patterns
### With TRL (SFTTrainer)
```python
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig
lora_config = LoraConfig(r=16, lora_alpha=32, target_modules="all-linear")
trainer = SFTTrainer(
model=model,
args=SFTConfig(output_dir="./output", max_seq_length=512),
train_dataset=dataset,
peft_config=lora_config, # Pass LoRA config directly
)
trainer.train()
```
### With Axolotl (YAML config)
```yaml
# axolotl config.yaml
adapter: lora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
- q_proj
- v_proj
- k_proj
- o_proj
lora_target_linear: true # Target all linear layers
```
### With vLLM (inference)
```python
from vllm import LLM
from vllm.lora.request import LoRARequest
# Load base model with LoRA support
llm = LLM(model="meta-llama/Llama-3.1-8B", enable_lora=True)
# Serve with adapter
outputs = llm.generate(
prompts,
lora_request=LoRARequest("adapter1", 1, "./lora-adapter")
)
```
## Performance benchmarks
### Memory usage (Llama 3.1 8B)
| Method | GPU Memory | Trainable Params |
|--------|-----------|------------------|
| Full fine-tuning | 60+ GB | 8B (100%) |
| LoRA r=16 | 18 GB | 14M (0.17%) |
| QLoRA r=16 | 6 GB | 14M (0.17%) |
| IA3 | 16 GB | 800K (0.01%) |
### Training speed (A100 80GB)
| Method | Tokens/sec | vs Full FT |
|--------|-----------|------------|
| Full FT | 2,500 | 1x |
| LoRA | 3,200 | 1.3x |
| QLoRA | 2,100 | 0.84x |
### Quality (MMLU benchmark)
| Model | Full FT | LoRA | QLoRA |
|-------|---------|------|-------|
| Llama 2-7B | 45.3 | 44.8 | 44.1 |
| Llama 2-13B | 54.8 | 54.2 | 53.5 |
## Common issues
### CUDA OOM during training
```python
# Solution 1: Enable gradient checkpointing
model.gradient_checkpointing_enable()
# Solution 2: Reduce batch size + increase accumulation
TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=16
)
# Solution 3: Use QLoRA
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")
```
### Adapter not applying
```python
# Verify adapter is active
print(model.active_adapters) # Should show adapter name
# Check trainable parameters
model.print_trainable_parameters()
# Ensure model in training mode
model.train()
```
### Quality degradation
```python
# Increase rank
LoraConfig(r=32, lora_alpha=64)
# Target more modules
target_modules = "all-linear"
# Use more training data and epochs
TrainingArguments(num_train_epochs=5)
# Lower learning rate
TrainingArguments(learning_rate=1e-4)
```
## Best practices
1. **Start with r=8-16**, increase if quality insufficient
2. **Use alpha = 2 * rank** as starting point
3. **Target attention + MLP layers** for best quality/efficiency
4. **Enable gradient checkpointing** for memory savings
5. **Save adapters frequently** (small files, easy rollback)
6. **Evaluate on held-out data** before merging
7. **Use QLoRA for 70B+ models** on consumer hardware
## References
- **[Advanced Usage](references/advanced-usage.md)** - DoRA, LoftQ, rank stabilization, custom modules
- **[Troubleshooting](references/troubleshooting.md)** - Common errors, debugging, optimization
## Resources
- **GitHub**: https://github.com/huggingface/peft
- **Docs**: https://huggingface.co/docs/peft
- **LoRA Paper**: arXiv:2106.09685
- **QLoRA Paper**: arXiv:2305.14314
- **Models**: https://huggingface.co/models?library=peft

View file

@ -0,0 +1,514 @@
# PEFT Advanced Usage Guide
## Advanced LoRA Variants
### DoRA (Weight-Decomposed Low-Rank Adaptation)
DoRA decomposes weights into magnitude and direction components, often achieving better results than standard LoRA:
```python
from peft import LoraConfig
dora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
use_dora=True, # Enable DoRA
task_type="CAUSAL_LM"
)
model = get_peft_model(model, dora_config)
```
**When to use DoRA**:
- Consistently outperforms LoRA on instruction-following tasks
- Slightly higher memory (~10%) due to magnitude vectors
- Best for quality-critical fine-tuning
### AdaLoRA (Adaptive Rank)
Automatically adjusts rank per layer based on importance:
```python
from peft import AdaLoraConfig
adalora_config = AdaLoraConfig(
init_r=64, # Initial rank
target_r=16, # Target average rank
tinit=200, # Warmup steps
tfinal=1000, # Final pruning step
deltaT=10, # Rank update frequency
beta1=0.85,
beta2=0.85,
orth_reg_weight=0.5, # Orthogonality regularization
target_modules=["q_proj", "v_proj"],
task_type="CAUSAL_LM"
)
```
**Benefits**:
- Allocates more rank to important layers
- Can reduce total parameters while maintaining quality
- Good for exploring optimal rank distribution
### LoRA+ (Asymmetric Learning Rates)
Different learning rates for A and B matrices:
```python
from peft import LoraConfig
# LoRA+ uses higher LR for B matrix
lora_plus_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules="all-linear",
use_rslora=True, # Rank-stabilized LoRA (related technique)
)
# Manual implementation of LoRA+
from torch.optim import AdamW
# Group parameters
lora_A_params = [p for n, p in model.named_parameters() if "lora_A" in n]
lora_B_params = [p for n, p in model.named_parameters() if "lora_B" in n]
optimizer = AdamW([
{"params": lora_A_params, "lr": 1e-4},
{"params": lora_B_params, "lr": 1e-3}, # 10x higher for B
])
```
### rsLoRA (Rank-Stabilized LoRA)
Scales LoRA outputs to stabilize training with different ranks:
```python
lora_config = LoraConfig(
r=64,
lora_alpha=64,
use_rslora=True, # Enables rank-stabilized scaling
target_modules="all-linear"
)
```
**When to use**:
- When experimenting with different ranks
- Helps maintain consistent behavior across rank values
- Recommended for r > 32
## LoftQ (LoRA-Fine-Tuning-aware Quantization)
Initializes LoRA weights to compensate for quantization error:
```python
from peft import LoftQConfig, LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
# LoftQ configuration
loftq_config = LoftQConfig(
loftq_bits=4, # Quantization bits
loftq_iter=5, # Alternating optimization iterations
)
# LoRA config with LoftQ initialization
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules="all-linear",
init_lora_weights="loftq",
loftq_config=loftq_config,
task_type="CAUSAL_LM"
)
# Load quantized model
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=bnb_config
)
model = get_peft_model(model, lora_config)
```
**Benefits over standard QLoRA**:
- Better initial quality after quantization
- Faster convergence
- ~1-2% better final accuracy on benchmarks
## Custom Module Targeting
### Target specific layers
```python
# Target only first and last transformer layers
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["model.layers.0.self_attn.q_proj",
"model.layers.0.self_attn.v_proj",
"model.layers.31.self_attn.q_proj",
"model.layers.31.self_attn.v_proj"],
layers_to_transform=[0, 31] # Alternative approach
)
```
### Layer pattern matching
```python
# Target layers 0-10 only
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules="all-linear",
layers_to_transform=list(range(11)), # Layers 0-10
layers_pattern="model.layers"
)
```
### Exclude specific layers
```python
lora_config = LoraConfig(
r=16,
target_modules="all-linear",
modules_to_save=["lm_head"], # Train these fully (not LoRA)
)
```
## Embedding and LM Head Training
### Train embeddings with LoRA
```python
from peft import LoraConfig
# Include embeddings
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "embed_tokens"], # Include embeddings
modules_to_save=["lm_head"], # Train lm_head fully
)
```
### Extending vocabulary with LoRA
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import get_peft_model, LoraConfig
# Add new tokens
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
new_tokens = ["<custom_token_1>", "<custom_token_2>"]
tokenizer.add_tokens(new_tokens)
# Resize model embeddings
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
model.resize_token_embeddings(len(tokenizer))
# Configure LoRA to train new embeddings
lora_config = LoraConfig(
r=16,
target_modules="all-linear",
modules_to_save=["embed_tokens", "lm_head"], # Train these fully
)
model = get_peft_model(model, lora_config)
```
## Multi-Adapter Patterns
### Adapter composition
```python
from peft import PeftModel
# Load model with multiple adapters
model = AutoPeftModelForCausalLM.from_pretrained("./base-adapter")
model.load_adapter("./style-adapter", adapter_name="style")
model.load_adapter("./task-adapter", adapter_name="task")
# Combine adapters (weighted sum)
model.add_weighted_adapter(
adapters=["style", "task"],
weights=[0.7, 0.3],
adapter_name="combined",
combination_type="linear" # or "cat", "svd"
)
model.set_adapter("combined")
```
### Adapter stacking
```python
# Stack adapters (apply sequentially)
model.add_weighted_adapter(
adapters=["base", "domain", "task"],
weights=[1.0, 1.0, 1.0],
adapter_name="stacked",
combination_type="cat" # Concatenate adapter outputs
)
```
### Dynamic adapter switching
```python
import torch
class MultiAdapterModel:
def __init__(self, base_model_path, adapter_paths):
self.model = AutoPeftModelForCausalLM.from_pretrained(adapter_paths[0])
for name, path in adapter_paths[1:].items():
self.model.load_adapter(path, adapter_name=name)
def generate(self, prompt, adapter_name="default"):
self.model.set_adapter(adapter_name)
return self.model.generate(**self.tokenize(prompt))
def generate_ensemble(self, prompt, adapters, weights):
"""Generate with weighted adapter ensemble"""
outputs = []
for adapter, weight in zip(adapters, weights):
self.model.set_adapter(adapter)
logits = self.model(**self.tokenize(prompt)).logits
outputs.append(weight * logits)
return torch.stack(outputs).sum(dim=0)
```
## Memory Optimization
### Gradient checkpointing with LoRA
```python
from peft import prepare_model_for_kbit_training
# Enable gradient checkpointing
model = prepare_model_for_kbit_training(
model,
use_gradient_checkpointing=True,
gradient_checkpointing_kwargs={"use_reentrant": False}
)
```
### CPU offloading for training
```python
from accelerate import Accelerator
accelerator = Accelerator(
mixed_precision="bf16",
gradient_accumulation_steps=8,
cpu_offload=True # Offload optimizer states to CPU
)
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
```
### Memory-efficient attention with LoRA
```python
from transformers import AutoModelForCausalLM
# Combine Flash Attention 2 with LoRA
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16
)
# Apply LoRA
model = get_peft_model(model, lora_config)
```
## Inference Optimization
### Merge for deployment
```python
# Merge adapter weights into base model
merged_model = model.merge_and_unload()
# Quantize merged model for inference
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(load_in_4bit=True)
quantized_model = AutoModelForCausalLM.from_pretrained(
"./merged-model",
quantization_config=bnb_config
)
```
### Export to different formats
```python
# Export to GGUF (llama.cpp)
# First merge, then convert
merged_model.save_pretrained("./merged-model")
# Use llama.cpp converter
# python convert-hf-to-gguf.py ./merged-model --outfile model.gguf
# Export to ONNX
from optimum.onnxruntime import ORTModelForCausalLM
ort_model = ORTModelForCausalLM.from_pretrained(
"./merged-model",
export=True
)
ort_model.save_pretrained("./onnx-model")
```
### Batch adapter inference
```python
from vllm import LLM
from vllm.lora.request import LoRARequest
# Initialize with LoRA support
llm = LLM(
model="meta-llama/Llama-3.1-8B",
enable_lora=True,
max_lora_rank=64,
max_loras=4 # Max concurrent adapters
)
# Batch with different adapters
requests = [
("prompt1", LoRARequest("adapter1", 1, "./adapter1")),
("prompt2", LoRARequest("adapter2", 2, "./adapter2")),
("prompt3", LoRARequest("adapter1", 1, "./adapter1")),
]
outputs = llm.generate(
[r[0] for r in requests],
lora_request=[r[1] for r in requests]
)
```
## Training Recipes
### Instruction tuning recipe
```python
lora_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules="all-linear",
bias="none",
task_type="CAUSAL_LM"
)
training_args = TrainingArguments(
output_dir="./output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.03,
bf16=True,
logging_steps=10,
save_strategy="steps",
save_steps=100,
eval_strategy="steps",
eval_steps=100,
)
```
### Code generation recipe
```python
lora_config = LoraConfig(
r=32, # Higher rank for code
lora_alpha=64,
lora_dropout=0.1,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
bias="none",
task_type="CAUSAL_LM"
)
training_args = TrainingArguments(
learning_rate=1e-4, # Lower LR for code
num_train_epochs=2,
max_seq_length=2048, # Longer sequences
)
```
### Conversational/Chat recipe
```python
from trl import SFTTrainer
lora_config = LoraConfig(
r=16,
lora_alpha=16, # alpha = r for chat
lora_dropout=0.05,
target_modules="all-linear"
)
# Use chat template
def format_chat(example):
messages = [
{"role": "user", "content": example["instruction"]},
{"role": "assistant", "content": example["response"]}
]
return tokenizer.apply_chat_template(messages, tokenize=False)
trainer = SFTTrainer(
model=model,
peft_config=lora_config,
train_dataset=dataset.map(format_chat),
max_seq_length=1024,
)
```
## Debugging and Validation
### Verify adapter application
```python
# Check which modules have LoRA
for name, module in model.named_modules():
if hasattr(module, "lora_A"):
print(f"LoRA applied to: {name}")
# Print detailed config
print(model.peft_config)
# Check adapter state
print(f"Active adapters: {model.active_adapters}")
print(f"Trainable: {sum(p.numel() for p in model.parameters() if p.requires_grad)}")
```
### Compare with base model
```python
# Generate with adapter
model.set_adapter("default")
adapter_output = model.generate(**inputs)
# Generate without adapter
with model.disable_adapter():
base_output = model.generate(**inputs)
print(f"Adapter: {tokenizer.decode(adapter_output[0])}")
print(f"Base: {tokenizer.decode(base_output[0])}")
```
### Monitor training metrics
```python
from transformers import TrainerCallback
class LoRACallback(TrainerCallback):
def on_log(self, args, state, control, logs=None, **kwargs):
if "loss" in logs:
# Log adapter-specific metrics
model = kwargs["model"]
lora_params = sum(p.numel() for n, p in model.named_parameters()
if "lora" in n and p.requires_grad)
print(f"Step {state.global_step}: loss={logs['loss']:.4f}, lora_params={lora_params}")
```

View file

@ -0,0 +1,480 @@
# PEFT Troubleshooting Guide
## Installation Issues
### bitsandbytes CUDA Error
**Error**: `CUDA Setup failed despite GPU being available`
**Fix**:
```bash
# Check CUDA version
nvcc --version
# Install matching bitsandbytes
pip uninstall bitsandbytes
pip install bitsandbytes --no-cache-dir
# Or compile from source for specific CUDA
git clone https://github.com/TimDettmers/bitsandbytes.git
cd bitsandbytes
CUDA_VERSION=118 make cuda11x # Adjust for your CUDA
pip install .
```
### Triton Import Error
**Error**: `ModuleNotFoundError: No module named 'triton'`
**Fix**:
```bash
# Install triton (Linux only)
pip install triton
# Windows: Triton not supported, use CUDA backend
# Set environment variable to disable triton
export CUDA_VISIBLE_DEVICES=0
```
### PEFT Version Conflicts
**Error**: `AttributeError: 'LoraConfig' object has no attribute 'use_dora'`
**Fix**:
```bash
# Upgrade to latest PEFT
pip install peft>=0.13.0 --upgrade
# Check version
python -c "import peft; print(peft.__version__)"
```
## Training Issues
### CUDA Out of Memory
**Error**: `torch.cuda.OutOfMemoryError: CUDA out of memory`
**Solutions**:
1. **Enable gradient checkpointing**:
```python
from peft import prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
```
2. **Reduce batch size**:
```python
TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=16 # Maintain effective batch size
)
```
3. **Use QLoRA**:
```python
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)
```
4. **Lower LoRA rank**:
```python
LoraConfig(r=8) # Instead of r=16 or higher
```
5. **Target fewer modules**:
```python
target_modules=["q_proj", "v_proj"] # Instead of all-linear
```
### Loss Not Decreasing
**Problem**: Training loss stays flat or increases.
**Solutions**:
1. **Check learning rate**:
```python
# Start lower
TrainingArguments(learning_rate=1e-4) # Not 2e-4 or higher
```
2. **Verify adapter is active**:
```python
model.print_trainable_parameters()
# Should show >0 trainable params
# Check adapter applied
print(model.peft_config)
```
3. **Check data formatting**:
```python
# Verify tokenization
sample = dataset[0]
decoded = tokenizer.decode(sample["input_ids"])
print(decoded) # Should look correct
```
4. **Increase rank**:
```python
LoraConfig(r=32, lora_alpha=64) # More capacity
```
### NaN Loss
**Error**: `Loss is NaN`
**Fix**:
```python
# Use bf16 instead of fp16
TrainingArguments(bf16=True, fp16=False)
# Or enable loss scaling
TrainingArguments(fp16=True, fp16_full_eval=True)
# Lower learning rate
TrainingArguments(learning_rate=5e-5)
# Check for data issues
for batch in dataloader:
if torch.isnan(batch["input_ids"].float()).any():
print("NaN in input!")
```
### Adapter Not Training
**Problem**: `trainable params: 0` or model not updating.
**Fix**:
```python
# Verify LoRA applied to correct modules
for name, module in model.named_modules():
if "lora" in name.lower():
print(f"Found LoRA: {name}")
# Check target_modules match model architecture
from peft.utils import TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING
print(TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING.get(model.config.model_type))
# Ensure model in training mode
model.train()
# Check requires_grad
for name, param in model.named_parameters():
if param.requires_grad:
print(f"Trainable: {name}")
```
## Loading Issues
### Adapter Loading Fails
**Error**: `ValueError: Can't find adapter weights`
**Fix**:
```python
# Check adapter files exist
import os
print(os.listdir("./adapter-path"))
# Should contain: adapter_config.json, adapter_model.safetensors
# Load with correct structure
from peft import PeftModel, PeftConfig
# Check config
config = PeftConfig.from_pretrained("./adapter-path")
print(config)
# Load base model first
base_model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path)
model = PeftModel.from_pretrained(base_model, "./adapter-path")
```
### Base Model Mismatch
**Error**: `RuntimeError: size mismatch`
**Fix**:
```python
# Ensure base model matches adapter
from peft import PeftConfig
config = PeftConfig.from_pretrained("./adapter-path")
print(f"Base model: {config.base_model_name_or_path}")
# Load exact same base model
base_model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path)
```
### Safetensors vs PyTorch Format
**Error**: `ValueError: We couldn't connect to 'https://huggingface.co'`
**Fix**:
```python
# Force local loading
model = PeftModel.from_pretrained(
base_model,
"./adapter-path",
local_files_only=True
)
# Or specify format
model.save_pretrained("./adapter", safe_serialization=True) # safetensors
model.save_pretrained("./adapter", safe_serialization=False) # pytorch
```
## Inference Issues
### Slow Generation
**Problem**: Inference much slower than expected.
**Solutions**:
1. **Merge adapter for deployment**:
```python
merged_model = model.merge_and_unload()
# No adapter overhead during inference
```
2. **Use optimized inference engine**:
```python
from vllm import LLM
llm = LLM(model="./merged-model", dtype="half")
```
3. **Enable Flash Attention**:
```python
model = AutoModelForCausalLM.from_pretrained(
model_name,
attn_implementation="flash_attention_2"
)
```
### Output Quality Issues
**Problem**: Fine-tuned model produces worse outputs.
**Solutions**:
1. **Check evaluation without adapter**:
```python
with model.disable_adapter():
base_output = model.generate(**inputs)
# Compare with adapter output
```
2. **Lower temperature during eval**:
```python
model.generate(**inputs, temperature=0.1, do_sample=False)
```
3. **Retrain with more data**:
```python
# Increase training samples
# Use higher quality data
# Train for more epochs
```
### Wrong Adapter Active
**Problem**: Model using wrong adapter or no adapter.
**Fix**:
```python
# Check active adapters
print(model.active_adapters)
# Explicitly set adapter
model.set_adapter("your-adapter-name")
# List all adapters
print(model.peft_config.keys())
```
## QLoRA Specific Issues
### Quantization Errors
**Error**: `RuntimeError: mat1 and mat2 shapes cannot be multiplied`
**Fix**:
```python
# Ensure compute dtype matches
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16, # Match model dtype
bnb_4bit_quant_type="nf4"
)
# Load with correct dtype
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
torch_dtype=torch.bfloat16
)
```
### QLoRA OOM
**Error**: OOM even with 4-bit quantization.
**Fix**:
```python
# Enable double quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True # Further memory reduction
)
# Use offloading
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
max_memory={0: "20GB", "cpu": "100GB"}
)
```
### QLoRA Merge Fails
**Error**: `RuntimeError: expected scalar type BFloat16 but found Float`
**Fix**:
```python
# Dequantize before merging
from peft import PeftModel
# Load in higher precision for merging
base_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype=torch.float16, # Not quantized
device_map="auto"
)
# Load adapter
model = PeftModel.from_pretrained(base_model, "./qlora-adapter")
# Now merge
merged = model.merge_and_unload()
```
## Multi-Adapter Issues
### Adapter Conflict
**Error**: `ValueError: Adapter with name 'default' already exists`
**Fix**:
```python
# Use unique names
model.load_adapter("./adapter1", adapter_name="task1")
model.load_adapter("./adapter2", adapter_name="task2")
# Or delete existing
model.delete_adapter("default")
```
### Mixed Precision Adapters
**Error**: Adapters trained with different dtypes.
**Fix**:
```python
# Convert adapter precision
model = PeftModel.from_pretrained(base_model, "./adapter")
model = model.to(torch.bfloat16)
# Or load with specific dtype
model = PeftModel.from_pretrained(
base_model,
"./adapter",
torch_dtype=torch.bfloat16
)
```
## Performance Optimization
### Memory Profiling
```python
import torch
def print_memory():
if torch.cuda.is_available():
allocated = torch.cuda.memory_allocated() / 1e9
reserved = torch.cuda.memory_reserved() / 1e9
print(f"Allocated: {allocated:.2f}GB, Reserved: {reserved:.2f}GB")
# Profile during training
print_memory() # Before
model.train()
loss = model(**batch).loss
loss.backward()
print_memory() # After
```
### Speed Profiling
```python
import time
import torch
def benchmark_generation(model, tokenizer, prompt, n_runs=5):
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Warmup
model.generate(**inputs, max_new_tokens=10)
torch.cuda.synchronize()
# Benchmark
times = []
for _ in range(n_runs):
start = time.perf_counter()
outputs = model.generate(**inputs, max_new_tokens=100)
torch.cuda.synchronize()
times.append(time.perf_counter() - start)
tokens = outputs.shape[1] - inputs.input_ids.shape[1]
avg_time = sum(times) / len(times)
print(f"Speed: {tokens/avg_time:.2f} tokens/sec")
# Compare adapter vs merged
benchmark_generation(adapter_model, tokenizer, "Hello")
benchmark_generation(merged_model, tokenizer, "Hello")
```
## Getting Help
1. **Check PEFT GitHub Issues**: https://github.com/huggingface/peft/issues
2. **HuggingFace Forums**: https://discuss.huggingface.co/
3. **PEFT Documentation**: https://huggingface.co/docs/peft
### Debugging Template
When reporting issues, include:
```python
# System info
import peft
import transformers
import torch
print(f"PEFT: {peft.__version__}")
print(f"Transformers: {transformers.__version__}")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA: {torch.version.cuda}")
print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'N/A'}")
# Config
print(model.peft_config)
model.print_trainable_parameters()
```

File diff suppressed because one or more lines are too long

View file

@ -0,0 +1,7 @@
# Pytorch-Fsdp Documentation Index
## Categories
### Other
**File:** `other.md`
**Pages:** 15

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,522 @@
---
name: stable-diffusion-image-generation
description: State-of-the-art text-to-image generation with Stable Diffusion models via HuggingFace Diffusers. Use when generating images from text prompts, performing image-to-image translation, inpainting, or building custom diffusion pipelines.
version: 1.0.0
author: Orchestra Research
license: MIT
dependencies: [diffusers>=0.30.0, transformers>=4.41.0, accelerate>=0.31.0, torch>=2.0.0]
metadata:
hermes:
tags: [Image Generation, Stable Diffusion, Diffusers, Text-to-Image, Multimodal, Computer Vision]
---
# Stable Diffusion Image Generation
Comprehensive guide to generating images with Stable Diffusion using the HuggingFace Diffusers library.
## When to use Stable Diffusion
**Use Stable Diffusion when:**
- Generating images from text descriptions
- Performing image-to-image translation (style transfer, enhancement)
- Inpainting (filling in masked regions)
- Outpainting (extending images beyond boundaries)
- Creating variations of existing images
- Building custom image generation workflows
**Key features:**
- **Text-to-Image**: Generate images from natural language prompts
- **Image-to-Image**: Transform existing images with text guidance
- **Inpainting**: Fill masked regions with context-aware content
- **ControlNet**: Add spatial conditioning (edges, poses, depth)
- **LoRA Support**: Efficient fine-tuning and style adaptation
- **Multiple Models**: SD 1.5, SDXL, SD 3.0, Flux support
**Use alternatives instead:**
- **DALL-E 3**: For API-based generation without GPU
- **Midjourney**: For artistic, stylized outputs
- **Imagen**: For Google Cloud integration
- **Leonardo.ai**: For web-based creative workflows
## Quick start
### Installation
```bash
pip install diffusers transformers accelerate torch
pip install xformers # Optional: memory-efficient attention
```
### Basic text-to-image
```python
from diffusers import DiffusionPipeline
import torch
# Load pipeline (auto-detects model type)
pipe = DiffusionPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
torch_dtype=torch.float16
)
pipe.to("cuda")
# Generate image
image = pipe(
"A serene mountain landscape at sunset, highly detailed",
num_inference_steps=50,
guidance_scale=7.5
).images[0]
image.save("output.png")
```
### Using SDXL (higher quality)
```python
from diffusers import AutoPipelineForText2Image
import torch
pipe = AutoPipelineForText2Image.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
variant="fp16"
)
pipe.to("cuda")
# Enable memory optimization
pipe.enable_model_cpu_offload()
image = pipe(
prompt="A futuristic city with flying cars, cinematic lighting",
height=1024,
width=1024,
num_inference_steps=30
).images[0]
```
## Architecture overview
### Three-pillar design
Diffusers is built around three core components:
```
Pipeline (orchestration)
├── Model (neural networks)
│ ├── UNet / Transformer (noise prediction)
│ ├── VAE (latent encoding/decoding)
│ └── Text Encoder (CLIP/T5)
└── Scheduler (denoising algorithm)
```
### Pipeline inference flow
```
Text Prompt → Text Encoder → Text Embeddings
Random Noise → [Denoising Loop] ← Scheduler
Predicted Noise
VAE Decoder → Final Image
```
## Core concepts
### Pipelines
Pipelines orchestrate complete workflows:
| Pipeline | Purpose |
|----------|---------|
| `StableDiffusionPipeline` | Text-to-image (SD 1.x/2.x) |
| `StableDiffusionXLPipeline` | Text-to-image (SDXL) |
| `StableDiffusion3Pipeline` | Text-to-image (SD 3.0) |
| `FluxPipeline` | Text-to-image (Flux models) |
| `StableDiffusionImg2ImgPipeline` | Image-to-image |
| `StableDiffusionInpaintPipeline` | Inpainting |
### Schedulers
Schedulers control the denoising process:
| Scheduler | Steps | Quality | Use Case |
|-----------|-------|---------|----------|
| `EulerDiscreteScheduler` | 20-50 | Good | Default choice |
| `EulerAncestralDiscreteScheduler` | 20-50 | Good | More variation |
| `DPMSolverMultistepScheduler` | 15-25 | Excellent | Fast, high quality |
| `DDIMScheduler` | 50-100 | Good | Deterministic |
| `LCMScheduler` | 4-8 | Good | Very fast |
| `UniPCMultistepScheduler` | 15-25 | Excellent | Fast convergence |
### Swapping schedulers
```python
from diffusers import DPMSolverMultistepScheduler
# Swap for faster generation
pipe.scheduler = DPMSolverMultistepScheduler.from_config(
pipe.scheduler.config
)
# Now generate with fewer steps
image = pipe(prompt, num_inference_steps=20).images[0]
```
## Generation parameters
### Key parameters
| Parameter | Default | Description |
|-----------|---------|-------------|
| `prompt` | Required | Text description of desired image |
| `negative_prompt` | None | What to avoid in the image |
| `num_inference_steps` | 50 | Denoising steps (more = better quality) |
| `guidance_scale` | 7.5 | Prompt adherence (7-12 typical) |
| `height`, `width` | 512/1024 | Output dimensions (multiples of 8) |
| `generator` | None | Torch generator for reproducibility |
| `num_images_per_prompt` | 1 | Batch size |
### Reproducible generation
```python
import torch
generator = torch.Generator(device="cuda").manual_seed(42)
image = pipe(
prompt="A cat wearing a top hat",
generator=generator,
num_inference_steps=50
).images[0]
```
### Negative prompts
```python
image = pipe(
prompt="Professional photo of a dog in a garden",
negative_prompt="blurry, low quality, distorted, ugly, bad anatomy",
guidance_scale=7.5
).images[0]
```
## Image-to-image
Transform existing images with text guidance:
```python
from diffusers import AutoPipelineForImage2Image
from PIL import Image
pipe = AutoPipelineForImage2Image.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
torch_dtype=torch.float16
).to("cuda")
init_image = Image.open("input.jpg").resize((512, 512))
image = pipe(
prompt="A watercolor painting of the scene",
image=init_image,
strength=0.75, # How much to transform (0-1)
num_inference_steps=50
).images[0]
```
## Inpainting
Fill masked regions:
```python
from diffusers import AutoPipelineForInpainting
from PIL import Image
pipe = AutoPipelineForInpainting.from_pretrained(
"runwayml/stable-diffusion-inpainting",
torch_dtype=torch.float16
).to("cuda")
image = Image.open("photo.jpg")
mask = Image.open("mask.png") # White = inpaint region
result = pipe(
prompt="A red car parked on the street",
image=image,
mask_image=mask,
num_inference_steps=50
).images[0]
```
## ControlNet
Add spatial conditioning for precise control:
```python
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
import torch
# Load ControlNet for edge conditioning
controlnet = ControlNetModel.from_pretrained(
"lllyasviel/control_v11p_sd15_canny",
torch_dtype=torch.float16
)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
controlnet=controlnet,
torch_dtype=torch.float16
).to("cuda")
# Use Canny edge image as control
control_image = get_canny_image(input_image)
image = pipe(
prompt="A beautiful house in the style of Van Gogh",
image=control_image,
num_inference_steps=30
).images[0]
```
### Available ControlNets
| ControlNet | Input Type | Use Case |
|------------|------------|----------|
| `canny` | Edge maps | Preserve structure |
| `openpose` | Pose skeletons | Human poses |
| `depth` | Depth maps | 3D-aware generation |
| `normal` | Normal maps | Surface details |
| `mlsd` | Line segments | Architectural lines |
| `scribble` | Rough sketches | Sketch-to-image |
## LoRA adapters
Load fine-tuned style adapters:
```python
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
torch_dtype=torch.float16
).to("cuda")
# Load LoRA weights
pipe.load_lora_weights("path/to/lora", weight_name="style.safetensors")
# Generate with LoRA style
image = pipe("A portrait in the trained style").images[0]
# Adjust LoRA strength
pipe.fuse_lora(lora_scale=0.8)
# Unload LoRA
pipe.unload_lora_weights()
```
### Multiple LoRAs
```python
# Load multiple LoRAs
pipe.load_lora_weights("lora1", adapter_name="style")
pipe.load_lora_weights("lora2", adapter_name="character")
# Set weights for each
pipe.set_adapters(["style", "character"], adapter_weights=[0.7, 0.5])
image = pipe("A portrait").images[0]
```
## Memory optimization
### Enable CPU offloading
```python
# Model CPU offload - moves models to CPU when not in use
pipe.enable_model_cpu_offload()
# Sequential CPU offload - more aggressive, slower
pipe.enable_sequential_cpu_offload()
```
### Attention slicing
```python
# Reduce memory by computing attention in chunks
pipe.enable_attention_slicing()
# Or specific chunk size
pipe.enable_attention_slicing("max")
```
### xFormers memory-efficient attention
```python
# Requires xformers package
pipe.enable_xformers_memory_efficient_attention()
```
### VAE slicing for large images
```python
# Decode latents in tiles for large images
pipe.enable_vae_slicing()
pipe.enable_vae_tiling()
```
## Model variants
### Loading different precisions
```python
# FP16 (recommended for GPU)
pipe = DiffusionPipeline.from_pretrained(
"model-id",
torch_dtype=torch.float16,
variant="fp16"
)
# BF16 (better precision, requires Ampere+ GPU)
pipe = DiffusionPipeline.from_pretrained(
"model-id",
torch_dtype=torch.bfloat16
)
```
### Loading specific components
```python
from diffusers import UNet2DConditionModel, AutoencoderKL
# Load custom VAE
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse")
# Use with pipeline
pipe = DiffusionPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
vae=vae,
torch_dtype=torch.float16
)
```
## Batch generation
Generate multiple images efficiently:
```python
# Multiple prompts
prompts = [
"A cat playing piano",
"A dog reading a book",
"A bird painting a picture"
]
images = pipe(prompts, num_inference_steps=30).images
# Multiple images per prompt
images = pipe(
"A beautiful sunset",
num_images_per_prompt=4,
num_inference_steps=30
).images
```
## Common workflows
### Workflow 1: High-quality generation
```python
from diffusers import StableDiffusionXLPipeline, DPMSolverMultistepScheduler
import torch
# 1. Load SDXL with optimizations
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
variant="fp16"
)
pipe.to("cuda")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()
# 2. Generate with quality settings
image = pipe(
prompt="A majestic lion in the savanna, golden hour lighting, 8k, detailed fur",
negative_prompt="blurry, low quality, cartoon, anime, sketch",
num_inference_steps=30,
guidance_scale=7.5,
height=1024,
width=1024
).images[0]
```
### Workflow 2: Fast prototyping
```python
from diffusers import AutoPipelineForText2Image, LCMScheduler
import torch
# Use LCM for 4-8 step generation
pipe = AutoPipelineForText2Image.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16
).to("cuda")
# Load LCM LoRA for fast generation
pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl")
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
pipe.fuse_lora()
# Generate in ~1 second
image = pipe(
"A beautiful landscape",
num_inference_steps=4,
guidance_scale=1.0
).images[0]
```
## Common issues
**CUDA out of memory:**
```python
# Enable memory optimizations
pipe.enable_model_cpu_offload()
pipe.enable_attention_slicing()
pipe.enable_vae_slicing()
# Or use lower precision
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
```
**Black/noise images:**
```python
# Check VAE configuration
# Use safety checker bypass if needed
pipe.safety_checker = None
# Ensure proper dtype consistency
pipe = pipe.to(dtype=torch.float16)
```
**Slow generation:**
```python
# Use faster scheduler
from diffusers import DPMSolverMultistepScheduler
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
# Reduce steps
image = pipe(prompt, num_inference_steps=20).images[0]
```
## References
- **[Advanced Usage](references/advanced-usage.md)** - Custom pipelines, fine-tuning, deployment
- **[Troubleshooting](references/troubleshooting.md)** - Common issues and solutions
## Resources
- **Documentation**: https://huggingface.co/docs/diffusers
- **Repository**: https://github.com/huggingface/diffusers
- **Model Hub**: https://huggingface.co/models?library=diffusers
- **Discord**: https://discord.gg/diffusers

View file

@ -0,0 +1,716 @@
# Stable Diffusion Advanced Usage Guide
## Custom Pipelines
### Building from components
```python
from diffusers import (
UNet2DConditionModel,
AutoencoderKL,
DDPMScheduler,
StableDiffusionPipeline
)
from transformers import CLIPTextModel, CLIPTokenizer
import torch
# Load components individually
unet = UNet2DConditionModel.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
subfolder="unet"
)
vae = AutoencoderKL.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
subfolder="vae"
)
text_encoder = CLIPTextModel.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
subfolder="text_encoder"
)
tokenizer = CLIPTokenizer.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
subfolder="tokenizer"
)
scheduler = DDPMScheduler.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
subfolder="scheduler"
)
# Assemble pipeline
pipe = StableDiffusionPipeline(
unet=unet,
vae=vae,
text_encoder=text_encoder,
tokenizer=tokenizer,
scheduler=scheduler,
safety_checker=None,
feature_extractor=None,
requires_safety_checker=False
)
```
### Custom denoising loop
```python
from diffusers import DDIMScheduler, AutoencoderKL, UNet2DConditionModel
from transformers import CLIPTextModel, CLIPTokenizer
import torch
def custom_generate(
prompt: str,
num_steps: int = 50,
guidance_scale: float = 7.5,
height: int = 512,
width: int = 512
):
# Load components
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")
unet = UNet2DConditionModel.from_pretrained("sd-model", subfolder="unet")
vae = AutoencoderKL.from_pretrained("sd-model", subfolder="vae")
scheduler = DDIMScheduler.from_pretrained("sd-model", subfolder="scheduler")
device = "cuda"
text_encoder.to(device)
unet.to(device)
vae.to(device)
# Encode prompt
text_input = tokenizer(
prompt,
padding="max_length",
max_length=77,
truncation=True,
return_tensors="pt"
)
text_embeddings = text_encoder(text_input.input_ids.to(device))[0]
# Unconditional embeddings for classifier-free guidance
uncond_input = tokenizer(
"",
padding="max_length",
max_length=77,
return_tensors="pt"
)
uncond_embeddings = text_encoder(uncond_input.input_ids.to(device))[0]
# Concatenate for batch processing
text_embeddings = torch.cat([uncond_embeddings, text_embeddings])
# Initialize latents
latents = torch.randn(
(1, 4, height // 8, width // 8),
device=device
)
latents = latents * scheduler.init_noise_sigma
# Denoising loop
scheduler.set_timesteps(num_steps)
for t in scheduler.timesteps:
latent_model_input = torch.cat([latents] * 2)
latent_model_input = scheduler.scale_model_input(latent_model_input, t)
# Predict noise
with torch.no_grad():
noise_pred = unet(
latent_model_input,
t,
encoder_hidden_states=text_embeddings
).sample
# Classifier-free guidance
noise_pred_uncond, noise_pred_cond = noise_pred.chunk(2)
noise_pred = noise_pred_uncond + guidance_scale * (
noise_pred_cond - noise_pred_uncond
)
# Update latents
latents = scheduler.step(noise_pred, t, latents).prev_sample
# Decode latents
latents = latents / vae.config.scaling_factor
with torch.no_grad():
image = vae.decode(latents).sample
# Convert to PIL
image = (image / 2 + 0.5).clamp(0, 1)
image = image.cpu().permute(0, 2, 3, 1).numpy()
image = (image * 255).round().astype("uint8")[0]
return Image.fromarray(image)
```
## IP-Adapter
Use image prompts alongside text:
```python
from diffusers import StableDiffusionPipeline
from diffusers.utils import load_image
import torch
pipe = StableDiffusionPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
torch_dtype=torch.float16
).to("cuda")
# Load IP-Adapter
pipe.load_ip_adapter(
"h94/IP-Adapter",
subfolder="models",
weight_name="ip-adapter_sd15.bin"
)
# Set IP-Adapter scale
pipe.set_ip_adapter_scale(0.6)
# Load reference image
ip_image = load_image("reference_style.jpg")
# Generate with image + text prompt
image = pipe(
prompt="A portrait in a garden",
ip_adapter_image=ip_image,
num_inference_steps=50
).images[0]
```
### Multiple IP-Adapter images
```python
# Use multiple reference images
pipe.set_ip_adapter_scale([0.5, 0.7])
images = [
load_image("style_reference.jpg"),
load_image("composition_reference.jpg")
]
result = pipe(
prompt="A landscape painting",
ip_adapter_image=images,
num_inference_steps=50
).images[0]
```
## SDXL Refiner
Two-stage generation for higher quality:
```python
from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline
import torch
# Load base model
base = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
variant="fp16"
).to("cuda")
# Load refiner
refiner = StableDiffusionXLImg2ImgPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-refiner-1.0",
torch_dtype=torch.float16,
variant="fp16"
).to("cuda")
# Generate with base (partial denoising)
image = base(
prompt="A majestic eagle soaring over mountains",
num_inference_steps=40,
denoising_end=0.8,
output_type="latent"
).images
# Refine with refiner
refined = refiner(
prompt="A majestic eagle soaring over mountains",
image=image,
num_inference_steps=40,
denoising_start=0.8
).images[0]
```
## T2I-Adapter
Lightweight conditioning without full ControlNet:
```python
from diffusers import StableDiffusionXLAdapterPipeline, T2IAdapter
import torch
# Load adapter
adapter = T2IAdapter.from_pretrained(
"TencentARC/t2i-adapter-canny-sdxl-1.0",
torch_dtype=torch.float16
)
pipe = StableDiffusionXLAdapterPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
adapter=adapter,
torch_dtype=torch.float16
).to("cuda")
# Get canny edges
canny_image = get_canny_image(input_image)
image = pipe(
prompt="A colorful anime character",
image=canny_image,
num_inference_steps=30,
adapter_conditioning_scale=0.8
).images[0]
```
## Fine-tuning with DreamBooth
Train on custom subjects:
```python
from diffusers import StableDiffusionPipeline, DDPMScheduler
from diffusers.optimization import get_scheduler
import torch
from torch.utils.data import Dataset, DataLoader
from PIL import Image
import os
class DreamBoothDataset(Dataset):
def __init__(self, instance_images_path, instance_prompt, tokenizer, size=512):
self.instance_images_path = instance_images_path
self.instance_prompt = instance_prompt
self.tokenizer = tokenizer
self.size = size
self.instance_images = [
os.path.join(instance_images_path, f)
for f in os.listdir(instance_images_path)
if f.endswith(('.png', '.jpg', '.jpeg'))
]
def __len__(self):
return len(self.instance_images)
def __getitem__(self, idx):
image = Image.open(self.instance_images[idx]).convert("RGB")
image = image.resize((self.size, self.size))
image = torch.tensor(np.array(image)).permute(2, 0, 1) / 127.5 - 1.0
tokens = self.tokenizer(
self.instance_prompt,
padding="max_length",
max_length=77,
truncation=True,
return_tensors="pt"
)
return {"image": image, "input_ids": tokens.input_ids.squeeze()}
def train_dreambooth(
pretrained_model: str,
instance_data_dir: str,
instance_prompt: str,
output_dir: str,
learning_rate: float = 5e-6,
max_train_steps: int = 800,
train_batch_size: int = 1
):
# Load pipeline
pipe = StableDiffusionPipeline.from_pretrained(pretrained_model)
unet = pipe.unet
vae = pipe.vae
text_encoder = pipe.text_encoder
tokenizer = pipe.tokenizer
noise_scheduler = DDPMScheduler.from_pretrained(pretrained_model, subfolder="scheduler")
# Freeze VAE and text encoder
vae.requires_grad_(False)
text_encoder.requires_grad_(False)
# Create dataset
dataset = DreamBoothDataset(
instance_data_dir, instance_prompt, tokenizer
)
dataloader = DataLoader(dataset, batch_size=train_batch_size, shuffle=True)
# Setup optimizer
optimizer = torch.optim.AdamW(unet.parameters(), lr=learning_rate)
lr_scheduler = get_scheduler(
"constant",
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=max_train_steps
)
# Training loop
unet.train()
device = "cuda"
unet.to(device)
vae.to(device)
text_encoder.to(device)
global_step = 0
for epoch in range(max_train_steps // len(dataloader) + 1):
for batch in dataloader:
if global_step >= max_train_steps:
break
# Encode images to latents
latents = vae.encode(batch["image"].to(device)).latent_dist.sample()
latents = latents * vae.config.scaling_factor
# Sample noise
noise = torch.randn_like(latents)
timesteps = torch.randint(0, noise_scheduler.num_train_timesteps, (latents.shape[0],))
timesteps = timesteps.to(device)
# Add noise
noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps)
# Get text embeddings
encoder_hidden_states = text_encoder(batch["input_ids"].to(device))[0]
# Predict noise
noise_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
# Compute loss
loss = torch.nn.functional.mse_loss(noise_pred, noise)
# Backprop
loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
global_step += 1
if global_step % 100 == 0:
print(f"Step {global_step}, Loss: {loss.item():.4f}")
# Save model
pipe.unet = unet
pipe.save_pretrained(output_dir)
```
## LoRA Training
Efficient fine-tuning with Low-Rank Adaptation:
```python
from peft import LoraConfig, get_peft_model
from diffusers import StableDiffusionPipeline
import torch
def train_lora(
base_model: str,
train_dataset,
output_dir: str,
lora_rank: int = 4,
learning_rate: float = 1e-4,
max_train_steps: int = 1000
):
pipe = StableDiffusionPipeline.from_pretrained(base_model)
unet = pipe.unet
# Configure LoRA
lora_config = LoraConfig(
r=lora_rank,
lora_alpha=lora_rank,
target_modules=["to_q", "to_v", "to_k", "to_out.0"],
lora_dropout=0.1
)
# Apply LoRA to UNet
unet = get_peft_model(unet, lora_config)
unet.print_trainable_parameters() # Shows ~0.1% trainable
# Train (similar to DreamBooth but only LoRA params)
optimizer = torch.optim.AdamW(
unet.parameters(),
lr=learning_rate
)
# ... training loop ...
# Save LoRA weights only
unet.save_pretrained(output_dir)
```
## Textual Inversion
Learn new concepts through embeddings:
```python
from diffusers import StableDiffusionPipeline
import torch
# Load with textual inversion
pipe = StableDiffusionPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
torch_dtype=torch.float16
).to("cuda")
# Load learned embedding
pipe.load_textual_inversion(
"sd-concepts-library/cat-toy",
token="<cat-toy>"
)
# Use in prompts
image = pipe("A photo of <cat-toy> on a beach").images[0]
```
## Quantization
Reduce memory with quantization:
```python
from diffusers import BitsAndBytesConfig, StableDiffusionXLPipeline
import torch
# 8-bit quantization
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
quantization_config=quantization_config,
torch_dtype=torch.float16
)
```
### NF4 quantization (4-bit)
```python
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16
)
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
quantization_config=quantization_config
)
```
## Production Deployment
### FastAPI server
```python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from diffusers import DiffusionPipeline
import torch
import base64
from io import BytesIO
app = FastAPI()
# Load model at startup
pipe = DiffusionPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
torch_dtype=torch.float16
).to("cuda")
pipe.enable_model_cpu_offload()
class GenerationRequest(BaseModel):
prompt: str
negative_prompt: str = ""
num_inference_steps: int = 30
guidance_scale: float = 7.5
width: int = 512
height: int = 512
seed: int = None
class GenerationResponse(BaseModel):
image_base64: str
seed: int
@app.post("/generate", response_model=GenerationResponse)
async def generate(request: GenerationRequest):
try:
generator = None
seed = request.seed or torch.randint(0, 2**32, (1,)).item()
generator = torch.Generator("cuda").manual_seed(seed)
image = pipe(
prompt=request.prompt,
negative_prompt=request.negative_prompt,
num_inference_steps=request.num_inference_steps,
guidance_scale=request.guidance_scale,
width=request.width,
height=request.height,
generator=generator
).images[0]
# Convert to base64
buffer = BytesIO()
image.save(buffer, format="PNG")
image_base64 = base64.b64encode(buffer.getvalue()).decode()
return GenerationResponse(image_base64=image_base64, seed=seed)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.get("/health")
async def health():
return {"status": "healthy"}
```
### Docker deployment
```dockerfile
FROM nvidia/cuda:12.1-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y python3 python3-pip
WORKDIR /app
COPY requirements.txt .
RUN pip3 install -r requirements.txt
COPY . .
# Pre-download model
RUN python3 -c "from diffusers import DiffusionPipeline; DiffusionPipeline.from_pretrained('stable-diffusion-v1-5/stable-diffusion-v1-5')"
EXPOSE 8000
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8000"]
```
### Kubernetes deployment
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: stable-diffusion
spec:
replicas: 2
selector:
matchLabels:
app: stable-diffusion
template:
metadata:
labels:
app: stable-diffusion
spec:
containers:
- name: sd
image: your-registry/stable-diffusion:latest
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 1
memory: "16Gi"
requests:
nvidia.com/gpu: 1
memory: "8Gi"
env:
- name: TRANSFORMERS_CACHE
value: "/cache/huggingface"
volumeMounts:
- name: model-cache
mountPath: /cache
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvc
---
apiVersion: v1
kind: Service
metadata:
name: stable-diffusion
spec:
selector:
app: stable-diffusion
ports:
- port: 80
targetPort: 8000
type: LoadBalancer
```
## Callback System
Monitor and modify generation:
```python
from diffusers import StableDiffusionPipeline
from diffusers.callbacks import PipelineCallback
import torch
class ProgressCallback(PipelineCallback):
def __init__(self):
self.progress = []
def callback_fn(self, pipe, step_index, timestep, callback_kwargs):
self.progress.append({
"step": step_index,
"timestep": timestep.item()
})
# Optionally modify latents
latents = callback_kwargs["latents"]
return callback_kwargs
# Use callback
callback = ProgressCallback()
image = pipe(
prompt="A sunset",
callback_on_step_end=callback.callback_fn,
callback_on_step_end_tensor_inputs=["latents"]
).images[0]
print(f"Generation completed in {len(callback.progress)} steps")
```
### Early stopping
```python
def early_stop_callback(pipe, step_index, timestep, callback_kwargs):
# Stop after 20 steps
if step_index >= 20:
pipe._interrupt = True
return callback_kwargs
image = pipe(
prompt="A landscape",
num_inference_steps=50,
callback_on_step_end=early_stop_callback
).images[0]
```
## Multi-GPU Inference
### Device map auto
```python
from diffusers import StableDiffusionXLPipeline
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
device_map="auto", # Automatically distribute across GPUs
torch_dtype=torch.float16
)
```
### Manual distribution
```python
from accelerate import infer_auto_device_map, dispatch_model
# Create device map
device_map = infer_auto_device_map(
pipe.unet,
max_memory={0: "10GiB", 1: "10GiB"}
)
# Dispatch model
pipe.unet = dispatch_model(pipe.unet, device_map=device_map)
```

View file

@ -0,0 +1,555 @@
# Stable Diffusion Troubleshooting Guide
## Installation Issues
### Package conflicts
**Error**: `ImportError: cannot import name 'cached_download' from 'huggingface_hub'`
**Fix**:
```bash
# Update huggingface_hub
pip install --upgrade huggingface_hub
# Reinstall diffusers
pip install --upgrade diffusers
```
### xFormers installation fails
**Error**: `RuntimeError: CUDA error: no kernel image is available for execution`
**Fix**:
```bash
# Check CUDA version
nvcc --version
# Install matching xformers
pip install xformers --index-url https://download.pytorch.org/whl/cu121 # For CUDA 12.1
# Or build from source
pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
```
### Torch/CUDA mismatch
**Error**: `RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED`
**Fix**:
```bash
# Check versions
python -c "import torch; print(torch.__version__, torch.cuda.is_available())"
# Reinstall PyTorch with correct CUDA
pip uninstall torch torchvision
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
```
## Memory Issues
### CUDA out of memory
**Error**: `torch.cuda.OutOfMemoryError: CUDA out of memory`
**Solutions**:
```python
# Solution 1: Enable CPU offloading
pipe.enable_model_cpu_offload()
# Solution 2: Sequential CPU offload (more aggressive)
pipe.enable_sequential_cpu_offload()
# Solution 3: Attention slicing
pipe.enable_attention_slicing()
# Solution 4: VAE slicing for large images
pipe.enable_vae_slicing()
# Solution 5: Use lower precision
pipe = DiffusionPipeline.from_pretrained(
"model-id",
torch_dtype=torch.float16 # or torch.bfloat16
)
# Solution 6: Reduce batch size
image = pipe(prompt, num_images_per_prompt=1).images[0]
# Solution 7: Generate smaller images
image = pipe(prompt, height=512, width=512).images[0]
# Solution 8: Clear cache between generations
import gc
torch.cuda.empty_cache()
gc.collect()
```
### Memory grows over time
**Problem**: Memory usage increases with each generation
**Fix**:
```python
import gc
import torch
def generate_with_cleanup(pipe, prompt, **kwargs):
try:
image = pipe(prompt, **kwargs).images[0]
return image
finally:
# Clear cache after generation
if torch.cuda.is_available():
torch.cuda.empty_cache()
gc.collect()
```
### Large model loading fails
**Error**: `RuntimeError: Unable to load model weights`
**Fix**:
```python
# Use low CPU memory mode
pipe = DiffusionPipeline.from_pretrained(
"large-model-id",
low_cpu_mem_usage=True,
torch_dtype=torch.float16
)
```
## Generation Issues
### Black images
**Problem**: Output images are completely black
**Solutions**:
```python
# Solution 1: Disable safety checker
pipe.safety_checker = None
# Solution 2: Check VAE scaling
# The issue might be with VAE encoding/decoding
latents = latents / pipe.vae.config.scaling_factor # Before decode
# Solution 3: Ensure proper dtype
pipe = pipe.to(dtype=torch.float16)
pipe.vae = pipe.vae.to(dtype=torch.float32) # VAE often needs fp32
# Solution 4: Check guidance scale
# Too high can cause issues
image = pipe(prompt, guidance_scale=7.5).images[0] # Not 20+
```
### Noise/static images
**Problem**: Output looks like random noise
**Solutions**:
```python
# Solution 1: Increase inference steps
image = pipe(prompt, num_inference_steps=50).images[0]
# Solution 2: Check scheduler configuration
pipe.scheduler = pipe.scheduler.from_config(pipe.scheduler.config)
# Solution 3: Verify model was loaded correctly
print(pipe.unet) # Should show model architecture
```
### Blurry images
**Problem**: Output images are low quality or blurry
**Solutions**:
```python
# Solution 1: Use more steps
image = pipe(prompt, num_inference_steps=50).images[0]
# Solution 2: Use better VAE
from diffusers import AutoencoderKL
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse")
pipe.vae = vae
# Solution 3: Use SDXL or refiner
pipe = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0"
)
# Solution 4: Upscale with img2img
upscale_pipe = StableDiffusionImg2ImgPipeline.from_pretrained(...)
upscaled = upscale_pipe(
prompt=prompt,
image=image.resize((1024, 1024)),
strength=0.3
).images[0]
```
### Prompt not being followed
**Problem**: Generated image doesn't match the prompt
**Solutions**:
```python
# Solution 1: Increase guidance scale
image = pipe(prompt, guidance_scale=10.0).images[0]
# Solution 2: Use negative prompts
image = pipe(
prompt="A red car",
negative_prompt="blue, green, yellow, wrong color",
guidance_scale=7.5
).images[0]
# Solution 3: Use prompt weighting
# Emphasize important words
prompt = "A (red:1.5) car on a street"
# Solution 4: Use longer, more detailed prompts
prompt = """
A bright red sports car, ferrari style, parked on a city street,
photorealistic, high detail, 8k, professional photography
"""
```
### Distorted faces/hands
**Problem**: Faces and hands look deformed
**Solutions**:
```python
# Solution 1: Use negative prompts
negative_prompt = """
bad hands, bad anatomy, deformed, ugly, blurry,
extra fingers, mutated hands, poorly drawn hands,
poorly drawn face, mutation, deformed face
"""
# Solution 2: Use face-specific models
# ADetailer or similar post-processing
# Solution 3: Use ControlNet for poses
# Load pose estimation and condition generation
# Solution 4: Inpaint problematic areas
mask = create_face_mask(image)
fixed = inpaint_pipe(
prompt="beautiful detailed face",
image=image,
mask_image=mask
).images[0]
```
## Scheduler Issues
### Scheduler not compatible
**Error**: `ValueError: Scheduler ... is not compatible with pipeline`
**Fix**:
```python
from diffusers import EulerDiscreteScheduler
# Create scheduler from config
pipe.scheduler = EulerDiscreteScheduler.from_config(
pipe.scheduler.config
)
# Check compatible schedulers
print(pipe.scheduler.compatibles)
```
### Wrong number of steps
**Problem**: Model generates different quality with same steps
**Fix**:
```python
# Reset timesteps explicitly
pipe.scheduler.set_timesteps(num_inference_steps)
# Check scheduler's step count
print(len(pipe.scheduler.timesteps))
```
## LoRA Issues
### LoRA weights not loading
**Error**: `RuntimeError: Error(s) in loading state_dict for UNet2DConditionModel`
**Fix**:
```python
# Check weight file format
# Should be .safetensors or .bin
# Load with correct key prefix
pipe.load_lora_weights(
"path/to/lora",
weight_name="lora.safetensors"
)
# Try loading into specific component
pipe.unet.load_attn_procs("path/to/lora")
```
### LoRA not affecting output
**Problem**: Generated images look the same with/without LoRA
**Fix**:
```python
# Fuse LoRA weights
pipe.fuse_lora(lora_scale=1.0)
# Or set scale explicitly
pipe.set_adapters(["lora_name"], adapter_weights=[1.0])
# Verify LoRA is loaded
print(list(pipe.unet.attn_processors.keys()))
```
### Multiple LoRAs conflict
**Problem**: Multiple LoRAs produce artifacts
**Fix**:
```python
# Load with different adapter names
pipe.load_lora_weights("lora1", adapter_name="style")
pipe.load_lora_weights("lora2", adapter_name="subject")
# Balance weights
pipe.set_adapters(
["style", "subject"],
adapter_weights=[0.5, 0.5] # Lower weights
)
# Or use LoRA merge before loading
# Merge LoRAs offline with appropriate ratios
```
## ControlNet Issues
### ControlNet not conditioning
**Problem**: ControlNet has no effect on output
**Fix**:
```python
# Check control image format
# Should be RGB, matching generation size
control_image = control_image.resize((512, 512))
# Increase conditioning scale
image = pipe(
prompt=prompt,
image=control_image,
controlnet_conditioning_scale=1.0, # Try 0.5-1.5
num_inference_steps=30
).images[0]
# Verify ControlNet is loaded
print(pipe.controlnet)
```
### Control image preprocessing
**Fix**:
```python
from controlnet_aux import CannyDetector
# Proper preprocessing
canny = CannyDetector()
control_image = canny(input_image)
# Ensure correct format
control_image = control_image.convert("RGB")
control_image = control_image.resize((512, 512))
```
## Hub/Download Issues
### Model download fails
**Error**: `requests.exceptions.ConnectionError`
**Fix**:
```bash
# Set longer timeout
export HF_HUB_DOWNLOAD_TIMEOUT=600
# Use mirror if available
export HF_ENDPOINT=https://hf-mirror.com
# Or download manually
huggingface-cli download stable-diffusion-v1-5/stable-diffusion-v1-5
```
### Cache issues
**Error**: `OSError: Can't load model from cache`
**Fix**:
```bash
# Clear cache
rm -rf ~/.cache/huggingface/hub
# Or set different cache location
export HF_HOME=/path/to/cache
# Force re-download
pipe = DiffusionPipeline.from_pretrained(
"model-id",
force_download=True
)
```
### Access denied for gated models
**Error**: `401 Client Error: Unauthorized`
**Fix**:
```bash
# Login to Hugging Face
huggingface-cli login
# Or use token
pipe = DiffusionPipeline.from_pretrained(
"model-id",
token="hf_xxxxx"
)
# Accept model license on Hub website first
```
## Performance Issues
### Slow generation
**Problem**: Generation takes too long
**Solutions**:
```python
# Solution 1: Use faster scheduler
from diffusers import DPMSolverMultistepScheduler
pipe.scheduler = DPMSolverMultistepScheduler.from_config(
pipe.scheduler.config
)
# Solution 2: Reduce steps
image = pipe(prompt, num_inference_steps=20).images[0]
# Solution 3: Use LCM
from diffusers import LCMScheduler
pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl")
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
image = pipe(prompt, num_inference_steps=4, guidance_scale=1.0).images[0]
# Solution 4: Enable xFormers
pipe.enable_xformers_memory_efficient_attention()
# Solution 5: Compile model
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
```
### First generation is slow
**Problem**: First image takes much longer
**Fix**:
```python
# Warm up the model
_ = pipe("warmup", num_inference_steps=1)
# Then run actual generation
image = pipe(prompt, num_inference_steps=50).images[0]
# Compile for faster subsequent runs
pipe.unet = torch.compile(pipe.unet)
```
## Debugging Tips
### Enable debug logging
```python
import logging
logging.basicConfig(level=logging.DEBUG)
# Or for specific modules
logging.getLogger("diffusers").setLevel(logging.DEBUG)
logging.getLogger("transformers").setLevel(logging.DEBUG)
```
### Check model components
```python
# Print pipeline components
print(pipe.components)
# Check model config
print(pipe.unet.config)
print(pipe.vae.config)
print(pipe.scheduler.config)
# Verify device placement
print(pipe.device)
for name, module in pipe.components.items():
if hasattr(module, 'device'):
print(f"{name}: {module.device}")
```
### Validate inputs
```python
# Check image dimensions
print(f"Height: {height}, Width: {width}")
assert height % 8 == 0, "Height must be divisible by 8"
assert width % 8 == 0, "Width must be divisible by 8"
# Check prompt tokenization
tokens = pipe.tokenizer(prompt, return_tensors="pt")
print(f"Token count: {tokens.input_ids.shape[1]}") # Max 77 for SD
```
### Save intermediate results
```python
def save_latents_callback(pipe, step_index, timestep, callback_kwargs):
latents = callback_kwargs["latents"]
# Decode and save intermediate
with torch.no_grad():
image = pipe.vae.decode(latents / pipe.vae.config.scaling_factor).sample
image = (image / 2 + 0.5).clamp(0, 1)
image = image.cpu().permute(0, 2, 3, 1).numpy()[0]
Image.fromarray((image * 255).astype("uint8")).save(f"step_{step_index}.png")
return callback_kwargs
image = pipe(
prompt,
callback_on_step_end=save_latents_callback,
callback_on_step_end_tensor_inputs=["latents"]
).images[0]
```
## Getting Help
1. **Documentation**: https://huggingface.co/docs/diffusers
2. **GitHub Issues**: https://github.com/huggingface/diffusers/issues
3. **Discord**: https://discord.gg/diffusers
4. **Forum**: https://discuss.huggingface.co
### Reporting Issues
Include:
- Diffusers version: `pip show diffusers`
- PyTorch version: `python -c "import torch; print(torch.__version__)"`
- CUDA version: `nvcc --version`
- GPU model: `nvidia-smi`
- Full error traceback
- Minimal reproducible code
- Model name/ID used

View file

@ -0,0 +1,320 @@
---
name: whisper
description: OpenAI's general-purpose speech recognition model. Supports 99 languages, transcription, translation to English, and language identification. Six model sizes from tiny (39M params) to large (1550M params). Use for speech-to-text, podcast transcription, or multilingual audio processing. Best for robust, multilingual ASR.
version: 1.0.0
author: Orchestra Research
license: MIT
dependencies: [openai-whisper, transformers, torch]
metadata:
hermes:
tags: [Whisper, Speech Recognition, ASR, Multimodal, Multilingual, OpenAI, Speech-To-Text, Transcription, Translation, Audio Processing]
---
# Whisper - Robust Speech Recognition
OpenAI's multilingual speech recognition model.
## When to use Whisper
**Use when:**
- Speech-to-text transcription (99 languages)
- Podcast/video transcription
- Meeting notes automation
- Translation to English
- Noisy audio transcription
- Multilingual audio processing
**Metrics**:
- **72,900+ GitHub stars**
- 99 languages supported
- Trained on 680,000 hours of audio
- MIT License
**Use alternatives instead**:
- **AssemblyAI**: Managed API, speaker diarization
- **Deepgram**: Real-time streaming ASR
- **Google Speech-to-Text**: Cloud-based
## Quick start
### Installation
```bash
# Requires Python 3.8-3.11
pip install -U openai-whisper
# Requires ffmpeg
# macOS: brew install ffmpeg
# Ubuntu: sudo apt install ffmpeg
# Windows: choco install ffmpeg
```
### Basic transcription
```python
import whisper
# Load model
model = whisper.load_model("base")
# Transcribe
result = model.transcribe("audio.mp3")
# Print text
print(result["text"])
# Access segments
for segment in result["segments"]:
print(f"[{segment['start']:.2f}s - {segment['end']:.2f}s] {segment['text']}")
```
## Model sizes
```python
# Available models
models = ["tiny", "base", "small", "medium", "large", "turbo"]
# Load specific model
model = whisper.load_model("turbo") # Fastest, good quality
```
| Model | Parameters | English-only | Multilingual | Speed | VRAM |
|-------|------------|--------------|--------------|-------|------|
| tiny | 39M | ✓ | ✓ | ~32x | ~1 GB |
| base | 74M | ✓ | ✓ | ~16x | ~1 GB |
| small | 244M | ✓ | ✓ | ~6x | ~2 GB |
| medium | 769M | ✓ | ✓ | ~2x | ~5 GB |
| large | 1550M | ✗ | ✓ | 1x | ~10 GB |
| turbo | 809M | ✗ | ✓ | ~8x | ~6 GB |
**Recommendation**: Use `turbo` for best speed/quality, `base` for prototyping
## Transcription options
### Language specification
```python
# Auto-detect language
result = model.transcribe("audio.mp3")
# Specify language (faster)
result = model.transcribe("audio.mp3", language="en")
# Supported: en, es, fr, de, it, pt, ru, ja, ko, zh, and 89 more
```
### Task selection
```python
# Transcription (default)
result = model.transcribe("audio.mp3", task="transcribe")
# Translation to English
result = model.transcribe("spanish.mp3", task="translate")
# Input: Spanish audio → Output: English text
```
### Initial prompt
```python
# Improve accuracy with context
result = model.transcribe(
"audio.mp3",
initial_prompt="This is a technical podcast about machine learning and AI."
)
# Helps with:
# - Technical terms
# - Proper nouns
# - Domain-specific vocabulary
```
### Timestamps
```python
# Word-level timestamps
result = model.transcribe("audio.mp3", word_timestamps=True)
for segment in result["segments"]:
for word in segment["words"]:
print(f"{word['word']} ({word['start']:.2f}s - {word['end']:.2f}s)")
```
### Temperature fallback
```python
# Retry with different temperatures if confidence low
result = model.transcribe(
"audio.mp3",
temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)
)
```
## Command line usage
```bash
# Basic transcription
whisper audio.mp3
# Specify model
whisper audio.mp3 --model turbo
# Output formats
whisper audio.mp3 --output_format txt # Plain text
whisper audio.mp3 --output_format srt # Subtitles
whisper audio.mp3 --output_format vtt # WebVTT
whisper audio.mp3 --output_format json # JSON with timestamps
# Language
whisper audio.mp3 --language Spanish
# Translation
whisper spanish.mp3 --task translate
```
## Batch processing
```python
import os
audio_files = ["file1.mp3", "file2.mp3", "file3.mp3"]
for audio_file in audio_files:
print(f"Transcribing {audio_file}...")
result = model.transcribe(audio_file)
# Save to file
output_file = audio_file.replace(".mp3", ".txt")
with open(output_file, "w") as f:
f.write(result["text"])
```
## Real-time transcription
```python
# For streaming audio, use faster-whisper
# pip install faster-whisper
from faster_whisper import WhisperModel
model = WhisperModel("base", device="cuda", compute_type="float16")
# Transcribe with streaming
segments, info = model.transcribe("audio.mp3", beam_size=5)
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
```
## GPU acceleration
```python
import whisper
# Automatically uses GPU if available
model = whisper.load_model("turbo")
# Force CPU
model = whisper.load_model("turbo", device="cpu")
# Force GPU
model = whisper.load_model("turbo", device="cuda")
# 10-20× faster on GPU
```
## Integration with other tools
### Subtitle generation
```bash
# Generate SRT subtitles
whisper video.mp4 --output_format srt --language English
# Output: video.srt
```
### With LangChain
```python
from langchain.document_loaders import WhisperTranscriptionLoader
loader = WhisperTranscriptionLoader(file_path="audio.mp3")
docs = loader.load()
# Use transcription in RAG
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())
```
### Extract audio from video
```bash
# Use ffmpeg to extract audio
ffmpeg -i video.mp4 -vn -acodec pcm_s16le audio.wav
# Then transcribe
whisper audio.wav
```
## Best practices
1. **Use turbo model** - Best speed/quality for English
2. **Specify language** - Faster than auto-detect
3. **Add initial prompt** - Improves technical terms
4. **Use GPU** - 10-20× faster
5. **Batch process** - More efficient
6. **Convert to WAV** - Better compatibility
7. **Split long audio** - <30 min chunks
8. **Check language support** - Quality varies by language
9. **Use faster-whisper** - 4× faster than openai-whisper
10. **Monitor VRAM** - Scale model size to hardware
## Performance
| Model | Real-time factor (CPU) | Real-time factor (GPU) |
|-------|------------------------|------------------------|
| tiny | ~0.32 | ~0.01 |
| base | ~0.16 | ~0.01 |
| turbo | ~0.08 | ~0.01 |
| large | ~1.0 | ~0.05 |
*Real-time factor: 0.1 = 10× faster than real-time*
## Language support
Top-supported languages:
- English (en)
- Spanish (es)
- French (fr)
- German (de)
- Italian (it)
- Portuguese (pt)
- Russian (ru)
- Japanese (ja)
- Korean (ko)
- Chinese (zh)
Full list: 99 languages total
## Limitations
1. **Hallucinations** - May repeat or invent text
2. **Long-form accuracy** - Degrades on >30 min audio
3. **Speaker identification** - No diarization
4. **Accents** - Quality varies
5. **Background noise** - Can affect accuracy
6. **Real-time latency** - Not suitable for live captioning
## Resources
- **GitHub**: https://github.com/openai/whisper ⭐ 72,900+
- **Paper**: https://arxiv.org/abs/2212.04356
- **Model Card**: https://github.com/openai/whisper/blob/main/model-card.md
- **Colab**: Available in repo
- **License**: MIT

View file

@ -0,0 +1,189 @@
# Whisper Language Support Guide
Complete guide to Whisper's multilingual capabilities.
## Supported languages (99 total)
### Top-tier support (WER < 10%)
- English (en)
- Spanish (es)
- French (fr)
- German (de)
- Italian (it)
- Portuguese (pt)
- Dutch (nl)
- Polish (pl)
- Russian (ru)
- Japanese (ja)
- Korean (ko)
- Chinese (zh)
### Good support (WER 10-20%)
- Arabic (ar)
- Turkish (tr)
- Vietnamese (vi)
- Swedish (sv)
- Finnish (fi)
- Czech (cs)
- Romanian (ro)
- Hungarian (hu)
- Danish (da)
- Norwegian (no)
- Thai (th)
- Hebrew (he)
- Greek (el)
- Indonesian (id)
- Malay (ms)
### Full list (99 languages)
Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Bashkir, Basque, Belarusian, Bengali, Bosnian, Breton, Bulgarian, Burmese, Cantonese, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Faroese, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Lao, Latin, Latvian, Lingala, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Moldavian, Mongolian, Myanmar, Nepali, Norwegian, Nynorsk, Occitan, Pashto, Persian, Polish, Portuguese, Punjabi, Pushto, Romanian, Russian, Sanskrit, Serbian, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Tibetan, Turkish, Turkmen, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, Yiddish, Yoruba
## Usage examples
### Auto-detect language
```python
import whisper
model = whisper.load_model("turbo")
# Auto-detect language
result = model.transcribe("audio.mp3")
print(f"Detected language: {result['language']}")
print(f"Text: {result['text']}")
```
### Specify language (faster)
```python
# Specify language for faster transcription
result = model.transcribe("audio.mp3", language="es") # Spanish
result = model.transcribe("audio.mp3", language="fr") # French
result = model.transcribe("audio.mp3", language="ja") # Japanese
```
### Translation to English
```python
# Translate any language to English
result = model.transcribe(
"spanish_audio.mp3",
task="translate" # Translates to English
)
print(f"Original language: {result['language']}")
print(f"English translation: {result['text']}")
```
## Language-specific tips
### Chinese
```python
# Chinese works well with larger models
model = whisper.load_model("large")
result = model.transcribe(
"chinese_audio.mp3",
language="zh",
initial_prompt="这是一段关于技术的讨论" # Context helps
)
```
### Japanese
```python
# Japanese benefits from initial prompt
result = model.transcribe(
"japanese_audio.mp3",
language="ja",
initial_prompt="これは技術的な会議の録音です"
)
```
### Arabic
```python
# Arabic: Use large model for best results
model = whisper.load_model("large")
result = model.transcribe(
"arabic_audio.mp3",
language="ar"
)
```
## Model size recommendations
| Language Tier | Recommended Model | WER |
|---------------|-------------------|-----|
| Top-tier (en, es, fr, de) | base/turbo | < 10% |
| Good (ar, tr, vi) | medium/large | 10-20% |
| Lower-resource | large | 20-30% |
## Performance by language
### English
- **tiny**: WER ~15%
- **base**: WER ~8%
- **small**: WER ~5%
- **medium**: WER ~4%
- **large**: WER ~3%
- **turbo**: WER ~3.5%
### Spanish
- **tiny**: WER ~20%
- **base**: WER ~12%
- **medium**: WER ~6%
- **large**: WER ~4%
### Chinese
- **small**: WER ~15%
- **medium**: WER ~8%
- **large**: WER ~5%
## Best practices
1. **Use English-only models** - Better for small models (tiny/base)
2. **Specify language** - Faster than auto-detect
3. **Add initial prompt** - Improves accuracy for technical terms
4. **Use larger models** - For low-resource languages
5. **Test on sample** - Quality varies by accent/dialect
6. **Consider audio quality** - Clear audio = better results
7. **Check language codes** - Use ISO 639-1 codes (2 letters)
## Language detection
```python
# Detect language only (no transcription)
import whisper
model = whisper.load_model("base")
# Load audio
audio = whisper.load_audio("audio.mp3")
audio = whisper.pad_or_trim(audio)
# Make log-Mel spectrogram
mel = whisper.log_mel_spectrogram(audio).to(model.device)
# Detect language
_, probs = model.detect_language(mel)
detected_language = max(probs, key=probs.get)
print(f"Detected language: {detected_language}")
print(f"Confidence: {probs[detected_language]:.2%}")
```
## Resources
- **Paper**: https://arxiv.org/abs/2212.04356
- **GitHub**: https://github.com/openai/whisper
- **Model Card**: https://github.com/openai/whisper/blob/main/model-card.md