mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-26 01:01:40 +00:00
- Restored 21 skills removed in commits757d012and740dd92: accelerate, audiocraft, code-review, faiss, flash-attention, gguf, grpo-rl-training, guidance, llava, nemo-curator, obliteratus, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, stable-diffusion, tensorrt-llm, torchtitan, trl-fine-tuning, whisper - Rewrote sync_skills() with proper update semantics: * New skills (not in manifest): copied to user dir * Existing skills (in manifest + on disk): updated via hash comparison * User-deleted skills (in manifest, not on disk): respected, not re-added * Stale manifest entries (removed from bundled): cleaned from manifest - Added sync_skills() to CLI startup (cmd_chat) and gateway startup (start_gateway) — previously only ran during 'hermes update' - Updated cmd_update output to show new/updated/cleaned counts - Rewrote tests: 20 tests covering manifest CRUD, dir hashing, fresh install, user deletion respect, update detection, stale cleanup, and name collision handling 75 bundled skills total. 2002 tests pass.
434 lines
12 KiB
Markdown
434 lines
12 KiB
Markdown
---
|
|
name: peft-fine-tuning
|
|
description: Parameter-efficient fine-tuning for LLMs using LoRA, QLoRA, and 25+ methods. Use when fine-tuning large models (7B-70B) with limited GPU memory, when you need to train <1% of parameters with minimal accuracy loss, or for multi-adapter serving. HuggingFace's official library integrated with transformers ecosystem.
|
|
version: 1.0.0
|
|
author: Orchestra Research
|
|
license: MIT
|
|
dependencies: [peft>=0.13.0, transformers>=4.45.0, torch>=2.0.0, bitsandbytes>=0.43.0]
|
|
metadata:
|
|
hermes:
|
|
tags: [Fine-Tuning, PEFT, LoRA, QLoRA, Parameter-Efficient, Adapters, Low-Rank, Memory Optimization, Multi-Adapter]
|
|
|
|
---
|
|
|
|
# PEFT (Parameter-Efficient Fine-Tuning)
|
|
|
|
Fine-tune LLMs by training <1% of parameters using LoRA, QLoRA, and 25+ adapter methods.
|
|
|
|
## When to use PEFT
|
|
|
|
**Use PEFT/LoRA when:**
|
|
- Fine-tuning 7B-70B models on consumer GPUs (RTX 4090, A100)
|
|
- Need to train <1% parameters (6MB adapters vs 14GB full model)
|
|
- Want fast iteration with multiple task-specific adapters
|
|
- Deploying multiple fine-tuned variants from one base model
|
|
|
|
**Use QLoRA (PEFT + quantization) when:**
|
|
- Fine-tuning 70B models on single 24GB GPU
|
|
- Memory is the primary constraint
|
|
- Can accept ~5% quality trade-off vs full fine-tuning
|
|
|
|
**Use full fine-tuning instead when:**
|
|
- Training small models (<1B parameters)
|
|
- Need maximum quality and have compute budget
|
|
- Significant domain shift requires updating all weights
|
|
|
|
## Quick start
|
|
|
|
### Installation
|
|
|
|
```bash
|
|
# Basic installation
|
|
pip install peft
|
|
|
|
# With quantization support (recommended)
|
|
pip install peft bitsandbytes
|
|
|
|
# Full stack
|
|
pip install peft transformers accelerate bitsandbytes datasets
|
|
```
|
|
|
|
### LoRA fine-tuning (standard)
|
|
|
|
```python
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
|
|
from peft import get_peft_model, LoraConfig, TaskType
|
|
from datasets import load_dataset
|
|
|
|
# Load base model
|
|
model_name = "meta-llama/Llama-3.1-8B"
|
|
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
|
tokenizer.pad_token = tokenizer.eos_token
|
|
|
|
# LoRA configuration
|
|
lora_config = LoraConfig(
|
|
task_type=TaskType.CAUSAL_LM,
|
|
r=16, # Rank (8-64, higher = more capacity)
|
|
lora_alpha=32, # Scaling factor (typically 2*r)
|
|
lora_dropout=0.05, # Dropout for regularization
|
|
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], # Attention layers
|
|
bias="none" # Don't train biases
|
|
)
|
|
|
|
# Apply LoRA
|
|
model = get_peft_model(model, lora_config)
|
|
model.print_trainable_parameters()
|
|
# Output: trainable params: 13,631,488 || all params: 8,043,307,008 || trainable%: 0.17%
|
|
|
|
# Prepare dataset
|
|
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
|
|
|
|
def tokenize(example):
|
|
text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}"
|
|
return tokenizer(text, truncation=True, max_length=512, padding="max_length")
|
|
|
|
tokenized = dataset.map(tokenize, remove_columns=dataset.column_names)
|
|
|
|
# Training
|
|
training_args = TrainingArguments(
|
|
output_dir="./lora-llama",
|
|
num_train_epochs=3,
|
|
per_device_train_batch_size=4,
|
|
gradient_accumulation_steps=4,
|
|
learning_rate=2e-4,
|
|
fp16=True,
|
|
logging_steps=10,
|
|
save_strategy="epoch"
|
|
)
|
|
|
|
trainer = Trainer(
|
|
model=model,
|
|
args=training_args,
|
|
train_dataset=tokenized,
|
|
data_collator=lambda data: {"input_ids": torch.stack([f["input_ids"] for f in data]),
|
|
"attention_mask": torch.stack([f["attention_mask"] for f in data]),
|
|
"labels": torch.stack([f["input_ids"] for f in data])}
|
|
)
|
|
|
|
trainer.train()
|
|
|
|
# Save adapter only (6MB vs 16GB)
|
|
model.save_pretrained("./lora-llama-adapter")
|
|
```
|
|
|
|
### QLoRA fine-tuning (memory-efficient)
|
|
|
|
```python
|
|
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
|
|
from peft import get_peft_model, LoraConfig, prepare_model_for_kbit_training
|
|
|
|
# 4-bit quantization config
|
|
bnb_config = BitsAndBytesConfig(
|
|
load_in_4bit=True,
|
|
bnb_4bit_quant_type="nf4", # NormalFloat4 (best for LLMs)
|
|
bnb_4bit_compute_dtype="bfloat16", # Compute in bf16
|
|
bnb_4bit_use_double_quant=True # Nested quantization
|
|
)
|
|
|
|
# Load quantized model
|
|
model = AutoModelForCausalLM.from_pretrained(
|
|
"meta-llama/Llama-3.1-70B",
|
|
quantization_config=bnb_config,
|
|
device_map="auto"
|
|
)
|
|
|
|
# Prepare for training (enables gradient checkpointing)
|
|
model = prepare_model_for_kbit_training(model)
|
|
|
|
# LoRA config for QLoRA
|
|
lora_config = LoraConfig(
|
|
r=64, # Higher rank for 70B
|
|
lora_alpha=128,
|
|
lora_dropout=0.1,
|
|
target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
|
|
bias="none",
|
|
task_type="CAUSAL_LM"
|
|
)
|
|
|
|
model = get_peft_model(model, lora_config)
|
|
# 70B model now fits on single 24GB GPU!
|
|
```
|
|
|
|
## LoRA parameter selection
|
|
|
|
### Rank (r) - capacity vs efficiency
|
|
|
|
| Rank | Trainable Params | Memory | Quality | Use Case |
|
|
|------|-----------------|--------|---------|----------|
|
|
| 4 | ~3M | Minimal | Lower | Simple tasks, prototyping |
|
|
| **8** | ~7M | Low | Good | **Recommended starting point** |
|
|
| **16** | ~14M | Medium | Better | **General fine-tuning** |
|
|
| 32 | ~27M | Higher | High | Complex tasks |
|
|
| 64 | ~54M | High | Highest | Domain adaptation, 70B models |
|
|
|
|
### Alpha (lora_alpha) - scaling factor
|
|
|
|
```python
|
|
# Rule of thumb: alpha = 2 * rank
|
|
LoraConfig(r=16, lora_alpha=32) # Standard
|
|
LoraConfig(r=16, lora_alpha=16) # Conservative (lower learning rate effect)
|
|
LoraConfig(r=16, lora_alpha=64) # Aggressive (higher learning rate effect)
|
|
```
|
|
|
|
### Target modules by architecture
|
|
|
|
```python
|
|
# Llama / Mistral / Qwen
|
|
target_modules = ["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
|
|
|
|
# GPT-2 / GPT-Neo
|
|
target_modules = ["c_attn", "c_proj", "c_fc"]
|
|
|
|
# Falcon
|
|
target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]
|
|
|
|
# BLOOM
|
|
target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]
|
|
|
|
# Auto-detect all linear layers
|
|
target_modules = "all-linear" # PEFT 0.6.0+
|
|
```
|
|
|
|
## Loading and merging adapters
|
|
|
|
### Load trained adapter
|
|
|
|
```python
|
|
from peft import PeftModel, AutoPeftModelForCausalLM
|
|
from transformers import AutoModelForCausalLM
|
|
|
|
# Option 1: Load with PeftModel
|
|
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
|
|
model = PeftModel.from_pretrained(base_model, "./lora-llama-adapter")
|
|
|
|
# Option 2: Load directly (recommended)
|
|
model = AutoPeftModelForCausalLM.from_pretrained(
|
|
"./lora-llama-adapter",
|
|
device_map="auto"
|
|
)
|
|
```
|
|
|
|
### Merge adapter into base model
|
|
|
|
```python
|
|
# Merge for deployment (no adapter overhead)
|
|
merged_model = model.merge_and_unload()
|
|
|
|
# Save merged model
|
|
merged_model.save_pretrained("./llama-merged")
|
|
tokenizer.save_pretrained("./llama-merged")
|
|
|
|
# Push to Hub
|
|
merged_model.push_to_hub("username/llama-finetuned")
|
|
```
|
|
|
|
### Multi-adapter serving
|
|
|
|
```python
|
|
from peft import PeftModel
|
|
|
|
# Load base with first adapter
|
|
model = AutoPeftModelForCausalLM.from_pretrained("./adapter-task1")
|
|
|
|
# Load additional adapters
|
|
model.load_adapter("./adapter-task2", adapter_name="task2")
|
|
model.load_adapter("./adapter-task3", adapter_name="task3")
|
|
|
|
# Switch between adapters at runtime
|
|
model.set_adapter("task1") # Use task1 adapter
|
|
output1 = model.generate(**inputs)
|
|
|
|
model.set_adapter("task2") # Switch to task2
|
|
output2 = model.generate(**inputs)
|
|
|
|
# Disable adapters (use base model)
|
|
with model.disable_adapter():
|
|
base_output = model.generate(**inputs)
|
|
```
|
|
|
|
## PEFT methods comparison
|
|
|
|
| Method | Trainable % | Memory | Speed | Best For |
|
|
|--------|------------|--------|-------|----------|
|
|
| **LoRA** | 0.1-1% | Low | Fast | General fine-tuning |
|
|
| **QLoRA** | 0.1-1% | Very Low | Medium | Memory-constrained |
|
|
| AdaLoRA | 0.1-1% | Low | Medium | Automatic rank selection |
|
|
| IA3 | 0.01% | Minimal | Fastest | Few-shot adaptation |
|
|
| Prefix Tuning | 0.1% | Low | Medium | Generation control |
|
|
| Prompt Tuning | 0.001% | Minimal | Fast | Simple task adaptation |
|
|
| P-Tuning v2 | 0.1% | Low | Medium | NLU tasks |
|
|
|
|
### IA3 (minimal parameters)
|
|
|
|
```python
|
|
from peft import IA3Config
|
|
|
|
ia3_config = IA3Config(
|
|
target_modules=["q_proj", "v_proj", "k_proj", "down_proj"],
|
|
feedforward_modules=["down_proj"]
|
|
)
|
|
model = get_peft_model(model, ia3_config)
|
|
# Trains only 0.01% of parameters!
|
|
```
|
|
|
|
### Prefix Tuning
|
|
|
|
```python
|
|
from peft import PrefixTuningConfig
|
|
|
|
prefix_config = PrefixTuningConfig(
|
|
task_type="CAUSAL_LM",
|
|
num_virtual_tokens=20, # Prepended tokens
|
|
prefix_projection=True # Use MLP projection
|
|
)
|
|
model = get_peft_model(model, prefix_config)
|
|
```
|
|
|
|
## Integration patterns
|
|
|
|
### With TRL (SFTTrainer)
|
|
|
|
```python
|
|
from trl import SFTTrainer, SFTConfig
|
|
from peft import LoraConfig
|
|
|
|
lora_config = LoraConfig(r=16, lora_alpha=32, target_modules="all-linear")
|
|
|
|
trainer = SFTTrainer(
|
|
model=model,
|
|
args=SFTConfig(output_dir="./output", max_seq_length=512),
|
|
train_dataset=dataset,
|
|
peft_config=lora_config, # Pass LoRA config directly
|
|
)
|
|
trainer.train()
|
|
```
|
|
|
|
### With Axolotl (YAML config)
|
|
|
|
```yaml
|
|
# axolotl config.yaml
|
|
adapter: lora
|
|
lora_r: 16
|
|
lora_alpha: 32
|
|
lora_dropout: 0.05
|
|
lora_target_modules:
|
|
- q_proj
|
|
- v_proj
|
|
- k_proj
|
|
- o_proj
|
|
lora_target_linear: true # Target all linear layers
|
|
```
|
|
|
|
### With vLLM (inference)
|
|
|
|
```python
|
|
from vllm import LLM
|
|
from vllm.lora.request import LoRARequest
|
|
|
|
# Load base model with LoRA support
|
|
llm = LLM(model="meta-llama/Llama-3.1-8B", enable_lora=True)
|
|
|
|
# Serve with adapter
|
|
outputs = llm.generate(
|
|
prompts,
|
|
lora_request=LoRARequest("adapter1", 1, "./lora-adapter")
|
|
)
|
|
```
|
|
|
|
## Performance benchmarks
|
|
|
|
### Memory usage (Llama 3.1 8B)
|
|
|
|
| Method | GPU Memory | Trainable Params |
|
|
|--------|-----------|------------------|
|
|
| Full fine-tuning | 60+ GB | 8B (100%) |
|
|
| LoRA r=16 | 18 GB | 14M (0.17%) |
|
|
| QLoRA r=16 | 6 GB | 14M (0.17%) |
|
|
| IA3 | 16 GB | 800K (0.01%) |
|
|
|
|
### Training speed (A100 80GB)
|
|
|
|
| Method | Tokens/sec | vs Full FT |
|
|
|--------|-----------|------------|
|
|
| Full FT | 2,500 | 1x |
|
|
| LoRA | 3,200 | 1.3x |
|
|
| QLoRA | 2,100 | 0.84x |
|
|
|
|
### Quality (MMLU benchmark)
|
|
|
|
| Model | Full FT | LoRA | QLoRA |
|
|
|-------|---------|------|-------|
|
|
| Llama 2-7B | 45.3 | 44.8 | 44.1 |
|
|
| Llama 2-13B | 54.8 | 54.2 | 53.5 |
|
|
|
|
## Common issues
|
|
|
|
### CUDA OOM during training
|
|
|
|
```python
|
|
# Solution 1: Enable gradient checkpointing
|
|
model.gradient_checkpointing_enable()
|
|
|
|
# Solution 2: Reduce batch size + increase accumulation
|
|
TrainingArguments(
|
|
per_device_train_batch_size=1,
|
|
gradient_accumulation_steps=16
|
|
)
|
|
|
|
# Solution 3: Use QLoRA
|
|
from transformers import BitsAndBytesConfig
|
|
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")
|
|
```
|
|
|
|
### Adapter not applying
|
|
|
|
```python
|
|
# Verify adapter is active
|
|
print(model.active_adapters) # Should show adapter name
|
|
|
|
# Check trainable parameters
|
|
model.print_trainable_parameters()
|
|
|
|
# Ensure model in training mode
|
|
model.train()
|
|
```
|
|
|
|
### Quality degradation
|
|
|
|
```python
|
|
# Increase rank
|
|
LoraConfig(r=32, lora_alpha=64)
|
|
|
|
# Target more modules
|
|
target_modules = "all-linear"
|
|
|
|
# Use more training data and epochs
|
|
TrainingArguments(num_train_epochs=5)
|
|
|
|
# Lower learning rate
|
|
TrainingArguments(learning_rate=1e-4)
|
|
```
|
|
|
|
## Best practices
|
|
|
|
1. **Start with r=8-16**, increase if quality insufficient
|
|
2. **Use alpha = 2 * rank** as starting point
|
|
3. **Target attention + MLP layers** for best quality/efficiency
|
|
4. **Enable gradient checkpointing** for memory savings
|
|
5. **Save adapters frequently** (small files, easy rollback)
|
|
6. **Evaluate on held-out data** before merging
|
|
7. **Use QLoRA for 70B+ models** on consumer hardware
|
|
|
|
## References
|
|
|
|
- **[Advanced Usage](references/advanced-usage.md)** - DoRA, LoftQ, rank stabilization, custom modules
|
|
- **[Troubleshooting](references/troubleshooting.md)** - Common errors, debugging, optimization
|
|
|
|
## Resources
|
|
|
|
- **GitHub**: https://github.com/huggingface/peft
|
|
- **Docs**: https://huggingface.co/docs/peft
|
|
- **LoRA Paper**: arXiv:2106.09685
|
|
- **QLoRA Paper**: arXiv:2305.14314
|
|
- **Models**: https://huggingface.co/models?library=peft
|