The skills directory was getting disorganized — mlops alone had 40 skills in a flat list, and 12 categories were singletons with just one skill each. Code change: - prompt_builder.py: Support sub-categories in skill scanner. skills/mlops/training/axolotl/SKILL.md now shows as category 'mlops/training' instead of just 'mlops'. Backwards-compatible with existing flat structure. Split mlops (40 skills) into 7 sub-categories: - mlops/training (12): accelerate, axolotl, flash-attention, grpo-rl-training, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, torchtitan, trl-fine-tuning, unsloth - mlops/inference (8): gguf, guidance, instructor, llama-cpp, obliteratus, outlines, tensorrt-llm, vllm - mlops/models (6): audiocraft, clip, llava, segment-anything, stable-diffusion, whisper - mlops/vector-databases (4): chroma, faiss, pinecone, qdrant - mlops/evaluation (5): huggingface-tokenizers, lm-evaluation-harness, nemo-curator, saelens, weights-and-biases - mlops/cloud (2): lambda-labs, modal - mlops/research (1): dspy Merged singleton categories: - gifs → media (gif-search joins youtube-content) - music-creation → media (heartmula, songsee) - diagramming → creative (excalidraw joins ascii-art) - ocr-and-documents → productivity - domain → research (domain-intel) - feeds → research (blogwatcher) - market-data → research (polymarket) Fixed misplaced skills: - mlops/code-review → software-development (not ML-specific) - mlops/ml-paper-writing → research (academic writing) Added DESCRIPTION.md files for all new/updated categories.
8.3 KiB
Hyperparameters
Complete guide to SimPO hyperparameter selection and tuning.
Overview
Key hyperparameters in SimPO:
- Learning Rate - Most critical
- Beta (β) - Reward scaling
- Gamma-Beta Ratio (γ/β) - Target margin
- SFT Weight - Regularization strength
Learning Rate
Recommended Ranges
By model size:
| Model Size | Learning Rate | Notes |
|---|---|---|
| 1B-3B | 5e-7 to 1e-6 | Higher end safe |
| 7B-8B | 3e-7 to 5e-7 | Standard |
| 13B-30B | 1e-7 to 3e-7 | Lower for stability |
| 70B+ | 5e-8 to 1e-7 | Very conservative |
By task type:
| Task | Learning Rate | Reason |
|---|---|---|
| General chat | 5e-7 | Standard |
| Code generation | 3e-7 | Precise reasoning |
| Math reasoning | 3e-7 | Careful optimization |
| Creative writing | 1e-6 | More aggressive OK |
Why Learning Rate Matters
Too high (> 1e-6 for 7B):
- Loss divergence
- Catastrophic forgetting
- Unstable training
Too low (< 1e-7 for 7B):
- Very slow convergence
- May not finish in time
- Undertraining
Optimal (3e-7 to 5e-7 for 7B):
- Stable convergence
- Good final performance
- Efficient training
Config Examples
Mistral 7B (general):
learning_rate: 5e-7
num_train_epochs: 1
warmup_ratio: 0.1
lr_scheduler_type: cosine
Llama 3 8B (reasoning):
learning_rate: 3e-7
num_train_epochs: 1
warmup_ratio: 0.1
lr_scheduler_type: cosine
Gemma 2 9B (creative):
learning_rate: 1e-6
num_train_epochs: 1
warmup_ratio: 0.1
lr_scheduler_type: linear
Beta (β)
Recommended Values
Range: 2.0 to 10.0 (much higher than DPO's 0.01-0.1)
By preference strength:
| Beta | Preference Strength | Use Case |
|---|---|---|
| 1.0-2.0 | Weak | Subtle preferences |
| 2.0-5.0 | Standard | General alignment |
| 5.0-10.0 | Strong | Clear preferences |
Default: 2.0 to 2.5
Why Beta Matters
Low beta (< 2.0):
- Weak reward signal
- Slow preference learning
- May underfit
High beta (> 10.0):
- Very strong reward signal
- Risk of overfitting
- May ignore weak preferences
Optimal (2.0-5.0):
- Balanced reward scaling
- Stable training
- Good generalization
Interaction with Gamma
Beta and gamma together:
Target margin in reward space = gamma
Target margin in logit space = gamma / beta
Example:
beta: 2.0
gamma_beta_ratio: 0.5
# Effective gamma = 2.0 * 0.5 = 1.0
Config Examples
Weak preferences:
beta: 2.0
gamma_beta_ratio: 0.3 # Small margin
Standard:
beta: 2.5
gamma_beta_ratio: 0.5 # Default
Strong preferences:
beta: 5.0
gamma_beta_ratio: 0.7 # Larger margin
Gamma-Beta Ratio (γ/β)
Recommended Values
Range: 0.0 to 1.0
By scenario:
| Ratio | Margin | Use Case |
|---|---|---|
| 0.0-0.3 | Small | Weak preference data |
| 0.4-0.6 | Standard | General use |
| 0.7-1.0 | Large | Very clear preferences |
Default: 0.5
Why Gamma Matters
Low gamma (< 0.3):
- Small target margin
- Less aggressive alignment
- More conservative
High gamma (> 0.7):
- Large target margin
- Stronger alignment
- More aggressive
Optimal (0.4-0.6):
- Balanced margin
- Stable training
- Good alignment
Mathematical Meaning
In loss function:
logits = pi_logratios - gamma_beta_ratio
loss = -log(sigmoid(beta * logits))
Interpretation:
- gamma_beta_ratio shifts the decision boundary
- Higher ratio = requires larger log prob difference
- Controls how "clear" preferences must be
Config Examples
Noisy preferences:
gamma_beta_ratio: 0.3 # Smaller margin, more tolerant
Standard:
gamma_beta_ratio: 0.5 # Default
High-quality preferences:
gamma_beta_ratio: 0.8 # Larger margin, stricter
SFT Weight
Recommended Values
Range: 0.0 to 1.0
By model type:
| Model Type | SFT Weight | Reason |
|---|---|---|
| Base model | 0.0 | No prior capabilities |
| Instruct model | 0.05-0.1 | Preserve instruction following |
| Chat model | 0.1-0.2 | Preserve conversational skills |
Default: 0.0 (no SFT regularization)
Why SFT Weight Matters
Zero SFT (0.0):
- Pure preference optimization
- May forget capabilities
- Standard for base models
Low SFT (0.05-0.1):
- Balanced approach
- Recommended for instruct models
- Slight capability preservation
High SFT (> 0.2):
- Strong capability preservation
- Weaker preference alignment
- May reduce alignment gains
Trade-off
Total Loss = SimPO Loss + (sft_weight * SFT Loss)
Example:
sft_weight: 0.1
# 90% preference optimization + 10% capability preservation
Config Examples
Base model (no SFT):
model_name_or_path: mistralai/Mistral-7B-v0.1
sft_weight: 0.0
Instruct model (light SFT):
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
sft_weight: 0.1
Chat model (moderate SFT):
model_name_or_path: HuggingFaceH4/zephyr-7b-beta
sft_weight: 0.2
Model-Size-Specific Recommendations
7B Models (Mistral, Llama 3)
Standard config:
learning_rate: 5e-7
beta: 2.0
gamma_beta_ratio: 0.5
sft_weight: 0.0 # 0.1 if instruct model
num_train_epochs: 1
per_device_train_batch_size: 2
gradient_accumulation_steps: 4
8B-13B Models
Standard config:
learning_rate: 3e-7
beta: 2.5
gamma_beta_ratio: 0.5
sft_weight: 0.1 # If instruct
num_train_epochs: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
70B Models
Standard config:
learning_rate: 1e-7
beta: 2.0
gamma_beta_ratio: 0.5
sft_weight: 0.05
num_train_epochs: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 16
Batch Size & Gradient Accumulation
Effective Batch Size
Effective Batch Size = per_device_batch_size * num_gpus * grad_accum_steps
Recommended effective batch sizes:
- 7B: 128-256
- 13B: 64-128
- 70B: 32-64
Config Examples
Single GPU (A100 40GB):
per_device_train_batch_size: 1
gradient_accumulation_steps: 128 # Effective batch = 128
4 GPUs (A100 40GB):
per_device_train_batch_size: 2
gradient_accumulation_steps: 16 # Effective batch = 2*4*16 = 128
8 GPUs (A100 80GB):
per_device_train_batch_size: 2
gradient_accumulation_steps: 8 # Effective batch = 2*8*8 = 128
Loss Type
Sigmoid vs Hinge
Sigmoid (default, recommended):
loss_type: sigmoid
label_smoothing: 0.0
Hinge (experimental):
loss_type: hinge
# No label smoothing for hinge
When to use hinge:
- Margin-based tasks
- SVM-style optimization
- Experimental purposes
Generally: Stick with sigmoid
Tuning Guide
Step 1: Start with Defaults
learning_rate: 5e-7 # For 7B
beta: 2.0
gamma_beta_ratio: 0.5
sft_weight: 0.0 # 0.1 if instruct
loss_type: sigmoid
Step 2: Monitor Training
Check every 100 steps:
- Loss curve (should decrease smoothly)
- Reward margin (should increase)
- Chosen/rejected logps (should separate)
Step 3: Adjust if Needed
If loss diverges:
learning_rate: 3e-7 # Reduce from 5e-7
beta: 1.0 # Reduce from 2.0
If loss plateaus early:
learning_rate: 1e-6 # Increase from 5e-7
beta: 5.0 # Increase from 2.0
If model forgets:
sft_weight: 0.2 # Increase from 0.0
Complete Example Configs
Mistral 7B Base (Standard)
model_name_or_path: mistralai/Mistral-7B-v0.1
dataset_mixer:
HuggingFaceH4/ultrafeedback_binarized: 1.0
learning_rate: 5e-7
beta: 2.0
gamma_beta_ratio: 0.5
loss_type: sigmoid
sft_weight: 0.0
num_train_epochs: 1
per_device_train_batch_size: 2
gradient_accumulation_steps: 4
warmup_ratio: 0.1
lr_scheduler_type: cosine
bf16: true
gradient_checkpointing: true
Llama 3 8B Instruct (Reasoning)
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
dataset_mixer:
argilla/distilabel-math-preference-dpo: 1.0
learning_rate: 3e-7
beta: 5.0
gamma_beta_ratio: 0.7
loss_type: sigmoid
sft_weight: 0.1
num_train_epochs: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 16
warmup_ratio: 0.1
lr_scheduler_type: cosine
References
- SimPO paper: https://arxiv.org/abs/2405.14734
- Alignment Handbook: https://github.com/huggingface/alignment-handbook