The skills directory was getting disorganized — mlops alone had 40 skills in a flat list, and 12 categories were singletons with just one skill each. Code change: - prompt_builder.py: Support sub-categories in skill scanner. skills/mlops/training/axolotl/SKILL.md now shows as category 'mlops/training' instead of just 'mlops'. Backwards-compatible with existing flat structure. Split mlops (40 skills) into 7 sub-categories: - mlops/training (12): accelerate, axolotl, flash-attention, grpo-rl-training, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, torchtitan, trl-fine-tuning, unsloth - mlops/inference (8): gguf, guidance, instructor, llama-cpp, obliteratus, outlines, tensorrt-llm, vllm - mlops/models (6): audiocraft, clip, llava, segment-anything, stable-diffusion, whisper - mlops/vector-databases (4): chroma, faiss, pinecone, qdrant - mlops/evaluation (5): huggingface-tokenizers, lm-evaluation-harness, nemo-curator, saelens, weights-and-biases - mlops/cloud (2): lambda-labs, modal - mlops/research (1): dspy Merged singleton categories: - gifs → media (gif-search joins youtube-content) - music-creation → media (heartmula, songsee) - diagramming → creative (excalidraw joins ascii-art) - ocr-and-documents → productivity - domain → research (domain-intel) - feeds → research (blogwatcher) - market-data → research (polymarket) Fixed misplaced skills: - mlops/code-review → software-development (not ML-specific) - mlops/ml-paper-writing → research (academic writing) Added DESCRIPTION.md files for all new/updated categories.
4.2 KiB
DPO Variants
Complete guide to Direct Preference Optimization loss variants in TRL.
Overview
DPO optimizes models using preference data (chosen/rejected pairs). TRL supports 10+ loss variants for different scenarios.
Loss Types
1. Sigmoid (Standard DPO)
Formula: -log(sigmoid(β * logits))
When to use: Default choice, general preference alignment
Config:
DPOConfig(
loss_type="sigmoid",
beta=0.1, # KL penalty
per_device_train_batch_size=64,
learning_rate=1e-6
)
2. IPO (Identity Policy Optimization)
Formula: (logits - 1/(2β))²
When to use: Better theoretical foundation, reduce overfitting
Config:
DPOConfig(
loss_type="ipo",
beta=0.1,
per_device_train_batch_size=90,
learning_rate=1e-2
)
3. Hinge (SLiC)
Formula: ReLU(1 - β * logits)
When to use: Margin-based objective
Config:
DPOConfig(
loss_type="hinge",
beta=0.1,
per_device_train_batch_size=512,
learning_rate=1e-4
)
4. Robust DPO
Formula: Sigmoid with label smoothing for noise robustness
When to use: Noisy preference labels
Config:
DPOConfig(
loss_type="robust",
beta=0.01,
label_smoothing=0.1, # Noise probability
per_device_train_batch_size=16,
learning_rate=1e-3,
max_prompt_length=128,
max_length=512
)
5. BCO Pair (Binary Classification)
Formula: Train binary classifier (chosen=1, rejected=0)
When to use: Pairwise preference data
Config:
DPOConfig(
loss_type="bco_pair",
beta=0.01,
per_device_train_batch_size=128,
learning_rate=5e-7,
max_prompt_length=1536,
max_completion_length=512
)
6. SPPO Hard
Formula: Push chosen→0.5, rejected→-0.5
When to use: Nash equilibrium, sparse data
Config:
DPOConfig(
loss_type="sppo_hard",
beta=0.1
)
7. DiscoPOP
Formula: Log-Ratio Modulated Loss
When to use: Automated loss discovery
Config:
DPOConfig(
loss_type="discopop",
beta=0.05,
discopop_tau=0.05,
per_device_train_batch_size=64,
learning_rate=5e-7
)
8. APO Zero
Formula: Increase chosen, decrease rejected likelihood
When to use: Model worse than winning outputs
Config:
DPOConfig(
loss_type="apo_zero",
beta=0.1,
per_device_train_batch_size=64,
learning_rate=2e-7,
max_prompt_length=512,
max_completion_length=512
)
9. APO Down
Formula: Decrease both, emphasize rejected reduction
When to use: Model better than winning outputs
Config:
DPOConfig(
loss_type="apo_down",
beta=0.1,
# Same hyperparameters as apo_zero
)
10. AOT & AOT Pair
Formula: Distributional alignment via stochastic dominance
When to use:
aot_pair: Paired preference dataaot: Unpaired data
Config:
DPOConfig(
loss_type="aot_pair", # or "aot"
beta=0.1,
label_smoothing=0.0
)
Multi-Loss Training
Combine multiple losses:
DPOConfig(
loss_type=["sigmoid", "ipo"],
loss_weights=[0.7, 0.3], # Weighted combination
beta=0.1
)
Key Parameters
Beta (β)
Controls deviation from reference model:
- Higher (0.5): More conservative, stays close to reference
- Lower (0.01): More aggressive alignment
- Default: 0.1
Label Smoothing
For robust DPO:
- 0.0: No smoothing (default)
- 0.1-0.3: Moderate noise robustness
- 0.5: Maximum noise tolerance
Max Lengths
max_prompt_length: 128-1536max_completion_length: 128-512max_length: Total sequence (1024-2048)
Comparison Table
| Loss | Speed | Stability | Best For |
|---|---|---|---|
| Sigmoid | Fast | Good | General use |
| IPO | Fast | Better | Overfitting issues |
| Hinge | Fast | Good | Margin objectives |
| Robust | Fast | Best | Noisy data |
| BCO | Medium | Good | Binary classification |
| DiscoPOP | Fast | Good | New architectures |
| APO | Fast | Good | Model quality matching |
References
- DPO paper: https://arxiv.org/abs/2305.18290
- IPO paper: https://arxiv.org/abs/2310.12036
- TRL docs: https://huggingface.co/docs/trl/dpo_trainer