refactor: reorganize skills into sub-categories

The skills directory was getting disorganized — mlops alone had 40 skills in a flat list, and 12 categories were singletons with just one skill each. Code change: - prompt_builder.py: Support sub-categories in skill scanner. skills/mlops/training/axolotl/SKILL.md now shows as category 'mlops/training' instead of just 'mlops'. Backwards-compatible with existing flat structure. Split mlops (40 skills) into 7 sub-categories: - mlops/training (12): accelerate, axolotl, flash-attention, grpo-rl-training, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, torchtitan, trl-fine-tuning, unsloth - mlops/inference (8): gguf, guidance, instructor, llama-cpp, obliteratus, outlines, tensorrt-llm, vllm - mlops/models (6): audiocraft, clip, llava, segment-anything, stable-diffusion, whisper - mlops/vector-databases (4): chroma, faiss, pinecone, qdrant - mlops/evaluation (5): huggingface-tokenizers, lm-evaluation-harness, nemo-curator, saelens, weights-and-biases - mlops/cloud (2): lambda-labs, modal - mlops/research (1): dspy Merged singleton categories: - gifs → media (gif-search joins youtube-content) - music-creation → media (heartmula, songsee) - diagramming → creative (excalidraw joins ascii-art) - ocr-and-documents → productivity - domain → research (domain-intel) - feeds → research (blogwatcher) - market-data → research (polymarket) Fixed misplaced skills: - mlops/code-review → software-development (not ML-specific) - mlops/ml-paper-writing → research (academic writing) Added DESCRIPTION.md files for all new/updated categories.
2026-04-25 00:51:20 +00:00 · 2026-03-09 03:35:53 -07:00 · 2026-03-09 03:35:53 -07:00 · 732c66b0f3
commit 732c66b0f3
parent d6c710706f
217 changed files with 39 additions and 4 deletions
--- a/skills/mlops/training/simpo/references/hyperparameters.md
+++ b/skills/mlops/training/simpo/references/hyperparameters.md
@ -0,0 +1,452 @@
+# Hyperparameters
+
+Complete guide to SimPO hyperparameter selection and tuning.
+
+## Overview
+
+Key hyperparameters in SimPO:
+1. **Learning Rate** - Most critical
+2. **Beta (β)** - Reward scaling
+3. **Gamma-Beta Ratio (γ/β)** - Target margin
+4. **SFT Weight** - Regularization strength
+
+## Learning Rate
+
+### Recommended Ranges
+
+**By model size**:
+| Model Size | Learning Rate | Notes |
+|------------|---------------|-------|
+| 1B-3B | 5e-7 to 1e-6 | Higher end safe |
+| 7B-8B | 3e-7 to 5e-7 | **Standard** |
+| 13B-30B | 1e-7 to 3e-7 | Lower for stability |
+| 70B+ | 5e-8 to 1e-7 | Very conservative |
+
+**By task type**:
+| Task | Learning Rate | Reason |
+|------|---------------|--------|
+| General chat | 5e-7 | Standard |
+| Code generation | 3e-7 | **Precise reasoning** |
+| Math reasoning | 3e-7 | **Careful optimization** |
+| Creative writing | 1e-6 | More aggressive OK |
+
+### Why Learning Rate Matters
+
+**Too high** (> 1e-6 for 7B):
+- Loss divergence
+- Catastrophic forgetting
+- Unstable training
+
+**Too low** (< 1e-7 for 7B):
+- Very slow convergence
+- May not finish in time
+- Undertraining
+
+**Optimal** (3e-7 to 5e-7 for 7B):
+- Stable convergence
+- Good final performance
+- Efficient training
+
+### Config Examples
+
+**Mistral 7B (general)**:
+```yaml
+learning_rate: 5e-7
+num_train_epochs: 1
+warmup_ratio: 0.1
+lr_scheduler_type: cosine
+```
+
+**Llama 3 8B (reasoning)**:
+```yaml
+learning_rate: 3e-7
+num_train_epochs: 1
+warmup_ratio: 0.1
+lr_scheduler_type: cosine
+```
+
+**Gemma 2 9B (creative)**:
+```yaml
+learning_rate: 1e-6
+num_train_epochs: 1
+warmup_ratio: 0.1
+lr_scheduler_type: linear
+```
+
+## Beta (β)
+
+### Recommended Values
+
+**Range**: 2.0 to 10.0 (much higher than DPO's 0.01-0.1)
+
+**By preference strength**:
+| Beta | Preference Strength | Use Case |
+|------|-------------------|----------|
+| 1.0-2.0 | Weak | Subtle preferences |
+| 2.0-5.0 | **Standard** | General alignment |
+| 5.0-10.0 | Strong | Clear preferences |
+
+**Default**: 2.0 to 2.5
+
+### Why Beta Matters
+
+**Low beta** (< 2.0):
+- Weak reward signal
+- Slow preference learning
+- May underfit
+
+**High beta** (> 10.0):
+- Very strong reward signal
+- Risk of overfitting
+- May ignore weak preferences
+
+**Optimal** (2.0-5.0):
+- Balanced reward scaling
+- Stable training
+- Good generalization
+
+### Interaction with Gamma
+
+**Beta and gamma together**:
+```
+Target margin in reward space = gamma
+Target margin in logit space = gamma / beta
+```
+
+**Example**:
+```yaml
+beta: 2.0
+gamma_beta_ratio: 0.5
+# Effective gamma = 2.0 * 0.5 = 1.0
+```
+
+### Config Examples
+
+**Weak preferences**:
+```yaml
+beta: 2.0
+gamma_beta_ratio: 0.3  # Small margin
+```
+
+**Standard**:
+```yaml
+beta: 2.5
+gamma_beta_ratio: 0.5  # Default
+```
+
+**Strong preferences**:
+```yaml
+beta: 5.0
+gamma_beta_ratio: 0.7  # Larger margin
+```
+
+## Gamma-Beta Ratio (γ/β)
+
+### Recommended Values
+
+**Range**: 0.0 to 1.0
+
+**By scenario**:
+| Ratio | Margin | Use Case |
+|-------|--------|----------|
+| 0.0-0.3 | Small | Weak preference data |
+| 0.4-0.6 | **Standard** | General use |
+| 0.7-1.0 | Large | Very clear preferences |
+
+**Default**: 0.5
+
+### Why Gamma Matters
+
+**Low gamma** (< 0.3):
+- Small target margin
+- Less aggressive alignment
+- More conservative
+
+**High gamma** (> 0.7):
+- Large target margin
+- Stronger alignment
+- More aggressive
+
+**Optimal** (0.4-0.6):
+- Balanced margin
+- Stable training
+- Good alignment
+
+### Mathematical Meaning
+
+**In loss function**:
+```python
+logits = pi_logratios - gamma_beta_ratio
+loss = -log(sigmoid(beta * logits))
+```
+
+**Interpretation**:
+- gamma_beta_ratio shifts the decision boundary
+- Higher ratio = requires larger log prob difference
+- Controls how "clear" preferences must be
+
+### Config Examples
+
+**Noisy preferences**:
+```yaml
+gamma_beta_ratio: 0.3  # Smaller margin, more tolerant
+```
+
+**Standard**:
+```yaml
+gamma_beta_ratio: 0.5  # Default
+```
+
+**High-quality preferences**:
+```yaml
+gamma_beta_ratio: 0.8  # Larger margin, stricter
+```
+
+## SFT Weight
+
+### Recommended Values
+
+**Range**: 0.0 to 1.0
+
+**By model type**:
+| Model Type | SFT Weight | Reason |
+|------------|-----------|--------|
+| Base model | 0.0 | No prior capabilities |
+| **Instruct model** | 0.05-0.1 | Preserve instruction following |
+| Chat model | 0.1-0.2 | Preserve conversational skills |
+
+**Default**: 0.0 (no SFT regularization)
+
+### Why SFT Weight Matters
+
+**Zero SFT** (0.0):
+- Pure preference optimization
+- May forget capabilities
+- Standard for base models
+
+**Low SFT** (0.05-0.1):
+- Balanced approach
+- **Recommended for instruct models**
+- Slight capability preservation
+
+**High SFT** (> 0.2):
+- Strong capability preservation
+- Weaker preference alignment
+- May reduce alignment gains
+
+### Trade-off
+
+```
+Total Loss = SimPO Loss + (sft_weight * SFT Loss)
+```
+
+**Example**:
+```yaml
+sft_weight: 0.1
+# 90% preference optimization + 10% capability preservation
+```
+
+### Config Examples
+
+**Base model (no SFT)**:
+```yaml
+model_name_or_path: mistralai/Mistral-7B-v0.1
+sft_weight: 0.0
+```
+
+**Instruct model (light SFT)**:
+```yaml
+model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
+sft_weight: 0.1
+```
+
+**Chat model (moderate SFT)**:
+```yaml
+model_name_or_path: HuggingFaceH4/zephyr-7b-beta
+sft_weight: 0.2
+```
+
+## Model-Size-Specific Recommendations
+
+### 7B Models (Mistral, Llama 3)
+
+**Standard config**:
+```yaml
+learning_rate: 5e-7
+beta: 2.0
+gamma_beta_ratio: 0.5
+sft_weight: 0.0  # 0.1 if instruct model
+num_train_epochs: 1
+per_device_train_batch_size: 2
+gradient_accumulation_steps: 4
+```
+
+### 8B-13B Models
+
+**Standard config**:
+```yaml
+learning_rate: 3e-7
+beta: 2.5
+gamma_beta_ratio: 0.5
+sft_weight: 0.1  # If instruct
+num_train_epochs: 1
+per_device_train_batch_size: 1
+gradient_accumulation_steps: 8
+```
+
+### 70B Models
+
+**Standard config**:
+```yaml
+learning_rate: 1e-7
+beta: 2.0
+gamma_beta_ratio: 0.5
+sft_weight: 0.05
+num_train_epochs: 1
+per_device_train_batch_size: 1
+gradient_accumulation_steps: 16
+```
+
+## Batch Size & Gradient Accumulation
+
+### Effective Batch Size
+
+```
+Effective Batch Size = per_device_batch_size * num_gpus * grad_accum_steps
+```
+
+**Recommended effective batch sizes**:
+- 7B: 128-256
+- 13B: 64-128
+- 70B: 32-64
+
+### Config Examples
+
+**Single GPU (A100 40GB)**:
+```yaml
+per_device_train_batch_size: 1
+gradient_accumulation_steps: 128  # Effective batch = 128
+```
+
+**4 GPUs (A100 40GB)**:
+```yaml
+per_device_train_batch_size: 2
+gradient_accumulation_steps: 16  # Effective batch = 2*4*16 = 128
+```
+
+**8 GPUs (A100 80GB)**:
+```yaml
+per_device_train_batch_size: 2
+gradient_accumulation_steps: 8  # Effective batch = 2*8*8 = 128
+```
+
+## Loss Type
+
+### Sigmoid vs Hinge
+
+**Sigmoid** (default, recommended):
+```yaml
+loss_type: sigmoid
+label_smoothing: 0.0
+```
+
+**Hinge** (experimental):
+```yaml
+loss_type: hinge
+# No label smoothing for hinge
+```
+
+**When to use hinge**:
+- Margin-based tasks
+- SVM-style optimization
+- Experimental purposes
+
+**Generally**: Stick with sigmoid
+
+## Tuning Guide
+
+### Step 1: Start with Defaults
+
+```yaml
+learning_rate: 5e-7  # For 7B
+beta: 2.0
+gamma_beta_ratio: 0.5
+sft_weight: 0.0  # 0.1 if instruct
+loss_type: sigmoid
+```
+
+### Step 2: Monitor Training
+
+**Check every 100 steps**:
+- Loss curve (should decrease smoothly)
+- Reward margin (should increase)
+- Chosen/rejected logps (should separate)
+
+### Step 3: Adjust if Needed
+
+**If loss diverges**:
+```yaml
+learning_rate: 3e-7  # Reduce from 5e-7
+beta: 1.0           # Reduce from 2.0
+```
+
+**If loss plateaus early**:
+```yaml
+learning_rate: 1e-6  # Increase from 5e-7
+beta: 5.0           # Increase from 2.0
+```
+
+**If model forgets**:
+```yaml
+sft_weight: 0.2  # Increase from 0.0
+```
+
+## Complete Example Configs
+
+### Mistral 7B Base (Standard)
+
+```yaml
+model_name_or_path: mistralai/Mistral-7B-v0.1
+dataset_mixer:
+  HuggingFaceH4/ultrafeedback_binarized: 1.0
+
+learning_rate: 5e-7
+beta: 2.0
+gamma_beta_ratio: 0.5
+loss_type: sigmoid
+sft_weight: 0.0
+
+num_train_epochs: 1
+per_device_train_batch_size: 2
+gradient_accumulation_steps: 4
+warmup_ratio: 0.1
+lr_scheduler_type: cosine
+
+bf16: true
+gradient_checkpointing: true
+```
+
+### Llama 3 8B Instruct (Reasoning)
+
+```yaml
+model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
+dataset_mixer:
+  argilla/distilabel-math-preference-dpo: 1.0
+
+learning_rate: 3e-7
+beta: 5.0
+gamma_beta_ratio: 0.7
+loss_type: sigmoid
+sft_weight: 0.1
+
+num_train_epochs: 1
+per_device_train_batch_size: 1
+gradient_accumulation_steps: 16
+warmup_ratio: 0.1
+lr_scheduler_type: cosine
+```
+
+## References
+
+- SimPO paper: https://arxiv.org/abs/2405.14734
+- Alignment Handbook: https://github.com/huggingface/alignment-handbook