mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-25 00:51:20 +00:00
fix: restore all removed bundled skills + fix skills sync system
- Restored 21 skills removed in commits757d012and740dd92: accelerate, audiocraft, code-review, faiss, flash-attention, gguf, grpo-rl-training, guidance, llava, nemo-curator, obliteratus, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, stable-diffusion, tensorrt-llm, torchtitan, trl-fine-tuning, whisper - Rewrote sync_skills() with proper update semantics: * New skills (not in manifest): copied to user dir * Existing skills (in manifest + on disk): updated via hash comparison * User-deleted skills (in manifest, not on disk): respected, not re-added * Stale manifest entries (removed from bundled): cleaned from manifest - Added sync_skills() to CLI startup (cmd_chat) and gateway startup (start_gateway) — previously only ran during 'hermes update' - Updated cmd_update output to show new/updated/cleaned counts - Rewrote tests: 20 tests covering manifest CRUD, dir hashing, fresh install, user deletion respect, update detection, stale cleanup, and name collision handling 75 bundled skills total. 2002 tests pass.
This commit is contained in:
parent
68fbae5692
commit
ab0f4126cf
74 changed files with 27881 additions and 44 deletions
452
skills/mlops/simpo/references/hyperparameters.md
Normal file
452
skills/mlops/simpo/references/hyperparameters.md
Normal file
|
|
@ -0,0 +1,452 @@
|
|||
# Hyperparameters
|
||||
|
||||
Complete guide to SimPO hyperparameter selection and tuning.
|
||||
|
||||
## Overview
|
||||
|
||||
Key hyperparameters in SimPO:
|
||||
1. **Learning Rate** - Most critical
|
||||
2. **Beta (β)** - Reward scaling
|
||||
3. **Gamma-Beta Ratio (γ/β)** - Target margin
|
||||
4. **SFT Weight** - Regularization strength
|
||||
|
||||
## Learning Rate
|
||||
|
||||
### Recommended Ranges
|
||||
|
||||
**By model size**:
|
||||
| Model Size | Learning Rate | Notes |
|
||||
|------------|---------------|-------|
|
||||
| 1B-3B | 5e-7 to 1e-6 | Higher end safe |
|
||||
| 7B-8B | 3e-7 to 5e-7 | **Standard** |
|
||||
| 13B-30B | 1e-7 to 3e-7 | Lower for stability |
|
||||
| 70B+ | 5e-8 to 1e-7 | Very conservative |
|
||||
|
||||
**By task type**:
|
||||
| Task | Learning Rate | Reason |
|
||||
|------|---------------|--------|
|
||||
| General chat | 5e-7 | Standard |
|
||||
| Code generation | 3e-7 | **Precise reasoning** |
|
||||
| Math reasoning | 3e-7 | **Careful optimization** |
|
||||
| Creative writing | 1e-6 | More aggressive OK |
|
||||
|
||||
### Why Learning Rate Matters
|
||||
|
||||
**Too high** (> 1e-6 for 7B):
|
||||
- Loss divergence
|
||||
- Catastrophic forgetting
|
||||
- Unstable training
|
||||
|
||||
**Too low** (< 1e-7 for 7B):
|
||||
- Very slow convergence
|
||||
- May not finish in time
|
||||
- Undertraining
|
||||
|
||||
**Optimal** (3e-7 to 5e-7 for 7B):
|
||||
- Stable convergence
|
||||
- Good final performance
|
||||
- Efficient training
|
||||
|
||||
### Config Examples
|
||||
|
||||
**Mistral 7B (general)**:
|
||||
```yaml
|
||||
learning_rate: 5e-7
|
||||
num_train_epochs: 1
|
||||
warmup_ratio: 0.1
|
||||
lr_scheduler_type: cosine
|
||||
```
|
||||
|
||||
**Llama 3 8B (reasoning)**:
|
||||
```yaml
|
||||
learning_rate: 3e-7
|
||||
num_train_epochs: 1
|
||||
warmup_ratio: 0.1
|
||||
lr_scheduler_type: cosine
|
||||
```
|
||||
|
||||
**Gemma 2 9B (creative)**:
|
||||
```yaml
|
||||
learning_rate: 1e-6
|
||||
num_train_epochs: 1
|
||||
warmup_ratio: 0.1
|
||||
lr_scheduler_type: linear
|
||||
```
|
||||
|
||||
## Beta (β)
|
||||
|
||||
### Recommended Values
|
||||
|
||||
**Range**: 2.0 to 10.0 (much higher than DPO's 0.01-0.1)
|
||||
|
||||
**By preference strength**:
|
||||
| Beta | Preference Strength | Use Case |
|
||||
|------|-------------------|----------|
|
||||
| 1.0-2.0 | Weak | Subtle preferences |
|
||||
| 2.0-5.0 | **Standard** | General alignment |
|
||||
| 5.0-10.0 | Strong | Clear preferences |
|
||||
|
||||
**Default**: 2.0 to 2.5
|
||||
|
||||
### Why Beta Matters
|
||||
|
||||
**Low beta** (< 2.0):
|
||||
- Weak reward signal
|
||||
- Slow preference learning
|
||||
- May underfit
|
||||
|
||||
**High beta** (> 10.0):
|
||||
- Very strong reward signal
|
||||
- Risk of overfitting
|
||||
- May ignore weak preferences
|
||||
|
||||
**Optimal** (2.0-5.0):
|
||||
- Balanced reward scaling
|
||||
- Stable training
|
||||
- Good generalization
|
||||
|
||||
### Interaction with Gamma
|
||||
|
||||
**Beta and gamma together**:
|
||||
```
|
||||
Target margin in reward space = gamma
|
||||
Target margin in logit space = gamma / beta
|
||||
```
|
||||
|
||||
**Example**:
|
||||
```yaml
|
||||
beta: 2.0
|
||||
gamma_beta_ratio: 0.5
|
||||
# Effective gamma = 2.0 * 0.5 = 1.0
|
||||
```
|
||||
|
||||
### Config Examples
|
||||
|
||||
**Weak preferences**:
|
||||
```yaml
|
||||
beta: 2.0
|
||||
gamma_beta_ratio: 0.3 # Small margin
|
||||
```
|
||||
|
||||
**Standard**:
|
||||
```yaml
|
||||
beta: 2.5
|
||||
gamma_beta_ratio: 0.5 # Default
|
||||
```
|
||||
|
||||
**Strong preferences**:
|
||||
```yaml
|
||||
beta: 5.0
|
||||
gamma_beta_ratio: 0.7 # Larger margin
|
||||
```
|
||||
|
||||
## Gamma-Beta Ratio (γ/β)
|
||||
|
||||
### Recommended Values
|
||||
|
||||
**Range**: 0.0 to 1.0
|
||||
|
||||
**By scenario**:
|
||||
| Ratio | Margin | Use Case |
|
||||
|-------|--------|----------|
|
||||
| 0.0-0.3 | Small | Weak preference data |
|
||||
| 0.4-0.6 | **Standard** | General use |
|
||||
| 0.7-1.0 | Large | Very clear preferences |
|
||||
|
||||
**Default**: 0.5
|
||||
|
||||
### Why Gamma Matters
|
||||
|
||||
**Low gamma** (< 0.3):
|
||||
- Small target margin
|
||||
- Less aggressive alignment
|
||||
- More conservative
|
||||
|
||||
**High gamma** (> 0.7):
|
||||
- Large target margin
|
||||
- Stronger alignment
|
||||
- More aggressive
|
||||
|
||||
**Optimal** (0.4-0.6):
|
||||
- Balanced margin
|
||||
- Stable training
|
||||
- Good alignment
|
||||
|
||||
### Mathematical Meaning
|
||||
|
||||
**In loss function**:
|
||||
```python
|
||||
logits = pi_logratios - gamma_beta_ratio
|
||||
loss = -log(sigmoid(beta * logits))
|
||||
```
|
||||
|
||||
**Interpretation**:
|
||||
- gamma_beta_ratio shifts the decision boundary
|
||||
- Higher ratio = requires larger log prob difference
|
||||
- Controls how "clear" preferences must be
|
||||
|
||||
### Config Examples
|
||||
|
||||
**Noisy preferences**:
|
||||
```yaml
|
||||
gamma_beta_ratio: 0.3 # Smaller margin, more tolerant
|
||||
```
|
||||
|
||||
**Standard**:
|
||||
```yaml
|
||||
gamma_beta_ratio: 0.5 # Default
|
||||
```
|
||||
|
||||
**High-quality preferences**:
|
||||
```yaml
|
||||
gamma_beta_ratio: 0.8 # Larger margin, stricter
|
||||
```
|
||||
|
||||
## SFT Weight
|
||||
|
||||
### Recommended Values
|
||||
|
||||
**Range**: 0.0 to 1.0
|
||||
|
||||
**By model type**:
|
||||
| Model Type | SFT Weight | Reason |
|
||||
|------------|-----------|--------|
|
||||
| Base model | 0.0 | No prior capabilities |
|
||||
| **Instruct model** | 0.05-0.1 | Preserve instruction following |
|
||||
| Chat model | 0.1-0.2 | Preserve conversational skills |
|
||||
|
||||
**Default**: 0.0 (no SFT regularization)
|
||||
|
||||
### Why SFT Weight Matters
|
||||
|
||||
**Zero SFT** (0.0):
|
||||
- Pure preference optimization
|
||||
- May forget capabilities
|
||||
- Standard for base models
|
||||
|
||||
**Low SFT** (0.05-0.1):
|
||||
- Balanced approach
|
||||
- **Recommended for instruct models**
|
||||
- Slight capability preservation
|
||||
|
||||
**High SFT** (> 0.2):
|
||||
- Strong capability preservation
|
||||
- Weaker preference alignment
|
||||
- May reduce alignment gains
|
||||
|
||||
### Trade-off
|
||||
|
||||
```
|
||||
Total Loss = SimPO Loss + (sft_weight * SFT Loss)
|
||||
```
|
||||
|
||||
**Example**:
|
||||
```yaml
|
||||
sft_weight: 0.1
|
||||
# 90% preference optimization + 10% capability preservation
|
||||
```
|
||||
|
||||
### Config Examples
|
||||
|
||||
**Base model (no SFT)**:
|
||||
```yaml
|
||||
model_name_or_path: mistralai/Mistral-7B-v0.1
|
||||
sft_weight: 0.0
|
||||
```
|
||||
|
||||
**Instruct model (light SFT)**:
|
||||
```yaml
|
||||
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
|
||||
sft_weight: 0.1
|
||||
```
|
||||
|
||||
**Chat model (moderate SFT)**:
|
||||
```yaml
|
||||
model_name_or_path: HuggingFaceH4/zephyr-7b-beta
|
||||
sft_weight: 0.2
|
||||
```
|
||||
|
||||
## Model-Size-Specific Recommendations
|
||||
|
||||
### 7B Models (Mistral, Llama 3)
|
||||
|
||||
**Standard config**:
|
||||
```yaml
|
||||
learning_rate: 5e-7
|
||||
beta: 2.0
|
||||
gamma_beta_ratio: 0.5
|
||||
sft_weight: 0.0 # 0.1 if instruct model
|
||||
num_train_epochs: 1
|
||||
per_device_train_batch_size: 2
|
||||
gradient_accumulation_steps: 4
|
||||
```
|
||||
|
||||
### 8B-13B Models
|
||||
|
||||
**Standard config**:
|
||||
```yaml
|
||||
learning_rate: 3e-7
|
||||
beta: 2.5
|
||||
gamma_beta_ratio: 0.5
|
||||
sft_weight: 0.1 # If instruct
|
||||
num_train_epochs: 1
|
||||
per_device_train_batch_size: 1
|
||||
gradient_accumulation_steps: 8
|
||||
```
|
||||
|
||||
### 70B Models
|
||||
|
||||
**Standard config**:
|
||||
```yaml
|
||||
learning_rate: 1e-7
|
||||
beta: 2.0
|
||||
gamma_beta_ratio: 0.5
|
||||
sft_weight: 0.05
|
||||
num_train_epochs: 1
|
||||
per_device_train_batch_size: 1
|
||||
gradient_accumulation_steps: 16
|
||||
```
|
||||
|
||||
## Batch Size & Gradient Accumulation
|
||||
|
||||
### Effective Batch Size
|
||||
|
||||
```
|
||||
Effective Batch Size = per_device_batch_size * num_gpus * grad_accum_steps
|
||||
```
|
||||
|
||||
**Recommended effective batch sizes**:
|
||||
- 7B: 128-256
|
||||
- 13B: 64-128
|
||||
- 70B: 32-64
|
||||
|
||||
### Config Examples
|
||||
|
||||
**Single GPU (A100 40GB)**:
|
||||
```yaml
|
||||
per_device_train_batch_size: 1
|
||||
gradient_accumulation_steps: 128 # Effective batch = 128
|
||||
```
|
||||
|
||||
**4 GPUs (A100 40GB)**:
|
||||
```yaml
|
||||
per_device_train_batch_size: 2
|
||||
gradient_accumulation_steps: 16 # Effective batch = 2*4*16 = 128
|
||||
```
|
||||
|
||||
**8 GPUs (A100 80GB)**:
|
||||
```yaml
|
||||
per_device_train_batch_size: 2
|
||||
gradient_accumulation_steps: 8 # Effective batch = 2*8*8 = 128
|
||||
```
|
||||
|
||||
## Loss Type
|
||||
|
||||
### Sigmoid vs Hinge
|
||||
|
||||
**Sigmoid** (default, recommended):
|
||||
```yaml
|
||||
loss_type: sigmoid
|
||||
label_smoothing: 0.0
|
||||
```
|
||||
|
||||
**Hinge** (experimental):
|
||||
```yaml
|
||||
loss_type: hinge
|
||||
# No label smoothing for hinge
|
||||
```
|
||||
|
||||
**When to use hinge**:
|
||||
- Margin-based tasks
|
||||
- SVM-style optimization
|
||||
- Experimental purposes
|
||||
|
||||
**Generally**: Stick with sigmoid
|
||||
|
||||
## Tuning Guide
|
||||
|
||||
### Step 1: Start with Defaults
|
||||
|
||||
```yaml
|
||||
learning_rate: 5e-7 # For 7B
|
||||
beta: 2.0
|
||||
gamma_beta_ratio: 0.5
|
||||
sft_weight: 0.0 # 0.1 if instruct
|
||||
loss_type: sigmoid
|
||||
```
|
||||
|
||||
### Step 2: Monitor Training
|
||||
|
||||
**Check every 100 steps**:
|
||||
- Loss curve (should decrease smoothly)
|
||||
- Reward margin (should increase)
|
||||
- Chosen/rejected logps (should separate)
|
||||
|
||||
### Step 3: Adjust if Needed
|
||||
|
||||
**If loss diverges**:
|
||||
```yaml
|
||||
learning_rate: 3e-7 # Reduce from 5e-7
|
||||
beta: 1.0 # Reduce from 2.0
|
||||
```
|
||||
|
||||
**If loss plateaus early**:
|
||||
```yaml
|
||||
learning_rate: 1e-6 # Increase from 5e-7
|
||||
beta: 5.0 # Increase from 2.0
|
||||
```
|
||||
|
||||
**If model forgets**:
|
||||
```yaml
|
||||
sft_weight: 0.2 # Increase from 0.0
|
||||
```
|
||||
|
||||
## Complete Example Configs
|
||||
|
||||
### Mistral 7B Base (Standard)
|
||||
|
||||
```yaml
|
||||
model_name_or_path: mistralai/Mistral-7B-v0.1
|
||||
dataset_mixer:
|
||||
HuggingFaceH4/ultrafeedback_binarized: 1.0
|
||||
|
||||
learning_rate: 5e-7
|
||||
beta: 2.0
|
||||
gamma_beta_ratio: 0.5
|
||||
loss_type: sigmoid
|
||||
sft_weight: 0.0
|
||||
|
||||
num_train_epochs: 1
|
||||
per_device_train_batch_size: 2
|
||||
gradient_accumulation_steps: 4
|
||||
warmup_ratio: 0.1
|
||||
lr_scheduler_type: cosine
|
||||
|
||||
bf16: true
|
||||
gradient_checkpointing: true
|
||||
```
|
||||
|
||||
### Llama 3 8B Instruct (Reasoning)
|
||||
|
||||
```yaml
|
||||
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
|
||||
dataset_mixer:
|
||||
argilla/distilabel-math-preference-dpo: 1.0
|
||||
|
||||
learning_rate: 3e-7
|
||||
beta: 5.0
|
||||
gamma_beta_ratio: 0.7
|
||||
loss_type: sigmoid
|
||||
sft_weight: 0.1
|
||||
|
||||
num_train_epochs: 1
|
||||
per_device_train_batch_size: 1
|
||||
gradient_accumulation_steps: 16
|
||||
warmup_ratio: 0.1
|
||||
lr_scheduler_type: cosine
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
- SimPO paper: https://arxiv.org/abs/2405.14734
|
||||
- Alignment Handbook: https://github.com/huggingface/alignment-handbook
|
||||
Loading…
Add table
Add a link
Reference in a new issue