- Introduced new skills tools: `skills_categories`, `skills_list`, and `skill_view` in `model_tools.py`, allowing for better organization and access to skill-related functionalities. - Updated `toolsets.py` to include a new `skills` toolset, providing a dedicated space for skill tools. - Enhanced `batch_runner.py` to recognize and validate skills tools during batch processing. - Added comprehensive tool definitions for skills tools, ensuring compatibility with OpenAI's expected format. - Created new shell script `test_skills_kimi.sh` for testing skills tool functionality with Kimi K2.5. - Added example skill files demonstrating the structure and usage of skills within the Hermes-Agent framework, including `SKILL.md` for example and audiocraft skills. - Improved documentation for skills tools and their integration into the existing tool framework, ensuring clarity for future development and usage.
8.3 KiB
Hyperparameters
Complete guide to SimPO hyperparameter selection and tuning.
Overview
Key hyperparameters in SimPO:
- Learning Rate - Most critical
- Beta (β) - Reward scaling
- Gamma-Beta Ratio (γ/β) - Target margin
- SFT Weight - Regularization strength
Learning Rate
Recommended Ranges
By model size:
| Model Size | Learning Rate | Notes |
|---|---|---|
| 1B-3B | 5e-7 to 1e-6 | Higher end safe |
| 7B-8B | 3e-7 to 5e-7 | Standard |
| 13B-30B | 1e-7 to 3e-7 | Lower for stability |
| 70B+ | 5e-8 to 1e-7 | Very conservative |
By task type:
| Task | Learning Rate | Reason |
|---|---|---|
| General chat | 5e-7 | Standard |
| Code generation | 3e-7 | Precise reasoning |
| Math reasoning | 3e-7 | Careful optimization |
| Creative writing | 1e-6 | More aggressive OK |
Why Learning Rate Matters
Too high (> 1e-6 for 7B):
- Loss divergence
- Catastrophic forgetting
- Unstable training
Too low (< 1e-7 for 7B):
- Very slow convergence
- May not finish in time
- Undertraining
Optimal (3e-7 to 5e-7 for 7B):
- Stable convergence
- Good final performance
- Efficient training
Config Examples
Mistral 7B (general):
learning_rate: 5e-7
num_train_epochs: 1
warmup_ratio: 0.1
lr_scheduler_type: cosine
Llama 3 8B (reasoning):
learning_rate: 3e-7
num_train_epochs: 1
warmup_ratio: 0.1
lr_scheduler_type: cosine
Gemma 2 9B (creative):
learning_rate: 1e-6
num_train_epochs: 1
warmup_ratio: 0.1
lr_scheduler_type: linear
Beta (β)
Recommended Values
Range: 2.0 to 10.0 (much higher than DPO's 0.01-0.1)
By preference strength:
| Beta | Preference Strength | Use Case |
|---|---|---|
| 1.0-2.0 | Weak | Subtle preferences |
| 2.0-5.0 | Standard | General alignment |
| 5.0-10.0 | Strong | Clear preferences |
Default: 2.0 to 2.5
Why Beta Matters
Low beta (< 2.0):
- Weak reward signal
- Slow preference learning
- May underfit
High beta (> 10.0):
- Very strong reward signal
- Risk of overfitting
- May ignore weak preferences
Optimal (2.0-5.0):
- Balanced reward scaling
- Stable training
- Good generalization
Interaction with Gamma
Beta and gamma together:
Target margin in reward space = gamma
Target margin in logit space = gamma / beta
Example:
beta: 2.0
gamma_beta_ratio: 0.5
# Effective gamma = 2.0 * 0.5 = 1.0
Config Examples
Weak preferences:
beta: 2.0
gamma_beta_ratio: 0.3 # Small margin
Standard:
beta: 2.5
gamma_beta_ratio: 0.5 # Default
Strong preferences:
beta: 5.0
gamma_beta_ratio: 0.7 # Larger margin
Gamma-Beta Ratio (γ/β)
Recommended Values
Range: 0.0 to 1.0
By scenario:
| Ratio | Margin | Use Case |
|---|---|---|
| 0.0-0.3 | Small | Weak preference data |
| 0.4-0.6 | Standard | General use |
| 0.7-1.0 | Large | Very clear preferences |
Default: 0.5
Why Gamma Matters
Low gamma (< 0.3):
- Small target margin
- Less aggressive alignment
- More conservative
High gamma (> 0.7):
- Large target margin
- Stronger alignment
- More aggressive
Optimal (0.4-0.6):
- Balanced margin
- Stable training
- Good alignment
Mathematical Meaning
In loss function:
logits = pi_logratios - gamma_beta_ratio
loss = -log(sigmoid(beta * logits))
Interpretation:
- gamma_beta_ratio shifts the decision boundary
- Higher ratio = requires larger log prob difference
- Controls how "clear" preferences must be
Config Examples
Noisy preferences:
gamma_beta_ratio: 0.3 # Smaller margin, more tolerant
Standard:
gamma_beta_ratio: 0.5 # Default
High-quality preferences:
gamma_beta_ratio: 0.8 # Larger margin, stricter
SFT Weight
Recommended Values
Range: 0.0 to 1.0
By model type:
| Model Type | SFT Weight | Reason |
|---|---|---|
| Base model | 0.0 | No prior capabilities |
| Instruct model | 0.05-0.1 | Preserve instruction following |
| Chat model | 0.1-0.2 | Preserve conversational skills |
Default: 0.0 (no SFT regularization)
Why SFT Weight Matters
Zero SFT (0.0):
- Pure preference optimization
- May forget capabilities
- Standard for base models
Low SFT (0.05-0.1):
- Balanced approach
- Recommended for instruct models
- Slight capability preservation
High SFT (> 0.2):
- Strong capability preservation
- Weaker preference alignment
- May reduce alignment gains
Trade-off
Total Loss = SimPO Loss + (sft_weight * SFT Loss)
Example:
sft_weight: 0.1
# 90% preference optimization + 10% capability preservation
Config Examples
Base model (no SFT):
model_name_or_path: mistralai/Mistral-7B-v0.1
sft_weight: 0.0
Instruct model (light SFT):
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
sft_weight: 0.1
Chat model (moderate SFT):
model_name_or_path: HuggingFaceH4/zephyr-7b-beta
sft_weight: 0.2
Model-Size-Specific Recommendations
7B Models (Mistral, Llama 3)
Standard config:
learning_rate: 5e-7
beta: 2.0
gamma_beta_ratio: 0.5
sft_weight: 0.0 # 0.1 if instruct model
num_train_epochs: 1
per_device_train_batch_size: 2
gradient_accumulation_steps: 4
8B-13B Models
Standard config:
learning_rate: 3e-7
beta: 2.5
gamma_beta_ratio: 0.5
sft_weight: 0.1 # If instruct
num_train_epochs: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
70B Models
Standard config:
learning_rate: 1e-7
beta: 2.0
gamma_beta_ratio: 0.5
sft_weight: 0.05
num_train_epochs: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 16
Batch Size & Gradient Accumulation
Effective Batch Size
Effective Batch Size = per_device_batch_size * num_gpus * grad_accum_steps
Recommended effective batch sizes:
- 7B: 128-256
- 13B: 64-128
- 70B: 32-64
Config Examples
Single GPU (A100 40GB):
per_device_train_batch_size: 1
gradient_accumulation_steps: 128 # Effective batch = 128
4 GPUs (A100 40GB):
per_device_train_batch_size: 2
gradient_accumulation_steps: 16 # Effective batch = 2*4*16 = 128
8 GPUs (A100 80GB):
per_device_train_batch_size: 2
gradient_accumulation_steps: 8 # Effective batch = 2*8*8 = 128
Loss Type
Sigmoid vs Hinge
Sigmoid (default, recommended):
loss_type: sigmoid
label_smoothing: 0.0
Hinge (experimental):
loss_type: hinge
# No label smoothing for hinge
When to use hinge:
- Margin-based tasks
- SVM-style optimization
- Experimental purposes
Generally: Stick with sigmoid
Tuning Guide
Step 1: Start with Defaults
learning_rate: 5e-7 # For 7B
beta: 2.0
gamma_beta_ratio: 0.5
sft_weight: 0.0 # 0.1 if instruct
loss_type: sigmoid
Step 2: Monitor Training
Check every 100 steps:
- Loss curve (should decrease smoothly)
- Reward margin (should increase)
- Chosen/rejected logps (should separate)
Step 3: Adjust if Needed
If loss diverges:
learning_rate: 3e-7 # Reduce from 5e-7
beta: 1.0 # Reduce from 2.0
If loss plateaus early:
learning_rate: 1e-6 # Increase from 5e-7
beta: 5.0 # Increase from 2.0
If model forgets:
sft_weight: 0.2 # Increase from 0.0
Complete Example Configs
Mistral 7B Base (Standard)
model_name_or_path: mistralai/Mistral-7B-v0.1
dataset_mixer:
HuggingFaceH4/ultrafeedback_binarized: 1.0
learning_rate: 5e-7
beta: 2.0
gamma_beta_ratio: 0.5
loss_type: sigmoid
sft_weight: 0.0
num_train_epochs: 1
per_device_train_batch_size: 2
gradient_accumulation_steps: 4
warmup_ratio: 0.1
lr_scheduler_type: cosine
bf16: true
gradient_checkpointing: true
Llama 3 8B Instruct (Reasoning)
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
dataset_mixer:
argilla/distilabel-math-preference-dpo: 1.0
learning_rate: 3e-7
beta: 5.0
gamma_beta_ratio: 0.7
loss_type: sigmoid
sft_weight: 0.1
num_train_epochs: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 16
warmup_ratio: 0.1
lr_scheduler_type: cosine
References
- SimPO paper: https://arxiv.org/abs/2405.14734
- Alignment Handbook: https://github.com/huggingface/alignment-handbook