hermes-agent/skills/mlops/simpo/references/hyperparameters.md
teknium f172f7d4aa Add skills tools and enhance model integration
- Introduced new skills tools: `skills_categories`, `skills_list`, and `skill_view` in `model_tools.py`, allowing for better organization and access to skill-related functionalities.
- Updated `toolsets.py` to include a new `skills` toolset, providing a dedicated space for skill tools.
- Enhanced `batch_runner.py` to recognize and validate skills tools during batch processing.
- Added comprehensive tool definitions for skills tools, ensuring compatibility with OpenAI's expected format.
- Created new shell script `test_skills_kimi.sh` for testing skills tool functionality with Kimi K2.5.
- Added example skill files demonstrating the structure and usage of skills within the Hermes-Agent framework, including `SKILL.md` for example and audiocraft skills.
- Improved documentation for skills tools and their integration into the existing tool framework, ensuring clarity for future development and usage.
2026-01-30 07:39:55 +00:00

8.3 KiB
Raw Blame History

Hyperparameters

Complete guide to SimPO hyperparameter selection and tuning.

Overview

Key hyperparameters in SimPO:

  1. Learning Rate - Most critical
  2. Beta (β) - Reward scaling
  3. Gamma-Beta Ratio (γ/β) - Target margin
  4. SFT Weight - Regularization strength

Learning Rate

By model size:

Model Size Learning Rate Notes
1B-3B 5e-7 to 1e-6 Higher end safe
7B-8B 3e-7 to 5e-7 Standard
13B-30B 1e-7 to 3e-7 Lower for stability
70B+ 5e-8 to 1e-7 Very conservative

By task type:

Task Learning Rate Reason
General chat 5e-7 Standard
Code generation 3e-7 Precise reasoning
Math reasoning 3e-7 Careful optimization
Creative writing 1e-6 More aggressive OK

Why Learning Rate Matters

Too high (> 1e-6 for 7B):

  • Loss divergence
  • Catastrophic forgetting
  • Unstable training

Too low (< 1e-7 for 7B):

  • Very slow convergence
  • May not finish in time
  • Undertraining

Optimal (3e-7 to 5e-7 for 7B):

  • Stable convergence
  • Good final performance
  • Efficient training

Config Examples

Mistral 7B (general):

learning_rate: 5e-7
num_train_epochs: 1
warmup_ratio: 0.1
lr_scheduler_type: cosine

Llama 3 8B (reasoning):

learning_rate: 3e-7
num_train_epochs: 1
warmup_ratio: 0.1
lr_scheduler_type: cosine

Gemma 2 9B (creative):

learning_rate: 1e-6
num_train_epochs: 1
warmup_ratio: 0.1
lr_scheduler_type: linear

Beta (β)

Range: 2.0 to 10.0 (much higher than DPO's 0.01-0.1)

By preference strength:

Beta Preference Strength Use Case
1.0-2.0 Weak Subtle preferences
2.0-5.0 Standard General alignment
5.0-10.0 Strong Clear preferences

Default: 2.0 to 2.5

Why Beta Matters

Low beta (< 2.0):

  • Weak reward signal
  • Slow preference learning
  • May underfit

High beta (> 10.0):

  • Very strong reward signal
  • Risk of overfitting
  • May ignore weak preferences

Optimal (2.0-5.0):

  • Balanced reward scaling
  • Stable training
  • Good generalization

Interaction with Gamma

Beta and gamma together:

Target margin in reward space = gamma
Target margin in logit space = gamma / beta

Example:

beta: 2.0
gamma_beta_ratio: 0.5
# Effective gamma = 2.0 * 0.5 = 1.0

Config Examples

Weak preferences:

beta: 2.0
gamma_beta_ratio: 0.3  # Small margin

Standard:

beta: 2.5
gamma_beta_ratio: 0.5  # Default

Strong preferences:

beta: 5.0
gamma_beta_ratio: 0.7  # Larger margin

Gamma-Beta Ratio (γ/β)

Range: 0.0 to 1.0

By scenario:

Ratio Margin Use Case
0.0-0.3 Small Weak preference data
0.4-0.6 Standard General use
0.7-1.0 Large Very clear preferences

Default: 0.5

Why Gamma Matters

Low gamma (< 0.3):

  • Small target margin
  • Less aggressive alignment
  • More conservative

High gamma (> 0.7):

  • Large target margin
  • Stronger alignment
  • More aggressive

Optimal (0.4-0.6):

  • Balanced margin
  • Stable training
  • Good alignment

Mathematical Meaning

In loss function:

logits = pi_logratios - gamma_beta_ratio
loss = -log(sigmoid(beta * logits))

Interpretation:

  • gamma_beta_ratio shifts the decision boundary
  • Higher ratio = requires larger log prob difference
  • Controls how "clear" preferences must be

Config Examples

Noisy preferences:

gamma_beta_ratio: 0.3  # Smaller margin, more tolerant

Standard:

gamma_beta_ratio: 0.5  # Default

High-quality preferences:

gamma_beta_ratio: 0.8  # Larger margin, stricter

SFT Weight

Range: 0.0 to 1.0

By model type:

Model Type SFT Weight Reason
Base model 0.0 No prior capabilities
Instruct model 0.05-0.1 Preserve instruction following
Chat model 0.1-0.2 Preserve conversational skills

Default: 0.0 (no SFT regularization)

Why SFT Weight Matters

Zero SFT (0.0):

  • Pure preference optimization
  • May forget capabilities
  • Standard for base models

Low SFT (0.05-0.1):

  • Balanced approach
  • Recommended for instruct models
  • Slight capability preservation

High SFT (> 0.2):

  • Strong capability preservation
  • Weaker preference alignment
  • May reduce alignment gains

Trade-off

Total Loss = SimPO Loss + (sft_weight * SFT Loss)

Example:

sft_weight: 0.1
# 90% preference optimization + 10% capability preservation

Config Examples

Base model (no SFT):

model_name_or_path: mistralai/Mistral-7B-v0.1
sft_weight: 0.0

Instruct model (light SFT):

model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
sft_weight: 0.1

Chat model (moderate SFT):

model_name_or_path: HuggingFaceH4/zephyr-7b-beta
sft_weight: 0.2

Model-Size-Specific Recommendations

7B Models (Mistral, Llama 3)

Standard config:

learning_rate: 5e-7
beta: 2.0
gamma_beta_ratio: 0.5
sft_weight: 0.0  # 0.1 if instruct model
num_train_epochs: 1
per_device_train_batch_size: 2
gradient_accumulation_steps: 4

8B-13B Models

Standard config:

learning_rate: 3e-7
beta: 2.5
gamma_beta_ratio: 0.5
sft_weight: 0.1  # If instruct
num_train_epochs: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 8

70B Models

Standard config:

learning_rate: 1e-7
beta: 2.0
gamma_beta_ratio: 0.5
sft_weight: 0.05
num_train_epochs: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 16

Batch Size & Gradient Accumulation

Effective Batch Size

Effective Batch Size = per_device_batch_size * num_gpus * grad_accum_steps

Recommended effective batch sizes:

  • 7B: 128-256
  • 13B: 64-128
  • 70B: 32-64

Config Examples

Single GPU (A100 40GB):

per_device_train_batch_size: 1
gradient_accumulation_steps: 128  # Effective batch = 128

4 GPUs (A100 40GB):

per_device_train_batch_size: 2
gradient_accumulation_steps: 16  # Effective batch = 2*4*16 = 128

8 GPUs (A100 80GB):

per_device_train_batch_size: 2
gradient_accumulation_steps: 8  # Effective batch = 2*8*8 = 128

Loss Type

Sigmoid vs Hinge

Sigmoid (default, recommended):

loss_type: sigmoid
label_smoothing: 0.0

Hinge (experimental):

loss_type: hinge
# No label smoothing for hinge

When to use hinge:

  • Margin-based tasks
  • SVM-style optimization
  • Experimental purposes

Generally: Stick with sigmoid

Tuning Guide

Step 1: Start with Defaults

learning_rate: 5e-7  # For 7B
beta: 2.0
gamma_beta_ratio: 0.5
sft_weight: 0.0  # 0.1 if instruct
loss_type: sigmoid

Step 2: Monitor Training

Check every 100 steps:

  • Loss curve (should decrease smoothly)
  • Reward margin (should increase)
  • Chosen/rejected logps (should separate)

Step 3: Adjust if Needed

If loss diverges:

learning_rate: 3e-7  # Reduce from 5e-7
beta: 1.0           # Reduce from 2.0

If loss plateaus early:

learning_rate: 1e-6  # Increase from 5e-7
beta: 5.0           # Increase from 2.0

If model forgets:

sft_weight: 0.2  # Increase from 0.0

Complete Example Configs

Mistral 7B Base (Standard)

model_name_or_path: mistralai/Mistral-7B-v0.1
dataset_mixer:
  HuggingFaceH4/ultrafeedback_binarized: 1.0

learning_rate: 5e-7
beta: 2.0
gamma_beta_ratio: 0.5
loss_type: sigmoid
sft_weight: 0.0

num_train_epochs: 1
per_device_train_batch_size: 2
gradient_accumulation_steps: 4
warmup_ratio: 0.1
lr_scheduler_type: cosine

bf16: true
gradient_checkpointing: true

Llama 3 8B Instruct (Reasoning)

model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
dataset_mixer:
  argilla/distilabel-math-preference-dpo: 1.0

learning_rate: 3e-7
beta: 5.0
gamma_beta_ratio: 0.7
loss_type: sigmoid
sft_weight: 0.1

num_train_epochs: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 16
warmup_ratio: 0.1
lr_scheduler_type: cosine

References