hermes-agent/optional-skills/mlops/simpo/references/hyperparameters.md
Teknium 5ceed021dc
feat(gateway): skill-aware slash commands, paginated /commands, Telegram 100-cap (#3934)
* feat(gateway): skill-aware slash commands, paginated /commands, Telegram 100-cap

Map active skills to Telegram's slash command menu so users can
discover and invoke skills directly. Three changes:

1. Telegram menu now includes active skill commands alongside built-in
   commands, capped at 100 entries (Telegram Bot API limit). Overflow
   commands remain callable but hidden from the picker. Logged at
   startup when cap is hit.

2. New /commands [page] gateway command for paginated browsing of all
   commands + skills. /help now shows first 10 skill commands and
   points to /commands for the full list.

3. When a user types a slash command that matches a disabled or
   uninstalled skill, they get actionable guidance:
   - Disabled: 'Enable it with: hermes skills config'
   - Optional (not installed): 'Install with: hermes skills install official/<path>'

Built on ideas from PR #3921 by @kshitijk4poor.

* chore: move 21 niche skills to optional-skills

Move specialized/niche skills from built-in (skills/) to optional
(optional-skills/) to reduce the default skill count. Users can
install them with: hermes skills install official/<category>/<name>

Moved skills (21):
- mlops: accelerate, chroma, faiss, flash-attention,
  hermes-atropos-environments, huggingface-tokenizers, instructor,
  lambda-labs, llava, nemo-curator, pinecone, pytorch-lightning,
  qdrant, saelens, simpo, slime, tensorrt-llm, torchtitan
- research: domain-intel, duckduckgo-search
- devops: inference-sh cli

Built-in skills: 96 → 75
Optional skills: 22 → 43

* fix: only include repo built-in skills in Telegram menu, not user-installed

User-installed skills (from hub or manually added) stay accessible via
/skills and by typing the command directly, but don't get registered
in the Telegram slash command picker. Only skills whose SKILL.md is
under the repo's skills/ directory are included in the menu.

This keeps the Telegram menu focused on the curated built-in set while
user-installed skills remain discoverable through /skills and /commands.
2026-03-30 10:57:30 -07:00

452 lines
8.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Hyperparameters
Complete guide to SimPO hyperparameter selection and tuning.
## Overview
Key hyperparameters in SimPO:
1. **Learning Rate** - Most critical
2. **Beta (β)** - Reward scaling
3. **Gamma-Beta Ratio (γ/β)** - Target margin
4. **SFT Weight** - Regularization strength
## Learning Rate
### Recommended Ranges
**By model size**:
| Model Size | Learning Rate | Notes |
|------------|---------------|-------|
| 1B-3B | 5e-7 to 1e-6 | Higher end safe |
| 7B-8B | 3e-7 to 5e-7 | **Standard** |
| 13B-30B | 1e-7 to 3e-7 | Lower for stability |
| 70B+ | 5e-8 to 1e-7 | Very conservative |
**By task type**:
| Task | Learning Rate | Reason |
|------|---------------|--------|
| General chat | 5e-7 | Standard |
| Code generation | 3e-7 | **Precise reasoning** |
| Math reasoning | 3e-7 | **Careful optimization** |
| Creative writing | 1e-6 | More aggressive OK |
### Why Learning Rate Matters
**Too high** (> 1e-6 for 7B):
- Loss divergence
- Catastrophic forgetting
- Unstable training
**Too low** (< 1e-7 for 7B):
- Very slow convergence
- May not finish in time
- Undertraining
**Optimal** (3e-7 to 5e-7 for 7B):
- Stable convergence
- Good final performance
- Efficient training
### Config Examples
**Mistral 7B (general)**:
```yaml
learning_rate: 5e-7
num_train_epochs: 1
warmup_ratio: 0.1
lr_scheduler_type: cosine
```
**Llama 3 8B (reasoning)**:
```yaml
learning_rate: 3e-7
num_train_epochs: 1
warmup_ratio: 0.1
lr_scheduler_type: cosine
```
**Gemma 2 9B (creative)**:
```yaml
learning_rate: 1e-6
num_train_epochs: 1
warmup_ratio: 0.1
lr_scheduler_type: linear
```
## Beta (β)
### Recommended Values
**Range**: 2.0 to 10.0 (much higher than DPO's 0.01-0.1)
**By preference strength**:
| Beta | Preference Strength | Use Case |
|------|-------------------|----------|
| 1.0-2.0 | Weak | Subtle preferences |
| 2.0-5.0 | **Standard** | General alignment |
| 5.0-10.0 | Strong | Clear preferences |
**Default**: 2.0 to 2.5
### Why Beta Matters
**Low beta** (< 2.0):
- Weak reward signal
- Slow preference learning
- May underfit
**High beta** (> 10.0):
- Very strong reward signal
- Risk of overfitting
- May ignore weak preferences
**Optimal** (2.0-5.0):
- Balanced reward scaling
- Stable training
- Good generalization
### Interaction with Gamma
**Beta and gamma together**:
```
Target margin in reward space = gamma
Target margin in logit space = gamma / beta
```
**Example**:
```yaml
beta: 2.0
gamma_beta_ratio: 0.5
# Effective gamma = 2.0 * 0.5 = 1.0
```
### Config Examples
**Weak preferences**:
```yaml
beta: 2.0
gamma_beta_ratio: 0.3 # Small margin
```
**Standard**:
```yaml
beta: 2.5
gamma_beta_ratio: 0.5 # Default
```
**Strong preferences**:
```yaml
beta: 5.0
gamma_beta_ratio: 0.7 # Larger margin
```
## Gamma-Beta Ratio (γ/β)
### Recommended Values
**Range**: 0.0 to 1.0
**By scenario**:
| Ratio | Margin | Use Case |
|-------|--------|----------|
| 0.0-0.3 | Small | Weak preference data |
| 0.4-0.6 | **Standard** | General use |
| 0.7-1.0 | Large | Very clear preferences |
**Default**: 0.5
### Why Gamma Matters
**Low gamma** (< 0.3):
- Small target margin
- Less aggressive alignment
- More conservative
**High gamma** (> 0.7):
- Large target margin
- Stronger alignment
- More aggressive
**Optimal** (0.4-0.6):
- Balanced margin
- Stable training
- Good alignment
### Mathematical Meaning
**In loss function**:
```python
logits = pi_logratios - gamma_beta_ratio
loss = -log(sigmoid(beta * logits))
```
**Interpretation**:
- gamma_beta_ratio shifts the decision boundary
- Higher ratio = requires larger log prob difference
- Controls how "clear" preferences must be
### Config Examples
**Noisy preferences**:
```yaml
gamma_beta_ratio: 0.3 # Smaller margin, more tolerant
```
**Standard**:
```yaml
gamma_beta_ratio: 0.5 # Default
```
**High-quality preferences**:
```yaml
gamma_beta_ratio: 0.8 # Larger margin, stricter
```
## SFT Weight
### Recommended Values
**Range**: 0.0 to 1.0
**By model type**:
| Model Type | SFT Weight | Reason |
|------------|-----------|--------|
| Base model | 0.0 | No prior capabilities |
| **Instruct model** | 0.05-0.1 | Preserve instruction following |
| Chat model | 0.1-0.2 | Preserve conversational skills |
**Default**: 0.0 (no SFT regularization)
### Why SFT Weight Matters
**Zero SFT** (0.0):
- Pure preference optimization
- May forget capabilities
- Standard for base models
**Low SFT** (0.05-0.1):
- Balanced approach
- **Recommended for instruct models**
- Slight capability preservation
**High SFT** (> 0.2):
- Strong capability preservation
- Weaker preference alignment
- May reduce alignment gains
### Trade-off
```
Total Loss = SimPO Loss + (sft_weight * SFT Loss)
```
**Example**:
```yaml
sft_weight: 0.1
# 90% preference optimization + 10% capability preservation
```
### Config Examples
**Base model (no SFT)**:
```yaml
model_name_or_path: mistralai/Mistral-7B-v0.1
sft_weight: 0.0
```
**Instruct model (light SFT)**:
```yaml
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
sft_weight: 0.1
```
**Chat model (moderate SFT)**:
```yaml
model_name_or_path: HuggingFaceH4/zephyr-7b-beta
sft_weight: 0.2
```
## Model-Size-Specific Recommendations
### 7B Models (Mistral, Llama 3)
**Standard config**:
```yaml
learning_rate: 5e-7
beta: 2.0
gamma_beta_ratio: 0.5
sft_weight: 0.0 # 0.1 if instruct model
num_train_epochs: 1
per_device_train_batch_size: 2
gradient_accumulation_steps: 4
```
### 8B-13B Models
**Standard config**:
```yaml
learning_rate: 3e-7
beta: 2.5
gamma_beta_ratio: 0.5
sft_weight: 0.1 # If instruct
num_train_epochs: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
```
### 70B Models
**Standard config**:
```yaml
learning_rate: 1e-7
beta: 2.0
gamma_beta_ratio: 0.5
sft_weight: 0.05
num_train_epochs: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 16
```
## Batch Size & Gradient Accumulation
### Effective Batch Size
```
Effective Batch Size = per_device_batch_size * num_gpus * grad_accum_steps
```
**Recommended effective batch sizes**:
- 7B: 128-256
- 13B: 64-128
- 70B: 32-64
### Config Examples
**Single GPU (A100 40GB)**:
```yaml
per_device_train_batch_size: 1
gradient_accumulation_steps: 128 # Effective batch = 128
```
**4 GPUs (A100 40GB)**:
```yaml
per_device_train_batch_size: 2
gradient_accumulation_steps: 16 # Effective batch = 2*4*16 = 128
```
**8 GPUs (A100 80GB)**:
```yaml
per_device_train_batch_size: 2
gradient_accumulation_steps: 8 # Effective batch = 2*8*8 = 128
```
## Loss Type
### Sigmoid vs Hinge
**Sigmoid** (default, recommended):
```yaml
loss_type: sigmoid
label_smoothing: 0.0
```
**Hinge** (experimental):
```yaml
loss_type: hinge
# No label smoothing for hinge
```
**When to use hinge**:
- Margin-based tasks
- SVM-style optimization
- Experimental purposes
**Generally**: Stick with sigmoid
## Tuning Guide
### Step 1: Start with Defaults
```yaml
learning_rate: 5e-7 # For 7B
beta: 2.0
gamma_beta_ratio: 0.5
sft_weight: 0.0 # 0.1 if instruct
loss_type: sigmoid
```
### Step 2: Monitor Training
**Check every 100 steps**:
- Loss curve (should decrease smoothly)
- Reward margin (should increase)
- Chosen/rejected logps (should separate)
### Step 3: Adjust if Needed
**If loss diverges**:
```yaml
learning_rate: 3e-7 # Reduce from 5e-7
beta: 1.0 # Reduce from 2.0
```
**If loss plateaus early**:
```yaml
learning_rate: 1e-6 # Increase from 5e-7
beta: 5.0 # Increase from 2.0
```
**If model forgets**:
```yaml
sft_weight: 0.2 # Increase from 0.0
```
## Complete Example Configs
### Mistral 7B Base (Standard)
```yaml
model_name_or_path: mistralai/Mistral-7B-v0.1
dataset_mixer:
HuggingFaceH4/ultrafeedback_binarized: 1.0
learning_rate: 5e-7
beta: 2.0
gamma_beta_ratio: 0.5
loss_type: sigmoid
sft_weight: 0.0
num_train_epochs: 1
per_device_train_batch_size: 2
gradient_accumulation_steps: 4
warmup_ratio: 0.1
lr_scheduler_type: cosine
bf16: true
gradient_checkpointing: true
```
### Llama 3 8B Instruct (Reasoning)
```yaml
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
dataset_mixer:
argilla/distilabel-math-preference-dpo: 1.0
learning_rate: 3e-7
beta: 5.0
gamma_beta_ratio: 0.7
loss_type: sigmoid
sft_weight: 0.1
num_train_epochs: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 16
warmup_ratio: 0.1
lr_scheduler_type: cosine
```
## References
- SimPO paper: https://arxiv.org/abs/2405.14734
- Alignment Handbook: https://github.com/huggingface/alignment-handbook