hermes-agent/optional-skills/mlops/simpo/references/loss-functions.md
Teknium 5ceed021dc
feat(gateway): skill-aware slash commands, paginated /commands, Telegram 100-cap (#3934)
* feat(gateway): skill-aware slash commands, paginated /commands, Telegram 100-cap

Map active skills to Telegram's slash command menu so users can
discover and invoke skills directly. Three changes:

1. Telegram menu now includes active skill commands alongside built-in
   commands, capped at 100 entries (Telegram Bot API limit). Overflow
   commands remain callable but hidden from the picker. Logged at
   startup when cap is hit.

2. New /commands [page] gateway command for paginated browsing of all
   commands + skills. /help now shows first 10 skill commands and
   points to /commands for the full list.

3. When a user types a slash command that matches a disabled or
   uninstalled skill, they get actionable guidance:
   - Disabled: 'Enable it with: hermes skills config'
   - Optional (not installed): 'Install with: hermes skills install official/<path>'

Built on ideas from PR #3921 by @kshitijk4poor.

* chore: move 21 niche skills to optional-skills

Move specialized/niche skills from built-in (skills/) to optional
(optional-skills/) to reduce the default skill count. Users can
install them with: hermes skills install official/<category>/<name>

Moved skills (21):
- mlops: accelerate, chroma, faiss, flash-attention,
  hermes-atropos-environments, huggingface-tokenizers, instructor,
  lambda-labs, llava, nemo-curator, pinecone, pytorch-lightning,
  qdrant, saelens, simpo, slime, tensorrt-llm, torchtitan
- research: domain-intel, duckduckgo-search
- devops: inference-sh cli

Built-in skills: 96 → 75
Optional skills: 22 → 43

* fix: only include repo built-in skills in Telegram menu, not user-installed

User-installed skills (from hub or manually added) stay accessible via
/skills and by typing the command directly, but don't get registered
in the Telegram slash command picker. Only skills whose SKILL.md is
under the repo's skills/ directory are included in the menu.

This keeps the Telegram menu focused on the curated built-in set while
user-installed skills remain discoverable through /skills and /commands.
2026-03-30 10:57:30 -07:00

6.9 KiB
Raw Blame History

Loss Functions

Complete guide to SimPO loss functions and mathematical formulations.

Overview

SimPO supports two loss types:

  • Sigmoid (default) - Smooth, differentiable loss
  • Hinge - Margin-based, sparse loss

Both are reference-free (no reference model needed).

SimPO Loss Formula

Core Calculation

Step 1: Log probability ratio:

pi_logratios = log P_θ(y_chosen|x) - log P_θ(y_rejected|x)

Step 2: Apply target margin:

logits = pi_logratios - γ

Where:

  • γ/β = gamma_beta_ratio (target margin)

Step 3: Compute loss (depends on loss type)

Sigmoid Loss (Default)

Formula:

L = -log σ(β * logits) * (1 - ε) - log σ(-β * logits) * ε

Where:

  • β = beta (reward scaling)
  • σ = sigmoid function
  • ε = label_smoothing (default 0.0)

Implementation:

losses = (
    -F.logsigmoid(self.beta * logits) * (1 - self.label_smoothing)
    - F.logsigmoid(-self.beta * logits) * self.label_smoothing
)

Characteristics:

  • Smooth, continuous gradients
  • Probabilistic interpretation
  • Standard choice for most tasks
  • Works well with higher beta values

Hinge Loss

Formula:

L = max(0, 1 - β * logits)

Implementation:

losses = torch.relu(1 - self.beta * logits)

Characteristics:

  • Non-smooth (has kink at logits = 1/β)
  • Margin-based (SVM-style)
  • Can lead to sparser solutions
  • Less commonly used

Comparison to DPO

DPO Loss (Reference Model Required)

Formula:

L_DPO = -E[log σ(β * log(π_θ(y_w|x)/π_ref(y_w|x)) - β * log(π_θ(y_l|x)/π_ref(y_l|x)))]

Key features:

  • Requires reference model π_ref
  • Normalizes by reference log probabilities
  • More conservative (stays close to reference)

SimPO Loss (Reference-Free)

Formula:

L_SimPO = -log σ(β * (log π_θ(y_w|x) - log π_θ(y_l|x) - γ/β))

Key features:

  • No reference model needed
  • Direct preference optimization
  • Target margin γ/β controls preference strength
  • More efficient (fewer model forward passes)

Visual comparison:

DPO:    [Policy] - [Reference] → Loss
SimPO:  [Policy]               → Loss

Average Log Probability Reward

Calculation

Per-token log probabilities:

# Get log probs for each token
per_token_logps = log_softmax(logits).gather(dim=-1, index=labels)

# Create mask to ignore padding
loss_mask = (labels != label_pad_token_id)

Average log probability (if average_log_prob=True):

avg_logp = (per_token_logps * loss_mask).sum(-1) / loss_mask.sum(-1)

Sum log probability (if average_log_prob=False):

sum_logp = (per_token_logps * loss_mask).sum(-1)

Why average?

  • Normalizes for sequence length
  • Prevents bias toward shorter/longer responses
  • Standard practice in SimPO

Reward Metrics

Chosen reward:

chosen_rewards = beta * policy_chosen_logps.detach()

Rejected reward:

rejected_rewards = beta * policy_rejected_logps.detach()

Reward margin:

reward_margin = chosen_rewards.mean() - rejected_rewards.mean()

Label Smoothing

Formula with Smoothing

Sigmoid loss:

L = -log σ(β * logits) * (1 - ε) - log σ(-β * logits) * ε

Effect:

  • ε = 0.0: No smoothing (default)
  • ε = 0.1: 10% smoothing (soft labels)
  • ε = 0.5: Maximum smoothing

When to use:

  • Noisy preference labels
  • Uncertain preferences
  • Prevent overconfidence

Config:

label_smoothing: 0.1  # 10% smoothing

SFT Regularization

Combined Loss

With SFT component:

L_total = L_SimPO + λ * L_SFT

Where:

  • L_SFT = cross-entropy loss on chosen responses
  • λ = sft_weight (0.0 to 1.0)

Implementation:

if self.sft_weight > 0:
    sft_loss = -policy_chosen_logps
    total_loss = simpo_loss + self.sft_weight * sft_loss

When to use:

  • Preserve model capabilities
  • Prevent catastrophic forgetting
  • Fine-tuning instruct models

Trade-off:

  • Higher sft_weight: Preserve capabilities, less alignment
  • Lower sft_weight: Stronger alignment, may forget capabilities

Config:

sft_weight: 0.1  # 10% SFT regularization

Loss Type Selection

Sigmoid vs Hinge

Aspect Sigmoid Hinge
Smoothness Smooth Non-smooth
Gradients Continuous Discontinuous at margin
Sparsity Dense solutions Sparse solutions
Interpretability Probabilistic Geometric margin
Use case General purpose Margin-based tasks
Recommendation Default choice Experimental

Config:

# Sigmoid (default)
loss_type: sigmoid

# Hinge (alternative)
loss_type: hinge

Mathematical Properties

Gradient Analysis

Sigmoid loss gradient:

∂L/∂logits = -β * σ(-β * logits) * (1 - ε) + β * σ(β * logits) * ε

Hinge loss gradient:

∂L/∂logits = -β   if logits < 1/β
             0     otherwise

Implications:

  • Sigmoid: Always provides gradient signal
  • Hinge: No gradient when margin satisfied

Convergence Behavior

Sigmoid:

  • Asymptotically approaches zero loss
  • Continues optimizing even with large margins
  • Smoother training curves

Hinge:

  • Reaches zero loss at margin
  • Stops optimizing once margin satisfied
  • May have training plateaus

Complete Loss Examples

Example 1: Basic SimPO (Sigmoid)

Config:

beta: 2.0
gamma_beta_ratio: 0.5
loss_type: sigmoid
label_smoothing: 0.0
sft_weight: 0.0

Loss calculation:

# Step 1: Compute log probs
chosen_logps = avg_log_prob(policy(chosen))    # e.g., -1.2
rejected_logps = avg_log_prob(policy(rejected)) # e.g., -2.5

# Step 2: Log ratio and margin
pi_logratios = -1.2 - (-2.5) = 1.3
logits = 1.3 - 0.5 = 0.8

# Step 3: Sigmoid loss
loss = -log(sigmoid(2.0 * 0.8))
     = -log(sigmoid(1.6))
     = -log(0.832)
     = 0.184

Example 2: SimPO with SFT

Config:

beta: 2.5
gamma_beta_ratio: 0.5
loss_type: sigmoid
sft_weight: 0.1

Loss calculation:

# SimPO loss (as above)
simpo_loss = 0.184

# SFT loss
sft_loss = -chosen_logps = -(-1.2) = 1.2

# Total loss
total_loss = simpo_loss + 0.1 * sft_loss
           = 0.184 + 0.12
           = 0.304

Debugging

Check Reward Margins

Low margin (< 0.5):

  • Preferences not being learned
  • Increase beta or gamma_beta_ratio

High margin (> 5.0):

  • May be overfitting
  • Reduce beta or learning rate

Monitor:

reward_margin = chosen_rewards.mean() - rejected_rewards.mean()
print(f"Reward margin: {reward_margin:.2f}")

Check Log Probabilities

Typical values:

  • Chosen: -1.0 to -2.0 (higher is better)
  • Rejected: -2.0 to -4.0 (lower is worse)

Warning signs:

  • Both very negative (< -10): Model not learning
  • Both very positive (> 0): Numerical instability

References