mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-04-26 01:01:40 +00:00

teknium1 ab0f4126cf fix: restore all removed bundled skills + fix skills sync system

- Restored 21 skills removed in commits 757d012 and 740dd92:
  accelerate, audiocraft, code-review, faiss, flash-attention, gguf,
  grpo-rl-training, guidance, llava, nemo-curator, obliteratus, peft,
  pytorch-fsdp, pytorch-lightning, simpo, slime, stable-diffusion,
  tensorrt-llm, torchtitan, trl-fine-tuning, whisper

- Rewrote sync_skills() with proper update semantics:
  * New skills (not in manifest): copied to user dir
  * Existing skills (in manifest + on disk): updated via hash comparison
  * User-deleted skills (in manifest, not on disk): respected, not re-added
  * Stale manifest entries (removed from bundled): cleaned from manifest

- Added sync_skills() to CLI startup (cmd_chat) and gateway startup
  (start_gateway) — previously only ran during 'hermes update'

- Updated cmd_update output to show new/updated/cleaned counts

- Rewrote tests: 20 tests covering manifest CRUD, dir hashing, fresh
  install, user deletion respect, update detection, stale cleanup, and
  name collision handling

75 bundled skills total. 2002 tests pass.

2026-03-06 15:57:30 -08:00

4.2 KiB

Raw Blame History

DPO Variants

Complete guide to Direct Preference Optimization loss variants in TRL.

Overview

DPO optimizes models using preference data (chosen/rejected pairs). TRL supports 10+ loss variants for different scenarios.

Loss Types

1. Sigmoid (Standard DPO)

Formula: -log(sigmoid(β * logits))

When to use: Default choice, general preference alignment

Config:

DPOConfig(
    loss_type="sigmoid",
    beta=0.1,  # KL penalty
    per_device_train_batch_size=64,
    learning_rate=1e-6
)

2. IPO (Identity Policy Optimization)

Formula: (logits - 1/(2β))²

When to use: Better theoretical foundation, reduce overfitting

Config:

DPOConfig(
    loss_type="ipo",
    beta=0.1,
    per_device_train_batch_size=90,
    learning_rate=1e-2
)

3. Hinge (SLiC)

Formula: ReLU(1 - β * logits)

When to use: Margin-based objective

Config:

DPOConfig(
    loss_type="hinge",
    beta=0.1,
    per_device_train_batch_size=512,
    learning_rate=1e-4
)

4. Robust DPO

Formula: Sigmoid with label smoothing for noise robustness

When to use: Noisy preference labels

Config:

DPOConfig(
    loss_type="robust",
    beta=0.01,
    label_smoothing=0.1,  # Noise probability
    per_device_train_batch_size=16,
    learning_rate=1e-3,
    max_prompt_length=128,
    max_length=512
)

5. BCO Pair (Binary Classification)

Formula: Train binary classifier (chosen=1, rejected=0)

When to use: Pairwise preference data

Config:

DPOConfig(
    loss_type="bco_pair",
    beta=0.01,
    per_device_train_batch_size=128,
    learning_rate=5e-7,
    max_prompt_length=1536,
    max_completion_length=512
)

6. SPPO Hard

Formula: Push chosen→0.5, rejected→-0.5

When to use: Nash equilibrium, sparse data

Config:

DPOConfig(
    loss_type="sppo_hard",
    beta=0.1
)

7. DiscoPOP

Formula: Log-Ratio Modulated Loss

When to use: Automated loss discovery

Config:

DPOConfig(
    loss_type="discopop",
    beta=0.05,
    discopop_tau=0.05,
    per_device_train_batch_size=64,
    learning_rate=5e-7
)

8. APO Zero

Formula: Increase chosen, decrease rejected likelihood

When to use: Model worse than winning outputs

Config:

DPOConfig(
    loss_type="apo_zero",
    beta=0.1,
    per_device_train_batch_size=64,
    learning_rate=2e-7,
    max_prompt_length=512,
    max_completion_length=512
)

9. APO Down

Formula: Decrease both, emphasize rejected reduction

When to use: Model better than winning outputs

Config:

DPOConfig(
    loss_type="apo_down",
    beta=0.1,
    # Same hyperparameters as apo_zero
)

10. AOT & AOT Pair

Formula: Distributional alignment via stochastic dominance

When to use:

aot_pair: Paired preference data
aot: Unpaired data

Config:

DPOConfig(
    loss_type="aot_pair",  # or "aot"
    beta=0.1,
    label_smoothing=0.0
)

Multi-Loss Training

Combine multiple losses:

DPOConfig(
    loss_type=["sigmoid", "ipo"],
    loss_weights=[0.7, 0.3],  # Weighted combination
    beta=0.1
)

Key Parameters

Beta (β)

Controls deviation from reference model:

Higher (0.5): More conservative, stays close to reference
Lower (0.01): More aggressive alignment
Default: 0.1

Label Smoothing

For robust DPO:

0.0: No smoothing (default)
0.1-0.3: Moderate noise robustness
0.5: Maximum noise tolerance

Max Lengths

max_prompt_length: 128-1536
max_completion_length: 128-512
max_length: Total sequence (1024-2048)

Comparison Table

Loss	Speed	Stability	Best For
Sigmoid	Fast	Good	General use
IPO	Fast	Better	Overfitting issues
Hinge	Fast	Good	Margin objectives
Robust	Fast	Best	Noisy data
BCO	Medium	Good	Binary classification
DiscoPOP	Fast	Good	New architectures
APO	Fast	Good	Model quality matching

References

DPO paper: https://arxiv.org/abs/2305.18290
IPO paper: https://arxiv.org/abs/2310.12036
TRL docs: https://huggingface.co/docs/trl/dpo_trainer

4.2 KiB Raw Blame History

DPO Variants

Overview

Loss Types

1. Sigmoid (Standard DPO)

2. IPO (Identity Policy Optimization)

3. Hinge (SLiC)

4. Robust DPO

5. BCO Pair (Binary Classification)

6. SPPO Hard

7. DiscoPOP

8. APO Zero

9. APO Down

10. AOT & AOT Pair

Multi-Loss Training

Key Parameters

Beta (β)

Label Smoothing

Max Lengths

Comparison Table

References

4.2 KiB

Raw Blame History