mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-08 03:01:47 +00:00
- Restored 21 skills removed in commits757d012and740dd92: accelerate, audiocraft, code-review, faiss, flash-attention, gguf, grpo-rl-training, guidance, llava, nemo-curator, obliteratus, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, stable-diffusion, tensorrt-llm, torchtitan, trl-fine-tuning, whisper - Rewrote sync_skills() with proper update semantics: * New skills (not in manifest): copied to user dir * Existing skills (in manifest + on disk): updated via hash comparison * User-deleted skills (in manifest, not on disk): respected, not re-added * Stale manifest entries (removed from bundled): cleaned from manifest - Added sync_skills() to CLI startup (cmd_chat) and gateway startup (start_gateway) — previously only ran during 'hermes update' - Updated cmd_update output to show new/updated/cleaned counts - Rewrote tests: 20 tests covering manifest CRUD, dir hashing, fresh install, user deletion respect, update detection, stale cleanup, and name collision handling 75 bundled skills total. 2002 tests pass.
82 lines
1.9 KiB
Markdown
82 lines
1.9 KiB
Markdown
# Online RL Methods
|
|
|
|
Guide to online reinforcement learning with PPO, GRPO, RLOO, and OnlineDPO.
|
|
|
|
## Overview
|
|
|
|
Online RL generates completions during training and optimizes based on rewards.
|
|
|
|
## PPO (Proximal Policy Optimization)
|
|
|
|
Classic RL algorithm for LLM alignment.
|
|
|
|
### Basic Usage
|
|
|
|
```bash
|
|
python -m trl.scripts.ppo \
|
|
--model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
|
|
--reward_model_path reward-model \
|
|
--dataset_name trl-internal-testing/descriptiveness-sentiment-trl-style \
|
|
--output_dir model-ppo \
|
|
--learning_rate 3e-6 \
|
|
--per_device_train_batch_size 64 \
|
|
--total_episodes 10000 \
|
|
--num_ppo_epochs 4 \
|
|
--kl_coef 0.05
|
|
```
|
|
|
|
### Key Parameters
|
|
|
|
- `kl_coef`: KL penalty (0.05-0.2)
|
|
- `num_ppo_epochs`: Epochs per batch (2-4)
|
|
- `cliprange`: PPO clip (0.1-0.3)
|
|
- `vf_coef`: Value function coef (0.1)
|
|
|
|
## GRPO (Group Relative Policy Optimization)
|
|
|
|
Memory-efficient online RL.
|
|
|
|
### Basic Usage
|
|
|
|
```python
|
|
from trl import GRPOTrainer, GRPOConfig
|
|
from datasets import load_dataset
|
|
|
|
# Define reward function
|
|
def reward_func(completions, **kwargs):
|
|
return [len(set(c.split())) for c in completions]
|
|
|
|
config = GRPOConfig(
|
|
output_dir="model-grpo",
|
|
num_generations=4, # Completions per prompt
|
|
max_new_tokens=128
|
|
)
|
|
|
|
trainer = GRPOTrainer(
|
|
model="Qwen/Qwen2-0.5B-Instruct",
|
|
reward_funcs=reward_func,
|
|
args=config,
|
|
train_dataset=load_dataset("trl-lib/tldr", split="train")
|
|
)
|
|
trainer.train()
|
|
```
|
|
|
|
### Key Parameters
|
|
|
|
- `num_generations`: 2-8 completions
|
|
- `max_new_tokens`: 64-256
|
|
- Learning rate: 1e-5 to 1e-4
|
|
|
|
## Memory Comparison
|
|
|
|
| Method | Memory (7B) | Speed | Use Case |
|
|
|--------|-------------|-------|----------|
|
|
| PPO | 40GB | Medium | Maximum control |
|
|
| GRPO | 24GB | Fast | **Memory-constrained** |
|
|
| OnlineDPO | 28GB | Fast | No reward model |
|
|
|
|
## References
|
|
|
|
- PPO paper: https://arxiv.org/abs/1707.06347
|
|
- GRPO paper: https://arxiv.org/abs/2402.03300
|
|
- TRL docs: https://huggingface.co/docs/trl/
|