mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-26 01:01:40 +00:00
- Restored 21 skills removed in commits757d012and740dd92: accelerate, audiocraft, code-review, faiss, flash-attention, gguf, grpo-rl-training, guidance, llava, nemo-curator, obliteratus, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, stable-diffusion, tensorrt-llm, torchtitan, trl-fine-tuning, whisper - Rewrote sync_skills() with proper update semantics: * New skills (not in manifest): copied to user dir * Existing skills (in manifest + on disk): updated via hash comparison * User-deleted skills (in manifest, not on disk): respected, not re-added * Stale manifest entries (removed from bundled): cleaned from manifest - Added sync_skills() to CLI startup (cmd_chat) and gateway startup (start_gateway) — previously only ran during 'hermes update' - Updated cmd_update output to show new/updated/cleaned counts - Rewrote tests: 20 tests covering manifest CRUD, dir hashing, fresh install, user deletion respect, update detection, stale cleanup, and name collision handling 75 bundled skills total. 2002 tests pass.
122 lines
2.5 KiB
Markdown
122 lines
2.5 KiB
Markdown
# Reward Modeling
|
|
|
|
Guide to training reward models with TRL for RLHF pipelines.
|
|
|
|
## Overview
|
|
|
|
Reward models score completions based on human preferences. Used in:
|
|
- PPO training (RL feedback)
|
|
- GRPO online RL
|
|
- Completion ranking
|
|
|
|
## Basic Training
|
|
|
|
```python
|
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
|
from trl import RewardTrainer, RewardConfig
|
|
from datasets import load_dataset
|
|
|
|
# Load model (num_labels=1 for single reward score)
|
|
model = AutoModelForSequenceClassification.from_pretrained(
|
|
"Qwen/Qwen2.5-0.5B-Instruct",
|
|
num_labels=1
|
|
)
|
|
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
|
|
|
|
# Load preference dataset (chosen/rejected pairs)
|
|
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
|
|
|
|
# Configure
|
|
config = RewardConfig(
|
|
output_dir="Qwen2.5-Reward",
|
|
per_device_train_batch_size=2,
|
|
num_train_epochs=1,
|
|
learning_rate=1e-5
|
|
)
|
|
|
|
# Train
|
|
trainer = RewardTrainer(
|
|
model=model,
|
|
args=config,
|
|
processing_class=tokenizer,
|
|
train_dataset=dataset
|
|
)
|
|
trainer.train()
|
|
```
|
|
|
|
## Dataset Format
|
|
|
|
Required fields:
|
|
```json
|
|
{
|
|
"prompt": "Question or instruction",
|
|
"chosen": "Better response",
|
|
"rejected": "Worse response"
|
|
}
|
|
```
|
|
|
|
## Bradley-Terry Loss
|
|
|
|
Default loss function:
|
|
```
|
|
loss = -log(sigmoid(reward_chosen - reward_rejected))
|
|
```
|
|
|
|
Learns to score chosen > rejected.
|
|
|
|
## Using Reward Models
|
|
|
|
### Inference
|
|
|
|
```python
|
|
from transformers import pipeline
|
|
|
|
# Load trained reward model
|
|
reward_pipe = pipeline("text-classification", model="Qwen2.5-Reward")
|
|
|
|
# Score completions
|
|
texts = ["Good answer", "Bad answer"]
|
|
scores = reward_pipe(texts)
|
|
print(scores) # Higher score = better
|
|
```
|
|
|
|
### In PPO
|
|
|
|
```python
|
|
from trl import PPOTrainer, PPOConfig
|
|
|
|
config = PPOConfig(
|
|
reward_model_path="Qwen2.5-Reward" # Use trained reward model
|
|
)
|
|
|
|
trainer = PPOTrainer(
|
|
model=policy_model,
|
|
config=config,
|
|
# Reward model loaded automatically
|
|
)
|
|
```
|
|
|
|
## Hyperparameters
|
|
|
|
| Model Size | Learning Rate | Batch Size | Epochs |
|
|
|------------|---------------|------------|--------|
|
|
| <1B | 2e-5 | 4-8 | 1-2 |
|
|
| 1-7B | 1e-5 | 2-4 | 1 |
|
|
| 7-13B | 5e-6 | 1-2 | 1 |
|
|
|
|
## Evaluation
|
|
|
|
Check reward separation:
|
|
```python
|
|
# Chosen should score higher than rejected
|
|
chosen_rewards = model(**chosen_inputs).logits
|
|
rejected_rewards = model(**rejected_inputs).logits
|
|
|
|
accuracy = (chosen_rewards > rejected_rewards).float().mean()
|
|
print(f"Accuracy: {accuracy:.2%}") # Target: >80%
|
|
```
|
|
|
|
## References
|
|
|
|
- InstructGPT paper: https://arxiv.org/abs/2203.02155
|
|
- TRL docs: https://huggingface.co/docs/trl/reward_trainer
|