mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-26 01:01:40 +00:00
The skills directory was getting disorganized — mlops alone had 40 skills in a flat list, and 12 categories were singletons with just one skill each. Code change: - prompt_builder.py: Support sub-categories in skill scanner. skills/mlops/training/axolotl/SKILL.md now shows as category 'mlops/training' instead of just 'mlops'. Backwards-compatible with existing flat structure. Split mlops (40 skills) into 7 sub-categories: - mlops/training (12): accelerate, axolotl, flash-attention, grpo-rl-training, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, torchtitan, trl-fine-tuning, unsloth - mlops/inference (8): gguf, guidance, instructor, llama-cpp, obliteratus, outlines, tensorrt-llm, vllm - mlops/models (6): audiocraft, clip, llava, segment-anything, stable-diffusion, whisper - mlops/vector-databases (4): chroma, faiss, pinecone, qdrant - mlops/evaluation (5): huggingface-tokenizers, lm-evaluation-harness, nemo-curator, saelens, weights-and-biases - mlops/cloud (2): lambda-labs, modal - mlops/research (1): dspy Merged singleton categories: - gifs → media (gif-search joins youtube-content) - music-creation → media (heartmula, songsee) - diagramming → creative (excalidraw joins ascii-art) - ocr-and-documents → productivity - domain → research (domain-intel) - feeds → research (blogwatcher) - market-data → research (polymarket) Fixed misplaced skills: - mlops/code-review → software-development (not ML-specific) - mlops/ml-paper-writing → research (academic writing) Added DESCRIPTION.md files for all new/updated categories.
122 lines
2.5 KiB
Markdown
122 lines
2.5 KiB
Markdown
# Reward Modeling
|
|
|
|
Guide to training reward models with TRL for RLHF pipelines.
|
|
|
|
## Overview
|
|
|
|
Reward models score completions based on human preferences. Used in:
|
|
- PPO training (RL feedback)
|
|
- GRPO online RL
|
|
- Completion ranking
|
|
|
|
## Basic Training
|
|
|
|
```python
|
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
|
from trl import RewardTrainer, RewardConfig
|
|
from datasets import load_dataset
|
|
|
|
# Load model (num_labels=1 for single reward score)
|
|
model = AutoModelForSequenceClassification.from_pretrained(
|
|
"Qwen/Qwen2.5-0.5B-Instruct",
|
|
num_labels=1
|
|
)
|
|
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
|
|
|
|
# Load preference dataset (chosen/rejected pairs)
|
|
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
|
|
|
|
# Configure
|
|
config = RewardConfig(
|
|
output_dir="Qwen2.5-Reward",
|
|
per_device_train_batch_size=2,
|
|
num_train_epochs=1,
|
|
learning_rate=1e-5
|
|
)
|
|
|
|
# Train
|
|
trainer = RewardTrainer(
|
|
model=model,
|
|
args=config,
|
|
processing_class=tokenizer,
|
|
train_dataset=dataset
|
|
)
|
|
trainer.train()
|
|
```
|
|
|
|
## Dataset Format
|
|
|
|
Required fields:
|
|
```json
|
|
{
|
|
"prompt": "Question or instruction",
|
|
"chosen": "Better response",
|
|
"rejected": "Worse response"
|
|
}
|
|
```
|
|
|
|
## Bradley-Terry Loss
|
|
|
|
Default loss function:
|
|
```
|
|
loss = -log(sigmoid(reward_chosen - reward_rejected))
|
|
```
|
|
|
|
Learns to score chosen > rejected.
|
|
|
|
## Using Reward Models
|
|
|
|
### Inference
|
|
|
|
```python
|
|
from transformers import pipeline
|
|
|
|
# Load trained reward model
|
|
reward_pipe = pipeline("text-classification", model="Qwen2.5-Reward")
|
|
|
|
# Score completions
|
|
texts = ["Good answer", "Bad answer"]
|
|
scores = reward_pipe(texts)
|
|
print(scores) # Higher score = better
|
|
```
|
|
|
|
### In PPO
|
|
|
|
```python
|
|
from trl import PPOTrainer, PPOConfig
|
|
|
|
config = PPOConfig(
|
|
reward_model_path="Qwen2.5-Reward" # Use trained reward model
|
|
)
|
|
|
|
trainer = PPOTrainer(
|
|
model=policy_model,
|
|
config=config,
|
|
# Reward model loaded automatically
|
|
)
|
|
```
|
|
|
|
## Hyperparameters
|
|
|
|
| Model Size | Learning Rate | Batch Size | Epochs |
|
|
|------------|---------------|------------|--------|
|
|
| <1B | 2e-5 | 4-8 | 1-2 |
|
|
| 1-7B | 1e-5 | 2-4 | 1 |
|
|
| 7-13B | 5e-6 | 1-2 | 1 |
|
|
|
|
## Evaluation
|
|
|
|
Check reward separation:
|
|
```python
|
|
# Chosen should score higher than rejected
|
|
chosen_rewards = model(**chosen_inputs).logits
|
|
rejected_rewards = model(**rejected_inputs).logits
|
|
|
|
accuracy = (chosen_rewards > rejected_rewards).float().mean()
|
|
print(f"Accuracy: {accuracy:.2%}") # Target: >80%
|
|
```
|
|
|
|
## References
|
|
|
|
- InstructGPT paper: https://arxiv.org/abs/2203.02155
|
|
- TRL docs: https://huggingface.co/docs/trl/reward_trainer
|