hermes-agent/skills/mlops/grpo-rl-training/README.md

# GRPO/RL Training Skill

**Expert-level guidance for Group Relative Policy Optimization with TRL**

## 📁 Skill Structure

```
grpo-rl-training/
├── SKILL.md                              # Main skill documentation (READ THIS FIRST)
├── README.md                             # This file
├── templates/
│   └── basic_grpo_training.py            # Production-ready training template
└── examples/
    └── reward_functions_library.py       # 20+ reward function examples
```

## 🚀 Quick Start

1. **Read SKILL.md** - Comprehensive guide with all concepts and patterns
2. **Copy `templates/basic_grpo_training.py`** - Start with working code
3. **Browse `examples/reward_functions_library.py`** - Pick reward functions for your task
4. **Modify for your use case** - Adapt dataset, rewards, and config

## 💡 What's Inside

### SKILL.md (Main Documentation)
- Core GRPO concepts and algorithm fundamentals
- Complete implementation workflow (dataset → rewards → training → deployment)
- 10+ reward function examples with code
- Hyperparameter tuning guide
- Training insights (loss behavior, metrics, debugging)
- Troubleshooting guide
- Production best practices

### Templates
- **basic_grpo_training.py**: Minimal, production-ready training script
  - Uses Qwen 2.5 1.5B Instruct
  - 3 reward functions (format + correctness)
  - LoRA for efficient training
  - Fully documented and ready to run

### Examples
- **reward_functions_library.py**: 20+ battle-tested reward functions
  - Correctness rewards (exact match, fuzzy match, numeric, code execution)
  - Format rewards (XML, JSON, strict/soft)
  - Length rewards (ideal length, min/max)
  - Style rewards (reasoning quality, citations, repetition penalty)
  - Combined rewards (multi-objective optimization)
  - Preset collections for common tasks

## 📖 Usage for Agents

When this skill is loaded in your agent's context:

1. **Always read SKILL.md first** before implementing
2. **Start simple** - Use length-based reward to validate setup
3. **Build incrementally** - Add one reward function at a time
4. **Reference examples** - Copy patterns from reward_functions_library.py
5. **Monitor training** - Watch reward metrics (not loss!)

## 🎯 Common Use Cases

| Task Type | Recommended Rewards | Template |
|-----------|---------------------|----------|
| Math reasoning | `MATH_REASONING_REWARDS` preset | basic_grpo_training.py |
| Code generation | `CODE_GENERATION_REWARDS` preset | Modify dataset in template |
| Summarization | `SUMMARIZATION_REWARDS` preset | Adjust prompts + rewards |
| Q&A | `QA_REWARDS` preset | Use fuzzy match + citations |

## ⚠️ Critical Reminders

- **Loss goes UP during training** - This is normal (it's KL divergence)
- **Use 3-5 reward functions** - Single rewards often fail
- **Test rewards before training** - Debug each function independently
- **Monitor reward_std** - Should stay > 0.1 (avoid mode collapse)
- **Start with num_generations=4-8** - Scale up if GPU allows

## 🔗 External Resources

- [TRL Documentation](https://huggingface.co/docs/trl)
- [DeepSeek R1 Paper](https://arxiv.org/abs/2501.12948)
- [Open R1 Implementation](https://github.com/huggingface/open-r1)
- [Unsloth (2-3x faster)](https://docs.unsloth.ai/)

## 📝 Version

**v1.0.0** - Initial release (January 2025)

## 👨‍💻 Maintained By

Orchestra Research
For questions or improvements, see https://orchestra.com

---

**License:** MIT
**Last Updated:** January 2025