mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-28 01:21:43 +00:00
- Restored 21 skills removed in commits757d012and740dd92: accelerate, audiocraft, code-review, faiss, flash-attention, gguf, grpo-rl-training, guidance, llava, nemo-curator, obliteratus, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, stable-diffusion, tensorrt-llm, torchtitan, trl-fine-tuning, whisper - Rewrote sync_skills() with proper update semantics: * New skills (not in manifest): copied to user dir * Existing skills (in manifest + on disk): updated via hash comparison * User-deleted skills (in manifest, not on disk): respected, not re-added * Stale manifest entries (removed from bundled): cleaned from manifest - Added sync_skills() to CLI startup (cmd_chat) and gateway startup (start_gateway) — previously only ran during 'hermes update' - Updated cmd_update output to show new/updated/cleaned counts - Rewrote tests: 20 tests covering manifest CRUD, dir hashing, fresh install, user deletion respect, update detection, stale cleanup, and name collision handling 75 bundled skills total. 2002 tests pass.
97 lines
3.4 KiB
Markdown
97 lines
3.4 KiB
Markdown
# GRPO/RL Training Skill
|
|
|
|
**Expert-level guidance for Group Relative Policy Optimization with TRL**
|
|
|
|
## 📁 Skill Structure
|
|
|
|
```
|
|
grpo-rl-training/
|
|
├── SKILL.md # Main skill documentation (READ THIS FIRST)
|
|
├── README.md # This file
|
|
├── templates/
|
|
│ └── basic_grpo_training.py # Production-ready training template
|
|
└── examples/
|
|
└── reward_functions_library.py # 20+ reward function examples
|
|
```
|
|
|
|
## 🚀 Quick Start
|
|
|
|
1. **Read SKILL.md** - Comprehensive guide with all concepts and patterns
|
|
2. **Copy `templates/basic_grpo_training.py`** - Start with working code
|
|
3. **Browse `examples/reward_functions_library.py`** - Pick reward functions for your task
|
|
4. **Modify for your use case** - Adapt dataset, rewards, and config
|
|
|
|
## 💡 What's Inside
|
|
|
|
### SKILL.md (Main Documentation)
|
|
- Core GRPO concepts and algorithm fundamentals
|
|
- Complete implementation workflow (dataset → rewards → training → deployment)
|
|
- 10+ reward function examples with code
|
|
- Hyperparameter tuning guide
|
|
- Training insights (loss behavior, metrics, debugging)
|
|
- Troubleshooting guide
|
|
- Production best practices
|
|
|
|
### Templates
|
|
- **basic_grpo_training.py**: Minimal, production-ready training script
|
|
- Uses Qwen 2.5 1.5B Instruct
|
|
- 3 reward functions (format + correctness)
|
|
- LoRA for efficient training
|
|
- Fully documented and ready to run
|
|
|
|
### Examples
|
|
- **reward_functions_library.py**: 20+ battle-tested reward functions
|
|
- Correctness rewards (exact match, fuzzy match, numeric, code execution)
|
|
- Format rewards (XML, JSON, strict/soft)
|
|
- Length rewards (ideal length, min/max)
|
|
- Style rewards (reasoning quality, citations, repetition penalty)
|
|
- Combined rewards (multi-objective optimization)
|
|
- Preset collections for common tasks
|
|
|
|
## 📖 Usage for Agents
|
|
|
|
When this skill is loaded in your agent's context:
|
|
|
|
1. **Always read SKILL.md first** before implementing
|
|
2. **Start simple** - Use length-based reward to validate setup
|
|
3. **Build incrementally** - Add one reward function at a time
|
|
4. **Reference examples** - Copy patterns from reward_functions_library.py
|
|
5. **Monitor training** - Watch reward metrics (not loss!)
|
|
|
|
## 🎯 Common Use Cases
|
|
|
|
| Task Type | Recommended Rewards | Template |
|
|
|-----------|---------------------|----------|
|
|
| Math reasoning | `MATH_REASONING_REWARDS` preset | basic_grpo_training.py |
|
|
| Code generation | `CODE_GENERATION_REWARDS` preset | Modify dataset in template |
|
|
| Summarization | `SUMMARIZATION_REWARDS` preset | Adjust prompts + rewards |
|
|
| Q&A | `QA_REWARDS` preset | Use fuzzy match + citations |
|
|
|
|
## ⚠️ Critical Reminders
|
|
|
|
- **Loss goes UP during training** - This is normal (it's KL divergence)
|
|
- **Use 3-5 reward functions** - Single rewards often fail
|
|
- **Test rewards before training** - Debug each function independently
|
|
- **Monitor reward_std** - Should stay > 0.1 (avoid mode collapse)
|
|
- **Start with num_generations=4-8** - Scale up if GPU allows
|
|
|
|
## 🔗 External Resources
|
|
|
|
- [TRL Documentation](https://huggingface.co/docs/trl)
|
|
- [DeepSeek R1 Paper](https://arxiv.org/abs/2501.12948)
|
|
- [Open R1 Implementation](https://github.com/huggingface/open-r1)
|
|
- [Unsloth (2-3x faster)](https://docs.unsloth.ai/)
|
|
|
|
## 📝 Version
|
|
|
|
**v1.0.0** - Initial release (January 2025)
|
|
|
|
## 👨💻 Maintained By
|
|
|
|
Orchestra Research
|
|
For questions or improvements, see https://orchestra.com
|
|
|
|
---
|
|
|
|
**License:** MIT
|
|
**Last Updated:** January 2025
|