mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-28 01:21:43 +00:00
- Restored 21 skills removed in commits757d012and740dd92: accelerate, audiocraft, code-review, faiss, flash-attention, gguf, grpo-rl-training, guidance, llava, nemo-curator, obliteratus, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, stable-diffusion, tensorrt-llm, torchtitan, trl-fine-tuning, whisper - Rewrote sync_skills() with proper update semantics: * New skills (not in manifest): copied to user dir * Existing skills (in manifest + on disk): updated via hash comparison * User-deleted skills (in manifest, not on disk): respected, not re-added * Stale manifest entries (removed from bundled): cleaned from manifest - Added sync_skills() to CLI startup (cmd_chat) and gateway startup (start_gateway) — previously only ran during 'hermes update' - Updated cmd_update output to show new/updated/cleaned counts - Rewrote tests: 20 tests covering manifest CRUD, dir hashing, fresh install, user deletion respect, update detection, stale cleanup, and name collision handling 75 bundled skills total. 2002 tests pass.
3.4 KiB
3.4 KiB
GRPO/RL Training Skill
Expert-level guidance for Group Relative Policy Optimization with TRL
📁 Skill Structure
grpo-rl-training/
├── SKILL.md # Main skill documentation (READ THIS FIRST)
├── README.md # This file
├── templates/
│ └── basic_grpo_training.py # Production-ready training template
└── examples/
└── reward_functions_library.py # 20+ reward function examples
🚀 Quick Start
- Read SKILL.md - Comprehensive guide with all concepts and patterns
- Copy
templates/basic_grpo_training.py- Start with working code - Browse
examples/reward_functions_library.py- Pick reward functions for your task - Modify for your use case - Adapt dataset, rewards, and config
💡 What's Inside
SKILL.md (Main Documentation)
- Core GRPO concepts and algorithm fundamentals
- Complete implementation workflow (dataset → rewards → training → deployment)
- 10+ reward function examples with code
- Hyperparameter tuning guide
- Training insights (loss behavior, metrics, debugging)
- Troubleshooting guide
- Production best practices
Templates
- basic_grpo_training.py: Minimal, production-ready training script
- Uses Qwen 2.5 1.5B Instruct
- 3 reward functions (format + correctness)
- LoRA for efficient training
- Fully documented and ready to run
Examples
- reward_functions_library.py: 20+ battle-tested reward functions
- Correctness rewards (exact match, fuzzy match, numeric, code execution)
- Format rewards (XML, JSON, strict/soft)
- Length rewards (ideal length, min/max)
- Style rewards (reasoning quality, citations, repetition penalty)
- Combined rewards (multi-objective optimization)
- Preset collections for common tasks
📖 Usage for Agents
When this skill is loaded in your agent's context:
- Always read SKILL.md first before implementing
- Start simple - Use length-based reward to validate setup
- Build incrementally - Add one reward function at a time
- Reference examples - Copy patterns from reward_functions_library.py
- Monitor training - Watch reward metrics (not loss!)
🎯 Common Use Cases
| Task Type | Recommended Rewards | Template |
|---|---|---|
| Math reasoning | MATH_REASONING_REWARDS preset |
basic_grpo_training.py |
| Code generation | CODE_GENERATION_REWARDS preset |
Modify dataset in template |
| Summarization | SUMMARIZATION_REWARDS preset |
Adjust prompts + rewards |
| Q&A | QA_REWARDS preset |
Use fuzzy match + citations |
⚠️ Critical Reminders
- Loss goes UP during training - This is normal (it's KL divergence)
- Use 3-5 reward functions - Single rewards often fail
- Test rewards before training - Debug each function independently
- Monitor reward_std - Should stay > 0.1 (avoid mode collapse)
- Start with num_generations=4-8 - Scale up if GPU allows
🔗 External Resources
📝 Version
v1.0.0 - Initial release (January 2025)
👨💻 Maintained By
Orchestra Research For questions or improvements, see https://orchestra.com
License: MIT Last Updated: January 2025