mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-26 01:01:40 +00:00
The skills directory was getting disorganized — mlops alone had 40 skills in a flat list, and 12 categories were singletons with just one skill each. Code change: - prompt_builder.py: Support sub-categories in skill scanner. skills/mlops/training/axolotl/SKILL.md now shows as category 'mlops/training' instead of just 'mlops'. Backwards-compatible with existing flat structure. Split mlops (40 skills) into 7 sub-categories: - mlops/training (12): accelerate, axolotl, flash-attention, grpo-rl-training, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, torchtitan, trl-fine-tuning, unsloth - mlops/inference (8): gguf, guidance, instructor, llama-cpp, obliteratus, outlines, tensorrt-llm, vllm - mlops/models (6): audiocraft, clip, llava, segment-anything, stable-diffusion, whisper - mlops/vector-databases (4): chroma, faiss, pinecone, qdrant - mlops/evaluation (5): huggingface-tokenizers, lm-evaluation-harness, nemo-curator, saelens, weights-and-biases - mlops/cloud (2): lambda-labs, modal - mlops/research (1): dspy Merged singleton categories: - gifs → media (gif-search joins youtube-content) - music-creation → media (heartmula, songsee) - diagramming → creative (excalidraw joins ascii-art) - ocr-and-documents → productivity - domain → research (domain-intel) - feeds → research (blogwatcher) - market-data → research (polymarket) Fixed misplaced skills: - mlops/code-review → software-development (not ML-specific) - mlops/ml-paper-writing → research (academic writing) Added DESCRIPTION.md files for all new/updated categories.
1.9 KiB
1.9 KiB
Online RL Methods
Guide to online reinforcement learning with PPO, GRPO, RLOO, and OnlineDPO.
Overview
Online RL generates completions during training and optimizes based on rewards.
PPO (Proximal Policy Optimization)
Classic RL algorithm for LLM alignment.
Basic Usage
python -m trl.scripts.ppo \
--model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
--reward_model_path reward-model \
--dataset_name trl-internal-testing/descriptiveness-sentiment-trl-style \
--output_dir model-ppo \
--learning_rate 3e-6 \
--per_device_train_batch_size 64 \
--total_episodes 10000 \
--num_ppo_epochs 4 \
--kl_coef 0.05
Key Parameters
kl_coef: KL penalty (0.05-0.2)num_ppo_epochs: Epochs per batch (2-4)cliprange: PPO clip (0.1-0.3)vf_coef: Value function coef (0.1)
GRPO (Group Relative Policy Optimization)
Memory-efficient online RL.
Basic Usage
from trl import GRPOTrainer, GRPOConfig
from datasets import load_dataset
# Define reward function
def reward_func(completions, **kwargs):
return [len(set(c.split())) for c in completions]
config = GRPOConfig(
output_dir="model-grpo",
num_generations=4, # Completions per prompt
max_new_tokens=128
)
trainer = GRPOTrainer(
model="Qwen/Qwen2-0.5B-Instruct",
reward_funcs=reward_func,
args=config,
train_dataset=load_dataset("trl-lib/tldr", split="train")
)
trainer.train()
Key Parameters
num_generations: 2-8 completionsmax_new_tokens: 64-256- Learning rate: 1e-5 to 1e-4
Memory Comparison
| Method | Memory (7B) | Speed | Use Case |
|---|---|---|---|
| PPO | 40GB | Medium | Maximum control |
| GRPO | 24GB | Fast | Memory-constrained |
| OnlineDPO | 28GB | Fast | No reward model |
References
- PPO paper: https://arxiv.org/abs/1707.06347
- GRPO paper: https://arxiv.org/abs/2402.03300
- TRL docs: https://huggingface.co/docs/trl/