skills: consolidate mlops redundancies (gguf+llama-cpp, grpo+trl, guidance→optional) (#11965)

Three tightly-scoped built-in skill consolidations to reduce redundancy in
the available_skills listing injected into every system prompt:

1. gguf-quantization → llama-cpp (merged)
   GGUF is llama.cpp's format; two skills covered the same toolchain. The
   merged llama-cpp skill keeps the full K-quant table + imatrix workflow
   from gguf and the ROCm/benchmarks/supported-models sections from the
   original llama-cpp. All 5 reference files preserved.

2. grpo-rl-training → fine-tuning-with-trl (folded in)
   GRPO isn't a framework, it's a trainer inside TRL. Moved the 17KB
   deep-dive SKILL.md to references/grpo-training.md and the working
   template to templates/basic_grpo_training.py. TRL's GRPO workflow
   section now points to both. Atropos skill's related_skills updated.

3. guidance → optional-skills/mlops/
   Dropped from built-in. Outlines (still built-in) covers the same
   structured-generation ground with wider adoption. Listed in the
   optional catalog for users who specifically want Guidance.

Net: 3 fewer built-in skill lines in every system prompt, zero content
loss. Contributor authorship preserved via git rename detection.
This commit is contained in:
Teknium 2026-04-17 21:36:40 -07:00 committed by GitHub
parent 598cba62ad
commit 73bccc94c7
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
15 changed files with 470 additions and 889 deletions

View file

@ -7,7 +7,7 @@ license: MIT
metadata: metadata:
hermes: hermes:
tags: [atropos, rl, environments, training, reinforcement-learning, reward-functions] tags: [atropos, rl, environments, training, reinforcement-learning, reward-functions]
related_skills: [axolotl, grpo-rl-training, trl-fine-tuning, lm-evaluation-harness] related_skills: [axolotl, fine-tuning-with-trl, lm-evaluation-harness]
--- ---
# Hermes Agent Atropos Environments # Hermes Agent Atropos Environments

View file

@ -1,430 +0,0 @@
---
name: gguf-quantization
description: GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.
version: 1.0.0
author: Orchestra Research
license: MIT
dependencies: [llama-cpp-python>=0.2.0]
metadata:
hermes:
tags: [GGUF, Quantization, llama.cpp, CPU Inference, Apple Silicon, Model Compression, Optimization]
---
# GGUF - Quantization Format for llama.cpp
The GGUF (GPT-Generated Unified Format) is the standard file format for llama.cpp, enabling efficient inference on CPUs, Apple Silicon, and GPUs with flexible quantization options.
## When to use GGUF
**Use GGUF when:**
- Deploying on consumer hardware (laptops, desktops)
- Running on Apple Silicon (M1/M2/M3) with Metal acceleration
- Need CPU inference without GPU requirements
- Want flexible quantization (Q2_K to Q8_0)
- Using local AI tools (LM Studio, Ollama, text-generation-webui)
**Key advantages:**
- **Universal hardware**: CPU, Apple Silicon, NVIDIA, AMD support
- **No Python runtime**: Pure C/C++ inference
- **Flexible quantization**: 2-8 bit with various methods (K-quants)
- **Ecosystem support**: LM Studio, Ollama, koboldcpp, and more
- **imatrix**: Importance matrix for better low-bit quality
**Use alternatives instead:**
- **AWQ/GPTQ**: Maximum accuracy with calibration on NVIDIA GPUs
- **HQQ**: Fast calibration-free quantization for HuggingFace
- **bitsandbytes**: Simple integration with transformers library
- **TensorRT-LLM**: Production NVIDIA deployment with maximum speed
## Quick start
### Installation
```bash
# Clone llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# Build (CPU)
make
# Build with CUDA (NVIDIA)
make GGML_CUDA=1
# Build with Metal (Apple Silicon)
make GGML_METAL=1
# Install Python bindings (optional)
pip install llama-cpp-python
```
### Convert model to GGUF
```bash
# Install requirements
pip install -r requirements.txt
# Convert HuggingFace model to GGUF (FP16)
python convert_hf_to_gguf.py ./path/to/model --outfile model-f16.gguf
# Or specify output type
python convert_hf_to_gguf.py ./path/to/model \
--outfile model-f16.gguf \
--outtype f16
```
### Quantize model
```bash
# Basic quantization to Q4_K_M
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
# Quantize with importance matrix (better quality)
./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
```
### Run inference
```bash
# CLI inference
./llama-cli -m model-q4_k_m.gguf -p "Hello, how are you?"
# Interactive mode
./llama-cli -m model-q4_k_m.gguf --interactive
# With GPU offload
./llama-cli -m model-q4_k_m.gguf -ngl 35 -p "Hello!"
```
## Quantization types
### K-quant methods (recommended)
| Type | Bits | Size (7B) | Quality | Use Case |
|------|------|-----------|---------|----------|
| Q2_K | 2.5 | ~2.8 GB | Low | Extreme compression |
| Q3_K_S | 3.0 | ~3.0 GB | Low-Med | Memory constrained |
| Q3_K_M | 3.3 | ~3.3 GB | Medium | Balance |
| Q4_K_S | 4.0 | ~3.8 GB | Med-High | Good balance |
| Q4_K_M | 4.5 | ~4.1 GB | High | **Recommended default** |
| Q5_K_S | 5.0 | ~4.6 GB | High | Quality focused |
| Q5_K_M | 5.5 | ~4.8 GB | Very High | High quality |
| Q6_K | 6.0 | ~5.5 GB | Excellent | Near-original |
| Q8_0 | 8.0 | ~7.2 GB | Best | Maximum quality |
### Legacy methods
| Type | Description |
|------|-------------|
| Q4_0 | 4-bit, basic |
| Q4_1 | 4-bit with delta |
| Q5_0 | 5-bit, basic |
| Q5_1 | 5-bit with delta |
**Recommendation**: Use K-quant methods (Q4_K_M, Q5_K_M) for best quality/size ratio.
## Conversion workflows
### Workflow 1: HuggingFace to GGUF
```bash
# 1. Download model
huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b
# 2. Convert to GGUF (FP16)
python convert_hf_to_gguf.py ./llama-3.1-8b \
--outfile llama-3.1-8b-f16.gguf \
--outtype f16
# 3. Quantize
./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M
# 4. Test
./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50
```
### Workflow 2: With importance matrix (better quality)
```bash
# 1. Convert to GGUF
python convert_hf_to_gguf.py ./model --outfile model-f16.gguf
# 2. Create calibration text (diverse samples)
cat > calibration.txt << 'EOF'
The quick brown fox jumps over the lazy dog.
Machine learning is a subset of artificial intelligence.
Python is a popular programming language.
# Add more diverse text samples...
EOF
# 3. Generate importance matrix
./llama-imatrix -m model-f16.gguf \
-f calibration.txt \
--chunk 512 \
-o model.imatrix \
-ngl 35 # GPU layers if available
# 4. Quantize with imatrix
./llama-quantize --imatrix model.imatrix \
model-f16.gguf \
model-q4_k_m.gguf \
Q4_K_M
```
### Workflow 3: Multiple quantizations
```bash
#!/bin/bash
MODEL="llama-3.1-8b-f16.gguf"
IMATRIX="llama-3.1-8b.imatrix"
# Generate imatrix once
./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35
# Create multiple quantizations
for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
OUTPUT="llama-3.1-8b-${QUANT,,}.gguf"
./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT
echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))"
done
```
## Python usage
### llama-cpp-python
```python
from llama_cpp import Llama
# Load model
llm = Llama(
model_path="./model-q4_k_m.gguf",
n_ctx=4096, # Context window
n_gpu_layers=35, # GPU offload (0 for CPU only)
n_threads=8 # CPU threads
)
# Generate
output = llm(
"What is machine learning?",
max_tokens=256,
temperature=0.7,
stop=["</s>", "\n\n"]
)
print(output["choices"][0]["text"])
```
### Chat completion
```python
from llama_cpp import Llama
llm = Llama(
model_path="./model-q4_k_m.gguf",
n_ctx=4096,
n_gpu_layers=35,
chat_format="llama-3" # Or "chatml", "mistral", etc.
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Python?"}
]
response = llm.create_chat_completion(
messages=messages,
max_tokens=256,
temperature=0.7
)
print(response["choices"][0]["message"]["content"])
```
### Streaming
```python
from llama_cpp import Llama
llm = Llama(model_path="./model-q4_k_m.gguf", n_gpu_layers=35)
# Stream tokens
for chunk in llm(
"Explain quantum computing:",
max_tokens=256,
stream=True
):
print(chunk["choices"][0]["text"], end="", flush=True)
```
## Server mode
### Start OpenAI-compatible server
```bash
# Start server
./llama-server -m model-q4_k_m.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 35 \
-c 4096
# Or with Python bindings
python -m llama_cpp.server \
--model model-q4_k_m.gguf \
--n_gpu_layers 35 \
--host 0.0.0.0 \
--port 8080
```
### Use with OpenAI client
```python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="local-model",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=256
)
print(response.choices[0].message.content)
```
## Hardware optimization
### Apple Silicon (Metal)
```bash
# Build with Metal
make clean && make GGML_METAL=1
# Run with Metal acceleration
./llama-cli -m model.gguf -ngl 99 -p "Hello"
# Python with Metal
llm = Llama(
model_path="model.gguf",
n_gpu_layers=99, # Offload all layers
n_threads=1 # Metal handles parallelism
)
```
### NVIDIA CUDA
```bash
# Build with CUDA
make clean && make GGML_CUDA=1
# Run with CUDA
./llama-cli -m model.gguf -ngl 35 -p "Hello"
# Specify GPU
CUDA_VISIBLE_DEVICES=0 ./llama-cli -m model.gguf -ngl 35
```
### CPU optimization
```bash
# Build with AVX2/AVX512
make clean && make
# Run with optimal threads
./llama-cli -m model.gguf -t 8 -p "Hello"
# Python CPU config
llm = Llama(
model_path="model.gguf",
n_gpu_layers=0, # CPU only
n_threads=8, # Match physical cores
n_batch=512 # Batch size for prompt processing
)
```
## Integration with tools
### Ollama
```bash
# Create Modelfile
cat > Modelfile << 'EOF'
FROM ./model-q4_k_m.gguf
TEMPLATE """{{ .System }}
{{ .Prompt }}"""
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
EOF
# Create Ollama model
ollama create mymodel -f Modelfile
# Run
ollama run mymodel "Hello!"
```
### LM Studio
1. Place GGUF file in `~/.cache/lm-studio/models/`
2. Open LM Studio and select the model
3. Configure context length and GPU offload
4. Start inference
### text-generation-webui
```bash
# Place in models folder
cp model-q4_k_m.gguf text-generation-webui/models/
# Start with llama.cpp loader
python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35
```
## Best practices
1. **Use K-quants**: Q4_K_M offers best quality/size balance
2. **Use imatrix**: Always use importance matrix for Q4 and below
3. **GPU offload**: Offload as many layers as VRAM allows
4. **Context length**: Start with 4096, increase if needed
5. **Thread count**: Match physical CPU cores, not logical
6. **Batch size**: Increase n_batch for faster prompt processing
## Common issues
**Model loads slowly:**
```bash
# Use mmap for faster loading
./llama-cli -m model.gguf --mmap
```
**Out of memory:**
```bash
# Reduce GPU layers
./llama-cli -m model.gguf -ngl 20 # Reduce from 35
# Or use smaller quantization
./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M
```
**Poor quality at low bits:**
```bash
# Always use imatrix for Q4 and below
./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
```
## References
- **[Advanced Usage](references/advanced-usage.md)** - Batching, speculative decoding, custom builds
- **[Troubleshooting](references/troubleshooting.md)** - Common issues, debugging, benchmarks
## Resources
- **Repository**: https://github.com/ggml-org/llama.cpp
- **Python Bindings**: https://github.com/abetlen/llama-cpp-python
- **Pre-quantized Models**: https://huggingface.co/TheBloke
- **GGUF Converter**: https://huggingface.co/spaces/ggml-org/gguf-my-repo
- **License**: MIT

View file

@ -1,138 +1,271 @@
--- ---
name: llama-cpp name: llama-cpp
description: Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU. description: Run LLM inference with llama.cpp on CPU, Apple Silicon, AMD/Intel GPUs, or NVIDIA — plus GGUF model conversion and quantization (28 bit with K-quants and imatrix). Covers CLI, Python bindings, OpenAI-compatible server, and Ollama/LM Studio integration. Use for edge deployment, M1/M2/M3/M4 Macs, CUDA-less environments, or flexible local quantization.
version: 1.0.0 version: 2.0.0
author: Orchestra Research author: Orchestra Research
license: MIT license: MIT
dependencies: [llama-cpp-python] dependencies: [llama-cpp-python>=0.2.0]
metadata: metadata:
hermes: hermes:
tags: [Inference Serving, Llama.cpp, CPU Inference, Apple Silicon, Edge Deployment, GGUF, Quantization, Non-NVIDIA, AMD GPUs, Intel GPUs, Embedded] tags: [llama.cpp, GGUF, Quantization, CPU Inference, Apple Silicon, Edge Deployment, Non-NVIDIA, AMD GPUs, Intel GPUs, Embedded, Model Compression]
--- ---
# llama.cpp # llama.cpp + GGUF
Pure C/C++ LLM inference with minimal dependencies, optimized for CPUs and non-NVIDIA hardware. Pure C/C++ LLM inference with minimal dependencies, plus the GGUF (GPT-Generated Unified Format) standard used for quantized weights. One toolchain covers conversion, quantization, and serving.
## When to use llama.cpp ## When to use
**Use llama.cpp when:** **Use llama.cpp + GGUF when:**
- Running on CPU-only machines - Running on CPU-only machines or Apple Silicon (M1/M2/M3/M4) with Metal acceleration
- Deploying on Apple Silicon (M1/M2/M3/M4) - Using AMD (ROCm) or Intel GPUs where CUDA isn't available
- Using AMD or Intel GPUs (no CUDA) - Edge deployment (Raspberry Pi, embedded systems, consumer laptops)
- Edge deployment (Raspberry Pi, embedded systems) - Need flexible quantization (28 bit with K-quants)
- Need simple deployment without Docker/Python - Want local AI tools (LM Studio, Ollama, text-generation-webui, koboldcpp)
- Want a single binary deploy without Docker/Python
**Use TensorRT-LLM instead when:** **Key advantages:**
- Have NVIDIA GPUs (A100/H100) - Universal hardware: CPU, Apple Silicon, NVIDIA, AMD, Intel
- Need maximum throughput (100K+ tok/s) - No Python runtime required (pure C/C++)
- Running in datacenter with CUDA - K-quants + imatrix for better low-bit quality
- OpenAI-compatible server built in
- Rich ecosystem (Ollama, LM Studio, llama-cpp-python)
**Use vLLM instead when:** **Use alternatives instead:**
- Have NVIDIA GPUs - **vLLM** — NVIDIA GPUs, PagedAttention, Python-first, max throughput
- Need Python-first API - **TensorRT-LLM** — Production NVIDIA (A100/H100), maximum speed
- Want PagedAttention - **AWQ/GPTQ** — Calibrated quantization for NVIDIA-only deployments
- **bitsandbytes** — Simple HuggingFace transformers integration
- **HQQ** — Fast calibration-free quantization
## Quick start ## Quick start
### Installation ### Install
```bash ```bash
# macOS/Linux # macOS / Linux (simplest)
brew install llama.cpp brew install llama.cpp
# Or build from source # Or build from source
git clone https://github.com/ggerganov/llama.cpp git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp cd llama.cpp
make make # CPU
make GGML_METAL=1 # Apple Silicon
make GGML_CUDA=1 # NVIDIA CUDA
make LLAMA_HIP=1 # AMD ROCm
# With Metal (Apple Silicon) # Python bindings (optional)
make LLAMA_METAL=1 pip install llama-cpp-python
# With CUDA: CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
# With CUDA (NVIDIA) # With Metal: CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
make LLAMA_CUDA=1
# With ROCm (AMD)
make LLAMA_HIP=1
``` ```
### Download model ### Download a pre-quantized GGUF
```bash ```bash
# Download from HuggingFace (GGUF format) # TheBloke hosts most popular models pre-quantized
huggingface-cli download \ huggingface-cli download \
TheBloke/Llama-2-7B-Chat-GGUF \ TheBloke/Llama-2-7B-Chat-GGUF \
llama-2-7b-chat.Q4_K_M.gguf \ llama-2-7b-chat.Q4_K_M.gguf \
--local-dir models/ --local-dir models/
```
# Or convert from HuggingFace ### Or convert a HuggingFace model to GGUF
python convert_hf_to_gguf.py models/llama-2-7b-chat/
```bash
# 1. Download HF model
huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b
# 2. Convert to FP16 GGUF
python convert_hf_to_gguf.py ./llama-3.1-8b \
--outfile llama-3.1-8b-f16.gguf \
--outtype f16
# 3. Quantize to Q4_K_M
./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M
``` ```
### Run inference ### Run inference
```bash ```bash
# Simple chat # One-shot prompt
./llama-cli \ ./llama-cli -m model.Q4_K_M.gguf -p "Explain quantum computing" -n 256
-m models/llama-2-7b-chat.Q4_K_M.gguf \
-p "Explain quantum computing" \
-n 256 # Max tokens
# Interactive chat # Interactive chat
./llama-cli \ ./llama-cli -m model.Q4_K_M.gguf --interactive
-m models/llama-2-7b-chat.Q4_K_M.gguf \
--interactive # With GPU offload
./llama-cli -m model.Q4_K_M.gguf -ngl 35 -p "Hello!"
``` ```
### Server mode ### Serve an OpenAI-compatible API
```bash ```bash
# Start OpenAI-compatible server
./llama-server \ ./llama-server \
-m models/llama-2-7b-chat.Q4_K_M.gguf \ -m model.Q4_K_M.gguf \
--host 0.0.0.0 \ --host 0.0.0.0 \
--port 8080 \ --port 8080 \
-ngl 32 # Offload 32 layers to GPU -ngl 35 \
-c 4096 \
--parallel 4 \
--cont-batching
```
# Client request ```bash
curl http://localhost:8080/v1/chat/completions \ curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{ -d '{
"model": "llama-2-7b-chat", "model": "local",
"messages": [{"role": "user", "content": "Hello!"}], "messages": [{"role": "user", "content": "Hello!"}],
"temperature": 0.7, "temperature": 0.7,
"max_tokens": 100 "max_tokens": 100
}' }'
``` ```
## Quantization formats ## Quantization formats (GGUF)
### GGUF format overview ### K-quant methods (recommended)
| Format | Bits | Size (7B) | Speed | Quality | Use Case | | Type | Bits | Size (7B) | Quality | Use Case |
|--------|------|-----------|-------|---------|----------| |------|------|-----------|---------|----------|
| **Q4_K_M** | 4.5 | 4.1 GB | Fast | Good | **Recommended default** | | Q2_K | 2.5 | ~2.8 GB | Low | Extreme compression (testing only) |
| Q4_K_S | 4.3 | 3.9 GB | Faster | Lower | Speed critical | | Q3_K_S | 3.0 | ~3.0 GB | Low-Med | Memory constrained |
| Q5_K_M | 5.5 | 4.8 GB | Medium | Better | Quality critical | | Q3_K_M | 3.3 | ~3.3 GB | Medium | Fits small devices |
| Q6_K | 6.5 | 5.5 GB | Slower | Best | Maximum quality | | Q4_K_S | 4.0 | ~3.8 GB | Med-High | Speed critical |
| Q8_0 | 8.0 | 7.0 GB | Slow | Excellent | Minimal degradation | | **Q4_K_M** | 4.5 | ~4.1 GB | High | **Recommended default** |
| Q2_K | 2.5 | 2.7 GB | Fastest | Poor | Testing only | | Q5_K_S | 5.0 | ~4.6 GB | High | Quality focused |
| Q5_K_M | 5.5 | ~4.8 GB | Very High | High quality |
| Q6_K | 6.0 | ~5.5 GB | Excellent | Near-original |
| Q8_0 | 8.0 | ~7.2 GB | Best | Maximum quality, minimal degradation |
### Choosing quantization **Variant suffixes** — `_S` (Small, faster, lower quality), `_M` (Medium, balanced), `_L` (Large, better quality).
**Legacy (Q4_0/Q4_1/Q5_0/Q5_1) exist** but always prefer K-quants for better quality/size ratio.
**IQ quantization** — ultra-low-bit with importance-aware methods: IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_XS, IQ3_S, IQ4_XS. Require `--imatrix`.
**Task-specific defaults:**
- General chat / assistants: Q4_K_M, or Q5_K_M if RAM allows
- Code generation: Q5_K_M or Q6_K (higher precision helps)
- Technical / medical: Q6_K or Q8_0
- Very large (70B, 405B) on consumer hardware: Q3_K_M or Q4_K_S
- Raspberry Pi / edge: Q2_K or Q3_K_S
## Conversion workflows
### Basic: HF → GGUF → quantized
```bash ```bash
# General use (balanced) python convert_hf_to_gguf.py ./model --outfile model-f16.gguf --outtype f16
Q4_K_M # 4-bit, medium quality ./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
./llama-cli -m model-q4_k_m.gguf -p "Hello!" -n 50
```
# Maximum speed (more degradation) ### With importance matrix (imatrix) — better low-bit quality
Q2_K or Q3_K_M
# Maximum quality (slower) `imatrix` gives 1020% perplexity improvement at Q4, essential at Q3 and below.
Q6_K or Q8_0
# Very large models (70B, 405B) ```bash
Q3_K_M or Q4_K_S # Lower bits to fit in memory # 1. Convert to FP16 GGUF
python convert_hf_to_gguf.py ./model --outfile model-f16.gguf
# 2. Prepare calibration data (diverse text, ~100MB is ideal)
cat > calibration.txt << 'EOF'
The quick brown fox jumps over the lazy dog.
Machine learning is a subset of artificial intelligence.
# Add more diverse text samples...
EOF
# 3. Generate importance matrix
./llama-imatrix -m model-f16.gguf \
-f calibration.txt \
--chunk 512 \
-o model.imatrix \
-ngl 35
# 4. Quantize with imatrix
./llama-quantize --imatrix model.imatrix \
model-f16.gguf model-q4_k_m.gguf Q4_K_M
```
### Multi-quant batch
```bash
#!/bin/bash
MODEL="llama-3.1-8b-f16.gguf"
IMATRIX="llama-3.1-8b.imatrix"
./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35
for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
OUTPUT="llama-3.1-8b-${QUANT,,}.gguf"
./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT
echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))"
done
```
### Quality testing (perplexity)
```bash
./llama-perplexity -m model.gguf -f wikitext-2-raw/wiki.test.raw -c 512
# Baseline FP16: ~5.96 | Q4_K_M: ~6.06 (+1.7%) | Q2_K: ~6.87 (+15.3%)
```
## Python bindings (llama-cpp-python)
### Basic generation
```python
from llama_cpp import Llama
llm = Llama(
model_path="./model-q4_k_m.gguf",
n_ctx=4096,
n_gpu_layers=35, # 0 for CPU only, 99 to offload everything
n_threads=8,
)
output = llm(
"What is machine learning?",
max_tokens=256,
temperature=0.7,
stop=["</s>", "\n\n"],
)
print(output["choices"][0]["text"])
```
### Chat completion + streaming
```python
llm = Llama(
model_path="./model-q4_k_m.gguf",
n_ctx=4096,
n_gpu_layers=35,
chat_format="llama-3", # Or "chatml", "mistral", etc.
)
# Non-streaming
response = llm.create_chat_completion(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Python?"},
],
max_tokens=256,
temperature=0.7,
)
print(response["choices"][0]["message"]["content"])
# Streaming
for chunk in llm("Explain quantum computing:", max_tokens=256, stream=True):
print(chunk["choices"][0]["text"], end="", flush=True)
```
### Embeddings
```python
llm = Llama(model_path="./model-q4_k_m.gguf", embedding=True, n_gpu_layers=35)
vec = llm.embed("This is a test sentence.")
print(f"Embedding dimension: {len(vec)}")
``` ```
## Hardware acceleration ## Hardware acceleration
@ -140,122 +273,166 @@ Q3_K_M or Q4_K_S # Lower bits to fit in memory
### Apple Silicon (Metal) ### Apple Silicon (Metal)
```bash ```bash
# Build with Metal make clean && make GGML_METAL=1
make LLAMA_METAL=1 ./llama-cli -m model.gguf -ngl 99 -p "Hello" # offload all layers
# Run with GPU acceleration (automatic)
./llama-cli -m model.gguf -ngl 999 # Offload all layers
# Performance: M3 Max 40-60 tokens/sec (Llama 2-7B Q4_K_M)
``` ```
### NVIDIA GPUs (CUDA) ```python
llm = Llama(
```bash model_path="model.gguf",
# Build with CUDA n_gpu_layers=99, # Offload everything
make LLAMA_CUDA=1 n_threads=1, # Metal handles parallelism
)
# Offload layers to GPU
./llama-cli -m model.gguf -ngl 35 # Offload 35/40 layers
# Hybrid CPU+GPU for large models
./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20 # GPU: 20 layers, CPU: rest
``` ```
### AMD GPUs (ROCm) Performance: M3 Max ~4060 tok/s on Llama 2-7B Q4_K_M.
### NVIDIA (CUDA)
```bash
make clean && make GGML_CUDA=1
./llama-cli -m model.gguf -ngl 35 -p "Hello"
# Hybrid for large models
./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20 # GPU: 20 layers, CPU: rest
# Multi-GPU split
./llama-cli -m large-model.gguf --tensor-split 0.5,0.5 -ngl 60
```
### AMD (ROCm)
```bash ```bash
# Build with ROCm
make LLAMA_HIP=1 make LLAMA_HIP=1
# Run with AMD GPU
./llama-cli -m model.gguf -ngl 999 ./llama-cli -m model.gguf -ngl 999
``` ```
## Common patterns ### CPU
### Batch processing
```bash ```bash
# Process multiple prompts from file # Match PHYSICAL cores, not logical
cat prompts.txt | ./llama-cli \ ./llama-cli -m model.gguf -t 8 -p "Hello"
-m model.gguf \
--batch-size 512 \ # BLAS acceleration (23× speedup)
-n 100 make LLAMA_OPENBLAS=1
``` ```
### Constrained generation ```python
llm = Llama(
```bash model_path="model.gguf",
# JSON output with grammar n_gpu_layers=0,
./llama-cli \ n_threads=8,
-m model.gguf \ n_batch=512, # Larger batch = faster prompt processing
-p "Generate a person: " \ )
--grammar-file grammars/json.gbnf
# Outputs valid JSON only
```
### Context size
```bash
# Increase context (default 512)
./llama-cli \
-m model.gguf \
-c 4096 # 4K context window
# Very long context (if model supports)
./llama-cli -m model.gguf -c 32768 # 32K context
``` ```
## Performance benchmarks ## Performance benchmarks
### CPU performance (Llama 2-7B Q4_K_M) ### CPU (Llama 2-7B Q4_K_M)
| CPU | Threads | Speed | Cost | | CPU | Threads | Speed |
|-----|---------|-------|------| |-----|---------|-------|
| Apple M3 Max | 16 | 50 tok/s | $0 (local) | | Apple M3 Max (Metal) | 16 | 50 tok/s |
| AMD Ryzen 9 7950X | 32 | 35 tok/s | $0.50/hour | | AMD Ryzen 9 7950X | 32 | 35 tok/s |
| Intel i9-13900K | 32 | 30 tok/s | $0.40/hour | | Intel i9-13900K | 32 | 30 tok/s |
| AWS c7i.16xlarge | 64 | 40 tok/s | $2.88/hour |
### GPU acceleration (Llama 2-7B Q4_K_M) ### GPU offloading on RTX 4090
| GPU | Speed | vs CPU | Cost | | Layers GPU | Speed | VRAM |
|-----|-------|--------|------| |------------|-------|------|
| NVIDIA RTX 4090 | 120 tok/s | 3-4× | $0 (local) | | 0 (CPU only) | 30 tok/s | 0 GB |
| NVIDIA A10 | 80 tok/s | 2-3× | $1.00/hour | | 20 (hybrid) | 80 tok/s | 8 GB |
| AMD MI250 | 70 tok/s | 2× | $2.00/hour | | 35 (all) | 120 tok/s | 12 GB |
| Apple M3 Max (Metal) | 50 tok/s | ~Same | $0 (local) |
## Supported models ## Supported models
**LLaMA family**: - **LLaMA family**: Llama 2 (7B/13B/70B), Llama 3 (8B/70B/405B), Code Llama
- Llama 2 (7B, 13B, 70B) - **Mistral family**: Mistral 7B, Mixtral 8x7B/8x22B
- Llama 3 (8B, 70B, 405B) - **Other**: Falcon, BLOOM, GPT-J, Phi-3, Gemma, Qwen, LLaVA (vision), Whisper (audio)
- Code Llama
**Mistral family**: Find GGUF models: https://huggingface.co/models?library=gguf
- Mistral 7B
- Mixtral 8x7B, 8x22B
**Other**: ## Ecosystem integrations
- Falcon, BLOOM, GPT-J
- Phi-3, Gemma, Qwen
- LLaVA (vision), Whisper (audio)
**Find models**: https://huggingface.co/models?library=gguf ### Ollama
```bash
cat > Modelfile << 'EOF'
FROM ./model-q4_k_m.gguf
TEMPLATE """{{ .System }}
{{ .Prompt }}"""
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
EOF
ollama create mymodel -f Modelfile
ollama run mymodel "Hello!"
```
### LM Studio
1. Place GGUF file in `~/.cache/lm-studio/models/`
2. Open LM Studio and select the model
3. Configure context length and GPU offload, start inference
### text-generation-webui
```bash
cp model-q4_k_m.gguf text-generation-webui/models/
python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35
```
### OpenAI client → llama-server
```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
model="local-model",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=256,
)
print(response.choices[0].message.content)
```
## Best practices
1. **Use K-quants** — Q4_K_M is the recommended default
2. **Use imatrix** for Q4 and below (calibration improves quality substantially)
3. **Offload as many layers as VRAM allows** — start high, reduce by 5 on OOM
4. **Thread count** — match physical cores, not logical
5. **Batch size** — increase `n_batch` (e.g. 512) for faster prompt processing
6. **Context** — start at 4096, grow only as needed (memory scales with ctx)
7. **Flash Attention** — add `--flash-attn` if your build supports it
## Common issues (quick fixes)
**Model loads slowly** — use `--mmap` for memory-mapped loading.
**Out of memory (GPU)** — reduce `-ngl`, use a smaller quant (Q4_K_S / Q3_K_M), or quantize the KV cache:
```python
Llama(model_path="...", type_k=2, type_v=2, n_gpu_layers=35) # Q4_0 KV cache
```
**Garbage output** — wrong `chat_format`, temperature too high, or model file corrupted. Test with `temperature=0.1` and verify FP16 baseline works.
**Connection refused (server)** — bind to `--host 0.0.0.0`, check `lsof -i :8080`.
See `references/troubleshooting.md` for the full playbook.
## References ## References
- **[Quantization Guide](references/quantization.md)** - GGUF formats, conversion, quality comparison - **[advanced-usage.md](references/advanced-usage.md)** — speculative decoding, batched inference, grammar-constrained generation, LoRA, multi-GPU, custom builds, benchmark scripts
- **[Server Deployment](references/server.md)** - API endpoints, Docker, monitoring - **[quantization.md](references/quantization.md)** — perplexity tables, use-case guide, model size scaling (7B/13B/70B RAM needs), imatrix deep dive
- **[Optimization](references/optimization.md)** - Performance tuning, hybrid CPU+GPU - **[server.md](references/server.md)** — OpenAI API endpoints, Docker deployment, NGINX load balancing, monitoring
- **[optimization.md](references/optimization.md)** — CPU threading, BLAS, GPU offload heuristics, batch tuning, benchmarks
- **[troubleshooting.md](references/troubleshooting.md)** — install/convert/quantize/inference/server issues, Apple Silicon, debugging
## Resources ## Resources
- **GitHub**: https://github.com/ggerganov/llama.cpp - **GitHub**: https://github.com/ggml-org/llama.cpp
- **Models**: https://huggingface.co/models?library=gguf - **Python bindings**: https://github.com/abetlen/llama-cpp-python
- **Discord**: https://discord.gg/llama-cpp - **Pre-quantized models**: https://huggingface.co/TheBloke
- **GGUF converter Space**: https://huggingface.co/spaces/ggml-org/gguf-my-repo
- **License**: MIT

View file

@ -1,97 +0,0 @@
# GRPO/RL Training Skill
**Expert-level guidance for Group Relative Policy Optimization with TRL**
## 📁 Skill Structure
```
grpo-rl-training/
├── SKILL.md # Main skill documentation (READ THIS FIRST)
├── README.md # This file
├── templates/
│ └── basic_grpo_training.py # Production-ready training template
└── examples/
└── reward_functions_library.py # 20+ reward function examples
```
## 🚀 Quick Start
1. **Read SKILL.md** - Comprehensive guide with all concepts and patterns
2. **Copy `templates/basic_grpo_training.py`** - Start with working code
3. **Browse `examples/reward_functions_library.py`** - Pick reward functions for your task
4. **Modify for your use case** - Adapt dataset, rewards, and config
## 💡 What's Inside
### SKILL.md (Main Documentation)
- Core GRPO concepts and algorithm fundamentals
- Complete implementation workflow (dataset → rewards → training → deployment)
- 10+ reward function examples with code
- Hyperparameter tuning guide
- Training insights (loss behavior, metrics, debugging)
- Troubleshooting guide
- Production best practices
### Templates
- **basic_grpo_training.py**: Minimal, production-ready training script
- Uses Qwen 2.5 1.5B Instruct
- 3 reward functions (format + correctness)
- LoRA for efficient training
- Fully documented and ready to run
### Examples
- **reward_functions_library.py**: 20+ battle-tested reward functions
- Correctness rewards (exact match, fuzzy match, numeric, code execution)
- Format rewards (XML, JSON, strict/soft)
- Length rewards (ideal length, min/max)
- Style rewards (reasoning quality, citations, repetition penalty)
- Combined rewards (multi-objective optimization)
- Preset collections for common tasks
## 📖 Usage for Agents
When this skill is loaded in your agent's context:
1. **Always read SKILL.md first** before implementing
2. **Start simple** - Use length-based reward to validate setup
3. **Build incrementally** - Add one reward function at a time
4. **Reference examples** - Copy patterns from reward_functions_library.py
5. **Monitor training** - Watch reward metrics (not loss!)
## 🎯 Common Use Cases
| Task Type | Recommended Rewards | Template |
|-----------|---------------------|----------|
| Math reasoning | `MATH_REASONING_REWARDS` preset | basic_grpo_training.py |
| Code generation | `CODE_GENERATION_REWARDS` preset | Modify dataset in template |
| Summarization | `SUMMARIZATION_REWARDS` preset | Adjust prompts + rewards |
| Q&A | `QA_REWARDS` preset | Use fuzzy match + citations |
## ⚠️ Critical Reminders
- **Loss goes UP during training** - This is normal (it's KL divergence)
- **Use 3-5 reward functions** - Single rewards often fail
- **Test rewards before training** - Debug each function independently
- **Monitor reward_std** - Should stay > 0.1 (avoid mode collapse)
- **Start with num_generations=4-8** - Scale up if GPU allows
## 🔗 External Resources
- [TRL Documentation](https://huggingface.co/docs/trl)
- [DeepSeek R1 Paper](https://arxiv.org/abs/2501.12948)
- [Open R1 Implementation](https://github.com/huggingface/open-r1)
- [Unsloth (2-3x faster)](https://docs.unsloth.ai/)
## 📝 Version
**v1.0.0** - Initial release (January 2025)
## 👨‍💻 Maintained By
Orchestra Research
For questions or improvements, see https://orchestra.com
---
**License:** MIT
**Last Updated:** January 2025

View file

@ -252,6 +252,8 @@ trl dpo \
Train with reinforcement learning using minimal memory. Train with reinforcement learning using minimal memory.
For in-depth GRPO guidance — reward function design, critical training insights (loss behavior, mode collapse, tuning), and advanced multi-stage patterns — see **[references/grpo-training.md](references/grpo-training.md)**. A production-ready training script is in **[templates/basic_grpo_training.py](templates/basic_grpo_training.py)**.
Copy this checklist: Copy this checklist:
``` ```
@ -428,6 +430,8 @@ config = PPOConfig(
**Online RL methods**: See [references/online-rl.md](references/online-rl.md) for PPO, GRPO, RLOO, and OnlineDPO with detailed configurations. **Online RL methods**: See [references/online-rl.md](references/online-rl.md) for PPO, GRPO, RLOO, and OnlineDPO with detailed configurations.
**GRPO deep dive**: See [references/grpo-training.md](references/grpo-training.md) for expert-level GRPO patterns — reward function design philosophy, training insights (why loss increases, mode collapse detection), hyperparameter tuning, multi-stage training, and troubleshooting. Production-ready template in [templates/basic_grpo_training.py](templates/basic_grpo_training.py).
## Hardware requirements ## Hardware requirements
- **GPU**: NVIDIA (CUDA required) - **GPU**: NVIDIA (CUDA required)

View file

@ -1,51 +1,36 @@
--- # GRPO (Group Relative Policy Optimization) — Deep Guide
name: grpo-rl-training
description: Expert guidance for GRPO/RL fine-tuning with TRL for reasoning and task-specific model training
version: 1.0.0
author: Orchestra Research
license: MIT
dependencies: [transformers>=4.47.0, trl>=0.14.0, datasets>=3.2.0, peft>=0.14.0, torch]
metadata:
hermes:
tags: [Post-Training, Reinforcement Learning, GRPO, TRL, RLHF, Reward Modeling, Reasoning, DPO, PPO, Structured Output]
--- Expert-level patterns, critical insights, and production-ready workflows for fine-tuning language models with custom reward functions using TRL's `GRPOTrainer`. This is the deep reference for the GRPO workflow summarized in the main skill.
# GRPO/RL Training with TRL ## When to use GRPO
Expert-level guidance for implementing Group Relative Policy Optimization (GRPO) using the Transformer Reinforcement Learning (TRL) library. This skill provides battle-tested patterns, critical insights, and production-ready workflows for fine-tuning language models with custom reward functions. Use GRPO when you need to:
- **Enforce specific output formats** (XML tags, JSON, structured reasoning)
## When to Use This Skill
Use GRPO training when you need to:
- **Enforce specific output formats** (e.g., XML tags, JSON, structured reasoning)
- **Teach verifiable tasks** with objective correctness metrics (math, coding, fact-checking) - **Teach verifiable tasks** with objective correctness metrics (math, coding, fact-checking)
- **Improve reasoning capabilities** by rewarding chain-of-thought patterns - **Improve reasoning capabilities** by rewarding chain-of-thought patterns
- **Align models to domain-specific behaviors** without labeled preference data - **Align models to domain-specific behaviors** without labeled preference data
- **Optimize for multiple objectives** simultaneously (format + correctness + style) - **Optimize for multiple objectives** simultaneously (format + correctness + style)
**Do NOT use GRPO for:** **Do NOT use GRPO for:**
- Simple supervised fine-tuning tasks (use SFT instead) - Simple supervised fine-tuning tasks → use SFT
- Tasks without clear reward signals - Tasks without clear reward signals
- When you already have high-quality preference pairs (use DPO/PPO instead) - When you already have high-quality preference pairs → use DPO/PPO
--- ## Core concepts
## Core Concepts ### 1. GRPO algorithm fundamentals
### 1. GRPO Algorithm Fundamentals **Key mechanism:**
- Generates **multiple completions** per prompt (group size: 416)
**Key Mechanism:**
- Generates **multiple completions** for each prompt (group size: 4-16)
- Compares completions within each group using reward functions - Compares completions within each group using reward functions
- Updates policy to favor higher-rewarded responses relative to the group - Updates policy to favor higher-rewarded responses relative to the group
**Critical Difference from PPO:** **Critical differences from PPO:**
- No separate reward model needed - No separate reward model needed
- More sample-efficient (learns from within-group comparisons) - More sample-efficient (learns from within-group comparisons)
- Simpler to implement and debug - Simpler to implement and debug
**Mathematical Intuition:** **Mathematical intuition:**
``` ```
For each prompt p: For each prompt p:
1. Generate N completions: {c₁, c₂, ..., cₙ} 1. Generate N completions: {c₁, c₂, ..., cₙ}
@ -54,35 +39,32 @@ For each prompt p:
relative to low-reward ones in the same group relative to low-reward ones in the same group
``` ```
### 2. Reward Function Design Philosophy ### 2. Reward function design philosophy
**Golden Rules:** **Golden rules:**
1. **Compose multiple reward functions** - Each handles one aspect (format, correctness, style) 1. **Compose multiple reward functions** — each handles one aspect (format, correctness, style)
2. **Scale rewards appropriately** - Higher weight = stronger signal 2. **Scale rewards appropriately** — higher weight = stronger signal
3. **Use incremental rewards** - Partial credit for partial compliance 3. **Use incremental rewards** — partial credit for partial compliance
4. **Test rewards independently** - Debug each reward function in isolation 4. **Test rewards independently** — debug each reward function in isolation
**Reward Function Types:** **Reward function types:**
| Type | Use Case | Example Weight | | Type | Use Case | Example Weight |
|------|----------|----------------| |------|----------|----------------|
| **Correctness** | Verifiable tasks (math, code) | 2.0 (highest) | | **Correctness** | Verifiable tasks (math, code) | 2.0 (highest) |
| **Format** | Strict structure enforcement | 0.5-1.0 | | **Format** | Strict structure enforcement | 0.51.0 |
| **Length** | Encourage verbosity/conciseness | 0.1-0.5 | | **Length** | Encourage verbosity/conciseness | 0.10.5 |
| **Style** | Penalize unwanted patterns | -0.5 to 0.5 | | **Style** | Penalize unwanted patterns | 0.5 to 0.5 |
--- ## Implementation workflow
## Implementation Workflow ### Step 1: Dataset preparation
### Step 1: Dataset Preparation **Critical requirements:**
- Prompts in chat format (list of dicts with `role` and `content`)
**Critical Requirements:**
- Prompts in chat format (list of dicts with 'role' and 'content')
- Include system prompts to set expectations - Include system prompts to set expectations
- For verifiable tasks, include ground truth answers as additional columns - For verifiable tasks, include ground truth answers as additional columns
**Example Structure:**
```python ```python
from datasets import load_dataset, Dataset from datasets import load_dataset, Dataset
@ -97,8 +79,7 @@ Respond in the following format:
""" """
def prepare_dataset(raw_data): def prepare_dataset(raw_data):
""" """Transform raw data into GRPO-compatible format.
Transform raw data into GRPO-compatible format.
Returns: Dataset with columns: Returns: Dataset with columns:
- 'prompt': List[Dict] with role/content (system + user messages) - 'prompt': List[Dict] with role/content (system + user messages)
@ -113,14 +94,14 @@ def prepare_dataset(raw_data):
}) })
``` ```
**Pro Tips:** **Pro tips:**
- Use one-shot or few-shot examples in system prompt for complex formats - Use one-shot or few-shot examples in the system prompt for complex formats
- Keep prompts concise (max_prompt_length: 256-512 tokens) - Keep prompts concise (max_prompt_length: 256512 tokens)
- Validate data quality before training (garbage in = garbage out) - Validate data quality before training (garbage in = garbage out)
### Step 2: Reward Function Implementation ### Step 2: Reward function implementation
**Template Structure:** **Template structure:**
```python ```python
def reward_function_name( def reward_function_name(
prompts, # List[List[Dict]]: Original prompts prompts, # List[List[Dict]]: Original prompts
@ -128,24 +109,16 @@ def reward_function_name(
answer=None, # Optional: Ground truth from dataset answer=None, # Optional: Ground truth from dataset
**kwargs # Additional dataset columns **kwargs # Additional dataset columns
) -> list[float]: ) -> list[float]:
""" """Evaluate completions and return rewards (one per completion)."""
Evaluate completions and return rewards.
Returns: List of floats (one per completion)
"""
# Extract completion text
responses = [comp[0]['content'] for comp in completions] responses = [comp[0]['content'] for comp in completions]
# Compute rewards
rewards = [] rewards = []
for response in responses: for response in responses:
score = compute_score(response) score = compute_score(response)
rewards.append(score) rewards.append(score)
return rewards return rewards
``` ```
**Example 1: Correctness Reward (Math/Coding)** **Example 1: correctness reward (math/coding)**
```python ```python
def correctness_reward(prompts, completions, answer, **kwargs): def correctness_reward(prompts, completions, answer, **kwargs):
"""Reward correct answers with high score.""" """Reward correct answers with high score."""
@ -155,7 +128,7 @@ def correctness_reward(prompts, completions, answer, **kwargs):
for ans, gt in zip(extracted, answer)] for ans, gt in zip(extracted, answer)]
``` ```
**Example 2: Format Reward (Structured Output)** **Example 2: format reward (structured output)**
```python ```python
import re import re
@ -167,7 +140,7 @@ def format_reward(completions, **kwargs):
for r in responses] for r in responses]
``` ```
**Example 3: Incremental Format Reward (Partial Credit)** **Example 3: incremental format reward (partial credit)**
```python ```python
def incremental_format_reward(completions, **kwargs): def incremental_format_reward(completions, **kwargs):
"""Award partial credit for format compliance.""" """Award partial credit for format compliance."""
@ -176,14 +149,10 @@ def incremental_format_reward(completions, **kwargs):
for r in responses: for r in responses:
score = 0.0 score = 0.0
if '<reasoning>' in r: if '<reasoning>' in r: score += 0.25
score += 0.25 if '</reasoning>' in r: score += 0.25
if '</reasoning>' in r: if '<answer>' in r: score += 0.25
score += 0.25 if '</answer>' in r: score += 0.25
if '<answer>' in r:
score += 0.25
if '</answer>' in r:
score += 0.25
# Penalize extra text after closing tag # Penalize extra text after closing tag
if r.count('</answer>') == 1: if r.count('</answer>') == 1:
extra_text = r.split('</answer>')[-1].strip() extra_text = r.split('</answer>')[-1].strip()
@ -193,12 +162,11 @@ def incremental_format_reward(completions, **kwargs):
return rewards return rewards
``` ```
**Critical Insight:** **Critical insight:** Combine 35 reward functions for robust training. Order matters less than diversity of signals.
Combine 3-5 reward functions for robust training. Order matters less than diversity of signals.
### Step 3: Training Configuration ### Step 3: Training configuration
**Memory-Optimized Config (Small GPU)** **Memory-optimized config (small GPU)**
```python ```python
from trl import GRPOConfig from trl import GRPOConfig
@ -218,13 +186,13 @@ training_args = GRPOConfig(
gradient_accumulation_steps=4, # Effective batch = 4 gradient_accumulation_steps=4, # Effective batch = 4
# GRPO-specific # GRPO-specific
num_generations=8, # Group size: 8-16 recommended num_generations=8, # Group size: 816 recommended
max_prompt_length=256, max_prompt_length=256,
max_completion_length=512, max_completion_length=512,
# Training duration # Training duration
num_train_epochs=1, num_train_epochs=1,
max_steps=None, # Or set fixed steps (e.g., 500) max_steps=None,
# Optimization # Optimization
bf16=True, # Faster on A100/H100 bf16=True, # Faster on A100/H100
@ -234,11 +202,11 @@ training_args = GRPOConfig(
# Logging # Logging
logging_steps=1, logging_steps=1,
save_steps=100, save_steps=100,
report_to="wandb", # Or "none" for no logging report_to="wandb",
) )
``` ```
**High-Performance Config (Large GPU)** **High-performance config (large GPU)**
```python ```python
training_args = GRPOConfig( training_args = GRPOConfig(
output_dir="outputs/grpo-model", output_dir="outputs/grpo-model",
@ -255,31 +223,30 @@ training_args = GRPOConfig(
) )
``` ```
**Critical Hyperparameters:** **Critical hyperparameters:**
| Parameter | Impact | Tuning Advice | | Parameter | Impact | Tuning Advice |
|-----------|--------|---------------| |-----------|--------|---------------|
| `num_generations` | Group size for comparison | Start with 8, increase to 16 if GPU allows | | `num_generations` | Group size for comparison | Start 8, increase to 16 if GPU allows |
| `learning_rate` | Convergence speed/stability | 5e-6 (safe), 1e-5 (faster, riskier) | | `learning_rate` | Convergence speed/stability | 5e-6 (safe), 1e-5 (faster, riskier) |
| `max_completion_length` | Output verbosity | Match your task (512 for reasoning, 256 for short answers) | | `max_completion_length` | Output verbosity | Match your task (512 reasoning, 256 short answers) |
| `gradient_accumulation_steps` | Effective batch size | Increase if GPU memory limited | | `gradient_accumulation_steps` | Effective batch size | Increase if GPU memory limited |
### Step 4: Model Setup and Training ### Step 4: Model setup and training
**Standard Setup (Transformers)** **Standard setup (Transformers + TRL)**
```python ```python
import torch import torch
from transformers import AutoModelForCausalLM, AutoTokenizer from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig from peft import LoraConfig
from trl import GRPOTrainer from trl import GRPOTrainer
# Load model
model_name = "Qwen/Qwen2.5-1.5B-Instruct" model_name = "Qwen/Qwen2.5-1.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained( model = AutoModelForCausalLM.from_pretrained(
model_name, model_name,
torch_dtype=torch.bfloat16, torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2", # 2-3x faster attn_implementation="flash_attention_2", # 23× faster
device_map="auto" device_map="auto",
) )
tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name)
@ -287,17 +254,16 @@ tokenizer.pad_token = tokenizer.eos_token
# Optional: LoRA for parameter-efficient training # Optional: LoRA for parameter-efficient training
peft_config = LoraConfig( peft_config = LoraConfig(
r=16, # Rank (higher = more capacity) r=16,
lora_alpha=32, # Scaling factor (typically 2*r) lora_alpha=32,
target_modules=[ target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj", "q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj" "gate_proj", "up_proj", "down_proj",
], ],
task_type="CAUSAL_LM", task_type="CAUSAL_LM",
lora_dropout=0.05, lora_dropout=0.05,
) )
# Initialize trainer
trainer = GRPOTrainer( trainer = GRPOTrainer(
model=model, model=model,
processing_class=tokenizer, processing_class=tokenizer,
@ -308,17 +274,14 @@ trainer = GRPOTrainer(
], ],
args=training_args, args=training_args,
train_dataset=dataset, train_dataset=dataset,
peft_config=peft_config, # Remove for full fine-tuning peft_config=peft_config, # Remove for full fine-tuning
) )
# Train
trainer.train() trainer.train()
# Save
trainer.save_model("final_model") trainer.save_model("final_model")
``` ```
**Unsloth Setup (2-3x Faster)** **Unsloth setup (23× faster)**
```python ```python
from unsloth import FastLanguageModel from unsloth import FastLanguageModel
@ -339,28 +302,26 @@ model = FastLanguageModel.get_peft_model(
use_gradient_checkpointing="unsloth", use_gradient_checkpointing="unsloth",
) )
# Rest is identical to standard setup # Rest is identical to the standard setup
trainer = GRPOTrainer(model=model, ...) trainer = GRPOTrainer(model=model, ...)
trainer.train() trainer.train()
``` ```
--- ## Critical training insights
## Critical Training Insights ### 1. Loss behavior (EXPECTED pattern)
- **Loss starts near 0 and INCREASES during training** — this is CORRECT
- Loss measures KL divergence from initial policy; the model is learning (diverging from original behavior to optimize rewards)
- **Monitor reward metrics, not loss, for progress**
### 1. Loss Behavior (EXPECTED PATTERN) ### 2. Reward tracking
- **Loss starts near 0 and INCREASES during training**
- This is CORRECT - loss measures KL divergence from initial policy
- Model is learning (diverging from original behavior to optimize rewards)
- Monitor reward metrics instead of loss for progress
### 2. Reward Tracking
Key metrics to watch: Key metrics to watch:
- `reward`: Average across all completions - `reward` — average across all completions
- `reward_std`: Diversity within groups (should remain > 0) - `reward_std` — diversity within groups (should remain > 0)
- `kl`: KL divergence from reference (should grow moderately) - `kl` KL divergence from reference (should grow moderately)
**Healthy Training Pattern:** **Healthy pattern:**
``` ```
Step Reward Reward_Std KL Step Reward Reward_Std KL
100 0.5 0.3 0.02 100 0.5 0.3 0.02
@ -369,12 +330,12 @@ Step Reward Reward_Std KL
400 1.5 0.15 0.12 400 1.5 0.15 0.12
``` ```
**Warning Signs:** **Warning signs:**
- Reward std → 0 (model collapsing to single response) - `reward_std` → 0 (model collapsing to a single response)
- KL exploding (> 0.5) (diverging too much, reduce LR) - `kl` exploding (> 0.5) — diverging too much, reduce LR
- Reward stuck (reward functions too harsh or model capacity issue) - Reward stuck reward functions too harsh or model capacity issue
### 3. Common Pitfalls and Solutions ### 3. Common pitfalls and solutions
| Problem | Symptom | Solution | | Problem | Symptom | Solution |
|---------|---------|----------| |---------|---------|----------|
@ -384,15 +345,14 @@ Step Reward Reward_Std KL
| **Slow training** | < 1 it/s | Enable `use_vllm=True`, use Unsloth, reduce seq length | | **Slow training** | < 1 it/s | Enable `use_vllm=True`, use Unsloth, reduce seq length |
| **Format ignored** | Model doesn't follow structure | Increase format reward weight, add incremental rewards | | **Format ignored** | Model doesn't follow structure | Increase format reward weight, add incremental rewards |
--- ## Advanced patterns
## Advanced Patterns ### 1. Multi-stage training
### 1. Multi-Stage Training
For complex tasks, train in stages: For complex tasks, train in stages:
```python ```python
# Stage 1: Format compliance (epochs=1) # Stage 1: Format compliance
trainer_stage1 = GRPOTrainer( trainer_stage1 = GRPOTrainer(
model=model, model=model,
reward_funcs=[incremental_format_reward, format_reward], reward_funcs=[incremental_format_reward, format_reward],
@ -400,7 +360,7 @@ trainer_stage1 = GRPOTrainer(
) )
trainer_stage1.train() trainer_stage1.train()
# Stage 2: Correctness (epochs=1) # Stage 2: Correctness
trainer_stage2 = GRPOTrainer( trainer_stage2 = GRPOTrainer(
model=model, model=model,
reward_funcs=[format_reward, correctness_reward], reward_funcs=[format_reward, correctness_reward],
@ -409,7 +369,8 @@ trainer_stage2 = GRPOTrainer(
trainer_stage2.train() trainer_stage2.train()
``` ```
### 2. Adaptive Reward Scaling ### 2. Adaptive reward scaling
```python ```python
class AdaptiveReward: class AdaptiveReward:
def __init__(self, base_reward_func, initial_weight=1.0): def __init__(self, base_reward_func, initial_weight=1.0):
@ -428,148 +389,116 @@ class AdaptiveReward:
self.weight *= 0.9 self.weight *= 0.9
``` ```
### 3. Custom Dataset Integration ### 3. Custom dataset integration
```python ```python
def load_custom_knowledge_base(csv_path): def load_custom_knowledge_base(csv_path):
"""Example: School communication platform docs."""
import pandas as pd import pandas as pd
df = pd.read_csv(csv_path) df = pd.read_csv(csv_path)
return Dataset.from_pandas(df).map(lambda x: {
dataset = Dataset.from_pandas(df).map(lambda x: {
'prompt': [ 'prompt': [
{'role': 'system', 'content': CUSTOM_SYSTEM_PROMPT}, {'role': 'system', 'content': CUSTOM_SYSTEM_PROMPT},
{'role': 'user', 'content': x['question']} {'role': 'user', 'content': x['question']}
], ],
'answer': x['expert_answer'] 'answer': x['expert_answer']
}) })
return dataset
``` ```
--- ## Deployment and inference
## Deployment and Inference ### Save and merge LoRA
### Save and Merge LoRA
```python ```python
# Merge LoRA adapters into base model
if hasattr(trainer.model, 'merge_and_unload'): if hasattr(trainer.model, 'merge_and_unload'):
merged_model = trainer.model.merge_and_unload() merged_model = trainer.model.merge_and_unload()
merged_model.save_pretrained("production_model") merged_model.save_pretrained("production_model")
tokenizer.save_pretrained("production_model") tokenizer.save_pretrained("production_model")
``` ```
### Inference Example ### Inference
```python ```python
from transformers import pipeline from transformers import pipeline
generator = pipeline( generator = pipeline("text-generation", model="production_model", tokenizer=tokenizer)
"text-generation",
model="production_model",
tokenizer=tokenizer
)
result = generator( result = generator(
[ [
{'role': 'system', 'content': SYSTEM_PROMPT}, {'role': 'system', 'content': SYSTEM_PROMPT},
{'role': 'user', 'content': "What is 15 + 27?"} {'role': 'user', 'content': "What is 15 + 27?"},
], ],
max_new_tokens=256, max_new_tokens=256,
do_sample=True, do_sample=True,
temperature=0.7, temperature=0.7,
top_p=0.9 top_p=0.9,
) )
print(result[0]['generated_text']) print(result[0]['generated_text'])
``` ```
--- ## Best practices checklist
## Best Practices Checklist **Before training:**
**Before Training:**
- [ ] Validate dataset format (prompts as List[Dict]) - [ ] Validate dataset format (prompts as List[Dict])
- [ ] Test reward functions on sample data - [ ] Test reward functions on sample data
- [ ] Calculate expected max_prompt_length from data - [ ] Calculate expected `max_prompt_length` from data
- [ ] Choose appropriate num_generations based on GPU memory - [ ] Choose `num_generations` based on GPU memory
- [ ] Set up logging (wandb recommended) - [ ] Set up logging (wandb recommended)
**During Training:** **During training:**
- [ ] Monitor reward progression (should increase) - [ ] Monitor reward progression (should increase)
- [ ] Check reward_std (should stay > 0.1) - [ ] Check `reward_std` (should stay > 0.1)
- [ ] Watch for OOM errors (reduce batch size if needed) - [ ] Watch for OOM errors (reduce batch size if needed)
- [ ] Sample generations every 50-100 steps - [ ] Sample generations every 50100 steps
- [ ] Validate format compliance on holdout set - [ ] Validate format compliance on holdout set
**After Training:** **After training:**
- [ ] Merge LoRA weights if using PEFT - [ ] Merge LoRA weights if using PEFT
- [ ] Test on diverse prompts - [ ] Test on diverse prompts
- [ ] Compare to baseline model - [ ] Compare to baseline model
- [ ] Document reward weights and hyperparameters - [ ] Document reward weights and hyperparameters
- [ ] Save reproducibility config - [ ] Save reproducibility config
--- ## Troubleshooting
## Troubleshooting Guide ### Debugging workflow
1. **Isolate reward functions** — test each independently
2. **Check data distribution** — ensure diversity in prompts
3. **Reduce complexity** — start with single reward, add gradually
4. **Monitor generations** — print samples every N steps
5. **Validate extraction logic** — ensure answer parsing works
### Debugging Workflow ### Quick debug reward
1. **Isolate reward functions** - Test each independently
2. **Check data distribution** - Ensure diversity in prompts
3. **Reduce complexity** - Start with single reward, add gradually
4. **Monitor generations** - Print samples every N steps
5. **Validate extraction logic** - Ensure answer parsing works
### Quick Fixes
```python ```python
# Debug reward function
def debug_reward(completions, **kwargs): def debug_reward(completions, **kwargs):
responses = [comp[0]['content'] for comp in completions] responses = [comp[0]['content'] for comp in completions]
for i, r in enumerate(responses[:2]): # Print first 2 for i, r in enumerate(responses[:2]):
print(f"Response {i}: {r[:200]}...") print(f"Response {i}: {r[:200]}...")
return [1.0] * len(responses) # Dummy rewards return [1.0] * len(responses)
# Test without training # Test without training
trainer = GRPOTrainer(..., reward_funcs=[debug_reward]) trainer = GRPOTrainer(..., reward_funcs=[debug_reward])
trainer.generate_completions(dataset[:1]) # Generate without updating trainer.generate_completions(dataset[:1])
``` ```
--- ## Template
## References and Resources A production-ready training script lives at **`../templates/basic_grpo_training.py`**. It uses Qwen 2.5-1.5B-Instruct with LoRA and three reward functions (incremental format, strict format, correctness) on GSM8K. Copy and adapt:
1. `get_dataset()` — swap in your data loader
2. Reward functions — tune to your task
3. `SYSTEM_PROMPT` — match your output format
4. `GRPOConfig` — adjust hyperparameters for your GPU
## References and resources
**Official Documentation:**
- TRL GRPO Trainer: https://huggingface.co/docs/trl/grpo_trainer - TRL GRPO Trainer: https://huggingface.co/docs/trl/grpo_trainer
- DeepSeek R1 Paper: https://arxiv.org/abs/2501.12948 - GRPO paper (DeepSeek): https://arxiv.org/abs/2402.03300
- Unsloth Docs: https://docs.unsloth.ai/ - DeepSeek R1 paper: https://arxiv.org/abs/2501.12948
- Open R1 implementation: https://github.com/huggingface/open-r1
**Example Repositories:** - TRL examples: https://github.com/huggingface/trl/tree/main/examples
- Open R1 Implementation: https://github.com/huggingface/open-r1 - Unsloth (faster training): https://docs.unsloth.ai/
- TRL Examples: https://github.com/huggingface/trl/tree/main/examples
**Recommended Reading:**
- Progressive Disclosure Pattern for agent instructions
- Reward shaping in RL (Ng et al.)
- LoRA paper (Hu et al., 2021)
---
## Usage Instructions for Agents
When this skill is loaded:
1. **Read this entire file** before implementing GRPO training
2. **Start with the simplest reward function** (e.g., length-based) to validate setup
3. **Use the templates** in `templates/` directory as starting points
4. **Reference examples** in `examples/` for task-specific implementations
5. **Follow the workflow** sequentially (don't skip steps)
6. **Debug incrementally** - add one reward function at a time
**Critical Reminders:**
- Always use multiple reward functions (3-5 is optimal)
- Monitor reward metrics, not loss
- Test reward functions before training
- Start small (num_generations=4), scale up gradually
- Save checkpoints frequently (every 100 steps)
This skill is designed for **expert-level implementation**. Beginners should start with supervised fine-tuning before attempting GRPO.
## Critical reminders
- **Loss goes UP during training** — this is normal (it's KL divergence)
- **Use 35 reward functions** — single rewards often fail
- **Test rewards before training** — debug each function independently
- **Monitor `reward_std`** — should stay > 0.1 (avoid mode collapse)
- **Start with `num_generations=48`** — scale up if GPU allows

View file

@ -98,6 +98,7 @@ The largest optional category — covers the full ML pipeline from data curation
| **chroma** | Open-source embedding database. Store embeddings and metadata, perform vector and full-text search. Simple 4-function API for RAG and semantic search. | | **chroma** | Open-source embedding database. Store embeddings and metadata, perform vector and full-text search. Simple 4-function API for RAG and semantic search. |
| **faiss** | Facebook's library for efficient similarity search and clustering of dense vectors. Supports billions of vectors, GPU acceleration, and various index types (Flat, IVF, HNSW). | | **faiss** | Facebook's library for efficient similarity search and clustering of dense vectors. Supports billions of vectors, GPU acceleration, and various index types (Flat, IVF, HNSW). |
| **flash-attention** | Optimize transformer attention with Flash Attention for 2-4x speedup and 10-20x memory reduction. Supports PyTorch SDPA, flash-attn library, H100 FP8, and sliding window. | | **flash-attention** | Optimize transformer attention with Flash Attention for 2-4x speedup and 10-20x memory reduction. Supports PyTorch SDPA, flash-attn library, H100 FP8, and sliding window. |
| **guidance** | Control LLM output with regex and grammars, guarantee valid JSON/XML/code generation, enforce structured formats, and build multi-step workflows with Guidance — Microsoft Research's constrained generation framework. |
| **hermes-atropos-environments** | Build, test, and debug Hermes Agent RL environments for Atropos training. Covers the HermesAgentBaseEnv interface, reward functions, agent loop integration, and evaluation. | | **hermes-atropos-environments** | Build, test, and debug Hermes Agent RL environments for Atropos training. Covers the HermesAgentBaseEnv interface, reward functions, agent loop integration, and evaluation. |
| **huggingface-tokenizers** | Fast Rust-based tokenizers for research and production. Tokenizes 1GB in under 20 seconds. Supports BPE, WordPiece, and Unigram algorithms. | | **huggingface-tokenizers** | Fast Rust-based tokenizers for research and production. Tokenizes 1GB in under 20 seconds. Supports BPE, WordPiece, and Unigram algorithms. |
| **instructor** | Extract structured data from LLM responses with Pydantic validation, retry failed extractions automatically, and stream partial results. | | **instructor** | Extract structured data from LLM responses with Pydantic validation, retry failed extractions automatically, and stream partial results. |

View file

@ -163,10 +163,8 @@ Model serving, quantization (GGUF/GPTQ), structured output, inference optimizati
| Skill | Description | Path | | Skill | Description | Path |
|-------|-------------|------| |-------|-------------|------|
| `gguf-quantization` | GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements. | `mlops/inference/gguf` |
| `guidance` | Control LLM output with regex and grammars, guarantee valid JSON/XML/code generation, enforce structured formats, and build multi-step workflows with Guidance - Microsoft Research's constrained generation framework | `mlops/inference/guidance` |
| `instructor` | Extract structured data from LLM responses with Pydantic validation, retry failed extractions automatically, parse complex JSON with type safety, and stream partial results with Instructor - battle-tested structured output library | `mlops/inference/instructor` | | `instructor` | Extract structured data from LLM responses with Pydantic validation, retry failed extractions automatically, parse complex JSON with type safety, and stream partial results with Instructor - battle-tested structured output library | `mlops/inference/instructor` |
| `llama-cpp` | Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU. | `mlops/inference/llama-cpp` | | `llama-cpp` | Run LLM inference with llama.cpp on CPU, Apple Silicon, AMD/Intel GPUs, or NVIDIA — plus GGUF model conversion and quantization (28 bit with K-quants and imatrix). Covers CLI, Python bindings, OpenAI-compatible server, and Ollama/LM Studio integration. Use for edge deployment, M1/M2/M3/M4 Macs, CUDA-less environments, or flexible local quantization. | `mlops/inference/llama-cpp` |
| `obliteratus` | Remove refusal behaviors from open-weight LLMs using OBLITERATUS — mechanistic interpretability techniques (diff-in-means, SVD, whitened SVD, LEACE, SAE decomposition, etc.) to excise guardrails while preserving reasoning. 9 CLI methods, 28 analysis modules, 116 model presets ac… | `mlops/inference/obliteratus` | | `obliteratus` | Remove refusal behaviors from open-weight LLMs using OBLITERATUS — mechanistic interpretability techniques (diff-in-means, SVD, whitened SVD, LEACE, SAE decomposition, etc.) to excise guardrails while preserving reasoning. 9 CLI methods, 28 analysis modules, 116 model presets ac… | `mlops/inference/obliteratus` |
| `outlines` | Guarantee valid JSON/XML/code structure during generation, use Pydantic models for type-safe outputs, support local models (Transformers, vLLM), and maximize inference speed with Outlines - dottxt.ai's structured generation library | `mlops/inference/outlines` | | `outlines` | Guarantee valid JSON/XML/code structure during generation, use Pydantic models for type-safe outputs, support local models (Transformers, vLLM), and maximize inference speed with Outlines - dottxt.ai's structured generation library | `mlops/inference/outlines` |
| `serving-llms-vllm` | Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), an… | `mlops/inference/vllm` | | `serving-llms-vllm` | Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), an… | `mlops/inference/vllm` |
@ -202,7 +200,6 @@ Fine-tuning, RLHF/DPO/GRPO training, distributed training frameworks, and optimi
| `axolotl` | Expert guidance for fine-tuning LLMs with Axolotl - YAML configs, 100+ models, LoRA/QLoRA, DPO/KTO/ORPO/GRPO, multimodal support | `mlops/training/axolotl` | | `axolotl` | Expert guidance for fine-tuning LLMs with Axolotl - YAML configs, 100+ models, LoRA/QLoRA, DPO/KTO/ORPO/GRPO, multimodal support | `mlops/training/axolotl` |
| `distributed-llm-pretraining-torchtitan` | Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing. | `mlops/training/torchtitan` | | `distributed-llm-pretraining-torchtitan` | Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing. | `mlops/training/torchtitan` |
| `fine-tuning-with-trl` | Fine-tune LLMs using reinforcement learning with TRL - SFT for instruction tuning, DPO for preference alignment, PPO/GRPO for reward optimization, and reward model training. Use when need RLHF, align model with preferences, or train from human feedback. Works with HuggingFace Tr… | `mlops/training/trl-fine-tuning` | | `fine-tuning-with-trl` | Fine-tune LLMs using reinforcement learning with TRL - SFT for instruction tuning, DPO for preference alignment, PPO/GRPO for reward optimization, and reward model training. Use when need RLHF, align model with preferences, or train from human feedback. Works with HuggingFace Tr… | `mlops/training/trl-fine-tuning` |
| `grpo-rl-training` | Expert guidance for GRPO/RL fine-tuning with TRL for reasoning and task-specific model training | `mlops/training/grpo-rl-training` |
| `hermes-atropos-environments` | Build, test, and debug Hermes Agent RL environments for Atropos training. Covers the HermesAgentBaseEnv interface, reward functions, agent loop integration, evaluation with tools, wandb logging, and the three CLI modes (serve/process/evaluate). Use when creating, reviewing, or f… | `mlops/training/hermes-atropos-environments` | | `hermes-atropos-environments` | Build, test, and debug Hermes Agent RL environments for Atropos training. Covers the HermesAgentBaseEnv interface, reward functions, agent loop integration, evaluation with tools, wandb logging, and the three CLI modes (serve/process/evaluate). Use when creating, reviewing, or f… | `mlops/training/hermes-atropos-environments` |
| `huggingface-accelerate` | Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard. | `mlops/training/accelerate` | | `huggingface-accelerate` | Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard. | `mlops/training/accelerate` |
| `optimizing-attention-flash` | Optimizes transformer attention with Flash Attention for 2-4x speedup and 10-20x memory reduction. Use when training/running transformers with long sequences (&gt;512 tokens), encountering GPU memory issues with attention, or need faster inference. Supports PyTorch native SDPA,… | `mlops/training/flash-attention` | | `optimizing-attention-flash` | Optimizes transformer attention with Flash Attention for 2-4x speedup and 10-20x memory reduction. Use when training/running transformers with long sequences (&gt;512 tokens), encountering GPU memory issues with attention, or need faster inference. Supports PyTorch native SDPA,… | `mlops/training/flash-attention` |