From 73bccc94c7af3a07b4002c2a14a4b54f844bd561 Mon Sep 17 00:00:00 2001
From: Teknium <127238744+teknium1@users.noreply.github.com>
Date: Fri, 17 Apr 2026 21:36:40 -0700
Subject: [PATCH] =?UTF-8?q?skills:=20consolidate=20mlops=20redundancies=20?=
=?UTF-8?q?(gguf+llama-cpp,=20grpo+trl,=20guidance=E2=86=92optional)=20(#1?=
=?UTF-8?q?1965)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Three tightly-scoped built-in skill consolidations to reduce redundancy in
the available_skills listing injected into every system prompt:
1. gguf-quantization → llama-cpp (merged)
GGUF is llama.cpp's format; two skills covered the same toolchain. The
merged llama-cpp skill keeps the full K-quant table + imatrix workflow
from gguf and the ROCm/benchmarks/supported-models sections from the
original llama-cpp. All 5 reference files preserved.
2. grpo-rl-training → fine-tuning-with-trl (folded in)
GRPO isn't a framework, it's a trainer inside TRL. Moved the 17KB
deep-dive SKILL.md to references/grpo-training.md and the working
template to templates/basic_grpo_training.py. TRL's GRPO workflow
section now points to both. Atropos skill's related_skills updated.
3. guidance → optional-skills/mlops/
Dropped from built-in. Outlines (still built-in) covers the same
structured-generation ground with wider adoption. Listed in the
optional catalog for users who specifically want Guidance.
Net: 3 fewer built-in skill lines in every system prompt, zero content
loss. Contributor authorship preserved via git rename detection.
---
.../mlops}/guidance/SKILL.md | 0
.../mlops}/guidance/references/backends.md | 0
.../mlops}/guidance/references/constraints.md | 0
.../mlops}/guidance/references/examples.md | 0
.../hermes-atropos-environments/SKILL.md | 2 +-
skills/mlops/inference/gguf/SKILL.md | 430 ---------------
skills/mlops/inference/llama-cpp/SKILL.md | 491 ++++++++++++------
.../references/advanced-usage.md | 0
.../references/troubleshooting.md | 0
.../mlops/training/grpo-rl-training/README.md | 97 ----
.../mlops/training/trl-fine-tuning/SKILL.md | 4 +
.../references/grpo-training.md} | 329 +++++-------
.../templates/basic_grpo_training.py | 0
.../docs/reference/optional-skills-catalog.md | 1 +
website/docs/reference/skills-catalog.md | 5 +-
15 files changed, 470 insertions(+), 889 deletions(-)
rename {skills/mlops/inference => optional-skills/mlops}/guidance/SKILL.md (100%)
rename {skills/mlops/inference => optional-skills/mlops}/guidance/references/backends.md (100%)
rename {skills/mlops/inference => optional-skills/mlops}/guidance/references/constraints.md (100%)
rename {skills/mlops/inference => optional-skills/mlops}/guidance/references/examples.md (100%)
delete mode 100644 skills/mlops/inference/gguf/SKILL.md
rename skills/mlops/inference/{gguf => llama-cpp}/references/advanced-usage.md (100%)
rename skills/mlops/inference/{gguf => llama-cpp}/references/troubleshooting.md (100%)
delete mode 100644 skills/mlops/training/grpo-rl-training/README.md
rename skills/mlops/training/{grpo-rl-training/SKILL.md => trl-fine-tuning/references/grpo-training.md} (56%)
rename skills/mlops/training/{grpo-rl-training => trl-fine-tuning}/templates/basic_grpo_training.py (100%)
diff --git a/skills/mlops/inference/guidance/SKILL.md b/optional-skills/mlops/guidance/SKILL.md
similarity index 100%
rename from skills/mlops/inference/guidance/SKILL.md
rename to optional-skills/mlops/guidance/SKILL.md
diff --git a/skills/mlops/inference/guidance/references/backends.md b/optional-skills/mlops/guidance/references/backends.md
similarity index 100%
rename from skills/mlops/inference/guidance/references/backends.md
rename to optional-skills/mlops/guidance/references/backends.md
diff --git a/skills/mlops/inference/guidance/references/constraints.md b/optional-skills/mlops/guidance/references/constraints.md
similarity index 100%
rename from skills/mlops/inference/guidance/references/constraints.md
rename to optional-skills/mlops/guidance/references/constraints.md
diff --git a/skills/mlops/inference/guidance/references/examples.md b/optional-skills/mlops/guidance/references/examples.md
similarity index 100%
rename from skills/mlops/inference/guidance/references/examples.md
rename to optional-skills/mlops/guidance/references/examples.md
diff --git a/optional-skills/mlops/hermes-atropos-environments/SKILL.md b/optional-skills/mlops/hermes-atropos-environments/SKILL.md
index 9dff46687..5101886b4 100644
--- a/optional-skills/mlops/hermes-atropos-environments/SKILL.md
+++ b/optional-skills/mlops/hermes-atropos-environments/SKILL.md
@@ -7,7 +7,7 @@ license: MIT
metadata:
hermes:
tags: [atropos, rl, environments, training, reinforcement-learning, reward-functions]
- related_skills: [axolotl, grpo-rl-training, trl-fine-tuning, lm-evaluation-harness]
+ related_skills: [axolotl, fine-tuning-with-trl, lm-evaluation-harness]
---
# Hermes Agent Atropos Environments
diff --git a/skills/mlops/inference/gguf/SKILL.md b/skills/mlops/inference/gguf/SKILL.md
deleted file mode 100644
index 21bb176c8..000000000
--- a/skills/mlops/inference/gguf/SKILL.md
+++ /dev/null
@@ -1,430 +0,0 @@
----
-name: gguf-quantization
-description: GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.
-version: 1.0.0
-author: Orchestra Research
-license: MIT
-dependencies: [llama-cpp-python>=0.2.0]
-metadata:
- hermes:
- tags: [GGUF, Quantization, llama.cpp, CPU Inference, Apple Silicon, Model Compression, Optimization]
-
----
-
-# GGUF - Quantization Format for llama.cpp
-
-The GGUF (GPT-Generated Unified Format) is the standard file format for llama.cpp, enabling efficient inference on CPUs, Apple Silicon, and GPUs with flexible quantization options.
-
-## When to use GGUF
-
-**Use GGUF when:**
-- Deploying on consumer hardware (laptops, desktops)
-- Running on Apple Silicon (M1/M2/M3) with Metal acceleration
-- Need CPU inference without GPU requirements
-- Want flexible quantization (Q2_K to Q8_0)
-- Using local AI tools (LM Studio, Ollama, text-generation-webui)
-
-**Key advantages:**
-- **Universal hardware**: CPU, Apple Silicon, NVIDIA, AMD support
-- **No Python runtime**: Pure C/C++ inference
-- **Flexible quantization**: 2-8 bit with various methods (K-quants)
-- **Ecosystem support**: LM Studio, Ollama, koboldcpp, and more
-- **imatrix**: Importance matrix for better low-bit quality
-
-**Use alternatives instead:**
-- **AWQ/GPTQ**: Maximum accuracy with calibration on NVIDIA GPUs
-- **HQQ**: Fast calibration-free quantization for HuggingFace
-- **bitsandbytes**: Simple integration with transformers library
-- **TensorRT-LLM**: Production NVIDIA deployment with maximum speed
-
-## Quick start
-
-### Installation
-
-```bash
-# Clone llama.cpp
-git clone https://github.com/ggml-org/llama.cpp
-cd llama.cpp
-
-# Build (CPU)
-make
-
-# Build with CUDA (NVIDIA)
-make GGML_CUDA=1
-
-# Build with Metal (Apple Silicon)
-make GGML_METAL=1
-
-# Install Python bindings (optional)
-pip install llama-cpp-python
-```
-
-### Convert model to GGUF
-
-```bash
-# Install requirements
-pip install -r requirements.txt
-
-# Convert HuggingFace model to GGUF (FP16)
-python convert_hf_to_gguf.py ./path/to/model --outfile model-f16.gguf
-
-# Or specify output type
-python convert_hf_to_gguf.py ./path/to/model \
- --outfile model-f16.gguf \
- --outtype f16
-```
-
-### Quantize model
-
-```bash
-# Basic quantization to Q4_K_M
-./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
-
-# Quantize with importance matrix (better quality)
-./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
-./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
-```
-
-### Run inference
-
-```bash
-# CLI inference
-./llama-cli -m model-q4_k_m.gguf -p "Hello, how are you?"
-
-# Interactive mode
-./llama-cli -m model-q4_k_m.gguf --interactive
-
-# With GPU offload
-./llama-cli -m model-q4_k_m.gguf -ngl 35 -p "Hello!"
-```
-
-## Quantization types
-
-### K-quant methods (recommended)
-
-| Type | Bits | Size (7B) | Quality | Use Case |
-|------|------|-----------|---------|----------|
-| Q2_K | 2.5 | ~2.8 GB | Low | Extreme compression |
-| Q3_K_S | 3.0 | ~3.0 GB | Low-Med | Memory constrained |
-| Q3_K_M | 3.3 | ~3.3 GB | Medium | Balance |
-| Q4_K_S | 4.0 | ~3.8 GB | Med-High | Good balance |
-| Q4_K_M | 4.5 | ~4.1 GB | High | **Recommended default** |
-| Q5_K_S | 5.0 | ~4.6 GB | High | Quality focused |
-| Q5_K_M | 5.5 | ~4.8 GB | Very High | High quality |
-| Q6_K | 6.0 | ~5.5 GB | Excellent | Near-original |
-| Q8_0 | 8.0 | ~7.2 GB | Best | Maximum quality |
-
-### Legacy methods
-
-| Type | Description |
-|------|-------------|
-| Q4_0 | 4-bit, basic |
-| Q4_1 | 4-bit with delta |
-| Q5_0 | 5-bit, basic |
-| Q5_1 | 5-bit with delta |
-
-**Recommendation**: Use K-quant methods (Q4_K_M, Q5_K_M) for best quality/size ratio.
-
-## Conversion workflows
-
-### Workflow 1: HuggingFace to GGUF
-
-```bash
-# 1. Download model
-huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b
-
-# 2. Convert to GGUF (FP16)
-python convert_hf_to_gguf.py ./llama-3.1-8b \
- --outfile llama-3.1-8b-f16.gguf \
- --outtype f16
-
-# 3. Quantize
-./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M
-
-# 4. Test
-./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50
-```
-
-### Workflow 2: With importance matrix (better quality)
-
-```bash
-# 1. Convert to GGUF
-python convert_hf_to_gguf.py ./model --outfile model-f16.gguf
-
-# 2. Create calibration text (diverse samples)
-cat > calibration.txt << 'EOF'
-The quick brown fox jumps over the lazy dog.
-Machine learning is a subset of artificial intelligence.
-Python is a popular programming language.
-# Add more diverse text samples...
-EOF
-
-# 3. Generate importance matrix
-./llama-imatrix -m model-f16.gguf \
- -f calibration.txt \
- --chunk 512 \
- -o model.imatrix \
- -ngl 35 # GPU layers if available
-
-# 4. Quantize with imatrix
-./llama-quantize --imatrix model.imatrix \
- model-f16.gguf \
- model-q4_k_m.gguf \
- Q4_K_M
-```
-
-### Workflow 3: Multiple quantizations
-
-```bash
-#!/bin/bash
-MODEL="llama-3.1-8b-f16.gguf"
-IMATRIX="llama-3.1-8b.imatrix"
-
-# Generate imatrix once
-./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35
-
-# Create multiple quantizations
-for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
- OUTPUT="llama-3.1-8b-${QUANT,,}.gguf"
- ./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT
- echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))"
-done
-```
-
-## Python usage
-
-### llama-cpp-python
-
-```python
-from llama_cpp import Llama
-
-# Load model
-llm = Llama(
- model_path="./model-q4_k_m.gguf",
- n_ctx=4096, # Context window
- n_gpu_layers=35, # GPU offload (0 for CPU only)
- n_threads=8 # CPU threads
-)
-
-# Generate
-output = llm(
- "What is machine learning?",
- max_tokens=256,
- temperature=0.7,
- stop=["", "\n\n"]
-)
-print(output["choices"][0]["text"])
-```
-
-### Chat completion
-
-```python
-from llama_cpp import Llama
-
-llm = Llama(
- model_path="./model-q4_k_m.gguf",
- n_ctx=4096,
- n_gpu_layers=35,
- chat_format="llama-3" # Or "chatml", "mistral", etc.
-)
-
-messages = [
- {"role": "system", "content": "You are a helpful assistant."},
- {"role": "user", "content": "What is Python?"}
-]
-
-response = llm.create_chat_completion(
- messages=messages,
- max_tokens=256,
- temperature=0.7
-)
-print(response["choices"][0]["message"]["content"])
-```
-
-### Streaming
-
-```python
-from llama_cpp import Llama
-
-llm = Llama(model_path="./model-q4_k_m.gguf", n_gpu_layers=35)
-
-# Stream tokens
-for chunk in llm(
- "Explain quantum computing:",
- max_tokens=256,
- stream=True
-):
- print(chunk["choices"][0]["text"], end="", flush=True)
-```
-
-## Server mode
-
-### Start OpenAI-compatible server
-
-```bash
-# Start server
-./llama-server -m model-q4_k_m.gguf \
- --host 0.0.0.0 \
- --port 8080 \
- -ngl 35 \
- -c 4096
-
-# Or with Python bindings
-python -m llama_cpp.server \
- --model model-q4_k_m.gguf \
- --n_gpu_layers 35 \
- --host 0.0.0.0 \
- --port 8080
-```
-
-### Use with OpenAI client
-
-```python
-from openai import OpenAI
-
-client = OpenAI(
- base_url="http://localhost:8080/v1",
- api_key="not-needed"
-)
-
-response = client.chat.completions.create(
- model="local-model",
- messages=[{"role": "user", "content": "Hello!"}],
- max_tokens=256
-)
-print(response.choices[0].message.content)
-```
-
-## Hardware optimization
-
-### Apple Silicon (Metal)
-
-```bash
-# Build with Metal
-make clean && make GGML_METAL=1
-
-# Run with Metal acceleration
-./llama-cli -m model.gguf -ngl 99 -p "Hello"
-
-# Python with Metal
-llm = Llama(
- model_path="model.gguf",
- n_gpu_layers=99, # Offload all layers
- n_threads=1 # Metal handles parallelism
-)
-```
-
-### NVIDIA CUDA
-
-```bash
-# Build with CUDA
-make clean && make GGML_CUDA=1
-
-# Run with CUDA
-./llama-cli -m model.gguf -ngl 35 -p "Hello"
-
-# Specify GPU
-CUDA_VISIBLE_DEVICES=0 ./llama-cli -m model.gguf -ngl 35
-```
-
-### CPU optimization
-
-```bash
-# Build with AVX2/AVX512
-make clean && make
-
-# Run with optimal threads
-./llama-cli -m model.gguf -t 8 -p "Hello"
-
-# Python CPU config
-llm = Llama(
- model_path="model.gguf",
- n_gpu_layers=0, # CPU only
- n_threads=8, # Match physical cores
- n_batch=512 # Batch size for prompt processing
-)
-```
-
-## Integration with tools
-
-### Ollama
-
-```bash
-# Create Modelfile
-cat > Modelfile << 'EOF'
-FROM ./model-q4_k_m.gguf
-TEMPLATE """{{ .System }}
-{{ .Prompt }}"""
-PARAMETER temperature 0.7
-PARAMETER num_ctx 4096
-EOF
-
-# Create Ollama model
-ollama create mymodel -f Modelfile
-
-# Run
-ollama run mymodel "Hello!"
-```
-
-### LM Studio
-
-1. Place GGUF file in `~/.cache/lm-studio/models/`
-2. Open LM Studio and select the model
-3. Configure context length and GPU offload
-4. Start inference
-
-### text-generation-webui
-
-```bash
-# Place in models folder
-cp model-q4_k_m.gguf text-generation-webui/models/
-
-# Start with llama.cpp loader
-python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35
-```
-
-## Best practices
-
-1. **Use K-quants**: Q4_K_M offers best quality/size balance
-2. **Use imatrix**: Always use importance matrix for Q4 and below
-3. **GPU offload**: Offload as many layers as VRAM allows
-4. **Context length**: Start with 4096, increase if needed
-5. **Thread count**: Match physical CPU cores, not logical
-6. **Batch size**: Increase n_batch for faster prompt processing
-
-## Common issues
-
-**Model loads slowly:**
-```bash
-# Use mmap for faster loading
-./llama-cli -m model.gguf --mmap
-```
-
-**Out of memory:**
-```bash
-# Reduce GPU layers
-./llama-cli -m model.gguf -ngl 20 # Reduce from 35
-
-# Or use smaller quantization
-./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M
-```
-
-**Poor quality at low bits:**
-```bash
-# Always use imatrix for Q4 and below
-./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
-./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
-```
-
-## References
-
-- **[Advanced Usage](references/advanced-usage.md)** - Batching, speculative decoding, custom builds
-- **[Troubleshooting](references/troubleshooting.md)** - Common issues, debugging, benchmarks
-
-## Resources
-
-- **Repository**: https://github.com/ggml-org/llama.cpp
-- **Python Bindings**: https://github.com/abetlen/llama-cpp-python
-- **Pre-quantized Models**: https://huggingface.co/TheBloke
-- **GGUF Converter**: https://huggingface.co/spaces/ggml-org/gguf-my-repo
-- **License**: MIT
diff --git a/skills/mlops/inference/llama-cpp/SKILL.md b/skills/mlops/inference/llama-cpp/SKILL.md
index 57016c920..33fc37adb 100644
--- a/skills/mlops/inference/llama-cpp/SKILL.md
+++ b/skills/mlops/inference/llama-cpp/SKILL.md
@@ -1,138 +1,271 @@
---
name: llama-cpp
-description: Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.
-version: 1.0.0
+description: Run LLM inference with llama.cpp on CPU, Apple Silicon, AMD/Intel GPUs, or NVIDIA — plus GGUF model conversion and quantization (2–8 bit with K-quants and imatrix). Covers CLI, Python bindings, OpenAI-compatible server, and Ollama/LM Studio integration. Use for edge deployment, M1/M2/M3/M4 Macs, CUDA-less environments, or flexible local quantization.
+version: 2.0.0
author: Orchestra Research
license: MIT
-dependencies: [llama-cpp-python]
+dependencies: [llama-cpp-python>=0.2.0]
metadata:
hermes:
- tags: [Inference Serving, Llama.cpp, CPU Inference, Apple Silicon, Edge Deployment, GGUF, Quantization, Non-NVIDIA, AMD GPUs, Intel GPUs, Embedded]
-
+ tags: [llama.cpp, GGUF, Quantization, CPU Inference, Apple Silicon, Edge Deployment, Non-NVIDIA, AMD GPUs, Intel GPUs, Embedded, Model Compression]
---
-# llama.cpp
+# llama.cpp + GGUF
-Pure C/C++ LLM inference with minimal dependencies, optimized for CPUs and non-NVIDIA hardware.
+Pure C/C++ LLM inference with minimal dependencies, plus the GGUF (GPT-Generated Unified Format) standard used for quantized weights. One toolchain covers conversion, quantization, and serving.
-## When to use llama.cpp
+## When to use
-**Use llama.cpp when:**
-- Running on CPU-only machines
-- Deploying on Apple Silicon (M1/M2/M3/M4)
-- Using AMD or Intel GPUs (no CUDA)
-- Edge deployment (Raspberry Pi, embedded systems)
-- Need simple deployment without Docker/Python
+**Use llama.cpp + GGUF when:**
+- Running on CPU-only machines or Apple Silicon (M1/M2/M3/M4) with Metal acceleration
+- Using AMD (ROCm) or Intel GPUs where CUDA isn't available
+- Edge deployment (Raspberry Pi, embedded systems, consumer laptops)
+- Need flexible quantization (2–8 bit with K-quants)
+- Want local AI tools (LM Studio, Ollama, text-generation-webui, koboldcpp)
+- Want a single binary deploy without Docker/Python
-**Use TensorRT-LLM instead when:**
-- Have NVIDIA GPUs (A100/H100)
-- Need maximum throughput (100K+ tok/s)
-- Running in datacenter with CUDA
+**Key advantages:**
+- Universal hardware: CPU, Apple Silicon, NVIDIA, AMD, Intel
+- No Python runtime required (pure C/C++)
+- K-quants + imatrix for better low-bit quality
+- OpenAI-compatible server built in
+- Rich ecosystem (Ollama, LM Studio, llama-cpp-python)
-**Use vLLM instead when:**
-- Have NVIDIA GPUs
-- Need Python-first API
-- Want PagedAttention
+**Use alternatives instead:**
+- **vLLM** — NVIDIA GPUs, PagedAttention, Python-first, max throughput
+- **TensorRT-LLM** — Production NVIDIA (A100/H100), maximum speed
+- **AWQ/GPTQ** — Calibrated quantization for NVIDIA-only deployments
+- **bitsandbytes** — Simple HuggingFace transformers integration
+- **HQQ** — Fast calibration-free quantization
## Quick start
-### Installation
+### Install
```bash
-# macOS/Linux
+# macOS / Linux (simplest)
brew install llama.cpp
# Or build from source
-git clone https://github.com/ggerganov/llama.cpp
+git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
-make
+make # CPU
+make GGML_METAL=1 # Apple Silicon
+make GGML_CUDA=1 # NVIDIA CUDA
+make LLAMA_HIP=1 # AMD ROCm
-# With Metal (Apple Silicon)
-make LLAMA_METAL=1
-
-# With CUDA (NVIDIA)
-make LLAMA_CUDA=1
-
-# With ROCm (AMD)
-make LLAMA_HIP=1
+# Python bindings (optional)
+pip install llama-cpp-python
+# With CUDA: CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
+# With Metal: CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
```
-### Download model
+### Download a pre-quantized GGUF
```bash
-# Download from HuggingFace (GGUF format)
+# TheBloke hosts most popular models pre-quantized
huggingface-cli download \
TheBloke/Llama-2-7B-Chat-GGUF \
llama-2-7b-chat.Q4_K_M.gguf \
--local-dir models/
+```
-# Or convert from HuggingFace
-python convert_hf_to_gguf.py models/llama-2-7b-chat/
+### Or convert a HuggingFace model to GGUF
+
+```bash
+# 1. Download HF model
+huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b
+
+# 2. Convert to FP16 GGUF
+python convert_hf_to_gguf.py ./llama-3.1-8b \
+ --outfile llama-3.1-8b-f16.gguf \
+ --outtype f16
+
+# 3. Quantize to Q4_K_M
+./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M
```
### Run inference
```bash
-# Simple chat
-./llama-cli \
- -m models/llama-2-7b-chat.Q4_K_M.gguf \
- -p "Explain quantum computing" \
- -n 256 # Max tokens
+# One-shot prompt
+./llama-cli -m model.Q4_K_M.gguf -p "Explain quantum computing" -n 256
# Interactive chat
-./llama-cli \
- -m models/llama-2-7b-chat.Q4_K_M.gguf \
- --interactive
+./llama-cli -m model.Q4_K_M.gguf --interactive
+
+# With GPU offload
+./llama-cli -m model.Q4_K_M.gguf -ngl 35 -p "Hello!"
```
-### Server mode
+### Serve an OpenAI-compatible API
```bash
-# Start OpenAI-compatible server
./llama-server \
- -m models/llama-2-7b-chat.Q4_K_M.gguf \
+ -m model.Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8080 \
- -ngl 32 # Offload 32 layers to GPU
+ -ngl 35 \
+ -c 4096 \
+ --parallel 4 \
+ --cont-batching
+```
-# Client request
+```bash
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
- "model": "llama-2-7b-chat",
+ "model": "local",
"messages": [{"role": "user", "content": "Hello!"}],
"temperature": 0.7,
"max_tokens": 100
}'
```
-## Quantization formats
+## Quantization formats (GGUF)
-### GGUF format overview
+### K-quant methods (recommended)
-| Format | Bits | Size (7B) | Speed | Quality | Use Case |
-|--------|------|-----------|-------|---------|----------|
-| **Q4_K_M** | 4.5 | 4.1 GB | Fast | Good | **Recommended default** |
-| Q4_K_S | 4.3 | 3.9 GB | Faster | Lower | Speed critical |
-| Q5_K_M | 5.5 | 4.8 GB | Medium | Better | Quality critical |
-| Q6_K | 6.5 | 5.5 GB | Slower | Best | Maximum quality |
-| Q8_0 | 8.0 | 7.0 GB | Slow | Excellent | Minimal degradation |
-| Q2_K | 2.5 | 2.7 GB | Fastest | Poor | Testing only |
+| Type | Bits | Size (7B) | Quality | Use Case |
+|------|------|-----------|---------|----------|
+| Q2_K | 2.5 | ~2.8 GB | Low | Extreme compression (testing only) |
+| Q3_K_S | 3.0 | ~3.0 GB | Low-Med | Memory constrained |
+| Q3_K_M | 3.3 | ~3.3 GB | Medium | Fits small devices |
+| Q4_K_S | 4.0 | ~3.8 GB | Med-High | Speed critical |
+| **Q4_K_M** | 4.5 | ~4.1 GB | High | **Recommended default** |
+| Q5_K_S | 5.0 | ~4.6 GB | High | Quality focused |
+| Q5_K_M | 5.5 | ~4.8 GB | Very High | High quality |
+| Q6_K | 6.0 | ~5.5 GB | Excellent | Near-original |
+| Q8_0 | 8.0 | ~7.2 GB | Best | Maximum quality, minimal degradation |
-### Choosing quantization
+**Variant suffixes** — `_S` (Small, faster, lower quality), `_M` (Medium, balanced), `_L` (Large, better quality).
+
+**Legacy (Q4_0/Q4_1/Q5_0/Q5_1) exist** but always prefer K-quants for better quality/size ratio.
+
+**IQ quantization** — ultra-low-bit with importance-aware methods: IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_XS, IQ3_S, IQ4_XS. Require `--imatrix`.
+
+**Task-specific defaults:**
+- General chat / assistants: Q4_K_M, or Q5_K_M if RAM allows
+- Code generation: Q5_K_M or Q6_K (higher precision helps)
+- Technical / medical: Q6_K or Q8_0
+- Very large (70B, 405B) on consumer hardware: Q3_K_M or Q4_K_S
+- Raspberry Pi / edge: Q2_K or Q3_K_S
+
+## Conversion workflows
+
+### Basic: HF → GGUF → quantized
```bash
-# General use (balanced)
-Q4_K_M # 4-bit, medium quality
+python convert_hf_to_gguf.py ./model --outfile model-f16.gguf --outtype f16
+./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
+./llama-cli -m model-q4_k_m.gguf -p "Hello!" -n 50
+```
-# Maximum speed (more degradation)
-Q2_K or Q3_K_M
+### With importance matrix (imatrix) — better low-bit quality
-# Maximum quality (slower)
-Q6_K or Q8_0
+`imatrix` gives 10–20% perplexity improvement at Q4, essential at Q3 and below.
-# Very large models (70B, 405B)
-Q3_K_M or Q4_K_S # Lower bits to fit in memory
+```bash
+# 1. Convert to FP16 GGUF
+python convert_hf_to_gguf.py ./model --outfile model-f16.gguf
+
+# 2. Prepare calibration data (diverse text, ~100MB is ideal)
+cat > calibration.txt << 'EOF'
+The quick brown fox jumps over the lazy dog.
+Machine learning is a subset of artificial intelligence.
+# Add more diverse text samples...
+EOF
+
+# 3. Generate importance matrix
+./llama-imatrix -m model-f16.gguf \
+ -f calibration.txt \
+ --chunk 512 \
+ -o model.imatrix \
+ -ngl 35
+
+# 4. Quantize with imatrix
+./llama-quantize --imatrix model.imatrix \
+ model-f16.gguf model-q4_k_m.gguf Q4_K_M
+```
+
+### Multi-quant batch
+
+```bash
+#!/bin/bash
+MODEL="llama-3.1-8b-f16.gguf"
+IMATRIX="llama-3.1-8b.imatrix"
+
+./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35
+
+for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
+ OUTPUT="llama-3.1-8b-${QUANT,,}.gguf"
+ ./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT
+ echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))"
+done
+```
+
+### Quality testing (perplexity)
+
+```bash
+./llama-perplexity -m model.gguf -f wikitext-2-raw/wiki.test.raw -c 512
+# Baseline FP16: ~5.96 | Q4_K_M: ~6.06 (+1.7%) | Q2_K: ~6.87 (+15.3%)
+```
+
+## Python bindings (llama-cpp-python)
+
+### Basic generation
+
+```python
+from llama_cpp import Llama
+
+llm = Llama(
+ model_path="./model-q4_k_m.gguf",
+ n_ctx=4096,
+ n_gpu_layers=35, # 0 for CPU only, 99 to offload everything
+ n_threads=8,
+)
+
+output = llm(
+ "What is machine learning?",
+ max_tokens=256,
+ temperature=0.7,
+ stop=["", "\n\n"],
+)
+print(output["choices"][0]["text"])
+```
+
+### Chat completion + streaming
+
+```python
+llm = Llama(
+ model_path="./model-q4_k_m.gguf",
+ n_ctx=4096,
+ n_gpu_layers=35,
+ chat_format="llama-3", # Or "chatml", "mistral", etc.
+)
+
+# Non-streaming
+response = llm.create_chat_completion(
+ messages=[
+ {"role": "system", "content": "You are a helpful assistant."},
+ {"role": "user", "content": "What is Python?"},
+ ],
+ max_tokens=256,
+ temperature=0.7,
+)
+print(response["choices"][0]["message"]["content"])
+
+# Streaming
+for chunk in llm("Explain quantum computing:", max_tokens=256, stream=True):
+ print(chunk["choices"][0]["text"], end="", flush=True)
+```
+
+### Embeddings
+
+```python
+llm = Llama(model_path="./model-q4_k_m.gguf", embedding=True, n_gpu_layers=35)
+vec = llm.embed("This is a test sentence.")
+print(f"Embedding dimension: {len(vec)}")
```
## Hardware acceleration
@@ -140,122 +273,166 @@ Q3_K_M or Q4_K_S # Lower bits to fit in memory
### Apple Silicon (Metal)
```bash
-# Build with Metal
-make LLAMA_METAL=1
-
-# Run with GPU acceleration (automatic)
-./llama-cli -m model.gguf -ngl 999 # Offload all layers
-
-# Performance: M3 Max 40-60 tokens/sec (Llama 2-7B Q4_K_M)
+make clean && make GGML_METAL=1
+./llama-cli -m model.gguf -ngl 99 -p "Hello" # offload all layers
```
-### NVIDIA GPUs (CUDA)
-
-```bash
-# Build with CUDA
-make LLAMA_CUDA=1
-
-# Offload layers to GPU
-./llama-cli -m model.gguf -ngl 35 # Offload 35/40 layers
-
-# Hybrid CPU+GPU for large models
-./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20 # GPU: 20 layers, CPU: rest
+```python
+llm = Llama(
+ model_path="model.gguf",
+ n_gpu_layers=99, # Offload everything
+ n_threads=1, # Metal handles parallelism
+)
```
-### AMD GPUs (ROCm)
+Performance: M3 Max ~40–60 tok/s on Llama 2-7B Q4_K_M.
+
+### NVIDIA (CUDA)
+
+```bash
+make clean && make GGML_CUDA=1
+./llama-cli -m model.gguf -ngl 35 -p "Hello"
+
+# Hybrid for large models
+./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20 # GPU: 20 layers, CPU: rest
+
+# Multi-GPU split
+./llama-cli -m large-model.gguf --tensor-split 0.5,0.5 -ngl 60
+```
+
+### AMD (ROCm)
```bash
-# Build with ROCm
make LLAMA_HIP=1
-
-# Run with AMD GPU
./llama-cli -m model.gguf -ngl 999
```
-## Common patterns
-
-### Batch processing
+### CPU
```bash
-# Process multiple prompts from file
-cat prompts.txt | ./llama-cli \
- -m model.gguf \
- --batch-size 512 \
- -n 100
+# Match PHYSICAL cores, not logical
+./llama-cli -m model.gguf -t 8 -p "Hello"
+
+# BLAS acceleration (2–3× speedup)
+make LLAMA_OPENBLAS=1
```
-### Constrained generation
-
-```bash
-# JSON output with grammar
-./llama-cli \
- -m model.gguf \
- -p "Generate a person: " \
- --grammar-file grammars/json.gbnf
-
-# Outputs valid JSON only
-```
-
-### Context size
-
-```bash
-# Increase context (default 512)
-./llama-cli \
- -m model.gguf \
- -c 4096 # 4K context window
-
-# Very long context (if model supports)
-./llama-cli -m model.gguf -c 32768 # 32K context
+```python
+llm = Llama(
+ model_path="model.gguf",
+ n_gpu_layers=0,
+ n_threads=8,
+ n_batch=512, # Larger batch = faster prompt processing
+)
```
## Performance benchmarks
-### CPU performance (Llama 2-7B Q4_K_M)
+### CPU (Llama 2-7B Q4_K_M)
-| CPU | Threads | Speed | Cost |
-|-----|---------|-------|------|
-| Apple M3 Max | 16 | 50 tok/s | $0 (local) |
-| AMD Ryzen 9 7950X | 32 | 35 tok/s | $0.50/hour |
-| Intel i9-13900K | 32 | 30 tok/s | $0.40/hour |
-| AWS c7i.16xlarge | 64 | 40 tok/s | $2.88/hour |
+| CPU | Threads | Speed |
+|-----|---------|-------|
+| Apple M3 Max (Metal) | 16 | 50 tok/s |
+| AMD Ryzen 9 7950X | 32 | 35 tok/s |
+| Intel i9-13900K | 32 | 30 tok/s |
-### GPU acceleration (Llama 2-7B Q4_K_M)
+### GPU offloading on RTX 4090
-| GPU | Speed | vs CPU | Cost |
-|-----|-------|--------|------|
-| NVIDIA RTX 4090 | 120 tok/s | 3-4× | $0 (local) |
-| NVIDIA A10 | 80 tok/s | 2-3× | $1.00/hour |
-| AMD MI250 | 70 tok/s | 2× | $2.00/hour |
-| Apple M3 Max (Metal) | 50 tok/s | ~Same | $0 (local) |
+| Layers GPU | Speed | VRAM |
+|------------|-------|------|
+| 0 (CPU only) | 30 tok/s | 0 GB |
+| 20 (hybrid) | 80 tok/s | 8 GB |
+| 35 (all) | 120 tok/s | 12 GB |
## Supported models
-**LLaMA family**:
-- Llama 2 (7B, 13B, 70B)
-- Llama 3 (8B, 70B, 405B)
-- Code Llama
+- **LLaMA family**: Llama 2 (7B/13B/70B), Llama 3 (8B/70B/405B), Code Llama
+- **Mistral family**: Mistral 7B, Mixtral 8x7B/8x22B
+- **Other**: Falcon, BLOOM, GPT-J, Phi-3, Gemma, Qwen, LLaVA (vision), Whisper (audio)
-**Mistral family**:
-- Mistral 7B
-- Mixtral 8x7B, 8x22B
+Find GGUF models: https://huggingface.co/models?library=gguf
-**Other**:
-- Falcon, BLOOM, GPT-J
-- Phi-3, Gemma, Qwen
-- LLaVA (vision), Whisper (audio)
+## Ecosystem integrations
-**Find models**: https://huggingface.co/models?library=gguf
+### Ollama
+
+```bash
+cat > Modelfile << 'EOF'
+FROM ./model-q4_k_m.gguf
+TEMPLATE """{{ .System }}
+{{ .Prompt }}"""
+PARAMETER temperature 0.7
+PARAMETER num_ctx 4096
+EOF
+
+ollama create mymodel -f Modelfile
+ollama run mymodel "Hello!"
+```
+
+### LM Studio
+
+1. Place GGUF file in `~/.cache/lm-studio/models/`
+2. Open LM Studio and select the model
+3. Configure context length and GPU offload, start inference
+
+### text-generation-webui
+
+```bash
+cp model-q4_k_m.gguf text-generation-webui/models/
+python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35
+```
+
+### OpenAI client → llama-server
+
+```python
+from openai import OpenAI
+
+client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
+response = client.chat.completions.create(
+ model="local-model",
+ messages=[{"role": "user", "content": "Hello!"}],
+ max_tokens=256,
+)
+print(response.choices[0].message.content)
+```
+
+## Best practices
+
+1. **Use K-quants** — Q4_K_M is the recommended default
+2. **Use imatrix** for Q4 and below (calibration improves quality substantially)
+3. **Offload as many layers as VRAM allows** — start high, reduce by 5 on OOM
+4. **Thread count** — match physical cores, not logical
+5. **Batch size** — increase `n_batch` (e.g. 512) for faster prompt processing
+6. **Context** — start at 4096, grow only as needed (memory scales with ctx)
+7. **Flash Attention** — add `--flash-attn` if your build supports it
+
+## Common issues (quick fixes)
+
+**Model loads slowly** — use `--mmap` for memory-mapped loading.
+
+**Out of memory (GPU)** — reduce `-ngl`, use a smaller quant (Q4_K_S / Q3_K_M), or quantize the KV cache:
+```python
+Llama(model_path="...", type_k=2, type_v=2, n_gpu_layers=35) # Q4_0 KV cache
+```
+
+**Garbage output** — wrong `chat_format`, temperature too high, or model file corrupted. Test with `temperature=0.1` and verify FP16 baseline works.
+
+**Connection refused (server)** — bind to `--host 0.0.0.0`, check `lsof -i :8080`.
+
+See `references/troubleshooting.md` for the full playbook.
## References
-- **[Quantization Guide](references/quantization.md)** - GGUF formats, conversion, quality comparison
-- **[Server Deployment](references/server.md)** - API endpoints, Docker, monitoring
-- **[Optimization](references/optimization.md)** - Performance tuning, hybrid CPU+GPU
+- **[advanced-usage.md](references/advanced-usage.md)** — speculative decoding, batched inference, grammar-constrained generation, LoRA, multi-GPU, custom builds, benchmark scripts
+- **[quantization.md](references/quantization.md)** — perplexity tables, use-case guide, model size scaling (7B/13B/70B RAM needs), imatrix deep dive
+- **[server.md](references/server.md)** — OpenAI API endpoints, Docker deployment, NGINX load balancing, monitoring
+- **[optimization.md](references/optimization.md)** — CPU threading, BLAS, GPU offload heuristics, batch tuning, benchmarks
+- **[troubleshooting.md](references/troubleshooting.md)** — install/convert/quantize/inference/server issues, Apple Silicon, debugging
## Resources
-- **GitHub**: https://github.com/ggerganov/llama.cpp
-- **Models**: https://huggingface.co/models?library=gguf
-- **Discord**: https://discord.gg/llama-cpp
-
-
+- **GitHub**: https://github.com/ggml-org/llama.cpp
+- **Python bindings**: https://github.com/abetlen/llama-cpp-python
+- **Pre-quantized models**: https://huggingface.co/TheBloke
+- **GGUF converter Space**: https://huggingface.co/spaces/ggml-org/gguf-my-repo
+- **License**: MIT
diff --git a/skills/mlops/inference/gguf/references/advanced-usage.md b/skills/mlops/inference/llama-cpp/references/advanced-usage.md
similarity index 100%
rename from skills/mlops/inference/gguf/references/advanced-usage.md
rename to skills/mlops/inference/llama-cpp/references/advanced-usage.md
diff --git a/skills/mlops/inference/gguf/references/troubleshooting.md b/skills/mlops/inference/llama-cpp/references/troubleshooting.md
similarity index 100%
rename from skills/mlops/inference/gguf/references/troubleshooting.md
rename to skills/mlops/inference/llama-cpp/references/troubleshooting.md
diff --git a/skills/mlops/training/grpo-rl-training/README.md b/skills/mlops/training/grpo-rl-training/README.md
deleted file mode 100644
index 99b60d664..000000000
--- a/skills/mlops/training/grpo-rl-training/README.md
+++ /dev/null
@@ -1,97 +0,0 @@
-# GRPO/RL Training Skill
-
-**Expert-level guidance for Group Relative Policy Optimization with TRL**
-
-## 📁 Skill Structure
-
-```
-grpo-rl-training/
-├── SKILL.md # Main skill documentation (READ THIS FIRST)
-├── README.md # This file
-├── templates/
-│ └── basic_grpo_training.py # Production-ready training template
-└── examples/
- └── reward_functions_library.py # 20+ reward function examples
-```
-
-## 🚀 Quick Start
-
-1. **Read SKILL.md** - Comprehensive guide with all concepts and patterns
-2. **Copy `templates/basic_grpo_training.py`** - Start with working code
-3. **Browse `examples/reward_functions_library.py`** - Pick reward functions for your task
-4. **Modify for your use case** - Adapt dataset, rewards, and config
-
-## 💡 What's Inside
-
-### SKILL.md (Main Documentation)
-- Core GRPO concepts and algorithm fundamentals
-- Complete implementation workflow (dataset → rewards → training → deployment)
-- 10+ reward function examples with code
-- Hyperparameter tuning guide
-- Training insights (loss behavior, metrics, debugging)
-- Troubleshooting guide
-- Production best practices
-
-### Templates
-- **basic_grpo_training.py**: Minimal, production-ready training script
- - Uses Qwen 2.5 1.5B Instruct
- - 3 reward functions (format + correctness)
- - LoRA for efficient training
- - Fully documented and ready to run
-
-### Examples
-- **reward_functions_library.py**: 20+ battle-tested reward functions
- - Correctness rewards (exact match, fuzzy match, numeric, code execution)
- - Format rewards (XML, JSON, strict/soft)
- - Length rewards (ideal length, min/max)
- - Style rewards (reasoning quality, citations, repetition penalty)
- - Combined rewards (multi-objective optimization)
- - Preset collections for common tasks
-
-## 📖 Usage for Agents
-
-When this skill is loaded in your agent's context:
-
-1. **Always read SKILL.md first** before implementing
-2. **Start simple** - Use length-based reward to validate setup
-3. **Build incrementally** - Add one reward function at a time
-4. **Reference examples** - Copy patterns from reward_functions_library.py
-5. **Monitor training** - Watch reward metrics (not loss!)
-
-## 🎯 Common Use Cases
-
-| Task Type | Recommended Rewards | Template |
-|-----------|---------------------|----------|
-| Math reasoning | `MATH_REASONING_REWARDS` preset | basic_grpo_training.py |
-| Code generation | `CODE_GENERATION_REWARDS` preset | Modify dataset in template |
-| Summarization | `SUMMARIZATION_REWARDS` preset | Adjust prompts + rewards |
-| Q&A | `QA_REWARDS` preset | Use fuzzy match + citations |
-
-## ⚠️ Critical Reminders
-
-- **Loss goes UP during training** - This is normal (it's KL divergence)
-- **Use 3-5 reward functions** - Single rewards often fail
-- **Test rewards before training** - Debug each function independently
-- **Monitor reward_std** - Should stay > 0.1 (avoid mode collapse)
-- **Start with num_generations=4-8** - Scale up if GPU allows
-
-## 🔗 External Resources
-
-- [TRL Documentation](https://huggingface.co/docs/trl)
-- [DeepSeek R1 Paper](https://arxiv.org/abs/2501.12948)
-- [Open R1 Implementation](https://github.com/huggingface/open-r1)
-- [Unsloth (2-3x faster)](https://docs.unsloth.ai/)
-
-## 📝 Version
-
-**v1.0.0** - Initial release (January 2025)
-
-## 👨💻 Maintained By
-
-Orchestra Research
-For questions or improvements, see https://orchestra.com
-
----
-
-**License:** MIT
-**Last Updated:** January 2025
diff --git a/skills/mlops/training/trl-fine-tuning/SKILL.md b/skills/mlops/training/trl-fine-tuning/SKILL.md
index 3bf4f6e12..70023fc70 100644
--- a/skills/mlops/training/trl-fine-tuning/SKILL.md
+++ b/skills/mlops/training/trl-fine-tuning/SKILL.md
@@ -252,6 +252,8 @@ trl dpo \
Train with reinforcement learning using minimal memory.
+For in-depth GRPO guidance — reward function design, critical training insights (loss behavior, mode collapse, tuning), and advanced multi-stage patterns — see **[references/grpo-training.md](references/grpo-training.md)**. A production-ready training script is in **[templates/basic_grpo_training.py](templates/basic_grpo_training.py)**.
+
Copy this checklist:
```
@@ -428,6 +430,8 @@ config = PPOConfig(
**Online RL methods**: See [references/online-rl.md](references/online-rl.md) for PPO, GRPO, RLOO, and OnlineDPO with detailed configurations.
+**GRPO deep dive**: See [references/grpo-training.md](references/grpo-training.md) for expert-level GRPO patterns — reward function design philosophy, training insights (why loss increases, mode collapse detection), hyperparameter tuning, multi-stage training, and troubleshooting. Production-ready template in [templates/basic_grpo_training.py](templates/basic_grpo_training.py).
+
## Hardware requirements
- **GPU**: NVIDIA (CUDA required)
diff --git a/skills/mlops/training/grpo-rl-training/SKILL.md b/skills/mlops/training/trl-fine-tuning/references/grpo-training.md
similarity index 56%
rename from skills/mlops/training/grpo-rl-training/SKILL.md
rename to skills/mlops/training/trl-fine-tuning/references/grpo-training.md
index 1d7629ab6..a22bd4094 100644
--- a/skills/mlops/training/grpo-rl-training/SKILL.md
+++ b/skills/mlops/training/trl-fine-tuning/references/grpo-training.md
@@ -1,51 +1,36 @@
----
-name: grpo-rl-training
-description: Expert guidance for GRPO/RL fine-tuning with TRL for reasoning and task-specific model training
-version: 1.0.0
-author: Orchestra Research
-license: MIT
-dependencies: [transformers>=4.47.0, trl>=0.14.0, datasets>=3.2.0, peft>=0.14.0, torch]
-metadata:
- hermes:
- tags: [Post-Training, Reinforcement Learning, GRPO, TRL, RLHF, Reward Modeling, Reasoning, DPO, PPO, Structured Output]
+# GRPO (Group Relative Policy Optimization) — Deep Guide
----
+Expert-level patterns, critical insights, and production-ready workflows for fine-tuning language models with custom reward functions using TRL's `GRPOTrainer`. This is the deep reference for the GRPO workflow summarized in the main skill.
-# GRPO/RL Training with TRL
+## When to use GRPO
-Expert-level guidance for implementing Group Relative Policy Optimization (GRPO) using the Transformer Reinforcement Learning (TRL) library. This skill provides battle-tested patterns, critical insights, and production-ready workflows for fine-tuning language models with custom reward functions.
-
-## When to Use This Skill
-
-Use GRPO training when you need to:
-- **Enforce specific output formats** (e.g., XML tags, JSON, structured reasoning)
+Use GRPO when you need to:
+- **Enforce specific output formats** (XML tags, JSON, structured reasoning)
- **Teach verifiable tasks** with objective correctness metrics (math, coding, fact-checking)
- **Improve reasoning capabilities** by rewarding chain-of-thought patterns
- **Align models to domain-specific behaviors** without labeled preference data
- **Optimize for multiple objectives** simultaneously (format + correctness + style)
**Do NOT use GRPO for:**
-- Simple supervised fine-tuning tasks (use SFT instead)
+- Simple supervised fine-tuning tasks → use SFT
- Tasks without clear reward signals
-- When you already have high-quality preference pairs (use DPO/PPO instead)
+- When you already have high-quality preference pairs → use DPO/PPO
----
+## Core concepts
-## Core Concepts
+### 1. GRPO algorithm fundamentals
-### 1. GRPO Algorithm Fundamentals
-
-**Key Mechanism:**
-- Generates **multiple completions** for each prompt (group size: 4-16)
+**Key mechanism:**
+- Generates **multiple completions** per prompt (group size: 4–16)
- Compares completions within each group using reward functions
- Updates policy to favor higher-rewarded responses relative to the group
-**Critical Difference from PPO:**
+**Critical differences from PPO:**
- No separate reward model needed
- More sample-efficient (learns from within-group comparisons)
- Simpler to implement and debug
-**Mathematical Intuition:**
+**Mathematical intuition:**
```
For each prompt p:
1. Generate N completions: {c₁, c₂, ..., cₙ}
@@ -54,35 +39,32 @@ For each prompt p:
relative to low-reward ones in the same group
```
-### 2. Reward Function Design Philosophy
+### 2. Reward function design philosophy
-**Golden Rules:**
-1. **Compose multiple reward functions** - Each handles one aspect (format, correctness, style)
-2. **Scale rewards appropriately** - Higher weight = stronger signal
-3. **Use incremental rewards** - Partial credit for partial compliance
-4. **Test rewards independently** - Debug each reward function in isolation
+**Golden rules:**
+1. **Compose multiple reward functions** — each handles one aspect (format, correctness, style)
+2. **Scale rewards appropriately** — higher weight = stronger signal
+3. **Use incremental rewards** — partial credit for partial compliance
+4. **Test rewards independently** — debug each reward function in isolation
-**Reward Function Types:**
+**Reward function types:**
| Type | Use Case | Example Weight |
|------|----------|----------------|
| **Correctness** | Verifiable tasks (math, code) | 2.0 (highest) |
-| **Format** | Strict structure enforcement | 0.5-1.0 |
-| **Length** | Encourage verbosity/conciseness | 0.1-0.5 |
-| **Style** | Penalize unwanted patterns | -0.5 to 0.5 |
+| **Format** | Strict structure enforcement | 0.5–1.0 |
+| **Length** | Encourage verbosity/conciseness | 0.1–0.5 |
+| **Style** | Penalize unwanted patterns | −0.5 to 0.5 |
----
+## Implementation workflow
-## Implementation Workflow
+### Step 1: Dataset preparation
-### Step 1: Dataset Preparation
-
-**Critical Requirements:**
-- Prompts in chat format (list of dicts with 'role' and 'content')
+**Critical requirements:**
+- Prompts in chat format (list of dicts with `role` and `content`)
- Include system prompts to set expectations
- For verifiable tasks, include ground truth answers as additional columns
-**Example Structure:**
```python
from datasets import load_dataset, Dataset
@@ -97,8 +79,7 @@ Respond in the following format:
"""
def prepare_dataset(raw_data):
- """
- Transform raw data into GRPO-compatible format.
+ """Transform raw data into GRPO-compatible format.
Returns: Dataset with columns:
- 'prompt': List[Dict] with role/content (system + user messages)
@@ -113,14 +94,14 @@ def prepare_dataset(raw_data):
})
```
-**Pro Tips:**
-- Use one-shot or few-shot examples in system prompt for complex formats
-- Keep prompts concise (max_prompt_length: 256-512 tokens)
+**Pro tips:**
+- Use one-shot or few-shot examples in the system prompt for complex formats
+- Keep prompts concise (max_prompt_length: 256–512 tokens)
- Validate data quality before training (garbage in = garbage out)
-### Step 2: Reward Function Implementation
+### Step 2: Reward function implementation
-**Template Structure:**
+**Template structure:**
```python
def reward_function_name(
prompts, # List[List[Dict]]: Original prompts
@@ -128,24 +109,16 @@ def reward_function_name(
answer=None, # Optional: Ground truth from dataset
**kwargs # Additional dataset columns
) -> list[float]:
- """
- Evaluate completions and return rewards.
-
- Returns: List of floats (one per completion)
- """
- # Extract completion text
+ """Evaluate completions and return rewards (one per completion)."""
responses = [comp[0]['content'] for comp in completions]
-
- # Compute rewards
rewards = []
for response in responses:
score = compute_score(response)
rewards.append(score)
-
return rewards
```
-**Example 1: Correctness Reward (Math/Coding)**
+**Example 1: correctness reward (math/coding)**
```python
def correctness_reward(prompts, completions, answer, **kwargs):
"""Reward correct answers with high score."""
@@ -155,7 +128,7 @@ def correctness_reward(prompts, completions, answer, **kwargs):
for ans, gt in zip(extracted, answer)]
```
-**Example 2: Format Reward (Structured Output)**
+**Example 2: format reward (structured output)**
```python
import re
@@ -167,7 +140,7 @@ def format_reward(completions, **kwargs):
for r in responses]
```
-**Example 3: Incremental Format Reward (Partial Credit)**
+**Example 3: incremental format reward (partial credit)**
```python
def incremental_format_reward(completions, **kwargs):
"""Award partial credit for format compliance."""
@@ -176,14 +149,10 @@ def incremental_format_reward(completions, **kwargs):
for r in responses:
score = 0.0
- if '' in r:
- score += 0.25
- if '' in r:
- score += 0.25
- if '' in r:
- score += 0.25
- if '' in r:
- score += 0.25
+ if '' in r: score += 0.25
+ if '' in r: score += 0.25
+ if '' in r: score += 0.25
+ if '' in r: score += 0.25
# Penalize extra text after closing tag
if r.count('') == 1:
extra_text = r.split('')[-1].strip()
@@ -193,12 +162,11 @@ def incremental_format_reward(completions, **kwargs):
return rewards
```
-**Critical Insight:**
-Combine 3-5 reward functions for robust training. Order matters less than diversity of signals.
+**Critical insight:** Combine 3–5 reward functions for robust training. Order matters less than diversity of signals.
-### Step 3: Training Configuration
+### Step 3: Training configuration
-**Memory-Optimized Config (Small GPU)**
+**Memory-optimized config (small GPU)**
```python
from trl import GRPOConfig
@@ -218,13 +186,13 @@ training_args = GRPOConfig(
gradient_accumulation_steps=4, # Effective batch = 4
# GRPO-specific
- num_generations=8, # Group size: 8-16 recommended
+ num_generations=8, # Group size: 8–16 recommended
max_prompt_length=256,
max_completion_length=512,
# Training duration
num_train_epochs=1,
- max_steps=None, # Or set fixed steps (e.g., 500)
+ max_steps=None,
# Optimization
bf16=True, # Faster on A100/H100
@@ -234,11 +202,11 @@ training_args = GRPOConfig(
# Logging
logging_steps=1,
save_steps=100,
- report_to="wandb", # Or "none" for no logging
+ report_to="wandb",
)
```
-**High-Performance Config (Large GPU)**
+**High-performance config (large GPU)**
```python
training_args = GRPOConfig(
output_dir="outputs/grpo-model",
@@ -255,31 +223,30 @@ training_args = GRPOConfig(
)
```
-**Critical Hyperparameters:**
+**Critical hyperparameters:**
| Parameter | Impact | Tuning Advice |
|-----------|--------|---------------|
-| `num_generations` | Group size for comparison | Start with 8, increase to 16 if GPU allows |
+| `num_generations` | Group size for comparison | Start 8, increase to 16 if GPU allows |
| `learning_rate` | Convergence speed/stability | 5e-6 (safe), 1e-5 (faster, riskier) |
-| `max_completion_length` | Output verbosity | Match your task (512 for reasoning, 256 for short answers) |
+| `max_completion_length` | Output verbosity | Match your task (512 reasoning, 256 short answers) |
| `gradient_accumulation_steps` | Effective batch size | Increase if GPU memory limited |
-### Step 4: Model Setup and Training
+### Step 4: Model setup and training
-**Standard Setup (Transformers)**
+**Standard setup (Transformers + TRL)**
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig
from trl import GRPOTrainer
-# Load model
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
- attn_implementation="flash_attention_2", # 2-3x faster
- device_map="auto"
+ attn_implementation="flash_attention_2", # 2–3× faster
+ device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
@@ -287,17 +254,16 @@ tokenizer.pad_token = tokenizer.eos_token
# Optional: LoRA for parameter-efficient training
peft_config = LoraConfig(
- r=16, # Rank (higher = more capacity)
- lora_alpha=32, # Scaling factor (typically 2*r)
+ r=16,
+ lora_alpha=32,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
- "gate_proj", "up_proj", "down_proj"
+ "gate_proj", "up_proj", "down_proj",
],
task_type="CAUSAL_LM",
lora_dropout=0.05,
)
-# Initialize trainer
trainer = GRPOTrainer(
model=model,
processing_class=tokenizer,
@@ -308,17 +274,14 @@ trainer = GRPOTrainer(
],
args=training_args,
train_dataset=dataset,
- peft_config=peft_config, # Remove for full fine-tuning
+ peft_config=peft_config, # Remove for full fine-tuning
)
-# Train
trainer.train()
-
-# Save
trainer.save_model("final_model")
```
-**Unsloth Setup (2-3x Faster)**
+**Unsloth setup (2–3× faster)**
```python
from unsloth import FastLanguageModel
@@ -339,28 +302,26 @@ model = FastLanguageModel.get_peft_model(
use_gradient_checkpointing="unsloth",
)
-# Rest is identical to standard setup
+# Rest is identical to the standard setup
trainer = GRPOTrainer(model=model, ...)
trainer.train()
```
----
+## Critical training insights
-## Critical Training Insights
+### 1. Loss behavior (EXPECTED pattern)
+- **Loss starts near 0 and INCREASES during training** — this is CORRECT
+- Loss measures KL divergence from initial policy; the model is learning (diverging from original behavior to optimize rewards)
+- **Monitor reward metrics, not loss, for progress**
-### 1. Loss Behavior (EXPECTED PATTERN)
-- **Loss starts near 0 and INCREASES during training**
-- This is CORRECT - loss measures KL divergence from initial policy
-- Model is learning (diverging from original behavior to optimize rewards)
-- Monitor reward metrics instead of loss for progress
+### 2. Reward tracking
-### 2. Reward Tracking
Key metrics to watch:
-- `reward`: Average across all completions
-- `reward_std`: Diversity within groups (should remain > 0)
-- `kl`: KL divergence from reference (should grow moderately)
+- `reward` — average across all completions
+- `reward_std` — diversity within groups (should remain > 0)
+- `kl` — KL divergence from reference (should grow moderately)
-**Healthy Training Pattern:**
+**Healthy pattern:**
```
Step Reward Reward_Std KL
100 0.5 0.3 0.02
@@ -369,12 +330,12 @@ Step Reward Reward_Std KL
400 1.5 0.15 0.12
```
-**Warning Signs:**
-- Reward std → 0 (model collapsing to single response)
-- KL exploding (> 0.5) (diverging too much, reduce LR)
-- Reward stuck (reward functions too harsh or model capacity issue)
+**Warning signs:**
+- `reward_std` → 0 (model collapsing to a single response)
+- `kl` exploding (> 0.5) — diverging too much, reduce LR
+- Reward stuck — reward functions too harsh or model capacity issue
-### 3. Common Pitfalls and Solutions
+### 3. Common pitfalls and solutions
| Problem | Symptom | Solution |
|---------|---------|----------|
@@ -384,15 +345,14 @@ Step Reward Reward_Std KL
| **Slow training** | < 1 it/s | Enable `use_vllm=True`, use Unsloth, reduce seq length |
| **Format ignored** | Model doesn't follow structure | Increase format reward weight, add incremental rewards |
----
+## Advanced patterns
-## Advanced Patterns
+### 1. Multi-stage training
-### 1. Multi-Stage Training
For complex tasks, train in stages:
```python
-# Stage 1: Format compliance (epochs=1)
+# Stage 1: Format compliance
trainer_stage1 = GRPOTrainer(
model=model,
reward_funcs=[incremental_format_reward, format_reward],
@@ -400,7 +360,7 @@ trainer_stage1 = GRPOTrainer(
)
trainer_stage1.train()
-# Stage 2: Correctness (epochs=1)
+# Stage 2: Correctness
trainer_stage2 = GRPOTrainer(
model=model,
reward_funcs=[format_reward, correctness_reward],
@@ -409,7 +369,8 @@ trainer_stage2 = GRPOTrainer(
trainer_stage2.train()
```
-### 2. Adaptive Reward Scaling
+### 2. Adaptive reward scaling
+
```python
class AdaptiveReward:
def __init__(self, base_reward_func, initial_weight=1.0):
@@ -428,148 +389,116 @@ class AdaptiveReward:
self.weight *= 0.9
```
-### 3. Custom Dataset Integration
+### 3. Custom dataset integration
+
```python
def load_custom_knowledge_base(csv_path):
- """Example: School communication platform docs."""
import pandas as pd
df = pd.read_csv(csv_path)
-
- dataset = Dataset.from_pandas(df).map(lambda x: {
+ return Dataset.from_pandas(df).map(lambda x: {
'prompt': [
{'role': 'system', 'content': CUSTOM_SYSTEM_PROMPT},
{'role': 'user', 'content': x['question']}
],
'answer': x['expert_answer']
})
- return dataset
```
----
+## Deployment and inference
-## Deployment and Inference
-
-### Save and Merge LoRA
+### Save and merge LoRA
```python
-# Merge LoRA adapters into base model
if hasattr(trainer.model, 'merge_and_unload'):
merged_model = trainer.model.merge_and_unload()
merged_model.save_pretrained("production_model")
tokenizer.save_pretrained("production_model")
```
-### Inference Example
+### Inference
```python
from transformers import pipeline
-generator = pipeline(
- "text-generation",
- model="production_model",
- tokenizer=tokenizer
-)
+generator = pipeline("text-generation", model="production_model", tokenizer=tokenizer)
result = generator(
[
{'role': 'system', 'content': SYSTEM_PROMPT},
- {'role': 'user', 'content': "What is 15 + 27?"}
+ {'role': 'user', 'content': "What is 15 + 27?"},
],
max_new_tokens=256,
do_sample=True,
temperature=0.7,
- top_p=0.9
+ top_p=0.9,
)
print(result[0]['generated_text'])
```
----
+## Best practices checklist
-## Best Practices Checklist
-
-**Before Training:**
+**Before training:**
- [ ] Validate dataset format (prompts as List[Dict])
- [ ] Test reward functions on sample data
-- [ ] Calculate expected max_prompt_length from data
-- [ ] Choose appropriate num_generations based on GPU memory
+- [ ] Calculate expected `max_prompt_length` from data
+- [ ] Choose `num_generations` based on GPU memory
- [ ] Set up logging (wandb recommended)
-**During Training:**
+**During training:**
- [ ] Monitor reward progression (should increase)
-- [ ] Check reward_std (should stay > 0.1)
+- [ ] Check `reward_std` (should stay > 0.1)
- [ ] Watch for OOM errors (reduce batch size if needed)
-- [ ] Sample generations every 50-100 steps
+- [ ] Sample generations every 50–100 steps
- [ ] Validate format compliance on holdout set
-**After Training:**
+**After training:**
- [ ] Merge LoRA weights if using PEFT
- [ ] Test on diverse prompts
- [ ] Compare to baseline model
- [ ] Document reward weights and hyperparameters
- [ ] Save reproducibility config
----
+## Troubleshooting
-## Troubleshooting Guide
+### Debugging workflow
+1. **Isolate reward functions** — test each independently
+2. **Check data distribution** — ensure diversity in prompts
+3. **Reduce complexity** — start with single reward, add gradually
+4. **Monitor generations** — print samples every N steps
+5. **Validate extraction logic** — ensure answer parsing works
-### Debugging Workflow
-1. **Isolate reward functions** - Test each independently
-2. **Check data distribution** - Ensure diversity in prompts
-3. **Reduce complexity** - Start with single reward, add gradually
-4. **Monitor generations** - Print samples every N steps
-5. **Validate extraction logic** - Ensure answer parsing works
-
-### Quick Fixes
+### Quick debug reward
```python
-# Debug reward function
def debug_reward(completions, **kwargs):
responses = [comp[0]['content'] for comp in completions]
- for i, r in enumerate(responses[:2]): # Print first 2
+ for i, r in enumerate(responses[:2]):
print(f"Response {i}: {r[:200]}...")
- return [1.0] * len(responses) # Dummy rewards
+ return [1.0] * len(responses)
# Test without training
trainer = GRPOTrainer(..., reward_funcs=[debug_reward])
-trainer.generate_completions(dataset[:1]) # Generate without updating
+trainer.generate_completions(dataset[:1])
```
----
+## Template
-## References and Resources
+A production-ready training script lives at **`../templates/basic_grpo_training.py`**. It uses Qwen 2.5-1.5B-Instruct with LoRA and three reward functions (incremental format, strict format, correctness) on GSM8K. Copy and adapt:
+1. `get_dataset()` — swap in your data loader
+2. Reward functions — tune to your task
+3. `SYSTEM_PROMPT` — match your output format
+4. `GRPOConfig` — adjust hyperparameters for your GPU
+
+## References and resources
-**Official Documentation:**
- TRL GRPO Trainer: https://huggingface.co/docs/trl/grpo_trainer
-- DeepSeek R1 Paper: https://arxiv.org/abs/2501.12948
-- Unsloth Docs: https://docs.unsloth.ai/
-
-**Example Repositories:**
-- Open R1 Implementation: https://github.com/huggingface/open-r1
-- TRL Examples: https://github.com/huggingface/trl/tree/main/examples
-
-**Recommended Reading:**
-- Progressive Disclosure Pattern for agent instructions
-- Reward shaping in RL (Ng et al.)
-- LoRA paper (Hu et al., 2021)
-
----
-
-## Usage Instructions for Agents
-
-When this skill is loaded:
-
-1. **Read this entire file** before implementing GRPO training
-2. **Start with the simplest reward function** (e.g., length-based) to validate setup
-3. **Use the templates** in `templates/` directory as starting points
-4. **Reference examples** in `examples/` for task-specific implementations
-5. **Follow the workflow** sequentially (don't skip steps)
-6. **Debug incrementally** - add one reward function at a time
-
-**Critical Reminders:**
-- Always use multiple reward functions (3-5 is optimal)
-- Monitor reward metrics, not loss
-- Test reward functions before training
-- Start small (num_generations=4), scale up gradually
-- Save checkpoints frequently (every 100 steps)
-
-This skill is designed for **expert-level implementation**. Beginners should start with supervised fine-tuning before attempting GRPO.
-
+- GRPO paper (DeepSeek): https://arxiv.org/abs/2402.03300
+- DeepSeek R1 paper: https://arxiv.org/abs/2501.12948
+- Open R1 implementation: https://github.com/huggingface/open-r1
+- TRL examples: https://github.com/huggingface/trl/tree/main/examples
+- Unsloth (faster training): https://docs.unsloth.ai/
+## Critical reminders
+- **Loss goes UP during training** — this is normal (it's KL divergence)
+- **Use 3–5 reward functions** — single rewards often fail
+- **Test rewards before training** — debug each function independently
+- **Monitor `reward_std`** — should stay > 0.1 (avoid mode collapse)
+- **Start with `num_generations=4–8`** — scale up if GPU allows
diff --git a/skills/mlops/training/grpo-rl-training/templates/basic_grpo_training.py b/skills/mlops/training/trl-fine-tuning/templates/basic_grpo_training.py
similarity index 100%
rename from skills/mlops/training/grpo-rl-training/templates/basic_grpo_training.py
rename to skills/mlops/training/trl-fine-tuning/templates/basic_grpo_training.py
diff --git a/website/docs/reference/optional-skills-catalog.md b/website/docs/reference/optional-skills-catalog.md
index 6fde99b5e..bbb2c3b80 100644
--- a/website/docs/reference/optional-skills-catalog.md
+++ b/website/docs/reference/optional-skills-catalog.md
@@ -98,6 +98,7 @@ The largest optional category — covers the full ML pipeline from data curation
| **chroma** | Open-source embedding database. Store embeddings and metadata, perform vector and full-text search. Simple 4-function API for RAG and semantic search. |
| **faiss** | Facebook's library for efficient similarity search and clustering of dense vectors. Supports billions of vectors, GPU acceleration, and various index types (Flat, IVF, HNSW). |
| **flash-attention** | Optimize transformer attention with Flash Attention for 2-4x speedup and 10-20x memory reduction. Supports PyTorch SDPA, flash-attn library, H100 FP8, and sliding window. |
+| **guidance** | Control LLM output with regex and grammars, guarantee valid JSON/XML/code generation, enforce structured formats, and build multi-step workflows with Guidance — Microsoft Research's constrained generation framework. |
| **hermes-atropos-environments** | Build, test, and debug Hermes Agent RL environments for Atropos training. Covers the HermesAgentBaseEnv interface, reward functions, agent loop integration, and evaluation. |
| **huggingface-tokenizers** | Fast Rust-based tokenizers for research and production. Tokenizes 1GB in under 20 seconds. Supports BPE, WordPiece, and Unigram algorithms. |
| **instructor** | Extract structured data from LLM responses with Pydantic validation, retry failed extractions automatically, and stream partial results. |
diff --git a/website/docs/reference/skills-catalog.md b/website/docs/reference/skills-catalog.md
index 13ef2f7fc..ead50dbea 100644
--- a/website/docs/reference/skills-catalog.md
+++ b/website/docs/reference/skills-catalog.md
@@ -163,10 +163,8 @@ Model serving, quantization (GGUF/GPTQ), structured output, inference optimizati
| Skill | Description | Path |
|-------|-------------|------|
-| `gguf-quantization` | GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements. | `mlops/inference/gguf` |
-| `guidance` | Control LLM output with regex and grammars, guarantee valid JSON/XML/code generation, enforce structured formats, and build multi-step workflows with Guidance - Microsoft Research's constrained generation framework | `mlops/inference/guidance` |
| `instructor` | Extract structured data from LLM responses with Pydantic validation, retry failed extractions automatically, parse complex JSON with type safety, and stream partial results with Instructor - battle-tested structured output library | `mlops/inference/instructor` |
-| `llama-cpp` | Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU. | `mlops/inference/llama-cpp` |
+| `llama-cpp` | Run LLM inference with llama.cpp on CPU, Apple Silicon, AMD/Intel GPUs, or NVIDIA — plus GGUF model conversion and quantization (2–8 bit with K-quants and imatrix). Covers CLI, Python bindings, OpenAI-compatible server, and Ollama/LM Studio integration. Use for edge deployment, M1/M2/M3/M4 Macs, CUDA-less environments, or flexible local quantization. | `mlops/inference/llama-cpp` |
| `obliteratus` | Remove refusal behaviors from open-weight LLMs using OBLITERATUS — mechanistic interpretability techniques (diff-in-means, SVD, whitened SVD, LEACE, SAE decomposition, etc.) to excise guardrails while preserving reasoning. 9 CLI methods, 28 analysis modules, 116 model presets ac… | `mlops/inference/obliteratus` |
| `outlines` | Guarantee valid JSON/XML/code structure during generation, use Pydantic models for type-safe outputs, support local models (Transformers, vLLM), and maximize inference speed with Outlines - dottxt.ai's structured generation library | `mlops/inference/outlines` |
| `serving-llms-vllm` | Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), an… | `mlops/inference/vllm` |
@@ -202,7 +200,6 @@ Fine-tuning, RLHF/DPO/GRPO training, distributed training frameworks, and optimi
| `axolotl` | Expert guidance for fine-tuning LLMs with Axolotl - YAML configs, 100+ models, LoRA/QLoRA, DPO/KTO/ORPO/GRPO, multimodal support | `mlops/training/axolotl` |
| `distributed-llm-pretraining-torchtitan` | Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing. | `mlops/training/torchtitan` |
| `fine-tuning-with-trl` | Fine-tune LLMs using reinforcement learning with TRL - SFT for instruction tuning, DPO for preference alignment, PPO/GRPO for reward optimization, and reward model training. Use when need RLHF, align model with preferences, or train from human feedback. Works with HuggingFace Tr… | `mlops/training/trl-fine-tuning` |
-| `grpo-rl-training` | Expert guidance for GRPO/RL fine-tuning with TRL for reasoning and task-specific model training | `mlops/training/grpo-rl-training` |
| `hermes-atropos-environments` | Build, test, and debug Hermes Agent RL environments for Atropos training. Covers the HermesAgentBaseEnv interface, reward functions, agent loop integration, evaluation with tools, wandb logging, and the three CLI modes (serve/process/evaluate). Use when creating, reviewing, or f… | `mlops/training/hermes-atropos-environments` |
| `huggingface-accelerate` | Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard. | `mlops/training/accelerate` |
| `optimizing-attention-flash` | Optimizes transformer attention with Flash Attention for 2-4x speedup and 10-20x memory reduction. Use when training/running transformers with long sequences (>512 tokens), encountering GPU memory issues with attention, or need faster inference. Supports PyTorch native SDPA,… | `mlops/training/flash-attention` |