From 73bccc94c7af3a07b4002c2a14a4b54f844bd561 Mon Sep 17 00:00:00 2001 From: Teknium <127238744+teknium1@users.noreply.github.com> Date: Fri, 17 Apr 2026 21:36:40 -0700 Subject: [PATCH] =?UTF-8?q?skills:=20consolidate=20mlops=20redundancies=20?= =?UTF-8?q?(gguf+llama-cpp,=20grpo+trl,=20guidance=E2=86=92optional)=20(#1?= =?UTF-8?q?1965)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Three tightly-scoped built-in skill consolidations to reduce redundancy in the available_skills listing injected into every system prompt: 1. gguf-quantization → llama-cpp (merged) GGUF is llama.cpp's format; two skills covered the same toolchain. The merged llama-cpp skill keeps the full K-quant table + imatrix workflow from gguf and the ROCm/benchmarks/supported-models sections from the original llama-cpp. All 5 reference files preserved. 2. grpo-rl-training → fine-tuning-with-trl (folded in) GRPO isn't a framework, it's a trainer inside TRL. Moved the 17KB deep-dive SKILL.md to references/grpo-training.md and the working template to templates/basic_grpo_training.py. TRL's GRPO workflow section now points to both. Atropos skill's related_skills updated. 3. guidance → optional-skills/mlops/ Dropped from built-in. Outlines (still built-in) covers the same structured-generation ground with wider adoption. Listed in the optional catalog for users who specifically want Guidance. Net: 3 fewer built-in skill lines in every system prompt, zero content loss. Contributor authorship preserved via git rename detection. --- .../mlops}/guidance/SKILL.md | 0 .../mlops}/guidance/references/backends.md | 0 .../mlops}/guidance/references/constraints.md | 0 .../mlops}/guidance/references/examples.md | 0 .../hermes-atropos-environments/SKILL.md | 2 +- skills/mlops/inference/gguf/SKILL.md | 430 --------------- skills/mlops/inference/llama-cpp/SKILL.md | 491 ++++++++++++------ .../references/advanced-usage.md | 0 .../references/troubleshooting.md | 0 .../mlops/training/grpo-rl-training/README.md | 97 ---- .../mlops/training/trl-fine-tuning/SKILL.md | 4 + .../references/grpo-training.md} | 329 +++++------- .../templates/basic_grpo_training.py | 0 .../docs/reference/optional-skills-catalog.md | 1 + website/docs/reference/skills-catalog.md | 5 +- 15 files changed, 470 insertions(+), 889 deletions(-) rename {skills/mlops/inference => optional-skills/mlops}/guidance/SKILL.md (100%) rename {skills/mlops/inference => optional-skills/mlops}/guidance/references/backends.md (100%) rename {skills/mlops/inference => optional-skills/mlops}/guidance/references/constraints.md (100%) rename {skills/mlops/inference => optional-skills/mlops}/guidance/references/examples.md (100%) delete mode 100644 skills/mlops/inference/gguf/SKILL.md rename skills/mlops/inference/{gguf => llama-cpp}/references/advanced-usage.md (100%) rename skills/mlops/inference/{gguf => llama-cpp}/references/troubleshooting.md (100%) delete mode 100644 skills/mlops/training/grpo-rl-training/README.md rename skills/mlops/training/{grpo-rl-training/SKILL.md => trl-fine-tuning/references/grpo-training.md} (56%) rename skills/mlops/training/{grpo-rl-training => trl-fine-tuning}/templates/basic_grpo_training.py (100%) diff --git a/skills/mlops/inference/guidance/SKILL.md b/optional-skills/mlops/guidance/SKILL.md similarity index 100% rename from skills/mlops/inference/guidance/SKILL.md rename to optional-skills/mlops/guidance/SKILL.md diff --git a/skills/mlops/inference/guidance/references/backends.md b/optional-skills/mlops/guidance/references/backends.md similarity index 100% rename from skills/mlops/inference/guidance/references/backends.md rename to optional-skills/mlops/guidance/references/backends.md diff --git a/skills/mlops/inference/guidance/references/constraints.md b/optional-skills/mlops/guidance/references/constraints.md similarity index 100% rename from skills/mlops/inference/guidance/references/constraints.md rename to optional-skills/mlops/guidance/references/constraints.md diff --git a/skills/mlops/inference/guidance/references/examples.md b/optional-skills/mlops/guidance/references/examples.md similarity index 100% rename from skills/mlops/inference/guidance/references/examples.md rename to optional-skills/mlops/guidance/references/examples.md diff --git a/optional-skills/mlops/hermes-atropos-environments/SKILL.md b/optional-skills/mlops/hermes-atropos-environments/SKILL.md index 9dff46687..5101886b4 100644 --- a/optional-skills/mlops/hermes-atropos-environments/SKILL.md +++ b/optional-skills/mlops/hermes-atropos-environments/SKILL.md @@ -7,7 +7,7 @@ license: MIT metadata: hermes: tags: [atropos, rl, environments, training, reinforcement-learning, reward-functions] - related_skills: [axolotl, grpo-rl-training, trl-fine-tuning, lm-evaluation-harness] + related_skills: [axolotl, fine-tuning-with-trl, lm-evaluation-harness] --- # Hermes Agent Atropos Environments diff --git a/skills/mlops/inference/gguf/SKILL.md b/skills/mlops/inference/gguf/SKILL.md deleted file mode 100644 index 21bb176c8..000000000 --- a/skills/mlops/inference/gguf/SKILL.md +++ /dev/null @@ -1,430 +0,0 @@ ---- -name: gguf-quantization -description: GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements. -version: 1.0.0 -author: Orchestra Research -license: MIT -dependencies: [llama-cpp-python>=0.2.0] -metadata: - hermes: - tags: [GGUF, Quantization, llama.cpp, CPU Inference, Apple Silicon, Model Compression, Optimization] - ---- - -# GGUF - Quantization Format for llama.cpp - -The GGUF (GPT-Generated Unified Format) is the standard file format for llama.cpp, enabling efficient inference on CPUs, Apple Silicon, and GPUs with flexible quantization options. - -## When to use GGUF - -**Use GGUF when:** -- Deploying on consumer hardware (laptops, desktops) -- Running on Apple Silicon (M1/M2/M3) with Metal acceleration -- Need CPU inference without GPU requirements -- Want flexible quantization (Q2_K to Q8_0) -- Using local AI tools (LM Studio, Ollama, text-generation-webui) - -**Key advantages:** -- **Universal hardware**: CPU, Apple Silicon, NVIDIA, AMD support -- **No Python runtime**: Pure C/C++ inference -- **Flexible quantization**: 2-8 bit with various methods (K-quants) -- **Ecosystem support**: LM Studio, Ollama, koboldcpp, and more -- **imatrix**: Importance matrix for better low-bit quality - -**Use alternatives instead:** -- **AWQ/GPTQ**: Maximum accuracy with calibration on NVIDIA GPUs -- **HQQ**: Fast calibration-free quantization for HuggingFace -- **bitsandbytes**: Simple integration with transformers library -- **TensorRT-LLM**: Production NVIDIA deployment with maximum speed - -## Quick start - -### Installation - -```bash -# Clone llama.cpp -git clone https://github.com/ggml-org/llama.cpp -cd llama.cpp - -# Build (CPU) -make - -# Build with CUDA (NVIDIA) -make GGML_CUDA=1 - -# Build with Metal (Apple Silicon) -make GGML_METAL=1 - -# Install Python bindings (optional) -pip install llama-cpp-python -``` - -### Convert model to GGUF - -```bash -# Install requirements -pip install -r requirements.txt - -# Convert HuggingFace model to GGUF (FP16) -python convert_hf_to_gguf.py ./path/to/model --outfile model-f16.gguf - -# Or specify output type -python convert_hf_to_gguf.py ./path/to/model \ - --outfile model-f16.gguf \ - --outtype f16 -``` - -### Quantize model - -```bash -# Basic quantization to Q4_K_M -./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M - -# Quantize with importance matrix (better quality) -./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix -./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M -``` - -### Run inference - -```bash -# CLI inference -./llama-cli -m model-q4_k_m.gguf -p "Hello, how are you?" - -# Interactive mode -./llama-cli -m model-q4_k_m.gguf --interactive - -# With GPU offload -./llama-cli -m model-q4_k_m.gguf -ngl 35 -p "Hello!" -``` - -## Quantization types - -### K-quant methods (recommended) - -| Type | Bits | Size (7B) | Quality | Use Case | -|------|------|-----------|---------|----------| -| Q2_K | 2.5 | ~2.8 GB | Low | Extreme compression | -| Q3_K_S | 3.0 | ~3.0 GB | Low-Med | Memory constrained | -| Q3_K_M | 3.3 | ~3.3 GB | Medium | Balance | -| Q4_K_S | 4.0 | ~3.8 GB | Med-High | Good balance | -| Q4_K_M | 4.5 | ~4.1 GB | High | **Recommended default** | -| Q5_K_S | 5.0 | ~4.6 GB | High | Quality focused | -| Q5_K_M | 5.5 | ~4.8 GB | Very High | High quality | -| Q6_K | 6.0 | ~5.5 GB | Excellent | Near-original | -| Q8_0 | 8.0 | ~7.2 GB | Best | Maximum quality | - -### Legacy methods - -| Type | Description | -|------|-------------| -| Q4_0 | 4-bit, basic | -| Q4_1 | 4-bit with delta | -| Q5_0 | 5-bit, basic | -| Q5_1 | 5-bit with delta | - -**Recommendation**: Use K-quant methods (Q4_K_M, Q5_K_M) for best quality/size ratio. - -## Conversion workflows - -### Workflow 1: HuggingFace to GGUF - -```bash -# 1. Download model -huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b - -# 2. Convert to GGUF (FP16) -python convert_hf_to_gguf.py ./llama-3.1-8b \ - --outfile llama-3.1-8b-f16.gguf \ - --outtype f16 - -# 3. Quantize -./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M - -# 4. Test -./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50 -``` - -### Workflow 2: With importance matrix (better quality) - -```bash -# 1. Convert to GGUF -python convert_hf_to_gguf.py ./model --outfile model-f16.gguf - -# 2. Create calibration text (diverse samples) -cat > calibration.txt << 'EOF' -The quick brown fox jumps over the lazy dog. -Machine learning is a subset of artificial intelligence. -Python is a popular programming language. -# Add more diverse text samples... -EOF - -# 3. Generate importance matrix -./llama-imatrix -m model-f16.gguf \ - -f calibration.txt \ - --chunk 512 \ - -o model.imatrix \ - -ngl 35 # GPU layers if available - -# 4. Quantize with imatrix -./llama-quantize --imatrix model.imatrix \ - model-f16.gguf \ - model-q4_k_m.gguf \ - Q4_K_M -``` - -### Workflow 3: Multiple quantizations - -```bash -#!/bin/bash -MODEL="llama-3.1-8b-f16.gguf" -IMATRIX="llama-3.1-8b.imatrix" - -# Generate imatrix once -./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35 - -# Create multiple quantizations -for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do - OUTPUT="llama-3.1-8b-${QUANT,,}.gguf" - ./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT - echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))" -done -``` - -## Python usage - -### llama-cpp-python - -```python -from llama_cpp import Llama - -# Load model -llm = Llama( - model_path="./model-q4_k_m.gguf", - n_ctx=4096, # Context window - n_gpu_layers=35, # GPU offload (0 for CPU only) - n_threads=8 # CPU threads -) - -# Generate -output = llm( - "What is machine learning?", - max_tokens=256, - temperature=0.7, - stop=["", "\n\n"] -) -print(output["choices"][0]["text"]) -``` - -### Chat completion - -```python -from llama_cpp import Llama - -llm = Llama( - model_path="./model-q4_k_m.gguf", - n_ctx=4096, - n_gpu_layers=35, - chat_format="llama-3" # Or "chatml", "mistral", etc. -) - -messages = [ - {"role": "system", "content": "You are a helpful assistant."}, - {"role": "user", "content": "What is Python?"} -] - -response = llm.create_chat_completion( - messages=messages, - max_tokens=256, - temperature=0.7 -) -print(response["choices"][0]["message"]["content"]) -``` - -### Streaming - -```python -from llama_cpp import Llama - -llm = Llama(model_path="./model-q4_k_m.gguf", n_gpu_layers=35) - -# Stream tokens -for chunk in llm( - "Explain quantum computing:", - max_tokens=256, - stream=True -): - print(chunk["choices"][0]["text"], end="", flush=True) -``` - -## Server mode - -### Start OpenAI-compatible server - -```bash -# Start server -./llama-server -m model-q4_k_m.gguf \ - --host 0.0.0.0 \ - --port 8080 \ - -ngl 35 \ - -c 4096 - -# Or with Python bindings -python -m llama_cpp.server \ - --model model-q4_k_m.gguf \ - --n_gpu_layers 35 \ - --host 0.0.0.0 \ - --port 8080 -``` - -### Use with OpenAI client - -```python -from openai import OpenAI - -client = OpenAI( - base_url="http://localhost:8080/v1", - api_key="not-needed" -) - -response = client.chat.completions.create( - model="local-model", - messages=[{"role": "user", "content": "Hello!"}], - max_tokens=256 -) -print(response.choices[0].message.content) -``` - -## Hardware optimization - -### Apple Silicon (Metal) - -```bash -# Build with Metal -make clean && make GGML_METAL=1 - -# Run with Metal acceleration -./llama-cli -m model.gguf -ngl 99 -p "Hello" - -# Python with Metal -llm = Llama( - model_path="model.gguf", - n_gpu_layers=99, # Offload all layers - n_threads=1 # Metal handles parallelism -) -``` - -### NVIDIA CUDA - -```bash -# Build with CUDA -make clean && make GGML_CUDA=1 - -# Run with CUDA -./llama-cli -m model.gguf -ngl 35 -p "Hello" - -# Specify GPU -CUDA_VISIBLE_DEVICES=0 ./llama-cli -m model.gguf -ngl 35 -``` - -### CPU optimization - -```bash -# Build with AVX2/AVX512 -make clean && make - -# Run with optimal threads -./llama-cli -m model.gguf -t 8 -p "Hello" - -# Python CPU config -llm = Llama( - model_path="model.gguf", - n_gpu_layers=0, # CPU only - n_threads=8, # Match physical cores - n_batch=512 # Batch size for prompt processing -) -``` - -## Integration with tools - -### Ollama - -```bash -# Create Modelfile -cat > Modelfile << 'EOF' -FROM ./model-q4_k_m.gguf -TEMPLATE """{{ .System }} -{{ .Prompt }}""" -PARAMETER temperature 0.7 -PARAMETER num_ctx 4096 -EOF - -# Create Ollama model -ollama create mymodel -f Modelfile - -# Run -ollama run mymodel "Hello!" -``` - -### LM Studio - -1. Place GGUF file in `~/.cache/lm-studio/models/` -2. Open LM Studio and select the model -3. Configure context length and GPU offload -4. Start inference - -### text-generation-webui - -```bash -# Place in models folder -cp model-q4_k_m.gguf text-generation-webui/models/ - -# Start with llama.cpp loader -python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35 -``` - -## Best practices - -1. **Use K-quants**: Q4_K_M offers best quality/size balance -2. **Use imatrix**: Always use importance matrix for Q4 and below -3. **GPU offload**: Offload as many layers as VRAM allows -4. **Context length**: Start with 4096, increase if needed -5. **Thread count**: Match physical CPU cores, not logical -6. **Batch size**: Increase n_batch for faster prompt processing - -## Common issues - -**Model loads slowly:** -```bash -# Use mmap for faster loading -./llama-cli -m model.gguf --mmap -``` - -**Out of memory:** -```bash -# Reduce GPU layers -./llama-cli -m model.gguf -ngl 20 # Reduce from 35 - -# Or use smaller quantization -./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M -``` - -**Poor quality at low bits:** -```bash -# Always use imatrix for Q4 and below -./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix -./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M -``` - -## References - -- **[Advanced Usage](references/advanced-usage.md)** - Batching, speculative decoding, custom builds -- **[Troubleshooting](references/troubleshooting.md)** - Common issues, debugging, benchmarks - -## Resources - -- **Repository**: https://github.com/ggml-org/llama.cpp -- **Python Bindings**: https://github.com/abetlen/llama-cpp-python -- **Pre-quantized Models**: https://huggingface.co/TheBloke -- **GGUF Converter**: https://huggingface.co/spaces/ggml-org/gguf-my-repo -- **License**: MIT diff --git a/skills/mlops/inference/llama-cpp/SKILL.md b/skills/mlops/inference/llama-cpp/SKILL.md index 57016c920..33fc37adb 100644 --- a/skills/mlops/inference/llama-cpp/SKILL.md +++ b/skills/mlops/inference/llama-cpp/SKILL.md @@ -1,138 +1,271 @@ --- name: llama-cpp -description: Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU. -version: 1.0.0 +description: Run LLM inference with llama.cpp on CPU, Apple Silicon, AMD/Intel GPUs, or NVIDIA — plus GGUF model conversion and quantization (2–8 bit with K-quants and imatrix). Covers CLI, Python bindings, OpenAI-compatible server, and Ollama/LM Studio integration. Use for edge deployment, M1/M2/M3/M4 Macs, CUDA-less environments, or flexible local quantization. +version: 2.0.0 author: Orchestra Research license: MIT -dependencies: [llama-cpp-python] +dependencies: [llama-cpp-python>=0.2.0] metadata: hermes: - tags: [Inference Serving, Llama.cpp, CPU Inference, Apple Silicon, Edge Deployment, GGUF, Quantization, Non-NVIDIA, AMD GPUs, Intel GPUs, Embedded] - + tags: [llama.cpp, GGUF, Quantization, CPU Inference, Apple Silicon, Edge Deployment, Non-NVIDIA, AMD GPUs, Intel GPUs, Embedded, Model Compression] --- -# llama.cpp +# llama.cpp + GGUF -Pure C/C++ LLM inference with minimal dependencies, optimized for CPUs and non-NVIDIA hardware. +Pure C/C++ LLM inference with minimal dependencies, plus the GGUF (GPT-Generated Unified Format) standard used for quantized weights. One toolchain covers conversion, quantization, and serving. -## When to use llama.cpp +## When to use -**Use llama.cpp when:** -- Running on CPU-only machines -- Deploying on Apple Silicon (M1/M2/M3/M4) -- Using AMD or Intel GPUs (no CUDA) -- Edge deployment (Raspberry Pi, embedded systems) -- Need simple deployment without Docker/Python +**Use llama.cpp + GGUF when:** +- Running on CPU-only machines or Apple Silicon (M1/M2/M3/M4) with Metal acceleration +- Using AMD (ROCm) or Intel GPUs where CUDA isn't available +- Edge deployment (Raspberry Pi, embedded systems, consumer laptops) +- Need flexible quantization (2–8 bit with K-quants) +- Want local AI tools (LM Studio, Ollama, text-generation-webui, koboldcpp) +- Want a single binary deploy without Docker/Python -**Use TensorRT-LLM instead when:** -- Have NVIDIA GPUs (A100/H100) -- Need maximum throughput (100K+ tok/s) -- Running in datacenter with CUDA +**Key advantages:** +- Universal hardware: CPU, Apple Silicon, NVIDIA, AMD, Intel +- No Python runtime required (pure C/C++) +- K-quants + imatrix for better low-bit quality +- OpenAI-compatible server built in +- Rich ecosystem (Ollama, LM Studio, llama-cpp-python) -**Use vLLM instead when:** -- Have NVIDIA GPUs -- Need Python-first API -- Want PagedAttention +**Use alternatives instead:** +- **vLLM** — NVIDIA GPUs, PagedAttention, Python-first, max throughput +- **TensorRT-LLM** — Production NVIDIA (A100/H100), maximum speed +- **AWQ/GPTQ** — Calibrated quantization for NVIDIA-only deployments +- **bitsandbytes** — Simple HuggingFace transformers integration +- **HQQ** — Fast calibration-free quantization ## Quick start -### Installation +### Install ```bash -# macOS/Linux +# macOS / Linux (simplest) brew install llama.cpp # Or build from source -git clone https://github.com/ggerganov/llama.cpp +git clone https://github.com/ggml-org/llama.cpp cd llama.cpp -make +make # CPU +make GGML_METAL=1 # Apple Silicon +make GGML_CUDA=1 # NVIDIA CUDA +make LLAMA_HIP=1 # AMD ROCm -# With Metal (Apple Silicon) -make LLAMA_METAL=1 - -# With CUDA (NVIDIA) -make LLAMA_CUDA=1 - -# With ROCm (AMD) -make LLAMA_HIP=1 +# Python bindings (optional) +pip install llama-cpp-python +# With CUDA: CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir +# With Metal: CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --no-cache-dir ``` -### Download model +### Download a pre-quantized GGUF ```bash -# Download from HuggingFace (GGUF format) +# TheBloke hosts most popular models pre-quantized huggingface-cli download \ TheBloke/Llama-2-7B-Chat-GGUF \ llama-2-7b-chat.Q4_K_M.gguf \ --local-dir models/ +``` -# Or convert from HuggingFace -python convert_hf_to_gguf.py models/llama-2-7b-chat/ +### Or convert a HuggingFace model to GGUF + +```bash +# 1. Download HF model +huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b + +# 2. Convert to FP16 GGUF +python convert_hf_to_gguf.py ./llama-3.1-8b \ + --outfile llama-3.1-8b-f16.gguf \ + --outtype f16 + +# 3. Quantize to Q4_K_M +./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M ``` ### Run inference ```bash -# Simple chat -./llama-cli \ - -m models/llama-2-7b-chat.Q4_K_M.gguf \ - -p "Explain quantum computing" \ - -n 256 # Max tokens +# One-shot prompt +./llama-cli -m model.Q4_K_M.gguf -p "Explain quantum computing" -n 256 # Interactive chat -./llama-cli \ - -m models/llama-2-7b-chat.Q4_K_M.gguf \ - --interactive +./llama-cli -m model.Q4_K_M.gguf --interactive + +# With GPU offload +./llama-cli -m model.Q4_K_M.gguf -ngl 35 -p "Hello!" ``` -### Server mode +### Serve an OpenAI-compatible API ```bash -# Start OpenAI-compatible server ./llama-server \ - -m models/llama-2-7b-chat.Q4_K_M.gguf \ + -m model.Q4_K_M.gguf \ --host 0.0.0.0 \ --port 8080 \ - -ngl 32 # Offload 32 layers to GPU + -ngl 35 \ + -c 4096 \ + --parallel 4 \ + --cont-batching +``` -# Client request +```bash curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ - "model": "llama-2-7b-chat", + "model": "local", "messages": [{"role": "user", "content": "Hello!"}], "temperature": 0.7, "max_tokens": 100 }' ``` -## Quantization formats +## Quantization formats (GGUF) -### GGUF format overview +### K-quant methods (recommended) -| Format | Bits | Size (7B) | Speed | Quality | Use Case | -|--------|------|-----------|-------|---------|----------| -| **Q4_K_M** | 4.5 | 4.1 GB | Fast | Good | **Recommended default** | -| Q4_K_S | 4.3 | 3.9 GB | Faster | Lower | Speed critical | -| Q5_K_M | 5.5 | 4.8 GB | Medium | Better | Quality critical | -| Q6_K | 6.5 | 5.5 GB | Slower | Best | Maximum quality | -| Q8_0 | 8.0 | 7.0 GB | Slow | Excellent | Minimal degradation | -| Q2_K | 2.5 | 2.7 GB | Fastest | Poor | Testing only | +| Type | Bits | Size (7B) | Quality | Use Case | +|------|------|-----------|---------|----------| +| Q2_K | 2.5 | ~2.8 GB | Low | Extreme compression (testing only) | +| Q3_K_S | 3.0 | ~3.0 GB | Low-Med | Memory constrained | +| Q3_K_M | 3.3 | ~3.3 GB | Medium | Fits small devices | +| Q4_K_S | 4.0 | ~3.8 GB | Med-High | Speed critical | +| **Q4_K_M** | 4.5 | ~4.1 GB | High | **Recommended default** | +| Q5_K_S | 5.0 | ~4.6 GB | High | Quality focused | +| Q5_K_M | 5.5 | ~4.8 GB | Very High | High quality | +| Q6_K | 6.0 | ~5.5 GB | Excellent | Near-original | +| Q8_0 | 8.0 | ~7.2 GB | Best | Maximum quality, minimal degradation | -### Choosing quantization +**Variant suffixes** — `_S` (Small, faster, lower quality), `_M` (Medium, balanced), `_L` (Large, better quality). + +**Legacy (Q4_0/Q4_1/Q5_0/Q5_1) exist** but always prefer K-quants for better quality/size ratio. + +**IQ quantization** — ultra-low-bit with importance-aware methods: IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_XS, IQ3_S, IQ4_XS. Require `--imatrix`. + +**Task-specific defaults:** +- General chat / assistants: Q4_K_M, or Q5_K_M if RAM allows +- Code generation: Q5_K_M or Q6_K (higher precision helps) +- Technical / medical: Q6_K or Q8_0 +- Very large (70B, 405B) on consumer hardware: Q3_K_M or Q4_K_S +- Raspberry Pi / edge: Q2_K or Q3_K_S + +## Conversion workflows + +### Basic: HF → GGUF → quantized ```bash -# General use (balanced) -Q4_K_M # 4-bit, medium quality +python convert_hf_to_gguf.py ./model --outfile model-f16.gguf --outtype f16 +./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M +./llama-cli -m model-q4_k_m.gguf -p "Hello!" -n 50 +``` -# Maximum speed (more degradation) -Q2_K or Q3_K_M +### With importance matrix (imatrix) — better low-bit quality -# Maximum quality (slower) -Q6_K or Q8_0 +`imatrix` gives 10–20% perplexity improvement at Q4, essential at Q3 and below. -# Very large models (70B, 405B) -Q3_K_M or Q4_K_S # Lower bits to fit in memory +```bash +# 1. Convert to FP16 GGUF +python convert_hf_to_gguf.py ./model --outfile model-f16.gguf + +# 2. Prepare calibration data (diverse text, ~100MB is ideal) +cat > calibration.txt << 'EOF' +The quick brown fox jumps over the lazy dog. +Machine learning is a subset of artificial intelligence. +# Add more diverse text samples... +EOF + +# 3. Generate importance matrix +./llama-imatrix -m model-f16.gguf \ + -f calibration.txt \ + --chunk 512 \ + -o model.imatrix \ + -ngl 35 + +# 4. Quantize with imatrix +./llama-quantize --imatrix model.imatrix \ + model-f16.gguf model-q4_k_m.gguf Q4_K_M +``` + +### Multi-quant batch + +```bash +#!/bin/bash +MODEL="llama-3.1-8b-f16.gguf" +IMATRIX="llama-3.1-8b.imatrix" + +./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35 + +for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do + OUTPUT="llama-3.1-8b-${QUANT,,}.gguf" + ./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT + echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))" +done +``` + +### Quality testing (perplexity) + +```bash +./llama-perplexity -m model.gguf -f wikitext-2-raw/wiki.test.raw -c 512 +# Baseline FP16: ~5.96 | Q4_K_M: ~6.06 (+1.7%) | Q2_K: ~6.87 (+15.3%) +``` + +## Python bindings (llama-cpp-python) + +### Basic generation + +```python +from llama_cpp import Llama + +llm = Llama( + model_path="./model-q4_k_m.gguf", + n_ctx=4096, + n_gpu_layers=35, # 0 for CPU only, 99 to offload everything + n_threads=8, +) + +output = llm( + "What is machine learning?", + max_tokens=256, + temperature=0.7, + stop=["", "\n\n"], +) +print(output["choices"][0]["text"]) +``` + +### Chat completion + streaming + +```python +llm = Llama( + model_path="./model-q4_k_m.gguf", + n_ctx=4096, + n_gpu_layers=35, + chat_format="llama-3", # Or "chatml", "mistral", etc. +) + +# Non-streaming +response = llm.create_chat_completion( + messages=[ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "What is Python?"}, + ], + max_tokens=256, + temperature=0.7, +) +print(response["choices"][0]["message"]["content"]) + +# Streaming +for chunk in llm("Explain quantum computing:", max_tokens=256, stream=True): + print(chunk["choices"][0]["text"], end="", flush=True) +``` + +### Embeddings + +```python +llm = Llama(model_path="./model-q4_k_m.gguf", embedding=True, n_gpu_layers=35) +vec = llm.embed("This is a test sentence.") +print(f"Embedding dimension: {len(vec)}") ``` ## Hardware acceleration @@ -140,122 +273,166 @@ Q3_K_M or Q4_K_S # Lower bits to fit in memory ### Apple Silicon (Metal) ```bash -# Build with Metal -make LLAMA_METAL=1 - -# Run with GPU acceleration (automatic) -./llama-cli -m model.gguf -ngl 999 # Offload all layers - -# Performance: M3 Max 40-60 tokens/sec (Llama 2-7B Q4_K_M) +make clean && make GGML_METAL=1 +./llama-cli -m model.gguf -ngl 99 -p "Hello" # offload all layers ``` -### NVIDIA GPUs (CUDA) - -```bash -# Build with CUDA -make LLAMA_CUDA=1 - -# Offload layers to GPU -./llama-cli -m model.gguf -ngl 35 # Offload 35/40 layers - -# Hybrid CPU+GPU for large models -./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20 # GPU: 20 layers, CPU: rest +```python +llm = Llama( + model_path="model.gguf", + n_gpu_layers=99, # Offload everything + n_threads=1, # Metal handles parallelism +) ``` -### AMD GPUs (ROCm) +Performance: M3 Max ~40–60 tok/s on Llama 2-7B Q4_K_M. + +### NVIDIA (CUDA) + +```bash +make clean && make GGML_CUDA=1 +./llama-cli -m model.gguf -ngl 35 -p "Hello" + +# Hybrid for large models +./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20 # GPU: 20 layers, CPU: rest + +# Multi-GPU split +./llama-cli -m large-model.gguf --tensor-split 0.5,0.5 -ngl 60 +``` + +### AMD (ROCm) ```bash -# Build with ROCm make LLAMA_HIP=1 - -# Run with AMD GPU ./llama-cli -m model.gguf -ngl 999 ``` -## Common patterns - -### Batch processing +### CPU ```bash -# Process multiple prompts from file -cat prompts.txt | ./llama-cli \ - -m model.gguf \ - --batch-size 512 \ - -n 100 +# Match PHYSICAL cores, not logical +./llama-cli -m model.gguf -t 8 -p "Hello" + +# BLAS acceleration (2–3× speedup) +make LLAMA_OPENBLAS=1 ``` -### Constrained generation - -```bash -# JSON output with grammar -./llama-cli \ - -m model.gguf \ - -p "Generate a person: " \ - --grammar-file grammars/json.gbnf - -# Outputs valid JSON only -``` - -### Context size - -```bash -# Increase context (default 512) -./llama-cli \ - -m model.gguf \ - -c 4096 # 4K context window - -# Very long context (if model supports) -./llama-cli -m model.gguf -c 32768 # 32K context +```python +llm = Llama( + model_path="model.gguf", + n_gpu_layers=0, + n_threads=8, + n_batch=512, # Larger batch = faster prompt processing +) ``` ## Performance benchmarks -### CPU performance (Llama 2-7B Q4_K_M) +### CPU (Llama 2-7B Q4_K_M) -| CPU | Threads | Speed | Cost | -|-----|---------|-------|------| -| Apple M3 Max | 16 | 50 tok/s | $0 (local) | -| AMD Ryzen 9 7950X | 32 | 35 tok/s | $0.50/hour | -| Intel i9-13900K | 32 | 30 tok/s | $0.40/hour | -| AWS c7i.16xlarge | 64 | 40 tok/s | $2.88/hour | +| CPU | Threads | Speed | +|-----|---------|-------| +| Apple M3 Max (Metal) | 16 | 50 tok/s | +| AMD Ryzen 9 7950X | 32 | 35 tok/s | +| Intel i9-13900K | 32 | 30 tok/s | -### GPU acceleration (Llama 2-7B Q4_K_M) +### GPU offloading on RTX 4090 -| GPU | Speed | vs CPU | Cost | -|-----|-------|--------|------| -| NVIDIA RTX 4090 | 120 tok/s | 3-4× | $0 (local) | -| NVIDIA A10 | 80 tok/s | 2-3× | $1.00/hour | -| AMD MI250 | 70 tok/s | 2× | $2.00/hour | -| Apple M3 Max (Metal) | 50 tok/s | ~Same | $0 (local) | +| Layers GPU | Speed | VRAM | +|------------|-------|------| +| 0 (CPU only) | 30 tok/s | 0 GB | +| 20 (hybrid) | 80 tok/s | 8 GB | +| 35 (all) | 120 tok/s | 12 GB | ## Supported models -**LLaMA family**: -- Llama 2 (7B, 13B, 70B) -- Llama 3 (8B, 70B, 405B) -- Code Llama +- **LLaMA family**: Llama 2 (7B/13B/70B), Llama 3 (8B/70B/405B), Code Llama +- **Mistral family**: Mistral 7B, Mixtral 8x7B/8x22B +- **Other**: Falcon, BLOOM, GPT-J, Phi-3, Gemma, Qwen, LLaVA (vision), Whisper (audio) -**Mistral family**: -- Mistral 7B -- Mixtral 8x7B, 8x22B +Find GGUF models: https://huggingface.co/models?library=gguf -**Other**: -- Falcon, BLOOM, GPT-J -- Phi-3, Gemma, Qwen -- LLaVA (vision), Whisper (audio) +## Ecosystem integrations -**Find models**: https://huggingface.co/models?library=gguf +### Ollama + +```bash +cat > Modelfile << 'EOF' +FROM ./model-q4_k_m.gguf +TEMPLATE """{{ .System }} +{{ .Prompt }}""" +PARAMETER temperature 0.7 +PARAMETER num_ctx 4096 +EOF + +ollama create mymodel -f Modelfile +ollama run mymodel "Hello!" +``` + +### LM Studio + +1. Place GGUF file in `~/.cache/lm-studio/models/` +2. Open LM Studio and select the model +3. Configure context length and GPU offload, start inference + +### text-generation-webui + +```bash +cp model-q4_k_m.gguf text-generation-webui/models/ +python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35 +``` + +### OpenAI client → llama-server + +```python +from openai import OpenAI + +client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed") +response = client.chat.completions.create( + model="local-model", + messages=[{"role": "user", "content": "Hello!"}], + max_tokens=256, +) +print(response.choices[0].message.content) +``` + +## Best practices + +1. **Use K-quants** — Q4_K_M is the recommended default +2. **Use imatrix** for Q4 and below (calibration improves quality substantially) +3. **Offload as many layers as VRAM allows** — start high, reduce by 5 on OOM +4. **Thread count** — match physical cores, not logical +5. **Batch size** — increase `n_batch` (e.g. 512) for faster prompt processing +6. **Context** — start at 4096, grow only as needed (memory scales with ctx) +7. **Flash Attention** — add `--flash-attn` if your build supports it + +## Common issues (quick fixes) + +**Model loads slowly** — use `--mmap` for memory-mapped loading. + +**Out of memory (GPU)** — reduce `-ngl`, use a smaller quant (Q4_K_S / Q3_K_M), or quantize the KV cache: +```python +Llama(model_path="...", type_k=2, type_v=2, n_gpu_layers=35) # Q4_0 KV cache +``` + +**Garbage output** — wrong `chat_format`, temperature too high, or model file corrupted. Test with `temperature=0.1` and verify FP16 baseline works. + +**Connection refused (server)** — bind to `--host 0.0.0.0`, check `lsof -i :8080`. + +See `references/troubleshooting.md` for the full playbook. ## References -- **[Quantization Guide](references/quantization.md)** - GGUF formats, conversion, quality comparison -- **[Server Deployment](references/server.md)** - API endpoints, Docker, monitoring -- **[Optimization](references/optimization.md)** - Performance tuning, hybrid CPU+GPU +- **[advanced-usage.md](references/advanced-usage.md)** — speculative decoding, batched inference, grammar-constrained generation, LoRA, multi-GPU, custom builds, benchmark scripts +- **[quantization.md](references/quantization.md)** — perplexity tables, use-case guide, model size scaling (7B/13B/70B RAM needs), imatrix deep dive +- **[server.md](references/server.md)** — OpenAI API endpoints, Docker deployment, NGINX load balancing, monitoring +- **[optimization.md](references/optimization.md)** — CPU threading, BLAS, GPU offload heuristics, batch tuning, benchmarks +- **[troubleshooting.md](references/troubleshooting.md)** — install/convert/quantize/inference/server issues, Apple Silicon, debugging ## Resources -- **GitHub**: https://github.com/ggerganov/llama.cpp -- **Models**: https://huggingface.co/models?library=gguf -- **Discord**: https://discord.gg/llama-cpp - - +- **GitHub**: https://github.com/ggml-org/llama.cpp +- **Python bindings**: https://github.com/abetlen/llama-cpp-python +- **Pre-quantized models**: https://huggingface.co/TheBloke +- **GGUF converter Space**: https://huggingface.co/spaces/ggml-org/gguf-my-repo +- **License**: MIT diff --git a/skills/mlops/inference/gguf/references/advanced-usage.md b/skills/mlops/inference/llama-cpp/references/advanced-usage.md similarity index 100% rename from skills/mlops/inference/gguf/references/advanced-usage.md rename to skills/mlops/inference/llama-cpp/references/advanced-usage.md diff --git a/skills/mlops/inference/gguf/references/troubleshooting.md b/skills/mlops/inference/llama-cpp/references/troubleshooting.md similarity index 100% rename from skills/mlops/inference/gguf/references/troubleshooting.md rename to skills/mlops/inference/llama-cpp/references/troubleshooting.md diff --git a/skills/mlops/training/grpo-rl-training/README.md b/skills/mlops/training/grpo-rl-training/README.md deleted file mode 100644 index 99b60d664..000000000 --- a/skills/mlops/training/grpo-rl-training/README.md +++ /dev/null @@ -1,97 +0,0 @@ -# GRPO/RL Training Skill - -**Expert-level guidance for Group Relative Policy Optimization with TRL** - -## 📁 Skill Structure - -``` -grpo-rl-training/ -├── SKILL.md # Main skill documentation (READ THIS FIRST) -├── README.md # This file -├── templates/ -│ └── basic_grpo_training.py # Production-ready training template -└── examples/ - └── reward_functions_library.py # 20+ reward function examples -``` - -## 🚀 Quick Start - -1. **Read SKILL.md** - Comprehensive guide with all concepts and patterns -2. **Copy `templates/basic_grpo_training.py`** - Start with working code -3. **Browse `examples/reward_functions_library.py`** - Pick reward functions for your task -4. **Modify for your use case** - Adapt dataset, rewards, and config - -## 💡 What's Inside - -### SKILL.md (Main Documentation) -- Core GRPO concepts and algorithm fundamentals -- Complete implementation workflow (dataset → rewards → training → deployment) -- 10+ reward function examples with code -- Hyperparameter tuning guide -- Training insights (loss behavior, metrics, debugging) -- Troubleshooting guide -- Production best practices - -### Templates -- **basic_grpo_training.py**: Minimal, production-ready training script - - Uses Qwen 2.5 1.5B Instruct - - 3 reward functions (format + correctness) - - LoRA for efficient training - - Fully documented and ready to run - -### Examples -- **reward_functions_library.py**: 20+ battle-tested reward functions - - Correctness rewards (exact match, fuzzy match, numeric, code execution) - - Format rewards (XML, JSON, strict/soft) - - Length rewards (ideal length, min/max) - - Style rewards (reasoning quality, citations, repetition penalty) - - Combined rewards (multi-objective optimization) - - Preset collections for common tasks - -## 📖 Usage for Agents - -When this skill is loaded in your agent's context: - -1. **Always read SKILL.md first** before implementing -2. **Start simple** - Use length-based reward to validate setup -3. **Build incrementally** - Add one reward function at a time -4. **Reference examples** - Copy patterns from reward_functions_library.py -5. **Monitor training** - Watch reward metrics (not loss!) - -## 🎯 Common Use Cases - -| Task Type | Recommended Rewards | Template | -|-----------|---------------------|----------| -| Math reasoning | `MATH_REASONING_REWARDS` preset | basic_grpo_training.py | -| Code generation | `CODE_GENERATION_REWARDS` preset | Modify dataset in template | -| Summarization | `SUMMARIZATION_REWARDS` preset | Adjust prompts + rewards | -| Q&A | `QA_REWARDS` preset | Use fuzzy match + citations | - -## ⚠️ Critical Reminders - -- **Loss goes UP during training** - This is normal (it's KL divergence) -- **Use 3-5 reward functions** - Single rewards often fail -- **Test rewards before training** - Debug each function independently -- **Monitor reward_std** - Should stay > 0.1 (avoid mode collapse) -- **Start with num_generations=4-8** - Scale up if GPU allows - -## 🔗 External Resources - -- [TRL Documentation](https://huggingface.co/docs/trl) -- [DeepSeek R1 Paper](https://arxiv.org/abs/2501.12948) -- [Open R1 Implementation](https://github.com/huggingface/open-r1) -- [Unsloth (2-3x faster)](https://docs.unsloth.ai/) - -## 📝 Version - -**v1.0.0** - Initial release (January 2025) - -## 👨‍💻 Maintained By - -Orchestra Research -For questions or improvements, see https://orchestra.com - ---- - -**License:** MIT -**Last Updated:** January 2025 diff --git a/skills/mlops/training/trl-fine-tuning/SKILL.md b/skills/mlops/training/trl-fine-tuning/SKILL.md index 3bf4f6e12..70023fc70 100644 --- a/skills/mlops/training/trl-fine-tuning/SKILL.md +++ b/skills/mlops/training/trl-fine-tuning/SKILL.md @@ -252,6 +252,8 @@ trl dpo \ Train with reinforcement learning using minimal memory. +For in-depth GRPO guidance — reward function design, critical training insights (loss behavior, mode collapse, tuning), and advanced multi-stage patterns — see **[references/grpo-training.md](references/grpo-training.md)**. A production-ready training script is in **[templates/basic_grpo_training.py](templates/basic_grpo_training.py)**. + Copy this checklist: ``` @@ -428,6 +430,8 @@ config = PPOConfig( **Online RL methods**: See [references/online-rl.md](references/online-rl.md) for PPO, GRPO, RLOO, and OnlineDPO with detailed configurations. +**GRPO deep dive**: See [references/grpo-training.md](references/grpo-training.md) for expert-level GRPO patterns — reward function design philosophy, training insights (why loss increases, mode collapse detection), hyperparameter tuning, multi-stage training, and troubleshooting. Production-ready template in [templates/basic_grpo_training.py](templates/basic_grpo_training.py). + ## Hardware requirements - **GPU**: NVIDIA (CUDA required) diff --git a/skills/mlops/training/grpo-rl-training/SKILL.md b/skills/mlops/training/trl-fine-tuning/references/grpo-training.md similarity index 56% rename from skills/mlops/training/grpo-rl-training/SKILL.md rename to skills/mlops/training/trl-fine-tuning/references/grpo-training.md index 1d7629ab6..a22bd4094 100644 --- a/skills/mlops/training/grpo-rl-training/SKILL.md +++ b/skills/mlops/training/trl-fine-tuning/references/grpo-training.md @@ -1,51 +1,36 @@ ---- -name: grpo-rl-training -description: Expert guidance for GRPO/RL fine-tuning with TRL for reasoning and task-specific model training -version: 1.0.0 -author: Orchestra Research -license: MIT -dependencies: [transformers>=4.47.0, trl>=0.14.0, datasets>=3.2.0, peft>=0.14.0, torch] -metadata: - hermes: - tags: [Post-Training, Reinforcement Learning, GRPO, TRL, RLHF, Reward Modeling, Reasoning, DPO, PPO, Structured Output] +# GRPO (Group Relative Policy Optimization) — Deep Guide ---- +Expert-level patterns, critical insights, and production-ready workflows for fine-tuning language models with custom reward functions using TRL's `GRPOTrainer`. This is the deep reference for the GRPO workflow summarized in the main skill. -# GRPO/RL Training with TRL +## When to use GRPO -Expert-level guidance for implementing Group Relative Policy Optimization (GRPO) using the Transformer Reinforcement Learning (TRL) library. This skill provides battle-tested patterns, critical insights, and production-ready workflows for fine-tuning language models with custom reward functions. - -## When to Use This Skill - -Use GRPO training when you need to: -- **Enforce specific output formats** (e.g., XML tags, JSON, structured reasoning) +Use GRPO when you need to: +- **Enforce specific output formats** (XML tags, JSON, structured reasoning) - **Teach verifiable tasks** with objective correctness metrics (math, coding, fact-checking) - **Improve reasoning capabilities** by rewarding chain-of-thought patterns - **Align models to domain-specific behaviors** without labeled preference data - **Optimize for multiple objectives** simultaneously (format + correctness + style) **Do NOT use GRPO for:** -- Simple supervised fine-tuning tasks (use SFT instead) +- Simple supervised fine-tuning tasks → use SFT - Tasks without clear reward signals -- When you already have high-quality preference pairs (use DPO/PPO instead) +- When you already have high-quality preference pairs → use DPO/PPO ---- +## Core concepts -## Core Concepts +### 1. GRPO algorithm fundamentals -### 1. GRPO Algorithm Fundamentals - -**Key Mechanism:** -- Generates **multiple completions** for each prompt (group size: 4-16) +**Key mechanism:** +- Generates **multiple completions** per prompt (group size: 4–16) - Compares completions within each group using reward functions - Updates policy to favor higher-rewarded responses relative to the group -**Critical Difference from PPO:** +**Critical differences from PPO:** - No separate reward model needed - More sample-efficient (learns from within-group comparisons) - Simpler to implement and debug -**Mathematical Intuition:** +**Mathematical intuition:** ``` For each prompt p: 1. Generate N completions: {c₁, c₂, ..., cₙ} @@ -54,35 +39,32 @@ For each prompt p: relative to low-reward ones in the same group ``` -### 2. Reward Function Design Philosophy +### 2. Reward function design philosophy -**Golden Rules:** -1. **Compose multiple reward functions** - Each handles one aspect (format, correctness, style) -2. **Scale rewards appropriately** - Higher weight = stronger signal -3. **Use incremental rewards** - Partial credit for partial compliance -4. **Test rewards independently** - Debug each reward function in isolation +**Golden rules:** +1. **Compose multiple reward functions** — each handles one aspect (format, correctness, style) +2. **Scale rewards appropriately** — higher weight = stronger signal +3. **Use incremental rewards** — partial credit for partial compliance +4. **Test rewards independently** — debug each reward function in isolation -**Reward Function Types:** +**Reward function types:** | Type | Use Case | Example Weight | |------|----------|----------------| | **Correctness** | Verifiable tasks (math, code) | 2.0 (highest) | -| **Format** | Strict structure enforcement | 0.5-1.0 | -| **Length** | Encourage verbosity/conciseness | 0.1-0.5 | -| **Style** | Penalize unwanted patterns | -0.5 to 0.5 | +| **Format** | Strict structure enforcement | 0.5–1.0 | +| **Length** | Encourage verbosity/conciseness | 0.1–0.5 | +| **Style** | Penalize unwanted patterns | −0.5 to 0.5 | ---- +## Implementation workflow -## Implementation Workflow +### Step 1: Dataset preparation -### Step 1: Dataset Preparation - -**Critical Requirements:** -- Prompts in chat format (list of dicts with 'role' and 'content') +**Critical requirements:** +- Prompts in chat format (list of dicts with `role` and `content`) - Include system prompts to set expectations - For verifiable tasks, include ground truth answers as additional columns -**Example Structure:** ```python from datasets import load_dataset, Dataset @@ -97,8 +79,7 @@ Respond in the following format: """ def prepare_dataset(raw_data): - """ - Transform raw data into GRPO-compatible format. + """Transform raw data into GRPO-compatible format. Returns: Dataset with columns: - 'prompt': List[Dict] with role/content (system + user messages) @@ -113,14 +94,14 @@ def prepare_dataset(raw_data): }) ``` -**Pro Tips:** -- Use one-shot or few-shot examples in system prompt for complex formats -- Keep prompts concise (max_prompt_length: 256-512 tokens) +**Pro tips:** +- Use one-shot or few-shot examples in the system prompt for complex formats +- Keep prompts concise (max_prompt_length: 256–512 tokens) - Validate data quality before training (garbage in = garbage out) -### Step 2: Reward Function Implementation +### Step 2: Reward function implementation -**Template Structure:** +**Template structure:** ```python def reward_function_name( prompts, # List[List[Dict]]: Original prompts @@ -128,24 +109,16 @@ def reward_function_name( answer=None, # Optional: Ground truth from dataset **kwargs # Additional dataset columns ) -> list[float]: - """ - Evaluate completions and return rewards. - - Returns: List of floats (one per completion) - """ - # Extract completion text + """Evaluate completions and return rewards (one per completion).""" responses = [comp[0]['content'] for comp in completions] - - # Compute rewards rewards = [] for response in responses: score = compute_score(response) rewards.append(score) - return rewards ``` -**Example 1: Correctness Reward (Math/Coding)** +**Example 1: correctness reward (math/coding)** ```python def correctness_reward(prompts, completions, answer, **kwargs): """Reward correct answers with high score.""" @@ -155,7 +128,7 @@ def correctness_reward(prompts, completions, answer, **kwargs): for ans, gt in zip(extracted, answer)] ``` -**Example 2: Format Reward (Structured Output)** +**Example 2: format reward (structured output)** ```python import re @@ -167,7 +140,7 @@ def format_reward(completions, **kwargs): for r in responses] ``` -**Example 3: Incremental Format Reward (Partial Credit)** +**Example 3: incremental format reward (partial credit)** ```python def incremental_format_reward(completions, **kwargs): """Award partial credit for format compliance.""" @@ -176,14 +149,10 @@ def incremental_format_reward(completions, **kwargs): for r in responses: score = 0.0 - if '' in r: - score += 0.25 - if '' in r: - score += 0.25 - if '' in r: - score += 0.25 - if '' in r: - score += 0.25 + if '' in r: score += 0.25 + if '' in r: score += 0.25 + if '' in r: score += 0.25 + if '' in r: score += 0.25 # Penalize extra text after closing tag if r.count('') == 1: extra_text = r.split('')[-1].strip() @@ -193,12 +162,11 @@ def incremental_format_reward(completions, **kwargs): return rewards ``` -**Critical Insight:** -Combine 3-5 reward functions for robust training. Order matters less than diversity of signals. +**Critical insight:** Combine 3–5 reward functions for robust training. Order matters less than diversity of signals. -### Step 3: Training Configuration +### Step 3: Training configuration -**Memory-Optimized Config (Small GPU)** +**Memory-optimized config (small GPU)** ```python from trl import GRPOConfig @@ -218,13 +186,13 @@ training_args = GRPOConfig( gradient_accumulation_steps=4, # Effective batch = 4 # GRPO-specific - num_generations=8, # Group size: 8-16 recommended + num_generations=8, # Group size: 8–16 recommended max_prompt_length=256, max_completion_length=512, # Training duration num_train_epochs=1, - max_steps=None, # Or set fixed steps (e.g., 500) + max_steps=None, # Optimization bf16=True, # Faster on A100/H100 @@ -234,11 +202,11 @@ training_args = GRPOConfig( # Logging logging_steps=1, save_steps=100, - report_to="wandb", # Or "none" for no logging + report_to="wandb", ) ``` -**High-Performance Config (Large GPU)** +**High-performance config (large GPU)** ```python training_args = GRPOConfig( output_dir="outputs/grpo-model", @@ -255,31 +223,30 @@ training_args = GRPOConfig( ) ``` -**Critical Hyperparameters:** +**Critical hyperparameters:** | Parameter | Impact | Tuning Advice | |-----------|--------|---------------| -| `num_generations` | Group size for comparison | Start with 8, increase to 16 if GPU allows | +| `num_generations` | Group size for comparison | Start 8, increase to 16 if GPU allows | | `learning_rate` | Convergence speed/stability | 5e-6 (safe), 1e-5 (faster, riskier) | -| `max_completion_length` | Output verbosity | Match your task (512 for reasoning, 256 for short answers) | +| `max_completion_length` | Output verbosity | Match your task (512 reasoning, 256 short answers) | | `gradient_accumulation_steps` | Effective batch size | Increase if GPU memory limited | -### Step 4: Model Setup and Training +### Step 4: Model setup and training -**Standard Setup (Transformers)** +**Standard setup (Transformers + TRL)** ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer from peft import LoraConfig from trl import GRPOTrainer -# Load model model_name = "Qwen/Qwen2.5-1.5B-Instruct" model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, - attn_implementation="flash_attention_2", # 2-3x faster - device_map="auto" + attn_implementation="flash_attention_2", # 2–3× faster + device_map="auto", ) tokenizer = AutoTokenizer.from_pretrained(model_name) @@ -287,17 +254,16 @@ tokenizer.pad_token = tokenizer.eos_token # Optional: LoRA for parameter-efficient training peft_config = LoraConfig( - r=16, # Rank (higher = more capacity) - lora_alpha=32, # Scaling factor (typically 2*r) + r=16, + lora_alpha=32, target_modules=[ "q_proj", "k_proj", "v_proj", "o_proj", - "gate_proj", "up_proj", "down_proj" + "gate_proj", "up_proj", "down_proj", ], task_type="CAUSAL_LM", lora_dropout=0.05, ) -# Initialize trainer trainer = GRPOTrainer( model=model, processing_class=tokenizer, @@ -308,17 +274,14 @@ trainer = GRPOTrainer( ], args=training_args, train_dataset=dataset, - peft_config=peft_config, # Remove for full fine-tuning + peft_config=peft_config, # Remove for full fine-tuning ) -# Train trainer.train() - -# Save trainer.save_model("final_model") ``` -**Unsloth Setup (2-3x Faster)** +**Unsloth setup (2–3× faster)** ```python from unsloth import FastLanguageModel @@ -339,28 +302,26 @@ model = FastLanguageModel.get_peft_model( use_gradient_checkpointing="unsloth", ) -# Rest is identical to standard setup +# Rest is identical to the standard setup trainer = GRPOTrainer(model=model, ...) trainer.train() ``` ---- +## Critical training insights -## Critical Training Insights +### 1. Loss behavior (EXPECTED pattern) +- **Loss starts near 0 and INCREASES during training** — this is CORRECT +- Loss measures KL divergence from initial policy; the model is learning (diverging from original behavior to optimize rewards) +- **Monitor reward metrics, not loss, for progress** -### 1. Loss Behavior (EXPECTED PATTERN) -- **Loss starts near 0 and INCREASES during training** -- This is CORRECT - loss measures KL divergence from initial policy -- Model is learning (diverging from original behavior to optimize rewards) -- Monitor reward metrics instead of loss for progress +### 2. Reward tracking -### 2. Reward Tracking Key metrics to watch: -- `reward`: Average across all completions -- `reward_std`: Diversity within groups (should remain > 0) -- `kl`: KL divergence from reference (should grow moderately) +- `reward` — average across all completions +- `reward_std` — diversity within groups (should remain > 0) +- `kl` — KL divergence from reference (should grow moderately) -**Healthy Training Pattern:** +**Healthy pattern:** ``` Step Reward Reward_Std KL 100 0.5 0.3 0.02 @@ -369,12 +330,12 @@ Step Reward Reward_Std KL 400 1.5 0.15 0.12 ``` -**Warning Signs:** -- Reward std → 0 (model collapsing to single response) -- KL exploding (> 0.5) (diverging too much, reduce LR) -- Reward stuck (reward functions too harsh or model capacity issue) +**Warning signs:** +- `reward_std` → 0 (model collapsing to a single response) +- `kl` exploding (> 0.5) — diverging too much, reduce LR +- Reward stuck — reward functions too harsh or model capacity issue -### 3. Common Pitfalls and Solutions +### 3. Common pitfalls and solutions | Problem | Symptom | Solution | |---------|---------|----------| @@ -384,15 +345,14 @@ Step Reward Reward_Std KL | **Slow training** | < 1 it/s | Enable `use_vllm=True`, use Unsloth, reduce seq length | | **Format ignored** | Model doesn't follow structure | Increase format reward weight, add incremental rewards | ---- +## Advanced patterns -## Advanced Patterns +### 1. Multi-stage training -### 1. Multi-Stage Training For complex tasks, train in stages: ```python -# Stage 1: Format compliance (epochs=1) +# Stage 1: Format compliance trainer_stage1 = GRPOTrainer( model=model, reward_funcs=[incremental_format_reward, format_reward], @@ -400,7 +360,7 @@ trainer_stage1 = GRPOTrainer( ) trainer_stage1.train() -# Stage 2: Correctness (epochs=1) +# Stage 2: Correctness trainer_stage2 = GRPOTrainer( model=model, reward_funcs=[format_reward, correctness_reward], @@ -409,7 +369,8 @@ trainer_stage2 = GRPOTrainer( trainer_stage2.train() ``` -### 2. Adaptive Reward Scaling +### 2. Adaptive reward scaling + ```python class AdaptiveReward: def __init__(self, base_reward_func, initial_weight=1.0): @@ -428,148 +389,116 @@ class AdaptiveReward: self.weight *= 0.9 ``` -### 3. Custom Dataset Integration +### 3. Custom dataset integration + ```python def load_custom_knowledge_base(csv_path): - """Example: School communication platform docs.""" import pandas as pd df = pd.read_csv(csv_path) - - dataset = Dataset.from_pandas(df).map(lambda x: { + return Dataset.from_pandas(df).map(lambda x: { 'prompt': [ {'role': 'system', 'content': CUSTOM_SYSTEM_PROMPT}, {'role': 'user', 'content': x['question']} ], 'answer': x['expert_answer'] }) - return dataset ``` ---- +## Deployment and inference -## Deployment and Inference - -### Save and Merge LoRA +### Save and merge LoRA ```python -# Merge LoRA adapters into base model if hasattr(trainer.model, 'merge_and_unload'): merged_model = trainer.model.merge_and_unload() merged_model.save_pretrained("production_model") tokenizer.save_pretrained("production_model") ``` -### Inference Example +### Inference ```python from transformers import pipeline -generator = pipeline( - "text-generation", - model="production_model", - tokenizer=tokenizer -) +generator = pipeline("text-generation", model="production_model", tokenizer=tokenizer) result = generator( [ {'role': 'system', 'content': SYSTEM_PROMPT}, - {'role': 'user', 'content': "What is 15 + 27?"} + {'role': 'user', 'content': "What is 15 + 27?"}, ], max_new_tokens=256, do_sample=True, temperature=0.7, - top_p=0.9 + top_p=0.9, ) print(result[0]['generated_text']) ``` ---- +## Best practices checklist -## Best Practices Checklist - -**Before Training:** +**Before training:** - [ ] Validate dataset format (prompts as List[Dict]) - [ ] Test reward functions on sample data -- [ ] Calculate expected max_prompt_length from data -- [ ] Choose appropriate num_generations based on GPU memory +- [ ] Calculate expected `max_prompt_length` from data +- [ ] Choose `num_generations` based on GPU memory - [ ] Set up logging (wandb recommended) -**During Training:** +**During training:** - [ ] Monitor reward progression (should increase) -- [ ] Check reward_std (should stay > 0.1) +- [ ] Check `reward_std` (should stay > 0.1) - [ ] Watch for OOM errors (reduce batch size if needed) -- [ ] Sample generations every 50-100 steps +- [ ] Sample generations every 50–100 steps - [ ] Validate format compliance on holdout set -**After Training:** +**After training:** - [ ] Merge LoRA weights if using PEFT - [ ] Test on diverse prompts - [ ] Compare to baseline model - [ ] Document reward weights and hyperparameters - [ ] Save reproducibility config ---- +## Troubleshooting -## Troubleshooting Guide +### Debugging workflow +1. **Isolate reward functions** — test each independently +2. **Check data distribution** — ensure diversity in prompts +3. **Reduce complexity** — start with single reward, add gradually +4. **Monitor generations** — print samples every N steps +5. **Validate extraction logic** — ensure answer parsing works -### Debugging Workflow -1. **Isolate reward functions** - Test each independently -2. **Check data distribution** - Ensure diversity in prompts -3. **Reduce complexity** - Start with single reward, add gradually -4. **Monitor generations** - Print samples every N steps -5. **Validate extraction logic** - Ensure answer parsing works - -### Quick Fixes +### Quick debug reward ```python -# Debug reward function def debug_reward(completions, **kwargs): responses = [comp[0]['content'] for comp in completions] - for i, r in enumerate(responses[:2]): # Print first 2 + for i, r in enumerate(responses[:2]): print(f"Response {i}: {r[:200]}...") - return [1.0] * len(responses) # Dummy rewards + return [1.0] * len(responses) # Test without training trainer = GRPOTrainer(..., reward_funcs=[debug_reward]) -trainer.generate_completions(dataset[:1]) # Generate without updating +trainer.generate_completions(dataset[:1]) ``` ---- +## Template -## References and Resources +A production-ready training script lives at **`../templates/basic_grpo_training.py`**. It uses Qwen 2.5-1.5B-Instruct with LoRA and three reward functions (incremental format, strict format, correctness) on GSM8K. Copy and adapt: +1. `get_dataset()` — swap in your data loader +2. Reward functions — tune to your task +3. `SYSTEM_PROMPT` — match your output format +4. `GRPOConfig` — adjust hyperparameters for your GPU + +## References and resources -**Official Documentation:** - TRL GRPO Trainer: https://huggingface.co/docs/trl/grpo_trainer -- DeepSeek R1 Paper: https://arxiv.org/abs/2501.12948 -- Unsloth Docs: https://docs.unsloth.ai/ - -**Example Repositories:** -- Open R1 Implementation: https://github.com/huggingface/open-r1 -- TRL Examples: https://github.com/huggingface/trl/tree/main/examples - -**Recommended Reading:** -- Progressive Disclosure Pattern for agent instructions -- Reward shaping in RL (Ng et al.) -- LoRA paper (Hu et al., 2021) - ---- - -## Usage Instructions for Agents - -When this skill is loaded: - -1. **Read this entire file** before implementing GRPO training -2. **Start with the simplest reward function** (e.g., length-based) to validate setup -3. **Use the templates** in `templates/` directory as starting points -4. **Reference examples** in `examples/` for task-specific implementations -5. **Follow the workflow** sequentially (don't skip steps) -6. **Debug incrementally** - add one reward function at a time - -**Critical Reminders:** -- Always use multiple reward functions (3-5 is optimal) -- Monitor reward metrics, not loss -- Test reward functions before training -- Start small (num_generations=4), scale up gradually -- Save checkpoints frequently (every 100 steps) - -This skill is designed for **expert-level implementation**. Beginners should start with supervised fine-tuning before attempting GRPO. - +- GRPO paper (DeepSeek): https://arxiv.org/abs/2402.03300 +- DeepSeek R1 paper: https://arxiv.org/abs/2501.12948 +- Open R1 implementation: https://github.com/huggingface/open-r1 +- TRL examples: https://github.com/huggingface/trl/tree/main/examples +- Unsloth (faster training): https://docs.unsloth.ai/ +## Critical reminders +- **Loss goes UP during training** — this is normal (it's KL divergence) +- **Use 3–5 reward functions** — single rewards often fail +- **Test rewards before training** — debug each function independently +- **Monitor `reward_std`** — should stay > 0.1 (avoid mode collapse) +- **Start with `num_generations=4–8`** — scale up if GPU allows diff --git a/skills/mlops/training/grpo-rl-training/templates/basic_grpo_training.py b/skills/mlops/training/trl-fine-tuning/templates/basic_grpo_training.py similarity index 100% rename from skills/mlops/training/grpo-rl-training/templates/basic_grpo_training.py rename to skills/mlops/training/trl-fine-tuning/templates/basic_grpo_training.py diff --git a/website/docs/reference/optional-skills-catalog.md b/website/docs/reference/optional-skills-catalog.md index 6fde99b5e..bbb2c3b80 100644 --- a/website/docs/reference/optional-skills-catalog.md +++ b/website/docs/reference/optional-skills-catalog.md @@ -98,6 +98,7 @@ The largest optional category — covers the full ML pipeline from data curation | **chroma** | Open-source embedding database. Store embeddings and metadata, perform vector and full-text search. Simple 4-function API for RAG and semantic search. | | **faiss** | Facebook's library for efficient similarity search and clustering of dense vectors. Supports billions of vectors, GPU acceleration, and various index types (Flat, IVF, HNSW). | | **flash-attention** | Optimize transformer attention with Flash Attention for 2-4x speedup and 10-20x memory reduction. Supports PyTorch SDPA, flash-attn library, H100 FP8, and sliding window. | +| **guidance** | Control LLM output with regex and grammars, guarantee valid JSON/XML/code generation, enforce structured formats, and build multi-step workflows with Guidance — Microsoft Research's constrained generation framework. | | **hermes-atropos-environments** | Build, test, and debug Hermes Agent RL environments for Atropos training. Covers the HermesAgentBaseEnv interface, reward functions, agent loop integration, and evaluation. | | **huggingface-tokenizers** | Fast Rust-based tokenizers for research and production. Tokenizes 1GB in under 20 seconds. Supports BPE, WordPiece, and Unigram algorithms. | | **instructor** | Extract structured data from LLM responses with Pydantic validation, retry failed extractions automatically, and stream partial results. | diff --git a/website/docs/reference/skills-catalog.md b/website/docs/reference/skills-catalog.md index 13ef2f7fc..ead50dbea 100644 --- a/website/docs/reference/skills-catalog.md +++ b/website/docs/reference/skills-catalog.md @@ -163,10 +163,8 @@ Model serving, quantization (GGUF/GPTQ), structured output, inference optimizati | Skill | Description | Path | |-------|-------------|------| -| `gguf-quantization` | GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements. | `mlops/inference/gguf` | -| `guidance` | Control LLM output with regex and grammars, guarantee valid JSON/XML/code generation, enforce structured formats, and build multi-step workflows with Guidance - Microsoft Research's constrained generation framework | `mlops/inference/guidance` | | `instructor` | Extract structured data from LLM responses with Pydantic validation, retry failed extractions automatically, parse complex JSON with type safety, and stream partial results with Instructor - battle-tested structured output library | `mlops/inference/instructor` | -| `llama-cpp` | Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU. | `mlops/inference/llama-cpp` | +| `llama-cpp` | Run LLM inference with llama.cpp on CPU, Apple Silicon, AMD/Intel GPUs, or NVIDIA — plus GGUF model conversion and quantization (2–8 bit with K-quants and imatrix). Covers CLI, Python bindings, OpenAI-compatible server, and Ollama/LM Studio integration. Use for edge deployment, M1/M2/M3/M4 Macs, CUDA-less environments, or flexible local quantization. | `mlops/inference/llama-cpp` | | `obliteratus` | Remove refusal behaviors from open-weight LLMs using OBLITERATUS — mechanistic interpretability techniques (diff-in-means, SVD, whitened SVD, LEACE, SAE decomposition, etc.) to excise guardrails while preserving reasoning. 9 CLI methods, 28 analysis modules, 116 model presets ac… | `mlops/inference/obliteratus` | | `outlines` | Guarantee valid JSON/XML/code structure during generation, use Pydantic models for type-safe outputs, support local models (Transformers, vLLM), and maximize inference speed with Outlines - dottxt.ai's structured generation library | `mlops/inference/outlines` | | `serving-llms-vllm` | Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), an… | `mlops/inference/vllm` | @@ -202,7 +200,6 @@ Fine-tuning, RLHF/DPO/GRPO training, distributed training frameworks, and optimi | `axolotl` | Expert guidance for fine-tuning LLMs with Axolotl - YAML configs, 100+ models, LoRA/QLoRA, DPO/KTO/ORPO/GRPO, multimodal support | `mlops/training/axolotl` | | `distributed-llm-pretraining-torchtitan` | Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing. | `mlops/training/torchtitan` | | `fine-tuning-with-trl` | Fine-tune LLMs using reinforcement learning with TRL - SFT for instruction tuning, DPO for preference alignment, PPO/GRPO for reward optimization, and reward model training. Use when need RLHF, align model with preferences, or train from human feedback. Works with HuggingFace Tr… | `mlops/training/trl-fine-tuning` | -| `grpo-rl-training` | Expert guidance for GRPO/RL fine-tuning with TRL for reasoning and task-specific model training | `mlops/training/grpo-rl-training` | | `hermes-atropos-environments` | Build, test, and debug Hermes Agent RL environments for Atropos training. Covers the HermesAgentBaseEnv interface, reward functions, agent loop integration, evaluation with tools, wandb logging, and the three CLI modes (serve/process/evaluate). Use when creating, reviewing, or f… | `mlops/training/hermes-atropos-environments` | | `huggingface-accelerate` | Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard. | `mlops/training/accelerate` | | `optimizing-attention-flash` | Optimizes transformer attention with Flash Attention for 2-4x speedup and 10-20x memory reduction. Use when training/running transformers with long sequences (>512 tokens), encountering GPU memory issues with attention, or need faster inference. Supports PyTorch native SDPA,… | `mlops/training/flash-attention` |