skills: consolidate mlops redundancies (gguf+llama-cpp, grpo+trl, guidance→optional) (#11965)

Three tightly-scoped built-in skill consolidations to reduce redundancy in the available_skills listing injected into every system prompt: 1. gguf-quantization → llama-cpp (merged) GGUF is llama.cpp's format; two skills covered the same toolchain. The merged llama-cpp skill keeps the full K-quant table + imatrix workflow from gguf and the ROCm/benchmarks/supported-models sections from the original llama-cpp. All 5 reference files preserved. 2. grpo-rl-training → fine-tuning-with-trl (folded in) GRPO isn't a framework, it's a trainer inside TRL. Moved the 17KB deep-dive SKILL.md to references/grpo-training.md and the working template to templates/basic_grpo_training.py. TRL's GRPO workflow section now points to both. Atropos skill's related_skills updated. 3. guidance → optional-skills/mlops/ Dropped from built-in. Outlines (still built-in) covers the same structured-generation ground with wider adoption. Listed in the optional catalog for users who specifically want Guidance. Net: 3 fewer built-in skill lines in every system prompt, zero content loss. Contributor authorship preserved via git rename detection.
2026-04-25 00:51:20 +00:00 · 2026-04-17 21:36:40 -07:00 · 2026-04-17 21:36:40 -07:00 · 73bccc94c7
commit 73bccc94c7
parent 598cba62ad
15 changed files with 470 additions and 889 deletions
--- a/skills/mlops/inference/guidance/SKILL.md
+++ b/skills/mlops/inference/guidance/SKILL.md
--- a/skills/mlops/inference/guidance/references/backends.md
+++ b/skills/mlops/inference/guidance/references/backends.md
--- a/skills/mlops/inference/guidance/references/constraints.md
+++ b/skills/mlops/inference/guidance/references/constraints.md
--- a/skills/mlops/inference/guidance/references/examples.md
+++ b/skills/mlops/inference/guidance/references/examples.md
--- a/optional-skills/mlops/hermes-atropos-environments/SKILL.md
+++ b/optional-skills/mlops/hermes-atropos-environments/SKILL.md
@ -7,7 +7,7 @@ license: MIT
 metadata:
  hermes:
    tags: [atropos, rl, environments, training, reinforcement-learning, reward-functions]
-    related_skills: [axolotl, grpo-rl-training, trl-fine-tuning, lm-evaluation-harness]
+    related_skills: [axolotl, fine-tuning-with-trl, lm-evaluation-harness]
 ---
 # Hermes Agent Atropos Environments
--- a/skills/mlops/inference/gguf/SKILL.md
+++ b/skills/mlops/inference/gguf/SKILL.md
@ -1,430 +0,0 @@
 ---
 name: gguf-quantization
 description: GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.
 version: 1.0.0
 author: Orchestra Research
 license: MIT
 dependencies: [llama-cpp-python>=0.2.0]
 metadata:
  hermes:
    tags: [GGUF, Quantization, llama.cpp, CPU Inference, Apple Silicon, Model Compression, Optimization]
 ---
 # GGUF - Quantization Format for llama.cpp
 The GGUF (GPT-Generated Unified Format) is the standard file format for llama.cpp, enabling efficient inference on CPUs, Apple Silicon, and GPUs with flexible quantization options.
 ## When to use GGUF
 **Use GGUF when:**
 - Deploying on consumer hardware (laptops, desktops)
 - Running on Apple Silicon (M1/M2/M3) with Metal acceleration
 - Need CPU inference without GPU requirements
 - Want flexible quantization (Q2_K to Q8_0)
 - Using local AI tools (LM Studio, Ollama, text-generation-webui)
 **Key advantages:**
 - **Universal hardware**: CPU, Apple Silicon, NVIDIA, AMD support
 - **No Python runtime**: Pure C/C++ inference
 - **Flexible quantization**: 2-8 bit with various methods (K-quants)
 - **Ecosystem support**: LM Studio, Ollama, koboldcpp, and more
 - **imatrix**: Importance matrix for better low-bit quality
 **Use alternatives instead:**
 - **AWQ/GPTQ**: Maximum accuracy with calibration on NVIDIA GPUs
 - **HQQ**: Fast calibration-free quantization for HuggingFace
 - **bitsandbytes**: Simple integration with transformers library
 - **TensorRT-LLM**: Production NVIDIA deployment with maximum speed
 ## Quick start
 ### Installation
 ```bash
 # Clone llama.cpp
 git clone https://github.com/ggml-org/llama.cpp
 cd llama.cpp
 # Build (CPU)
 make
 # Build with CUDA (NVIDIA)
 make GGML_CUDA=1
 # Build with Metal (Apple Silicon)
 make GGML_METAL=1
 # Install Python bindings (optional)
 pip install llama-cpp-python
 ```
 ### Convert model to GGUF
 ```bash
 # Install requirements
 pip install -r requirements.txt
 # Convert HuggingFace model to GGUF (FP16)
 python convert_hf_to_gguf.py ./path/to/model --outfile model-f16.gguf
 # Or specify output type
 python convert_hf_to_gguf.py ./path/to/model \
    --outfile model-f16.gguf \
    --outtype f16
 ```
 ### Quantize model
 ```bash
 # Basic quantization to Q4_K_M
 ./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
 # Quantize with importance matrix (better quality)
 ./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
 ./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
 ```
 ### Run inference
 ```bash
 # CLI inference
 ./llama-cli -m model-q4_k_m.gguf -p "Hello, how are you?"
 # Interactive mode
 ./llama-cli -m model-q4_k_m.gguf --interactive
 # With GPU offload
 ./llama-cli -m model-q4_k_m.gguf -ngl 35 -p "Hello!"
 ```
 ## Quantization types
 ### K-quant methods (recommended)
 | Type | Bits | Size (7B) | Quality | Use Case |
 |------|------|-----------|---------|----------|
 | Q2_K | 2.5 | ~2.8 GB | Low | Extreme compression |
 | Q3_K_S | 3.0 | ~3.0 GB | Low-Med | Memory constrained |
 | Q3_K_M | 3.3 | ~3.3 GB | Medium | Balance |
 | Q4_K_S | 4.0 | ~3.8 GB | Med-High | Good balance |
 | Q4_K_M | 4.5 | ~4.1 GB | High | **Recommended default** |
 | Q5_K_S | 5.0 | ~4.6 GB | High | Quality focused |
 | Q5_K_M | 5.5 | ~4.8 GB | Very High | High quality |
 | Q6_K | 6.0 | ~5.5 GB | Excellent | Near-original |
 | Q8_0 | 8.0 | ~7.2 GB | Best | Maximum quality |
 ### Legacy methods
 | Type | Description |
 |------|-------------|
 | Q4_0 | 4-bit, basic |
 | Q4_1 | 4-bit with delta |
 | Q5_0 | 5-bit, basic |
 | Q5_1 | 5-bit with delta |
 **Recommendation**: Use K-quant methods (Q4_K_M, Q5_K_M) for best quality/size ratio.
 ## Conversion workflows
 ### Workflow 1: HuggingFace to GGUF
 ```bash
 # 1. Download model
 huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b
 # 2. Convert to GGUF (FP16)
 python convert_hf_to_gguf.py ./llama-3.1-8b \
    --outfile llama-3.1-8b-f16.gguf \
    --outtype f16
 # 3. Quantize
 ./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M
 # 4. Test
 ./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50
 ```
 ### Workflow 2: With importance matrix (better quality)
 ```bash
 # 1. Convert to GGUF
 python convert_hf_to_gguf.py ./model --outfile model-f16.gguf
 # 2. Create calibration text (diverse samples)
 cat > calibration.txt << 'EOF'
 The quick brown fox jumps over the lazy dog.
 Machine learning is a subset of artificial intelligence.
 Python is a popular programming language.
 # Add more diverse text samples...
 EOF
 # 3. Generate importance matrix
 ./llama-imatrix -m model-f16.gguf \
    -f calibration.txt \
    --chunk 512 \
    -o model.imatrix \
    -ngl 35  # GPU layers if available
 # 4. Quantize with imatrix
 ./llama-quantize --imatrix model.imatrix \
    model-f16.gguf \
    model-q4_k_m.gguf \
    Q4_K_M
 ```
 ### Workflow 3: Multiple quantizations
 ```bash
 #!/bin/bash
 MODEL="llama-3.1-8b-f16.gguf"
 IMATRIX="llama-3.1-8b.imatrix"
 # Generate imatrix once
 ./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35
 # Create multiple quantizations
 for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
    OUTPUT="llama-3.1-8b-${QUANT,,}.gguf"
    ./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT
    echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))"
 done
 ```
 ## Python usage
 ### llama-cpp-python
 ```python
 from llama_cpp import Llama
 # Load model
 llm = Llama(
    model_path="./model-q4_k_m.gguf",
    n_ctx=4096,          # Context window
    n_gpu_layers=35,     # GPU offload (0 for CPU only)
    n_threads=8          # CPU threads
 )
 # Generate
 output = llm(
    "What is machine learning?",
    max_tokens=256,
    temperature=0.7,
    stop=["</s>", "\n\n"]
 )
 print(output["choices"][0]["text"])
 ```
 ### Chat completion
 ```python
 from llama_cpp import Llama
 llm = Llama(
    model_path="./model-q4_k_m.gguf",
    n_ctx=4096,
    n_gpu_layers=35,
    chat_format="llama-3"  # Or "chatml", "mistral", etc.
 )
 messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Python?"}
 ]
 response = llm.create_chat_completion(
    messages=messages,
    max_tokens=256,
    temperature=0.7
 )
 print(response["choices"][0]["message"]["content"])
 ```
 ### Streaming
 ```python
 from llama_cpp import Llama
 llm = Llama(model_path="./model-q4_k_m.gguf", n_gpu_layers=35)
 # Stream tokens
 for chunk in llm(
    "Explain quantum computing:",
    max_tokens=256,
    stream=True
 ):
    print(chunk["choices"][0]["text"], end="", flush=True)
 ```
 ## Server mode
 ### Start OpenAI-compatible server
 ```bash
 # Start server
 ./llama-server -m model-q4_k_m.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 35 \
    -c 4096
 # Or with Python bindings
 python -m llama_cpp.server \
    --model model-q4_k_m.gguf \
    --n_gpu_layers 35 \
    --host 0.0.0.0 \
    --port 8080
 ```
 ### Use with OpenAI client
 ```python
 from openai import OpenAI
 client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
 )
 response = client.chat.completions.create(
    model="local-model",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=256
 )
 print(response.choices[0].message.content)
 ```
 ## Hardware optimization
 ### Apple Silicon (Metal)
 ```bash
 # Build with Metal
 make clean && make GGML_METAL=1
 # Run with Metal acceleration
 ./llama-cli -m model.gguf -ngl 99 -p "Hello"
 # Python with Metal
 llm = Llama(
    model_path="model.gguf",
    n_gpu_layers=99,     # Offload all layers
    n_threads=1          # Metal handles parallelism
 )
 ```
 ### NVIDIA CUDA
 ```bash
 # Build with CUDA
 make clean && make GGML_CUDA=1
 # Run with CUDA
 ./llama-cli -m model.gguf -ngl 35 -p "Hello"
 # Specify GPU
 CUDA_VISIBLE_DEVICES=0 ./llama-cli -m model.gguf -ngl 35
 ```
 ### CPU optimization
 ```bash
 # Build with AVX2/AVX512
 make clean && make
 # Run with optimal threads
 ./llama-cli -m model.gguf -t 8 -p "Hello"
 # Python CPU config
 llm = Llama(
    model_path="model.gguf",
    n_gpu_layers=0,      # CPU only
    n_threads=8,         # Match physical cores
    n_batch=512          # Batch size for prompt processing
 )
 ```
 ## Integration with tools
 ### Ollama
 ```bash
 # Create Modelfile
 cat > Modelfile << 'EOF'
 FROM ./model-q4_k_m.gguf
 TEMPLATE """{{ .System }}
 {{ .Prompt }}"""
 PARAMETER temperature 0.7
 PARAMETER num_ctx 4096
 EOF
 # Create Ollama model
 ollama create mymodel -f Modelfile
 # Run
 ollama run mymodel "Hello!"
 ```
 ### LM Studio
 1. Place GGUF file in `~/.cache/lm-studio/models/`
 2. Open LM Studio and select the model
 3. Configure context length and GPU offload
 4. Start inference
 ### text-generation-webui
 ```bash
 # Place in models folder
 cp model-q4_k_m.gguf text-generation-webui/models/
 # Start with llama.cpp loader
 python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35
 ```
 ## Best practices
 1. **Use K-quants**: Q4_K_M offers best quality/size balance
 2. **Use imatrix**: Always use importance matrix for Q4 and below
 3. **GPU offload**: Offload as many layers as VRAM allows
 4. **Context length**: Start with 4096, increase if needed
 5. **Thread count**: Match physical CPU cores, not logical
 6. **Batch size**: Increase n_batch for faster prompt processing
 ## Common issues
 **Model loads slowly:**
 ```bash
 # Use mmap for faster loading
 ./llama-cli -m model.gguf --mmap
 ```
 **Out of memory:**
 ```bash
 # Reduce GPU layers
 ./llama-cli -m model.gguf -ngl 20  # Reduce from 35
 # Or use smaller quantization
 ./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M
 ```
 **Poor quality at low bits:**
 ```bash
 # Always use imatrix for Q4 and below
 ./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
 ./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
 ```
 ## References
 - **[Advanced Usage](references/advanced-usage.md)** - Batching, speculative decoding, custom builds
 - **[Troubleshooting](references/troubleshooting.md)** - Common issues, debugging, benchmarks
 ## Resources
 - **Repository**: https://github.com/ggml-org/llama.cpp
 - **Python Bindings**: https://github.com/abetlen/llama-cpp-python
 - **Pre-quantized Models**: https://huggingface.co/TheBloke
 - **GGUF Converter**: https://huggingface.co/spaces/ggml-org/gguf-my-repo
 - **License**: MIT
--- a/skills/mlops/inference/llama-cpp/SKILL.md
+++ b/skills/mlops/inference/llama-cpp/SKILL.md
@ -1,138 +1,271 @@
 ---
 name: llama-cpp
-description: Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.
+description: Run LLM inference with llama.cpp on CPU, Apple Silicon, AMD/Intel GPUs, or NVIDIA — plus GGUF model conversion and quantization (2–8 bit with K-quants and imatrix). Covers CLI, Python bindings, OpenAI-compatible server, and Ollama/LM Studio integration. Use for edge deployment, M1/M2/M3/M4 Macs, CUDA-less environments, or flexible local quantization.
-version: 1.0.0
+version: 2.0.0
 author: Orchestra Research
 license: MIT
-dependencies: [llama-cpp-python]
+dependencies: [llama-cpp-python>=0.2.0]
 metadata:
  hermes:
-    tags: [Inference Serving, Llama.cpp, CPU Inference, Apple Silicon, Edge Deployment, GGUF, Quantization, Non-NVIDIA, AMD GPUs, Intel GPUs, Embedded]
+    tags: [llama.cpp, GGUF, Quantization, CPU Inference, Apple Silicon, Edge Deployment, Non-NVIDIA, AMD GPUs, Intel GPUs, Embedded, Model Compression]
 ---
-# llama.cpp
+# llama.cpp + GGUF
-Pure C/C++ LLM inference with minimal dependencies, optimized for CPUs and non-NVIDIA hardware.
+Pure C/C++ LLM inference with minimal dependencies, plus the GGUF (GPT-Generated Unified Format) standard used for quantized weights. One toolchain covers conversion, quantization, and serving.
-## When to use llama.cpp
+## When to use
-**Use llama.cpp when:**
+**Use llama.cpp + GGUF when:**
- Running on CPU-only machines
+- Running on CPU-only machines or Apple Silicon (M1/M2/M3/M4) with Metal acceleration
- Deploying on Apple Silicon (M1/M2/M3/M4)
+- Using AMD (ROCm) or Intel GPUs where CUDA isn't available
- Using AMD or Intel GPUs (no CUDA)
+- Edge deployment (Raspberry Pi, embedded systems, consumer laptops)
- Edge deployment (Raspberry Pi, embedded systems)
+- Need flexible quantization (2–8 bit with K-quants)
- Need simple deployment without Docker/Python
+- Want local AI tools (LM Studio, Ollama, text-generation-webui, koboldcpp)
 - Want a single binary deploy without Docker/Python
-**Use TensorRT-LLM instead when:**
+**Key advantages:**
- Have NVIDIA GPUs (A100/H100)
+- Universal hardware: CPU, Apple Silicon, NVIDIA, AMD, Intel
- Need maximum throughput (100K+ tok/s)
+- No Python runtime required (pure C/C++)
- Running in datacenter with CUDA
+- K-quants + imatrix for better low-bit quality
 - OpenAI-compatible server built in
 - Rich ecosystem (Ollama, LM Studio, llama-cpp-python)
-**Use vLLM instead when:**
+**Use alternatives instead:**
- Have NVIDIA GPUs
+- **vLLM** — NVIDIA GPUs, PagedAttention, Python-first, max throughput
- Need Python-first API
+- **TensorRT-LLM** — Production NVIDIA (A100/H100), maximum speed
- Want PagedAttention
+- **AWQ/GPTQ** — Calibrated quantization for NVIDIA-only deployments
 - **bitsandbytes** — Simple HuggingFace transformers integration
 - **HQQ** — Fast calibration-free quantization
 ## Quick start
-### Installation
+### Install
 ```bash
-# macOS/Linux
+# macOS / Linux (simplest)
 brew install llama.cpp
 # Or build from source
-git clone https://github.com/ggerganov/llama.cpp
+git clone https://github.com/ggml-org/llama.cpp
 cd llama.cpp
-make
+make                        # CPU
 make GGML_METAL=1           # Apple Silicon
 make GGML_CUDA=1            # NVIDIA CUDA
 make LLAMA_HIP=1            # AMD ROCm
-# With Metal (Apple Silicon)
+# Python bindings (optional)
-make LLAMA_METAL=1
+pip install llama-cpp-python
-
+# With CUDA:   CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
-# With CUDA (NVIDIA)
+# With Metal:  CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
 make LLAMA_CUDA=1
 # With ROCm (AMD)
 make LLAMA_HIP=1
 ```
-### Download model
+### Download a pre-quantized GGUF
 ```bash
-# Download from HuggingFace (GGUF format)
+# TheBloke hosts most popular models pre-quantized
 huggingface-cli download \
    TheBloke/Llama-2-7B-Chat-GGUF \
    llama-2-7b-chat.Q4_K_M.gguf \
    --local-dir models/
 ```
-# Or convert from HuggingFace
+### Or convert a HuggingFace model to GGUF
-python convert_hf_to_gguf.py models/llama-2-7b-chat/
+
 ```bash
 # 1. Download HF model
 huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b
 # 2. Convert to FP16 GGUF
 python convert_hf_to_gguf.py ./llama-3.1-8b \
    --outfile llama-3.1-8b-f16.gguf \
    --outtype f16
 # 3. Quantize to Q4_K_M
 ./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M
 ```
 ### Run inference
 ```bash
-# Simple chat
+# One-shot prompt
-./llama-cli \
+./llama-cli -m model.Q4_K_M.gguf -p "Explain quantum computing" -n 256
    -m models/llama-2-7b-chat.Q4_K_M.gguf \
    -p "Explain quantum computing" \
    -n 256  # Max tokens
 # Interactive chat
-./llama-cli \
+./llama-cli -m model.Q4_K_M.gguf --interactive
-    -m models/llama-2-7b-chat.Q4_K_M.gguf \
+
-    --interactive
+# With GPU offload
 ./llama-cli -m model.Q4_K_M.gguf -ngl 35 -p "Hello!"
 ```
-### Server mode
+### Serve an OpenAI-compatible API
 ```bash
 # Start OpenAI-compatible server
 ./llama-server \
-    -m models/llama-2-7b-chat.Q4_K_M.gguf \
+    -m model.Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8080 \
-    -ngl 32  # Offload 32 layers to GPU
+    -ngl 35 \
    -c 4096 \
    --parallel 4 \
    --cont-batching
 ```
-# Client request
+```bash
 curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
-    "model": "llama-2-7b-chat",
+    "model": "local",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.7,
    "max_tokens": 100
  }'
 ```
-## Quantization formats
+## Quantization formats (GGUF)
-### GGUF format overview
+### K-quant methods (recommended)
-| Format | Bits | Size (7B) | Speed | Quality | Use Case |
+| Type | Bits | Size (7B) | Quality | Use Case |
-|--------|------|-----------|-------|---------|----------|
+|------|------|-----------|---------|----------|
-| **Q4_K_M** | 4.5 | 4.1 GB | Fast | Good | **Recommended default** |
+| Q2_K | 2.5 | ~2.8 GB | Low | Extreme compression (testing only) |
-| Q4_K_S | 4.3 | 3.9 GB | Faster | Lower | Speed critical |
+| Q3_K_S | 3.0 | ~3.0 GB | Low-Med | Memory constrained |
-| Q5_K_M | 5.5 | 4.8 GB | Medium | Better | Quality critical |
+| Q3_K_M | 3.3 | ~3.3 GB | Medium | Fits small devices |
-| Q6_K | 6.5 | 5.5 GB | Slower | Best | Maximum quality |
+| Q4_K_S | 4.0 | ~3.8 GB | Med-High | Speed critical |
-| Q8_0 | 8.0 | 7.0 GB | Slow | Excellent | Minimal degradation |
+| **Q4_K_M** | 4.5 | ~4.1 GB | High | **Recommended default** |
-| Q2_K | 2.5 | 2.7 GB | Fastest | Poor | Testing only |
+| Q5_K_S | 5.0 | ~4.6 GB | High | Quality focused |
 | Q5_K_M | 5.5 | ~4.8 GB | Very High | High quality |
 | Q6_K | 6.0 | ~5.5 GB | Excellent | Near-original |
 | Q8_0 | 8.0 | ~7.2 GB | Best | Maximum quality, minimal degradation |
-### Choosing quantization
+**Variant suffixes** — `_S` (Small, faster, lower quality), `_M` (Medium, balanced), `_L` (Large, better quality).
 **Legacy (Q4_0/Q4_1/Q5_0/Q5_1) exist** but always prefer K-quants for better quality/size ratio.
 **IQ quantization** — ultra-low-bit with importance-aware methods: IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_XS, IQ3_S, IQ4_XS. Require `--imatrix`.
 **Task-specific defaults:**
 - General chat / assistants: Q4_K_M, or Q5_K_M if RAM allows
 - Code generation: Q5_K_M or Q6_K (higher precision helps)
 - Technical / medical: Q6_K or Q8_0
 - Very large (70B, 405B) on consumer hardware: Q3_K_M or Q4_K_S
 - Raspberry Pi / edge: Q2_K or Q3_K_S
 ## Conversion workflows
 ### Basic: HF → GGUF → quantized
 ```bash
-# General use (balanced)
+python convert_hf_to_gguf.py ./model --outfile model-f16.gguf --outtype f16
-Q4_K_M  # 4-bit, medium quality
+./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
 ./llama-cli -m model-q4_k_m.gguf -p "Hello!" -n 50
 ```
-# Maximum speed (more degradation)
+### With importance matrix (imatrix) — better low-bit quality
 Q2_K or Q3_K_M
-# Maximum quality (slower)
+`imatrix` gives 10–20% perplexity improvement at Q4, essential at Q3 and below.
 Q6_K or Q8_0
-# Very large models (70B, 405B)
+```bash
-Q3_K_M or Q4_K_S  # Lower bits to fit in memory
+# 1. Convert to FP16 GGUF
 python convert_hf_to_gguf.py ./model --outfile model-f16.gguf
 # 2. Prepare calibration data (diverse text, ~100MB is ideal)
 cat > calibration.txt << 'EOF'
 The quick brown fox jumps over the lazy dog.
 Machine learning is a subset of artificial intelligence.
 # Add more diverse text samples...
 EOF
 # 3. Generate importance matrix
 ./llama-imatrix -m model-f16.gguf \
    -f calibration.txt \
    --chunk 512 \
    -o model.imatrix \
    -ngl 35
 # 4. Quantize with imatrix
 ./llama-quantize --imatrix model.imatrix \
    model-f16.gguf model-q4_k_m.gguf Q4_K_M
 ```
 ### Multi-quant batch
 ```bash
 #!/bin/bash
 MODEL="llama-3.1-8b-f16.gguf"
 IMATRIX="llama-3.1-8b.imatrix"
 ./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35
 for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
    OUTPUT="llama-3.1-8b-${QUANT,,}.gguf"
    ./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT
    echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))"
 done
 ```
 ### Quality testing (perplexity)
 ```bash
 ./llama-perplexity -m model.gguf -f wikitext-2-raw/wiki.test.raw -c 512
 # Baseline FP16: ~5.96  |  Q4_K_M: ~6.06 (+1.7%)  |  Q2_K: ~6.87 (+15.3%)
 ```
 ## Python bindings (llama-cpp-python)
 ### Basic generation
 ```python
 from llama_cpp import Llama
 llm = Llama(
    model_path="./model-q4_k_m.gguf",
    n_ctx=4096,
    n_gpu_layers=35,     # 0 for CPU only, 99 to offload everything
    n_threads=8,
 )
 output = llm(
    "What is machine learning?",
    max_tokens=256,
    temperature=0.7,
    stop=["</s>", "\n\n"],
 )
 print(output["choices"][0]["text"])
 ```
 ### Chat completion + streaming
 ```python
 llm = Llama(
    model_path="./model-q4_k_m.gguf",
    n_ctx=4096,
    n_gpu_layers=35,
    chat_format="llama-3",    # Or "chatml", "mistral", etc.
 )
 # Non-streaming
 response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Python?"},
    ],
    max_tokens=256,
    temperature=0.7,
 )
 print(response["choices"][0]["message"]["content"])
 # Streaming
 for chunk in llm("Explain quantum computing:", max_tokens=256, stream=True):
    print(chunk["choices"][0]["text"], end="", flush=True)
 ```
 ### Embeddings
 ```python
 llm = Llama(model_path="./model-q4_k_m.gguf", embedding=True, n_gpu_layers=35)
 vec = llm.embed("This is a test sentence.")
 print(f"Embedding dimension: {len(vec)}")
 ```
 ## Hardware acceleration
@ -140,122 +273,166 @@ Q3_K_M or Q4_K_S  # Lower bits to fit in memory
 ### Apple Silicon (Metal)
 ```bash
-# Build with Metal
+make clean && make GGML_METAL=1
-make LLAMA_METAL=1
+./llama-cli -m model.gguf -ngl 99 -p "Hello"   # offload all layers
 # Run with GPU acceleration (automatic)
 ./llama-cli -m model.gguf -ngl 999  # Offload all layers
 # Performance: M3 Max 40-60 tokens/sec (Llama 2-7B Q4_K_M)
 ```
-### NVIDIA GPUs (CUDA)
+```python
-
+llm = Llama(
-```bash
+    model_path="model.gguf",
-# Build with CUDA
+    n_gpu_layers=99,     # Offload everything
-make LLAMA_CUDA=1
+    n_threads=1,         # Metal handles parallelism
-
+)
 # Offload layers to GPU
 ./llama-cli -m model.gguf -ngl 35  # Offload 35/40 layers
 # Hybrid CPU+GPU for large models
 ./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20  # GPU: 20 layers, CPU: rest
 ```
-### AMD GPUs (ROCm)
+Performance: M3 Max ~40–60 tok/s on Llama 2-7B Q4_K_M.
 ### NVIDIA (CUDA)
 ```bash
 make clean && make GGML_CUDA=1
 ./llama-cli -m model.gguf -ngl 35 -p "Hello"
 # Hybrid for large models
 ./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20   # GPU: 20 layers, CPU: rest
 # Multi-GPU split
 ./llama-cli -m large-model.gguf --tensor-split 0.5,0.5 -ngl 60
 ```
 ### AMD (ROCm)
 ```bash
 # Build with ROCm
 make LLAMA_HIP=1
 # Run with AMD GPU
 ./llama-cli -m model.gguf -ngl 999
 ```
-## Common patterns
+### CPU
 ### Batch processing
 ```bash
-# Process multiple prompts from file
+# Match PHYSICAL cores, not logical
-cat prompts.txt | ./llama-cli \
+./llama-cli -m model.gguf -t 8 -p "Hello"
-    -m model.gguf \
+
-    --batch-size 512 \
+# BLAS acceleration (2–3× speedup)
-    -n 100
+make LLAMA_OPENBLAS=1
 ```
-### Constrained generation
+```python
-
+llm = Llama(
-```bash
+    model_path="model.gguf",
-# JSON output with grammar
+    n_gpu_layers=0,
-./llama-cli \
+    n_threads=8,
-    -m model.gguf \
+    n_batch=512,         # Larger batch = faster prompt processing
-    -p "Generate a person: " \
+)
    --grammar-file grammars/json.gbnf
 # Outputs valid JSON only
 ```
 ### Context size
 ```bash
 # Increase context (default 512)
 ./llama-cli \
    -m model.gguf \
    -c 4096  # 4K context window
 # Very long context (if model supports)
 ./llama-cli -m model.gguf -c 32768  # 32K context
 ```
 ## Performance benchmarks
-### CPU performance (Llama 2-7B Q4_K_M)
+### CPU (Llama 2-7B Q4_K_M)
-| CPU | Threads | Speed | Cost |
+| CPU | Threads | Speed |
-|-----|---------|-------|------|
+|-----|---------|-------|
-| Apple M3 Max | 16 | 50 tok/s | $0 (local) |
+| Apple M3 Max (Metal) | 16 | 50 tok/s |
-| AMD Ryzen 9 7950X | 32 | 35 tok/s | $0.50/hour |
+| AMD Ryzen 9 7950X | 32 | 35 tok/s |
-| Intel i9-13900K | 32 | 30 tok/s | $0.40/hour |
+| Intel i9-13900K | 32 | 30 tok/s |
 | AWS c7i.16xlarge | 64 | 40 tok/s | $2.88/hour |
-### GPU acceleration (Llama 2-7B Q4_K_M)
+### GPU offloading on RTX 4090
-| GPU | Speed | vs CPU | Cost |
+| Layers GPU | Speed | VRAM |
-|-----|-------|--------|------|
+|------------|-------|------|
-| NVIDIA RTX 4090 | 120 tok/s | 3-4× | $0 (local) |
+| 0 (CPU only) | 30 tok/s | 0 GB |
-| NVIDIA A10 | 80 tok/s | 2-3× | $1.00/hour |
+| 20 (hybrid) | 80 tok/s | 8 GB |
-| AMD MI250 | 70 tok/s | 2× | $2.00/hour |
+| 35 (all) | 120 tok/s | 12 GB |
 | Apple M3 Max (Metal) | 50 tok/s | ~Same | $0 (local) |
 ## Supported models
-**LLaMA family**:
+- **LLaMA family**: Llama 2 (7B/13B/70B), Llama 3 (8B/70B/405B), Code Llama
- Llama 2 (7B, 13B, 70B)
+- **Mistral family**: Mistral 7B, Mixtral 8x7B/8x22B
- Llama 3 (8B, 70B, 405B)
+- **Other**: Falcon, BLOOM, GPT-J, Phi-3, Gemma, Qwen, LLaVA (vision), Whisper (audio)
 - Code Llama
-**Mistral family**:
+Find GGUF models: https://huggingface.co/models?library=gguf
 - Mistral 7B
 - Mixtral 8x7B, 8x22B
-**Other**:
+## Ecosystem integrations
 - Falcon, BLOOM, GPT-J
 - Phi-3, Gemma, Qwen
 - LLaVA (vision), Whisper (audio)
-**Find models**: https://huggingface.co/models?library=gguf
+### Ollama
 ```bash
 cat > Modelfile << 'EOF'
 FROM ./model-q4_k_m.gguf
 TEMPLATE """{{ .System }}
 {{ .Prompt }}"""
 PARAMETER temperature 0.7
 PARAMETER num_ctx 4096
 EOF
 ollama create mymodel -f Modelfile
 ollama run mymodel "Hello!"
 ```
 ### LM Studio
 1. Place GGUF file in `~/.cache/lm-studio/models/`
 2. Open LM Studio and select the model
 3. Configure context length and GPU offload, start inference
 ### text-generation-webui
 ```bash
 cp model-q4_k_m.gguf text-generation-webui/models/
 python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35
 ```
 ### OpenAI client → llama-server
 ```python
 from openai import OpenAI
 client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
 response = client.chat.completions.create(
    model="local-model",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=256,
 )
 print(response.choices[0].message.content)
 ```
 ## Best practices
 1. **Use K-quants** — Q4_K_M is the recommended default
 2. **Use imatrix** for Q4 and below (calibration improves quality substantially)
 3. **Offload as many layers as VRAM allows** — start high, reduce by 5 on OOM
 4. **Thread count** — match physical cores, not logical
 5. **Batch size** — increase `n_batch` (e.g. 512) for faster prompt processing
 6. **Context** — start at 4096, grow only as needed (memory scales with ctx)
 7. **Flash Attention** — add `--flash-attn` if your build supports it
 ## Common issues (quick fixes)
 **Model loads slowly** — use `--mmap` for memory-mapped loading.
 **Out of memory (GPU)** — reduce `-ngl`, use a smaller quant (Q4_K_S / Q3_K_M), or quantize the KV cache:
 ```python
 Llama(model_path="...", type_k=2, type_v=2, n_gpu_layers=35)  # Q4_0 KV cache
 ```
 **Garbage output** — wrong `chat_format`, temperature too high, or model file corrupted. Test with `temperature=0.1` and verify FP16 baseline works.
 **Connection refused (server)** — bind to `--host 0.0.0.0`, check `lsof -i :8080`.
 See `references/troubleshooting.md` for the full playbook.
 ## References
- **[Quantization Guide](references/quantization.md)** - GGUF formats, conversion, quality comparison
+- **[advanced-usage.md](references/advanced-usage.md)** — speculative decoding, batched inference, grammar-constrained generation, LoRA, multi-GPU, custom builds, benchmark scripts
- **[Server Deployment](references/server.md)** - API endpoints, Docker, monitoring
+- **[quantization.md](references/quantization.md)** — perplexity tables, use-case guide, model size scaling (7B/13B/70B RAM needs), imatrix deep dive
- **[Optimization](references/optimization.md)** - Performance tuning, hybrid CPU+GPU
+- **[server.md](references/server.md)** — OpenAI API endpoints, Docker deployment, NGINX load balancing, monitoring
 - **[optimization.md](references/optimization.md)** — CPU threading, BLAS, GPU offload heuristics, batch tuning, benchmarks
 - **[troubleshooting.md](references/troubleshooting.md)** — install/convert/quantize/inference/server issues, Apple Silicon, debugging
 ## Resources
- **GitHub**: https://github.com/ggerganov/llama.cpp
+- **GitHub**: https://github.com/ggml-org/llama.cpp
- **Models**: https://huggingface.co/models?library=gguf
+- **Python bindings**: https://github.com/abetlen/llama-cpp-python
- **Discord**: https://discord.gg/llama-cpp
+- **Pre-quantized models**: https://huggingface.co/TheBloke
-
+- **GGUF converter Space**: https://huggingface.co/spaces/ggml-org/gguf-my-repo
-
+- **License**: MIT
--- a/skills/mlops/inference/llama-cpp/references/advanced-usage.md
+++ b/skills/mlops/inference/llama-cpp/references/advanced-usage.md
--- a/skills/mlops/inference/llama-cpp/references/troubleshooting.md
+++ b/skills/mlops/inference/llama-cpp/references/troubleshooting.md
--- a/skills/mlops/training/grpo-rl-training/README.md
+++ b/skills/mlops/training/grpo-rl-training/README.md
@ -1,97 +0,0 @@
 # GRPO/RL Training Skill
 **Expert-level guidance for Group Relative Policy Optimization with TRL**
 ## 📁 Skill Structure
 ```
 grpo-rl-training/
 ├── SKILL.md                              # Main skill documentation (READ THIS FIRST)
 ├── README.md                             # This file
 ├── templates/
 │   └── basic_grpo_training.py            # Production-ready training template
 └── examples/
    └── reward_functions_library.py       # 20+ reward function examples
 ```
 ## 🚀 Quick Start
 1. **Read SKILL.md** - Comprehensive guide with all concepts and patterns
 2. **Copy `templates/basic_grpo_training.py`** - Start with working code
 3. **Browse `examples/reward_functions_library.py`** - Pick reward functions for your task
 4. **Modify for your use case** - Adapt dataset, rewards, and config
 ## 💡 What's Inside
 ### SKILL.md (Main Documentation)
 - Core GRPO concepts and algorithm fundamentals
 - Complete implementation workflow (dataset → rewards → training → deployment)
 - 10+ reward function examples with code
 - Hyperparameter tuning guide
 - Training insights (loss behavior, metrics, debugging)
 - Troubleshooting guide
 - Production best practices
 ### Templates
 - **basic_grpo_training.py**: Minimal, production-ready training script
  - Uses Qwen 2.5 1.5B Instruct
  - 3 reward functions (format + correctness)
  - LoRA for efficient training
  - Fully documented and ready to run
 ### Examples
 - **reward_functions_library.py**: 20+ battle-tested reward functions
  - Correctness rewards (exact match, fuzzy match, numeric, code execution)
  - Format rewards (XML, JSON, strict/soft)
  - Length rewards (ideal length, min/max)
  - Style rewards (reasoning quality, citations, repetition penalty)
  - Combined rewards (multi-objective optimization)
  - Preset collections for common tasks
 ## 📖 Usage for Agents
 When this skill is loaded in your agent's context:
 1. **Always read SKILL.md first** before implementing
 2. **Start simple** - Use length-based reward to validate setup
 3. **Build incrementally** - Add one reward function at a time
 4. **Reference examples** - Copy patterns from reward_functions_library.py
 5. **Monitor training** - Watch reward metrics (not loss!)
 ## 🎯 Common Use Cases
 | Task Type | Recommended Rewards | Template |
 |-----------|---------------------|----------|
 | Math reasoning | `MATH_REASONING_REWARDS` preset | basic_grpo_training.py |
 | Code generation | `CODE_GENERATION_REWARDS` preset | Modify dataset in template |
 | Summarization | `SUMMARIZATION_REWARDS` preset | Adjust prompts + rewards |
 | Q&A | `QA_REWARDS` preset | Use fuzzy match + citations |
 ## ⚠️ Critical Reminders
 - **Loss goes UP during training** - This is normal (it's KL divergence)
 - **Use 3-5 reward functions** - Single rewards often fail
 - **Test rewards before training** - Debug each function independently
 - **Monitor reward_std** - Should stay > 0.1 (avoid mode collapse)
 - **Start with num_generations=4-8** - Scale up if GPU allows
 ## 🔗 External Resources
 - [TRL Documentation](https://huggingface.co/docs/trl)
 - [DeepSeek R1 Paper](https://arxiv.org/abs/2501.12948)
 - [Open R1 Implementation](https://github.com/huggingface/open-r1)
 - [Unsloth (2-3x faster)](https://docs.unsloth.ai/)
 ## 📝 Version
 **v1.0.0** - Initial release (January 2025)
 ## 👨‍💻 Maintained By
 Orchestra Research
 For questions or improvements, see https://orchestra.com
 ---
 **License:** MIT
 **Last Updated:** January 2025
--- a/skills/mlops/training/trl-fine-tuning/SKILL.md
+++ b/skills/mlops/training/trl-fine-tuning/SKILL.md
@ -252,6 +252,8 @@ trl dpo \
 Train with reinforcement learning using minimal memory.
 For in-depth GRPO guidance — reward function design, critical training insights (loss behavior, mode collapse, tuning), and advanced multi-stage patterns — see **[references/grpo-training.md](references/grpo-training.md)**. A production-ready training script is in **[templates/basic_grpo_training.py](templates/basic_grpo_training.py)**.
 Copy this checklist:
 ```
@ -428,6 +430,8 @@ config = PPOConfig(
 **Online RL methods**: See [references/online-rl.md](references/online-rl.md) for PPO, GRPO, RLOO, and OnlineDPO with detailed configurations.
 **GRPO deep dive**: See [references/grpo-training.md](references/grpo-training.md) for expert-level GRPO patterns — reward function design philosophy, training insights (why loss increases, mode collapse detection), hyperparameter tuning, multi-stage training, and troubleshooting. Production-ready template in [templates/basic_grpo_training.py](templates/basic_grpo_training.py).
 ## Hardware requirements
 - **GPU**: NVIDIA (CUDA required)
--- a/skills/mlops/training/trl-fine-tuning/references/grpo-training.md
+++ b/skills/mlops/training/trl-fine-tuning/references/grpo-training.md
@ -1,51 +1,36 @@
---
+# GRPO (Group Relative Policy Optimization) — Deep Guide
 name: grpo-rl-training
 description: Expert guidance for GRPO/RL fine-tuning with TRL for reasoning and task-specific model training
 version: 1.0.0
 author: Orchestra Research
 license: MIT
 dependencies: [transformers>=4.47.0, trl>=0.14.0, datasets>=3.2.0, peft>=0.14.0, torch]
 metadata:
  hermes:
    tags: [Post-Training, Reinforcement Learning, GRPO, TRL, RLHF, Reward Modeling, Reasoning, DPO, PPO, Structured Output]
---
+Expert-level patterns, critical insights, and production-ready workflows for fine-tuning language models with custom reward functions using TRL's `GRPOTrainer`. This is the deep reference for the GRPO workflow summarized in the main skill.
-# GRPO/RL Training with TRL
+## When to use GRPO
-Expert-level guidance for implementing Group Relative Policy Optimization (GRPO) using the Transformer Reinforcement Learning (TRL) library. This skill provides battle-tested patterns, critical insights, and production-ready workflows for fine-tuning language models with custom reward functions.
+Use GRPO when you need to:
-
+- **Enforce specific output formats** (XML tags, JSON, structured reasoning)
 ## When to Use This Skill
 Use GRPO training when you need to:
 - **Enforce specific output formats** (e.g., XML tags, JSON, structured reasoning)
 - **Teach verifiable tasks** with objective correctness metrics (math, coding, fact-checking)
 - **Improve reasoning capabilities** by rewarding chain-of-thought patterns
 - **Align models to domain-specific behaviors** without labeled preference data
 - **Optimize for multiple objectives** simultaneously (format + correctness + style)
 **Do NOT use GRPO for:**
- Simple supervised fine-tuning tasks (use SFT instead)
+- Simple supervised fine-tuning tasks → use SFT
 - Tasks without clear reward signals
- When you already have high-quality preference pairs (use DPO/PPO instead)
+- When you already have high-quality preference pairs → use DPO/PPO
---
+## Core concepts
-## Core Concepts
+### 1. GRPO algorithm fundamentals
-### 1. GRPO Algorithm Fundamentals
+**Key mechanism:**
-
+- Generates **multiple completions** per prompt (group size: 4–16)
 **Key Mechanism:**
 - Generates **multiple completions** for each prompt (group size: 4-16)
 - Compares completions within each group using reward functions
 - Updates policy to favor higher-rewarded responses relative to the group
-**Critical Difference from PPO:**
+**Critical differences from PPO:**
 - No separate reward model needed
 - More sample-efficient (learns from within-group comparisons)
 - Simpler to implement and debug
-**Mathematical Intuition:**
+**Mathematical intuition:**
 ```
 For each prompt p:
  1. Generate N completions: {c₁, c₂, ..., cₙ}
@ -54,35 +39,32 @@ For each prompt p:
     relative to low-reward ones in the same group
 ```
-### 2. Reward Function Design Philosophy
+### 2. Reward function design philosophy
-**Golden Rules:**
+**Golden rules:**
-1. **Compose multiple reward functions** - Each handles one aspect (format, correctness, style)
+1. **Compose multiple reward functions** — each handles one aspect (format, correctness, style)
-2. **Scale rewards appropriately** - Higher weight = stronger signal
+2. **Scale rewards appropriately** — higher weight = stronger signal
-3. **Use incremental rewards** - Partial credit for partial compliance
+3. **Use incremental rewards** — partial credit for partial compliance
-4. **Test rewards independently** - Debug each reward function in isolation
+4. **Test rewards independently** — debug each reward function in isolation
-**Reward Function Types:**
+**Reward function types:**
 | Type | Use Case | Example Weight |
 |------|----------|----------------|
 | **Correctness** | Verifiable tasks (math, code) | 2.0 (highest) |
-| **Format** | Strict structure enforcement | 0.5-1.0 |
+| **Format** | Strict structure enforcement | 0.5–1.0 |
-| **Length** | Encourage verbosity/conciseness | 0.1-0.5 |
+| **Length** | Encourage verbosity/conciseness | 0.1–0.5 |
-| **Style** | Penalize unwanted patterns | -0.5 to 0.5 |
+| **Style** | Penalize unwanted patterns | −0.5 to 0.5 |
---
+## Implementation workflow
-## Implementation Workflow
+### Step 1: Dataset preparation
-### Step 1: Dataset Preparation
+**Critical requirements:**
-
+- Prompts in chat format (list of dicts with `role` and `content`)
 **Critical Requirements:**
 - Prompts in chat format (list of dicts with 'role' and 'content')
 - Include system prompts to set expectations
 - For verifiable tasks, include ground truth answers as additional columns
 **Example Structure:**
 ```python
 from datasets import load_dataset, Dataset
@ -97,8 +79,7 @@ Respond in the following format:
 """
 def prepare_dataset(raw_data):
-    """
+    """Transform raw data into GRPO-compatible format.
    Transform raw data into GRPO-compatible format.
    Returns: Dataset with columns:
    - 'prompt': List[Dict] with role/content (system + user messages)
@ -113,14 +94,14 @@ def prepare_dataset(raw_data):
    })
 ```
-**Pro Tips:**
+**Pro tips:**
- Use one-shot or few-shot examples in system prompt for complex formats
+- Use one-shot or few-shot examples in the system prompt for complex formats
- Keep prompts concise (max_prompt_length: 256-512 tokens)
+- Keep prompts concise (max_prompt_length: 256–512 tokens)
 - Validate data quality before training (garbage in = garbage out)
-### Step 2: Reward Function Implementation
+### Step 2: Reward function implementation
-**Template Structure:**
+**Template structure:**
 ```python
 def reward_function_name(
    prompts,        # List[List[Dict]]: Original prompts
@ -128,24 +109,16 @@ def reward_function_name(
    answer=None,    # Optional: Ground truth from dataset
    **kwargs        # Additional dataset columns
 ) -> list[float]:
-    """
+    """Evaluate completions and return rewards (one per completion)."""
    Evaluate completions and return rewards.
    Returns: List of floats (one per completion)
    """
    # Extract completion text
    responses = [comp[0]['content'] for comp in completions]
    # Compute rewards
    rewards = []
    for response in responses:
        score = compute_score(response)
        rewards.append(score)
    return rewards
 ```
-**Example 1: Correctness Reward (Math/Coding)**
+**Example 1: correctness reward (math/coding)**
 ```python
 def correctness_reward(prompts, completions, answer, **kwargs):
    """Reward correct answers with high score."""
@ -155,7 +128,7 @@ def correctness_reward(prompts, completions, answer, **kwargs):
            for ans, gt in zip(extracted, answer)]
 ```
-**Example 2: Format Reward (Structured Output)**
+**Example 2: format reward (structured output)**
 ```python
 import re
@ -167,7 +140,7 @@ def format_reward(completions, **kwargs):
            for r in responses]
 ```
-**Example 3: Incremental Format Reward (Partial Credit)**
+**Example 3: incremental format reward (partial credit)**
 ```python
 def incremental_format_reward(completions, **kwargs):
    """Award partial credit for format compliance."""
@ -176,14 +149,10 @@ def incremental_format_reward(completions, **kwargs):
    for r in responses:
        score = 0.0
-        if '<reasoning>' in r:
+        if '<reasoning>' in r:  score += 0.25
-            score += 0.25
+        if '</reasoning>' in r: score += 0.25
-        if '</reasoning>' in r:
+        if '<answer>' in r:     score += 0.25
-            score += 0.25
+        if '</answer>' in r:    score += 0.25
        if '<answer>' in r:
            score += 0.25
        if '</answer>' in r:
            score += 0.25
        # Penalize extra text after closing tag
        if r.count('</answer>') == 1:
            extra_text = r.split('</answer>')[-1].strip()
@ -193,12 +162,11 @@ def incremental_format_reward(completions, **kwargs):
    return rewards
 ```
-**Critical Insight:**
+**Critical insight:** Combine 3–5 reward functions for robust training. Order matters less than diversity of signals.
 Combine 3-5 reward functions for robust training. Order matters less than diversity of signals.
-### Step 3: Training Configuration
+### Step 3: Training configuration
-**Memory-Optimized Config (Small GPU)**
+**Memory-optimized config (small GPU)**
 ```python
 from trl import GRPOConfig
@ -218,13 +186,13 @@ training_args = GRPOConfig(
    gradient_accumulation_steps=4,  # Effective batch = 4
    # GRPO-specific
-    num_generations=8,            # Group size: 8-16 recommended
+    num_generations=8,            # Group size: 8–16 recommended
    max_prompt_length=256,
    max_completion_length=512,
    # Training duration
    num_train_epochs=1,
-    max_steps=None,               # Or set fixed steps (e.g., 500)
+    max_steps=None,
    # Optimization
    bf16=True,                    # Faster on A100/H100
@ -234,11 +202,11 @@ training_args = GRPOConfig(
    # Logging
    logging_steps=1,
    save_steps=100,
-    report_to="wandb",            # Or "none" for no logging
+    report_to="wandb",
 )
 ```
-**High-Performance Config (Large GPU)**
+**High-performance config (large GPU)**
 ```python
 training_args = GRPOConfig(
    output_dir="outputs/grpo-model",
@ -255,31 +223,30 @@ training_args = GRPOConfig(
 )
 ```
-**Critical Hyperparameters:**
+**Critical hyperparameters:**
 | Parameter | Impact | Tuning Advice |
 |-----------|--------|---------------|
-| `num_generations` | Group size for comparison | Start with 8, increase to 16 if GPU allows |
+| `num_generations` | Group size for comparison | Start 8, increase to 16 if GPU allows |
 | `learning_rate` | Convergence speed/stability | 5e-6 (safe), 1e-5 (faster, riskier) |
-| `max_completion_length` | Output verbosity | Match your task (512 for reasoning, 256 for short answers) |
+| `max_completion_length` | Output verbosity | Match your task (512 reasoning, 256 short answers) |
 | `gradient_accumulation_steps` | Effective batch size | Increase if GPU memory limited |
-### Step 4: Model Setup and Training
+### Step 4: Model setup and training
-**Standard Setup (Transformers)**
+**Standard setup (Transformers + TRL)**
 ```python
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer
 from peft import LoraConfig
 from trl import GRPOTrainer
 # Load model
 model_name = "Qwen/Qwen2.5-1.5B-Instruct"
 model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
-    attn_implementation="flash_attention_2",  # 2-3x faster
+    attn_implementation="flash_attention_2",  # 2–3× faster
-    device_map="auto"
+    device_map="auto",
 )
 tokenizer = AutoTokenizer.from_pretrained(model_name)
@ -287,17 +254,16 @@ tokenizer.pad_token = tokenizer.eos_token
 # Optional: LoRA for parameter-efficient training
 peft_config = LoraConfig(
-    r=16,                         # Rank (higher = more capacity)
+    r=16,
-    lora_alpha=32,               # Scaling factor (typically 2*r)
+    lora_alpha=32,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
-        "gate_proj", "up_proj", "down_proj"
+        "gate_proj", "up_proj", "down_proj",
    ],
    task_type="CAUSAL_LM",
    lora_dropout=0.05,
 )
 # Initialize trainer
 trainer = GRPOTrainer(
    model=model,
    processing_class=tokenizer,
@ -308,17 +274,14 @@ trainer = GRPOTrainer(
    ],
    args=training_args,
    train_dataset=dataset,
-    peft_config=peft_config,      # Remove for full fine-tuning
+    peft_config=peft_config,   # Remove for full fine-tuning
 )
 # Train
 trainer.train()
 # Save
 trainer.save_model("final_model")
 ```
-**Unsloth Setup (2-3x Faster)**
+**Unsloth setup (2–3× faster)**
 ```python
 from unsloth import FastLanguageModel
@ -339,28 +302,26 @@ model = FastLanguageModel.get_peft_model(
    use_gradient_checkpointing="unsloth",
 )
-# Rest is identical to standard setup
+# Rest is identical to the standard setup
 trainer = GRPOTrainer(model=model, ...)
 trainer.train()
 ```
---
+## Critical training insights
-## Critical Training Insights
+### 1. Loss behavior (EXPECTED pattern)
 - **Loss starts near 0 and INCREASES during training** — this is CORRECT
 - Loss measures KL divergence from initial policy; the model is learning (diverging from original behavior to optimize rewards)
 - **Monitor reward metrics, not loss, for progress**
-### 1. Loss Behavior (EXPECTED PATTERN)
+### 2. Reward tracking
 - **Loss starts near 0 and INCREASES during training**
 - This is CORRECT - loss measures KL divergence from initial policy
 - Model is learning (diverging from original behavior to optimize rewards)
 - Monitor reward metrics instead of loss for progress
 ### 2. Reward Tracking
 Key metrics to watch:
- `reward`: Average across all completions
+- `reward` — average across all completions
- `reward_std`: Diversity within groups (should remain > 0)
+- `reward_std` — diversity within groups (should remain > 0)
- `kl`: KL divergence from reference (should grow moderately)
+- `kl` — KL divergence from reference (should grow moderately)
-**Healthy Training Pattern:**
+**Healthy pattern:**
 ```
 Step   Reward    Reward_Std   KL
 100    0.5       0.3          0.02
@ -369,12 +330,12 @@ Step   Reward    Reward_Std   KL
 400    1.5       0.15         0.12
 ```
-**Warning Signs:**
+**Warning signs:**
- Reward std → 0 (model collapsing to single response)
+- `reward_std` → 0 (model collapsing to a single response)
- KL exploding (> 0.5) (diverging too much, reduce LR)
+- `kl` exploding (> 0.5) — diverging too much, reduce LR
- Reward stuck (reward functions too harsh or model capacity issue)
+- Reward stuck — reward functions too harsh or model capacity issue
-### 3. Common Pitfalls and Solutions
+### 3. Common pitfalls and solutions
 | Problem | Symptom | Solution |
 |---------|---------|----------|
@ -384,15 +345,14 @@ Step   Reward    Reward_Std   KL
 | **Slow training** | < 1 it/s | Enable `use_vllm=True`, use Unsloth, reduce seq length |
 | **Format ignored** | Model doesn't follow structure | Increase format reward weight, add incremental rewards |
---
+## Advanced patterns
-## Advanced Patterns
+### 1. Multi-stage training
 ### 1. Multi-Stage Training
 For complex tasks, train in stages:
 ```python
-# Stage 1: Format compliance (epochs=1)
+# Stage 1: Format compliance
 trainer_stage1 = GRPOTrainer(
    model=model,
    reward_funcs=[incremental_format_reward, format_reward],
@ -400,7 +360,7 @@ trainer_stage1 = GRPOTrainer(
 )
 trainer_stage1.train()
-# Stage 2: Correctness (epochs=1)
+# Stage 2: Correctness
 trainer_stage2 = GRPOTrainer(
    model=model,
    reward_funcs=[format_reward, correctness_reward],
@ -409,7 +369,8 @@ trainer_stage2 = GRPOTrainer(
 trainer_stage2.train()
 ```
-### 2. Adaptive Reward Scaling
+### 2. Adaptive reward scaling
 ```python
 class AdaptiveReward:
    def __init__(self, base_reward_func, initial_weight=1.0):
@ -428,148 +389,116 @@ class AdaptiveReward:
            self.weight *= 0.9
 ```
-### 3. Custom Dataset Integration
+### 3. Custom dataset integration
 ```python
 def load_custom_knowledge_base(csv_path):
    """Example: School communication platform docs."""
    import pandas as pd
    df = pd.read_csv(csv_path)
-
+    return Dataset.from_pandas(df).map(lambda x: {
    dataset = Dataset.from_pandas(df).map(lambda x: {
        'prompt': [
            {'role': 'system', 'content': CUSTOM_SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': x['expert_answer']
    })
    return dataset
 ```
---
+## Deployment and inference
-## Deployment and Inference
+### Save and merge LoRA
 ### Save and Merge LoRA
 ```python
 # Merge LoRA adapters into base model
 if hasattr(trainer.model, 'merge_and_unload'):
    merged_model = trainer.model.merge_and_unload()
    merged_model.save_pretrained("production_model")
    tokenizer.save_pretrained("production_model")
 ```
-### Inference Example
+### Inference
 ```python
 from transformers import pipeline
-generator = pipeline(
+generator = pipeline("text-generation", model="production_model", tokenizer=tokenizer)
    "text-generation",
    model="production_model",
    tokenizer=tokenizer
 )
 result = generator(
    [
        {'role': 'system', 'content': SYSTEM_PROMPT},
-        {'role': 'user', 'content': "What is 15 + 27?"}
+        {'role': 'user', 'content': "What is 15 + 27?"},
    ],
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
-    top_p=0.9
+    top_p=0.9,
 )
 print(result[0]['generated_text'])
 ```
---
+## Best practices checklist
-## Best Practices Checklist
+**Before training:**
 **Before Training:**
 - [ ] Validate dataset format (prompts as List[Dict])
 - [ ] Test reward functions on sample data
- [ ] Calculate expected max_prompt_length from data
+- [ ] Calculate expected `max_prompt_length` from data
- [ ] Choose appropriate num_generations based on GPU memory
+- [ ] Choose `num_generations` based on GPU memory
 - [ ] Set up logging (wandb recommended)
-**During Training:**
+**During training:**
 - [ ] Monitor reward progression (should increase)
- [ ] Check reward_std (should stay > 0.1)
+- [ ] Check `reward_std` (should stay > 0.1)
 - [ ] Watch for OOM errors (reduce batch size if needed)
- [ ] Sample generations every 50-100 steps
+- [ ] Sample generations every 50–100 steps
 - [ ] Validate format compliance on holdout set
-**After Training:**
+**After training:**
 - [ ] Merge LoRA weights if using PEFT
 - [ ] Test on diverse prompts
 - [ ] Compare to baseline model
 - [ ] Document reward weights and hyperparameters
 - [ ] Save reproducibility config
---
+## Troubleshooting
-## Troubleshooting Guide
+### Debugging workflow
 1. **Isolate reward functions** — test each independently
 2. **Check data distribution** — ensure diversity in prompts
 3. **Reduce complexity** — start with single reward, add gradually
 4. **Monitor generations** — print samples every N steps
 5. **Validate extraction logic** — ensure answer parsing works
-### Debugging Workflow
+### Quick debug reward
 1. **Isolate reward functions** - Test each independently
 2. **Check data distribution** - Ensure diversity in prompts
 3. **Reduce complexity** - Start with single reward, add gradually
 4. **Monitor generations** - Print samples every N steps
 5. **Validate extraction logic** - Ensure answer parsing works
 ### Quick Fixes
 ```python
 # Debug reward function
 def debug_reward(completions, **kwargs):
    responses = [comp[0]['content'] for comp in completions]
-    for i, r in enumerate(responses[:2]):  # Print first 2
+    for i, r in enumerate(responses[:2]):
        print(f"Response {i}: {r[:200]}...")
-    return [1.0] * len(responses)  # Dummy rewards
+    return [1.0] * len(responses)
 # Test without training
 trainer = GRPOTrainer(..., reward_funcs=[debug_reward])
-trainer.generate_completions(dataset[:1])  # Generate without updating
+trainer.generate_completions(dataset[:1])
 ```
---
+## Template
-## References and Resources
+A production-ready training script lives at **`../templates/basic_grpo_training.py`**. It uses Qwen 2.5-1.5B-Instruct with LoRA and three reward functions (incremental format, strict format, correctness) on GSM8K. Copy and adapt:
 1. `get_dataset()` — swap in your data loader
 2. Reward functions — tune to your task
 3. `SYSTEM_PROMPT` — match your output format
 4. `GRPOConfig` — adjust hyperparameters for your GPU
 ## References and resources
 **Official Documentation:**
 - TRL GRPO Trainer: https://huggingface.co/docs/trl/grpo_trainer
- DeepSeek R1 Paper: https://arxiv.org/abs/2501.12948
+- GRPO paper (DeepSeek): https://arxiv.org/abs/2402.03300
- Unsloth Docs: https://docs.unsloth.ai/
+- DeepSeek R1 paper: https://arxiv.org/abs/2501.12948
-
+- Open R1 implementation: https://github.com/huggingface/open-r1
-**Example Repositories:**
+- TRL examples: https://github.com/huggingface/trl/tree/main/examples
- Open R1 Implementation: https://github.com/huggingface/open-r1
+- Unsloth (faster training): https://docs.unsloth.ai/
 - TRL Examples: https://github.com/huggingface/trl/tree/main/examples
 **Recommended Reading:**
 - Progressive Disclosure Pattern for agent instructions
 - Reward shaping in RL (Ng et al.)
 - LoRA paper (Hu et al., 2021)
 ---
 ## Usage Instructions for Agents
 When this skill is loaded:
 1. **Read this entire file** before implementing GRPO training
 2. **Start with the simplest reward function** (e.g., length-based) to validate setup
 3. **Use the templates** in `templates/` directory as starting points
 4. **Reference examples** in `examples/` for task-specific implementations
 5. **Follow the workflow** sequentially (don't skip steps)
 6. **Debug incrementally** - add one reward function at a time
 **Critical Reminders:**
 - Always use multiple reward functions (3-5 is optimal)
 - Monitor reward metrics, not loss
 - Test reward functions before training
 - Start small (num_generations=4), scale up gradually
 - Save checkpoints frequently (every 100 steps)
 This skill is designed for **expert-level implementation**. Beginners should start with supervised fine-tuning before attempting GRPO.
 ## Critical reminders
 - **Loss goes UP during training** — this is normal (it's KL divergence)
 - **Use 3–5 reward functions** — single rewards often fail
 - **Test rewards before training** — debug each function independently
 - **Monitor `reward_std`** — should stay > 0.1 (avoid mode collapse)
 - **Start with `num_generations=4–8`** — scale up if GPU allows
--- a/skills/mlops/training/grpo-rl-training/templates/basic_grpo_training.py
+++ b/skills/mlops/training/grpo-rl-training/templates/basic_grpo_training.py
--- a/website/docs/reference/optional-skills-catalog.md
+++ b/website/docs/reference/optional-skills-catalog.md
@ -98,6 +98,7 @@ The largest optional category — covers the full ML pipeline from data curation
 | **chroma** | Open-source embedding database. Store embeddings and metadata, perform vector and full-text search. Simple 4-function API for RAG and semantic search. |
 | **faiss** | Facebook's library for efficient similarity search and clustering of dense vectors. Supports billions of vectors, GPU acceleration, and various index types (Flat, IVF, HNSW). |
 | **flash-attention** | Optimize transformer attention with Flash Attention for 2-4x speedup and 10-20x memory reduction. Supports PyTorch SDPA, flash-attn library, H100 FP8, and sliding window. |
 | **guidance** | Control LLM output with regex and grammars, guarantee valid JSON/XML/code generation, enforce structured formats, and build multi-step workflows with Guidance — Microsoft Research's constrained generation framework. |
 | **hermes-atropos-environments** | Build, test, and debug Hermes Agent RL environments for Atropos training. Covers the HermesAgentBaseEnv interface, reward functions, agent loop integration, and evaluation. |
 | **huggingface-tokenizers** | Fast Rust-based tokenizers for research and production. Tokenizes 1GB in under 20 seconds. Supports BPE, WordPiece, and Unigram algorithms. |
 | **instructor** | Extract structured data from LLM responses with Pydantic validation, retry failed extractions automatically, and stream partial results. |
--- a/website/docs/reference/skills-catalog.md
+++ b/website/docs/reference/skills-catalog.md
@ -163,10 +163,8 @@ Model serving, quantization (GGUF/GPTQ), structured output, inference optimizati
 | Skill | Description | Path |
 |-------|-------------|------|
 | `gguf-quantization` | GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements. | `mlops/inference/gguf` |
 | `guidance` | Control LLM output with regex and grammars, guarantee valid JSON/XML/code generation, enforce structured formats, and build multi-step workflows with Guidance - Microsoft Research's constrained generation framework | `mlops/inference/guidance` |
 | `instructor` | Extract structured data from LLM responses with Pydantic validation, retry failed extractions automatically, parse complex JSON with type safety, and stream partial results with Instructor - battle-tested structured output library | `mlops/inference/instructor` |
-| `llama-cpp` | Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU. | `mlops/inference/llama-cpp` |
+| `llama-cpp` | Run LLM inference with llama.cpp on CPU, Apple Silicon, AMD/Intel GPUs, or NVIDIA — plus GGUF model conversion and quantization (2–8 bit with K-quants and imatrix). Covers CLI, Python bindings, OpenAI-compatible server, and Ollama/LM Studio integration. Use for edge deployment, M1/M2/M3/M4 Macs, CUDA-less environments, or flexible local quantization. | `mlops/inference/llama-cpp` |
 | `obliteratus` | Remove refusal behaviors from open-weight LLMs using OBLITERATUS — mechanistic interpretability techniques (diff-in-means, SVD, whitened SVD, LEACE, SAE decomposition, etc.) to excise guardrails while preserving reasoning. 9 CLI methods, 28 analysis modules, 116 model presets ac… | `mlops/inference/obliteratus` |
 | `outlines` | Guarantee valid JSON/XML/code structure during generation, use Pydantic models for type-safe outputs, support local models (Transformers, vLLM), and maximize inference speed with Outlines - dottxt.ai's structured generation library | `mlops/inference/outlines` |
 | `serving-llms-vllm` | Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), an… | `mlops/inference/vllm` |
@ -202,7 +200,6 @@ Fine-tuning, RLHF/DPO/GRPO training, distributed training frameworks, and optimi
 | `axolotl` | Expert guidance for fine-tuning LLMs with Axolotl - YAML configs, 100+ models, LoRA/QLoRA, DPO/KTO/ORPO/GRPO, multimodal support | `mlops/training/axolotl` |
 | `distributed-llm-pretraining-torchtitan` | Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing. | `mlops/training/torchtitan` |
 | `fine-tuning-with-trl` | Fine-tune LLMs using reinforcement learning with TRL - SFT for instruction tuning, DPO for preference alignment, PPO/GRPO for reward optimization, and reward model training. Use when need RLHF, align model with preferences, or train from human feedback. Works with HuggingFace Tr… | `mlops/training/trl-fine-tuning` |
 | `grpo-rl-training` | Expert guidance for GRPO/RL fine-tuning with TRL for reasoning and task-specific model training | `mlops/training/grpo-rl-training` |
 | `hermes-atropos-environments` | Build, test, and debug Hermes Agent RL environments for Atropos training. Covers the HermesAgentBaseEnv interface, reward functions, agent loop integration, evaluation with tools, wandb logging, and the three CLI modes (serve/process/evaluate). Use when creating, reviewing, or f… | `mlops/training/hermes-atropos-environments` |
 | `huggingface-accelerate` | Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard. | `mlops/training/accelerate` |
 | `optimizing-attention-flash` | Optimizes transformer attention with Flash Attention for 2-4x speedup and 10-20x memory reduction. Use when training/running transformers with long sequences (&gt;512 tokens), encountering GPU memory issues with attention, or need faster inference. Supports PyTorch native SDPA,… | `mlops/training/flash-attention` |