skills: consolidate mlops redundancies (gguf+llama-cpp, grpo+trl, guidance→optional) (#11965)

Three tightly-scoped built-in skill consolidations to reduce redundancy in the available_skills listing injected into every system prompt: 1. gguf-quantization → llama-cpp (merged) GGUF is llama.cpp's format; two skills covered the same toolchain. The merged llama-cpp skill keeps the full K-quant table + imatrix workflow from gguf and the ROCm/benchmarks/supported-models sections from the original llama-cpp. All 5 reference files preserved. 2. grpo-rl-training → fine-tuning-with-trl (folded in) GRPO isn't a framework, it's a trainer inside TRL. Moved the 17KB deep-dive SKILL.md to references/grpo-training.md and the working template to templates/basic_grpo_training.py. TRL's GRPO workflow section now points to both. Atropos skill's related_skills updated. 3. guidance → optional-skills/mlops/ Dropped from built-in. Outlines (still built-in) covers the same structured-generation ground with wider adoption. Listed in the optional catalog for users who specifically want Guidance. Net: 3 fewer built-in skill lines in every system prompt, zero content loss. Contributor authorship preserved via git rename detection.
2026-04-25 00:51:20 +00:00 · 2026-04-17 21:36:40 -07:00 · 2026-04-17 21:36:40 -07:00 · 73bccc94c7
commit 73bccc94c7
parent 598cba62ad
15 changed files with 470 additions and 889 deletions
--- a/skills/mlops/inference/gguf/SKILL.md
+++ b/skills/mlops/inference/gguf/SKILL.md
@ -1,430 +0,0 @@
---
-name: gguf-quantization
-description: GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.
-version: 1.0.0
-author: Orchestra Research
-license: MIT
-dependencies: [llama-cpp-python>=0.2.0]
-metadata:
-  hermes:
-    tags: [GGUF, Quantization, llama.cpp, CPU Inference, Apple Silicon, Model Compression, Optimization]
-
---
-
-# GGUF - Quantization Format for llama.cpp
-
-The GGUF (GPT-Generated Unified Format) is the standard file format for llama.cpp, enabling efficient inference on CPUs, Apple Silicon, and GPUs with flexible quantization options.
-
-## When to use GGUF
-
-**Use GGUF when:**
- Deploying on consumer hardware (laptops, desktops)
- Running on Apple Silicon (M1/M2/M3) with Metal acceleration
- Need CPU inference without GPU requirements
- Want flexible quantization (Q2_K to Q8_0)
- Using local AI tools (LM Studio, Ollama, text-generation-webui)
-
-**Key advantages:**
- **Universal hardware**: CPU, Apple Silicon, NVIDIA, AMD support
- **No Python runtime**: Pure C/C++ inference
- **Flexible quantization**: 2-8 bit with various methods (K-quants)
- **Ecosystem support**: LM Studio, Ollama, koboldcpp, and more
- **imatrix**: Importance matrix for better low-bit quality
-
-**Use alternatives instead:**
- **AWQ/GPTQ**: Maximum accuracy with calibration on NVIDIA GPUs
- **HQQ**: Fast calibration-free quantization for HuggingFace
- **bitsandbytes**: Simple integration with transformers library
- **TensorRT-LLM**: Production NVIDIA deployment with maximum speed
-
-## Quick start
-
-### Installation
-
-```bash
-# Clone llama.cpp
-git clone https://github.com/ggml-org/llama.cpp
-cd llama.cpp
-
-# Build (CPU)
-make
-
-# Build with CUDA (NVIDIA)
-make GGML_CUDA=1
-
-# Build with Metal (Apple Silicon)
-make GGML_METAL=1
-
-# Install Python bindings (optional)
-pip install llama-cpp-python
-```
-
-### Convert model to GGUF
-
-```bash
-# Install requirements
-pip install -r requirements.txt
-
-# Convert HuggingFace model to GGUF (FP16)
-python convert_hf_to_gguf.py ./path/to/model --outfile model-f16.gguf
-
-# Or specify output type
-python convert_hf_to_gguf.py ./path/to/model \
-    --outfile model-f16.gguf \
-    --outtype f16
-```
-
-### Quantize model
-
-```bash
-# Basic quantization to Q4_K_M
-./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
-
-# Quantize with importance matrix (better quality)
-./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
-./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
-```
-
-### Run inference
-
-```bash
-# CLI inference
-./llama-cli -m model-q4_k_m.gguf -p "Hello, how are you?"
-
-# Interactive mode
-./llama-cli -m model-q4_k_m.gguf --interactive
-
-# With GPU offload
-./llama-cli -m model-q4_k_m.gguf -ngl 35 -p "Hello!"
-```
-
-## Quantization types
-
-### K-quant methods (recommended)
-
-| Type | Bits | Size (7B) | Quality | Use Case |
-|------|------|-----------|---------|----------|
-| Q2_K | 2.5 | ~2.8 GB | Low | Extreme compression |
-| Q3_K_S | 3.0 | ~3.0 GB | Low-Med | Memory constrained |
-| Q3_K_M | 3.3 | ~3.3 GB | Medium | Balance |
-| Q4_K_S | 4.0 | ~3.8 GB | Med-High | Good balance |
-| Q4_K_M | 4.5 | ~4.1 GB | High | **Recommended default** |
-| Q5_K_S | 5.0 | ~4.6 GB | High | Quality focused |
-| Q5_K_M | 5.5 | ~4.8 GB | Very High | High quality |
-| Q6_K | 6.0 | ~5.5 GB | Excellent | Near-original |
-| Q8_0 | 8.0 | ~7.2 GB | Best | Maximum quality |
-
-### Legacy methods
-
-| Type | Description |
-|------|-------------|
-| Q4_0 | 4-bit, basic |
-| Q4_1 | 4-bit with delta |
-| Q5_0 | 5-bit, basic |
-| Q5_1 | 5-bit with delta |
-
-**Recommendation**: Use K-quant methods (Q4_K_M, Q5_K_M) for best quality/size ratio.
-
-## Conversion workflows
-
-### Workflow 1: HuggingFace to GGUF
-
-```bash
-# 1. Download model
-huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b
-
-# 2. Convert to GGUF (FP16)
-python convert_hf_to_gguf.py ./llama-3.1-8b \
-    --outfile llama-3.1-8b-f16.gguf \
-    --outtype f16
-
-# 3. Quantize
-./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M
-
-# 4. Test
-./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50
-```
-
-### Workflow 2: With importance matrix (better quality)
-
-```bash
-# 1. Convert to GGUF
-python convert_hf_to_gguf.py ./model --outfile model-f16.gguf
-
-# 2. Create calibration text (diverse samples)
-cat > calibration.txt << 'EOF'
-The quick brown fox jumps over the lazy dog.
-Machine learning is a subset of artificial intelligence.
-Python is a popular programming language.
-# Add more diverse text samples...
-EOF
-
-# 3. Generate importance matrix
-./llama-imatrix -m model-f16.gguf \
-    -f calibration.txt \
-    --chunk 512 \
-    -o model.imatrix \
-    -ngl 35  # GPU layers if available
-
-# 4. Quantize with imatrix
-./llama-quantize --imatrix model.imatrix \
-    model-f16.gguf \
-    model-q4_k_m.gguf \
-    Q4_K_M
-```
-
-### Workflow 3: Multiple quantizations
-
-```bash
-#!/bin/bash
-MODEL="llama-3.1-8b-f16.gguf"
-IMATRIX="llama-3.1-8b.imatrix"
-
-# Generate imatrix once
-./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35
-
-# Create multiple quantizations
-for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
-    OUTPUT="llama-3.1-8b-${QUANT,,}.gguf"
-    ./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT
-    echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))"
-done
-```
-
-## Python usage
-
-### llama-cpp-python
-
-```python
-from llama_cpp import Llama
-
-# Load model
-llm = Llama(
-    model_path="./model-q4_k_m.gguf",
-    n_ctx=4096,          # Context window
-    n_gpu_layers=35,     # GPU offload (0 for CPU only)
-    n_threads=8          # CPU threads
-)
-
-# Generate
-output = llm(
-    "What is machine learning?",
-    max_tokens=256,
-    temperature=0.7,
-    stop=["</s>", "\n\n"]
-)
-print(output["choices"][0]["text"])
-```
-
-### Chat completion
-
-```python
-from llama_cpp import Llama
-
-llm = Llama(
-    model_path="./model-q4_k_m.gguf",
-    n_ctx=4096,
-    n_gpu_layers=35,
-    chat_format="llama-3"  # Or "chatml", "mistral", etc.
-)
-
-messages = [
-    {"role": "system", "content": "You are a helpful assistant."},
-    {"role": "user", "content": "What is Python?"}
-]
-
-response = llm.create_chat_completion(
-    messages=messages,
-    max_tokens=256,
-    temperature=0.7
-)
-print(response["choices"][0]["message"]["content"])
-```
-
-### Streaming
-
-```python
-from llama_cpp import Llama
-
-llm = Llama(model_path="./model-q4_k_m.gguf", n_gpu_layers=35)
-
-# Stream tokens
-for chunk in llm(
-    "Explain quantum computing:",
-    max_tokens=256,
-    stream=True
-):
-    print(chunk["choices"][0]["text"], end="", flush=True)
-```
-
-## Server mode
-
-### Start OpenAI-compatible server
-
-```bash
-# Start server
-./llama-server -m model-q4_k_m.gguf \
-    --host 0.0.0.0 \
-    --port 8080 \
-    -ngl 35 \
-    -c 4096
-
-# Or with Python bindings
-python -m llama_cpp.server \
-    --model model-q4_k_m.gguf \
-    --n_gpu_layers 35 \
-    --host 0.0.0.0 \
-    --port 8080
-```
-
-### Use with OpenAI client
-
-```python
-from openai import OpenAI
-
-client = OpenAI(
-    base_url="http://localhost:8080/v1",
-    api_key="not-needed"
-)
-
-response = client.chat.completions.create(
-    model="local-model",
-    messages=[{"role": "user", "content": "Hello!"}],
-    max_tokens=256
-)
-print(response.choices[0].message.content)
-```
-
-## Hardware optimization
-
-### Apple Silicon (Metal)
-
-```bash
-# Build with Metal
-make clean && make GGML_METAL=1
-
-# Run with Metal acceleration
-./llama-cli -m model.gguf -ngl 99 -p "Hello"
-
-# Python with Metal
-llm = Llama(
-    model_path="model.gguf",
-    n_gpu_layers=99,     # Offload all layers
-    n_threads=1          # Metal handles parallelism
-)
-```
-
-### NVIDIA CUDA
-
-```bash
-# Build with CUDA
-make clean && make GGML_CUDA=1
-
-# Run with CUDA
-./llama-cli -m model.gguf -ngl 35 -p "Hello"
-
-# Specify GPU
-CUDA_VISIBLE_DEVICES=0 ./llama-cli -m model.gguf -ngl 35
-```
-
-### CPU optimization
-
-```bash
-# Build with AVX2/AVX512
-make clean && make
-
-# Run with optimal threads
-./llama-cli -m model.gguf -t 8 -p "Hello"
-
-# Python CPU config
-llm = Llama(
-    model_path="model.gguf",
-    n_gpu_layers=0,      # CPU only
-    n_threads=8,         # Match physical cores
-    n_batch=512          # Batch size for prompt processing
-)
-```
-
-## Integration with tools
-
-### Ollama
-
-```bash
-# Create Modelfile
-cat > Modelfile << 'EOF'
-FROM ./model-q4_k_m.gguf
-TEMPLATE """{{ .System }}
-{{ .Prompt }}"""
-PARAMETER temperature 0.7
-PARAMETER num_ctx 4096
-EOF
-
-# Create Ollama model
-ollama create mymodel -f Modelfile
-
-# Run
-ollama run mymodel "Hello!"
-```
-
-### LM Studio
-
-1. Place GGUF file in `~/.cache/lm-studio/models/`
-2. Open LM Studio and select the model
-3. Configure context length and GPU offload
-4. Start inference
-
-### text-generation-webui
-
-```bash
-# Place in models folder
-cp model-q4_k_m.gguf text-generation-webui/models/
-
-# Start with llama.cpp loader
-python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35
-```
-
-## Best practices
-
-1. **Use K-quants**: Q4_K_M offers best quality/size balance
-2. **Use imatrix**: Always use importance matrix for Q4 and below
-3. **GPU offload**: Offload as many layers as VRAM allows
-4. **Context length**: Start with 4096, increase if needed
-5. **Thread count**: Match physical CPU cores, not logical
-6. **Batch size**: Increase n_batch for faster prompt processing
-
-## Common issues
-
-**Model loads slowly:**
-```bash
-# Use mmap for faster loading
-./llama-cli -m model.gguf --mmap
-```
-
-**Out of memory:**
-```bash
-# Reduce GPU layers
-./llama-cli -m model.gguf -ngl 20  # Reduce from 35
-
-# Or use smaller quantization
-./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M
-```
-
-**Poor quality at low bits:**
-```bash
-# Always use imatrix for Q4 and below
-./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
-./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
-```
-
-## References
-
- **[Advanced Usage](references/advanced-usage.md)** - Batching, speculative decoding, custom builds
- **[Troubleshooting](references/troubleshooting.md)** - Common issues, debugging, benchmarks
-
-## Resources
-
- **Repository**: https://github.com/ggml-org/llama.cpp
- **Python Bindings**: https://github.com/abetlen/llama-cpp-python
- **Pre-quantized Models**: https://huggingface.co/TheBloke
- **GGUF Converter**: https://huggingface.co/spaces/ggml-org/gguf-my-repo
- **License**: MIT
--- a/skills/mlops/inference/guidance/SKILL.md
+++ b/skills/mlops/inference/guidance/SKILL.md
@ -1,575 +0,0 @@
---
-name: guidance
-description: Control LLM output with regex and grammars, guarantee valid JSON/XML/code generation, enforce structured formats, and build multi-step workflows with Guidance - Microsoft Research's constrained generation framework
-version: 1.0.0
-author: Orchestra Research
-license: MIT
-dependencies: [guidance, transformers]
-metadata:
-  hermes:
-    tags: [Prompt Engineering, Guidance, Constrained Generation, Structured Output, JSON Validation, Grammar, Microsoft Research, Format Enforcement, Multi-Step Workflows]
-
---
-
-# Guidance: Constrained LLM Generation
-
-## When to Use This Skill
-
-Use Guidance when you need to:
- **Control LLM output syntax** with regex or grammars
- **Guarantee valid JSON/XML/code** generation
- **Reduce latency** vs traditional prompting approaches
- **Enforce structured formats** (dates, emails, IDs, etc.)
- **Build multi-step workflows** with Pythonic control flow
- **Prevent invalid outputs** through grammatical constraints
-
-**GitHub Stars**: 18,000+ | **From**: Microsoft Research
-
-## Installation
-
-```bash
-# Base installation
-pip install guidance
-
-# With specific backends
-pip install guidance[transformers]  # Hugging Face models
-pip install guidance[llama_cpp]     # llama.cpp models
-```
-
-## Quick Start
-
-### Basic Example: Structured Generation
-
-```python
-from guidance import models, gen
-
-# Load model (supports OpenAI, Transformers, llama.cpp)
-lm = models.OpenAI("gpt-4")
-
-# Generate with constraints
-result = lm + "The capital of France is " + gen("capital", max_tokens=5)
-
-print(result["capital"])  # "Paris"
-```
-
-### With Anthropic Claude
-
-```python
-from guidance import models, gen, system, user, assistant
-
-# Configure Claude
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-
-# Use context managers for chat format
-with system():
-    lm += "You are a helpful assistant."
-
-with user():
-    lm += "What is the capital of France?"
-
-with assistant():
-    lm += gen(max_tokens=20)
-```
-
-## Core Concepts
-
-### 1. Context Managers
-
-Guidance uses Pythonic context managers for chat-style interactions.
-
-```python
-from guidance import system, user, assistant, gen
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-
-# System message
-with system():
-    lm += "You are a JSON generation expert."
-
-# User message
-with user():
-    lm += "Generate a person object with name and age."
-
-# Assistant response
-with assistant():
-    lm += gen("response", max_tokens=100)
-
-print(lm["response"])
-```
-
-**Benefits:**
- Natural chat flow
- Clear role separation
- Easy to read and maintain
-
-### 2. Constrained Generation
-
-Guidance ensures outputs match specified patterns using regex or grammars.
-
-#### Regex Constraints
-
-```python
-from guidance import models, gen
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-
-# Constrain to valid email format
-lm += "Email: " + gen("email", regex=r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")
-
-# Constrain to date format (YYYY-MM-DD)
-lm += "Date: " + gen("date", regex=r"\d{4}-\d{2}-\d{2}")
-
-# Constrain to phone number
-lm += "Phone: " + gen("phone", regex=r"\d{3}-\d{3}-\d{4}")
-
-print(lm["email"])  # Guaranteed valid email
-print(lm["date"])   # Guaranteed YYYY-MM-DD format
-```
-
-**How it works:**
- Regex converted to grammar at token level
- Invalid tokens filtered during generation
- Model can only produce matching outputs
-
-#### Selection Constraints
-
-```python
-from guidance import models, gen, select
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-
-# Constrain to specific choices
-lm += "Sentiment: " + select(["positive", "negative", "neutral"], name="sentiment")
-
-# Multiple-choice selection
-lm += "Best answer: " + select(
-    ["A) Paris", "B) London", "C) Berlin", "D) Madrid"],
-    name="answer"
-)
-
-print(lm["sentiment"])  # One of: positive, negative, neutral
-print(lm["answer"])     # One of: A, B, C, or D
-```
-
-### 3. Token Healing
-
-Guidance automatically "heals" token boundaries between prompt and generation.
-
-**Problem:** Tokenization creates unnatural boundaries.
-
-```python
-# Without token healing
-prompt = "The capital of France is "
-# Last token: " is "
-# First generated token might be " Par" (with leading space)
-# Result: "The capital of France is  Paris" (double space!)
-```
-
-**Solution:** Guidance backs up one token and regenerates.
-
-```python
-from guidance import models, gen
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-
-# Token healing enabled by default
-lm += "The capital of France is " + gen("capital", max_tokens=5)
-# Result: "The capital of France is Paris" (correct spacing)
-```
-
-**Benefits:**
- Natural text boundaries
- No awkward spacing issues
- Better model performance (sees natural token sequences)
-
-### 4. Grammar-Based Generation
-
-Define complex structures using context-free grammars.
-
-```python
-from guidance import models, gen
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-
-# JSON grammar (simplified)
-json_grammar = """
-{
-    "name": <gen name regex="[A-Za-z ]+" max_tokens=20>,
-    "age": <gen age regex="[0-9]+" max_tokens=3>,
-    "email": <gen email regex="[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}" max_tokens=50>
-}
-"""
-
-# Generate valid JSON
-lm += gen("person", grammar=json_grammar)
-
-print(lm["person"])  # Guaranteed valid JSON structure
-```
-
-**Use cases:**
- Complex structured outputs
- Nested data structures
- Programming language syntax
- Domain-specific languages
-
-### 5. Guidance Functions
-
-Create reusable generation patterns with the `@guidance` decorator.
-
-```python
-from guidance import guidance, gen, models
-
-@guidance
-def generate_person(lm):
-    """Generate a person with name and age."""
-    lm += "Name: " + gen("name", max_tokens=20, stop="\n")
-    lm += "\nAge: " + gen("age", regex=r"[0-9]+", max_tokens=3)
-    return lm
-
-# Use the function
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = generate_person(lm)
-
-print(lm["name"])
-print(lm["age"])
-```
-
-**Stateful Functions:**
-
-```python
-@guidance(stateless=False)
-def react_agent(lm, question, tools, max_rounds=5):
-    """ReAct agent with tool use."""
-    lm += f"Question: {question}\n\n"
-
-    for i in range(max_rounds):
-        # Thought
-        lm += f"Thought {i+1}: " + gen("thought", stop="\n")
-
-        # Action
-        lm += "\nAction: " + select(list(tools.keys()), name="action")
-
-        # Execute tool
-        tool_result = tools[lm["action"]]()
-        lm += f"\nObservation: {tool_result}\n\n"
-
-        # Check if done
-        lm += "Done? " + select(["Yes", "No"], name="done")
-        if lm["done"] == "Yes":
-            break
-
-    # Final answer
-    lm += "\nFinal Answer: " + gen("answer", max_tokens=100)
-    return lm
-```
-
-## Backend Configuration
-
-### Anthropic Claude
-
-```python
-from guidance import models
-
-lm = models.Anthropic(
-    model="claude-sonnet-4-5-20250929",
-    api_key="your-api-key"  # Or set ANTHROPIC_API_KEY env var
-)
-```
-
-### OpenAI
-
-```python
-lm = models.OpenAI(
-    model="gpt-4o-mini",
-    api_key="your-api-key"  # Or set OPENAI_API_KEY env var
-)
-```
-
-### Local Models (Transformers)
-
-```python
-from guidance.models import Transformers
-
-lm = Transformers(
-    "microsoft/Phi-4-mini-instruct",
-    device="cuda"  # Or "cpu"
-)
-```
-
-### Local Models (llama.cpp)
-
-```python
-from guidance.models import LlamaCpp
-
-lm = LlamaCpp(
-    model_path="/path/to/model.gguf",
-    n_ctx=4096,
-    n_gpu_layers=35
-)
-```
-
-## Common Patterns
-
-### Pattern 1: JSON Generation
-
-```python
-from guidance import models, gen, system, user, assistant
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-
-with system():
-    lm += "You generate valid JSON."
-
-with user():
-    lm += "Generate a user profile with name, age, and email."
-
-with assistant():
-    lm += """{
-    "name": """ + gen("name", regex=r'"[A-Za-z ]+"', max_tokens=30) + """,
-    "age": """ + gen("age", regex=r"[0-9]+", max_tokens=3) + """,
-    "email": """ + gen("email", regex=r'"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"', max_tokens=50) + """
-}"""
-
-print(lm)  # Valid JSON guaranteed
-```
-
-### Pattern 2: Classification
-
-```python
-from guidance import models, gen, select
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-
-text = "This product is amazing! I love it."
-
-lm += f"Text: {text}\n"
-lm += "Sentiment: " + select(["positive", "negative", "neutral"], name="sentiment")
-lm += "\nConfidence: " + gen("confidence", regex=r"[0-9]+", max_tokens=3) + "%"
-
-print(f"Sentiment: {lm['sentiment']}")
-print(f"Confidence: {lm['confidence']}%")
-```
-
-### Pattern 3: Multi-Step Reasoning
-
-```python
-from guidance import models, gen, guidance
-
-@guidance
-def chain_of_thought(lm, question):
-    """Generate answer with step-by-step reasoning."""
-    lm += f"Question: {question}\n\n"
-
-    # Generate multiple reasoning steps
-    for i in range(3):
-        lm += f"Step {i+1}: " + gen(f"step_{i+1}", stop="\n", max_tokens=100) + "\n"
-
-    # Final answer
-    lm += "\nTherefore, the answer is: " + gen("answer", max_tokens=50)
-
-    return lm
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = chain_of_thought(lm, "What is 15% of 200?")
-
-print(lm["answer"])
-```
-
-### Pattern 4: ReAct Agent
-
-```python
-from guidance import models, gen, select, guidance
-
-@guidance(stateless=False)
-def react_agent(lm, question):
-    """ReAct agent with tool use."""
-    tools = {
-        "calculator": lambda expr: eval(expr),
-        "search": lambda query: f"Search results for: {query}",
-    }
-
-    lm += f"Question: {question}\n\n"
-
-    for round in range(5):
-        # Thought
-        lm += f"Thought: " + gen("thought", stop="\n") + "\n"
-
-        # Action selection
-        lm += "Action: " + select(["calculator", "search", "answer"], name="action")
-
-        if lm["action"] == "answer":
-            lm += "\nFinal Answer: " + gen("answer", max_tokens=100)
-            break
-
-        # Action input
-        lm += "\nAction Input: " + gen("action_input", stop="\n") + "\n"
-
-        # Execute tool
-        if lm["action"] in tools:
-            result = tools[lm["action"]](lm["action_input"])
-            lm += f"Observation: {result}\n\n"
-
-    return lm
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = react_agent(lm, "What is 25 * 4 + 10?")
-print(lm["answer"])
-```
-
-### Pattern 5: Data Extraction
-
-```python
-from guidance import models, gen, guidance
-
-@guidance
-def extract_entities(lm, text):
-    """Extract structured entities from text."""
-    lm += f"Text: {text}\n\n"
-
-    # Extract person
-    lm += "Person: " + gen("person", stop="\n", max_tokens=30) + "\n"
-
-    # Extract organization
-    lm += "Organization: " + gen("organization", stop="\n", max_tokens=30) + "\n"
-
-    # Extract date
-    lm += "Date: " + gen("date", regex=r"\d{4}-\d{2}-\d{2}", max_tokens=10) + "\n"
-
-    # Extract location
-    lm += "Location: " + gen("location", stop="\n", max_tokens=30) + "\n"
-
-    return lm
-
-text = "Tim Cook announced at Apple Park on 2024-09-15 in Cupertino."
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = extract_entities(lm, text)
-
-print(f"Person: {lm['person']}")
-print(f"Organization: {lm['organization']}")
-print(f"Date: {lm['date']}")
-print(f"Location: {lm['location']}")
-```
-
-## Best Practices
-
-### 1. Use Regex for Format Validation
-
-```python
-# ✅ Good: Regex ensures valid format
-lm += "Email: " + gen("email", regex=r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")
-
-# ❌ Bad: Free generation may produce invalid emails
-lm += "Email: " + gen("email", max_tokens=50)
-```
-
-### 2. Use select() for Fixed Categories
-
-```python
-# ✅ Good: Guaranteed valid category
-lm += "Status: " + select(["pending", "approved", "rejected"], name="status")
-
-# ❌ Bad: May generate typos or invalid values
-lm += "Status: " + gen("status", max_tokens=20)
-```
-
-### 3. Leverage Token Healing
-
-```python
-# Token healing is enabled by default
-# No special action needed - just concatenate naturally
-lm += "The capital is " + gen("capital")  # Automatic healing
-```
-
-### 4. Use stop Sequences
-
-```python
-# ✅ Good: Stop at newline for single-line outputs
-lm += "Name: " + gen("name", stop="\n")
-
-# ❌ Bad: May generate multiple lines
-lm += "Name: " + gen("name", max_tokens=50)
-```
-
-### 5. Create Reusable Functions
-
-```python
-# ✅ Good: Reusable pattern
-@guidance
-def generate_person(lm):
-    lm += "Name: " + gen("name", stop="\n")
-    lm += "\nAge: " + gen("age", regex=r"[0-9]+")
-    return lm
-
-# Use multiple times
-lm = generate_person(lm)
-lm += "\n\n"
-lm = generate_person(lm)
-```
-
-### 6. Balance Constraints
-
-```python
-# ✅ Good: Reasonable constraints
-lm += gen("name", regex=r"[A-Za-z ]+", max_tokens=30)
-
-# ❌ Too strict: May fail or be very slow
-lm += gen("name", regex=r"^(John|Jane)$", max_tokens=10)
-```
-
-## Comparison to Alternatives
-
-| Feature | Guidance | Instructor | Outlines | LMQL |
-|---------|----------|------------|----------|------|
-| Regex Constraints | ✅ Yes | ❌ No | ✅ Yes | ✅ Yes |
-| Grammar Support | ✅ CFG | ❌ No | ✅ CFG | ✅ CFG |
-| Pydantic Validation | ❌ No | ✅ Yes | ✅ Yes | ❌ No |
-| Token Healing | ✅ Yes | ❌ No | ✅ Yes | ❌ No |
-| Local Models | ✅ Yes | ⚠️ Limited | ✅ Yes | ✅ Yes |
-| API Models | ✅ Yes | ✅ Yes | ⚠️ Limited | ✅ Yes |
-| Pythonic Syntax | ✅ Yes | ✅ Yes | ✅ Yes | ❌ SQL-like |
-| Learning Curve | Low | Low | Medium | High |
-
-**When to choose Guidance:**
- Need regex/grammar constraints
- Want token healing
- Building complex workflows with control flow
- Using local models (Transformers, llama.cpp)
- Prefer Pythonic syntax
-
-**When to choose alternatives:**
- Instructor: Need Pydantic validation with automatic retrying
- Outlines: Need JSON schema validation
- LMQL: Prefer declarative query syntax
-
-## Performance Characteristics
-
-**Latency Reduction:**
- 30-50% faster than traditional prompting for constrained outputs
- Token healing reduces unnecessary regeneration
- Grammar constraints prevent invalid token generation
-
-**Memory Usage:**
- Minimal overhead vs unconstrained generation
- Grammar compilation cached after first use
- Efficient token filtering at inference time
-
-**Token Efficiency:**
- Prevents wasted tokens on invalid outputs
- No need for retry loops
- Direct path to valid outputs
-
-## Resources
-
- **Documentation**: https://guidance.readthedocs.io
- **GitHub**: https://github.com/guidance-ai/guidance (18k+ stars)
- **Notebooks**: https://github.com/guidance-ai/guidance/tree/main/notebooks
- **Discord**: Community support available
-
-## See Also
-
- `references/constraints.md` - Comprehensive regex and grammar patterns
- `references/backends.md` - Backend-specific configuration
- `references/examples.md` - Production-ready examples
-
-
--- a/skills/mlops/inference/guidance/references/backends.md
+++ b/skills/mlops/inference/guidance/references/backends.md
@ -1,554 +0,0 @@
-# Backend Configuration Guide
-
-Complete guide to configuring Guidance with different LLM backends.
-
-## Table of Contents
- API-Based Models (Anthropic, OpenAI)
- Local Models (Transformers, llama.cpp)
- Backend Comparison
- Performance Tuning
- Advanced Configuration
-
-## API-Based Models
-
-### Anthropic Claude
-
-#### Basic Setup
-
-```python
-from guidance import models
-
-# Using environment variable
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-# Reads ANTHROPIC_API_KEY from environment
-
-# Explicit API key
-lm = models.Anthropic(
-    model="claude-sonnet-4-5-20250929",
-    api_key="your-api-key-here"
-)
-```
-
-#### Available Models
-
-```python
-# Claude 3.5 Sonnet (Latest, recommended)
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-
-# Claude 3.7 Sonnet (Fast, cost-effective)
-lm = models.Anthropic("claude-sonnet-3.7-20250219")
-
-# Claude 3 Opus (Most capable)
-lm = models.Anthropic("claude-3-opus-20240229")
-
-# Claude 3.5 Haiku (Fastest, cheapest)
-lm = models.Anthropic("claude-3-5-haiku-20241022")
-```
-
-#### Configuration Options
-
-```python
-lm = models.Anthropic(
-    model="claude-sonnet-4-5-20250929",
-    api_key="your-api-key",
-    max_tokens=4096,           # Max tokens to generate
-    temperature=0.7,            # Sampling temperature (0-1)
-    top_p=0.9,                  # Nucleus sampling
-    timeout=30,                 # Request timeout (seconds)
-    max_retries=3              # Retry failed requests
-)
-```
-
-#### With Context Managers
-
-```python
-from guidance import models, system, user, assistant, gen
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-
-with system():
-    lm += "You are a helpful assistant."
-
-with user():
-    lm += "What is the capital of France?"
-
-with assistant():
-    lm += gen(max_tokens=50)
-
-print(lm)
-```
-
-### OpenAI
-
-#### Basic Setup
-
-```python
-from guidance import models
-
-# Using environment variable
-lm = models.OpenAI("gpt-4o")
-# Reads OPENAI_API_KEY from environment
-
-# Explicit API key
-lm = models.OpenAI(
-    model="gpt-4o",
-    api_key="your-api-key-here"
-)
-```
-
-#### Available Models
-
-```python
-# GPT-4o (Latest, multimodal)
-lm = models.OpenAI("gpt-4o")
-
-# GPT-4o Mini (Fast, cost-effective)
-lm = models.OpenAI("gpt-4o-mini")
-
-# GPT-4 Turbo
-lm = models.OpenAI("gpt-4-turbo")
-
-# GPT-3.5 Turbo (Cheapest)
-lm = models.OpenAI("gpt-3.5-turbo")
-```
-
-#### Configuration Options
-
-```python
-lm = models.OpenAI(
-    model="gpt-4o-mini",
-    api_key="your-api-key",
-    max_tokens=2048,
-    temperature=0.7,
-    top_p=1.0,
-    frequency_penalty=0.0,
-    presence_penalty=0.0,
-    timeout=30
-)
-```
-
-#### Chat Format
-
-```python
-from guidance import models, gen
-
-lm = models.OpenAI("gpt-4o-mini")
-
-# OpenAI uses chat format
-lm += [
-    {"role": "system", "content": "You are a helpful assistant."},
-    {"role": "user", "content": "What is 2+2?"}
-]
-
-# Generate response
-lm += gen(max_tokens=50)
-```
-
-### Azure OpenAI
-
-```python
-from guidance import models
-
-lm = models.AzureOpenAI(
-    model="gpt-4o",
-    azure_endpoint="https://your-resource.openai.azure.com/",
-    api_key="your-azure-api-key",
-    api_version="2024-02-15-preview",
-    deployment_name="your-deployment-name"
-)
-```
-
-## Local Models
-
-### Transformers (Hugging Face)
-
-#### Basic Setup
-
-```python
-from guidance.models import Transformers
-
-# Load model from Hugging Face
-lm = Transformers("microsoft/Phi-4-mini-instruct")
-```
-
-#### GPU Configuration
-
-```python
-# Use GPU
-lm = Transformers(
-    "microsoft/Phi-4-mini-instruct",
-    device="cuda"
-)
-
-# Use specific GPU
-lm = Transformers(
-    "microsoft/Phi-4-mini-instruct",
-    device="cuda:0"  # GPU 0
-)
-
-# Use CPU
-lm = Transformers(
-    "microsoft/Phi-4-mini-instruct",
-    device="cpu"
-)
-```
-
-#### Advanced Configuration
-
-```python
-lm = Transformers(
-    "microsoft/Phi-4-mini-instruct",
-    device="cuda",
-    torch_dtype="float16",      # Use FP16 (faster, less memory)
-    load_in_8bit=True,          # 8-bit quantization
-    max_memory={0: "20GB"},     # GPU memory limit
-    offload_folder="./offload"  # Offload to disk if needed
-)
-```
-
-#### Popular Models
-
-```python
-# Phi-4 (Microsoft)
-lm = Transformers("microsoft/Phi-4-mini-instruct")
-lm = Transformers("microsoft/Phi-3-medium-4k-instruct")
-
-# Llama 3 (Meta)
-lm = Transformers("meta-llama/Llama-3.1-8B-Instruct")
-lm = Transformers("meta-llama/Llama-3.1-70B-Instruct")
-
-# Mistral (Mistral AI)
-lm = Transformers("mistralai/Mistral-7B-Instruct-v0.3")
-lm = Transformers("mistralai/Mixtral-8x7B-Instruct-v0.1")
-
-# Qwen (Alibaba)
-lm = Transformers("Qwen/Qwen2.5-7B-Instruct")
-
-# Gemma (Google)
-lm = Transformers("google/gemma-2-9b-it")
-```
-
-#### Generation Configuration
-
-```python
-lm = Transformers(
-    "microsoft/Phi-4-mini-instruct",
-    device="cuda"
-)
-
-# Configure generation
-from guidance import gen
-
-result = lm + gen(
-    max_tokens=100,
-    temperature=0.7,
-    top_p=0.9,
-    top_k=50,
-    repetition_penalty=1.1
-)
-```
-
-### llama.cpp
-
-#### Basic Setup
-
-```python
-from guidance.models import LlamaCpp
-
-# Load GGUF model
-lm = LlamaCpp(
-    model_path="/path/to/model.gguf",
-    n_ctx=4096  # Context window
-)
-```
-
-#### GPU Configuration
-
-```python
-# Use GPU acceleration
-lm = LlamaCpp(
-    model_path="/path/to/model.gguf",
-    n_ctx=4096,
-    n_gpu_layers=35,  # Offload 35 layers to GPU
-    n_threads=8       # CPU threads for remaining layers
-)
-
-# Full GPU offload
-lm = LlamaCpp(
-    model_path="/path/to/model.gguf",
-    n_ctx=4096,
-    n_gpu_layers=-1  # Offload all layers
-)
-```
-
-#### Advanced Configuration
-
-```python
-lm = LlamaCpp(
-    model_path="/path/to/llama-3.1-8b-instruct.Q4_K_M.gguf",
-    n_ctx=8192,          # Context window (tokens)
-    n_gpu_layers=35,     # GPU layers
-    n_threads=8,         # CPU threads
-    n_batch=512,         # Batch size for prompt processing
-    use_mmap=True,       # Memory-map the model file
-    use_mlock=False,     # Lock model in RAM
-    seed=42,             # Random seed
-    verbose=False        # Suppress verbose output
-)
-```
-
-#### Quantized Models
-
-```python
-# Q4_K_M (4-bit, recommended for most cases)
-lm = LlamaCpp("/path/to/model.Q4_K_M.gguf")
-
-# Q5_K_M (5-bit, better quality)
-lm = LlamaCpp("/path/to/model.Q5_K_M.gguf")
-
-# Q8_0 (8-bit, high quality)
-lm = LlamaCpp("/path/to/model.Q8_0.gguf")
-
-# F16 (16-bit float, highest quality)
-lm = LlamaCpp("/path/to/model.F16.gguf")
-```
-
-#### Popular GGUF Models
-
-```python
-# Llama 3.1
-lm = LlamaCpp("llama-3.1-8b-instruct.Q4_K_M.gguf")
-
-# Mistral
-lm = LlamaCpp("mistral-7b-instruct-v0.3.Q4_K_M.gguf")
-
-# Phi-4
-lm = LlamaCpp("phi-4-mini-instruct.Q4_K_M.gguf")
-```
-
-## Backend Comparison
-
-### Feature Matrix
-
-| Feature | Anthropic | OpenAI | Transformers | llama.cpp |
-|---------|-----------|--------|--------------|-----------|
-| Constrained Generation | ✅ Full | ✅ Full | ✅ Full | ✅ Full |
-| Token Healing | ✅ Yes | ✅ Yes | ✅ Yes | ✅ Yes |
-| Streaming | ✅ Yes | ✅ Yes | ✅ Yes | ✅ Yes |
-| GPU Support | N/A | N/A | ✅ Yes | ✅ Yes |
-| Quantization | N/A | N/A | ✅ Yes | ✅ Yes |
-| Cost | $$$ | $$$ | Free | Free |
-| Latency | Low | Low | Medium | Low |
-| Setup Difficulty | Easy | Easy | Medium | Medium |
-
-### Performance Characteristics
-
-**Anthropic Claude:**
- **Latency**: 200-500ms (API call)
- **Throughput**: Limited by API rate limits
- **Cost**: $3-15 per 1M input tokens
- **Best for**: Production systems, high-quality outputs
-
-**OpenAI:**
- **Latency**: 200-400ms (API call)
- **Throughput**: Limited by API rate limits
- **Cost**: $0.15-30 per 1M input tokens
- **Best for**: Cost-sensitive production, gpt-4o-mini
-
-**Transformers:**
- **Latency**: 50-200ms (local inference)
- **Throughput**: GPU-dependent (10-100 tokens/sec)
- **Cost**: Hardware cost only
- **Best for**: Privacy-sensitive, high-volume, experimentation
-
-**llama.cpp:**
- **Latency**: 30-150ms (local inference)
- **Throughput**: Hardware-dependent (20-150 tokens/sec)
- **Cost**: Hardware cost only
- **Best for**: Edge deployment, Apple Silicon, CPU inference
-
-### Memory Requirements
-
-**Transformers (FP16):**
- 7B model: ~14GB GPU VRAM
- 13B model: ~26GB GPU VRAM
- 70B model: ~140GB GPU VRAM (multi-GPU)
-
-**llama.cpp (Q4_K_M):**
- 7B model: ~4.5GB RAM
- 13B model: ~8GB RAM
- 70B model: ~40GB RAM
-
-**Optimization Tips:**
- Use quantized models (Q4_K_M) for lower memory
- Use GPU offloading for faster inference
- Use CPU inference for smaller models (<7B)
-
-## Performance Tuning
-
-### API Models (Anthropic, OpenAI)
-
-#### Reduce Latency
-
-```python
-from guidance import models, gen
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-
-# Use lower max_tokens (faster response)
-lm += gen(max_tokens=100)  # Instead of 1000
-
-# Use streaming (perceived latency reduction)
-for chunk in lm.stream(gen(max_tokens=500)):
-    print(chunk, end="", flush=True)
-```
-
-#### Reduce Cost
-
-```python
-# Use cheaper models
-lm = models.Anthropic("claude-3-5-haiku-20241022")  # vs Sonnet
-lm = models.OpenAI("gpt-4o-mini")  # vs gpt-4o
-
-# Reduce context size
-# - Keep prompts concise
-# - Avoid large few-shot examples
-# - Use max_tokens limits
-```
-
-### Local Models (Transformers, llama.cpp)
-
-#### Optimize GPU Usage
-
-```python
-from guidance.models import Transformers
-
-# Use FP16 for 2x speedup
-lm = Transformers(
-    "meta-llama/Llama-3.1-8B-Instruct",
-    device="cuda",
-    torch_dtype="float16"
-)
-
-# Use 8-bit quantization for 4x memory reduction
-lm = Transformers(
-    "meta-llama/Llama-3.1-8B-Instruct",
-    device="cuda",
-    load_in_8bit=True
-)
-
-# Use flash attention (requires flash-attn package)
-lm = Transformers(
-    "meta-llama/Llama-3.1-8B-Instruct",
-    device="cuda",
-    use_flash_attention_2=True
-)
-```
-
-#### Optimize llama.cpp
-
-```python
-from guidance.models import LlamaCpp
-
-# Maximize GPU layers
-lm = LlamaCpp(
-    model_path="/path/to/model.Q4_K_M.gguf",
-    n_gpu_layers=-1  # All layers on GPU
-)
-
-# Optimize batch size
-lm = LlamaCpp(
-    model_path="/path/to/model.Q4_K_M.gguf",
-    n_batch=512,     # Larger batch = faster prompt processing
-    n_gpu_layers=-1
-)
-
-# Use Metal (Apple Silicon)
-lm = LlamaCpp(
-    model_path="/path/to/model.Q4_K_M.gguf",
-    n_gpu_layers=-1,  # Use Metal GPU acceleration
-    use_mmap=True
-)
-```
-
-#### Batch Processing
-
-```python
-# Process multiple requests efficiently
-requests = [
-    "What is 2+2?",
-    "What is the capital of France?",
-    "What is photosynthesis?"
-]
-
-# Bad: Sequential processing
-for req in requests:
-    lm = Transformers("microsoft/Phi-4-mini-instruct")
-    lm += req + gen(max_tokens=50)
-
-# Good: Reuse loaded model
-lm = Transformers("microsoft/Phi-4-mini-instruct")
-for req in requests:
-    lm += req + gen(max_tokens=50)
-```
-
-## Advanced Configuration
-
-### Custom Model Configurations
-
-```python
-from transformers import AutoTokenizer, AutoModelForCausalLM
-from guidance.models import Transformers
-
-# Load custom model
-tokenizer = AutoTokenizer.from_pretrained("your-model")
-model = AutoModelForCausalLM.from_pretrained(
-    "your-model",
-    device_map="auto",
-    torch_dtype="float16"
-)
-
-# Use with Guidance
-lm = Transformers(model=model, tokenizer=tokenizer)
-```
-
-### Environment Variables
-
-```bash
-# API keys
-export ANTHROPIC_API_KEY="sk-ant-..."
-export OPENAI_API_KEY="sk-..."
-
-# Transformers cache
-export HF_HOME="/path/to/cache"
-export TRANSFORMERS_CACHE="/path/to/cache"
-
-# GPU selection
-export CUDA_VISIBLE_DEVICES=0,1  # Use GPU 0 and 1
-```
-
-### Debugging
-
-```python
-# Enable verbose logging
-import logging
-logging.basicConfig(level=logging.DEBUG)
-
-# Check backend info
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-print(f"Model: {lm.model_name}")
-print(f"Backend: {lm.backend}")
-
-# Check GPU usage (Transformers)
-lm = Transformers("microsoft/Phi-4-mini-instruct", device="cuda")
-print(f"Device: {lm.device}")
-print(f"Memory allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
-```
-
-## Resources
-
- **Anthropic Docs**: https://docs.anthropic.com
- **OpenAI Docs**: https://platform.openai.com/docs
- **Hugging Face Models**: https://huggingface.co/models
- **llama.cpp**: https://github.com/ggerganov/llama.cpp
- **GGUF Models**: https://huggingface.co/models?library=gguf
--- a/skills/mlops/inference/guidance/references/constraints.md
+++ b/skills/mlops/inference/guidance/references/constraints.md
@ -1,674 +0,0 @@
-# Comprehensive Constraint Patterns
-
-Guide to regex constraints, grammar-based generation, and token healing in Guidance.
-
-## Table of Contents
- Regex Constraints
- Grammar-Based Generation
- Token Healing
- Selection Constraints
- Complex Patterns
- Performance Optimization
-
-## Regex Constraints
-
-### Basic Patterns
-
-#### Numeric Constraints
-
-```python
-from guidance import models, gen
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-
-# Integer (positive)
-lm += "Age: " + gen("age", regex=r"[0-9]+")
-
-# Integer (with negatives)
-lm += "Temperature: " + gen("temp", regex=r"-?[0-9]+")
-
-# Float (positive)
-lm += "Price: $" + gen("price", regex=r"[0-9]+\.[0-9]{2}")
-
-# Float (with negatives and optional decimals)
-lm += "Value: " + gen("value", regex=r"-?[0-9]+(\.[0-9]+)?")
-
-# Percentage (0-100)
-lm += "Progress: " + gen("progress", regex=r"(100|[0-9]{1,2})")
-
-# Range (1-5 stars)
-lm += "Rating: " + gen("rating", regex=r"[1-5]") + " stars"
-```
-
-#### Text Constraints
-
-```python
-# Alphabetic only
-lm += "Name: " + gen("name", regex=r"[A-Za-z]+")
-
-# Alphabetic with spaces
-lm += "Full Name: " + gen("full_name", regex=r"[A-Za-z ]+")
-
-# Alphanumeric
-lm += "Username: " + gen("username", regex=r"[A-Za-z0-9_]+")
-
-# Capitalized words
-lm += "Title: " + gen("title", regex=r"[A-Z][a-z]+( [A-Z][a-z]+)*")
-
-# Lowercase only
-lm += "Code: " + gen("code", regex=r"[a-z0-9-]+")
-
-# Specific length
-lm += "ID: " + gen("id", regex=r"[A-Z]{3}-[0-9]{6}")  # e.g., "ABC-123456"
-```
-
-#### Date and Time Constraints
-
-```python
-# Date (YYYY-MM-DD)
-lm += "Date: " + gen("date", regex=r"\d{4}-\d{2}-\d{2}")
-
-# Date (MM/DD/YYYY)
-lm += "Date: " + gen("date_us", regex=r"\d{2}/\d{2}/\d{4}")
-
-# Time (HH:MM)
-lm += "Time: " + gen("time", regex=r"\d{2}:\d{2}")
-
-# Time (HH:MM:SS)
-lm += "Time: " + gen("time_full", regex=r"\d{2}:\d{2}:\d{2}")
-
-# ISO 8601 datetime
-lm += "Timestamp: " + gen(
-    "timestamp",
-    regex=r"\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}Z"
-)
-
-# Year (YYYY)
-lm += "Year: " + gen("year", regex=r"(19|20)\d{2}")
-
-# Month name
-lm += "Month: " + gen(
-    "month",
-    regex=r"(January|February|March|April|May|June|July|August|September|October|November|December)"
-)
-```
-
-#### Contact Information
-
-```python
-# Email
-lm += "Email: " + gen(
-    "email",
-    regex=r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
-)
-
-# Phone (US format)
-lm += "Phone: " + gen("phone", regex=r"\d{3}-\d{3}-\d{4}")
-
-# Phone (international format)
-lm += "Phone: " + gen("phone_intl", regex=r"\+[0-9]{1,3}-[0-9]{1,14}")
-
-# ZIP code (US)
-lm += "ZIP: " + gen("zip", regex=r"\d{5}(-\d{4})?")
-
-# Postal code (Canada)
-lm += "Postal: " + gen("postal", regex=r"[A-Z]\d[A-Z] \d[A-Z]\d")
-
-# URL
-lm += "URL: " + gen(
-    "url",
-    regex=r"https?://[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}(/[a-zA-Z0-9._~:/?#\[\]@!$&'()*+,;=-]*)?"
-)
-```
-
-### Advanced Patterns
-
-#### JSON Field Constraints
-
-```python
-from guidance import models, gen
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-
-# String field with quotes
-lm += '"name": ' + gen("name", regex=r'"[A-Za-z ]+"')
-
-# Numeric field (no quotes)
-lm += '"age": ' + gen("age", regex=r"[0-9]+")
-
-# Boolean field
-lm += '"active": ' + gen("active", regex=r"(true|false)")
-
-# Null field
-lm += '"optional": ' + gen("optional", regex=r"(null|[0-9]+)")
-
-# Array of strings
-lm += '"tags": [' + gen(
-    "tags",
-    regex=r'"[a-z]+"(, "[a-z]+")*'
-) + ']'
-
-# Complete JSON object
-lm += """{
-    "name": """ + gen("name", regex=r'"[A-Za-z ]+"') + """,
-    "age": """ + gen("age", regex=r"[0-9]+") + """,
-    "email": """ + gen(
-        "email",
-        regex=r'"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"'
-    ) + """
-}"""
-```
-
-#### Code Patterns
-
-```python
-# Python variable name
-lm += "Variable: " + gen("var", regex=r"[a-z_][a-z0-9_]*")
-
-# Python function name
-lm += "Function: " + gen("func", regex=r"[a-z_][a-z0-9_]*")
-
-# Hex color code
-lm += "Color: #" + gen("color", regex=r"[0-9A-Fa-f]{6}")
-
-# UUID
-lm += "UUID: " + gen(
-    "uuid",
-    regex=r"[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}"
-)
-
-# Git commit hash (short)
-lm += "Commit: " + gen("commit", regex=r"[0-9a-f]{7}")
-
-# Semantic version
-lm += "Version: " + gen("version", regex=r"[0-9]+\.[0-9]+\.[0-9]+")
-
-# IP address (IPv4)
-lm += "IP: " + gen(
-    "ip",
-    regex=r"((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"
-)
-```
-
-#### Domain-Specific Patterns
-
-```python
-# Credit card number
-lm += "Card: " + gen("card", regex=r"\d{4}-\d{4}-\d{4}-\d{4}")
-
-# Social Security Number (US)
-lm += "SSN: " + gen("ssn", regex=r"\d{3}-\d{2}-\d{4}")
-
-# ISBN-13
-lm += "ISBN: " + gen("isbn", regex=r"978-\d{1,5}-\d{1,7}-\d{1,7}-\d")
-
-# License plate (US)
-lm += "Plate: " + gen("plate", regex=r"[A-Z]{3}-\d{4}")
-
-# Currency amount
-lm += "Amount: $" + gen("amount", regex=r"[0-9]{1,3}(,[0-9]{3})*\.[0-9]{2}")
-
-# Percentage with decimal
-lm += "Rate: " + gen("rate", regex=r"[0-9]+\.[0-9]{1,2}%")
-```
-
-## Grammar-Based Generation
-
-### JSON Grammar
-
-```python
-from guidance import models, gen, guidance
-
-@guidance
-def json_object(lm):
-    """Generate valid JSON object."""
-    lm += "{\n"
-
-    # Name field (required)
-    lm += '    "name": ' + gen("name", regex=r'"[A-Za-z ]+"') + ",\n"
-
-    # Age field (required)
-    lm += '    "age": ' + gen("age", regex=r"[0-9]+") + ",\n"
-
-    # Email field (required)
-    lm += '    "email": ' + gen(
-        "email",
-        regex=r'"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"'
-    ) + ",\n"
-
-    # Active field (required, boolean)
-    lm += '    "active": ' + gen("active", regex=r"(true|false)") + "\n"
-
-    lm += "}"
-    return lm
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = json_object(lm)
-print(lm)  # Valid JSON guaranteed
-```
-
-### Nested JSON Grammar
-
-```python
-@guidance
-def nested_json(lm):
-    """Generate nested JSON structure."""
-    lm += "{\n"
-
-    # User object
-    lm += '    "user": {\n'
-    lm += '        "name": ' + gen("name", regex=r'"[A-Za-z ]+"') + ",\n"
-    lm += '        "age": ' + gen("age", regex=r"[0-9]+") + "\n"
-    lm += "    },\n"
-
-    # Address object
-    lm += '    "address": {\n'
-    lm += '        "street": ' + gen("street", regex=r'"[A-Za-z0-9 ]+"') + ",\n"
-    lm += '        "city": ' + gen("city", regex=r'"[A-Za-z ]+"') + ",\n"
-    lm += '        "zip": ' + gen("zip", regex=r'"\d{5}"') + "\n"
-    lm += "    }\n"
-
-    lm += "}"
-    return lm
-```
-
-### Array Grammar
-
-```python
-@guidance
-def json_array(lm, count=3):
-    """Generate JSON array with fixed count."""
-    lm += "[\n"
-
-    for i in range(count):
-        lm += "    {\n"
-        lm += '        "id": ' + gen(f"id_{i}", regex=r"[0-9]+") + ",\n"
-        lm += '        "name": ' + gen(f"name_{i}", regex=r'"[A-Za-z ]+"') + "\n"
-        lm += "    }"
-        if i < count - 1:
-            lm += ","
-        lm += "\n"
-
-    lm += "]"
-    return lm
-```
-
-### XML Grammar
-
-```python
-@guidance
-def xml_document(lm):
-    """Generate valid XML document."""
-    lm += '<?xml version="1.0"?>\n'
-    lm += "<person>\n"
-
-    # Name element
-    lm += "    <name>" + gen("name", regex=r"[A-Za-z ]+") + "</name>\n"
-
-    # Age element
-    lm += "    <age>" + gen("age", regex=r"[0-9]+") + "</age>\n"
-
-    # Email element
-    lm += "    <email>" + gen(
-        "email",
-        regex=r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
-    ) + "</email>\n"
-
-    lm += "</person>"
-    return lm
-```
-
-### CSV Grammar
-
-```python
-@guidance
-def csv_row(lm):
-    """Generate CSV row."""
-    lm += gen("name", regex=r"[A-Za-z ]+") + ","
-    lm += gen("age", regex=r"[0-9]+") + ","
-    lm += gen("email", regex=r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")
-    return lm
-
-@guidance
-def csv_document(lm, rows=5):
-    """Generate complete CSV."""
-    # Header
-    lm += "Name,Age,Email\n"
-
-    # Rows
-    for i in range(rows):
-        lm = csv_row(lm)
-        if i < rows - 1:
-            lm += "\n"
-
-    return lm
-```
-
-## Token Healing
-
-### How Token Healing Works
-
-**Problem:** Tokenization creates unnatural boundaries.
-
-```python
-# Example without token healing
-prompt = "The capital of France is "
-# Tokenization: ["The", " capital", " of", " France", " is", " "]
-# Model sees last token: " "
-# First generated token might include leading space: " Paris"
-# Result: "The capital of France is  Paris" (double space)
-```
-
-**Solution:** Guidance backs up and regenerates the last token.
-
-```python
-from guidance import models, gen
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-
-# Token healing enabled by default
-lm += "The capital of France is " + gen("capital", max_tokens=5)
-
-# Process:
-# 1. Back up to token before " is "
-# 2. Regenerate " is" + "capital" together
-# 3. Result: "The capital of France is Paris" (correct)
-```
-
-### Token Healing Examples
-
-#### Natural Continuations
-
-```python
-# Before token healing
-lm += "The function name is get" + gen("rest")
-# Might generate: "The function name is get User" (space before User)
-
-# With token healing
-lm += "The function name is get" + gen("rest")
-# Generates: "The function name is getUser" (correct camelCase)
-```
-
-#### Code Generation
-
-```python
-# Function name completion
-lm += "def calculate_" + gen("rest", stop="(")
-# Token healing ensures smooth connection: "calculate_total"
-
-# Variable name completion
-lm += "my_" + gen("var_name", regex=r"[a-z_]+")
-# Token healing ensures: "my_variable_name" (not "my_ variable_name")
-```
-
-#### Domain-Specific Terms
-
-```python
-# Medical terms
-lm += "The patient has hyper" + gen("condition")
-# Token healing helps: "hypertension" (not "hyper tension")
-
-# Technical terms
-lm += "Using micro" + gen("tech")
-# Token healing helps: "microservices" (not "micro services")
-```
-
-### Disabling Token Healing
-
-```python
-# Disable token healing if needed (rare)
-lm += gen("text", token_healing=False)
-```
-
-## Selection Constraints
-
-### Basic Selection
-
-```python
-from guidance import models, select
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-
-# Simple selection
-lm += "Status: " + select(["active", "inactive", "pending"], name="status")
-
-# Boolean selection
-lm += "Approved: " + select(["Yes", "No"], name="approved")
-
-# Multiple choice
-lm += "Answer: " + select(
-    ["A) Paris", "B) London", "C) Berlin", "D) Madrid"],
-    name="answer"
-)
-```
-
-### Conditional Selection
-
-```python
-from guidance import models, select, gen, guidance
-
-@guidance
-def conditional_fields(lm):
-    """Generate fields conditionally based on type."""
-    lm += "Type: " + select(["person", "company"], name="type")
-
-    if lm["type"] == "person":
-        lm += "\nName: " + gen("name", regex=r"[A-Za-z ]+")
-        lm += "\nAge: " + gen("age", regex=r"[0-9]+")
-    else:
-        lm += "\nCompany Name: " + gen("company", regex=r"[A-Za-z ]+")
-        lm += "\nEmployees: " + gen("employees", regex=r"[0-9]+")
-
-    return lm
-```
-
-### Repeated Selection
-
-```python
-@guidance
-def multiple_selections(lm):
-    """Select multiple items."""
-    lm += "Select 3 colors:\n"
-
-    colors = ["red", "blue", "green", "yellow", "purple"]
-
-    for i in range(3):
-        lm += f"{i+1}. " + select(colors, name=f"color_{i}") + "\n"
-
-    return lm
-```
-
-## Complex Patterns
-
-### Pattern 1: Structured Forms
-
-```python
-@guidance
-def user_form(lm):
-    """Generate structured user form."""
-    lm += "=== User Registration ===\n\n"
-
-    # Name (alphabetic only)
-    lm += "Full Name: " + gen("name", regex=r"[A-Za-z ]+", stop="\n") + "\n"
-
-    # Age (numeric)
-    lm += "Age: " + gen("age", regex=r"[0-9]+", max_tokens=3) + "\n"
-
-    # Email (validated format)
-    lm += "Email: " + gen(
-        "email",
-        regex=r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",
-        stop="\n"
-    ) + "\n"
-
-    # Phone (US format)
-    lm += "Phone: " + gen("phone", regex=r"\d{3}-\d{3}-\d{4}") + "\n"
-
-    # Account type (selection)
-    lm += "Account Type: " + select(
-        ["Standard", "Premium", "Enterprise"],
-        name="account_type"
-    ) + "\n"
-
-    # Active status (boolean)
-    lm += "Active: " + select(["Yes", "No"], name="active") + "\n"
-
-    return lm
-```
-
-### Pattern 2: Multi-Entity Extraction
-
-```python
-@guidance
-def extract_entities(lm, text):
-    """Extract multiple entities with constraints."""
-    lm += f"Text: {text}\n\n"
-
-    # Person name (alphabetic)
-    lm += "Person: " + gen("person", regex=r"[A-Za-z ]+", stop="\n") + "\n"
-
-    # Organization (alphanumeric with spaces)
-    lm += "Organization: " + gen(
-        "organization",
-        regex=r"[A-Za-z0-9 ]+",
-        stop="\n"
-    ) + "\n"
-
-    # Date (YYYY-MM-DD format)
-    lm += "Date: " + gen("date", regex=r"\d{4}-\d{2}-\d{2}") + "\n"
-
-    # Location (alphabetic with spaces)
-    lm += "Location: " + gen("location", regex=r"[A-Za-z ]+", stop="\n") + "\n"
-
-    # Amount (currency)
-    lm += "Amount: $" + gen("amount", regex=r"[0-9,]+\.[0-9]{2}") + "\n"
-
-    return lm
-```
-
-### Pattern 3: Code Generation
-
-```python
-@guidance
-def generate_python_function(lm):
-    """Generate Python function with constraints."""
-    # Function name (valid Python identifier)
-    lm += "def " + gen("func_name", regex=r"[a-z_][a-z0-9_]*") + "("
-
-    # Parameter name
-    lm += gen("param", regex=r"[a-z_][a-z0-9_]*") + "):\n"
-
-    # Docstring
-    lm += '    """' + gen("docstring", stop='"""', max_tokens=50) + '"""\n'
-
-    # Function body (constrained to valid Python)
-    lm += "    return " + gen("return_value", stop="\n") + "\n"
-
-    return lm
-```
-
-### Pattern 4: Hierarchical Data
-
-```python
-@guidance
-def org_chart(lm):
-    """Generate organizational chart."""
-    lm += "Company: " + gen("company", regex=r"[A-Za-z ]+") + "\n\n"
-
-    # CEO
-    lm += "CEO: " + gen("ceo", regex=r"[A-Za-z ]+") + "\n"
-
-    # Departments
-    for dept in ["Engineering", "Sales", "Marketing"]:
-        lm += f"\n{dept} Department:\n"
-        lm += "  Head: " + gen(f"{dept.lower()}_head", regex=r"[A-Za-z ]+") + "\n"
-        lm += "  Size: " + gen(f"{dept.lower()}_size", regex=r"[0-9]+") + " employees\n"
-
-    return lm
-```
-
-## Performance Optimization
-
-### Best Practices
-
-#### 1. Use Specific Patterns
-
-```python
-# ✅ Good: Specific pattern
-lm += gen("age", regex=r"[0-9]{1,3}")  # Fast
-
-# ❌ Bad: Overly broad pattern
-lm += gen("age", regex=r"[0-9]+")  # Slower
-```
-
-#### 2. Limit Max Tokens
-
-```python
-# ✅ Good: Reasonable limit
-lm += gen("name", max_tokens=30)
-
-# ❌ Bad: No limit
-lm += gen("name")  # May generate forever
-```
-
-#### 3. Use stop Sequences
-
-```python
-# ✅ Good: Stop at newline
-lm += gen("line", stop="\n")
-
-# ❌ Bad: Rely on max_tokens
-lm += gen("line", max_tokens=100)
-```
-
-#### 4. Cache Compiled Grammars
-
-```python
-# Grammars are cached automatically after first use
-# No manual caching needed
-@guidance
-def reusable_pattern(lm):
-    """This grammar is compiled once and cached."""
-    lm += gen("email", regex=r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")
-    return lm
-
-# First call: compiles grammar
-lm = reusable_pattern(lm)
-
-# Subsequent calls: uses cached grammar (fast)
-lm = reusable_pattern(lm)
-```
-
-#### 5. Avoid Overlapping Constraints
-
-```python
-# ✅ Good: Clear constraints
-lm += gen("age", regex=r"[0-9]+", max_tokens=3)
-
-# ❌ Bad: Conflicting constraints
-lm += gen("age", regex=r"[0-9]{2}", max_tokens=10)  # max_tokens unnecessary
-```
-
-### Performance Benchmarks
-
-**Regex vs Free Generation:**
- Simple regex (digits): ~1.2x slower than free gen
- Complex regex (email): ~1.5x slower than free gen
- Grammar-based: ~2x slower than free gen
-
-**But:**
- 100% valid outputs (vs ~70% with free gen + validation)
- No retry loops needed
- Overall faster end-to-end for structured outputs
-
-**Optimization Tips:**
- Use regex for critical fields only
- Use `select()` for small fixed sets (fastest)
- Use `stop` sequences when possible (faster than max_tokens)
- Cache compiled grammars by reusing functions
-
-## Resources
-
- **Token Healing Paper**: https://arxiv.org/abs/2306.17648
- **Guidance Docs**: https://guidance.readthedocs.io
- **GitHub**: https://github.com/guidance-ai/guidance
--- a/skills/mlops/inference/guidance/references/examples.md
+++ b/skills/mlops/inference/guidance/references/examples.md
@ -1,767 +0,0 @@
-# Production-Ready Examples
-
-Real-world examples of using Guidance for structured generation, agents, and workflows.
-
-## Table of Contents
- JSON Generation
- Data Extraction
- Classification Systems
- Agent Systems
- Multi-Step Workflows
- Code Generation
- Production Tips
-
-## JSON Generation
-
-### Basic JSON
-
-```python
-from guidance import models, gen, guidance
-
-@guidance
-def generate_user(lm):
-    """Generate valid user JSON."""
-    lm += "{\n"
-    lm += '  "name": ' + gen("name", regex=r'"[A-Za-z ]+"') + ",\n"
-    lm += '  "age": ' + gen("age", regex=r"[0-9]+") + ",\n"
-    lm += '  "email": ' + gen(
-        "email",
-        regex=r'"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"'
-    ) + "\n"
-    lm += "}"
-    return lm
-
-# Use it
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm += "Generate a user profile:\n"
-lm = generate_user(lm)
-
-print(lm)
-# Output: Valid JSON guaranteed
-```
-
-### Nested JSON
-
-```python
-@guidance
-def generate_order(lm):
-    """Generate nested order JSON."""
-    lm += "{\n"
-
-    # Customer info
-    lm += '  "customer": {\n'
-    lm += '    "name": ' + gen("customer_name", regex=r'"[A-Za-z ]+"') + ",\n"
-    lm += '    "email": ' + gen(
-        "customer_email",
-        regex=r'"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"'
-    ) + "\n"
-    lm += "  },\n"
-
-    # Order details
-    lm += '  "order": {\n'
-    lm += '    "id": ' + gen("order_id", regex=r'"ORD-[0-9]{6}"') + ",\n"
-    lm += '    "date": ' + gen("order_date", regex=r'"\d{4}-\d{2}-\d{2}"') + ",\n"
-    lm += '    "total": ' + gen("order_total", regex=r"[0-9]+\.[0-9]{2}") + "\n"
-    lm += "  },\n"
-
-    # Status
-    lm += '  "status": ' + gen(
-        "status",
-        regex=r'"(pending|processing|shipped|delivered)"'
-    ) + "\n"
-
-    lm += "}"
-    return lm
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = generate_order(lm)
-```
-
-### JSON Array
-
-```python
-@guidance
-def generate_user_list(lm, count=3):
-    """Generate JSON array of users."""
-    lm += "[\n"
-
-    for i in range(count):
-        lm += "  {\n"
-        lm += '    "id": ' + gen(f"id_{i}", regex=r"[0-9]+") + ",\n"
-        lm += '    "name": ' + gen(f"name_{i}", regex=r'"[A-Za-z ]+"') + ",\n"
-        lm += '    "active": ' + gen(f"active_{i}", regex=r"(true|false)") + "\n"
-        lm += "  }"
-        if i < count - 1:
-            lm += ","
-        lm += "\n"
-
-    lm += "]"
-    return lm
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = generate_user_list(lm, count=5)
-```
-
-### Dynamic JSON Schema
-
-```python
-import json
-from guidance import models, gen, guidance
-
-@guidance
-def json_from_schema(lm, schema):
-    """Generate JSON matching a schema."""
-    lm += "{\n"
-
-    fields = list(schema["properties"].items())
-    for i, (field_name, field_schema) in enumerate(fields):
-        lm += f'  "{field_name}": '
-
-        # Handle different types
-        if field_schema["type"] == "string":
-            if "pattern" in field_schema:
-                lm += gen(field_name, regex=f'"{field_schema["pattern"]}"')
-            else:
-                lm += gen(field_name, regex=r'"[^"]+"')
-        elif field_schema["type"] == "number":
-            lm += gen(field_name, regex=r"[0-9]+(\.[0-9]+)?")
-        elif field_schema["type"] == "integer":
-            lm += gen(field_name, regex=r"[0-9]+")
-        elif field_schema["type"] == "boolean":
-            lm += gen(field_name, regex=r"(true|false)")
-
-        if i < len(fields) - 1:
-            lm += ","
-        lm += "\n"
-
-    lm += "}"
-    return lm
-
-# Define schema
-schema = {
-    "type": "object",
-    "properties": {
-        "name": {"type": "string"},
-        "age": {"type": "integer"},
-        "score": {"type": "number"},
-        "active": {"type": "boolean"}
-    }
-}
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = json_from_schema(lm, schema)
-```
-
-## Data Extraction
-
-### Extract from Text
-
-```python
-from guidance import models, gen, guidance, system, user, assistant
-
-@guidance
-def extract_person_info(lm, text):
-    """Extract structured info from text."""
-    lm += f"Text: {text}\n\n"
-
-    with assistant():
-        lm += "Name: " + gen("name", regex=r"[A-Za-z ]+", stop="\n") + "\n"
-        lm += "Age: " + gen("age", regex=r"[0-9]+", max_tokens=3) + "\n"
-        lm += "Occupation: " + gen("occupation", regex=r"[A-Za-z ]+", stop="\n") + "\n"
-        lm += "Email: " + gen(
-            "email",
-            regex=r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",
-            stop="\n"
-        ) + "\n"
-
-    return lm
-
-text = "John Smith is a 35-year-old software engineer. Contact: john@example.com"
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-
-with system():
-    lm += "You extract structured information from text."
-
-with user():
-    lm = extract_person_info(lm, text)
-
-print(f"Name: {lm['name']}")
-print(f"Age: {lm['age']}")
-print(f"Occupation: {lm['occupation']}")
-print(f"Email: {lm['email']}")
-```
-
-### Multi-Entity Extraction
-
-```python
-@guidance
-def extract_entities(lm, text):
-    """Extract multiple entity types."""
-    lm += f"Analyze: {text}\n\n"
-
-    # Person entities
-    lm += "People:\n"
-    for i in range(3):  # Up to 3 people
-        lm += f"- " + gen(f"person_{i}", regex=r"[A-Za-z ]+", stop="\n") + "\n"
-
-    # Organization entities
-    lm += "\nOrganizations:\n"
-    for i in range(2):  # Up to 2 orgs
-        lm += f"- " + gen(f"org_{i}", regex=r"[A-Za-z0-9 ]+", stop="\n") + "\n"
-
-    # Dates
-    lm += "\nDates:\n"
-    for i in range(2):  # Up to 2 dates
-        lm += f"- " + gen(f"date_{i}", regex=r"\d{4}-\d{2}-\d{2}", stop="\n") + "\n"
-
-    # Locations
-    lm += "\nLocations:\n"
-    for i in range(2):  # Up to 2 locations
-        lm += f"- " + gen(f"location_{i}", regex=r"[A-Za-z ]+", stop="\n") + "\n"
-
-    return lm
-
-text = """
-Tim Cook and Satya Nadella met at Microsoft headquarters in Redmond on 2024-09-15
-to discuss the collaboration between Apple and Microsoft. The meeting continued
-in Cupertino on 2024-09-20.
-"""
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = extract_entities(lm, text)
-```
-
-### Batch Extraction
-
-```python
-@guidance
-def batch_extract(lm, texts):
-    """Extract from multiple texts."""
-    lm += "Batch Extraction Results:\n\n"
-
-    for i, text in enumerate(texts):
-        lm += f"=== Item {i+1} ===\n"
-        lm += f"Text: {text}\n"
-        lm += "Name: " + gen(f"name_{i}", regex=r"[A-Za-z ]+", stop="\n") + "\n"
-        lm += "Sentiment: " + gen(
-            f"sentiment_{i}",
-            regex=r"(positive|negative|neutral)",
-            stop="\n"
-        ) + "\n\n"
-
-    return lm
-
-texts = [
-    "Alice is happy with the product",
-    "Bob is disappointed with the service",
-    "Carol has no strong feelings either way"
-]
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = batch_extract(lm, texts)
-```
-
-## Classification Systems
-
-### Sentiment Analysis
-
-```python
-from guidance import models, select, gen
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-
-text = "This product is absolutely amazing! Best purchase ever."
-
-lm += f"Text: {text}\n\n"
-lm += "Sentiment: " + select(
-    ["positive", "negative", "neutral"],
-    name="sentiment"
-)
-lm += "\nConfidence: " + gen("confidence", regex=r"[0-9]{1,3}") + "%\n"
-lm += "Reasoning: " + gen("reasoning", stop="\n", max_tokens=50)
-
-print(f"Sentiment: {lm['sentiment']}")
-print(f"Confidence: {lm['confidence']}%")
-print(f"Reasoning: {lm['reasoning']}")
-```
-
-### Multi-Label Classification
-
-```python
-@guidance
-def classify_article(lm, text):
-    """Classify article with multiple labels."""
-    lm += f"Article: {text}\n\n"
-
-    # Primary category
-    lm += "Primary Category: " + select(
-        ["Technology", "Business", "Science", "Politics", "Entertainment"],
-        name="primary_category"
-    ) + "\n"
-
-    # Secondary categories (up to 3)
-    lm += "\nSecondary Categories:\n"
-    categories = ["Technology", "Business", "Science", "Politics", "Entertainment"]
-    for i in range(3):
-        lm += f"{i+1}. " + select(categories, name=f"secondary_{i}") + "\n"
-
-    # Tags
-    lm += "\nTags: " + gen("tags", stop="\n", max_tokens=50) + "\n"
-
-    # Target audience
-    lm += "Target Audience: " + select(
-        ["General", "Expert", "Beginner"],
-        name="audience"
-    )
-
-    return lm
-
-article = """
-Apple announced new AI features in iOS 18, leveraging machine learning to improve
-battery life and performance. The company's stock rose 5% following the announcement.
-"""
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = classify_article(lm, article)
-```
-
-### Intent Classification
-
-```python
-@guidance
-def classify_intent(lm, message):
-    """Classify user intent."""
-    lm += f"User Message: {message}\n\n"
-
-    # Intent
-    lm += "Intent: " + select(
-        ["question", "complaint", "request", "feedback", "other"],
-        name="intent"
-    ) + "\n"
-
-    # Urgency
-    lm += "Urgency: " + select(
-        ["low", "medium", "high", "critical"],
-        name="urgency"
-    ) + "\n"
-
-    # Department
-    lm += "Route To: " + select(
-        ["support", "sales", "billing", "technical"],
-        name="department"
-    ) + "\n"
-
-    # Sentiment
-    lm += "Sentiment: " + select(
-        ["positive", "neutral", "negative"],
-        name="sentiment"
-    )
-
-    return lm
-
-message = "My account was charged twice for the same order. Need help ASAP!"
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = classify_intent(lm, message)
-
-print(f"Intent: {lm['intent']}")
-print(f"Urgency: {lm['urgency']}")
-print(f"Department: {lm['department']}")
-```
-
-## Agent Systems
-
-### ReAct Agent
-
-```python
-from guidance import models, gen, select, guidance
-
-@guidance(stateless=False)
-def react_agent(lm, question, tools, max_rounds=5):
-    """ReAct agent with tool use."""
-    lm += f"Question: {question}\n\n"
-
-    for round in range(max_rounds):
-        # Thought
-        lm += f"Thought {round+1}: " + gen("thought", stop="\n", max_tokens=100) + "\n"
-
-        # Action selection
-        lm += "Action: " + select(
-            list(tools.keys()) + ["answer"],
-            name="action"
-        )
-
-        if lm["action"] == "answer":
-            lm += "\n\nFinal Answer: " + gen("answer", max_tokens=200)
-            break
-
-        # Action input
-        lm += "\nAction Input: " + gen("action_input", stop="\n", max_tokens=100) + "\n"
-
-        # Execute tool
-        if lm["action"] in tools:
-            try:
-                result = tools[lm["action"]](lm["action_input"])
-                lm += f"Observation: {result}\n\n"
-            except Exception as e:
-                lm += f"Observation: Error - {str(e)}\n\n"
-
-    return lm
-
-# Define tools
-tools = {
-    "calculator": lambda expr: eval(expr),
-    "search": lambda query: f"Search results for '{query}': [Mock results]",
-    "weather": lambda city: f"Weather in {city}: Sunny, 72°F"
-}
-
-# Use agent
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = react_agent(lm, "What is (25 * 4) + 10?", tools)
-
-print(lm["answer"])
-```
-
-### Multi-Agent System
-
-```python
-@guidance
-def coordinator_agent(lm, task):
-    """Coordinator that delegates to specialists."""
-    lm += f"Task: {task}\n\n"
-
-    # Determine which specialist to use
-    lm += "Specialist: " + select(
-        ["researcher", "writer", "coder", "analyst"],
-        name="specialist"
-    ) + "\n"
-
-    lm += "Reasoning: " + gen("reasoning", stop="\n", max_tokens=100) + "\n"
-
-    return lm
-
-@guidance
-def researcher_agent(lm, query):
-    """Research specialist."""
-    lm += f"Research Query: {query}\n\n"
-    lm += "Findings:\n"
-    for i in range(3):
-        lm += f"{i+1}. " + gen(f"finding_{i}", stop="\n", max_tokens=100) + "\n"
-    return lm
-
-@guidance
-def writer_agent(lm, topic):
-    """Writing specialist."""
-    lm += f"Topic: {topic}\n\n"
-    lm += "Title: " + gen("title", stop="\n", max_tokens=50) + "\n"
-    lm += "Content:\n" + gen("content", max_tokens=500)
-    return lm
-
-# Coordination workflow
-task = "Write an article about AI safety"
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = coordinator_agent(lm, task)
-
-specialist = lm["specialist"]
-if specialist == "researcher":
-    lm = researcher_agent(lm, task)
-elif specialist == "writer":
-    lm = writer_agent(lm, task)
-```
-
-### Tool Use with Validation
-
-```python
-@guidance(stateless=False)
-def validated_tool_agent(lm, question):
-    """Agent with validated tool calls."""
-    tools = {
-        "add": lambda a, b: float(a) + float(b),
-        "multiply": lambda a, b: float(a) * float(b),
-        "divide": lambda a, b: float(a) / float(b) if float(b) != 0 else "Error: Division by zero"
-    }
-
-    lm += f"Question: {question}\n\n"
-
-    for i in range(5):
-        # Select tool
-        lm += "Tool: " + select(list(tools.keys()) + ["done"], name="tool")
-
-        if lm["tool"] == "done":
-            lm += "\nAnswer: " + gen("answer", max_tokens=100)
-            break
-
-        # Get validated numeric arguments
-        lm += "\nArg1: " + gen("arg1", regex=r"-?[0-9]+(\.[0-9]+)?") + "\n"
-        lm += "Arg2: " + gen("arg2", regex=r"-?[0-9]+(\.[0-9]+)?") + "\n"
-
-        # Execute
-        result = tools[lm["tool"]](lm["arg1"], lm["arg2"])
-        lm += f"Result: {result}\n\n"
-
-    return lm
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = validated_tool_agent(lm, "What is (10 + 5) * 3?")
-```
-
-## Multi-Step Workflows
-
-### Chain of Thought
-
-```python
-@guidance
-def chain_of_thought(lm, question):
-    """Multi-step reasoning with CoT."""
-    lm += f"Question: {question}\n\n"
-
-    # Generate reasoning steps
-    lm += "Let me think step by step:\n\n"
-    for i in range(4):
-        lm += f"Step {i+1}: " + gen(f"step_{i+1}", stop="\n", max_tokens=100) + "\n"
-
-    # Final answer
-    lm += "\nTherefore, the answer is: " + gen("answer", stop="\n", max_tokens=50)
-
-    return lm
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = chain_of_thought(lm, "If a train travels 60 mph for 2.5 hours, how far does it go?")
-
-print(lm["answer"])
-```
-
-### Self-Consistency
-
-```python
-@guidance
-def self_consistency(lm, question, num_samples=3):
-    """Generate multiple reasoning paths and aggregate."""
-    lm += f"Question: {question}\n\n"
-
-    answers = []
-    for i in range(num_samples):
-        lm += f"=== Attempt {i+1} ===\n"
-        lm += "Reasoning: " + gen(f"reasoning_{i}", stop="\n", max_tokens=100) + "\n"
-        lm += "Answer: " + gen(f"answer_{i}", stop="\n", max_tokens=50) + "\n\n"
-        answers.append(lm[f"answer_{i}"])
-
-    # Aggregate (simple majority vote)
-    from collections import Counter
-    most_common = Counter(answers).most_common(1)[0][0]
-
-    lm += f"Final Answer (by majority): {most_common}\n"
-    return lm
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = self_consistency(lm, "What is 15% of 200?")
-```
-
-### Planning and Execution
-
-```python
-@guidance
-def plan_and_execute(lm, goal):
-    """Plan tasks then execute them."""
-    lm += f"Goal: {goal}\n\n"
-
-    # Planning phase
-    lm += "Plan:\n"
-    num_steps = 4
-    for i in range(num_steps):
-        lm += f"{i+1}. " + gen(f"plan_step_{i}", stop="\n", max_tokens=100) + "\n"
-
-    # Execution phase
-    lm += "\nExecution:\n\n"
-    for i in range(num_steps):
-        lm += f"Step {i+1}: {lm[f'plan_step_{i}']}\n"
-        lm += "Status: " + select(["completed", "in-progress", "blocked"], name=f"status_{i}") + "\n"
-        lm += "Result: " + gen(f"result_{i}", stop="\n", max_tokens=150) + "\n\n"
-
-    # Summary
-    lm += "Summary: " + gen("summary", max_tokens=200)
-
-    return lm
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = plan_and_execute(lm, "Build a REST API for a blog platform")
-```
-
-## Code Generation
-
-### Python Function
-
-```python
-@guidance
-def generate_python_function(lm, description):
-    """Generate Python function from description."""
-    lm += f"Description: {description}\n\n"
-
-    # Function signature
-    lm += "def " + gen("func_name", regex=r"[a-z_][a-z0-9_]*") + "("
-    lm += gen("params", regex=r"[a-z_][a-z0-9_]*(, [a-z_][a-z0-9_]*)*") + "):\n"
-
-    # Docstring
-    lm += '    """' + gen("docstring", stop='"""', max_tokens=100) + '"""\n'
-
-    # Function body
-    lm += "    " + gen("body", stop="\n", max_tokens=200) + "\n"
-
-    return lm
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = generate_python_function(lm, "Check if a number is prime")
-
-print(lm)
-```
-
-### SQL Query
-
-```python
-@guidance
-def generate_sql(lm, description):
-    """Generate SQL query from description."""
-    lm += f"Description: {description}\n\n"
-    lm += "SQL Query:\n"
-
-    # SELECT clause
-    lm += "SELECT " + gen("select_clause", stop=" FROM", max_tokens=100)
-
-    # FROM clause
-    lm += " FROM " + gen("from_clause", stop=" WHERE", max_tokens=50)
-
-    # WHERE clause (optional)
-    lm += " WHERE " + gen("where_clause", stop=";", max_tokens=100) + ";"
-
-    return lm
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = generate_sql(lm, "Get all users who signed up in the last 30 days")
-```
-
-### API Endpoint
-
-```python
-@guidance
-def generate_api_endpoint(lm, description):
-    """Generate REST API endpoint."""
-    lm += f"Description: {description}\n\n"
-
-    # HTTP method
-    lm += "Method: " + select(["GET", "POST", "PUT", "DELETE"], name="method") + "\n"
-
-    # Path
-    lm += "Path: /" + gen("path", regex=r"[a-z0-9/-]+", stop="\n") + "\n"
-
-    # Request body (if POST/PUT)
-    if lm["method"] in ["POST", "PUT"]:
-        lm += "\nRequest Body:\n"
-        lm += "{\n"
-        lm += '  "field1": ' + gen("field1", regex=r'"[a-z_]+"') + ",\n"
-        lm += '  "field2": ' + gen("field2", regex=r'"[a-z_]+"') + "\n"
-        lm += "}\n"
-
-    # Response
-    lm += "\nResponse (200 OK):\n"
-    lm += "{\n"
-    lm += '  "status": "success",\n'
-    lm += '  "data": ' + gen("response_data", max_tokens=100) + "\n"
-    lm += "}\n"
-
-    return lm
-
-lm = models.Anthropic("claude-sonnet-4-5-20250929")
-lm = generate_api_endpoint(lm, "Create a new blog post")
-```
-
-## Production Tips
-
-### Error Handling
-
-```python
-@guidance
-def safe_extraction(lm, text):
-    """Extract with fallback handling."""
-    try:
-        lm += f"Text: {text}\n"
-        lm += "Name: " + gen("name", regex=r"[A-Za-z ]+", stop="\n", max_tokens=30)
-        return lm
-    except Exception as e:
-        # Fallback to less strict extraction
-        lm += f"Text: {text}\n"
-        lm += "Name: " + gen("name", stop="\n", max_tokens=30)
-        return lm
-```
-
-### Caching
-
-```python
-from functools import lru_cache
-
-@lru_cache(maxsize=100)
-def cached_generation(text):
-    """Cache LLM generations."""
-    lm = models.Anthropic("claude-sonnet-4-5-20250929")
-    lm += f"Analyze: {text}\n"
-    lm += "Sentiment: " + select(["positive", "negative", "neutral"], name="sentiment")
-    return lm["sentiment"]
-
-# First call: hits LLM
-result1 = cached_generation("This is great!")
-
-# Second call: returns cached result
-result2 = cached_generation("This is great!")  # Instant!
-```
-
-### Monitoring
-
-```python
-import time
-
-@guidance
-def monitored_generation(lm, text):
-    """Track generation metrics."""
-    start_time = time.time()
-
-    lm += f"Text: {text}\n"
-    lm += "Analysis: " + gen("analysis", max_tokens=100)
-
-    elapsed = time.time() - start_time
-
-    # Log metrics
-    print(f"Generation time: {elapsed:.2f}s")
-    print(f"Output length: {len(lm['analysis'])} chars")
-
-    return lm
-```
-
-### Batch Processing
-
-```python
-def batch_process(texts, batch_size=10):
-    """Process texts in batches."""
-    lm = models.Anthropic("claude-sonnet-4-5-20250929")
-    results = []
-
-    for i in range(0, len(texts), batch_size):
-        batch = texts[i:i+batch_size]
-
-        for text in batch:
-            lm += f"Text: {text}\n"
-            lm += "Sentiment: " + select(
-                ["positive", "negative", "neutral"],
-                name=f"sentiment_{i}"
-            ) + "\n\n"
-
-        results.extend([lm[f"sentiment_{i}"] for i in range(len(batch))])
-
-    return results
-```
-
-## Resources
-
- **Guidance Notebooks**: https://github.com/guidance-ai/guidance/tree/main/notebooks
- **Guidance Docs**: https://guidance.readthedocs.io
- **Community Examples**: https://github.com/guidance-ai/guidance/discussions
--- a/skills/mlops/inference/llama-cpp/SKILL.md
+++ b/skills/mlops/inference/llama-cpp/SKILL.md
@ -1,138 +1,271 @@
 ---
 name: llama-cpp
-description: Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.
-version: 1.0.0
+description: Run LLM inference with llama.cpp on CPU, Apple Silicon, AMD/Intel GPUs, or NVIDIA — plus GGUF model conversion and quantization (2–8 bit with K-quants and imatrix). Covers CLI, Python bindings, OpenAI-compatible server, and Ollama/LM Studio integration. Use for edge deployment, M1/M2/M3/M4 Macs, CUDA-less environments, or flexible local quantization.
+version: 2.0.0
 author: Orchestra Research
 license: MIT
-dependencies: [llama-cpp-python]
+dependencies: [llama-cpp-python>=0.2.0]
 metadata:
  hermes:
-    tags: [Inference Serving, Llama.cpp, CPU Inference, Apple Silicon, Edge Deployment, GGUF, Quantization, Non-NVIDIA, AMD GPUs, Intel GPUs, Embedded]
-
+    tags: [llama.cpp, GGUF, Quantization, CPU Inference, Apple Silicon, Edge Deployment, Non-NVIDIA, AMD GPUs, Intel GPUs, Embedded, Model Compression]
 ---

-# llama.cpp
+# llama.cpp + GGUF

-Pure C/C++ LLM inference with minimal dependencies, optimized for CPUs and non-NVIDIA hardware.
+Pure C/C++ LLM inference with minimal dependencies, plus the GGUF (GPT-Generated Unified Format) standard used for quantized weights. One toolchain covers conversion, quantization, and serving.

-## When to use llama.cpp
+## When to use

-**Use llama.cpp when:**
- Running on CPU-only machines
- Deploying on Apple Silicon (M1/M2/M3/M4)
- Using AMD or Intel GPUs (no CUDA)
- Edge deployment (Raspberry Pi, embedded systems)
- Need simple deployment without Docker/Python
+**Use llama.cpp + GGUF when:**
+- Running on CPU-only machines or Apple Silicon (M1/M2/M3/M4) with Metal acceleration
+- Using AMD (ROCm) or Intel GPUs where CUDA isn't available
+- Edge deployment (Raspberry Pi, embedded systems, consumer laptops)
+- Need flexible quantization (2–8 bit with K-quants)
+- Want local AI tools (LM Studio, Ollama, text-generation-webui, koboldcpp)
+- Want a single binary deploy without Docker/Python

-**Use TensorRT-LLM instead when:**
- Have NVIDIA GPUs (A100/H100)
- Need maximum throughput (100K+ tok/s)
- Running in datacenter with CUDA
+**Key advantages:**
+- Universal hardware: CPU, Apple Silicon, NVIDIA, AMD, Intel
+- No Python runtime required (pure C/C++)
+- K-quants + imatrix for better low-bit quality
+- OpenAI-compatible server built in
+- Rich ecosystem (Ollama, LM Studio, llama-cpp-python)

-**Use vLLM instead when:**
- Have NVIDIA GPUs
- Need Python-first API
- Want PagedAttention
+**Use alternatives instead:**
+- **vLLM** — NVIDIA GPUs, PagedAttention, Python-first, max throughput
+- **TensorRT-LLM** — Production NVIDIA (A100/H100), maximum speed
+- **AWQ/GPTQ** — Calibrated quantization for NVIDIA-only deployments
+- **bitsandbytes** — Simple HuggingFace transformers integration
+- **HQQ** — Fast calibration-free quantization

 ## Quick start

-### Installation
+### Install

 ```bash
-# macOS/Linux
+# macOS / Linux (simplest)
 brew install llama.cpp

 # Or build from source
-git clone https://github.com/ggerganov/llama.cpp
+git clone https://github.com/ggml-org/llama.cpp
 cd llama.cpp
-make
+make                        # CPU
+make GGML_METAL=1           # Apple Silicon
+make GGML_CUDA=1            # NVIDIA CUDA
+make LLAMA_HIP=1            # AMD ROCm

-# With Metal (Apple Silicon)
-make LLAMA_METAL=1
-
-# With CUDA (NVIDIA)
-make LLAMA_CUDA=1
-
-# With ROCm (AMD)
-make LLAMA_HIP=1
+# Python bindings (optional)
+pip install llama-cpp-python
+# With CUDA:   CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
+# With Metal:  CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
 ```

-### Download model
+### Download a pre-quantized GGUF

 ```bash
-# Download from HuggingFace (GGUF format)
+# TheBloke hosts most popular models pre-quantized
 huggingface-cli download \
    TheBloke/Llama-2-7B-Chat-GGUF \
    llama-2-7b-chat.Q4_K_M.gguf \
    --local-dir models/
+```

-# Or convert from HuggingFace
-python convert_hf_to_gguf.py models/llama-2-7b-chat/
+### Or convert a HuggingFace model to GGUF
+
+```bash
+# 1. Download HF model
+huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b
+
+# 2. Convert to FP16 GGUF
+python convert_hf_to_gguf.py ./llama-3.1-8b \
+    --outfile llama-3.1-8b-f16.gguf \
+    --outtype f16
+
+# 3. Quantize to Q4_K_M
+./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M
 ```

 ### Run inference

 ```bash
-# Simple chat
-./llama-cli \
-    -m models/llama-2-7b-chat.Q4_K_M.gguf \
-    -p "Explain quantum computing" \
-    -n 256  # Max tokens
+# One-shot prompt
+./llama-cli -m model.Q4_K_M.gguf -p "Explain quantum computing" -n 256

 # Interactive chat
-./llama-cli \
-    -m models/llama-2-7b-chat.Q4_K_M.gguf \
-    --interactive
+./llama-cli -m model.Q4_K_M.gguf --interactive
+
+# With GPU offload
+./llama-cli -m model.Q4_K_M.gguf -ngl 35 -p "Hello!"
 ```

-### Server mode
+### Serve an OpenAI-compatible API

 ```bash
-# Start OpenAI-compatible server
 ./llama-server \
-    -m models/llama-2-7b-chat.Q4_K_M.gguf \
+    -m model.Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8080 \
-    -ngl 32  # Offload 32 layers to GPU
+    -ngl 35 \
+    -c 4096 \
+    --parallel 4 \
+    --cont-batching
+```

-# Client request
+```bash
 curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
-    "model": "llama-2-7b-chat",
+    "model": "local",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.7,
    "max_tokens": 100
  }'
 ```

-## Quantization formats
+## Quantization formats (GGUF)

-### GGUF format overview
+### K-quant methods (recommended)

-| Format | Bits | Size (7B) | Speed | Quality | Use Case |
-|--------|------|-----------|-------|---------|----------|
-| **Q4_K_M** | 4.5 | 4.1 GB | Fast | Good | **Recommended default** |
-| Q4_K_S | 4.3 | 3.9 GB | Faster | Lower | Speed critical |
-| Q5_K_M | 5.5 | 4.8 GB | Medium | Better | Quality critical |
-| Q6_K | 6.5 | 5.5 GB | Slower | Best | Maximum quality |
-| Q8_0 | 8.0 | 7.0 GB | Slow | Excellent | Minimal degradation |
-| Q2_K | 2.5 | 2.7 GB | Fastest | Poor | Testing only |
+| Type | Bits | Size (7B) | Quality | Use Case |
+|------|------|-----------|---------|----------|
+| Q2_K | 2.5 | ~2.8 GB | Low | Extreme compression (testing only) |
+| Q3_K_S | 3.0 | ~3.0 GB | Low-Med | Memory constrained |
+| Q3_K_M | 3.3 | ~3.3 GB | Medium | Fits small devices |
+| Q4_K_S | 4.0 | ~3.8 GB | Med-High | Speed critical |
+| **Q4_K_M** | 4.5 | ~4.1 GB | High | **Recommended default** |
+| Q5_K_S | 5.0 | ~4.6 GB | High | Quality focused |
+| Q5_K_M | 5.5 | ~4.8 GB | Very High | High quality |
+| Q6_K | 6.0 | ~5.5 GB | Excellent | Near-original |
+| Q8_0 | 8.0 | ~7.2 GB | Best | Maximum quality, minimal degradation |

-### Choosing quantization
+**Variant suffixes** — `_S` (Small, faster, lower quality), `_M` (Medium, balanced), `_L` (Large, better quality).
+
+**Legacy (Q4_0/Q4_1/Q5_0/Q5_1) exist** but always prefer K-quants for better quality/size ratio.
+
+**IQ quantization** — ultra-low-bit with importance-aware methods: IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_XS, IQ3_S, IQ4_XS. Require `--imatrix`.
+
+**Task-specific defaults:**
+- General chat / assistants: Q4_K_M, or Q5_K_M if RAM allows
+- Code generation: Q5_K_M or Q6_K (higher precision helps)
+- Technical / medical: Q6_K or Q8_0
+- Very large (70B, 405B) on consumer hardware: Q3_K_M or Q4_K_S
+- Raspberry Pi / edge: Q2_K or Q3_K_S
+
+## Conversion workflows
+
+### Basic: HF → GGUF → quantized

 ```bash
-# General use (balanced)
-Q4_K_M  # 4-bit, medium quality
+python convert_hf_to_gguf.py ./model --outfile model-f16.gguf --outtype f16
+./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
+./llama-cli -m model-q4_k_m.gguf -p "Hello!" -n 50
+```

-# Maximum speed (more degradation)
-Q2_K or Q3_K_M
+### With importance matrix (imatrix) — better low-bit quality

-# Maximum quality (slower)
-Q6_K or Q8_0
+`imatrix` gives 10–20% perplexity improvement at Q4, essential at Q3 and below.

-# Very large models (70B, 405B)
-Q3_K_M or Q4_K_S  # Lower bits to fit in memory
+```bash
+# 1. Convert to FP16 GGUF
+python convert_hf_to_gguf.py ./model --outfile model-f16.gguf
+
+# 2. Prepare calibration data (diverse text, ~100MB is ideal)
+cat > calibration.txt << 'EOF'
+The quick brown fox jumps over the lazy dog.
+Machine learning is a subset of artificial intelligence.
+# Add more diverse text samples...
+EOF
+
+# 3. Generate importance matrix
+./llama-imatrix -m model-f16.gguf \
+    -f calibration.txt \
+    --chunk 512 \
+    -o model.imatrix \
+    -ngl 35
+
+# 4. Quantize with imatrix
+./llama-quantize --imatrix model.imatrix \
+    model-f16.gguf model-q4_k_m.gguf Q4_K_M
+```
+
+### Multi-quant batch
+
+```bash
+#!/bin/bash
+MODEL="llama-3.1-8b-f16.gguf"
+IMATRIX="llama-3.1-8b.imatrix"
+
+./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35
+
+for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
+    OUTPUT="llama-3.1-8b-${QUANT,,}.gguf"
+    ./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT
+    echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))"
+done
+```
+
+### Quality testing (perplexity)
+
+```bash
+./llama-perplexity -m model.gguf -f wikitext-2-raw/wiki.test.raw -c 512
+# Baseline FP16: ~5.96  |  Q4_K_M: ~6.06 (+1.7%)  |  Q2_K: ~6.87 (+15.3%)
+```
+
+## Python bindings (llama-cpp-python)
+
+### Basic generation
+
+```python
+from llama_cpp import Llama
+
+llm = Llama(
+    model_path="./model-q4_k_m.gguf",
+    n_ctx=4096,
+    n_gpu_layers=35,     # 0 for CPU only, 99 to offload everything
+    n_threads=8,
+)
+
+output = llm(
+    "What is machine learning?",
+    max_tokens=256,
+    temperature=0.7,
+    stop=["</s>", "\n\n"],
+)
+print(output["choices"][0]["text"])
+```
+
+### Chat completion + streaming
+
+```python
+llm = Llama(
+    model_path="./model-q4_k_m.gguf",
+    n_ctx=4096,
+    n_gpu_layers=35,
+    chat_format="llama-3",    # Or "chatml", "mistral", etc.
+)
+
+# Non-streaming
+response = llm.create_chat_completion(
+    messages=[
+        {"role": "system", "content": "You are a helpful assistant."},
+        {"role": "user", "content": "What is Python?"},
+    ],
+    max_tokens=256,
+    temperature=0.7,
+)
+print(response["choices"][0]["message"]["content"])
+
+# Streaming
+for chunk in llm("Explain quantum computing:", max_tokens=256, stream=True):
+    print(chunk["choices"][0]["text"], end="", flush=True)
+```
+
+### Embeddings
+
+```python
+llm = Llama(model_path="./model-q4_k_m.gguf", embedding=True, n_gpu_layers=35)
+vec = llm.embed("This is a test sentence.")
+print(f"Embedding dimension: {len(vec)}")
 ```

 ## Hardware acceleration
@ -140,122 +273,166 @@ Q3_K_M or Q4_K_S  # Lower bits to fit in memory
 ### Apple Silicon (Metal)

 ```bash
-# Build with Metal
-make LLAMA_METAL=1
-
-# Run with GPU acceleration (automatic)
-./llama-cli -m model.gguf -ngl 999  # Offload all layers
-
-# Performance: M3 Max 40-60 tokens/sec (Llama 2-7B Q4_K_M)
+make clean && make GGML_METAL=1
+./llama-cli -m model.gguf -ngl 99 -p "Hello"   # offload all layers
 ```

-### NVIDIA GPUs (CUDA)
-
-```bash
-# Build with CUDA
-make LLAMA_CUDA=1
-
-# Offload layers to GPU
-./llama-cli -m model.gguf -ngl 35  # Offload 35/40 layers
-
-# Hybrid CPU+GPU for large models
-./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20  # GPU: 20 layers, CPU: rest
+```python
+llm = Llama(
+    model_path="model.gguf",
+    n_gpu_layers=99,     # Offload everything
+    n_threads=1,         # Metal handles parallelism
+)
 ```

-### AMD GPUs (ROCm)
+Performance: M3 Max ~40–60 tok/s on Llama 2-7B Q4_K_M.
+
+### NVIDIA (CUDA)
+
+```bash
+make clean && make GGML_CUDA=1
+./llama-cli -m model.gguf -ngl 35 -p "Hello"
+
+# Hybrid for large models
+./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20   # GPU: 20 layers, CPU: rest
+
+# Multi-GPU split
+./llama-cli -m large-model.gguf --tensor-split 0.5,0.5 -ngl 60
+```
+
+### AMD (ROCm)

 ```bash
-# Build with ROCm
 make LLAMA_HIP=1
-
-# Run with AMD GPU
 ./llama-cli -m model.gguf -ngl 999
 ```

-## Common patterns
-
-### Batch processing
+### CPU

 ```bash
-# Process multiple prompts from file
-cat prompts.txt | ./llama-cli \
-    -m model.gguf \
-    --batch-size 512 \
-    -n 100
+# Match PHYSICAL cores, not logical
+./llama-cli -m model.gguf -t 8 -p "Hello"
+
+# BLAS acceleration (2–3× speedup)
+make LLAMA_OPENBLAS=1
 ```

-### Constrained generation
-
-```bash
-# JSON output with grammar
-./llama-cli \
-    -m model.gguf \
-    -p "Generate a person: " \
-    --grammar-file grammars/json.gbnf
-
-# Outputs valid JSON only
-```
-
-### Context size
-
-```bash
-# Increase context (default 512)
-./llama-cli \
-    -m model.gguf \
-    -c 4096  # 4K context window
-
-# Very long context (if model supports)
-./llama-cli -m model.gguf -c 32768  # 32K context
+```python
+llm = Llama(
+    model_path="model.gguf",
+    n_gpu_layers=0,
+    n_threads=8,
+    n_batch=512,         # Larger batch = faster prompt processing
+)
 ```

 ## Performance benchmarks

-### CPU performance (Llama 2-7B Q4_K_M)
+### CPU (Llama 2-7B Q4_K_M)

-| CPU | Threads | Speed | Cost |
-|-----|---------|-------|------|
-| Apple M3 Max | 16 | 50 tok/s | $0 (local) |
-| AMD Ryzen 9 7950X | 32 | 35 tok/s | $0.50/hour |
-| Intel i9-13900K | 32 | 30 tok/s | $0.40/hour |
-| AWS c7i.16xlarge | 64 | 40 tok/s | $2.88/hour |
+| CPU | Threads | Speed |
+|-----|---------|-------|
+| Apple M3 Max (Metal) | 16 | 50 tok/s |
+| AMD Ryzen 9 7950X | 32 | 35 tok/s |
+| Intel i9-13900K | 32 | 30 tok/s |

-### GPU acceleration (Llama 2-7B Q4_K_M)
+### GPU offloading on RTX 4090

-| GPU | Speed | vs CPU | Cost |
-|-----|-------|--------|------|
-| NVIDIA RTX 4090 | 120 tok/s | 3-4× | $0 (local) |
-| NVIDIA A10 | 80 tok/s | 2-3× | $1.00/hour |
-| AMD MI250 | 70 tok/s | 2× | $2.00/hour |
-| Apple M3 Max (Metal) | 50 tok/s | ~Same | $0 (local) |
+| Layers GPU | Speed | VRAM |
+|------------|-------|------|
+| 0 (CPU only) | 30 tok/s | 0 GB |
+| 20 (hybrid) | 80 tok/s | 8 GB |
+| 35 (all) | 120 tok/s | 12 GB |

 ## Supported models

-**LLaMA family**:
- Llama 2 (7B, 13B, 70B)
- Llama 3 (8B, 70B, 405B)
- Code Llama
+- **LLaMA family**: Llama 2 (7B/13B/70B), Llama 3 (8B/70B/405B), Code Llama
+- **Mistral family**: Mistral 7B, Mixtral 8x7B/8x22B
+- **Other**: Falcon, BLOOM, GPT-J, Phi-3, Gemma, Qwen, LLaVA (vision), Whisper (audio)

-**Mistral family**:
- Mistral 7B
- Mixtral 8x7B, 8x22B
+Find GGUF models: https://huggingface.co/models?library=gguf

-**Other**:
- Falcon, BLOOM, GPT-J
- Phi-3, Gemma, Qwen
- LLaVA (vision), Whisper (audio)
+## Ecosystem integrations

-**Find models**: https://huggingface.co/models?library=gguf
+### Ollama
+
+```bash
+cat > Modelfile << 'EOF'
+FROM ./model-q4_k_m.gguf
+TEMPLATE """{{ .System }}
+{{ .Prompt }}"""
+PARAMETER temperature 0.7
+PARAMETER num_ctx 4096
+EOF
+
+ollama create mymodel -f Modelfile
+ollama run mymodel "Hello!"
+```
+
+### LM Studio
+
+1. Place GGUF file in `~/.cache/lm-studio/models/`
+2. Open LM Studio and select the model
+3. Configure context length and GPU offload, start inference
+
+### text-generation-webui
+
+```bash
+cp model-q4_k_m.gguf text-generation-webui/models/
+python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35
+```
+
+### OpenAI client → llama-server
+
+```python
+from openai import OpenAI
+
+client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
+response = client.chat.completions.create(
+    model="local-model",
+    messages=[{"role": "user", "content": "Hello!"}],
+    max_tokens=256,
+)
+print(response.choices[0].message.content)
+```
+
+## Best practices
+
+1. **Use K-quants** — Q4_K_M is the recommended default
+2. **Use imatrix** for Q4 and below (calibration improves quality substantially)
+3. **Offload as many layers as VRAM allows** — start high, reduce by 5 on OOM
+4. **Thread count** — match physical cores, not logical
+5. **Batch size** — increase `n_batch` (e.g. 512) for faster prompt processing
+6. **Context** — start at 4096, grow only as needed (memory scales with ctx)
+7. **Flash Attention** — add `--flash-attn` if your build supports it
+
+## Common issues (quick fixes)
+
+**Model loads slowly** — use `--mmap` for memory-mapped loading.
+
+**Out of memory (GPU)** — reduce `-ngl`, use a smaller quant (Q4_K_S / Q3_K_M), or quantize the KV cache:
+```python
+Llama(model_path="...", type_k=2, type_v=2, n_gpu_layers=35)  # Q4_0 KV cache
+```
+
+**Garbage output** — wrong `chat_format`, temperature too high, or model file corrupted. Test with `temperature=0.1` and verify FP16 baseline works.
+
+**Connection refused (server)** — bind to `--host 0.0.0.0`, check `lsof -i :8080`.
+
+See `references/troubleshooting.md` for the full playbook.

 ## References

- **[Quantization Guide](references/quantization.md)** - GGUF formats, conversion, quality comparison
- **[Server Deployment](references/server.md)** - API endpoints, Docker, monitoring
- **[Optimization](references/optimization.md)** - Performance tuning, hybrid CPU+GPU
+- **[advanced-usage.md](references/advanced-usage.md)** — speculative decoding, batched inference, grammar-constrained generation, LoRA, multi-GPU, custom builds, benchmark scripts
+- **[quantization.md](references/quantization.md)** — perplexity tables, use-case guide, model size scaling (7B/13B/70B RAM needs), imatrix deep dive
+- **[server.md](references/server.md)** — OpenAI API endpoints, Docker deployment, NGINX load balancing, monitoring
+- **[optimization.md](references/optimization.md)** — CPU threading, BLAS, GPU offload heuristics, batch tuning, benchmarks
+- **[troubleshooting.md](references/troubleshooting.md)** — install/convert/quantize/inference/server issues, Apple Silicon, debugging

 ## Resources

- **GitHub**: https://github.com/ggerganov/llama.cpp
- **Models**: https://huggingface.co/models?library=gguf
- **Discord**: https://discord.gg/llama-cpp
-
-
+- **GitHub**: https://github.com/ggml-org/llama.cpp
+- **Python bindings**: https://github.com/abetlen/llama-cpp-python
+- **Pre-quantized models**: https://huggingface.co/TheBloke
+- **GGUF converter Space**: https://huggingface.co/spaces/ggml-org/gguf-my-repo
+- **License**: MIT
--- a/skills/mlops/inference/llama-cpp/references/advanced-usage.md
+++ b/skills/mlops/inference/llama-cpp/references/advanced-usage.md
--- a/skills/mlops/inference/llama-cpp/references/troubleshooting.md
+++ b/skills/mlops/inference/llama-cpp/references/troubleshooting.md
--- a/skills/mlops/training/grpo-rl-training/README.md
+++ b/skills/mlops/training/grpo-rl-training/README.md
@ -1,97 +0,0 @@
-# GRPO/RL Training Skill
-
-**Expert-level guidance for Group Relative Policy Optimization with TRL**
-
-## 📁 Skill Structure
-
-```
-grpo-rl-training/
-├── SKILL.md                              # Main skill documentation (READ THIS FIRST)
-├── README.md                             # This file
-├── templates/
-│   └── basic_grpo_training.py            # Production-ready training template
-└── examples/
-    └── reward_functions_library.py       # 20+ reward function examples
-```
-
-## 🚀 Quick Start
-
-1. **Read SKILL.md** - Comprehensive guide with all concepts and patterns
-2. **Copy `templates/basic_grpo_training.py`** - Start with working code
-3. **Browse `examples/reward_functions_library.py`** - Pick reward functions for your task
-4. **Modify for your use case** - Adapt dataset, rewards, and config
-
-## 💡 What's Inside
-
-### SKILL.md (Main Documentation)
- Core GRPO concepts and algorithm fundamentals
- Complete implementation workflow (dataset → rewards → training → deployment)
- 10+ reward function examples with code
- Hyperparameter tuning guide
- Training insights (loss behavior, metrics, debugging)
- Troubleshooting guide
- Production best practices
-
-### Templates
- **basic_grpo_training.py**: Minimal, production-ready training script
-  - Uses Qwen 2.5 1.5B Instruct
-  - 3 reward functions (format + correctness)
-  - LoRA for efficient training
-  - Fully documented and ready to run
-
-### Examples
- **reward_functions_library.py**: 20+ battle-tested reward functions
-  - Correctness rewards (exact match, fuzzy match, numeric, code execution)
-  - Format rewards (XML, JSON, strict/soft)
-  - Length rewards (ideal length, min/max)
-  - Style rewards (reasoning quality, citations, repetition penalty)
-  - Combined rewards (multi-objective optimization)
-  - Preset collections for common tasks
-
-## 📖 Usage for Agents
-
-When this skill is loaded in your agent's context:
-
-1. **Always read SKILL.md first** before implementing
-2. **Start simple** - Use length-based reward to validate setup
-3. **Build incrementally** - Add one reward function at a time
-4. **Reference examples** - Copy patterns from reward_functions_library.py
-5. **Monitor training** - Watch reward metrics (not loss!)
-
-## 🎯 Common Use Cases
-
-| Task Type | Recommended Rewards | Template |
-|-----------|---------------------|----------|
-| Math reasoning | `MATH_REASONING_REWARDS` preset | basic_grpo_training.py |
-| Code generation | `CODE_GENERATION_REWARDS` preset | Modify dataset in template |
-| Summarization | `SUMMARIZATION_REWARDS` preset | Adjust prompts + rewards |
-| Q&A | `QA_REWARDS` preset | Use fuzzy match + citations |
-
-## ⚠️ Critical Reminders
-
- **Loss goes UP during training** - This is normal (it's KL divergence)
- **Use 3-5 reward functions** - Single rewards often fail
- **Test rewards before training** - Debug each function independently
- **Monitor reward_std** - Should stay > 0.1 (avoid mode collapse)
- **Start with num_generations=4-8** - Scale up if GPU allows
-
-## 🔗 External Resources
-
- [TRL Documentation](https://huggingface.co/docs/trl)
- [DeepSeek R1 Paper](https://arxiv.org/abs/2501.12948)
- [Open R1 Implementation](https://github.com/huggingface/open-r1)
- [Unsloth (2-3x faster)](https://docs.unsloth.ai/)
-
-## 📝 Version
-
-**v1.0.0** - Initial release (January 2025)
-
-## 👨‍💻 Maintained By
-
-Orchestra Research
-For questions or improvements, see https://orchestra.com
-
---
-
-**License:** MIT
-**Last Updated:** January 2025
--- a/skills/mlops/training/trl-fine-tuning/SKILL.md
+++ b/skills/mlops/training/trl-fine-tuning/SKILL.md
@ -252,6 +252,8 @@ trl dpo \

 Train with reinforcement learning using minimal memory.

+For in-depth GRPO guidance — reward function design, critical training insights (loss behavior, mode collapse, tuning), and advanced multi-stage patterns — see **[references/grpo-training.md](references/grpo-training.md)**. A production-ready training script is in **[templates/basic_grpo_training.py](templates/basic_grpo_training.py)**.
+
 Copy this checklist:

 ```
@ -428,6 +430,8 @@ config = PPOConfig(

 **Online RL methods**: See [references/online-rl.md](references/online-rl.md) for PPO, GRPO, RLOO, and OnlineDPO with detailed configurations.

+**GRPO deep dive**: See [references/grpo-training.md](references/grpo-training.md) for expert-level GRPO patterns — reward function design philosophy, training insights (why loss increases, mode collapse detection), hyperparameter tuning, multi-stage training, and troubleshooting. Production-ready template in [templates/basic_grpo_training.py](templates/basic_grpo_training.py).
+
 ## Hardware requirements

 - **GPU**: NVIDIA (CUDA required)
--- a/skills/mlops/training/trl-fine-tuning/references/grpo-training.md
+++ b/skills/mlops/training/trl-fine-tuning/references/grpo-training.md
@ -1,51 +1,36 @@
---
-name: grpo-rl-training
-description: Expert guidance for GRPO/RL fine-tuning with TRL for reasoning and task-specific model training
-version: 1.0.0
-author: Orchestra Research
-license: MIT
-dependencies: [transformers>=4.47.0, trl>=0.14.0, datasets>=3.2.0, peft>=0.14.0, torch]
-metadata:
-  hermes:
-    tags: [Post-Training, Reinforcement Learning, GRPO, TRL, RLHF, Reward Modeling, Reasoning, DPO, PPO, Structured Output]
+# GRPO (Group Relative Policy Optimization) — Deep Guide

---
+Expert-level patterns, critical insights, and production-ready workflows for fine-tuning language models with custom reward functions using TRL's `GRPOTrainer`. This is the deep reference for the GRPO workflow summarized in the main skill.

-# GRPO/RL Training with TRL
+## When to use GRPO

-Expert-level guidance for implementing Group Relative Policy Optimization (GRPO) using the Transformer Reinforcement Learning (TRL) library. This skill provides battle-tested patterns, critical insights, and production-ready workflows for fine-tuning language models with custom reward functions.
-
-## When to Use This Skill
-
-Use GRPO training when you need to:
- **Enforce specific output formats** (e.g., XML tags, JSON, structured reasoning)
+Use GRPO when you need to:
+- **Enforce specific output formats** (XML tags, JSON, structured reasoning)
 - **Teach verifiable tasks** with objective correctness metrics (math, coding, fact-checking)
 - **Improve reasoning capabilities** by rewarding chain-of-thought patterns
 - **Align models to domain-specific behaviors** without labeled preference data
 - **Optimize for multiple objectives** simultaneously (format + correctness + style)

 **Do NOT use GRPO for:**
- Simple supervised fine-tuning tasks (use SFT instead)
+- Simple supervised fine-tuning tasks → use SFT
 - Tasks without clear reward signals
- When you already have high-quality preference pairs (use DPO/PPO instead)
+- When you already have high-quality preference pairs → use DPO/PPO

---
+## Core concepts

-## Core Concepts
+### 1. GRPO algorithm fundamentals

-### 1. GRPO Algorithm Fundamentals
-
-**Key Mechanism:**
- Generates **multiple completions** for each prompt (group size: 4-16)
+**Key mechanism:**
+- Generates **multiple completions** per prompt (group size: 4–16)
 - Compares completions within each group using reward functions
 - Updates policy to favor higher-rewarded responses relative to the group

-**Critical Difference from PPO:**
+**Critical differences from PPO:**
 - No separate reward model needed
 - More sample-efficient (learns from within-group comparisons)
 - Simpler to implement and debug

-**Mathematical Intuition:**
+**Mathematical intuition:**
 ```
 For each prompt p:
  1. Generate N completions: {c₁, c₂, ..., cₙ}
@ -54,35 +39,32 @@ For each prompt p:
     relative to low-reward ones in the same group
 ```

-### 2. Reward Function Design Philosophy
+### 2. Reward function design philosophy

-**Golden Rules:**
-1. **Compose multiple reward functions** - Each handles one aspect (format, correctness, style)
-2. **Scale rewards appropriately** - Higher weight = stronger signal
-3. **Use incremental rewards** - Partial credit for partial compliance
-4. **Test rewards independently** - Debug each reward function in isolation
+**Golden rules:**
+1. **Compose multiple reward functions** — each handles one aspect (format, correctness, style)
+2. **Scale rewards appropriately** — higher weight = stronger signal
+3. **Use incremental rewards** — partial credit for partial compliance
+4. **Test rewards independently** — debug each reward function in isolation

-**Reward Function Types:**
+**Reward function types:**

 | Type | Use Case | Example Weight |
 |------|----------|----------------|
 | **Correctness** | Verifiable tasks (math, code) | 2.0 (highest) |
-| **Format** | Strict structure enforcement | 0.5-1.0 |
-| **Length** | Encourage verbosity/conciseness | 0.1-0.5 |
-| **Style** | Penalize unwanted patterns | -0.5 to 0.5 |
+| **Format** | Strict structure enforcement | 0.5–1.0 |
+| **Length** | Encourage verbosity/conciseness | 0.1–0.5 |
+| **Style** | Penalize unwanted patterns | −0.5 to 0.5 |

---
+## Implementation workflow

-## Implementation Workflow
+### Step 1: Dataset preparation

-### Step 1: Dataset Preparation
-
-**Critical Requirements:**
- Prompts in chat format (list of dicts with 'role' and 'content')
+**Critical requirements:**
+- Prompts in chat format (list of dicts with `role` and `content`)
 - Include system prompts to set expectations
 - For verifiable tasks, include ground truth answers as additional columns

-**Example Structure:**
 ```python
 from datasets import load_dataset, Dataset

@ -97,8 +79,7 @@ Respond in the following format:
 """

 def prepare_dataset(raw_data):
-    """
-    Transform raw data into GRPO-compatible format.
+    """Transform raw data into GRPO-compatible format.

    Returns: Dataset with columns:
    - 'prompt': List[Dict] with role/content (system + user messages)
@ -113,14 +94,14 @@ def prepare_dataset(raw_data):
    })
 ```

-**Pro Tips:**
- Use one-shot or few-shot examples in system prompt for complex formats
- Keep prompts concise (max_prompt_length: 256-512 tokens)
+**Pro tips:**
+- Use one-shot or few-shot examples in the system prompt for complex formats
+- Keep prompts concise (max_prompt_length: 256–512 tokens)
 - Validate data quality before training (garbage in = garbage out)

-### Step 2: Reward Function Implementation
+### Step 2: Reward function implementation

-**Template Structure:**
+**Template structure:**
 ```python
 def reward_function_name(
    prompts,        # List[List[Dict]]: Original prompts
@ -128,24 +109,16 @@ def reward_function_name(
    answer=None,    # Optional: Ground truth from dataset
    **kwargs        # Additional dataset columns
 ) -> list[float]:
-    """
-    Evaluate completions and return rewards.
-
-    Returns: List of floats (one per completion)
-    """
-    # Extract completion text
+    """Evaluate completions and return rewards (one per completion)."""
    responses = [comp[0]['content'] for comp in completions]
-
-    # Compute rewards
    rewards = []
    for response in responses:
        score = compute_score(response)
        rewards.append(score)
-
    return rewards
 ```

-**Example 1: Correctness Reward (Math/Coding)**
+**Example 1: correctness reward (math/coding)**
 ```python
 def correctness_reward(prompts, completions, answer, **kwargs):
    """Reward correct answers with high score."""
@ -155,7 +128,7 @@ def correctness_reward(prompts, completions, answer, **kwargs):
            for ans, gt in zip(extracted, answer)]
 ```

-**Example 2: Format Reward (Structured Output)**
+**Example 2: format reward (structured output)**
 ```python
 import re

@ -167,7 +140,7 @@ def format_reward(completions, **kwargs):
            for r in responses]
 ```

-**Example 3: Incremental Format Reward (Partial Credit)**
+**Example 3: incremental format reward (partial credit)**
 ```python
 def incremental_format_reward(completions, **kwargs):
    """Award partial credit for format compliance."""
@ -176,14 +149,10 @@ def incremental_format_reward(completions, **kwargs):

    for r in responses:
        score = 0.0
-        if '<reasoning>' in r:
-            score += 0.25
-        if '</reasoning>' in r:
-            score += 0.25
-        if '<answer>' in r:
-            score += 0.25
-        if '</answer>' in r:
-            score += 0.25
+        if '<reasoning>' in r:  score += 0.25
+        if '</reasoning>' in r: score += 0.25
+        if '<answer>' in r:     score += 0.25
+        if '</answer>' in r:    score += 0.25
        # Penalize extra text after closing tag
        if r.count('</answer>') == 1:
            extra_text = r.split('</answer>')[-1].strip()
@ -193,12 +162,11 @@ def incremental_format_reward(completions, **kwargs):
    return rewards
 ```

-**Critical Insight:**
-Combine 3-5 reward functions for robust training. Order matters less than diversity of signals.
+**Critical insight:** Combine 3–5 reward functions for robust training. Order matters less than diversity of signals.

-### Step 3: Training Configuration
+### Step 3: Training configuration

-**Memory-Optimized Config (Small GPU)**
+**Memory-optimized config (small GPU)**
 ```python
 from trl import GRPOConfig

@ -218,13 +186,13 @@ training_args = GRPOConfig(
    gradient_accumulation_steps=4,  # Effective batch = 4

    # GRPO-specific
-    num_generations=8,            # Group size: 8-16 recommended
+    num_generations=8,            # Group size: 8–16 recommended
    max_prompt_length=256,
    max_completion_length=512,

    # Training duration
    num_train_epochs=1,
-    max_steps=None,               # Or set fixed steps (e.g., 500)
+    max_steps=None,

    # Optimization
    bf16=True,                    # Faster on A100/H100
@ -234,11 +202,11 @@ training_args = GRPOConfig(
    # Logging
    logging_steps=1,
    save_steps=100,
-    report_to="wandb",            # Or "none" for no logging
+    report_to="wandb",
 )
 ```

-**High-Performance Config (Large GPU)**
+**High-performance config (large GPU)**
 ```python
 training_args = GRPOConfig(
    output_dir="outputs/grpo-model",
@ -255,31 +223,30 @@ training_args = GRPOConfig(
 )
 ```

-**Critical Hyperparameters:**
+**Critical hyperparameters:**

 | Parameter | Impact | Tuning Advice |
 |-----------|--------|---------------|
-| `num_generations` | Group size for comparison | Start with 8, increase to 16 if GPU allows |
+| `num_generations` | Group size for comparison | Start 8, increase to 16 if GPU allows |
 | `learning_rate` | Convergence speed/stability | 5e-6 (safe), 1e-5 (faster, riskier) |
-| `max_completion_length` | Output verbosity | Match your task (512 for reasoning, 256 for short answers) |
+| `max_completion_length` | Output verbosity | Match your task (512 reasoning, 256 short answers) |
 | `gradient_accumulation_steps` | Effective batch size | Increase if GPU memory limited |

-### Step 4: Model Setup and Training
+### Step 4: Model setup and training

-**Standard Setup (Transformers)**
+**Standard setup (Transformers + TRL)**
 ```python
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer
 from peft import LoraConfig
 from trl import GRPOTrainer

-# Load model
 model_name = "Qwen/Qwen2.5-1.5B-Instruct"
 model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
-    attn_implementation="flash_attention_2",  # 2-3x faster
-    device_map="auto"
+    attn_implementation="flash_attention_2",  # 2–3× faster
+    device_map="auto",
 )

 tokenizer = AutoTokenizer.from_pretrained(model_name)
@ -287,17 +254,16 @@ tokenizer.pad_token = tokenizer.eos_token

 # Optional: LoRA for parameter-efficient training
 peft_config = LoraConfig(
-    r=16,                         # Rank (higher = more capacity)
-    lora_alpha=32,               # Scaling factor (typically 2*r)
+    r=16,
+    lora_alpha=32,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
-        "gate_proj", "up_proj", "down_proj"
+        "gate_proj", "up_proj", "down_proj",
    ],
    task_type="CAUSAL_LM",
    lora_dropout=0.05,
 )

-# Initialize trainer
 trainer = GRPOTrainer(
    model=model,
    processing_class=tokenizer,
@ -308,17 +274,14 @@ trainer = GRPOTrainer(
    ],
    args=training_args,
    train_dataset=dataset,
-    peft_config=peft_config,      # Remove for full fine-tuning
+    peft_config=peft_config,   # Remove for full fine-tuning
 )

-# Train
 trainer.train()
-
-# Save
 trainer.save_model("final_model")
 ```

-**Unsloth Setup (2-3x Faster)**
+**Unsloth setup (2–3× faster)**
 ```python
 from unsloth import FastLanguageModel

@ -339,28 +302,26 @@ model = FastLanguageModel.get_peft_model(
    use_gradient_checkpointing="unsloth",
 )

-# Rest is identical to standard setup
+# Rest is identical to the standard setup
 trainer = GRPOTrainer(model=model, ...)
 trainer.train()
 ```

---
+## Critical training insights

-## Critical Training Insights
+### 1. Loss behavior (EXPECTED pattern)
+- **Loss starts near 0 and INCREASES during training** — this is CORRECT
+- Loss measures KL divergence from initial policy; the model is learning (diverging from original behavior to optimize rewards)
+- **Monitor reward metrics, not loss, for progress**

-### 1. Loss Behavior (EXPECTED PATTERN)
- **Loss starts near 0 and INCREASES during training**
- This is CORRECT - loss measures KL divergence from initial policy
- Model is learning (diverging from original behavior to optimize rewards)
- Monitor reward metrics instead of loss for progress
+### 2. Reward tracking

-### 2. Reward Tracking
 Key metrics to watch:
- `reward`: Average across all completions
- `reward_std`: Diversity within groups (should remain > 0)
- `kl`: KL divergence from reference (should grow moderately)
+- `reward` — average across all completions
+- `reward_std` — diversity within groups (should remain > 0)
+- `kl` — KL divergence from reference (should grow moderately)

-**Healthy Training Pattern:**
+**Healthy pattern:**
 ```
 Step   Reward    Reward_Std   KL
 100    0.5       0.3          0.02
@ -369,12 +330,12 @@ Step   Reward    Reward_Std   KL
 400    1.5       0.15         0.12
 ```

-**Warning Signs:**
- Reward std → 0 (model collapsing to single response)
- KL exploding (> 0.5) (diverging too much, reduce LR)
- Reward stuck (reward functions too harsh or model capacity issue)
+**Warning signs:**
+- `reward_std` → 0 (model collapsing to a single response)
+- `kl` exploding (> 0.5) — diverging too much, reduce LR
+- Reward stuck — reward functions too harsh or model capacity issue

-### 3. Common Pitfalls and Solutions
+### 3. Common pitfalls and solutions

 | Problem | Symptom | Solution |
 |---------|---------|----------|
@ -384,15 +345,14 @@ Step   Reward    Reward_Std   KL
 | **Slow training** | < 1 it/s | Enable `use_vllm=True`, use Unsloth, reduce seq length |
 | **Format ignored** | Model doesn't follow structure | Increase format reward weight, add incremental rewards |

---
+## Advanced patterns

-## Advanced Patterns
+### 1. Multi-stage training

-### 1. Multi-Stage Training
 For complex tasks, train in stages:

 ```python
-# Stage 1: Format compliance (epochs=1)
+# Stage 1: Format compliance
 trainer_stage1 = GRPOTrainer(
    model=model,
    reward_funcs=[incremental_format_reward, format_reward],
@ -400,7 +360,7 @@ trainer_stage1 = GRPOTrainer(
 )
 trainer_stage1.train()

-# Stage 2: Correctness (epochs=1)
+# Stage 2: Correctness
 trainer_stage2 = GRPOTrainer(
    model=model,
    reward_funcs=[format_reward, correctness_reward],
@ -409,7 +369,8 @@ trainer_stage2 = GRPOTrainer(
 trainer_stage2.train()
 ```

-### 2. Adaptive Reward Scaling
+### 2. Adaptive reward scaling
+
 ```python
 class AdaptiveReward:
    def __init__(self, base_reward_func, initial_weight=1.0):
@ -428,148 +389,116 @@ class AdaptiveReward:
            self.weight *= 0.9
 ```

-### 3. Custom Dataset Integration
+### 3. Custom dataset integration
+
 ```python
 def load_custom_knowledge_base(csv_path):
-    """Example: School communication platform docs."""
    import pandas as pd
    df = pd.read_csv(csv_path)
-
-    dataset = Dataset.from_pandas(df).map(lambda x: {
+    return Dataset.from_pandas(df).map(lambda x: {
        'prompt': [
            {'role': 'system', 'content': CUSTOM_SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': x['expert_answer']
    })
-    return dataset
 ```

---
+## Deployment and inference

-## Deployment and Inference
-
-### Save and Merge LoRA
+### Save and merge LoRA
 ```python
-# Merge LoRA adapters into base model
 if hasattr(trainer.model, 'merge_and_unload'):
    merged_model = trainer.model.merge_and_unload()
    merged_model.save_pretrained("production_model")
    tokenizer.save_pretrained("production_model")
 ```

-### Inference Example
+### Inference
 ```python
 from transformers import pipeline

-generator = pipeline(
-    "text-generation",
-    model="production_model",
-    tokenizer=tokenizer
-)
+generator = pipeline("text-generation", model="production_model", tokenizer=tokenizer)

 result = generator(
    [
        {'role': 'system', 'content': SYSTEM_PROMPT},
-        {'role': 'user', 'content': "What is 15 + 27?"}
+        {'role': 'user', 'content': "What is 15 + 27?"},
    ],
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
-    top_p=0.9
+    top_p=0.9,
 )
 print(result[0]['generated_text'])
 ```

---
+## Best practices checklist

-## Best Practices Checklist
-
-**Before Training:**
+**Before training:**
 - [ ] Validate dataset format (prompts as List[Dict])
 - [ ] Test reward functions on sample data
- [ ] Calculate expected max_prompt_length from data
- [ ] Choose appropriate num_generations based on GPU memory
+- [ ] Calculate expected `max_prompt_length` from data
+- [ ] Choose `num_generations` based on GPU memory
 - [ ] Set up logging (wandb recommended)

-**During Training:**
+**During training:**
 - [ ] Monitor reward progression (should increase)
- [ ] Check reward_std (should stay > 0.1)
+- [ ] Check `reward_std` (should stay > 0.1)
 - [ ] Watch for OOM errors (reduce batch size if needed)
- [ ] Sample generations every 50-100 steps
+- [ ] Sample generations every 50–100 steps
 - [ ] Validate format compliance on holdout set

-**After Training:**
+**After training:**
 - [ ] Merge LoRA weights if using PEFT
 - [ ] Test on diverse prompts
 - [ ] Compare to baseline model
 - [ ] Document reward weights and hyperparameters
 - [ ] Save reproducibility config

---
+## Troubleshooting

-## Troubleshooting Guide
+### Debugging workflow
+1. **Isolate reward functions** — test each independently
+2. **Check data distribution** — ensure diversity in prompts
+3. **Reduce complexity** — start with single reward, add gradually
+4. **Monitor generations** — print samples every N steps
+5. **Validate extraction logic** — ensure answer parsing works

-### Debugging Workflow
-1. **Isolate reward functions** - Test each independently
-2. **Check data distribution** - Ensure diversity in prompts
-3. **Reduce complexity** - Start with single reward, add gradually
-4. **Monitor generations** - Print samples every N steps
-5. **Validate extraction logic** - Ensure answer parsing works
-
-### Quick Fixes
+### Quick debug reward
 ```python
-# Debug reward function
 def debug_reward(completions, **kwargs):
    responses = [comp[0]['content'] for comp in completions]
-    for i, r in enumerate(responses[:2]):  # Print first 2
+    for i, r in enumerate(responses[:2]):
        print(f"Response {i}: {r[:200]}...")
-    return [1.0] * len(responses)  # Dummy rewards
+    return [1.0] * len(responses)

 # Test without training
 trainer = GRPOTrainer(..., reward_funcs=[debug_reward])
-trainer.generate_completions(dataset[:1])  # Generate without updating
+trainer.generate_completions(dataset[:1])
 ```

---
+## Template

-## References and Resources
+A production-ready training script lives at **`../templates/basic_grpo_training.py`**. It uses Qwen 2.5-1.5B-Instruct with LoRA and three reward functions (incremental format, strict format, correctness) on GSM8K. Copy and adapt:
+1. `get_dataset()` — swap in your data loader
+2. Reward functions — tune to your task
+3. `SYSTEM_PROMPT` — match your output format
+4. `GRPOConfig` — adjust hyperparameters for your GPU
+
+## References and resources

-**Official Documentation:**
 - TRL GRPO Trainer: https://huggingface.co/docs/trl/grpo_trainer
- DeepSeek R1 Paper: https://arxiv.org/abs/2501.12948
- Unsloth Docs: https://docs.unsloth.ai/
-
-**Example Repositories:**
- Open R1 Implementation: https://github.com/huggingface/open-r1
- TRL Examples: https://github.com/huggingface/trl/tree/main/examples
-
-**Recommended Reading:**
- Progressive Disclosure Pattern for agent instructions
- Reward shaping in RL (Ng et al.)
- LoRA paper (Hu et al., 2021)
-
---
-
-## Usage Instructions for Agents
-
-When this skill is loaded:
-
-1. **Read this entire file** before implementing GRPO training
-2. **Start with the simplest reward function** (e.g., length-based) to validate setup
-3. **Use the templates** in `templates/` directory as starting points
-4. **Reference examples** in `examples/` for task-specific implementations
-5. **Follow the workflow** sequentially (don't skip steps)
-6. **Debug incrementally** - add one reward function at a time
-
-**Critical Reminders:**
- Always use multiple reward functions (3-5 is optimal)
- Monitor reward metrics, not loss
- Test reward functions before training
- Start small (num_generations=4), scale up gradually
- Save checkpoints frequently (every 100 steps)
-
-This skill is designed for **expert-level implementation**. Beginners should start with supervised fine-tuning before attempting GRPO.
-
+- GRPO paper (DeepSeek): https://arxiv.org/abs/2402.03300
+- DeepSeek R1 paper: https://arxiv.org/abs/2501.12948
+- Open R1 implementation: https://github.com/huggingface/open-r1
+- TRL examples: https://github.com/huggingface/trl/tree/main/examples
+- Unsloth (faster training): https://docs.unsloth.ai/

+## Critical reminders

+- **Loss goes UP during training** — this is normal (it's KL divergence)
+- **Use 3–5 reward functions** — single rewards often fail
+- **Test rewards before training** — debug each function independently
+- **Monitor `reward_std`** — should stay > 0.1 (avoid mode collapse)
+- **Start with `num_generations=4–8`** — scale up if GPU allows
--- a/skills/mlops/training/grpo-rl-training/templates/basic_grpo_training.py
+++ b/skills/mlops/training/grpo-rl-training/templates/basic_grpo_training.py