mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-25 00:51:20 +00:00
skills: consolidate mlops redundancies (gguf+llama-cpp, grpo+trl, guidance→optional) (#11965)
Three tightly-scoped built-in skill consolidations to reduce redundancy in the available_skills listing injected into every system prompt: 1. gguf-quantization → llama-cpp (merged) GGUF is llama.cpp's format; two skills covered the same toolchain. The merged llama-cpp skill keeps the full K-quant table + imatrix workflow from gguf and the ROCm/benchmarks/supported-models sections from the original llama-cpp. All 5 reference files preserved. 2. grpo-rl-training → fine-tuning-with-trl (folded in) GRPO isn't a framework, it's a trainer inside TRL. Moved the 17KB deep-dive SKILL.md to references/grpo-training.md and the working template to templates/basic_grpo_training.py. TRL's GRPO workflow section now points to both. Atropos skill's related_skills updated. 3. guidance → optional-skills/mlops/ Dropped from built-in. Outlines (still built-in) covers the same structured-generation ground with wider adoption. Listed in the optional catalog for users who specifically want Guidance. Net: 3 fewer built-in skill lines in every system prompt, zero content loss. Contributor authorship preserved via git rename detection.
This commit is contained in:
parent
598cba62ad
commit
73bccc94c7
15 changed files with 470 additions and 889 deletions
|
|
@ -7,7 +7,7 @@ license: MIT
|
||||||
metadata:
|
metadata:
|
||||||
hermes:
|
hermes:
|
||||||
tags: [atropos, rl, environments, training, reinforcement-learning, reward-functions]
|
tags: [atropos, rl, environments, training, reinforcement-learning, reward-functions]
|
||||||
related_skills: [axolotl, grpo-rl-training, trl-fine-tuning, lm-evaluation-harness]
|
related_skills: [axolotl, fine-tuning-with-trl, lm-evaluation-harness]
|
||||||
---
|
---
|
||||||
|
|
||||||
# Hermes Agent Atropos Environments
|
# Hermes Agent Atropos Environments
|
||||||
|
|
|
||||||
|
|
@ -1,430 +0,0 @@
|
||||||
---
|
|
||||||
name: gguf-quantization
|
|
||||||
description: GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.
|
|
||||||
version: 1.0.0
|
|
||||||
author: Orchestra Research
|
|
||||||
license: MIT
|
|
||||||
dependencies: [llama-cpp-python>=0.2.0]
|
|
||||||
metadata:
|
|
||||||
hermes:
|
|
||||||
tags: [GGUF, Quantization, llama.cpp, CPU Inference, Apple Silicon, Model Compression, Optimization]
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
# GGUF - Quantization Format for llama.cpp
|
|
||||||
|
|
||||||
The GGUF (GPT-Generated Unified Format) is the standard file format for llama.cpp, enabling efficient inference on CPUs, Apple Silicon, and GPUs with flexible quantization options.
|
|
||||||
|
|
||||||
## When to use GGUF
|
|
||||||
|
|
||||||
**Use GGUF when:**
|
|
||||||
- Deploying on consumer hardware (laptops, desktops)
|
|
||||||
- Running on Apple Silicon (M1/M2/M3) with Metal acceleration
|
|
||||||
- Need CPU inference without GPU requirements
|
|
||||||
- Want flexible quantization (Q2_K to Q8_0)
|
|
||||||
- Using local AI tools (LM Studio, Ollama, text-generation-webui)
|
|
||||||
|
|
||||||
**Key advantages:**
|
|
||||||
- **Universal hardware**: CPU, Apple Silicon, NVIDIA, AMD support
|
|
||||||
- **No Python runtime**: Pure C/C++ inference
|
|
||||||
- **Flexible quantization**: 2-8 bit with various methods (K-quants)
|
|
||||||
- **Ecosystem support**: LM Studio, Ollama, koboldcpp, and more
|
|
||||||
- **imatrix**: Importance matrix for better low-bit quality
|
|
||||||
|
|
||||||
**Use alternatives instead:**
|
|
||||||
- **AWQ/GPTQ**: Maximum accuracy with calibration on NVIDIA GPUs
|
|
||||||
- **HQQ**: Fast calibration-free quantization for HuggingFace
|
|
||||||
- **bitsandbytes**: Simple integration with transformers library
|
|
||||||
- **TensorRT-LLM**: Production NVIDIA deployment with maximum speed
|
|
||||||
|
|
||||||
## Quick start
|
|
||||||
|
|
||||||
### Installation
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Clone llama.cpp
|
|
||||||
git clone https://github.com/ggml-org/llama.cpp
|
|
||||||
cd llama.cpp
|
|
||||||
|
|
||||||
# Build (CPU)
|
|
||||||
make
|
|
||||||
|
|
||||||
# Build with CUDA (NVIDIA)
|
|
||||||
make GGML_CUDA=1
|
|
||||||
|
|
||||||
# Build with Metal (Apple Silicon)
|
|
||||||
make GGML_METAL=1
|
|
||||||
|
|
||||||
# Install Python bindings (optional)
|
|
||||||
pip install llama-cpp-python
|
|
||||||
```
|
|
||||||
|
|
||||||
### Convert model to GGUF
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Install requirements
|
|
||||||
pip install -r requirements.txt
|
|
||||||
|
|
||||||
# Convert HuggingFace model to GGUF (FP16)
|
|
||||||
python convert_hf_to_gguf.py ./path/to/model --outfile model-f16.gguf
|
|
||||||
|
|
||||||
# Or specify output type
|
|
||||||
python convert_hf_to_gguf.py ./path/to/model \
|
|
||||||
--outfile model-f16.gguf \
|
|
||||||
--outtype f16
|
|
||||||
```
|
|
||||||
|
|
||||||
### Quantize model
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Basic quantization to Q4_K_M
|
|
||||||
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
|
|
||||||
|
|
||||||
# Quantize with importance matrix (better quality)
|
|
||||||
./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
|
|
||||||
./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
|
|
||||||
```
|
|
||||||
|
|
||||||
### Run inference
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# CLI inference
|
|
||||||
./llama-cli -m model-q4_k_m.gguf -p "Hello, how are you?"
|
|
||||||
|
|
||||||
# Interactive mode
|
|
||||||
./llama-cli -m model-q4_k_m.gguf --interactive
|
|
||||||
|
|
||||||
# With GPU offload
|
|
||||||
./llama-cli -m model-q4_k_m.gguf -ngl 35 -p "Hello!"
|
|
||||||
```
|
|
||||||
|
|
||||||
## Quantization types
|
|
||||||
|
|
||||||
### K-quant methods (recommended)
|
|
||||||
|
|
||||||
| Type | Bits | Size (7B) | Quality | Use Case |
|
|
||||||
|------|------|-----------|---------|----------|
|
|
||||||
| Q2_K | 2.5 | ~2.8 GB | Low | Extreme compression |
|
|
||||||
| Q3_K_S | 3.0 | ~3.0 GB | Low-Med | Memory constrained |
|
|
||||||
| Q3_K_M | 3.3 | ~3.3 GB | Medium | Balance |
|
|
||||||
| Q4_K_S | 4.0 | ~3.8 GB | Med-High | Good balance |
|
|
||||||
| Q4_K_M | 4.5 | ~4.1 GB | High | **Recommended default** |
|
|
||||||
| Q5_K_S | 5.0 | ~4.6 GB | High | Quality focused |
|
|
||||||
| Q5_K_M | 5.5 | ~4.8 GB | Very High | High quality |
|
|
||||||
| Q6_K | 6.0 | ~5.5 GB | Excellent | Near-original |
|
|
||||||
| Q8_0 | 8.0 | ~7.2 GB | Best | Maximum quality |
|
|
||||||
|
|
||||||
### Legacy methods
|
|
||||||
|
|
||||||
| Type | Description |
|
|
||||||
|------|-------------|
|
|
||||||
| Q4_0 | 4-bit, basic |
|
|
||||||
| Q4_1 | 4-bit with delta |
|
|
||||||
| Q5_0 | 5-bit, basic |
|
|
||||||
| Q5_1 | 5-bit with delta |
|
|
||||||
|
|
||||||
**Recommendation**: Use K-quant methods (Q4_K_M, Q5_K_M) for best quality/size ratio.
|
|
||||||
|
|
||||||
## Conversion workflows
|
|
||||||
|
|
||||||
### Workflow 1: HuggingFace to GGUF
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# 1. Download model
|
|
||||||
huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b
|
|
||||||
|
|
||||||
# 2. Convert to GGUF (FP16)
|
|
||||||
python convert_hf_to_gguf.py ./llama-3.1-8b \
|
|
||||||
--outfile llama-3.1-8b-f16.gguf \
|
|
||||||
--outtype f16
|
|
||||||
|
|
||||||
# 3. Quantize
|
|
||||||
./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M
|
|
||||||
|
|
||||||
# 4. Test
|
|
||||||
./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50
|
|
||||||
```
|
|
||||||
|
|
||||||
### Workflow 2: With importance matrix (better quality)
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# 1. Convert to GGUF
|
|
||||||
python convert_hf_to_gguf.py ./model --outfile model-f16.gguf
|
|
||||||
|
|
||||||
# 2. Create calibration text (diverse samples)
|
|
||||||
cat > calibration.txt << 'EOF'
|
|
||||||
The quick brown fox jumps over the lazy dog.
|
|
||||||
Machine learning is a subset of artificial intelligence.
|
|
||||||
Python is a popular programming language.
|
|
||||||
# Add more diverse text samples...
|
|
||||||
EOF
|
|
||||||
|
|
||||||
# 3. Generate importance matrix
|
|
||||||
./llama-imatrix -m model-f16.gguf \
|
|
||||||
-f calibration.txt \
|
|
||||||
--chunk 512 \
|
|
||||||
-o model.imatrix \
|
|
||||||
-ngl 35 # GPU layers if available
|
|
||||||
|
|
||||||
# 4. Quantize with imatrix
|
|
||||||
./llama-quantize --imatrix model.imatrix \
|
|
||||||
model-f16.gguf \
|
|
||||||
model-q4_k_m.gguf \
|
|
||||||
Q4_K_M
|
|
||||||
```
|
|
||||||
|
|
||||||
### Workflow 3: Multiple quantizations
|
|
||||||
|
|
||||||
```bash
|
|
||||||
#!/bin/bash
|
|
||||||
MODEL="llama-3.1-8b-f16.gguf"
|
|
||||||
IMATRIX="llama-3.1-8b.imatrix"
|
|
||||||
|
|
||||||
# Generate imatrix once
|
|
||||||
./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35
|
|
||||||
|
|
||||||
# Create multiple quantizations
|
|
||||||
for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
|
|
||||||
OUTPUT="llama-3.1-8b-${QUANT,,}.gguf"
|
|
||||||
./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT
|
|
||||||
echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))"
|
|
||||||
done
|
|
||||||
```
|
|
||||||
|
|
||||||
## Python usage
|
|
||||||
|
|
||||||
### llama-cpp-python
|
|
||||||
|
|
||||||
```python
|
|
||||||
from llama_cpp import Llama
|
|
||||||
|
|
||||||
# Load model
|
|
||||||
llm = Llama(
|
|
||||||
model_path="./model-q4_k_m.gguf",
|
|
||||||
n_ctx=4096, # Context window
|
|
||||||
n_gpu_layers=35, # GPU offload (0 for CPU only)
|
|
||||||
n_threads=8 # CPU threads
|
|
||||||
)
|
|
||||||
|
|
||||||
# Generate
|
|
||||||
output = llm(
|
|
||||||
"What is machine learning?",
|
|
||||||
max_tokens=256,
|
|
||||||
temperature=0.7,
|
|
||||||
stop=["</s>", "\n\n"]
|
|
||||||
)
|
|
||||||
print(output["choices"][0]["text"])
|
|
||||||
```
|
|
||||||
|
|
||||||
### Chat completion
|
|
||||||
|
|
||||||
```python
|
|
||||||
from llama_cpp import Llama
|
|
||||||
|
|
||||||
llm = Llama(
|
|
||||||
model_path="./model-q4_k_m.gguf",
|
|
||||||
n_ctx=4096,
|
|
||||||
n_gpu_layers=35,
|
|
||||||
chat_format="llama-3" # Or "chatml", "mistral", etc.
|
|
||||||
)
|
|
||||||
|
|
||||||
messages = [
|
|
||||||
{"role": "system", "content": "You are a helpful assistant."},
|
|
||||||
{"role": "user", "content": "What is Python?"}
|
|
||||||
]
|
|
||||||
|
|
||||||
response = llm.create_chat_completion(
|
|
||||||
messages=messages,
|
|
||||||
max_tokens=256,
|
|
||||||
temperature=0.7
|
|
||||||
)
|
|
||||||
print(response["choices"][0]["message"]["content"])
|
|
||||||
```
|
|
||||||
|
|
||||||
### Streaming
|
|
||||||
|
|
||||||
```python
|
|
||||||
from llama_cpp import Llama
|
|
||||||
|
|
||||||
llm = Llama(model_path="./model-q4_k_m.gguf", n_gpu_layers=35)
|
|
||||||
|
|
||||||
# Stream tokens
|
|
||||||
for chunk in llm(
|
|
||||||
"Explain quantum computing:",
|
|
||||||
max_tokens=256,
|
|
||||||
stream=True
|
|
||||||
):
|
|
||||||
print(chunk["choices"][0]["text"], end="", flush=True)
|
|
||||||
```
|
|
||||||
|
|
||||||
## Server mode
|
|
||||||
|
|
||||||
### Start OpenAI-compatible server
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Start server
|
|
||||||
./llama-server -m model-q4_k_m.gguf \
|
|
||||||
--host 0.0.0.0 \
|
|
||||||
--port 8080 \
|
|
||||||
-ngl 35 \
|
|
||||||
-c 4096
|
|
||||||
|
|
||||||
# Or with Python bindings
|
|
||||||
python -m llama_cpp.server \
|
|
||||||
--model model-q4_k_m.gguf \
|
|
||||||
--n_gpu_layers 35 \
|
|
||||||
--host 0.0.0.0 \
|
|
||||||
--port 8080
|
|
||||||
```
|
|
||||||
|
|
||||||
### Use with OpenAI client
|
|
||||||
|
|
||||||
```python
|
|
||||||
from openai import OpenAI
|
|
||||||
|
|
||||||
client = OpenAI(
|
|
||||||
base_url="http://localhost:8080/v1",
|
|
||||||
api_key="not-needed"
|
|
||||||
)
|
|
||||||
|
|
||||||
response = client.chat.completions.create(
|
|
||||||
model="local-model",
|
|
||||||
messages=[{"role": "user", "content": "Hello!"}],
|
|
||||||
max_tokens=256
|
|
||||||
)
|
|
||||||
print(response.choices[0].message.content)
|
|
||||||
```
|
|
||||||
|
|
||||||
## Hardware optimization
|
|
||||||
|
|
||||||
### Apple Silicon (Metal)
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Build with Metal
|
|
||||||
make clean && make GGML_METAL=1
|
|
||||||
|
|
||||||
# Run with Metal acceleration
|
|
||||||
./llama-cli -m model.gguf -ngl 99 -p "Hello"
|
|
||||||
|
|
||||||
# Python with Metal
|
|
||||||
llm = Llama(
|
|
||||||
model_path="model.gguf",
|
|
||||||
n_gpu_layers=99, # Offload all layers
|
|
||||||
n_threads=1 # Metal handles parallelism
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
### NVIDIA CUDA
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Build with CUDA
|
|
||||||
make clean && make GGML_CUDA=1
|
|
||||||
|
|
||||||
# Run with CUDA
|
|
||||||
./llama-cli -m model.gguf -ngl 35 -p "Hello"
|
|
||||||
|
|
||||||
# Specify GPU
|
|
||||||
CUDA_VISIBLE_DEVICES=0 ./llama-cli -m model.gguf -ngl 35
|
|
||||||
```
|
|
||||||
|
|
||||||
### CPU optimization
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Build with AVX2/AVX512
|
|
||||||
make clean && make
|
|
||||||
|
|
||||||
# Run with optimal threads
|
|
||||||
./llama-cli -m model.gguf -t 8 -p "Hello"
|
|
||||||
|
|
||||||
# Python CPU config
|
|
||||||
llm = Llama(
|
|
||||||
model_path="model.gguf",
|
|
||||||
n_gpu_layers=0, # CPU only
|
|
||||||
n_threads=8, # Match physical cores
|
|
||||||
n_batch=512 # Batch size for prompt processing
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
## Integration with tools
|
|
||||||
|
|
||||||
### Ollama
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Create Modelfile
|
|
||||||
cat > Modelfile << 'EOF'
|
|
||||||
FROM ./model-q4_k_m.gguf
|
|
||||||
TEMPLATE """{{ .System }}
|
|
||||||
{{ .Prompt }}"""
|
|
||||||
PARAMETER temperature 0.7
|
|
||||||
PARAMETER num_ctx 4096
|
|
||||||
EOF
|
|
||||||
|
|
||||||
# Create Ollama model
|
|
||||||
ollama create mymodel -f Modelfile
|
|
||||||
|
|
||||||
# Run
|
|
||||||
ollama run mymodel "Hello!"
|
|
||||||
```
|
|
||||||
|
|
||||||
### LM Studio
|
|
||||||
|
|
||||||
1. Place GGUF file in `~/.cache/lm-studio/models/`
|
|
||||||
2. Open LM Studio and select the model
|
|
||||||
3. Configure context length and GPU offload
|
|
||||||
4. Start inference
|
|
||||||
|
|
||||||
### text-generation-webui
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Place in models folder
|
|
||||||
cp model-q4_k_m.gguf text-generation-webui/models/
|
|
||||||
|
|
||||||
# Start with llama.cpp loader
|
|
||||||
python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35
|
|
||||||
```
|
|
||||||
|
|
||||||
## Best practices
|
|
||||||
|
|
||||||
1. **Use K-quants**: Q4_K_M offers best quality/size balance
|
|
||||||
2. **Use imatrix**: Always use importance matrix for Q4 and below
|
|
||||||
3. **GPU offload**: Offload as many layers as VRAM allows
|
|
||||||
4. **Context length**: Start with 4096, increase if needed
|
|
||||||
5. **Thread count**: Match physical CPU cores, not logical
|
|
||||||
6. **Batch size**: Increase n_batch for faster prompt processing
|
|
||||||
|
|
||||||
## Common issues
|
|
||||||
|
|
||||||
**Model loads slowly:**
|
|
||||||
```bash
|
|
||||||
# Use mmap for faster loading
|
|
||||||
./llama-cli -m model.gguf --mmap
|
|
||||||
```
|
|
||||||
|
|
||||||
**Out of memory:**
|
|
||||||
```bash
|
|
||||||
# Reduce GPU layers
|
|
||||||
./llama-cli -m model.gguf -ngl 20 # Reduce from 35
|
|
||||||
|
|
||||||
# Or use smaller quantization
|
|
||||||
./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M
|
|
||||||
```
|
|
||||||
|
|
||||||
**Poor quality at low bits:**
|
|
||||||
```bash
|
|
||||||
# Always use imatrix for Q4 and below
|
|
||||||
./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
|
|
||||||
./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
|
|
||||||
```
|
|
||||||
|
|
||||||
## References
|
|
||||||
|
|
||||||
- **[Advanced Usage](references/advanced-usage.md)** - Batching, speculative decoding, custom builds
|
|
||||||
- **[Troubleshooting](references/troubleshooting.md)** - Common issues, debugging, benchmarks
|
|
||||||
|
|
||||||
## Resources
|
|
||||||
|
|
||||||
- **Repository**: https://github.com/ggml-org/llama.cpp
|
|
||||||
- **Python Bindings**: https://github.com/abetlen/llama-cpp-python
|
|
||||||
- **Pre-quantized Models**: https://huggingface.co/TheBloke
|
|
||||||
- **GGUF Converter**: https://huggingface.co/spaces/ggml-org/gguf-my-repo
|
|
||||||
- **License**: MIT
|
|
||||||
|
|
@ -1,138 +1,271 @@
|
||||||
---
|
---
|
||||||
name: llama-cpp
|
name: llama-cpp
|
||||||
description: Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.
|
description: Run LLM inference with llama.cpp on CPU, Apple Silicon, AMD/Intel GPUs, or NVIDIA — plus GGUF model conversion and quantization (2–8 bit with K-quants and imatrix). Covers CLI, Python bindings, OpenAI-compatible server, and Ollama/LM Studio integration. Use for edge deployment, M1/M2/M3/M4 Macs, CUDA-less environments, or flexible local quantization.
|
||||||
version: 1.0.0
|
version: 2.0.0
|
||||||
author: Orchestra Research
|
author: Orchestra Research
|
||||||
license: MIT
|
license: MIT
|
||||||
dependencies: [llama-cpp-python]
|
dependencies: [llama-cpp-python>=0.2.0]
|
||||||
metadata:
|
metadata:
|
||||||
hermes:
|
hermes:
|
||||||
tags: [Inference Serving, Llama.cpp, CPU Inference, Apple Silicon, Edge Deployment, GGUF, Quantization, Non-NVIDIA, AMD GPUs, Intel GPUs, Embedded]
|
tags: [llama.cpp, GGUF, Quantization, CPU Inference, Apple Silicon, Edge Deployment, Non-NVIDIA, AMD GPUs, Intel GPUs, Embedded, Model Compression]
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
# llama.cpp
|
# llama.cpp + GGUF
|
||||||
|
|
||||||
Pure C/C++ LLM inference with minimal dependencies, optimized for CPUs and non-NVIDIA hardware.
|
Pure C/C++ LLM inference with minimal dependencies, plus the GGUF (GPT-Generated Unified Format) standard used for quantized weights. One toolchain covers conversion, quantization, and serving.
|
||||||
|
|
||||||
## When to use llama.cpp
|
## When to use
|
||||||
|
|
||||||
**Use llama.cpp when:**
|
**Use llama.cpp + GGUF when:**
|
||||||
- Running on CPU-only machines
|
- Running on CPU-only machines or Apple Silicon (M1/M2/M3/M4) with Metal acceleration
|
||||||
- Deploying on Apple Silicon (M1/M2/M3/M4)
|
- Using AMD (ROCm) or Intel GPUs where CUDA isn't available
|
||||||
- Using AMD or Intel GPUs (no CUDA)
|
- Edge deployment (Raspberry Pi, embedded systems, consumer laptops)
|
||||||
- Edge deployment (Raspberry Pi, embedded systems)
|
- Need flexible quantization (2–8 bit with K-quants)
|
||||||
- Need simple deployment without Docker/Python
|
- Want local AI tools (LM Studio, Ollama, text-generation-webui, koboldcpp)
|
||||||
|
- Want a single binary deploy without Docker/Python
|
||||||
|
|
||||||
**Use TensorRT-LLM instead when:**
|
**Key advantages:**
|
||||||
- Have NVIDIA GPUs (A100/H100)
|
- Universal hardware: CPU, Apple Silicon, NVIDIA, AMD, Intel
|
||||||
- Need maximum throughput (100K+ tok/s)
|
- No Python runtime required (pure C/C++)
|
||||||
- Running in datacenter with CUDA
|
- K-quants + imatrix for better low-bit quality
|
||||||
|
- OpenAI-compatible server built in
|
||||||
|
- Rich ecosystem (Ollama, LM Studio, llama-cpp-python)
|
||||||
|
|
||||||
**Use vLLM instead when:**
|
**Use alternatives instead:**
|
||||||
- Have NVIDIA GPUs
|
- **vLLM** — NVIDIA GPUs, PagedAttention, Python-first, max throughput
|
||||||
- Need Python-first API
|
- **TensorRT-LLM** — Production NVIDIA (A100/H100), maximum speed
|
||||||
- Want PagedAttention
|
- **AWQ/GPTQ** — Calibrated quantization for NVIDIA-only deployments
|
||||||
|
- **bitsandbytes** — Simple HuggingFace transformers integration
|
||||||
|
- **HQQ** — Fast calibration-free quantization
|
||||||
|
|
||||||
## Quick start
|
## Quick start
|
||||||
|
|
||||||
### Installation
|
### Install
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# macOS/Linux
|
# macOS / Linux (simplest)
|
||||||
brew install llama.cpp
|
brew install llama.cpp
|
||||||
|
|
||||||
# Or build from source
|
# Or build from source
|
||||||
git clone https://github.com/ggerganov/llama.cpp
|
git clone https://github.com/ggml-org/llama.cpp
|
||||||
cd llama.cpp
|
cd llama.cpp
|
||||||
make
|
make # CPU
|
||||||
|
make GGML_METAL=1 # Apple Silicon
|
||||||
|
make GGML_CUDA=1 # NVIDIA CUDA
|
||||||
|
make LLAMA_HIP=1 # AMD ROCm
|
||||||
|
|
||||||
# With Metal (Apple Silicon)
|
# Python bindings (optional)
|
||||||
make LLAMA_METAL=1
|
pip install llama-cpp-python
|
||||||
|
# With CUDA: CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
|
||||||
# With CUDA (NVIDIA)
|
# With Metal: CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
|
||||||
make LLAMA_CUDA=1
|
|
||||||
|
|
||||||
# With ROCm (AMD)
|
|
||||||
make LLAMA_HIP=1
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### Download model
|
### Download a pre-quantized GGUF
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Download from HuggingFace (GGUF format)
|
# TheBloke hosts most popular models pre-quantized
|
||||||
huggingface-cli download \
|
huggingface-cli download \
|
||||||
TheBloke/Llama-2-7B-Chat-GGUF \
|
TheBloke/Llama-2-7B-Chat-GGUF \
|
||||||
llama-2-7b-chat.Q4_K_M.gguf \
|
llama-2-7b-chat.Q4_K_M.gguf \
|
||||||
--local-dir models/
|
--local-dir models/
|
||||||
|
```
|
||||||
|
|
||||||
# Or convert from HuggingFace
|
### Or convert a HuggingFace model to GGUF
|
||||||
python convert_hf_to_gguf.py models/llama-2-7b-chat/
|
|
||||||
|
```bash
|
||||||
|
# 1. Download HF model
|
||||||
|
huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b
|
||||||
|
|
||||||
|
# 2. Convert to FP16 GGUF
|
||||||
|
python convert_hf_to_gguf.py ./llama-3.1-8b \
|
||||||
|
--outfile llama-3.1-8b-f16.gguf \
|
||||||
|
--outtype f16
|
||||||
|
|
||||||
|
# 3. Quantize to Q4_K_M
|
||||||
|
./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M
|
||||||
```
|
```
|
||||||
|
|
||||||
### Run inference
|
### Run inference
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Simple chat
|
# One-shot prompt
|
||||||
./llama-cli \
|
./llama-cli -m model.Q4_K_M.gguf -p "Explain quantum computing" -n 256
|
||||||
-m models/llama-2-7b-chat.Q4_K_M.gguf \
|
|
||||||
-p "Explain quantum computing" \
|
|
||||||
-n 256 # Max tokens
|
|
||||||
|
|
||||||
# Interactive chat
|
# Interactive chat
|
||||||
./llama-cli \
|
./llama-cli -m model.Q4_K_M.gguf --interactive
|
||||||
-m models/llama-2-7b-chat.Q4_K_M.gguf \
|
|
||||||
--interactive
|
# With GPU offload
|
||||||
|
./llama-cli -m model.Q4_K_M.gguf -ngl 35 -p "Hello!"
|
||||||
```
|
```
|
||||||
|
|
||||||
### Server mode
|
### Serve an OpenAI-compatible API
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Start OpenAI-compatible server
|
|
||||||
./llama-server \
|
./llama-server \
|
||||||
-m models/llama-2-7b-chat.Q4_K_M.gguf \
|
-m model.Q4_K_M.gguf \
|
||||||
--host 0.0.0.0 \
|
--host 0.0.0.0 \
|
||||||
--port 8080 \
|
--port 8080 \
|
||||||
-ngl 32 # Offload 32 layers to GPU
|
-ngl 35 \
|
||||||
|
-c 4096 \
|
||||||
|
--parallel 4 \
|
||||||
|
--cont-batching
|
||||||
|
```
|
||||||
|
|
||||||
# Client request
|
```bash
|
||||||
curl http://localhost:8080/v1/chat/completions \
|
curl http://localhost:8080/v1/chat/completions \
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
-d '{
|
-d '{
|
||||||
"model": "llama-2-7b-chat",
|
"model": "local",
|
||||||
"messages": [{"role": "user", "content": "Hello!"}],
|
"messages": [{"role": "user", "content": "Hello!"}],
|
||||||
"temperature": 0.7,
|
"temperature": 0.7,
|
||||||
"max_tokens": 100
|
"max_tokens": 100
|
||||||
}'
|
}'
|
||||||
```
|
```
|
||||||
|
|
||||||
## Quantization formats
|
## Quantization formats (GGUF)
|
||||||
|
|
||||||
### GGUF format overview
|
### K-quant methods (recommended)
|
||||||
|
|
||||||
| Format | Bits | Size (7B) | Speed | Quality | Use Case |
|
| Type | Bits | Size (7B) | Quality | Use Case |
|
||||||
|--------|------|-----------|-------|---------|----------|
|
|------|------|-----------|---------|----------|
|
||||||
| **Q4_K_M** | 4.5 | 4.1 GB | Fast | Good | **Recommended default** |
|
| Q2_K | 2.5 | ~2.8 GB | Low | Extreme compression (testing only) |
|
||||||
| Q4_K_S | 4.3 | 3.9 GB | Faster | Lower | Speed critical |
|
| Q3_K_S | 3.0 | ~3.0 GB | Low-Med | Memory constrained |
|
||||||
| Q5_K_M | 5.5 | 4.8 GB | Medium | Better | Quality critical |
|
| Q3_K_M | 3.3 | ~3.3 GB | Medium | Fits small devices |
|
||||||
| Q6_K | 6.5 | 5.5 GB | Slower | Best | Maximum quality |
|
| Q4_K_S | 4.0 | ~3.8 GB | Med-High | Speed critical |
|
||||||
| Q8_0 | 8.0 | 7.0 GB | Slow | Excellent | Minimal degradation |
|
| **Q4_K_M** | 4.5 | ~4.1 GB | High | **Recommended default** |
|
||||||
| Q2_K | 2.5 | 2.7 GB | Fastest | Poor | Testing only |
|
| Q5_K_S | 5.0 | ~4.6 GB | High | Quality focused |
|
||||||
|
| Q5_K_M | 5.5 | ~4.8 GB | Very High | High quality |
|
||||||
|
| Q6_K | 6.0 | ~5.5 GB | Excellent | Near-original |
|
||||||
|
| Q8_0 | 8.0 | ~7.2 GB | Best | Maximum quality, minimal degradation |
|
||||||
|
|
||||||
### Choosing quantization
|
**Variant suffixes** — `_S` (Small, faster, lower quality), `_M` (Medium, balanced), `_L` (Large, better quality).
|
||||||
|
|
||||||
|
**Legacy (Q4_0/Q4_1/Q5_0/Q5_1) exist** but always prefer K-quants for better quality/size ratio.
|
||||||
|
|
||||||
|
**IQ quantization** — ultra-low-bit with importance-aware methods: IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_XS, IQ3_S, IQ4_XS. Require `--imatrix`.
|
||||||
|
|
||||||
|
**Task-specific defaults:**
|
||||||
|
- General chat / assistants: Q4_K_M, or Q5_K_M if RAM allows
|
||||||
|
- Code generation: Q5_K_M or Q6_K (higher precision helps)
|
||||||
|
- Technical / medical: Q6_K or Q8_0
|
||||||
|
- Very large (70B, 405B) on consumer hardware: Q3_K_M or Q4_K_S
|
||||||
|
- Raspberry Pi / edge: Q2_K or Q3_K_S
|
||||||
|
|
||||||
|
## Conversion workflows
|
||||||
|
|
||||||
|
### Basic: HF → GGUF → quantized
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# General use (balanced)
|
python convert_hf_to_gguf.py ./model --outfile model-f16.gguf --outtype f16
|
||||||
Q4_K_M # 4-bit, medium quality
|
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
|
||||||
|
./llama-cli -m model-q4_k_m.gguf -p "Hello!" -n 50
|
||||||
|
```
|
||||||
|
|
||||||
# Maximum speed (more degradation)
|
### With importance matrix (imatrix) — better low-bit quality
|
||||||
Q2_K or Q3_K_M
|
|
||||||
|
|
||||||
# Maximum quality (slower)
|
`imatrix` gives 10–20% perplexity improvement at Q4, essential at Q3 and below.
|
||||||
Q6_K or Q8_0
|
|
||||||
|
|
||||||
# Very large models (70B, 405B)
|
```bash
|
||||||
Q3_K_M or Q4_K_S # Lower bits to fit in memory
|
# 1. Convert to FP16 GGUF
|
||||||
|
python convert_hf_to_gguf.py ./model --outfile model-f16.gguf
|
||||||
|
|
||||||
|
# 2. Prepare calibration data (diverse text, ~100MB is ideal)
|
||||||
|
cat > calibration.txt << 'EOF'
|
||||||
|
The quick brown fox jumps over the lazy dog.
|
||||||
|
Machine learning is a subset of artificial intelligence.
|
||||||
|
# Add more diverse text samples...
|
||||||
|
EOF
|
||||||
|
|
||||||
|
# 3. Generate importance matrix
|
||||||
|
./llama-imatrix -m model-f16.gguf \
|
||||||
|
-f calibration.txt \
|
||||||
|
--chunk 512 \
|
||||||
|
-o model.imatrix \
|
||||||
|
-ngl 35
|
||||||
|
|
||||||
|
# 4. Quantize with imatrix
|
||||||
|
./llama-quantize --imatrix model.imatrix \
|
||||||
|
model-f16.gguf model-q4_k_m.gguf Q4_K_M
|
||||||
|
```
|
||||||
|
|
||||||
|
### Multi-quant batch
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
MODEL="llama-3.1-8b-f16.gguf"
|
||||||
|
IMATRIX="llama-3.1-8b.imatrix"
|
||||||
|
|
||||||
|
./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35
|
||||||
|
|
||||||
|
for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
|
||||||
|
OUTPUT="llama-3.1-8b-${QUANT,,}.gguf"
|
||||||
|
./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT
|
||||||
|
echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))"
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
### Quality testing (perplexity)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./llama-perplexity -m model.gguf -f wikitext-2-raw/wiki.test.raw -c 512
|
||||||
|
# Baseline FP16: ~5.96 | Q4_K_M: ~6.06 (+1.7%) | Q2_K: ~6.87 (+15.3%)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Python bindings (llama-cpp-python)
|
||||||
|
|
||||||
|
### Basic generation
|
||||||
|
|
||||||
|
```python
|
||||||
|
from llama_cpp import Llama
|
||||||
|
|
||||||
|
llm = Llama(
|
||||||
|
model_path="./model-q4_k_m.gguf",
|
||||||
|
n_ctx=4096,
|
||||||
|
n_gpu_layers=35, # 0 for CPU only, 99 to offload everything
|
||||||
|
n_threads=8,
|
||||||
|
)
|
||||||
|
|
||||||
|
output = llm(
|
||||||
|
"What is machine learning?",
|
||||||
|
max_tokens=256,
|
||||||
|
temperature=0.7,
|
||||||
|
stop=["</s>", "\n\n"],
|
||||||
|
)
|
||||||
|
print(output["choices"][0]["text"])
|
||||||
|
```
|
||||||
|
|
||||||
|
### Chat completion + streaming
|
||||||
|
|
||||||
|
```python
|
||||||
|
llm = Llama(
|
||||||
|
model_path="./model-q4_k_m.gguf",
|
||||||
|
n_ctx=4096,
|
||||||
|
n_gpu_layers=35,
|
||||||
|
chat_format="llama-3", # Or "chatml", "mistral", etc.
|
||||||
|
)
|
||||||
|
|
||||||
|
# Non-streaming
|
||||||
|
response = llm.create_chat_completion(
|
||||||
|
messages=[
|
||||||
|
{"role": "system", "content": "You are a helpful assistant."},
|
||||||
|
{"role": "user", "content": "What is Python?"},
|
||||||
|
],
|
||||||
|
max_tokens=256,
|
||||||
|
temperature=0.7,
|
||||||
|
)
|
||||||
|
print(response["choices"][0]["message"]["content"])
|
||||||
|
|
||||||
|
# Streaming
|
||||||
|
for chunk in llm("Explain quantum computing:", max_tokens=256, stream=True):
|
||||||
|
print(chunk["choices"][0]["text"], end="", flush=True)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Embeddings
|
||||||
|
|
||||||
|
```python
|
||||||
|
llm = Llama(model_path="./model-q4_k_m.gguf", embedding=True, n_gpu_layers=35)
|
||||||
|
vec = llm.embed("This is a test sentence.")
|
||||||
|
print(f"Embedding dimension: {len(vec)}")
|
||||||
```
|
```
|
||||||
|
|
||||||
## Hardware acceleration
|
## Hardware acceleration
|
||||||
|
|
@ -140,122 +273,166 @@ Q3_K_M or Q4_K_S # Lower bits to fit in memory
|
||||||
### Apple Silicon (Metal)
|
### Apple Silicon (Metal)
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Build with Metal
|
make clean && make GGML_METAL=1
|
||||||
make LLAMA_METAL=1
|
./llama-cli -m model.gguf -ngl 99 -p "Hello" # offload all layers
|
||||||
|
|
||||||
# Run with GPU acceleration (automatic)
|
|
||||||
./llama-cli -m model.gguf -ngl 999 # Offload all layers
|
|
||||||
|
|
||||||
# Performance: M3 Max 40-60 tokens/sec (Llama 2-7B Q4_K_M)
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### NVIDIA GPUs (CUDA)
|
```python
|
||||||
|
llm = Llama(
|
||||||
```bash
|
model_path="model.gguf",
|
||||||
# Build with CUDA
|
n_gpu_layers=99, # Offload everything
|
||||||
make LLAMA_CUDA=1
|
n_threads=1, # Metal handles parallelism
|
||||||
|
)
|
||||||
# Offload layers to GPU
|
|
||||||
./llama-cli -m model.gguf -ngl 35 # Offload 35/40 layers
|
|
||||||
|
|
||||||
# Hybrid CPU+GPU for large models
|
|
||||||
./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20 # GPU: 20 layers, CPU: rest
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### AMD GPUs (ROCm)
|
Performance: M3 Max ~40–60 tok/s on Llama 2-7B Q4_K_M.
|
||||||
|
|
||||||
|
### NVIDIA (CUDA)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
make clean && make GGML_CUDA=1
|
||||||
|
./llama-cli -m model.gguf -ngl 35 -p "Hello"
|
||||||
|
|
||||||
|
# Hybrid for large models
|
||||||
|
./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20 # GPU: 20 layers, CPU: rest
|
||||||
|
|
||||||
|
# Multi-GPU split
|
||||||
|
./llama-cli -m large-model.gguf --tensor-split 0.5,0.5 -ngl 60
|
||||||
|
```
|
||||||
|
|
||||||
|
### AMD (ROCm)
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Build with ROCm
|
|
||||||
make LLAMA_HIP=1
|
make LLAMA_HIP=1
|
||||||
|
|
||||||
# Run with AMD GPU
|
|
||||||
./llama-cli -m model.gguf -ngl 999
|
./llama-cli -m model.gguf -ngl 999
|
||||||
```
|
```
|
||||||
|
|
||||||
## Common patterns
|
### CPU
|
||||||
|
|
||||||
### Batch processing
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Process multiple prompts from file
|
# Match PHYSICAL cores, not logical
|
||||||
cat prompts.txt | ./llama-cli \
|
./llama-cli -m model.gguf -t 8 -p "Hello"
|
||||||
-m model.gguf \
|
|
||||||
--batch-size 512 \
|
# BLAS acceleration (2–3× speedup)
|
||||||
-n 100
|
make LLAMA_OPENBLAS=1
|
||||||
```
|
```
|
||||||
|
|
||||||
### Constrained generation
|
```python
|
||||||
|
llm = Llama(
|
||||||
```bash
|
model_path="model.gguf",
|
||||||
# JSON output with grammar
|
n_gpu_layers=0,
|
||||||
./llama-cli \
|
n_threads=8,
|
||||||
-m model.gguf \
|
n_batch=512, # Larger batch = faster prompt processing
|
||||||
-p "Generate a person: " \
|
)
|
||||||
--grammar-file grammars/json.gbnf
|
|
||||||
|
|
||||||
# Outputs valid JSON only
|
|
||||||
```
|
|
||||||
|
|
||||||
### Context size
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Increase context (default 512)
|
|
||||||
./llama-cli \
|
|
||||||
-m model.gguf \
|
|
||||||
-c 4096 # 4K context window
|
|
||||||
|
|
||||||
# Very long context (if model supports)
|
|
||||||
./llama-cli -m model.gguf -c 32768 # 32K context
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Performance benchmarks
|
## Performance benchmarks
|
||||||
|
|
||||||
### CPU performance (Llama 2-7B Q4_K_M)
|
### CPU (Llama 2-7B Q4_K_M)
|
||||||
|
|
||||||
| CPU | Threads | Speed | Cost |
|
| CPU | Threads | Speed |
|
||||||
|-----|---------|-------|------|
|
|-----|---------|-------|
|
||||||
| Apple M3 Max | 16 | 50 tok/s | $0 (local) |
|
| Apple M3 Max (Metal) | 16 | 50 tok/s |
|
||||||
| AMD Ryzen 9 7950X | 32 | 35 tok/s | $0.50/hour |
|
| AMD Ryzen 9 7950X | 32 | 35 tok/s |
|
||||||
| Intel i9-13900K | 32 | 30 tok/s | $0.40/hour |
|
| Intel i9-13900K | 32 | 30 tok/s |
|
||||||
| AWS c7i.16xlarge | 64 | 40 tok/s | $2.88/hour |
|
|
||||||
|
|
||||||
### GPU acceleration (Llama 2-7B Q4_K_M)
|
### GPU offloading on RTX 4090
|
||||||
|
|
||||||
| GPU | Speed | vs CPU | Cost |
|
| Layers GPU | Speed | VRAM |
|
||||||
|-----|-------|--------|------|
|
|------------|-------|------|
|
||||||
| NVIDIA RTX 4090 | 120 tok/s | 3-4× | $0 (local) |
|
| 0 (CPU only) | 30 tok/s | 0 GB |
|
||||||
| NVIDIA A10 | 80 tok/s | 2-3× | $1.00/hour |
|
| 20 (hybrid) | 80 tok/s | 8 GB |
|
||||||
| AMD MI250 | 70 tok/s | 2× | $2.00/hour |
|
| 35 (all) | 120 tok/s | 12 GB |
|
||||||
| Apple M3 Max (Metal) | 50 tok/s | ~Same | $0 (local) |
|
|
||||||
|
|
||||||
## Supported models
|
## Supported models
|
||||||
|
|
||||||
**LLaMA family**:
|
- **LLaMA family**: Llama 2 (7B/13B/70B), Llama 3 (8B/70B/405B), Code Llama
|
||||||
- Llama 2 (7B, 13B, 70B)
|
- **Mistral family**: Mistral 7B, Mixtral 8x7B/8x22B
|
||||||
- Llama 3 (8B, 70B, 405B)
|
- **Other**: Falcon, BLOOM, GPT-J, Phi-3, Gemma, Qwen, LLaVA (vision), Whisper (audio)
|
||||||
- Code Llama
|
|
||||||
|
|
||||||
**Mistral family**:
|
Find GGUF models: https://huggingface.co/models?library=gguf
|
||||||
- Mistral 7B
|
|
||||||
- Mixtral 8x7B, 8x22B
|
|
||||||
|
|
||||||
**Other**:
|
## Ecosystem integrations
|
||||||
- Falcon, BLOOM, GPT-J
|
|
||||||
- Phi-3, Gemma, Qwen
|
|
||||||
- LLaVA (vision), Whisper (audio)
|
|
||||||
|
|
||||||
**Find models**: https://huggingface.co/models?library=gguf
|
### Ollama
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cat > Modelfile << 'EOF'
|
||||||
|
FROM ./model-q4_k_m.gguf
|
||||||
|
TEMPLATE """{{ .System }}
|
||||||
|
{{ .Prompt }}"""
|
||||||
|
PARAMETER temperature 0.7
|
||||||
|
PARAMETER num_ctx 4096
|
||||||
|
EOF
|
||||||
|
|
||||||
|
ollama create mymodel -f Modelfile
|
||||||
|
ollama run mymodel "Hello!"
|
||||||
|
```
|
||||||
|
|
||||||
|
### LM Studio
|
||||||
|
|
||||||
|
1. Place GGUF file in `~/.cache/lm-studio/models/`
|
||||||
|
2. Open LM Studio and select the model
|
||||||
|
3. Configure context length and GPU offload, start inference
|
||||||
|
|
||||||
|
### text-generation-webui
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cp model-q4_k_m.gguf text-generation-webui/models/
|
||||||
|
python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35
|
||||||
|
```
|
||||||
|
|
||||||
|
### OpenAI client → llama-server
|
||||||
|
|
||||||
|
```python
|
||||||
|
from openai import OpenAI
|
||||||
|
|
||||||
|
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
|
||||||
|
response = client.chat.completions.create(
|
||||||
|
model="local-model",
|
||||||
|
messages=[{"role": "user", "content": "Hello!"}],
|
||||||
|
max_tokens=256,
|
||||||
|
)
|
||||||
|
print(response.choices[0].message.content)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Best practices
|
||||||
|
|
||||||
|
1. **Use K-quants** — Q4_K_M is the recommended default
|
||||||
|
2. **Use imatrix** for Q4 and below (calibration improves quality substantially)
|
||||||
|
3. **Offload as many layers as VRAM allows** — start high, reduce by 5 on OOM
|
||||||
|
4. **Thread count** — match physical cores, not logical
|
||||||
|
5. **Batch size** — increase `n_batch` (e.g. 512) for faster prompt processing
|
||||||
|
6. **Context** — start at 4096, grow only as needed (memory scales with ctx)
|
||||||
|
7. **Flash Attention** — add `--flash-attn` if your build supports it
|
||||||
|
|
||||||
|
## Common issues (quick fixes)
|
||||||
|
|
||||||
|
**Model loads slowly** — use `--mmap` for memory-mapped loading.
|
||||||
|
|
||||||
|
**Out of memory (GPU)** — reduce `-ngl`, use a smaller quant (Q4_K_S / Q3_K_M), or quantize the KV cache:
|
||||||
|
```python
|
||||||
|
Llama(model_path="...", type_k=2, type_v=2, n_gpu_layers=35) # Q4_0 KV cache
|
||||||
|
```
|
||||||
|
|
||||||
|
**Garbage output** — wrong `chat_format`, temperature too high, or model file corrupted. Test with `temperature=0.1` and verify FP16 baseline works.
|
||||||
|
|
||||||
|
**Connection refused (server)** — bind to `--host 0.0.0.0`, check `lsof -i :8080`.
|
||||||
|
|
||||||
|
See `references/troubleshooting.md` for the full playbook.
|
||||||
|
|
||||||
## References
|
## References
|
||||||
|
|
||||||
- **[Quantization Guide](references/quantization.md)** - GGUF formats, conversion, quality comparison
|
- **[advanced-usage.md](references/advanced-usage.md)** — speculative decoding, batched inference, grammar-constrained generation, LoRA, multi-GPU, custom builds, benchmark scripts
|
||||||
- **[Server Deployment](references/server.md)** - API endpoints, Docker, monitoring
|
- **[quantization.md](references/quantization.md)** — perplexity tables, use-case guide, model size scaling (7B/13B/70B RAM needs), imatrix deep dive
|
||||||
- **[Optimization](references/optimization.md)** - Performance tuning, hybrid CPU+GPU
|
- **[server.md](references/server.md)** — OpenAI API endpoints, Docker deployment, NGINX load balancing, monitoring
|
||||||
|
- **[optimization.md](references/optimization.md)** — CPU threading, BLAS, GPU offload heuristics, batch tuning, benchmarks
|
||||||
|
- **[troubleshooting.md](references/troubleshooting.md)** — install/convert/quantize/inference/server issues, Apple Silicon, debugging
|
||||||
|
|
||||||
## Resources
|
## Resources
|
||||||
|
|
||||||
- **GitHub**: https://github.com/ggerganov/llama.cpp
|
- **GitHub**: https://github.com/ggml-org/llama.cpp
|
||||||
- **Models**: https://huggingface.co/models?library=gguf
|
- **Python bindings**: https://github.com/abetlen/llama-cpp-python
|
||||||
- **Discord**: https://discord.gg/llama-cpp
|
- **Pre-quantized models**: https://huggingface.co/TheBloke
|
||||||
|
- **GGUF converter Space**: https://huggingface.co/spaces/ggml-org/gguf-my-repo
|
||||||
|
- **License**: MIT
|
||||||
|
|
|
||||||
|
|
@ -1,97 +0,0 @@
|
||||||
# GRPO/RL Training Skill
|
|
||||||
|
|
||||||
**Expert-level guidance for Group Relative Policy Optimization with TRL**
|
|
||||||
|
|
||||||
## 📁 Skill Structure
|
|
||||||
|
|
||||||
```
|
|
||||||
grpo-rl-training/
|
|
||||||
├── SKILL.md # Main skill documentation (READ THIS FIRST)
|
|
||||||
├── README.md # This file
|
|
||||||
├── templates/
|
|
||||||
│ └── basic_grpo_training.py # Production-ready training template
|
|
||||||
└── examples/
|
|
||||||
└── reward_functions_library.py # 20+ reward function examples
|
|
||||||
```
|
|
||||||
|
|
||||||
## 🚀 Quick Start
|
|
||||||
|
|
||||||
1. **Read SKILL.md** - Comprehensive guide with all concepts and patterns
|
|
||||||
2. **Copy `templates/basic_grpo_training.py`** - Start with working code
|
|
||||||
3. **Browse `examples/reward_functions_library.py`** - Pick reward functions for your task
|
|
||||||
4. **Modify for your use case** - Adapt dataset, rewards, and config
|
|
||||||
|
|
||||||
## 💡 What's Inside
|
|
||||||
|
|
||||||
### SKILL.md (Main Documentation)
|
|
||||||
- Core GRPO concepts and algorithm fundamentals
|
|
||||||
- Complete implementation workflow (dataset → rewards → training → deployment)
|
|
||||||
- 10+ reward function examples with code
|
|
||||||
- Hyperparameter tuning guide
|
|
||||||
- Training insights (loss behavior, metrics, debugging)
|
|
||||||
- Troubleshooting guide
|
|
||||||
- Production best practices
|
|
||||||
|
|
||||||
### Templates
|
|
||||||
- **basic_grpo_training.py**: Minimal, production-ready training script
|
|
||||||
- Uses Qwen 2.5 1.5B Instruct
|
|
||||||
- 3 reward functions (format + correctness)
|
|
||||||
- LoRA for efficient training
|
|
||||||
- Fully documented and ready to run
|
|
||||||
|
|
||||||
### Examples
|
|
||||||
- **reward_functions_library.py**: 20+ battle-tested reward functions
|
|
||||||
- Correctness rewards (exact match, fuzzy match, numeric, code execution)
|
|
||||||
- Format rewards (XML, JSON, strict/soft)
|
|
||||||
- Length rewards (ideal length, min/max)
|
|
||||||
- Style rewards (reasoning quality, citations, repetition penalty)
|
|
||||||
- Combined rewards (multi-objective optimization)
|
|
||||||
- Preset collections for common tasks
|
|
||||||
|
|
||||||
## 📖 Usage for Agents
|
|
||||||
|
|
||||||
When this skill is loaded in your agent's context:
|
|
||||||
|
|
||||||
1. **Always read SKILL.md first** before implementing
|
|
||||||
2. **Start simple** - Use length-based reward to validate setup
|
|
||||||
3. **Build incrementally** - Add one reward function at a time
|
|
||||||
4. **Reference examples** - Copy patterns from reward_functions_library.py
|
|
||||||
5. **Monitor training** - Watch reward metrics (not loss!)
|
|
||||||
|
|
||||||
## 🎯 Common Use Cases
|
|
||||||
|
|
||||||
| Task Type | Recommended Rewards | Template |
|
|
||||||
|-----------|---------------------|----------|
|
|
||||||
| Math reasoning | `MATH_REASONING_REWARDS` preset | basic_grpo_training.py |
|
|
||||||
| Code generation | `CODE_GENERATION_REWARDS` preset | Modify dataset in template |
|
|
||||||
| Summarization | `SUMMARIZATION_REWARDS` preset | Adjust prompts + rewards |
|
|
||||||
| Q&A | `QA_REWARDS` preset | Use fuzzy match + citations |
|
|
||||||
|
|
||||||
## ⚠️ Critical Reminders
|
|
||||||
|
|
||||||
- **Loss goes UP during training** - This is normal (it's KL divergence)
|
|
||||||
- **Use 3-5 reward functions** - Single rewards often fail
|
|
||||||
- **Test rewards before training** - Debug each function independently
|
|
||||||
- **Monitor reward_std** - Should stay > 0.1 (avoid mode collapse)
|
|
||||||
- **Start with num_generations=4-8** - Scale up if GPU allows
|
|
||||||
|
|
||||||
## 🔗 External Resources
|
|
||||||
|
|
||||||
- [TRL Documentation](https://huggingface.co/docs/trl)
|
|
||||||
- [DeepSeek R1 Paper](https://arxiv.org/abs/2501.12948)
|
|
||||||
- [Open R1 Implementation](https://github.com/huggingface/open-r1)
|
|
||||||
- [Unsloth (2-3x faster)](https://docs.unsloth.ai/)
|
|
||||||
|
|
||||||
## 📝 Version
|
|
||||||
|
|
||||||
**v1.0.0** - Initial release (January 2025)
|
|
||||||
|
|
||||||
## 👨💻 Maintained By
|
|
||||||
|
|
||||||
Orchestra Research
|
|
||||||
For questions or improvements, see https://orchestra.com
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
**License:** MIT
|
|
||||||
**Last Updated:** January 2025
|
|
||||||
|
|
@ -252,6 +252,8 @@ trl dpo \
|
||||||
|
|
||||||
Train with reinforcement learning using minimal memory.
|
Train with reinforcement learning using minimal memory.
|
||||||
|
|
||||||
|
For in-depth GRPO guidance — reward function design, critical training insights (loss behavior, mode collapse, tuning), and advanced multi-stage patterns — see **[references/grpo-training.md](references/grpo-training.md)**. A production-ready training script is in **[templates/basic_grpo_training.py](templates/basic_grpo_training.py)**.
|
||||||
|
|
||||||
Copy this checklist:
|
Copy this checklist:
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
@ -428,6 +430,8 @@ config = PPOConfig(
|
||||||
|
|
||||||
**Online RL methods**: See [references/online-rl.md](references/online-rl.md) for PPO, GRPO, RLOO, and OnlineDPO with detailed configurations.
|
**Online RL methods**: See [references/online-rl.md](references/online-rl.md) for PPO, GRPO, RLOO, and OnlineDPO with detailed configurations.
|
||||||
|
|
||||||
|
**GRPO deep dive**: See [references/grpo-training.md](references/grpo-training.md) for expert-level GRPO patterns — reward function design philosophy, training insights (why loss increases, mode collapse detection), hyperparameter tuning, multi-stage training, and troubleshooting. Production-ready template in [templates/basic_grpo_training.py](templates/basic_grpo_training.py).
|
||||||
|
|
||||||
## Hardware requirements
|
## Hardware requirements
|
||||||
|
|
||||||
- **GPU**: NVIDIA (CUDA required)
|
- **GPU**: NVIDIA (CUDA required)
|
||||||
|
|
|
||||||
|
|
@ -1,51 +1,36 @@
|
||||||
---
|
# GRPO (Group Relative Policy Optimization) — Deep Guide
|
||||||
name: grpo-rl-training
|
|
||||||
description: Expert guidance for GRPO/RL fine-tuning with TRL for reasoning and task-specific model training
|
|
||||||
version: 1.0.0
|
|
||||||
author: Orchestra Research
|
|
||||||
license: MIT
|
|
||||||
dependencies: [transformers>=4.47.0, trl>=0.14.0, datasets>=3.2.0, peft>=0.14.0, torch]
|
|
||||||
metadata:
|
|
||||||
hermes:
|
|
||||||
tags: [Post-Training, Reinforcement Learning, GRPO, TRL, RLHF, Reward Modeling, Reasoning, DPO, PPO, Structured Output]
|
|
||||||
|
|
||||||
---
|
Expert-level patterns, critical insights, and production-ready workflows for fine-tuning language models with custom reward functions using TRL's `GRPOTrainer`. This is the deep reference for the GRPO workflow summarized in the main skill.
|
||||||
|
|
||||||
# GRPO/RL Training with TRL
|
## When to use GRPO
|
||||||
|
|
||||||
Expert-level guidance for implementing Group Relative Policy Optimization (GRPO) using the Transformer Reinforcement Learning (TRL) library. This skill provides battle-tested patterns, critical insights, and production-ready workflows for fine-tuning language models with custom reward functions.
|
Use GRPO when you need to:
|
||||||
|
- **Enforce specific output formats** (XML tags, JSON, structured reasoning)
|
||||||
## When to Use This Skill
|
|
||||||
|
|
||||||
Use GRPO training when you need to:
|
|
||||||
- **Enforce specific output formats** (e.g., XML tags, JSON, structured reasoning)
|
|
||||||
- **Teach verifiable tasks** with objective correctness metrics (math, coding, fact-checking)
|
- **Teach verifiable tasks** with objective correctness metrics (math, coding, fact-checking)
|
||||||
- **Improve reasoning capabilities** by rewarding chain-of-thought patterns
|
- **Improve reasoning capabilities** by rewarding chain-of-thought patterns
|
||||||
- **Align models to domain-specific behaviors** without labeled preference data
|
- **Align models to domain-specific behaviors** without labeled preference data
|
||||||
- **Optimize for multiple objectives** simultaneously (format + correctness + style)
|
- **Optimize for multiple objectives** simultaneously (format + correctness + style)
|
||||||
|
|
||||||
**Do NOT use GRPO for:**
|
**Do NOT use GRPO for:**
|
||||||
- Simple supervised fine-tuning tasks (use SFT instead)
|
- Simple supervised fine-tuning tasks → use SFT
|
||||||
- Tasks without clear reward signals
|
- Tasks without clear reward signals
|
||||||
- When you already have high-quality preference pairs (use DPO/PPO instead)
|
- When you already have high-quality preference pairs → use DPO/PPO
|
||||||
|
|
||||||
---
|
## Core concepts
|
||||||
|
|
||||||
## Core Concepts
|
### 1. GRPO algorithm fundamentals
|
||||||
|
|
||||||
### 1. GRPO Algorithm Fundamentals
|
**Key mechanism:**
|
||||||
|
- Generates **multiple completions** per prompt (group size: 4–16)
|
||||||
**Key Mechanism:**
|
|
||||||
- Generates **multiple completions** for each prompt (group size: 4-16)
|
|
||||||
- Compares completions within each group using reward functions
|
- Compares completions within each group using reward functions
|
||||||
- Updates policy to favor higher-rewarded responses relative to the group
|
- Updates policy to favor higher-rewarded responses relative to the group
|
||||||
|
|
||||||
**Critical Difference from PPO:**
|
**Critical differences from PPO:**
|
||||||
- No separate reward model needed
|
- No separate reward model needed
|
||||||
- More sample-efficient (learns from within-group comparisons)
|
- More sample-efficient (learns from within-group comparisons)
|
||||||
- Simpler to implement and debug
|
- Simpler to implement and debug
|
||||||
|
|
||||||
**Mathematical Intuition:**
|
**Mathematical intuition:**
|
||||||
```
|
```
|
||||||
For each prompt p:
|
For each prompt p:
|
||||||
1. Generate N completions: {c₁, c₂, ..., cₙ}
|
1. Generate N completions: {c₁, c₂, ..., cₙ}
|
||||||
|
|
@ -54,35 +39,32 @@ For each prompt p:
|
||||||
relative to low-reward ones in the same group
|
relative to low-reward ones in the same group
|
||||||
```
|
```
|
||||||
|
|
||||||
### 2. Reward Function Design Philosophy
|
### 2. Reward function design philosophy
|
||||||
|
|
||||||
**Golden Rules:**
|
**Golden rules:**
|
||||||
1. **Compose multiple reward functions** - Each handles one aspect (format, correctness, style)
|
1. **Compose multiple reward functions** — each handles one aspect (format, correctness, style)
|
||||||
2. **Scale rewards appropriately** - Higher weight = stronger signal
|
2. **Scale rewards appropriately** — higher weight = stronger signal
|
||||||
3. **Use incremental rewards** - Partial credit for partial compliance
|
3. **Use incremental rewards** — partial credit for partial compliance
|
||||||
4. **Test rewards independently** - Debug each reward function in isolation
|
4. **Test rewards independently** — debug each reward function in isolation
|
||||||
|
|
||||||
**Reward Function Types:**
|
**Reward function types:**
|
||||||
|
|
||||||
| Type | Use Case | Example Weight |
|
| Type | Use Case | Example Weight |
|
||||||
|------|----------|----------------|
|
|------|----------|----------------|
|
||||||
| **Correctness** | Verifiable tasks (math, code) | 2.0 (highest) |
|
| **Correctness** | Verifiable tasks (math, code) | 2.0 (highest) |
|
||||||
| **Format** | Strict structure enforcement | 0.5-1.0 |
|
| **Format** | Strict structure enforcement | 0.5–1.0 |
|
||||||
| **Length** | Encourage verbosity/conciseness | 0.1-0.5 |
|
| **Length** | Encourage verbosity/conciseness | 0.1–0.5 |
|
||||||
| **Style** | Penalize unwanted patterns | -0.5 to 0.5 |
|
| **Style** | Penalize unwanted patterns | −0.5 to 0.5 |
|
||||||
|
|
||||||
---
|
## Implementation workflow
|
||||||
|
|
||||||
## Implementation Workflow
|
### Step 1: Dataset preparation
|
||||||
|
|
||||||
### Step 1: Dataset Preparation
|
**Critical requirements:**
|
||||||
|
- Prompts in chat format (list of dicts with `role` and `content`)
|
||||||
**Critical Requirements:**
|
|
||||||
- Prompts in chat format (list of dicts with 'role' and 'content')
|
|
||||||
- Include system prompts to set expectations
|
- Include system prompts to set expectations
|
||||||
- For verifiable tasks, include ground truth answers as additional columns
|
- For verifiable tasks, include ground truth answers as additional columns
|
||||||
|
|
||||||
**Example Structure:**
|
|
||||||
```python
|
```python
|
||||||
from datasets import load_dataset, Dataset
|
from datasets import load_dataset, Dataset
|
||||||
|
|
||||||
|
|
@ -97,8 +79,7 @@ Respond in the following format:
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def prepare_dataset(raw_data):
|
def prepare_dataset(raw_data):
|
||||||
"""
|
"""Transform raw data into GRPO-compatible format.
|
||||||
Transform raw data into GRPO-compatible format.
|
|
||||||
|
|
||||||
Returns: Dataset with columns:
|
Returns: Dataset with columns:
|
||||||
- 'prompt': List[Dict] with role/content (system + user messages)
|
- 'prompt': List[Dict] with role/content (system + user messages)
|
||||||
|
|
@ -113,14 +94,14 @@ def prepare_dataset(raw_data):
|
||||||
})
|
})
|
||||||
```
|
```
|
||||||
|
|
||||||
**Pro Tips:**
|
**Pro tips:**
|
||||||
- Use one-shot or few-shot examples in system prompt for complex formats
|
- Use one-shot or few-shot examples in the system prompt for complex formats
|
||||||
- Keep prompts concise (max_prompt_length: 256-512 tokens)
|
- Keep prompts concise (max_prompt_length: 256–512 tokens)
|
||||||
- Validate data quality before training (garbage in = garbage out)
|
- Validate data quality before training (garbage in = garbage out)
|
||||||
|
|
||||||
### Step 2: Reward Function Implementation
|
### Step 2: Reward function implementation
|
||||||
|
|
||||||
**Template Structure:**
|
**Template structure:**
|
||||||
```python
|
```python
|
||||||
def reward_function_name(
|
def reward_function_name(
|
||||||
prompts, # List[List[Dict]]: Original prompts
|
prompts, # List[List[Dict]]: Original prompts
|
||||||
|
|
@ -128,24 +109,16 @@ def reward_function_name(
|
||||||
answer=None, # Optional: Ground truth from dataset
|
answer=None, # Optional: Ground truth from dataset
|
||||||
**kwargs # Additional dataset columns
|
**kwargs # Additional dataset columns
|
||||||
) -> list[float]:
|
) -> list[float]:
|
||||||
"""
|
"""Evaluate completions and return rewards (one per completion)."""
|
||||||
Evaluate completions and return rewards.
|
|
||||||
|
|
||||||
Returns: List of floats (one per completion)
|
|
||||||
"""
|
|
||||||
# Extract completion text
|
|
||||||
responses = [comp[0]['content'] for comp in completions]
|
responses = [comp[0]['content'] for comp in completions]
|
||||||
|
|
||||||
# Compute rewards
|
|
||||||
rewards = []
|
rewards = []
|
||||||
for response in responses:
|
for response in responses:
|
||||||
score = compute_score(response)
|
score = compute_score(response)
|
||||||
rewards.append(score)
|
rewards.append(score)
|
||||||
|
|
||||||
return rewards
|
return rewards
|
||||||
```
|
```
|
||||||
|
|
||||||
**Example 1: Correctness Reward (Math/Coding)**
|
**Example 1: correctness reward (math/coding)**
|
||||||
```python
|
```python
|
||||||
def correctness_reward(prompts, completions, answer, **kwargs):
|
def correctness_reward(prompts, completions, answer, **kwargs):
|
||||||
"""Reward correct answers with high score."""
|
"""Reward correct answers with high score."""
|
||||||
|
|
@ -155,7 +128,7 @@ def correctness_reward(prompts, completions, answer, **kwargs):
|
||||||
for ans, gt in zip(extracted, answer)]
|
for ans, gt in zip(extracted, answer)]
|
||||||
```
|
```
|
||||||
|
|
||||||
**Example 2: Format Reward (Structured Output)**
|
**Example 2: format reward (structured output)**
|
||||||
```python
|
```python
|
||||||
import re
|
import re
|
||||||
|
|
||||||
|
|
@ -167,7 +140,7 @@ def format_reward(completions, **kwargs):
|
||||||
for r in responses]
|
for r in responses]
|
||||||
```
|
```
|
||||||
|
|
||||||
**Example 3: Incremental Format Reward (Partial Credit)**
|
**Example 3: incremental format reward (partial credit)**
|
||||||
```python
|
```python
|
||||||
def incremental_format_reward(completions, **kwargs):
|
def incremental_format_reward(completions, **kwargs):
|
||||||
"""Award partial credit for format compliance."""
|
"""Award partial credit for format compliance."""
|
||||||
|
|
@ -176,14 +149,10 @@ def incremental_format_reward(completions, **kwargs):
|
||||||
|
|
||||||
for r in responses:
|
for r in responses:
|
||||||
score = 0.0
|
score = 0.0
|
||||||
if '<reasoning>' in r:
|
if '<reasoning>' in r: score += 0.25
|
||||||
score += 0.25
|
if '</reasoning>' in r: score += 0.25
|
||||||
if '</reasoning>' in r:
|
if '<answer>' in r: score += 0.25
|
||||||
score += 0.25
|
if '</answer>' in r: score += 0.25
|
||||||
if '<answer>' in r:
|
|
||||||
score += 0.25
|
|
||||||
if '</answer>' in r:
|
|
||||||
score += 0.25
|
|
||||||
# Penalize extra text after closing tag
|
# Penalize extra text after closing tag
|
||||||
if r.count('</answer>') == 1:
|
if r.count('</answer>') == 1:
|
||||||
extra_text = r.split('</answer>')[-1].strip()
|
extra_text = r.split('</answer>')[-1].strip()
|
||||||
|
|
@ -193,12 +162,11 @@ def incremental_format_reward(completions, **kwargs):
|
||||||
return rewards
|
return rewards
|
||||||
```
|
```
|
||||||
|
|
||||||
**Critical Insight:**
|
**Critical insight:** Combine 3–5 reward functions for robust training. Order matters less than diversity of signals.
|
||||||
Combine 3-5 reward functions for robust training. Order matters less than diversity of signals.
|
|
||||||
|
|
||||||
### Step 3: Training Configuration
|
### Step 3: Training configuration
|
||||||
|
|
||||||
**Memory-Optimized Config (Small GPU)**
|
**Memory-optimized config (small GPU)**
|
||||||
```python
|
```python
|
||||||
from trl import GRPOConfig
|
from trl import GRPOConfig
|
||||||
|
|
||||||
|
|
@ -218,13 +186,13 @@ training_args = GRPOConfig(
|
||||||
gradient_accumulation_steps=4, # Effective batch = 4
|
gradient_accumulation_steps=4, # Effective batch = 4
|
||||||
|
|
||||||
# GRPO-specific
|
# GRPO-specific
|
||||||
num_generations=8, # Group size: 8-16 recommended
|
num_generations=8, # Group size: 8–16 recommended
|
||||||
max_prompt_length=256,
|
max_prompt_length=256,
|
||||||
max_completion_length=512,
|
max_completion_length=512,
|
||||||
|
|
||||||
# Training duration
|
# Training duration
|
||||||
num_train_epochs=1,
|
num_train_epochs=1,
|
||||||
max_steps=None, # Or set fixed steps (e.g., 500)
|
max_steps=None,
|
||||||
|
|
||||||
# Optimization
|
# Optimization
|
||||||
bf16=True, # Faster on A100/H100
|
bf16=True, # Faster on A100/H100
|
||||||
|
|
@ -234,11 +202,11 @@ training_args = GRPOConfig(
|
||||||
# Logging
|
# Logging
|
||||||
logging_steps=1,
|
logging_steps=1,
|
||||||
save_steps=100,
|
save_steps=100,
|
||||||
report_to="wandb", # Or "none" for no logging
|
report_to="wandb",
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
**High-Performance Config (Large GPU)**
|
**High-performance config (large GPU)**
|
||||||
```python
|
```python
|
||||||
training_args = GRPOConfig(
|
training_args = GRPOConfig(
|
||||||
output_dir="outputs/grpo-model",
|
output_dir="outputs/grpo-model",
|
||||||
|
|
@ -255,31 +223,30 @@ training_args = GRPOConfig(
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
**Critical Hyperparameters:**
|
**Critical hyperparameters:**
|
||||||
|
|
||||||
| Parameter | Impact | Tuning Advice |
|
| Parameter | Impact | Tuning Advice |
|
||||||
|-----------|--------|---------------|
|
|-----------|--------|---------------|
|
||||||
| `num_generations` | Group size for comparison | Start with 8, increase to 16 if GPU allows |
|
| `num_generations` | Group size for comparison | Start 8, increase to 16 if GPU allows |
|
||||||
| `learning_rate` | Convergence speed/stability | 5e-6 (safe), 1e-5 (faster, riskier) |
|
| `learning_rate` | Convergence speed/stability | 5e-6 (safe), 1e-5 (faster, riskier) |
|
||||||
| `max_completion_length` | Output verbosity | Match your task (512 for reasoning, 256 for short answers) |
|
| `max_completion_length` | Output verbosity | Match your task (512 reasoning, 256 short answers) |
|
||||||
| `gradient_accumulation_steps` | Effective batch size | Increase if GPU memory limited |
|
| `gradient_accumulation_steps` | Effective batch size | Increase if GPU memory limited |
|
||||||
|
|
||||||
### Step 4: Model Setup and Training
|
### Step 4: Model setup and training
|
||||||
|
|
||||||
**Standard Setup (Transformers)**
|
**Standard setup (Transformers + TRL)**
|
||||||
```python
|
```python
|
||||||
import torch
|
import torch
|
||||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||||
from peft import LoraConfig
|
from peft import LoraConfig
|
||||||
from trl import GRPOTrainer
|
from trl import GRPOTrainer
|
||||||
|
|
||||||
# Load model
|
|
||||||
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
|
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
|
||||||
model = AutoModelForCausalLM.from_pretrained(
|
model = AutoModelForCausalLM.from_pretrained(
|
||||||
model_name,
|
model_name,
|
||||||
torch_dtype=torch.bfloat16,
|
torch_dtype=torch.bfloat16,
|
||||||
attn_implementation="flash_attention_2", # 2-3x faster
|
attn_implementation="flash_attention_2", # 2–3× faster
|
||||||
device_map="auto"
|
device_map="auto",
|
||||||
)
|
)
|
||||||
|
|
||||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||||
|
|
@ -287,17 +254,16 @@ tokenizer.pad_token = tokenizer.eos_token
|
||||||
|
|
||||||
# Optional: LoRA for parameter-efficient training
|
# Optional: LoRA for parameter-efficient training
|
||||||
peft_config = LoraConfig(
|
peft_config = LoraConfig(
|
||||||
r=16, # Rank (higher = more capacity)
|
r=16,
|
||||||
lora_alpha=32, # Scaling factor (typically 2*r)
|
lora_alpha=32,
|
||||||
target_modules=[
|
target_modules=[
|
||||||
"q_proj", "k_proj", "v_proj", "o_proj",
|
"q_proj", "k_proj", "v_proj", "o_proj",
|
||||||
"gate_proj", "up_proj", "down_proj"
|
"gate_proj", "up_proj", "down_proj",
|
||||||
],
|
],
|
||||||
task_type="CAUSAL_LM",
|
task_type="CAUSAL_LM",
|
||||||
lora_dropout=0.05,
|
lora_dropout=0.05,
|
||||||
)
|
)
|
||||||
|
|
||||||
# Initialize trainer
|
|
||||||
trainer = GRPOTrainer(
|
trainer = GRPOTrainer(
|
||||||
model=model,
|
model=model,
|
||||||
processing_class=tokenizer,
|
processing_class=tokenizer,
|
||||||
|
|
@ -308,17 +274,14 @@ trainer = GRPOTrainer(
|
||||||
],
|
],
|
||||||
args=training_args,
|
args=training_args,
|
||||||
train_dataset=dataset,
|
train_dataset=dataset,
|
||||||
peft_config=peft_config, # Remove for full fine-tuning
|
peft_config=peft_config, # Remove for full fine-tuning
|
||||||
)
|
)
|
||||||
|
|
||||||
# Train
|
|
||||||
trainer.train()
|
trainer.train()
|
||||||
|
|
||||||
# Save
|
|
||||||
trainer.save_model("final_model")
|
trainer.save_model("final_model")
|
||||||
```
|
```
|
||||||
|
|
||||||
**Unsloth Setup (2-3x Faster)**
|
**Unsloth setup (2–3× faster)**
|
||||||
```python
|
```python
|
||||||
from unsloth import FastLanguageModel
|
from unsloth import FastLanguageModel
|
||||||
|
|
||||||
|
|
@ -339,28 +302,26 @@ model = FastLanguageModel.get_peft_model(
|
||||||
use_gradient_checkpointing="unsloth",
|
use_gradient_checkpointing="unsloth",
|
||||||
)
|
)
|
||||||
|
|
||||||
# Rest is identical to standard setup
|
# Rest is identical to the standard setup
|
||||||
trainer = GRPOTrainer(model=model, ...)
|
trainer = GRPOTrainer(model=model, ...)
|
||||||
trainer.train()
|
trainer.train()
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
## Critical training insights
|
||||||
|
|
||||||
## Critical Training Insights
|
### 1. Loss behavior (EXPECTED pattern)
|
||||||
|
- **Loss starts near 0 and INCREASES during training** — this is CORRECT
|
||||||
|
- Loss measures KL divergence from initial policy; the model is learning (diverging from original behavior to optimize rewards)
|
||||||
|
- **Monitor reward metrics, not loss, for progress**
|
||||||
|
|
||||||
### 1. Loss Behavior (EXPECTED PATTERN)
|
### 2. Reward tracking
|
||||||
- **Loss starts near 0 and INCREASES during training**
|
|
||||||
- This is CORRECT - loss measures KL divergence from initial policy
|
|
||||||
- Model is learning (diverging from original behavior to optimize rewards)
|
|
||||||
- Monitor reward metrics instead of loss for progress
|
|
||||||
|
|
||||||
### 2. Reward Tracking
|
|
||||||
Key metrics to watch:
|
Key metrics to watch:
|
||||||
- `reward`: Average across all completions
|
- `reward` — average across all completions
|
||||||
- `reward_std`: Diversity within groups (should remain > 0)
|
- `reward_std` — diversity within groups (should remain > 0)
|
||||||
- `kl`: KL divergence from reference (should grow moderately)
|
- `kl` — KL divergence from reference (should grow moderately)
|
||||||
|
|
||||||
**Healthy Training Pattern:**
|
**Healthy pattern:**
|
||||||
```
|
```
|
||||||
Step Reward Reward_Std KL
|
Step Reward Reward_Std KL
|
||||||
100 0.5 0.3 0.02
|
100 0.5 0.3 0.02
|
||||||
|
|
@ -369,12 +330,12 @@ Step Reward Reward_Std KL
|
||||||
400 1.5 0.15 0.12
|
400 1.5 0.15 0.12
|
||||||
```
|
```
|
||||||
|
|
||||||
**Warning Signs:**
|
**Warning signs:**
|
||||||
- Reward std → 0 (model collapsing to single response)
|
- `reward_std` → 0 (model collapsing to a single response)
|
||||||
- KL exploding (> 0.5) (diverging too much, reduce LR)
|
- `kl` exploding (> 0.5) — diverging too much, reduce LR
|
||||||
- Reward stuck (reward functions too harsh or model capacity issue)
|
- Reward stuck — reward functions too harsh or model capacity issue
|
||||||
|
|
||||||
### 3. Common Pitfalls and Solutions
|
### 3. Common pitfalls and solutions
|
||||||
|
|
||||||
| Problem | Symptom | Solution |
|
| Problem | Symptom | Solution |
|
||||||
|---------|---------|----------|
|
|---------|---------|----------|
|
||||||
|
|
@ -384,15 +345,14 @@ Step Reward Reward_Std KL
|
||||||
| **Slow training** | < 1 it/s | Enable `use_vllm=True`, use Unsloth, reduce seq length |
|
| **Slow training** | < 1 it/s | Enable `use_vllm=True`, use Unsloth, reduce seq length |
|
||||||
| **Format ignored** | Model doesn't follow structure | Increase format reward weight, add incremental rewards |
|
| **Format ignored** | Model doesn't follow structure | Increase format reward weight, add incremental rewards |
|
||||||
|
|
||||||
---
|
## Advanced patterns
|
||||||
|
|
||||||
## Advanced Patterns
|
### 1. Multi-stage training
|
||||||
|
|
||||||
### 1. Multi-Stage Training
|
|
||||||
For complex tasks, train in stages:
|
For complex tasks, train in stages:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
# Stage 1: Format compliance (epochs=1)
|
# Stage 1: Format compliance
|
||||||
trainer_stage1 = GRPOTrainer(
|
trainer_stage1 = GRPOTrainer(
|
||||||
model=model,
|
model=model,
|
||||||
reward_funcs=[incremental_format_reward, format_reward],
|
reward_funcs=[incremental_format_reward, format_reward],
|
||||||
|
|
@ -400,7 +360,7 @@ trainer_stage1 = GRPOTrainer(
|
||||||
)
|
)
|
||||||
trainer_stage1.train()
|
trainer_stage1.train()
|
||||||
|
|
||||||
# Stage 2: Correctness (epochs=1)
|
# Stage 2: Correctness
|
||||||
trainer_stage2 = GRPOTrainer(
|
trainer_stage2 = GRPOTrainer(
|
||||||
model=model,
|
model=model,
|
||||||
reward_funcs=[format_reward, correctness_reward],
|
reward_funcs=[format_reward, correctness_reward],
|
||||||
|
|
@ -409,7 +369,8 @@ trainer_stage2 = GRPOTrainer(
|
||||||
trainer_stage2.train()
|
trainer_stage2.train()
|
||||||
```
|
```
|
||||||
|
|
||||||
### 2. Adaptive Reward Scaling
|
### 2. Adaptive reward scaling
|
||||||
|
|
||||||
```python
|
```python
|
||||||
class AdaptiveReward:
|
class AdaptiveReward:
|
||||||
def __init__(self, base_reward_func, initial_weight=1.0):
|
def __init__(self, base_reward_func, initial_weight=1.0):
|
||||||
|
|
@ -428,148 +389,116 @@ class AdaptiveReward:
|
||||||
self.weight *= 0.9
|
self.weight *= 0.9
|
||||||
```
|
```
|
||||||
|
|
||||||
### 3. Custom Dataset Integration
|
### 3. Custom dataset integration
|
||||||
|
|
||||||
```python
|
```python
|
||||||
def load_custom_knowledge_base(csv_path):
|
def load_custom_knowledge_base(csv_path):
|
||||||
"""Example: School communication platform docs."""
|
|
||||||
import pandas as pd
|
import pandas as pd
|
||||||
df = pd.read_csv(csv_path)
|
df = pd.read_csv(csv_path)
|
||||||
|
return Dataset.from_pandas(df).map(lambda x: {
|
||||||
dataset = Dataset.from_pandas(df).map(lambda x: {
|
|
||||||
'prompt': [
|
'prompt': [
|
||||||
{'role': 'system', 'content': CUSTOM_SYSTEM_PROMPT},
|
{'role': 'system', 'content': CUSTOM_SYSTEM_PROMPT},
|
||||||
{'role': 'user', 'content': x['question']}
|
{'role': 'user', 'content': x['question']}
|
||||||
],
|
],
|
||||||
'answer': x['expert_answer']
|
'answer': x['expert_answer']
|
||||||
})
|
})
|
||||||
return dataset
|
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
## Deployment and inference
|
||||||
|
|
||||||
## Deployment and Inference
|
### Save and merge LoRA
|
||||||
|
|
||||||
### Save and Merge LoRA
|
|
||||||
```python
|
```python
|
||||||
# Merge LoRA adapters into base model
|
|
||||||
if hasattr(trainer.model, 'merge_and_unload'):
|
if hasattr(trainer.model, 'merge_and_unload'):
|
||||||
merged_model = trainer.model.merge_and_unload()
|
merged_model = trainer.model.merge_and_unload()
|
||||||
merged_model.save_pretrained("production_model")
|
merged_model.save_pretrained("production_model")
|
||||||
tokenizer.save_pretrained("production_model")
|
tokenizer.save_pretrained("production_model")
|
||||||
```
|
```
|
||||||
|
|
||||||
### Inference Example
|
### Inference
|
||||||
```python
|
```python
|
||||||
from transformers import pipeline
|
from transformers import pipeline
|
||||||
|
|
||||||
generator = pipeline(
|
generator = pipeline("text-generation", model="production_model", tokenizer=tokenizer)
|
||||||
"text-generation",
|
|
||||||
model="production_model",
|
|
||||||
tokenizer=tokenizer
|
|
||||||
)
|
|
||||||
|
|
||||||
result = generator(
|
result = generator(
|
||||||
[
|
[
|
||||||
{'role': 'system', 'content': SYSTEM_PROMPT},
|
{'role': 'system', 'content': SYSTEM_PROMPT},
|
||||||
{'role': 'user', 'content': "What is 15 + 27?"}
|
{'role': 'user', 'content': "What is 15 + 27?"},
|
||||||
],
|
],
|
||||||
max_new_tokens=256,
|
max_new_tokens=256,
|
||||||
do_sample=True,
|
do_sample=True,
|
||||||
temperature=0.7,
|
temperature=0.7,
|
||||||
top_p=0.9
|
top_p=0.9,
|
||||||
)
|
)
|
||||||
print(result[0]['generated_text'])
|
print(result[0]['generated_text'])
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
## Best practices checklist
|
||||||
|
|
||||||
## Best Practices Checklist
|
**Before training:**
|
||||||
|
|
||||||
**Before Training:**
|
|
||||||
- [ ] Validate dataset format (prompts as List[Dict])
|
- [ ] Validate dataset format (prompts as List[Dict])
|
||||||
- [ ] Test reward functions on sample data
|
- [ ] Test reward functions on sample data
|
||||||
- [ ] Calculate expected max_prompt_length from data
|
- [ ] Calculate expected `max_prompt_length` from data
|
||||||
- [ ] Choose appropriate num_generations based on GPU memory
|
- [ ] Choose `num_generations` based on GPU memory
|
||||||
- [ ] Set up logging (wandb recommended)
|
- [ ] Set up logging (wandb recommended)
|
||||||
|
|
||||||
**During Training:**
|
**During training:**
|
||||||
- [ ] Monitor reward progression (should increase)
|
- [ ] Monitor reward progression (should increase)
|
||||||
- [ ] Check reward_std (should stay > 0.1)
|
- [ ] Check `reward_std` (should stay > 0.1)
|
||||||
- [ ] Watch for OOM errors (reduce batch size if needed)
|
- [ ] Watch for OOM errors (reduce batch size if needed)
|
||||||
- [ ] Sample generations every 50-100 steps
|
- [ ] Sample generations every 50–100 steps
|
||||||
- [ ] Validate format compliance on holdout set
|
- [ ] Validate format compliance on holdout set
|
||||||
|
|
||||||
**After Training:**
|
**After training:**
|
||||||
- [ ] Merge LoRA weights if using PEFT
|
- [ ] Merge LoRA weights if using PEFT
|
||||||
- [ ] Test on diverse prompts
|
- [ ] Test on diverse prompts
|
||||||
- [ ] Compare to baseline model
|
- [ ] Compare to baseline model
|
||||||
- [ ] Document reward weights and hyperparameters
|
- [ ] Document reward weights and hyperparameters
|
||||||
- [ ] Save reproducibility config
|
- [ ] Save reproducibility config
|
||||||
|
|
||||||
---
|
## Troubleshooting
|
||||||
|
|
||||||
## Troubleshooting Guide
|
### Debugging workflow
|
||||||
|
1. **Isolate reward functions** — test each independently
|
||||||
|
2. **Check data distribution** — ensure diversity in prompts
|
||||||
|
3. **Reduce complexity** — start with single reward, add gradually
|
||||||
|
4. **Monitor generations** — print samples every N steps
|
||||||
|
5. **Validate extraction logic** — ensure answer parsing works
|
||||||
|
|
||||||
### Debugging Workflow
|
### Quick debug reward
|
||||||
1. **Isolate reward functions** - Test each independently
|
|
||||||
2. **Check data distribution** - Ensure diversity in prompts
|
|
||||||
3. **Reduce complexity** - Start with single reward, add gradually
|
|
||||||
4. **Monitor generations** - Print samples every N steps
|
|
||||||
5. **Validate extraction logic** - Ensure answer parsing works
|
|
||||||
|
|
||||||
### Quick Fixes
|
|
||||||
```python
|
```python
|
||||||
# Debug reward function
|
|
||||||
def debug_reward(completions, **kwargs):
|
def debug_reward(completions, **kwargs):
|
||||||
responses = [comp[0]['content'] for comp in completions]
|
responses = [comp[0]['content'] for comp in completions]
|
||||||
for i, r in enumerate(responses[:2]): # Print first 2
|
for i, r in enumerate(responses[:2]):
|
||||||
print(f"Response {i}: {r[:200]}...")
|
print(f"Response {i}: {r[:200]}...")
|
||||||
return [1.0] * len(responses) # Dummy rewards
|
return [1.0] * len(responses)
|
||||||
|
|
||||||
# Test without training
|
# Test without training
|
||||||
trainer = GRPOTrainer(..., reward_funcs=[debug_reward])
|
trainer = GRPOTrainer(..., reward_funcs=[debug_reward])
|
||||||
trainer.generate_completions(dataset[:1]) # Generate without updating
|
trainer.generate_completions(dataset[:1])
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
## Template
|
||||||
|
|
||||||
## References and Resources
|
A production-ready training script lives at **`../templates/basic_grpo_training.py`**. It uses Qwen 2.5-1.5B-Instruct with LoRA and three reward functions (incremental format, strict format, correctness) on GSM8K. Copy and adapt:
|
||||||
|
1. `get_dataset()` — swap in your data loader
|
||||||
|
2. Reward functions — tune to your task
|
||||||
|
3. `SYSTEM_PROMPT` — match your output format
|
||||||
|
4. `GRPOConfig` — adjust hyperparameters for your GPU
|
||||||
|
|
||||||
|
## References and resources
|
||||||
|
|
||||||
**Official Documentation:**
|
|
||||||
- TRL GRPO Trainer: https://huggingface.co/docs/trl/grpo_trainer
|
- TRL GRPO Trainer: https://huggingface.co/docs/trl/grpo_trainer
|
||||||
- DeepSeek R1 Paper: https://arxiv.org/abs/2501.12948
|
- GRPO paper (DeepSeek): https://arxiv.org/abs/2402.03300
|
||||||
- Unsloth Docs: https://docs.unsloth.ai/
|
- DeepSeek R1 paper: https://arxiv.org/abs/2501.12948
|
||||||
|
- Open R1 implementation: https://github.com/huggingface/open-r1
|
||||||
**Example Repositories:**
|
- TRL examples: https://github.com/huggingface/trl/tree/main/examples
|
||||||
- Open R1 Implementation: https://github.com/huggingface/open-r1
|
- Unsloth (faster training): https://docs.unsloth.ai/
|
||||||
- TRL Examples: https://github.com/huggingface/trl/tree/main/examples
|
|
||||||
|
|
||||||
**Recommended Reading:**
|
|
||||||
- Progressive Disclosure Pattern for agent instructions
|
|
||||||
- Reward shaping in RL (Ng et al.)
|
|
||||||
- LoRA paper (Hu et al., 2021)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Usage Instructions for Agents
|
|
||||||
|
|
||||||
When this skill is loaded:
|
|
||||||
|
|
||||||
1. **Read this entire file** before implementing GRPO training
|
|
||||||
2. **Start with the simplest reward function** (e.g., length-based) to validate setup
|
|
||||||
3. **Use the templates** in `templates/` directory as starting points
|
|
||||||
4. **Reference examples** in `examples/` for task-specific implementations
|
|
||||||
5. **Follow the workflow** sequentially (don't skip steps)
|
|
||||||
6. **Debug incrementally** - add one reward function at a time
|
|
||||||
|
|
||||||
**Critical Reminders:**
|
|
||||||
- Always use multiple reward functions (3-5 is optimal)
|
|
||||||
- Monitor reward metrics, not loss
|
|
||||||
- Test reward functions before training
|
|
||||||
- Start small (num_generations=4), scale up gradually
|
|
||||||
- Save checkpoints frequently (every 100 steps)
|
|
||||||
|
|
||||||
This skill is designed for **expert-level implementation**. Beginners should start with supervised fine-tuning before attempting GRPO.
|
|
||||||
|
|
||||||
|
|
||||||
|
## Critical reminders
|
||||||
|
|
||||||
|
- **Loss goes UP during training** — this is normal (it's KL divergence)
|
||||||
|
- **Use 3–5 reward functions** — single rewards often fail
|
||||||
|
- **Test rewards before training** — debug each function independently
|
||||||
|
- **Monitor `reward_std`** — should stay > 0.1 (avoid mode collapse)
|
||||||
|
- **Start with `num_generations=4–8`** — scale up if GPU allows
|
||||||
|
|
@ -98,6 +98,7 @@ The largest optional category — covers the full ML pipeline from data curation
|
||||||
| **chroma** | Open-source embedding database. Store embeddings and metadata, perform vector and full-text search. Simple 4-function API for RAG and semantic search. |
|
| **chroma** | Open-source embedding database. Store embeddings and metadata, perform vector and full-text search. Simple 4-function API for RAG and semantic search. |
|
||||||
| **faiss** | Facebook's library for efficient similarity search and clustering of dense vectors. Supports billions of vectors, GPU acceleration, and various index types (Flat, IVF, HNSW). |
|
| **faiss** | Facebook's library for efficient similarity search and clustering of dense vectors. Supports billions of vectors, GPU acceleration, and various index types (Flat, IVF, HNSW). |
|
||||||
| **flash-attention** | Optimize transformer attention with Flash Attention for 2-4x speedup and 10-20x memory reduction. Supports PyTorch SDPA, flash-attn library, H100 FP8, and sliding window. |
|
| **flash-attention** | Optimize transformer attention with Flash Attention for 2-4x speedup and 10-20x memory reduction. Supports PyTorch SDPA, flash-attn library, H100 FP8, and sliding window. |
|
||||||
|
| **guidance** | Control LLM output with regex and grammars, guarantee valid JSON/XML/code generation, enforce structured formats, and build multi-step workflows with Guidance — Microsoft Research's constrained generation framework. |
|
||||||
| **hermes-atropos-environments** | Build, test, and debug Hermes Agent RL environments for Atropos training. Covers the HermesAgentBaseEnv interface, reward functions, agent loop integration, and evaluation. |
|
| **hermes-atropos-environments** | Build, test, and debug Hermes Agent RL environments for Atropos training. Covers the HermesAgentBaseEnv interface, reward functions, agent loop integration, and evaluation. |
|
||||||
| **huggingface-tokenizers** | Fast Rust-based tokenizers for research and production. Tokenizes 1GB in under 20 seconds. Supports BPE, WordPiece, and Unigram algorithms. |
|
| **huggingface-tokenizers** | Fast Rust-based tokenizers for research and production. Tokenizes 1GB in under 20 seconds. Supports BPE, WordPiece, and Unigram algorithms. |
|
||||||
| **instructor** | Extract structured data from LLM responses with Pydantic validation, retry failed extractions automatically, and stream partial results. |
|
| **instructor** | Extract structured data from LLM responses with Pydantic validation, retry failed extractions automatically, and stream partial results. |
|
||||||
|
|
|
||||||
|
|
@ -163,10 +163,8 @@ Model serving, quantization (GGUF/GPTQ), structured output, inference optimizati
|
||||||
|
|
||||||
| Skill | Description | Path |
|
| Skill | Description | Path |
|
||||||
|-------|-------------|------|
|
|-------|-------------|------|
|
||||||
| `gguf-quantization` | GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements. | `mlops/inference/gguf` |
|
|
||||||
| `guidance` | Control LLM output with regex and grammars, guarantee valid JSON/XML/code generation, enforce structured formats, and build multi-step workflows with Guidance - Microsoft Research's constrained generation framework | `mlops/inference/guidance` |
|
|
||||||
| `instructor` | Extract structured data from LLM responses with Pydantic validation, retry failed extractions automatically, parse complex JSON with type safety, and stream partial results with Instructor - battle-tested structured output library | `mlops/inference/instructor` |
|
| `instructor` | Extract structured data from LLM responses with Pydantic validation, retry failed extractions automatically, parse complex JSON with type safety, and stream partial results with Instructor - battle-tested structured output library | `mlops/inference/instructor` |
|
||||||
| `llama-cpp` | Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU. | `mlops/inference/llama-cpp` |
|
| `llama-cpp` | Run LLM inference with llama.cpp on CPU, Apple Silicon, AMD/Intel GPUs, or NVIDIA — plus GGUF model conversion and quantization (2–8 bit with K-quants and imatrix). Covers CLI, Python bindings, OpenAI-compatible server, and Ollama/LM Studio integration. Use for edge deployment, M1/M2/M3/M4 Macs, CUDA-less environments, or flexible local quantization. | `mlops/inference/llama-cpp` |
|
||||||
| `obliteratus` | Remove refusal behaviors from open-weight LLMs using OBLITERATUS — mechanistic interpretability techniques (diff-in-means, SVD, whitened SVD, LEACE, SAE decomposition, etc.) to excise guardrails while preserving reasoning. 9 CLI methods, 28 analysis modules, 116 model presets ac… | `mlops/inference/obliteratus` |
|
| `obliteratus` | Remove refusal behaviors from open-weight LLMs using OBLITERATUS — mechanistic interpretability techniques (diff-in-means, SVD, whitened SVD, LEACE, SAE decomposition, etc.) to excise guardrails while preserving reasoning. 9 CLI methods, 28 analysis modules, 116 model presets ac… | `mlops/inference/obliteratus` |
|
||||||
| `outlines` | Guarantee valid JSON/XML/code structure during generation, use Pydantic models for type-safe outputs, support local models (Transformers, vLLM), and maximize inference speed with Outlines - dottxt.ai's structured generation library | `mlops/inference/outlines` |
|
| `outlines` | Guarantee valid JSON/XML/code structure during generation, use Pydantic models for type-safe outputs, support local models (Transformers, vLLM), and maximize inference speed with Outlines - dottxt.ai's structured generation library | `mlops/inference/outlines` |
|
||||||
| `serving-llms-vllm` | Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), an… | `mlops/inference/vllm` |
|
| `serving-llms-vllm` | Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), an… | `mlops/inference/vllm` |
|
||||||
|
|
@ -202,7 +200,6 @@ Fine-tuning, RLHF/DPO/GRPO training, distributed training frameworks, and optimi
|
||||||
| `axolotl` | Expert guidance for fine-tuning LLMs with Axolotl - YAML configs, 100+ models, LoRA/QLoRA, DPO/KTO/ORPO/GRPO, multimodal support | `mlops/training/axolotl` |
|
| `axolotl` | Expert guidance for fine-tuning LLMs with Axolotl - YAML configs, 100+ models, LoRA/QLoRA, DPO/KTO/ORPO/GRPO, multimodal support | `mlops/training/axolotl` |
|
||||||
| `distributed-llm-pretraining-torchtitan` | Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing. | `mlops/training/torchtitan` |
|
| `distributed-llm-pretraining-torchtitan` | Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing. | `mlops/training/torchtitan` |
|
||||||
| `fine-tuning-with-trl` | Fine-tune LLMs using reinforcement learning with TRL - SFT for instruction tuning, DPO for preference alignment, PPO/GRPO for reward optimization, and reward model training. Use when need RLHF, align model with preferences, or train from human feedback. Works with HuggingFace Tr… | `mlops/training/trl-fine-tuning` |
|
| `fine-tuning-with-trl` | Fine-tune LLMs using reinforcement learning with TRL - SFT for instruction tuning, DPO for preference alignment, PPO/GRPO for reward optimization, and reward model training. Use when need RLHF, align model with preferences, or train from human feedback. Works with HuggingFace Tr… | `mlops/training/trl-fine-tuning` |
|
||||||
| `grpo-rl-training` | Expert guidance for GRPO/RL fine-tuning with TRL for reasoning and task-specific model training | `mlops/training/grpo-rl-training` |
|
|
||||||
| `hermes-atropos-environments` | Build, test, and debug Hermes Agent RL environments for Atropos training. Covers the HermesAgentBaseEnv interface, reward functions, agent loop integration, evaluation with tools, wandb logging, and the three CLI modes (serve/process/evaluate). Use when creating, reviewing, or f… | `mlops/training/hermes-atropos-environments` |
|
| `hermes-atropos-environments` | Build, test, and debug Hermes Agent RL environments for Atropos training. Covers the HermesAgentBaseEnv interface, reward functions, agent loop integration, evaluation with tools, wandb logging, and the three CLI modes (serve/process/evaluate). Use when creating, reviewing, or f… | `mlops/training/hermes-atropos-environments` |
|
||||||
| `huggingface-accelerate` | Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard. | `mlops/training/accelerate` |
|
| `huggingface-accelerate` | Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard. | `mlops/training/accelerate` |
|
||||||
| `optimizing-attention-flash` | Optimizes transformer attention with Flash Attention for 2-4x speedup and 10-20x memory reduction. Use when training/running transformers with long sequences (>512 tokens), encountering GPU memory issues with attention, or need faster inference. Supports PyTorch native SDPA,… | `mlops/training/flash-attention` |
|
| `optimizing-attention-flash` | Optimizes transformer attention with Flash Attention for 2-4x speedup and 10-20x memory reduction. Use when training/running transformers with long sequences (>512 tokens), encountering GPU memory issues with attention, or need faster inference. Supports PyTorch native SDPA,… | `mlops/training/flash-attention` |
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue