skills: consolidate mlops redundancies (gguf+llama-cpp, grpo+trl, guidance→optional) (#11965)

Three tightly-scoped built-in skill consolidations to reduce redundancy in
the available_skills listing injected into every system prompt:

1. gguf-quantization → llama-cpp (merged)
   GGUF is llama.cpp's format; two skills covered the same toolchain. The
   merged llama-cpp skill keeps the full K-quant table + imatrix workflow
   from gguf and the ROCm/benchmarks/supported-models sections from the
   original llama-cpp. All 5 reference files preserved.

2. grpo-rl-training → fine-tuning-with-trl (folded in)
   GRPO isn't a framework, it's a trainer inside TRL. Moved the 17KB
   deep-dive SKILL.md to references/grpo-training.md and the working
   template to templates/basic_grpo_training.py. TRL's GRPO workflow
   section now points to both. Atropos skill's related_skills updated.

3. guidance → optional-skills/mlops/
   Dropped from built-in. Outlines (still built-in) covers the same
   structured-generation ground with wider adoption. Listed in the
   optional catalog for users who specifically want Guidance.

Net: 3 fewer built-in skill lines in every system prompt, zero content
loss. Contributor authorship preserved via git rename detection.
This commit is contained in:
Teknium 2026-04-17 21:36:40 -07:00 committed by GitHub
parent 598cba62ad
commit 73bccc94c7
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
15 changed files with 470 additions and 889 deletions

View file

@ -1,138 +1,271 @@
---
name: llama-cpp
description: Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.
version: 1.0.0
description: Run LLM inference with llama.cpp on CPU, Apple Silicon, AMD/Intel GPUs, or NVIDIA — plus GGUF model conversion and quantization (28 bit with K-quants and imatrix). Covers CLI, Python bindings, OpenAI-compatible server, and Ollama/LM Studio integration. Use for edge deployment, M1/M2/M3/M4 Macs, CUDA-less environments, or flexible local quantization.
version: 2.0.0
author: Orchestra Research
license: MIT
dependencies: [llama-cpp-python]
dependencies: [llama-cpp-python>=0.2.0]
metadata:
hermes:
tags: [Inference Serving, Llama.cpp, CPU Inference, Apple Silicon, Edge Deployment, GGUF, Quantization, Non-NVIDIA, AMD GPUs, Intel GPUs, Embedded]
tags: [llama.cpp, GGUF, Quantization, CPU Inference, Apple Silicon, Edge Deployment, Non-NVIDIA, AMD GPUs, Intel GPUs, Embedded, Model Compression]
---
# llama.cpp
# llama.cpp + GGUF
Pure C/C++ LLM inference with minimal dependencies, optimized for CPUs and non-NVIDIA hardware.
Pure C/C++ LLM inference with minimal dependencies, plus the GGUF (GPT-Generated Unified Format) standard used for quantized weights. One toolchain covers conversion, quantization, and serving.
## When to use llama.cpp
## When to use
**Use llama.cpp when:**
- Running on CPU-only machines
- Deploying on Apple Silicon (M1/M2/M3/M4)
- Using AMD or Intel GPUs (no CUDA)
- Edge deployment (Raspberry Pi, embedded systems)
- Need simple deployment without Docker/Python
**Use llama.cpp + GGUF when:**
- Running on CPU-only machines or Apple Silicon (M1/M2/M3/M4) with Metal acceleration
- Using AMD (ROCm) or Intel GPUs where CUDA isn't available
- Edge deployment (Raspberry Pi, embedded systems, consumer laptops)
- Need flexible quantization (28 bit with K-quants)
- Want local AI tools (LM Studio, Ollama, text-generation-webui, koboldcpp)
- Want a single binary deploy without Docker/Python
**Use TensorRT-LLM instead when:**
- Have NVIDIA GPUs (A100/H100)
- Need maximum throughput (100K+ tok/s)
- Running in datacenter with CUDA
**Key advantages:**
- Universal hardware: CPU, Apple Silicon, NVIDIA, AMD, Intel
- No Python runtime required (pure C/C++)
- K-quants + imatrix for better low-bit quality
- OpenAI-compatible server built in
- Rich ecosystem (Ollama, LM Studio, llama-cpp-python)
**Use vLLM instead when:**
- Have NVIDIA GPUs
- Need Python-first API
- Want PagedAttention
**Use alternatives instead:**
- **vLLM** — NVIDIA GPUs, PagedAttention, Python-first, max throughput
- **TensorRT-LLM** — Production NVIDIA (A100/H100), maximum speed
- **AWQ/GPTQ** — Calibrated quantization for NVIDIA-only deployments
- **bitsandbytes** — Simple HuggingFace transformers integration
- **HQQ** — Fast calibration-free quantization
## Quick start
### Installation
### Install
```bash
# macOS/Linux
# macOS / Linux (simplest)
brew install llama.cpp
# Or build from source
git clone https://github.com/ggerganov/llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
make
make # CPU
make GGML_METAL=1 # Apple Silicon
make GGML_CUDA=1 # NVIDIA CUDA
make LLAMA_HIP=1 # AMD ROCm
# With Metal (Apple Silicon)
make LLAMA_METAL=1
# With CUDA (NVIDIA)
make LLAMA_CUDA=1
# With ROCm (AMD)
make LLAMA_HIP=1
# Python bindings (optional)
pip install llama-cpp-python
# With CUDA: CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
# With Metal: CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
```
### Download model
### Download a pre-quantized GGUF
```bash
# Download from HuggingFace (GGUF format)
# TheBloke hosts most popular models pre-quantized
huggingface-cli download \
TheBloke/Llama-2-7B-Chat-GGUF \
llama-2-7b-chat.Q4_K_M.gguf \
--local-dir models/
```
# Or convert from HuggingFace
python convert_hf_to_gguf.py models/llama-2-7b-chat/
### Or convert a HuggingFace model to GGUF
```bash
# 1. Download HF model
huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b
# 2. Convert to FP16 GGUF
python convert_hf_to_gguf.py ./llama-3.1-8b \
--outfile llama-3.1-8b-f16.gguf \
--outtype f16
# 3. Quantize to Q4_K_M
./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M
```
### Run inference
```bash
# Simple chat
./llama-cli \
-m models/llama-2-7b-chat.Q4_K_M.gguf \
-p "Explain quantum computing" \
-n 256 # Max tokens
# One-shot prompt
./llama-cli -m model.Q4_K_M.gguf -p "Explain quantum computing" -n 256
# Interactive chat
./llama-cli \
-m models/llama-2-7b-chat.Q4_K_M.gguf \
--interactive
./llama-cli -m model.Q4_K_M.gguf --interactive
# With GPU offload
./llama-cli -m model.Q4_K_M.gguf -ngl 35 -p "Hello!"
```
### Server mode
### Serve an OpenAI-compatible API
```bash
# Start OpenAI-compatible server
./llama-server \
-m models/llama-2-7b-chat.Q4_K_M.gguf \
-m model.Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 32 # Offload 32 layers to GPU
-ngl 35 \
-c 4096 \
--parallel 4 \
--cont-batching
```
# Client request
```bash
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-2-7b-chat",
"model": "local",
"messages": [{"role": "user", "content": "Hello!"}],
"temperature": 0.7,
"max_tokens": 100
}'
```
## Quantization formats
## Quantization formats (GGUF)
### GGUF format overview
### K-quant methods (recommended)
| Format | Bits | Size (7B) | Speed | Quality | Use Case |
|--------|------|-----------|-------|---------|----------|
| **Q4_K_M** | 4.5 | 4.1 GB | Fast | Good | **Recommended default** |
| Q4_K_S | 4.3 | 3.9 GB | Faster | Lower | Speed critical |
| Q5_K_M | 5.5 | 4.8 GB | Medium | Better | Quality critical |
| Q6_K | 6.5 | 5.5 GB | Slower | Best | Maximum quality |
| Q8_0 | 8.0 | 7.0 GB | Slow | Excellent | Minimal degradation |
| Q2_K | 2.5 | 2.7 GB | Fastest | Poor | Testing only |
| Type | Bits | Size (7B) | Quality | Use Case |
|------|------|-----------|---------|----------|
| Q2_K | 2.5 | ~2.8 GB | Low | Extreme compression (testing only) |
| Q3_K_S | 3.0 | ~3.0 GB | Low-Med | Memory constrained |
| Q3_K_M | 3.3 | ~3.3 GB | Medium | Fits small devices |
| Q4_K_S | 4.0 | ~3.8 GB | Med-High | Speed critical |
| **Q4_K_M** | 4.5 | ~4.1 GB | High | **Recommended default** |
| Q5_K_S | 5.0 | ~4.6 GB | High | Quality focused |
| Q5_K_M | 5.5 | ~4.8 GB | Very High | High quality |
| Q6_K | 6.0 | ~5.5 GB | Excellent | Near-original |
| Q8_0 | 8.0 | ~7.2 GB | Best | Maximum quality, minimal degradation |
### Choosing quantization
**Variant suffixes** — `_S` (Small, faster, lower quality), `_M` (Medium, balanced), `_L` (Large, better quality).
**Legacy (Q4_0/Q4_1/Q5_0/Q5_1) exist** but always prefer K-quants for better quality/size ratio.
**IQ quantization** — ultra-low-bit with importance-aware methods: IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_XS, IQ3_S, IQ4_XS. Require `--imatrix`.
**Task-specific defaults:**
- General chat / assistants: Q4_K_M, or Q5_K_M if RAM allows
- Code generation: Q5_K_M or Q6_K (higher precision helps)
- Technical / medical: Q6_K or Q8_0
- Very large (70B, 405B) on consumer hardware: Q3_K_M or Q4_K_S
- Raspberry Pi / edge: Q2_K or Q3_K_S
## Conversion workflows
### Basic: HF → GGUF → quantized
```bash
# General use (balanced)
Q4_K_M # 4-bit, medium quality
python convert_hf_to_gguf.py ./model --outfile model-f16.gguf --outtype f16
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
./llama-cli -m model-q4_k_m.gguf -p "Hello!" -n 50
```
# Maximum speed (more degradation)
Q2_K or Q3_K_M
### With importance matrix (imatrix) — better low-bit quality
# Maximum quality (slower)
Q6_K or Q8_0
`imatrix` gives 1020% perplexity improvement at Q4, essential at Q3 and below.
# Very large models (70B, 405B)
Q3_K_M or Q4_K_S # Lower bits to fit in memory
```bash
# 1. Convert to FP16 GGUF
python convert_hf_to_gguf.py ./model --outfile model-f16.gguf
# 2. Prepare calibration data (diverse text, ~100MB is ideal)
cat > calibration.txt << 'EOF'
The quick brown fox jumps over the lazy dog.
Machine learning is a subset of artificial intelligence.
# Add more diverse text samples...
EOF
# 3. Generate importance matrix
./llama-imatrix -m model-f16.gguf \
-f calibration.txt \
--chunk 512 \
-o model.imatrix \
-ngl 35
# 4. Quantize with imatrix
./llama-quantize --imatrix model.imatrix \
model-f16.gguf model-q4_k_m.gguf Q4_K_M
```
### Multi-quant batch
```bash
#!/bin/bash
MODEL="llama-3.1-8b-f16.gguf"
IMATRIX="llama-3.1-8b.imatrix"
./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35
for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
OUTPUT="llama-3.1-8b-${QUANT,,}.gguf"
./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT
echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))"
done
```
### Quality testing (perplexity)
```bash
./llama-perplexity -m model.gguf -f wikitext-2-raw/wiki.test.raw -c 512
# Baseline FP16: ~5.96 | Q4_K_M: ~6.06 (+1.7%) | Q2_K: ~6.87 (+15.3%)
```
## Python bindings (llama-cpp-python)
### Basic generation
```python
from llama_cpp import Llama
llm = Llama(
model_path="./model-q4_k_m.gguf",
n_ctx=4096,
n_gpu_layers=35, # 0 for CPU only, 99 to offload everything
n_threads=8,
)
output = llm(
"What is machine learning?",
max_tokens=256,
temperature=0.7,
stop=["</s>", "\n\n"],
)
print(output["choices"][0]["text"])
```
### Chat completion + streaming
```python
llm = Llama(
model_path="./model-q4_k_m.gguf",
n_ctx=4096,
n_gpu_layers=35,
chat_format="llama-3", # Or "chatml", "mistral", etc.
)
# Non-streaming
response = llm.create_chat_completion(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Python?"},
],
max_tokens=256,
temperature=0.7,
)
print(response["choices"][0]["message"]["content"])
# Streaming
for chunk in llm("Explain quantum computing:", max_tokens=256, stream=True):
print(chunk["choices"][0]["text"], end="", flush=True)
```
### Embeddings
```python
llm = Llama(model_path="./model-q4_k_m.gguf", embedding=True, n_gpu_layers=35)
vec = llm.embed("This is a test sentence.")
print(f"Embedding dimension: {len(vec)}")
```
## Hardware acceleration
@ -140,122 +273,166 @@ Q3_K_M or Q4_K_S # Lower bits to fit in memory
### Apple Silicon (Metal)
```bash
# Build with Metal
make LLAMA_METAL=1
# Run with GPU acceleration (automatic)
./llama-cli -m model.gguf -ngl 999 # Offload all layers
# Performance: M3 Max 40-60 tokens/sec (Llama 2-7B Q4_K_M)
make clean && make GGML_METAL=1
./llama-cli -m model.gguf -ngl 99 -p "Hello" # offload all layers
```
### NVIDIA GPUs (CUDA)
```bash
# Build with CUDA
make LLAMA_CUDA=1
# Offload layers to GPU
./llama-cli -m model.gguf -ngl 35 # Offload 35/40 layers
# Hybrid CPU+GPU for large models
./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20 # GPU: 20 layers, CPU: rest
```python
llm = Llama(
model_path="model.gguf",
n_gpu_layers=99, # Offload everything
n_threads=1, # Metal handles parallelism
)
```
### AMD GPUs (ROCm)
Performance: M3 Max ~4060 tok/s on Llama 2-7B Q4_K_M.
### NVIDIA (CUDA)
```bash
make clean && make GGML_CUDA=1
./llama-cli -m model.gguf -ngl 35 -p "Hello"
# Hybrid for large models
./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20 # GPU: 20 layers, CPU: rest
# Multi-GPU split
./llama-cli -m large-model.gguf --tensor-split 0.5,0.5 -ngl 60
```
### AMD (ROCm)
```bash
# Build with ROCm
make LLAMA_HIP=1
# Run with AMD GPU
./llama-cli -m model.gguf -ngl 999
```
## Common patterns
### Batch processing
### CPU
```bash
# Process multiple prompts from file
cat prompts.txt | ./llama-cli \
-m model.gguf \
--batch-size 512 \
-n 100
# Match PHYSICAL cores, not logical
./llama-cli -m model.gguf -t 8 -p "Hello"
# BLAS acceleration (23× speedup)
make LLAMA_OPENBLAS=1
```
### Constrained generation
```bash
# JSON output with grammar
./llama-cli \
-m model.gguf \
-p "Generate a person: " \
--grammar-file grammars/json.gbnf
# Outputs valid JSON only
```
### Context size
```bash
# Increase context (default 512)
./llama-cli \
-m model.gguf \
-c 4096 # 4K context window
# Very long context (if model supports)
./llama-cli -m model.gguf -c 32768 # 32K context
```python
llm = Llama(
model_path="model.gguf",
n_gpu_layers=0,
n_threads=8,
n_batch=512, # Larger batch = faster prompt processing
)
```
## Performance benchmarks
### CPU performance (Llama 2-7B Q4_K_M)
### CPU (Llama 2-7B Q4_K_M)
| CPU | Threads | Speed | Cost |
|-----|---------|-------|------|
| Apple M3 Max | 16 | 50 tok/s | $0 (local) |
| AMD Ryzen 9 7950X | 32 | 35 tok/s | $0.50/hour |
| Intel i9-13900K | 32 | 30 tok/s | $0.40/hour |
| AWS c7i.16xlarge | 64 | 40 tok/s | $2.88/hour |
| CPU | Threads | Speed |
|-----|---------|-------|
| Apple M3 Max (Metal) | 16 | 50 tok/s |
| AMD Ryzen 9 7950X | 32 | 35 tok/s |
| Intel i9-13900K | 32 | 30 tok/s |
### GPU acceleration (Llama 2-7B Q4_K_M)
### GPU offloading on RTX 4090
| GPU | Speed | vs CPU | Cost |
|-----|-------|--------|------|
| NVIDIA RTX 4090 | 120 tok/s | 3-4× | $0 (local) |
| NVIDIA A10 | 80 tok/s | 2-3× | $1.00/hour |
| AMD MI250 | 70 tok/s | 2× | $2.00/hour |
| Apple M3 Max (Metal) | 50 tok/s | ~Same | $0 (local) |
| Layers GPU | Speed | VRAM |
|------------|-------|------|
| 0 (CPU only) | 30 tok/s | 0 GB |
| 20 (hybrid) | 80 tok/s | 8 GB |
| 35 (all) | 120 tok/s | 12 GB |
## Supported models
**LLaMA family**:
- Llama 2 (7B, 13B, 70B)
- Llama 3 (8B, 70B, 405B)
- Code Llama
- **LLaMA family**: Llama 2 (7B/13B/70B), Llama 3 (8B/70B/405B), Code Llama
- **Mistral family**: Mistral 7B, Mixtral 8x7B/8x22B
- **Other**: Falcon, BLOOM, GPT-J, Phi-3, Gemma, Qwen, LLaVA (vision), Whisper (audio)
**Mistral family**:
- Mistral 7B
- Mixtral 8x7B, 8x22B
Find GGUF models: https://huggingface.co/models?library=gguf
**Other**:
- Falcon, BLOOM, GPT-J
- Phi-3, Gemma, Qwen
- LLaVA (vision), Whisper (audio)
## Ecosystem integrations
**Find models**: https://huggingface.co/models?library=gguf
### Ollama
```bash
cat > Modelfile << 'EOF'
FROM ./model-q4_k_m.gguf
TEMPLATE """{{ .System }}
{{ .Prompt }}"""
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
EOF
ollama create mymodel -f Modelfile
ollama run mymodel "Hello!"
```
### LM Studio
1. Place GGUF file in `~/.cache/lm-studio/models/`
2. Open LM Studio and select the model
3. Configure context length and GPU offload, start inference
### text-generation-webui
```bash
cp model-q4_k_m.gguf text-generation-webui/models/
python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35
```
### OpenAI client → llama-server
```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
model="local-model",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=256,
)
print(response.choices[0].message.content)
```
## Best practices
1. **Use K-quants** — Q4_K_M is the recommended default
2. **Use imatrix** for Q4 and below (calibration improves quality substantially)
3. **Offload as many layers as VRAM allows** — start high, reduce by 5 on OOM
4. **Thread count** — match physical cores, not logical
5. **Batch size** — increase `n_batch` (e.g. 512) for faster prompt processing
6. **Context** — start at 4096, grow only as needed (memory scales with ctx)
7. **Flash Attention** — add `--flash-attn` if your build supports it
## Common issues (quick fixes)
**Model loads slowly** — use `--mmap` for memory-mapped loading.
**Out of memory (GPU)** — reduce `-ngl`, use a smaller quant (Q4_K_S / Q3_K_M), or quantize the KV cache:
```python
Llama(model_path="...", type_k=2, type_v=2, n_gpu_layers=35) # Q4_0 KV cache
```
**Garbage output** — wrong `chat_format`, temperature too high, or model file corrupted. Test with `temperature=0.1` and verify FP16 baseline works.
**Connection refused (server)** — bind to `--host 0.0.0.0`, check `lsof -i :8080`.
See `references/troubleshooting.md` for the full playbook.
## References
- **[Quantization Guide](references/quantization.md)** - GGUF formats, conversion, quality comparison
- **[Server Deployment](references/server.md)** - API endpoints, Docker, monitoring
- **[Optimization](references/optimization.md)** - Performance tuning, hybrid CPU+GPU
- **[advanced-usage.md](references/advanced-usage.md)** — speculative decoding, batched inference, grammar-constrained generation, LoRA, multi-GPU, custom builds, benchmark scripts
- **[quantization.md](references/quantization.md)** — perplexity tables, use-case guide, model size scaling (7B/13B/70B RAM needs), imatrix deep dive
- **[server.md](references/server.md)** — OpenAI API endpoints, Docker deployment, NGINX load balancing, monitoring
- **[optimization.md](references/optimization.md)** — CPU threading, BLAS, GPU offload heuristics, batch tuning, benchmarks
- **[troubleshooting.md](references/troubleshooting.md)** — install/convert/quantize/inference/server issues, Apple Silicon, debugging
## Resources
- **GitHub**: https://github.com/ggerganov/llama.cpp
- **Models**: https://huggingface.co/models?library=gguf
- **Discord**: https://discord.gg/llama-cpp
- **GitHub**: https://github.com/ggml-org/llama.cpp
- **Python bindings**: https://github.com/abetlen/llama-cpp-python
- **Pre-quantized models**: https://huggingface.co/TheBloke
- **GGUF converter Space**: https://huggingface.co/spaces/ggml-org/gguf-my-repo
- **License**: MIT

View file

@ -0,0 +1,504 @@
# GGUF Advanced Usage Guide
## Speculative Decoding
### Draft Model Approach
```bash
# Use smaller model as draft for faster generation
./llama-speculative \
-m large-model-q4_k_m.gguf \
-md draft-model-q4_k_m.gguf \
-p "Write a story about AI" \
-n 500 \
--draft 8 # Draft tokens before verification
```
### Self-Speculative Decoding
```bash
# Use same model with different context for speculation
./llama-cli -m model-q4_k_m.gguf \
--lookup-cache-static lookup.bin \
--lookup-cache-dynamic lookup-dynamic.bin \
-p "Hello world"
```
## Batched Inference
### Process Multiple Prompts
```python
from llama_cpp import Llama
llm = Llama(
model_path="model-q4_k_m.gguf",
n_ctx=4096,
n_gpu_layers=35,
n_batch=512 # Larger batch for parallel processing
)
prompts = [
"What is Python?",
"Explain machine learning.",
"Describe neural networks."
]
# Process in batch (each prompt gets separate context)
for prompt in prompts:
output = llm(prompt, max_tokens=100)
print(f"Q: {prompt}")
print(f"A: {output['choices'][0]['text']}\n")
```
### Server Batching
```bash
# Start server with batching
./llama-server -m model-q4_k_m.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 35 \
-c 4096 \
--parallel 4 # Concurrent requests
--cont-batching # Continuous batching
```
## Custom Model Conversion
### Convert with Vocabulary Modifications
```python
# custom_convert.py
import sys
sys.path.insert(0, './llama.cpp')
from convert_hf_to_gguf import main
from gguf import GGUFWriter
# Custom conversion with modified vocab
def convert_with_custom_vocab(model_path, output_path):
# Load and modify tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Add special tokens if needed
special_tokens = {"additional_special_tokens": ["<|custom|>"]}
tokenizer.add_special_tokens(special_tokens)
tokenizer.save_pretrained(model_path)
# Then run standard conversion
main([model_path, "--outfile", output_path])
```
### Convert Specific Architecture
```bash
# For Mistral-style models
python convert_hf_to_gguf.py ./mistral-model \
--outfile mistral-f16.gguf \
--outtype f16
# For Qwen models
python convert_hf_to_gguf.py ./qwen-model \
--outfile qwen-f16.gguf \
--outtype f16
# For Phi models
python convert_hf_to_gguf.py ./phi-model \
--outfile phi-f16.gguf \
--outtype f16
```
## Advanced Quantization
### Mixed Quantization
```bash
# Quantize different layer types differently
./llama-quantize model-f16.gguf model-mixed.gguf Q4_K_M \
--allow-requantize \
--leave-output-tensor
```
### Quantization with Token Embeddings
```bash
# Keep embeddings at higher precision
./llama-quantize model-f16.gguf model-q4.gguf Q4_K_M \
--token-embedding-type f16
```
### IQ Quantization (Importance-aware)
```bash
# Ultra-low bit quantization with importance
./llama-quantize --imatrix model.imatrix \
model-f16.gguf model-iq2_xxs.gguf IQ2_XXS
# Available IQ types: IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_XS, IQ3_S, IQ4_XS
```
## Memory Optimization
### Memory Mapping
```python
from llama_cpp import Llama
# Use memory mapping for large models
llm = Llama(
model_path="model-q4_k_m.gguf",
use_mmap=True, # Memory map the model
use_mlock=False, # Don't lock in RAM
n_gpu_layers=35
)
```
### Partial GPU Offload
```python
# Calculate layers to offload based on VRAM
import subprocess
def get_free_vram_gb():
result = subprocess.run(
['nvidia-smi', '--query-gpu=memory.free', '--format=csv,nounits,noheader'],
capture_output=True, text=True
)
return int(result.stdout.strip()) / 1024
# Estimate layers based on VRAM (rough: 0.5GB per layer for 7B Q4)
free_vram = get_free_vram_gb()
layers_to_offload = int(free_vram / 0.5)
llm = Llama(
model_path="model-q4_k_m.gguf",
n_gpu_layers=min(layers_to_offload, 35) # Cap at total layers
)
```
### KV Cache Optimization
```python
from llama_cpp import Llama
# Optimize KV cache for long contexts
llm = Llama(
model_path="model-q4_k_m.gguf",
n_ctx=8192, # Large context
n_gpu_layers=35,
type_k=1, # Q8_0 for K cache (1)
type_v=1, # Q8_0 for V cache (1)
# Or use Q4_0 (2) for more compression
)
```
## Context Management
### Context Shifting
```python
from llama_cpp import Llama
llm = Llama(
model_path="model-q4_k_m.gguf",
n_ctx=4096,
n_gpu_layers=35
)
# Handle long conversations with context shifting
conversation = []
max_history = 10
def chat(user_message):
conversation.append({"role": "user", "content": user_message})
# Keep only recent history
if len(conversation) > max_history * 2:
conversation = conversation[-max_history * 2:]
response = llm.create_chat_completion(
messages=conversation,
max_tokens=256
)
assistant_message = response["choices"][0]["message"]["content"]
conversation.append({"role": "assistant", "content": assistant_message})
return assistant_message
```
### Save and Load State
```bash
# Save state to file
./llama-cli -m model.gguf \
-p "Once upon a time" \
--save-session session.bin \
-n 100
# Load and continue
./llama-cli -m model.gguf \
--load-session session.bin \
-p " and they lived" \
-n 100
```
## Grammar Constrained Generation
### JSON Output
```python
from llama_cpp import Llama, LlamaGrammar
# Define JSON grammar
json_grammar = LlamaGrammar.from_string('''
root ::= object
object ::= "{" ws pair ("," ws pair)* "}" ws
pair ::= string ":" ws value
value ::= string | number | object | array | "true" | "false" | "null"
array ::= "[" ws value ("," ws value)* "]" ws
string ::= "\\"" [^"\\\\]* "\\""
number ::= [0-9]+
ws ::= [ \\t\\n]*
''')
llm = Llama(model_path="model-q4_k_m.gguf", n_gpu_layers=35)
output = llm(
"Output a JSON object with name and age:",
grammar=json_grammar,
max_tokens=100
)
print(output["choices"][0]["text"])
```
### Custom Grammar
```python
# Grammar for specific format
answer_grammar = LlamaGrammar.from_string('''
root ::= "Answer: " letter "\\n" "Explanation: " explanation
letter ::= [A-D]
explanation ::= [a-zA-Z0-9 .,!?]+
''')
output = llm(
"Q: What is 2+2? A) 3 B) 4 C) 5 D) 6",
grammar=answer_grammar,
max_tokens=100
)
```
## LoRA Integration
### Load LoRA Adapter
```bash
# Apply LoRA at runtime
./llama-cli -m base-model-q4_k_m.gguf \
--lora lora-adapter.gguf \
--lora-scale 1.0 \
-p "Hello!"
```
### Multiple LoRA Adapters
```bash
# Stack multiple adapters
./llama-cli -m base-model.gguf \
--lora adapter1.gguf --lora-scale 0.5 \
--lora adapter2.gguf --lora-scale 0.5 \
-p "Hello!"
```
### Python LoRA Usage
```python
from llama_cpp import Llama
llm = Llama(
model_path="base-model-q4_k_m.gguf",
lora_path="lora-adapter.gguf",
lora_scale=1.0,
n_gpu_layers=35
)
```
## Embedding Generation
### Extract Embeddings
```python
from llama_cpp import Llama
llm = Llama(
model_path="model-q4_k_m.gguf",
embedding=True, # Enable embedding mode
n_gpu_layers=35
)
# Get embeddings
embeddings = llm.embed("This is a test sentence.")
print(f"Embedding dimension: {len(embeddings)}")
```
### Batch Embeddings
```python
texts = [
"Machine learning is fascinating.",
"Deep learning uses neural networks.",
"Python is a programming language."
]
embeddings = [llm.embed(text) for text in texts]
# Calculate similarity
import numpy as np
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
sim = cosine_similarity(embeddings[0], embeddings[1])
print(f"Similarity: {sim:.4f}")
```
## Performance Tuning
### Benchmark Script
```python
import time
from llama_cpp import Llama
def benchmark(model_path, prompt, n_tokens=100, n_runs=5):
llm = Llama(
model_path=model_path,
n_gpu_layers=35,
n_ctx=2048,
verbose=False
)
# Warmup
llm(prompt, max_tokens=10)
# Benchmark
times = []
for _ in range(n_runs):
start = time.time()
output = llm(prompt, max_tokens=n_tokens)
elapsed = time.time() - start
times.append(elapsed)
avg_time = sum(times) / len(times)
tokens_per_sec = n_tokens / avg_time
print(f"Model: {model_path}")
print(f"Avg time: {avg_time:.2f}s")
print(f"Tokens/sec: {tokens_per_sec:.1f}")
return tokens_per_sec
# Compare quantizations
for quant in ["q4_k_m", "q5_k_m", "q8_0"]:
benchmark(f"model-{quant}.gguf", "Explain quantum computing:", 100)
```
### Optimal Configuration Finder
```python
def find_optimal_config(model_path, target_vram_gb=8):
"""Find optimal n_gpu_layers and n_batch for target VRAM."""
from llama_cpp import Llama
import gc
best_config = None
best_speed = 0
for n_gpu_layers in range(0, 50, 5):
for n_batch in [128, 256, 512, 1024]:
try:
gc.collect()
llm = Llama(
model_path=model_path,
n_gpu_layers=n_gpu_layers,
n_batch=n_batch,
n_ctx=2048,
verbose=False
)
# Quick benchmark
start = time.time()
llm("Hello", max_tokens=50)
speed = 50 / (time.time() - start)
if speed > best_speed:
best_speed = speed
best_config = {
"n_gpu_layers": n_gpu_layers,
"n_batch": n_batch,
"speed": speed
}
del llm
gc.collect()
except Exception as e:
print(f"OOM at layers={n_gpu_layers}, batch={n_batch}")
break
return best_config
```
## Multi-GPU Setup
### Distribute Across GPUs
```bash
# Split model across multiple GPUs
./llama-cli -m large-model.gguf \
--tensor-split 0.5,0.5 \
-ngl 60 \
-p "Hello!"
```
### Python Multi-GPU
```python
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
from llama_cpp import Llama
llm = Llama(
model_path="large-model-q4_k_m.gguf",
n_gpu_layers=60,
tensor_split=[0.5, 0.5] # Split evenly across 2 GPUs
)
```
## Custom Builds
### Build with All Optimizations
```bash
# Clean build with all CPU optimizations
make clean
LLAMA_OPENBLAS=1 LLAMA_BLAS_VENDOR=OpenBLAS make -j
# With CUDA and cuBLAS
make clean
GGML_CUDA=1 LLAMA_CUBLAS=1 make -j
# With specific CUDA architecture
GGML_CUDA=1 CUDA_DOCKER_ARCH=sm_86 make -j
```
### CMake Build
```bash
mkdir build && cd build
cmake .. -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build . --config Release -j
```

View file

@ -0,0 +1,442 @@
# GGUF Troubleshooting Guide
## Installation Issues
### Build Fails
**Error**: `make: *** No targets specified and no makefile found`
**Fix**:
```bash
# Ensure you're in llama.cpp directory
cd llama.cpp
make
```
**Error**: `fatal error: cuda_runtime.h: No such file or directory`
**Fix**:
```bash
# Install CUDA toolkit
# Ubuntu
sudo apt install nvidia-cuda-toolkit
# Or set CUDA path
export CUDA_PATH=/usr/local/cuda
export PATH=$CUDA_PATH/bin:$PATH
make GGML_CUDA=1
```
### Python Bindings Issues
**Error**: `ERROR: Failed building wheel for llama-cpp-python`
**Fix**:
```bash
# Install build dependencies
pip install cmake scikit-build-core
# For CUDA support
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
# For Metal (macOS)
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
```
**Error**: `ImportError: libcudart.so.XX: cannot open shared object file`
**Fix**:
```bash
# Add CUDA libraries to path
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
# Or reinstall with correct CUDA version
pip uninstall llama-cpp-python
CUDACXX=/usr/local/cuda/bin/nvcc CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python
```
## Conversion Issues
### Model Not Supported
**Error**: `KeyError: 'model.embed_tokens.weight'`
**Fix**:
```bash
# Check model architecture
python -c "from transformers import AutoConfig; print(AutoConfig.from_pretrained('./model').architectures)"
# Use appropriate conversion script
# For most models:
python convert_hf_to_gguf.py ./model --outfile model.gguf
# For older models, check if legacy script needed
```
### Vocabulary Mismatch
**Error**: `RuntimeError: Vocabulary size mismatch`
**Fix**:
```python
# Ensure tokenizer matches model
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("./model")
model = AutoModelForCausalLM.from_pretrained("./model")
print(f"Tokenizer vocab size: {len(tokenizer)}")
print(f"Model vocab size: {model.config.vocab_size}")
# If mismatch, resize embeddings before conversion
model.resize_token_embeddings(len(tokenizer))
model.save_pretrained("./model-fixed")
```
### Out of Memory During Conversion
**Error**: `torch.cuda.OutOfMemoryError` during conversion
**Fix**:
```bash
# Use CPU for conversion
CUDA_VISIBLE_DEVICES="" python convert_hf_to_gguf.py ./model --outfile model.gguf
# Or use low memory mode
python convert_hf_to_gguf.py ./model --outfile model.gguf --outtype f16
```
## Quantization Issues
### Wrong Output File Size
**Problem**: Quantized file is larger than expected
**Check**:
```bash
# Verify quantization type
./llama-cli -m model.gguf --verbose
# Expected sizes for 7B model:
# Q4_K_M: ~4.1 GB
# Q5_K_M: ~4.8 GB
# Q8_0: ~7.2 GB
# F16: ~13.5 GB
```
### Quantization Crashes
**Error**: `Segmentation fault` during quantization
**Fix**:
```bash
# Increase stack size
ulimit -s unlimited
# Or use less threads
./llama-quantize -t 4 model-f16.gguf model-q4.gguf Q4_K_M
```
### Poor Quality After Quantization
**Problem**: Model outputs gibberish after quantization
**Solutions**:
1. **Use importance matrix**:
```bash
# Generate imatrix with good calibration data
./llama-imatrix -m model-f16.gguf \
-f wiki_sample.txt \
--chunk 512 \
-o model.imatrix
# Quantize with imatrix
./llama-quantize --imatrix model.imatrix \
model-f16.gguf model-q4_k_m.gguf Q4_K_M
```
2. **Try higher precision**:
```bash
# Use Q5_K_M or Q6_K instead of Q4
./llama-quantize model-f16.gguf model-q5_k_m.gguf Q5_K_M
```
3. **Check original model**:
```bash
# Test FP16 version first
./llama-cli -m model-f16.gguf -p "Hello, how are you?" -n 50
```
## Inference Issues
### Slow Generation
**Problem**: Generation is slower than expected
**Solutions**:
1. **Enable GPU offload**:
```bash
./llama-cli -m model.gguf -ngl 35 -p "Hello"
```
2. **Optimize batch size**:
```python
llm = Llama(
model_path="model.gguf",
n_batch=512, # Increase for faster prompt processing
n_gpu_layers=35
)
```
3. **Use appropriate threads**:
```bash
# Match physical cores, not logical
./llama-cli -m model.gguf -t 8 -p "Hello"
```
4. **Enable Flash Attention** (if supported):
```bash
./llama-cli -m model.gguf -ngl 35 --flash-attn -p "Hello"
```
### Out of Memory
**Error**: `CUDA out of memory` or system freeze
**Solutions**:
1. **Reduce GPU layers**:
```python
# Start low and increase
llm = Llama(model_path="model.gguf", n_gpu_layers=10)
```
2. **Use smaller quantization**:
```bash
./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M
```
3. **Reduce context length**:
```python
llm = Llama(
model_path="model.gguf",
n_ctx=2048, # Reduce from 4096
n_gpu_layers=35
)
```
4. **Quantize KV cache**:
```python
llm = Llama(
model_path="model.gguf",
type_k=2, # Q4_0 for K cache
type_v=2, # Q4_0 for V cache
n_gpu_layers=35
)
```
### Garbage Output
**Problem**: Model outputs random characters or nonsense
**Diagnose**:
```python
# Check model loading
llm = Llama(model_path="model.gguf", verbose=True)
# Test with simple prompt
output = llm("1+1=", max_tokens=5, temperature=0)
print(output)
```
**Solutions**:
1. **Check model integrity**:
```bash
# Verify GGUF file
./llama-cli -m model.gguf --verbose 2>&1 | head -50
```
2. **Use correct chat format**:
```python
llm = Llama(
model_path="model.gguf",
chat_format="llama-3" # Match your model: chatml, mistral, etc.
)
```
3. **Check temperature**:
```python
# Use lower temperature for deterministic output
output = llm("Hello", max_tokens=50, temperature=0.1)
```
### Token Issues
**Error**: `RuntimeError: unknown token` or encoding errors
**Fix**:
```python
# Ensure UTF-8 encoding
prompt = "Hello, world!".encode('utf-8').decode('utf-8')
output = llm(prompt, max_tokens=50)
```
## Server Issues
### Connection Refused
**Error**: `Connection refused` when accessing server
**Fix**:
```bash
# Bind to all interfaces
./llama-server -m model.gguf --host 0.0.0.0 --port 8080
# Check if port is in use
lsof -i :8080
```
### Server Crashes Under Load
**Problem**: Server crashes with multiple concurrent requests
**Solutions**:
1. **Limit parallelism**:
```bash
./llama-server -m model.gguf \
--parallel 2 \
-c 4096 \
--cont-batching
```
2. **Add request timeout**:
```bash
./llama-server -m model.gguf --timeout 300
```
3. **Monitor memory**:
```bash
watch -n 1 nvidia-smi # For GPU
watch -n 1 free -h # For RAM
```
### API Compatibility Issues
**Problem**: OpenAI client not working with server
**Fix**:
```python
from openai import OpenAI
# Use correct base URL format
client = OpenAI(
base_url="http://localhost:8080/v1", # Include /v1
api_key="not-needed"
)
# Use correct model name
response = client.chat.completions.create(
model="local", # Or the actual model name
messages=[{"role": "user", "content": "Hello"}]
)
```
## Apple Silicon Issues
### Metal Not Working
**Problem**: Metal acceleration not enabled
**Check**:
```bash
# Verify Metal support
./llama-cli -m model.gguf --verbose 2>&1 | grep -i metal
```
**Fix**:
```bash
# Rebuild with Metal
make clean
make GGML_METAL=1
# Python bindings
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall
```
### Incorrect Memory Usage on M1/M2
**Problem**: Model uses too much unified memory
**Fix**:
```python
# Offload all layers for Metal
llm = Llama(
model_path="model.gguf",
n_gpu_layers=99, # Offload everything
n_threads=1 # Metal handles parallelism
)
```
## Debugging
### Enable Verbose Output
```bash
# CLI verbose mode
./llama-cli -m model.gguf --verbose -p "Hello" -n 50
# Python verbose
llm = Llama(model_path="model.gguf", verbose=True)
```
### Check Model Metadata
```bash
# View GGUF metadata
./llama-cli -m model.gguf --verbose 2>&1 | head -100
```
### Validate GGUF File
```python
import struct
def validate_gguf(filepath):
with open(filepath, 'rb') as f:
magic = f.read(4)
if magic != b'GGUF':
print(f"Invalid magic: {magic}")
return False
version = struct.unpack('<I', f.read(4))[0]
print(f"GGUF version: {version}")
tensor_count = struct.unpack('<Q', f.read(8))[0]
metadata_count = struct.unpack('<Q', f.read(8))[0]
print(f"Tensors: {tensor_count}, Metadata: {metadata_count}")
return True
validate_gguf("model.gguf")
```
## Getting Help
1. **GitHub Issues**: https://github.com/ggml-org/llama.cpp/issues
2. **Discussions**: https://github.com/ggml-org/llama.cpp/discussions
3. **Reddit**: r/LocalLLaMA
### Reporting Issues
Include:
- llama.cpp version/commit hash
- Build command used
- Model name and quantization
- Full error message/stack trace
- Hardware: CPU/GPU model, RAM, VRAM
- OS version
- Minimal reproduction steps