hermes-agent/skills/mlops/inference/llama-cpp/SKILL.md
Teknium 73bccc94c7
skills: consolidate mlops redundancies (gguf+llama-cpp, grpo+trl, guidance→optional) (#11965)
Three tightly-scoped built-in skill consolidations to reduce redundancy in
the available_skills listing injected into every system prompt:

1. gguf-quantization → llama-cpp (merged)
   GGUF is llama.cpp's format; two skills covered the same toolchain. The
   merged llama-cpp skill keeps the full K-quant table + imatrix workflow
   from gguf and the ROCm/benchmarks/supported-models sections from the
   original llama-cpp. All 5 reference files preserved.

2. grpo-rl-training → fine-tuning-with-trl (folded in)
   GRPO isn't a framework, it's a trainer inside TRL. Moved the 17KB
   deep-dive SKILL.md to references/grpo-training.md and the working
   template to templates/basic_grpo_training.py. TRL's GRPO workflow
   section now points to both. Atropos skill's related_skills updated.

3. guidance → optional-skills/mlops/
   Dropped from built-in. Outlines (still built-in) covers the same
   structured-generation ground with wider adoption. Listed in the
   optional catalog for users who specifically want Guidance.

Net: 3 fewer built-in skill lines in every system prompt, zero content
loss. Contributor authorship preserved via git rename detection.
2026-04-17 21:36:40 -07:00

13 KiB
Raw Blame History

name description version author license dependencies metadata
llama-cpp Run LLM inference with llama.cpp on CPU, Apple Silicon, AMD/Intel GPUs, or NVIDIA — plus GGUF model conversion and quantization (28 bit with K-quants and imatrix). Covers CLI, Python bindings, OpenAI-compatible server, and Ollama/LM Studio integration. Use for edge deployment, M1/M2/M3/M4 Macs, CUDA-less environments, or flexible local quantization. 2.0.0 Orchestra Research MIT
llama-cpp-python>=0.2.0
hermes
tags
llama.cpp
GGUF
Quantization
CPU Inference
Apple Silicon
Edge Deployment
Non-NVIDIA
AMD GPUs
Intel GPUs
Embedded
Model Compression

llama.cpp + GGUF

Pure C/C++ LLM inference with minimal dependencies, plus the GGUF (GPT-Generated Unified Format) standard used for quantized weights. One toolchain covers conversion, quantization, and serving.

When to use

Use llama.cpp + GGUF when:

  • Running on CPU-only machines or Apple Silicon (M1/M2/M3/M4) with Metal acceleration
  • Using AMD (ROCm) or Intel GPUs where CUDA isn't available
  • Edge deployment (Raspberry Pi, embedded systems, consumer laptops)
  • Need flexible quantization (28 bit with K-quants)
  • Want local AI tools (LM Studio, Ollama, text-generation-webui, koboldcpp)
  • Want a single binary deploy without Docker/Python

Key advantages:

  • Universal hardware: CPU, Apple Silicon, NVIDIA, AMD, Intel
  • No Python runtime required (pure C/C++)
  • K-quants + imatrix for better low-bit quality
  • OpenAI-compatible server built in
  • Rich ecosystem (Ollama, LM Studio, llama-cpp-python)

Use alternatives instead:

  • vLLM — NVIDIA GPUs, PagedAttention, Python-first, max throughput
  • TensorRT-LLM — Production NVIDIA (A100/H100), maximum speed
  • AWQ/GPTQ — Calibrated quantization for NVIDIA-only deployments
  • bitsandbytes — Simple HuggingFace transformers integration
  • HQQ — Fast calibration-free quantization

Quick start

Install

# macOS / Linux (simplest)
brew install llama.cpp

# Or build from source
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
make                        # CPU
make GGML_METAL=1           # Apple Silicon
make GGML_CUDA=1            # NVIDIA CUDA
make LLAMA_HIP=1            # AMD ROCm

# Python bindings (optional)
pip install llama-cpp-python
# With CUDA:   CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
# With Metal:  CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --no-cache-dir

Download a pre-quantized GGUF

# TheBloke hosts most popular models pre-quantized
huggingface-cli download \
    TheBloke/Llama-2-7B-Chat-GGUF \
    llama-2-7b-chat.Q4_K_M.gguf \
    --local-dir models/

Or convert a HuggingFace model to GGUF

# 1. Download HF model
huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b

# 2. Convert to FP16 GGUF
python convert_hf_to_gguf.py ./llama-3.1-8b \
    --outfile llama-3.1-8b-f16.gguf \
    --outtype f16

# 3. Quantize to Q4_K_M
./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M

Run inference

# One-shot prompt
./llama-cli -m model.Q4_K_M.gguf -p "Explain quantum computing" -n 256

# Interactive chat
./llama-cli -m model.Q4_K_M.gguf --interactive

# With GPU offload
./llama-cli -m model.Q4_K_M.gguf -ngl 35 -p "Hello!"

Serve an OpenAI-compatible API

./llama-server \
    -m model.Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 35 \
    -c 4096 \
    --parallel 4 \
    --cont-batching
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Quantization formats (GGUF)

Type Bits Size (7B) Quality Use Case
Q2_K 2.5 ~2.8 GB Low Extreme compression (testing only)
Q3_K_S 3.0 ~3.0 GB Low-Med Memory constrained
Q3_K_M 3.3 ~3.3 GB Medium Fits small devices
Q4_K_S 4.0 ~3.8 GB Med-High Speed critical
Q4_K_M 4.5 ~4.1 GB High Recommended default
Q5_K_S 5.0 ~4.6 GB High Quality focused
Q5_K_M 5.5 ~4.8 GB Very High High quality
Q6_K 6.0 ~5.5 GB Excellent Near-original
Q8_0 8.0 ~7.2 GB Best Maximum quality, minimal degradation

Variant suffixes_S (Small, faster, lower quality), _M (Medium, balanced), _L (Large, better quality).

Legacy (Q4_0/Q4_1/Q5_0/Q5_1) exist but always prefer K-quants for better quality/size ratio.

IQ quantization — ultra-low-bit with importance-aware methods: IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_XS, IQ3_S, IQ4_XS. Require --imatrix.

Task-specific defaults:

  • General chat / assistants: Q4_K_M, or Q5_K_M if RAM allows
  • Code generation: Q5_K_M or Q6_K (higher precision helps)
  • Technical / medical: Q6_K or Q8_0
  • Very large (70B, 405B) on consumer hardware: Q3_K_M or Q4_K_S
  • Raspberry Pi / edge: Q2_K or Q3_K_S

Conversion workflows

Basic: HF → GGUF → quantized

python convert_hf_to_gguf.py ./model --outfile model-f16.gguf --outtype f16
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
./llama-cli -m model-q4_k_m.gguf -p "Hello!" -n 50

With importance matrix (imatrix) — better low-bit quality

imatrix gives 1020% perplexity improvement at Q4, essential at Q3 and below.

# 1. Convert to FP16 GGUF
python convert_hf_to_gguf.py ./model --outfile model-f16.gguf

# 2. Prepare calibration data (diverse text, ~100MB is ideal)
cat > calibration.txt << 'EOF'
The quick brown fox jumps over the lazy dog.
Machine learning is a subset of artificial intelligence.
# Add more diverse text samples...
EOF

# 3. Generate importance matrix
./llama-imatrix -m model-f16.gguf \
    -f calibration.txt \
    --chunk 512 \
    -o model.imatrix \
    -ngl 35

# 4. Quantize with imatrix
./llama-quantize --imatrix model.imatrix \
    model-f16.gguf model-q4_k_m.gguf Q4_K_M

Multi-quant batch

#!/bin/bash
MODEL="llama-3.1-8b-f16.gguf"
IMATRIX="llama-3.1-8b.imatrix"

./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35

for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
    OUTPUT="llama-3.1-8b-${QUANT,,}.gguf"
    ./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT
    echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))"
done

Quality testing (perplexity)

./llama-perplexity -m model.gguf -f wikitext-2-raw/wiki.test.raw -c 512
# Baseline FP16: ~5.96  |  Q4_K_M: ~6.06 (+1.7%)  |  Q2_K: ~6.87 (+15.3%)

Python bindings (llama-cpp-python)

Basic generation

from llama_cpp import Llama

llm = Llama(
    model_path="./model-q4_k_m.gguf",
    n_ctx=4096,
    n_gpu_layers=35,     # 0 for CPU only, 99 to offload everything
    n_threads=8,
)

output = llm(
    "What is machine learning?",
    max_tokens=256,
    temperature=0.7,
    stop=["</s>", "\n\n"],
)
print(output["choices"][0]["text"])

Chat completion + streaming

llm = Llama(
    model_path="./model-q4_k_m.gguf",
    n_ctx=4096,
    n_gpu_layers=35,
    chat_format="llama-3",    # Or "chatml", "mistral", etc.
)

# Non-streaming
response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Python?"},
    ],
    max_tokens=256,
    temperature=0.7,
)
print(response["choices"][0]["message"]["content"])

# Streaming
for chunk in llm("Explain quantum computing:", max_tokens=256, stream=True):
    print(chunk["choices"][0]["text"], end="", flush=True)

Embeddings

llm = Llama(model_path="./model-q4_k_m.gguf", embedding=True, n_gpu_layers=35)
vec = llm.embed("This is a test sentence.")
print(f"Embedding dimension: {len(vec)}")

Hardware acceleration

Apple Silicon (Metal)

make clean && make GGML_METAL=1
./llama-cli -m model.gguf -ngl 99 -p "Hello"   # offload all layers
llm = Llama(
    model_path="model.gguf",
    n_gpu_layers=99,     # Offload everything
    n_threads=1,         # Metal handles parallelism
)

Performance: M3 Max ~4060 tok/s on Llama 2-7B Q4_K_M.

NVIDIA (CUDA)

make clean && make GGML_CUDA=1
./llama-cli -m model.gguf -ngl 35 -p "Hello"

# Hybrid for large models
./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20   # GPU: 20 layers, CPU: rest

# Multi-GPU split
./llama-cli -m large-model.gguf --tensor-split 0.5,0.5 -ngl 60

AMD (ROCm)

make LLAMA_HIP=1
./llama-cli -m model.gguf -ngl 999

CPU

# Match PHYSICAL cores, not logical
./llama-cli -m model.gguf -t 8 -p "Hello"

# BLAS acceleration (23× speedup)
make LLAMA_OPENBLAS=1
llm = Llama(
    model_path="model.gguf",
    n_gpu_layers=0,
    n_threads=8,
    n_batch=512,         # Larger batch = faster prompt processing
)

Performance benchmarks

CPU (Llama 2-7B Q4_K_M)

CPU Threads Speed
Apple M3 Max (Metal) 16 50 tok/s
AMD Ryzen 9 7950X 32 35 tok/s
Intel i9-13900K 32 30 tok/s

GPU offloading on RTX 4090

Layers GPU Speed VRAM
0 (CPU only) 30 tok/s 0 GB
20 (hybrid) 80 tok/s 8 GB
35 (all) 120 tok/s 12 GB

Supported models

  • LLaMA family: Llama 2 (7B/13B/70B), Llama 3 (8B/70B/405B), Code Llama
  • Mistral family: Mistral 7B, Mixtral 8x7B/8x22B
  • Other: Falcon, BLOOM, GPT-J, Phi-3, Gemma, Qwen, LLaVA (vision), Whisper (audio)

Find GGUF models: https://huggingface.co/models?library=gguf

Ecosystem integrations

Ollama

cat > Modelfile << 'EOF'
FROM ./model-q4_k_m.gguf
TEMPLATE """{{ .System }}
{{ .Prompt }}"""
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
EOF

ollama create mymodel -f Modelfile
ollama run mymodel "Hello!"

LM Studio

  1. Place GGUF file in ~/.cache/lm-studio/models/
  2. Open LM Studio and select the model
  3. Configure context length and GPU offload, start inference

text-generation-webui

cp model-q4_k_m.gguf text-generation-webui/models/
python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35

OpenAI client → llama-server

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="local-model",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=256,
)
print(response.choices[0].message.content)

Best practices

  1. Use K-quants — Q4_K_M is the recommended default
  2. Use imatrix for Q4 and below (calibration improves quality substantially)
  3. Offload as many layers as VRAM allows — start high, reduce by 5 on OOM
  4. Thread count — match physical cores, not logical
  5. Batch size — increase n_batch (e.g. 512) for faster prompt processing
  6. Context — start at 4096, grow only as needed (memory scales with ctx)
  7. Flash Attention — add --flash-attn if your build supports it

Common issues (quick fixes)

Model loads slowly — use --mmap for memory-mapped loading.

Out of memory (GPU) — reduce -ngl, use a smaller quant (Q4_K_S / Q3_K_M), or quantize the KV cache:

Llama(model_path="...", type_k=2, type_v=2, n_gpu_layers=35)  # Q4_0 KV cache

Garbage output — wrong chat_format, temperature too high, or model file corrupted. Test with temperature=0.1 and verify FP16 baseline works.

Connection refused (server) — bind to --host 0.0.0.0, check lsof -i :8080.

See references/troubleshooting.md for the full playbook.

References

  • advanced-usage.md — speculative decoding, batched inference, grammar-constrained generation, LoRA, multi-GPU, custom builds, benchmark scripts
  • quantization.md — perplexity tables, use-case guide, model size scaling (7B/13B/70B RAM needs), imatrix deep dive
  • server.md — OpenAI API endpoints, Docker deployment, NGINX load balancing, monitoring
  • optimization.md — CPU threading, BLAS, GPU offload heuristics, batch tuning, benchmarks
  • troubleshooting.md — install/convert/quantize/inference/server issues, Apple Silicon, debugging

Resources