improve llama.cpp skill

2026-04-26 01:01:40 +00:00 · 2026-04-21 20:37:07 +02:00 · 2026-04-21 20:37:07 +02:00 · d6cf2cc058
commit d6cf2cc058
parent ce98e1ef11
4 changed files with 351 additions and 380 deletions
--- a/skills/mlops/inference/llama-cpp/SKILL.md
+++ b/skills/mlops/inference/llama-cpp/SKILL.md
@ -1,438 +1,186 @@
 ---
 name: llama-cpp
-description: Run LLM inference with llama.cpp on CPU, Apple Silicon, AMD/Intel GPUs, or NVIDIA — plus GGUF model conversion and quantization (2–8 bit with K-quants and imatrix). Covers CLI, Python bindings, OpenAI-compatible server, and Ollama/LM Studio integration. Use for edge deployment, M1/M2/M3/M4 Macs, CUDA-less environments, or flexible local quantization.
-version: 2.0.0
+description: Run LLM inference with llama.cpp on CPU, Apple Silicon, AMD/Intel GPUs, or NVIDIA. Covers GGUF quant selection, Hugging Face Hub model search with `apps=llama.cpp`, hardware-aware quant recommendations from `?local-app=llama.cpp`, extracting available `.gguf` files from the Hugging Face tree API, and building the right `llama-cli` or `llama-server` command directly from Hub URLs.
+version: 2.1.1
 author: Orchestra Research
 license: MIT
-dependencies: [llama-cpp-python>=0.2.0]
 metadata:
  hermes:
-    tags: [llama.cpp, GGUF, Quantization, CPU Inference, Apple Silicon, Edge Deployment, Non-NVIDIA, AMD GPUs, Intel GPUs, Embedded, Model Compression]
+    tags: [llama.cpp, GGUF, Quantization, Hugging Face Hub, CPU Inference, Apple Silicon, Edge Deployment, AMD GPUs, Intel GPUs, NVIDIA, URL-first]
 ---

 # llama.cpp + GGUF

-Pure C/C++ LLM inference with minimal dependencies, plus the GGUF (GPT-Generated Unified Format) standard used for quantized weights. One toolchain covers conversion, quantization, and serving.
+Use this skill for local GGUF inference, quant selection, or Hugging Face repo discovery for llama.cpp.

 ## When to use

-**Use llama.cpp + GGUF when:**
- Running on CPU-only machines or Apple Silicon (M1/M2/M3/M4) with Metal acceleration
- Using AMD (ROCm) or Intel GPUs where CUDA isn't available
- Edge deployment (Raspberry Pi, embedded systems, consumer laptops)
- Need flexible quantization (2–8 bit with K-quants)
- Want local AI tools (LM Studio, Ollama, text-generation-webui, koboldcpp)
- Want a single binary deploy without Docker/Python
+- Run local models on CPU, Apple Silicon, CUDA, ROCm, or Intel GPUs
+- Find the right GGUF for a specific Hugging Face repo
+- Build a `llama-server` or `llama-cli` command from the Hub
+- Search the Hub for models that already support llama.cpp
+- Enumerate available `.gguf` files and sizes for a repo
+- Decide between Q4/Q5/Q6/IQ variants for the user's RAM or VRAM

-**Key advantages:**
- Universal hardware: CPU, Apple Silicon, NVIDIA, AMD, Intel
- No Python runtime required (pure C/C++)
- K-quants + imatrix for better low-bit quality
- OpenAI-compatible server built in
- Rich ecosystem (Ollama, LM Studio, llama-cpp-python)
+## Model Discovery workflow

-**Use alternatives instead:**
- **vLLM** — NVIDIA GPUs, PagedAttention, Python-first, max throughput
- **TensorRT-LLM** — Production NVIDIA (A100/H100), maximum speed
- **AWQ/GPTQ** — Calibrated quantization for NVIDIA-only deployments
- **bitsandbytes** — Simple HuggingFace transformers integration
- **HQQ** — Fast calibration-free quantization
+Prefer URL workflows before asking for `hf`, Python, or custom scripts.
+
+1. Search for candidate repos on the Hub:
+   - Base: `https://huggingface.co/models?apps=llama.cpp&sort=trending`
+   - Add `search=<term>` for a model family
+   - Add `num_parameters=min:0,max:24B` or similar when the user has size constraints
+2. Open the repo with the llama.cpp local-app view:
+   - `https://huggingface.co/<repo>?local-app=llama.cpp`
+3. Treat the local-app snippet as the source of truth when it is visible:
+   - copy the exact `llama-server` or `llama-cli` command
+   - report the recommended quant exactly as HF shows it
+4. Read the same `?local-app=llama.cpp` URL as page text or HTML and extract the section under `Hardware compatibility`:
+   - prefer its exact quant labels and sizes over generic tables
+   - keep repo-specific labels such as `UD-Q4_K_M` or `IQ4_NL_XL`
+   - if that section is not visible in the fetched page source, say so and fall back to the tree API plus generic quant guidance
+5. Query the tree API to confirm what actually exists:
+   - `https://huggingface.co/api/models/<repo>/tree/main?recursive=true`
+   - keep entries where `type` is `file` and `path` ends with `.gguf`
+   - use `path` and `size` as the source of truth for filenames and byte sizes
+   - separate quantized checkpoints from `mmproj-*.gguf` projector files and `BF16/` shard files
+   - use `https://huggingface.co/<repo>/tree/main` only as a human fallback
+6. If the local-app snippet is not text-visible, reconstruct the command from the repo plus the chosen quant:
+   - shorthand quant selection: `llama-server -hf <repo>:<QUANT>`
+   - exact-file fallback: `llama-server --hf-repo <repo> --hf-file <filename.gguf>`
+7. Only suggest conversion from Transformers weights if the repo does not already expose GGUF files.

 ## Quick start

-### Install
+### Install llama.cpp

 ```bash
 # macOS / Linux (simplest)
 brew install llama.cpp
+```

-# Or build from source
+```bash
+winget install llama.cpp
+```
+
+```bash
 git clone https://github.com/ggml-org/llama.cpp
 cd llama.cpp
-make                        # CPU
-make GGML_METAL=1           # Apple Silicon
-make GGML_CUDA=1            # NVIDIA CUDA
-make LLAMA_HIP=1            # AMD ROCm
-
-# Python bindings (optional)
-pip install llama-cpp-python
-# With CUDA:   CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
-# With Metal:  CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
+cmake -B build
+cmake --build build --config Release
 ```

-### Download a pre-quantized GGUF
+### Run directly from the Hugging Face Hub

 ```bash
-# TheBloke hosts most popular models pre-quantized
-huggingface-cli download \
-    TheBloke/Llama-2-7B-Chat-GGUF \
-    llama-2-7b-chat.Q4_K_M.gguf \
-    --local-dir models/
+llama-cli -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0
 ```

-### Or convert a HuggingFace model to GGUF
-
 ```bash
-# 1. Download HF model
-huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b
-
-# 2. Convert to FP16 GGUF
-python convert_hf_to_gguf.py ./llama-3.1-8b \
-    --outfile llama-3.1-8b-f16.gguf \
-    --outtype f16
-
-# 3. Quantize to Q4_K_M
-./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M
+llama-server -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0
 ```

-### Run inference
+### Run an exact GGUF file from the Hub
+
+Use this when the tree API shows custom file naming or the exact HF snippet is missing.

 ```bash
-# One-shot prompt
-./llama-cli -m model.Q4_K_M.gguf -p "Explain quantum computing" -n 256
-
-# Interactive chat
-./llama-cli -m model.Q4_K_M.gguf --interactive
-
-# With GPU offload
-./llama-cli -m model.Q4_K_M.gguf -ngl 35 -p "Hello!"
+llama-server \
+    --hf-repo microsoft/Phi-3-mini-4k-instruct-gguf \
+    --hf-file Phi-3-mini-4k-instruct-q4.gguf \
+    -c 4096
 ```

-### Serve an OpenAI-compatible API
-
-```bash
-./llama-server \
-    -m model.Q4_K_M.gguf \
-    --host 0.0.0.0 \
-    --port 8080 \
-    -ngl 35 \
-    -c 4096 \
-    --parallel 4 \
-    --cont-batching
-```
+### OpenAI-compatible server check

 ```bash
 curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
+  -H "Authorization: Bearer no-key" \
  -d '{
-    "model": "local",
-    "messages": [{"role": "user", "content": "Hello!"}],
-    "temperature": 0.7,
-    "max_tokens": 100
+    "messages": [
+      {"role": "user", "content": "Write a limerick about Python exceptions"}
+    ]
  }'
 ```

-## Quantization formats (GGUF)
+## Choosing a quant

-### K-quant methods (recommended)
+Use the Hub page first, generic heuristics second.

-| Type | Bits | Size (7B) | Quality | Use Case |
-|------|------|-----------|---------|----------|
-| Q2_K | 2.5 | ~2.8 GB | Low | Extreme compression (testing only) |
-| Q3_K_S | 3.0 | ~3.0 GB | Low-Med | Memory constrained |
-| Q3_K_M | 3.3 | ~3.3 GB | Medium | Fits small devices |
-| Q4_K_S | 4.0 | ~3.8 GB | Med-High | Speed critical |
-| **Q4_K_M** | 4.5 | ~4.1 GB | High | **Recommended default** |
-| Q5_K_S | 5.0 | ~4.6 GB | High | Quality focused |
-| Q5_K_M | 5.5 | ~4.8 GB | Very High | High quality |
-| Q6_K | 6.0 | ~5.5 GB | Excellent | Near-original |
-| Q8_0 | 8.0 | ~7.2 GB | Best | Maximum quality, minimal degradation |
+- Prefer the exact quant that HF marks as compatible for the user's hardware profile.
+- For general chat, start with `Q4_K_M`.
+- For code or technical work, prefer `Q5_K_M` or `Q6_K` if memory allows.
+- For very tight RAM budgets, consider `Q3_K_M`, `IQ` variants, or `Q2` variants only if the user explicitly prioritizes fit over quality.
+- For multimodal repos, mention `mmproj-*.gguf` separately. The projector is not the main model file.
+- Do not normalize repo-native labels. If the page says `UD-Q4_K_M`, report `UD-Q4_K_M`.

-**Variant suffixes** — `_S` (Small, faster, lower quality), `_M` (Medium, balanced), `_L` (Large, better quality).
+## Extracting available GGUFs from a repo

-**Legacy (Q4_0/Q4_1/Q5_0/Q5_1) exist** but always prefer K-quants for better quality/size ratio.
+When the user asks what GGUFs exist, return:

-**IQ quantization** — ultra-low-bit with importance-aware methods: IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_XS, IQ3_S, IQ4_XS. Require `--imatrix`.
+- filename
+- file size
+- quant label
+- whether it is a main model or an auxiliary projector

-**Task-specific defaults:**
- General chat / assistants: Q4_K_M, or Q5_K_M if RAM allows
- Code generation: Q5_K_M or Q6_K (higher precision helps)
- Technical / medical: Q6_K or Q8_0
- Very large (70B, 405B) on consumer hardware: Q3_K_M or Q4_K_S
- Raspberry Pi / edge: Q2_K or Q3_K_S
+Ignore unless requested:

-## Conversion workflows
+- README
+- BF16 shard files
+- imatrix blobs or calibration artifacts

-### Basic: HF → GGUF → quantized
+Use the tree API for this step:

-```bash
-python convert_hf_to_gguf.py ./model --outfile model-f16.gguf --outtype f16
-./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
-./llama-cli -m model-q4_k_m.gguf -p "Hello!" -n 50
+- `https://huggingface.co/api/models/<repo>/tree/main?recursive=true`
+
+For a repo like `unsloth/Qwen3.6-35B-A3B-GGUF`, the local-app page can show quant chips such as `UD-Q4_K_M`, `UD-Q5_K_M`, `UD-Q6_K`, and `Q8_0`, while the tree API exposes exact file paths such as `Qwen3.6-35B-A3B-UD-Q4_K_M.gguf` and `Qwen3.6-35B-A3B-Q8_0.gguf` with byte sizes. Use the tree API to turn a quant label into an exact filename.
+
+## Search patterns
+
+Use these URL shapes directly:
+
+```text
+https://huggingface.co/models?apps=llama.cpp&sort=trending
+https://huggingface.co/models?search=<term>&apps=llama.cpp&sort=trending
+https://huggingface.co/models?search=<term>&apps=llama.cpp&num_parameters=min:0,max:24B&sort=trending
+https://huggingface.co/<repo>?local-app=llama.cpp
+https://huggingface.co/api/models/<repo>/tree/main?recursive=true
+https://huggingface.co/<repo>/tree/main
 ```

-### With importance matrix (imatrix) — better low-bit quality
+## Output format

-`imatrix` gives 10–20% perplexity improvement at Q4, essential at Q3 and below.
+When answering discovery requests, prefer a compact structured result like:

-```bash
-# 1. Convert to FP16 GGUF
-python convert_hf_to_gguf.py ./model --outfile model-f16.gguf
-
-# 2. Prepare calibration data (diverse text, ~100MB is ideal)
-cat > calibration.txt << 'EOF'
-The quick brown fox jumps over the lazy dog.
-Machine learning is a subset of artificial intelligence.
-# Add more diverse text samples...
-EOF
-
-# 3. Generate importance matrix
-./llama-imatrix -m model-f16.gguf \
-    -f calibration.txt \
-    --chunk 512 \
-    -o model.imatrix \
-    -ngl 35
-
-# 4. Quantize with imatrix
-./llama-quantize --imatrix model.imatrix \
-    model-f16.gguf model-q4_k_m.gguf Q4_K_M
+```text
+Repo: <repo>
+Recommended quant from HF: <label> (<size>)
+llama-server: <command>
+Other GGUFs:
+- <filename> - <size>
+- <filename> - <size>
+Source URLs:
+- <local-app URL>
+- <tree API URL>
 ```

-### Multi-quant batch
-
-```bash
-#!/bin/bash
-MODEL="llama-3.1-8b-f16.gguf"
-IMATRIX="llama-3.1-8b.imatrix"
-
-./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35
-
-for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
-    OUTPUT="llama-3.1-8b-${QUANT,,}.gguf"
-    ./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT
-    echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))"
-done
-```
-
-### Quality testing (perplexity)
-
-```bash
-./llama-perplexity -m model.gguf -f wikitext-2-raw/wiki.test.raw -c 512
-# Baseline FP16: ~5.96  |  Q4_K_M: ~6.06 (+1.7%)  |  Q2_K: ~6.87 (+15.3%)
-```
-
-## Python bindings (llama-cpp-python)
-
-### Basic generation
-
-```python
-from llama_cpp import Llama
-
-llm = Llama(
-    model_path="./model-q4_k_m.gguf",
-    n_ctx=4096,
-    n_gpu_layers=35,     # 0 for CPU only, 99 to offload everything
-    n_threads=8,
-)
-
-output = llm(
-    "What is machine learning?",
-    max_tokens=256,
-    temperature=0.7,
-    stop=["</s>", "\n\n"],
-)
-print(output["choices"][0]["text"])
-```
-
-### Chat completion + streaming
-
-```python
-llm = Llama(
-    model_path="./model-q4_k_m.gguf",
-    n_ctx=4096,
-    n_gpu_layers=35,
-    chat_format="llama-3",    # Or "chatml", "mistral", etc.
-)
-
-# Non-streaming
-response = llm.create_chat_completion(
-    messages=[
-        {"role": "system", "content": "You are a helpful assistant."},
-        {"role": "user", "content": "What is Python?"},
-    ],
-    max_tokens=256,
-    temperature=0.7,
-)
-print(response["choices"][0]["message"]["content"])
-
-# Streaming
-for chunk in llm("Explain quantum computing:", max_tokens=256, stream=True):
-    print(chunk["choices"][0]["text"], end="", flush=True)
-```
-
-### Embeddings
-
-```python
-llm = Llama(model_path="./model-q4_k_m.gguf", embedding=True, n_gpu_layers=35)
-vec = llm.embed("This is a test sentence.")
-print(f"Embedding dimension: {len(vec)}")
-```
-
-## Hardware acceleration
-
-### Apple Silicon (Metal)
-
-```bash
-make clean && make GGML_METAL=1
-./llama-cli -m model.gguf -ngl 99 -p "Hello"   # offload all layers
-```
-
-```python
-llm = Llama(
-    model_path="model.gguf",
-    n_gpu_layers=99,     # Offload everything
-    n_threads=1,         # Metal handles parallelism
-)
-```
-
-Performance: M3 Max ~40–60 tok/s on Llama 2-7B Q4_K_M.
-
-### NVIDIA (CUDA)
-
-```bash
-make clean && make GGML_CUDA=1
-./llama-cli -m model.gguf -ngl 35 -p "Hello"
-
-# Hybrid for large models
-./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20   # GPU: 20 layers, CPU: rest
-
-# Multi-GPU split
-./llama-cli -m large-model.gguf --tensor-split 0.5,0.5 -ngl 60
-```
-
-### AMD (ROCm)
-
-```bash
-make LLAMA_HIP=1
-./llama-cli -m model.gguf -ngl 999
-```
-
-### CPU
-
-```bash
-# Match PHYSICAL cores, not logical
-./llama-cli -m model.gguf -t 8 -p "Hello"
-
-# BLAS acceleration (2–3× speedup)
-make LLAMA_OPENBLAS=1
-```
-
-```python
-llm = Llama(
-    model_path="model.gguf",
-    n_gpu_layers=0,
-    n_threads=8,
-    n_batch=512,         # Larger batch = faster prompt processing
-)
-```
-
-## Performance benchmarks
-
-### CPU (Llama 2-7B Q4_K_M)
-
-| CPU | Threads | Speed |
-|-----|---------|-------|
-| Apple M3 Max (Metal) | 16 | 50 tok/s |
-| AMD Ryzen 9 7950X | 32 | 35 tok/s |
-| Intel i9-13900K | 32 | 30 tok/s |
-
-### GPU offloading on RTX 4090
-
-| Layers GPU | Speed | VRAM |
-|------------|-------|------|
-| 0 (CPU only) | 30 tok/s | 0 GB |
-| 20 (hybrid) | 80 tok/s | 8 GB |
-| 35 (all) | 120 tok/s | 12 GB |
-
-## Supported models
-
- **LLaMA family**: Llama 2 (7B/13B/70B), Llama 3 (8B/70B/405B), Code Llama
- **Mistral family**: Mistral 7B, Mixtral 8x7B/8x22B
- **Other**: Falcon, BLOOM, GPT-J, Phi-3, Gemma, Qwen, LLaVA (vision), Whisper (audio)
-
-Find GGUF models: https://huggingface.co/models?library=gguf
-
-## Ecosystem integrations
-
-### Ollama
-
-```bash
-cat > Modelfile << 'EOF'
-FROM ./model-q4_k_m.gguf
-TEMPLATE """{{ .System }}
-{{ .Prompt }}"""
-PARAMETER temperature 0.7
-PARAMETER num_ctx 4096
-EOF
-
-ollama create mymodel -f Modelfile
-ollama run mymodel "Hello!"
-```
-
-### LM Studio
-
-1. Place GGUF file in `~/.cache/lm-studio/models/`
-2. Open LM Studio and select the model
-3. Configure context length and GPU offload, start inference
-
-### text-generation-webui
-
-```bash
-cp model-q4_k_m.gguf text-generation-webui/models/
-python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35
-```
-
-### OpenAI client → llama-server
-
-```python
-from openai import OpenAI
-
-client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
-response = client.chat.completions.create(
-    model="local-model",
-    messages=[{"role": "user", "content": "Hello!"}],
-    max_tokens=256,
-)
-print(response.choices[0].message.content)
-```
-
-## Best practices
-
-1. **Use K-quants** — Q4_K_M is the recommended default
-2. **Use imatrix** for Q4 and below (calibration improves quality substantially)
-3. **Offload as many layers as VRAM allows** — start high, reduce by 5 on OOM
-4. **Thread count** — match physical cores, not logical
-5. **Batch size** — increase `n_batch` (e.g. 512) for faster prompt processing
-6. **Context** — start at 4096, grow only as needed (memory scales with ctx)
-7. **Flash Attention** — add `--flash-attn` if your build supports it
-
-## Common issues (quick fixes)
-
-**Model loads slowly** — use `--mmap` for memory-mapped loading.
-
-**Out of memory (GPU)** — reduce `-ngl`, use a smaller quant (Q4_K_S / Q3_K_M), or quantize the KV cache:
-```python
-Llama(model_path="...", type_k=2, type_v=2, n_gpu_layers=35)  # Q4_0 KV cache
-```
-
-**Garbage output** — wrong `chat_format`, temperature too high, or model file corrupted. Test with `temperature=0.1` and verify FP16 baseline works.
-
-**Connection refused (server)** — bind to `--host 0.0.0.0`, check `lsof -i :8080`.
-
-See `references/troubleshooting.md` for the full playbook.
-
 ## References

+- **[hub-discovery.md](references/hub-discovery.md)** - URL-only Hugging Face workflows, search patterns, GGUF extraction, and command reconstruction
 - **[advanced-usage.md](references/advanced-usage.md)** — speculative decoding, batched inference, grammar-constrained generation, LoRA, multi-GPU, custom builds, benchmark scripts
- **[quantization.md](references/quantization.md)** — perplexity tables, use-case guide, model size scaling (7B/13B/70B RAM needs), imatrix deep dive
- **[server.md](references/server.md)** — OpenAI API endpoints, Docker deployment, NGINX load balancing, monitoring
+- **[quantization.md](references/quantization.md)** — quant quality tradeoffs, when to use Q4/Q5/Q6/IQ, model size scaling, imatrix
+- **[server.md](references/server.md)** — direct-from-Hub server launch, OpenAI API endpoints, Docker deployment, NGINX load balancing, monitoring
 - **[optimization.md](references/optimization.md)** — CPU threading, BLAS, GPU offload heuristics, batch tuning, benchmarks
 - **[troubleshooting.md](references/troubleshooting.md)** — install/convert/quantize/inference/server issues, Apple Silicon, debugging

 ## Resources

 - **GitHub**: https://github.com/ggml-org/llama.cpp
- **Python bindings**: https://github.com/abetlen/llama-cpp-python
- **Pre-quantized models**: https://huggingface.co/TheBloke
- **GGUF converter Space**: https://huggingface.co/spaces/ggml-org/gguf-my-repo
+- **Hugging Face GGUF + llama.cpp docs**: https://huggingface.co/docs/hub/gguf-llamacpp
+- **Hugging Face Local Apps docs**: https://huggingface.co/docs/hub/main/local-apps
+- **Hugging Face Local Agents docs**: https://huggingface.co/docs/hub/agents-local
+- **Example local-app page**: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF?local-app=llama.cpp
+- **Example tree API**: https://huggingface.co/api/models/unsloth/Qwen3.6-35B-A3B-GGUF/tree/main?recursive=true
+- **Example llama.cpp search**: https://huggingface.co/models?num_parameters=min:0,max:24B&apps=llama.cpp&sort=trending
 - **License**: MIT