diff --git a/skills/mlops/inference/llama-cpp/SKILL.md b/skills/mlops/inference/llama-cpp/SKILL.md index 33fc37adb..26c9e1975 100644 --- a/skills/mlops/inference/llama-cpp/SKILL.md +++ b/skills/mlops/inference/llama-cpp/SKILL.md @@ -1,438 +1,186 @@ --- name: llama-cpp -description: Run LLM inference with llama.cpp on CPU, Apple Silicon, AMD/Intel GPUs, or NVIDIA — plus GGUF model conversion and quantization (2–8 bit with K-quants and imatrix). Covers CLI, Python bindings, OpenAI-compatible server, and Ollama/LM Studio integration. Use for edge deployment, M1/M2/M3/M4 Macs, CUDA-less environments, or flexible local quantization. -version: 2.0.0 +description: Run LLM inference with llama.cpp on CPU, Apple Silicon, AMD/Intel GPUs, or NVIDIA. Covers GGUF quant selection, Hugging Face Hub model search with `apps=llama.cpp`, hardware-aware quant recommendations from `?local-app=llama.cpp`, extracting available `.gguf` files from the Hugging Face tree API, and building the right `llama-cli` or `llama-server` command directly from Hub URLs. +version: 2.1.1 author: Orchestra Research license: MIT -dependencies: [llama-cpp-python>=0.2.0] metadata: hermes: - tags: [llama.cpp, GGUF, Quantization, CPU Inference, Apple Silicon, Edge Deployment, Non-NVIDIA, AMD GPUs, Intel GPUs, Embedded, Model Compression] + tags: [llama.cpp, GGUF, Quantization, Hugging Face Hub, CPU Inference, Apple Silicon, Edge Deployment, AMD GPUs, Intel GPUs, NVIDIA, URL-first] --- # llama.cpp + GGUF -Pure C/C++ LLM inference with minimal dependencies, plus the GGUF (GPT-Generated Unified Format) standard used for quantized weights. One toolchain covers conversion, quantization, and serving. +Use this skill for local GGUF inference, quant selection, or Hugging Face repo discovery for llama.cpp. ## When to use -**Use llama.cpp + GGUF when:** -- Running on CPU-only machines or Apple Silicon (M1/M2/M3/M4) with Metal acceleration -- Using AMD (ROCm) or Intel GPUs where CUDA isn't available -- Edge deployment (Raspberry Pi, embedded systems, consumer laptops) -- Need flexible quantization (2–8 bit with K-quants) -- Want local AI tools (LM Studio, Ollama, text-generation-webui, koboldcpp) -- Want a single binary deploy without Docker/Python +- Run local models on CPU, Apple Silicon, CUDA, ROCm, or Intel GPUs +- Find the right GGUF for a specific Hugging Face repo +- Build a `llama-server` or `llama-cli` command from the Hub +- Search the Hub for models that already support llama.cpp +- Enumerate available `.gguf` files and sizes for a repo +- Decide between Q4/Q5/Q6/IQ variants for the user's RAM or VRAM -**Key advantages:** -- Universal hardware: CPU, Apple Silicon, NVIDIA, AMD, Intel -- No Python runtime required (pure C/C++) -- K-quants + imatrix for better low-bit quality -- OpenAI-compatible server built in -- Rich ecosystem (Ollama, LM Studio, llama-cpp-python) +## Model Discovery workflow -**Use alternatives instead:** -- **vLLM** — NVIDIA GPUs, PagedAttention, Python-first, max throughput -- **TensorRT-LLM** — Production NVIDIA (A100/H100), maximum speed -- **AWQ/GPTQ** — Calibrated quantization for NVIDIA-only deployments -- **bitsandbytes** — Simple HuggingFace transformers integration -- **HQQ** — Fast calibration-free quantization +Prefer URL workflows before asking for `hf`, Python, or custom scripts. + +1. Search for candidate repos on the Hub: + - Base: `https://huggingface.co/models?apps=llama.cpp&sort=trending` + - Add `search=` for a model family + - Add `num_parameters=min:0,max:24B` or similar when the user has size constraints +2. Open the repo with the llama.cpp local-app view: + - `https://huggingface.co/?local-app=llama.cpp` +3. Treat the local-app snippet as the source of truth when it is visible: + - copy the exact `llama-server` or `llama-cli` command + - report the recommended quant exactly as HF shows it +4. Read the same `?local-app=llama.cpp` URL as page text or HTML and extract the section under `Hardware compatibility`: + - prefer its exact quant labels and sizes over generic tables + - keep repo-specific labels such as `UD-Q4_K_M` or `IQ4_NL_XL` + - if that section is not visible in the fetched page source, say so and fall back to the tree API plus generic quant guidance +5. Query the tree API to confirm what actually exists: + - `https://huggingface.co/api/models//tree/main?recursive=true` + - keep entries where `type` is `file` and `path` ends with `.gguf` + - use `path` and `size` as the source of truth for filenames and byte sizes + - separate quantized checkpoints from `mmproj-*.gguf` projector files and `BF16/` shard files + - use `https://huggingface.co//tree/main` only as a human fallback +6. If the local-app snippet is not text-visible, reconstruct the command from the repo plus the chosen quant: + - shorthand quant selection: `llama-server -hf :` + - exact-file fallback: `llama-server --hf-repo --hf-file ` +7. Only suggest conversion from Transformers weights if the repo does not already expose GGUF files. ## Quick start -### Install +### Install llama.cpp ```bash # macOS / Linux (simplest) brew install llama.cpp +``` -# Or build from source +```bash +winget install llama.cpp +``` + +```bash git clone https://github.com/ggml-org/llama.cpp cd llama.cpp -make # CPU -make GGML_METAL=1 # Apple Silicon -make GGML_CUDA=1 # NVIDIA CUDA -make LLAMA_HIP=1 # AMD ROCm - -# Python bindings (optional) -pip install llama-cpp-python -# With CUDA: CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir -# With Metal: CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --no-cache-dir +cmake -B build +cmake --build build --config Release ``` -### Download a pre-quantized GGUF +### Run directly from the Hugging Face Hub ```bash -# TheBloke hosts most popular models pre-quantized -huggingface-cli download \ - TheBloke/Llama-2-7B-Chat-GGUF \ - llama-2-7b-chat.Q4_K_M.gguf \ - --local-dir models/ +llama-cli -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0 ``` -### Or convert a HuggingFace model to GGUF - ```bash -# 1. Download HF model -huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b - -# 2. Convert to FP16 GGUF -python convert_hf_to_gguf.py ./llama-3.1-8b \ - --outfile llama-3.1-8b-f16.gguf \ - --outtype f16 - -# 3. Quantize to Q4_K_M -./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M +llama-server -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0 ``` -### Run inference +### Run an exact GGUF file from the Hub + +Use this when the tree API shows custom file naming or the exact HF snippet is missing. ```bash -# One-shot prompt -./llama-cli -m model.Q4_K_M.gguf -p "Explain quantum computing" -n 256 - -# Interactive chat -./llama-cli -m model.Q4_K_M.gguf --interactive - -# With GPU offload -./llama-cli -m model.Q4_K_M.gguf -ngl 35 -p "Hello!" +llama-server \ + --hf-repo microsoft/Phi-3-mini-4k-instruct-gguf \ + --hf-file Phi-3-mini-4k-instruct-q4.gguf \ + -c 4096 ``` -### Serve an OpenAI-compatible API - -```bash -./llama-server \ - -m model.Q4_K_M.gguf \ - --host 0.0.0.0 \ - --port 8080 \ - -ngl 35 \ - -c 4096 \ - --parallel 4 \ - --cont-batching -``` +### OpenAI-compatible server check ```bash curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ + -H "Authorization: Bearer no-key" \ -d '{ - "model": "local", - "messages": [{"role": "user", "content": "Hello!"}], - "temperature": 0.7, - "max_tokens": 100 + "messages": [ + {"role": "user", "content": "Write a limerick about Python exceptions"} + ] }' ``` -## Quantization formats (GGUF) +## Choosing a quant -### K-quant methods (recommended) +Use the Hub page first, generic heuristics second. -| Type | Bits | Size (7B) | Quality | Use Case | -|------|------|-----------|---------|----------| -| Q2_K | 2.5 | ~2.8 GB | Low | Extreme compression (testing only) | -| Q3_K_S | 3.0 | ~3.0 GB | Low-Med | Memory constrained | -| Q3_K_M | 3.3 | ~3.3 GB | Medium | Fits small devices | -| Q4_K_S | 4.0 | ~3.8 GB | Med-High | Speed critical | -| **Q4_K_M** | 4.5 | ~4.1 GB | High | **Recommended default** | -| Q5_K_S | 5.0 | ~4.6 GB | High | Quality focused | -| Q5_K_M | 5.5 | ~4.8 GB | Very High | High quality | -| Q6_K | 6.0 | ~5.5 GB | Excellent | Near-original | -| Q8_0 | 8.0 | ~7.2 GB | Best | Maximum quality, minimal degradation | +- Prefer the exact quant that HF marks as compatible for the user's hardware profile. +- For general chat, start with `Q4_K_M`. +- For code or technical work, prefer `Q5_K_M` or `Q6_K` if memory allows. +- For very tight RAM budgets, consider `Q3_K_M`, `IQ` variants, or `Q2` variants only if the user explicitly prioritizes fit over quality. +- For multimodal repos, mention `mmproj-*.gguf` separately. The projector is not the main model file. +- Do not normalize repo-native labels. If the page says `UD-Q4_K_M`, report `UD-Q4_K_M`. -**Variant suffixes** — `_S` (Small, faster, lower quality), `_M` (Medium, balanced), `_L` (Large, better quality). +## Extracting available GGUFs from a repo -**Legacy (Q4_0/Q4_1/Q5_0/Q5_1) exist** but always prefer K-quants for better quality/size ratio. +When the user asks what GGUFs exist, return: -**IQ quantization** — ultra-low-bit with importance-aware methods: IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_XS, IQ3_S, IQ4_XS. Require `--imatrix`. +- filename +- file size +- quant label +- whether it is a main model or an auxiliary projector -**Task-specific defaults:** -- General chat / assistants: Q4_K_M, or Q5_K_M if RAM allows -- Code generation: Q5_K_M or Q6_K (higher precision helps) -- Technical / medical: Q6_K or Q8_0 -- Very large (70B, 405B) on consumer hardware: Q3_K_M or Q4_K_S -- Raspberry Pi / edge: Q2_K or Q3_K_S +Ignore unless requested: -## Conversion workflows +- README +- BF16 shard files +- imatrix blobs or calibration artifacts -### Basic: HF → GGUF → quantized +Use the tree API for this step: -```bash -python convert_hf_to_gguf.py ./model --outfile model-f16.gguf --outtype f16 -./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M -./llama-cli -m model-q4_k_m.gguf -p "Hello!" -n 50 +- `https://huggingface.co/api/models//tree/main?recursive=true` + +For a repo like `unsloth/Qwen3.6-35B-A3B-GGUF`, the local-app page can show quant chips such as `UD-Q4_K_M`, `UD-Q5_K_M`, `UD-Q6_K`, and `Q8_0`, while the tree API exposes exact file paths such as `Qwen3.6-35B-A3B-UD-Q4_K_M.gguf` and `Qwen3.6-35B-A3B-Q8_0.gguf` with byte sizes. Use the tree API to turn a quant label into an exact filename. + +## Search patterns + +Use these URL shapes directly: + +```text +https://huggingface.co/models?apps=llama.cpp&sort=trending +https://huggingface.co/models?search=&apps=llama.cpp&sort=trending +https://huggingface.co/models?search=&apps=llama.cpp&num_parameters=min:0,max:24B&sort=trending +https://huggingface.co/?local-app=llama.cpp +https://huggingface.co/api/models//tree/main?recursive=true +https://huggingface.co//tree/main ``` -### With importance matrix (imatrix) — better low-bit quality +## Output format -`imatrix` gives 10–20% perplexity improvement at Q4, essential at Q3 and below. +When answering discovery requests, prefer a compact structured result like: -```bash -# 1. Convert to FP16 GGUF -python convert_hf_to_gguf.py ./model --outfile model-f16.gguf - -# 2. Prepare calibration data (diverse text, ~100MB is ideal) -cat > calibration.txt << 'EOF' -The quick brown fox jumps over the lazy dog. -Machine learning is a subset of artificial intelligence. -# Add more diverse text samples... -EOF - -# 3. Generate importance matrix -./llama-imatrix -m model-f16.gguf \ - -f calibration.txt \ - --chunk 512 \ - -o model.imatrix \ - -ngl 35 - -# 4. Quantize with imatrix -./llama-quantize --imatrix model.imatrix \ - model-f16.gguf model-q4_k_m.gguf Q4_K_M +```text +Repo: +Recommended quant from HF: