mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-25 00:51:20 +00:00
improve llama.cpp skill
This commit is contained in:
parent
ce98e1ef11
commit
d6cf2cc058
4 changed files with 351 additions and 380 deletions
|
|
@ -1,438 +1,186 @@
|
|||
---
|
||||
name: llama-cpp
|
||||
description: Run LLM inference with llama.cpp on CPU, Apple Silicon, AMD/Intel GPUs, or NVIDIA — plus GGUF model conversion and quantization (2–8 bit with K-quants and imatrix). Covers CLI, Python bindings, OpenAI-compatible server, and Ollama/LM Studio integration. Use for edge deployment, M1/M2/M3/M4 Macs, CUDA-less environments, or flexible local quantization.
|
||||
version: 2.0.0
|
||||
description: Run LLM inference with llama.cpp on CPU, Apple Silicon, AMD/Intel GPUs, or NVIDIA. Covers GGUF quant selection, Hugging Face Hub model search with `apps=llama.cpp`, hardware-aware quant recommendations from `?local-app=llama.cpp`, extracting available `.gguf` files from the Hugging Face tree API, and building the right `llama-cli` or `llama-server` command directly from Hub URLs.
|
||||
version: 2.1.1
|
||||
author: Orchestra Research
|
||||
license: MIT
|
||||
dependencies: [llama-cpp-python>=0.2.0]
|
||||
metadata:
|
||||
hermes:
|
||||
tags: [llama.cpp, GGUF, Quantization, CPU Inference, Apple Silicon, Edge Deployment, Non-NVIDIA, AMD GPUs, Intel GPUs, Embedded, Model Compression]
|
||||
tags: [llama.cpp, GGUF, Quantization, Hugging Face Hub, CPU Inference, Apple Silicon, Edge Deployment, AMD GPUs, Intel GPUs, NVIDIA, URL-first]
|
||||
---
|
||||
|
||||
# llama.cpp + GGUF
|
||||
|
||||
Pure C/C++ LLM inference with minimal dependencies, plus the GGUF (GPT-Generated Unified Format) standard used for quantized weights. One toolchain covers conversion, quantization, and serving.
|
||||
Use this skill for local GGUF inference, quant selection, or Hugging Face repo discovery for llama.cpp.
|
||||
|
||||
## When to use
|
||||
|
||||
**Use llama.cpp + GGUF when:**
|
||||
- Running on CPU-only machines or Apple Silicon (M1/M2/M3/M4) with Metal acceleration
|
||||
- Using AMD (ROCm) or Intel GPUs where CUDA isn't available
|
||||
- Edge deployment (Raspberry Pi, embedded systems, consumer laptops)
|
||||
- Need flexible quantization (2–8 bit with K-quants)
|
||||
- Want local AI tools (LM Studio, Ollama, text-generation-webui, koboldcpp)
|
||||
- Want a single binary deploy without Docker/Python
|
||||
- Run local models on CPU, Apple Silicon, CUDA, ROCm, or Intel GPUs
|
||||
- Find the right GGUF for a specific Hugging Face repo
|
||||
- Build a `llama-server` or `llama-cli` command from the Hub
|
||||
- Search the Hub for models that already support llama.cpp
|
||||
- Enumerate available `.gguf` files and sizes for a repo
|
||||
- Decide between Q4/Q5/Q6/IQ variants for the user's RAM or VRAM
|
||||
|
||||
**Key advantages:**
|
||||
- Universal hardware: CPU, Apple Silicon, NVIDIA, AMD, Intel
|
||||
- No Python runtime required (pure C/C++)
|
||||
- K-quants + imatrix for better low-bit quality
|
||||
- OpenAI-compatible server built in
|
||||
- Rich ecosystem (Ollama, LM Studio, llama-cpp-python)
|
||||
## Model Discovery workflow
|
||||
|
||||
**Use alternatives instead:**
|
||||
- **vLLM** — NVIDIA GPUs, PagedAttention, Python-first, max throughput
|
||||
- **TensorRT-LLM** — Production NVIDIA (A100/H100), maximum speed
|
||||
- **AWQ/GPTQ** — Calibrated quantization for NVIDIA-only deployments
|
||||
- **bitsandbytes** — Simple HuggingFace transformers integration
|
||||
- **HQQ** — Fast calibration-free quantization
|
||||
Prefer URL workflows before asking for `hf`, Python, or custom scripts.
|
||||
|
||||
1. Search for candidate repos on the Hub:
|
||||
- Base: `https://huggingface.co/models?apps=llama.cpp&sort=trending`
|
||||
- Add `search=<term>` for a model family
|
||||
- Add `num_parameters=min:0,max:24B` or similar when the user has size constraints
|
||||
2. Open the repo with the llama.cpp local-app view:
|
||||
- `https://huggingface.co/<repo>?local-app=llama.cpp`
|
||||
3. Treat the local-app snippet as the source of truth when it is visible:
|
||||
- copy the exact `llama-server` or `llama-cli` command
|
||||
- report the recommended quant exactly as HF shows it
|
||||
4. Read the same `?local-app=llama.cpp` URL as page text or HTML and extract the section under `Hardware compatibility`:
|
||||
- prefer its exact quant labels and sizes over generic tables
|
||||
- keep repo-specific labels such as `UD-Q4_K_M` or `IQ4_NL_XL`
|
||||
- if that section is not visible in the fetched page source, say so and fall back to the tree API plus generic quant guidance
|
||||
5. Query the tree API to confirm what actually exists:
|
||||
- `https://huggingface.co/api/models/<repo>/tree/main?recursive=true`
|
||||
- keep entries where `type` is `file` and `path` ends with `.gguf`
|
||||
- use `path` and `size` as the source of truth for filenames and byte sizes
|
||||
- separate quantized checkpoints from `mmproj-*.gguf` projector files and `BF16/` shard files
|
||||
- use `https://huggingface.co/<repo>/tree/main` only as a human fallback
|
||||
6. If the local-app snippet is not text-visible, reconstruct the command from the repo plus the chosen quant:
|
||||
- shorthand quant selection: `llama-server -hf <repo>:<QUANT>`
|
||||
- exact-file fallback: `llama-server --hf-repo <repo> --hf-file <filename.gguf>`
|
||||
7. Only suggest conversion from Transformers weights if the repo does not already expose GGUF files.
|
||||
|
||||
## Quick start
|
||||
|
||||
### Install
|
||||
### Install llama.cpp
|
||||
|
||||
```bash
|
||||
# macOS / Linux (simplest)
|
||||
brew install llama.cpp
|
||||
```
|
||||
|
||||
# Or build from source
|
||||
```bash
|
||||
winget install llama.cpp
|
||||
```
|
||||
|
||||
```bash
|
||||
git clone https://github.com/ggml-org/llama.cpp
|
||||
cd llama.cpp
|
||||
make # CPU
|
||||
make GGML_METAL=1 # Apple Silicon
|
||||
make GGML_CUDA=1 # NVIDIA CUDA
|
||||
make LLAMA_HIP=1 # AMD ROCm
|
||||
|
||||
# Python bindings (optional)
|
||||
pip install llama-cpp-python
|
||||
# With CUDA: CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
|
||||
# With Metal: CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
|
||||
cmake -B build
|
||||
cmake --build build --config Release
|
||||
```
|
||||
|
||||
### Download a pre-quantized GGUF
|
||||
### Run directly from the Hugging Face Hub
|
||||
|
||||
```bash
|
||||
# TheBloke hosts most popular models pre-quantized
|
||||
huggingface-cli download \
|
||||
TheBloke/Llama-2-7B-Chat-GGUF \
|
||||
llama-2-7b-chat.Q4_K_M.gguf \
|
||||
--local-dir models/
|
||||
llama-cli -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0
|
||||
```
|
||||
|
||||
### Or convert a HuggingFace model to GGUF
|
||||
|
||||
```bash
|
||||
# 1. Download HF model
|
||||
huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b
|
||||
|
||||
# 2. Convert to FP16 GGUF
|
||||
python convert_hf_to_gguf.py ./llama-3.1-8b \
|
||||
--outfile llama-3.1-8b-f16.gguf \
|
||||
--outtype f16
|
||||
|
||||
# 3. Quantize to Q4_K_M
|
||||
./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M
|
||||
llama-server -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0
|
||||
```
|
||||
|
||||
### Run inference
|
||||
### Run an exact GGUF file from the Hub
|
||||
|
||||
Use this when the tree API shows custom file naming or the exact HF snippet is missing.
|
||||
|
||||
```bash
|
||||
# One-shot prompt
|
||||
./llama-cli -m model.Q4_K_M.gguf -p "Explain quantum computing" -n 256
|
||||
|
||||
# Interactive chat
|
||||
./llama-cli -m model.Q4_K_M.gguf --interactive
|
||||
|
||||
# With GPU offload
|
||||
./llama-cli -m model.Q4_K_M.gguf -ngl 35 -p "Hello!"
|
||||
llama-server \
|
||||
--hf-repo microsoft/Phi-3-mini-4k-instruct-gguf \
|
||||
--hf-file Phi-3-mini-4k-instruct-q4.gguf \
|
||||
-c 4096
|
||||
```
|
||||
|
||||
### Serve an OpenAI-compatible API
|
||||
|
||||
```bash
|
||||
./llama-server \
|
||||
-m model.Q4_K_M.gguf \
|
||||
--host 0.0.0.0 \
|
||||
--port 8080 \
|
||||
-ngl 35 \
|
||||
-c 4096 \
|
||||
--parallel 4 \
|
||||
--cont-batching
|
||||
```
|
||||
### OpenAI-compatible server check
|
||||
|
||||
```bash
|
||||
curl http://localhost:8080/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "Authorization: Bearer no-key" \
|
||||
-d '{
|
||||
"model": "local",
|
||||
"messages": [{"role": "user", "content": "Hello!"}],
|
||||
"temperature": 0.7,
|
||||
"max_tokens": 100
|
||||
"messages": [
|
||||
{"role": "user", "content": "Write a limerick about Python exceptions"}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
## Quantization formats (GGUF)
|
||||
## Choosing a quant
|
||||
|
||||
### K-quant methods (recommended)
|
||||
Use the Hub page first, generic heuristics second.
|
||||
|
||||
| Type | Bits | Size (7B) | Quality | Use Case |
|
||||
|------|------|-----------|---------|----------|
|
||||
| Q2_K | 2.5 | ~2.8 GB | Low | Extreme compression (testing only) |
|
||||
| Q3_K_S | 3.0 | ~3.0 GB | Low-Med | Memory constrained |
|
||||
| Q3_K_M | 3.3 | ~3.3 GB | Medium | Fits small devices |
|
||||
| Q4_K_S | 4.0 | ~3.8 GB | Med-High | Speed critical |
|
||||
| **Q4_K_M** | 4.5 | ~4.1 GB | High | **Recommended default** |
|
||||
| Q5_K_S | 5.0 | ~4.6 GB | High | Quality focused |
|
||||
| Q5_K_M | 5.5 | ~4.8 GB | Very High | High quality |
|
||||
| Q6_K | 6.0 | ~5.5 GB | Excellent | Near-original |
|
||||
| Q8_0 | 8.0 | ~7.2 GB | Best | Maximum quality, minimal degradation |
|
||||
- Prefer the exact quant that HF marks as compatible for the user's hardware profile.
|
||||
- For general chat, start with `Q4_K_M`.
|
||||
- For code or technical work, prefer `Q5_K_M` or `Q6_K` if memory allows.
|
||||
- For very tight RAM budgets, consider `Q3_K_M`, `IQ` variants, or `Q2` variants only if the user explicitly prioritizes fit over quality.
|
||||
- For multimodal repos, mention `mmproj-*.gguf` separately. The projector is not the main model file.
|
||||
- Do not normalize repo-native labels. If the page says `UD-Q4_K_M`, report `UD-Q4_K_M`.
|
||||
|
||||
**Variant suffixes** — `_S` (Small, faster, lower quality), `_M` (Medium, balanced), `_L` (Large, better quality).
|
||||
## Extracting available GGUFs from a repo
|
||||
|
||||
**Legacy (Q4_0/Q4_1/Q5_0/Q5_1) exist** but always prefer K-quants for better quality/size ratio.
|
||||
When the user asks what GGUFs exist, return:
|
||||
|
||||
**IQ quantization** — ultra-low-bit with importance-aware methods: IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_XS, IQ3_S, IQ4_XS. Require `--imatrix`.
|
||||
- filename
|
||||
- file size
|
||||
- quant label
|
||||
- whether it is a main model or an auxiliary projector
|
||||
|
||||
**Task-specific defaults:**
|
||||
- General chat / assistants: Q4_K_M, or Q5_K_M if RAM allows
|
||||
- Code generation: Q5_K_M or Q6_K (higher precision helps)
|
||||
- Technical / medical: Q6_K or Q8_0
|
||||
- Very large (70B, 405B) on consumer hardware: Q3_K_M or Q4_K_S
|
||||
- Raspberry Pi / edge: Q2_K or Q3_K_S
|
||||
Ignore unless requested:
|
||||
|
||||
## Conversion workflows
|
||||
- README
|
||||
- BF16 shard files
|
||||
- imatrix blobs or calibration artifacts
|
||||
|
||||
### Basic: HF → GGUF → quantized
|
||||
Use the tree API for this step:
|
||||
|
||||
```bash
|
||||
python convert_hf_to_gguf.py ./model --outfile model-f16.gguf --outtype f16
|
||||
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
|
||||
./llama-cli -m model-q4_k_m.gguf -p "Hello!" -n 50
|
||||
- `https://huggingface.co/api/models/<repo>/tree/main?recursive=true`
|
||||
|
||||
For a repo like `unsloth/Qwen3.6-35B-A3B-GGUF`, the local-app page can show quant chips such as `UD-Q4_K_M`, `UD-Q5_K_M`, `UD-Q6_K`, and `Q8_0`, while the tree API exposes exact file paths such as `Qwen3.6-35B-A3B-UD-Q4_K_M.gguf` and `Qwen3.6-35B-A3B-Q8_0.gguf` with byte sizes. Use the tree API to turn a quant label into an exact filename.
|
||||
|
||||
## Search patterns
|
||||
|
||||
Use these URL shapes directly:
|
||||
|
||||
```text
|
||||
https://huggingface.co/models?apps=llama.cpp&sort=trending
|
||||
https://huggingface.co/models?search=<term>&apps=llama.cpp&sort=trending
|
||||
https://huggingface.co/models?search=<term>&apps=llama.cpp&num_parameters=min:0,max:24B&sort=trending
|
||||
https://huggingface.co/<repo>?local-app=llama.cpp
|
||||
https://huggingface.co/api/models/<repo>/tree/main?recursive=true
|
||||
https://huggingface.co/<repo>/tree/main
|
||||
```
|
||||
|
||||
### With importance matrix (imatrix) — better low-bit quality
|
||||
## Output format
|
||||
|
||||
`imatrix` gives 10–20% perplexity improvement at Q4, essential at Q3 and below.
|
||||
When answering discovery requests, prefer a compact structured result like:
|
||||
|
||||
```bash
|
||||
# 1. Convert to FP16 GGUF
|
||||
python convert_hf_to_gguf.py ./model --outfile model-f16.gguf
|
||||
|
||||
# 2. Prepare calibration data (diverse text, ~100MB is ideal)
|
||||
cat > calibration.txt << 'EOF'
|
||||
The quick brown fox jumps over the lazy dog.
|
||||
Machine learning is a subset of artificial intelligence.
|
||||
# Add more diverse text samples...
|
||||
EOF
|
||||
|
||||
# 3. Generate importance matrix
|
||||
./llama-imatrix -m model-f16.gguf \
|
||||
-f calibration.txt \
|
||||
--chunk 512 \
|
||||
-o model.imatrix \
|
||||
-ngl 35
|
||||
|
||||
# 4. Quantize with imatrix
|
||||
./llama-quantize --imatrix model.imatrix \
|
||||
model-f16.gguf model-q4_k_m.gguf Q4_K_M
|
||||
```text
|
||||
Repo: <repo>
|
||||
Recommended quant from HF: <label> (<size>)
|
||||
llama-server: <command>
|
||||
Other GGUFs:
|
||||
- <filename> - <size>
|
||||
- <filename> - <size>
|
||||
Source URLs:
|
||||
- <local-app URL>
|
||||
- <tree API URL>
|
||||
```
|
||||
|
||||
### Multi-quant batch
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
MODEL="llama-3.1-8b-f16.gguf"
|
||||
IMATRIX="llama-3.1-8b.imatrix"
|
||||
|
||||
./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35
|
||||
|
||||
for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
|
||||
OUTPUT="llama-3.1-8b-${QUANT,,}.gguf"
|
||||
./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT
|
||||
echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))"
|
||||
done
|
||||
```
|
||||
|
||||
### Quality testing (perplexity)
|
||||
|
||||
```bash
|
||||
./llama-perplexity -m model.gguf -f wikitext-2-raw/wiki.test.raw -c 512
|
||||
# Baseline FP16: ~5.96 | Q4_K_M: ~6.06 (+1.7%) | Q2_K: ~6.87 (+15.3%)
|
||||
```
|
||||
|
||||
## Python bindings (llama-cpp-python)
|
||||
|
||||
### Basic generation
|
||||
|
||||
```python
|
||||
from llama_cpp import Llama
|
||||
|
||||
llm = Llama(
|
||||
model_path="./model-q4_k_m.gguf",
|
||||
n_ctx=4096,
|
||||
n_gpu_layers=35, # 0 for CPU only, 99 to offload everything
|
||||
n_threads=8,
|
||||
)
|
||||
|
||||
output = llm(
|
||||
"What is machine learning?",
|
||||
max_tokens=256,
|
||||
temperature=0.7,
|
||||
stop=["</s>", "\n\n"],
|
||||
)
|
||||
print(output["choices"][0]["text"])
|
||||
```
|
||||
|
||||
### Chat completion + streaming
|
||||
|
||||
```python
|
||||
llm = Llama(
|
||||
model_path="./model-q4_k_m.gguf",
|
||||
n_ctx=4096,
|
||||
n_gpu_layers=35,
|
||||
chat_format="llama-3", # Or "chatml", "mistral", etc.
|
||||
)
|
||||
|
||||
# Non-streaming
|
||||
response = llm.create_chat_completion(
|
||||
messages=[
|
||||
{"role": "system", "content": "You are a helpful assistant."},
|
||||
{"role": "user", "content": "What is Python?"},
|
||||
],
|
||||
max_tokens=256,
|
||||
temperature=0.7,
|
||||
)
|
||||
print(response["choices"][0]["message"]["content"])
|
||||
|
||||
# Streaming
|
||||
for chunk in llm("Explain quantum computing:", max_tokens=256, stream=True):
|
||||
print(chunk["choices"][0]["text"], end="", flush=True)
|
||||
```
|
||||
|
||||
### Embeddings
|
||||
|
||||
```python
|
||||
llm = Llama(model_path="./model-q4_k_m.gguf", embedding=True, n_gpu_layers=35)
|
||||
vec = llm.embed("This is a test sentence.")
|
||||
print(f"Embedding dimension: {len(vec)}")
|
||||
```
|
||||
|
||||
## Hardware acceleration
|
||||
|
||||
### Apple Silicon (Metal)
|
||||
|
||||
```bash
|
||||
make clean && make GGML_METAL=1
|
||||
./llama-cli -m model.gguf -ngl 99 -p "Hello" # offload all layers
|
||||
```
|
||||
|
||||
```python
|
||||
llm = Llama(
|
||||
model_path="model.gguf",
|
||||
n_gpu_layers=99, # Offload everything
|
||||
n_threads=1, # Metal handles parallelism
|
||||
)
|
||||
```
|
||||
|
||||
Performance: M3 Max ~40–60 tok/s on Llama 2-7B Q4_K_M.
|
||||
|
||||
### NVIDIA (CUDA)
|
||||
|
||||
```bash
|
||||
make clean && make GGML_CUDA=1
|
||||
./llama-cli -m model.gguf -ngl 35 -p "Hello"
|
||||
|
||||
# Hybrid for large models
|
||||
./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20 # GPU: 20 layers, CPU: rest
|
||||
|
||||
# Multi-GPU split
|
||||
./llama-cli -m large-model.gguf --tensor-split 0.5,0.5 -ngl 60
|
||||
```
|
||||
|
||||
### AMD (ROCm)
|
||||
|
||||
```bash
|
||||
make LLAMA_HIP=1
|
||||
./llama-cli -m model.gguf -ngl 999
|
||||
```
|
||||
|
||||
### CPU
|
||||
|
||||
```bash
|
||||
# Match PHYSICAL cores, not logical
|
||||
./llama-cli -m model.gguf -t 8 -p "Hello"
|
||||
|
||||
# BLAS acceleration (2–3× speedup)
|
||||
make LLAMA_OPENBLAS=1
|
||||
```
|
||||
|
||||
```python
|
||||
llm = Llama(
|
||||
model_path="model.gguf",
|
||||
n_gpu_layers=0,
|
||||
n_threads=8,
|
||||
n_batch=512, # Larger batch = faster prompt processing
|
||||
)
|
||||
```
|
||||
|
||||
## Performance benchmarks
|
||||
|
||||
### CPU (Llama 2-7B Q4_K_M)
|
||||
|
||||
| CPU | Threads | Speed |
|
||||
|-----|---------|-------|
|
||||
| Apple M3 Max (Metal) | 16 | 50 tok/s |
|
||||
| AMD Ryzen 9 7950X | 32 | 35 tok/s |
|
||||
| Intel i9-13900K | 32 | 30 tok/s |
|
||||
|
||||
### GPU offloading on RTX 4090
|
||||
|
||||
| Layers GPU | Speed | VRAM |
|
||||
|------------|-------|------|
|
||||
| 0 (CPU only) | 30 tok/s | 0 GB |
|
||||
| 20 (hybrid) | 80 tok/s | 8 GB |
|
||||
| 35 (all) | 120 tok/s | 12 GB |
|
||||
|
||||
## Supported models
|
||||
|
||||
- **LLaMA family**: Llama 2 (7B/13B/70B), Llama 3 (8B/70B/405B), Code Llama
|
||||
- **Mistral family**: Mistral 7B, Mixtral 8x7B/8x22B
|
||||
- **Other**: Falcon, BLOOM, GPT-J, Phi-3, Gemma, Qwen, LLaVA (vision), Whisper (audio)
|
||||
|
||||
Find GGUF models: https://huggingface.co/models?library=gguf
|
||||
|
||||
## Ecosystem integrations
|
||||
|
||||
### Ollama
|
||||
|
||||
```bash
|
||||
cat > Modelfile << 'EOF'
|
||||
FROM ./model-q4_k_m.gguf
|
||||
TEMPLATE """{{ .System }}
|
||||
{{ .Prompt }}"""
|
||||
PARAMETER temperature 0.7
|
||||
PARAMETER num_ctx 4096
|
||||
EOF
|
||||
|
||||
ollama create mymodel -f Modelfile
|
||||
ollama run mymodel "Hello!"
|
||||
```
|
||||
|
||||
### LM Studio
|
||||
|
||||
1. Place GGUF file in `~/.cache/lm-studio/models/`
|
||||
2. Open LM Studio and select the model
|
||||
3. Configure context length and GPU offload, start inference
|
||||
|
||||
### text-generation-webui
|
||||
|
||||
```bash
|
||||
cp model-q4_k_m.gguf text-generation-webui/models/
|
||||
python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35
|
||||
```
|
||||
|
||||
### OpenAI client → llama-server
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
|
||||
response = client.chat.completions.create(
|
||||
model="local-model",
|
||||
messages=[{"role": "user", "content": "Hello!"}],
|
||||
max_tokens=256,
|
||||
)
|
||||
print(response.choices[0].message.content)
|
||||
```
|
||||
|
||||
## Best practices
|
||||
|
||||
1. **Use K-quants** — Q4_K_M is the recommended default
|
||||
2. **Use imatrix** for Q4 and below (calibration improves quality substantially)
|
||||
3. **Offload as many layers as VRAM allows** — start high, reduce by 5 on OOM
|
||||
4. **Thread count** — match physical cores, not logical
|
||||
5. **Batch size** — increase `n_batch` (e.g. 512) for faster prompt processing
|
||||
6. **Context** — start at 4096, grow only as needed (memory scales with ctx)
|
||||
7. **Flash Attention** — add `--flash-attn` if your build supports it
|
||||
|
||||
## Common issues (quick fixes)
|
||||
|
||||
**Model loads slowly** — use `--mmap` for memory-mapped loading.
|
||||
|
||||
**Out of memory (GPU)** — reduce `-ngl`, use a smaller quant (Q4_K_S / Q3_K_M), or quantize the KV cache:
|
||||
```python
|
||||
Llama(model_path="...", type_k=2, type_v=2, n_gpu_layers=35) # Q4_0 KV cache
|
||||
```
|
||||
|
||||
**Garbage output** — wrong `chat_format`, temperature too high, or model file corrupted. Test with `temperature=0.1` and verify FP16 baseline works.
|
||||
|
||||
**Connection refused (server)** — bind to `--host 0.0.0.0`, check `lsof -i :8080`.
|
||||
|
||||
See `references/troubleshooting.md` for the full playbook.
|
||||
|
||||
## References
|
||||
|
||||
- **[hub-discovery.md](references/hub-discovery.md)** - URL-only Hugging Face workflows, search patterns, GGUF extraction, and command reconstruction
|
||||
- **[advanced-usage.md](references/advanced-usage.md)** — speculative decoding, batched inference, grammar-constrained generation, LoRA, multi-GPU, custom builds, benchmark scripts
|
||||
- **[quantization.md](references/quantization.md)** — perplexity tables, use-case guide, model size scaling (7B/13B/70B RAM needs), imatrix deep dive
|
||||
- **[server.md](references/server.md)** — OpenAI API endpoints, Docker deployment, NGINX load balancing, monitoring
|
||||
- **[quantization.md](references/quantization.md)** — quant quality tradeoffs, when to use Q4/Q5/Q6/IQ, model size scaling, imatrix
|
||||
- **[server.md](references/server.md)** — direct-from-Hub server launch, OpenAI API endpoints, Docker deployment, NGINX load balancing, monitoring
|
||||
- **[optimization.md](references/optimization.md)** — CPU threading, BLAS, GPU offload heuristics, batch tuning, benchmarks
|
||||
- **[troubleshooting.md](references/troubleshooting.md)** — install/convert/quantize/inference/server issues, Apple Silicon, debugging
|
||||
|
||||
## Resources
|
||||
|
||||
- **GitHub**: https://github.com/ggml-org/llama.cpp
|
||||
- **Python bindings**: https://github.com/abetlen/llama-cpp-python
|
||||
- **Pre-quantized models**: https://huggingface.co/TheBloke
|
||||
- **GGUF converter Space**: https://huggingface.co/spaces/ggml-org/gguf-my-repo
|
||||
- **Hugging Face GGUF + llama.cpp docs**: https://huggingface.co/docs/hub/gguf-llamacpp
|
||||
- **Hugging Face Local Apps docs**: https://huggingface.co/docs/hub/main/local-apps
|
||||
- **Hugging Face Local Agents docs**: https://huggingface.co/docs/hub/agents-local
|
||||
- **Example local-app page**: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF?local-app=llama.cpp
|
||||
- **Example tree API**: https://huggingface.co/api/models/unsloth/Qwen3.6-35B-A3B-GGUF/tree/main?recursive=true
|
||||
- **Example llama.cpp search**: https://huggingface.co/models?num_parameters=min:0,max:24B&apps=llama.cpp&sort=trending
|
||||
- **License**: MIT
|
||||
|
|
|
|||
168
skills/mlops/inference/llama-cpp/references/hub-discovery.md
Normal file
168
skills/mlops/inference/llama-cpp/references/hub-discovery.md
Normal file
|
|
@ -0,0 +1,168 @@
|
|||
# Hugging Face URL Workflows for llama.cpp
|
||||
|
||||
Use URL-only workflows first. Do not require `hf` or API clients just to find GGUF files, choose a quant, or build a `llama-server` command.
|
||||
|
||||
## Core URLs
|
||||
|
||||
```text
|
||||
Search:
|
||||
https://huggingface.co/models?apps=llama.cpp&sort=trending
|
||||
|
||||
Search with text:
|
||||
https://huggingface.co/models?search=<term>&apps=llama.cpp&sort=trending
|
||||
|
||||
Search with size bounds:
|
||||
https://huggingface.co/models?search=<term>&apps=llama.cpp&num_parameters=min:0,max:24B&sort=trending
|
||||
|
||||
Repo local-app view:
|
||||
https://huggingface.co/<repo>?local-app=llama.cpp
|
||||
|
||||
Repo tree API:
|
||||
https://huggingface.co/api/models/<repo>/tree/main?recursive=true
|
||||
|
||||
Repo file tree:
|
||||
https://huggingface.co/<repo>/tree/main
|
||||
```
|
||||
|
||||
## 1. Search for llama.cpp-compatible models
|
||||
|
||||
Start from the models page with `apps=llama.cpp`.
|
||||
|
||||
Use:
|
||||
|
||||
- `search=<term>` for model family names such as `Qwen`, `Gemma`, `Phi`, or `Mistral`
|
||||
- `num_parameters=min:0,max:24B` or similar if the user has hardware limits
|
||||
- `sort=trending` when the user wants popular repos right now
|
||||
|
||||
Do not start with random GGUF repos if the user has not chosen a model family yet. Search first, shortlist second.
|
||||
|
||||
Example: https://huggingface.co/models?search=Qwen&apps=llama.cpp&num_parameters=min:0,max:24B&sort=trending
|
||||
|
||||
## 2. Use the local-app page for the recommended quant
|
||||
|
||||
Open:
|
||||
|
||||
```text
|
||||
https://huggingface.co/<repo>?local-app=llama.cpp
|
||||
```
|
||||
|
||||
Extract, in order:
|
||||
|
||||
1. The exact `Use this model` snippet, if it is visible as text
|
||||
2. The `Hardware compatibility` section from the fetched page text or HTML:
|
||||
- quant label
|
||||
- file size
|
||||
- bit-depth grouping
|
||||
3. Any extra launch flags shown in the snippet, such as `--jinja`
|
||||
|
||||
Treat the HF local-app snippet as the source of truth when it is visible.
|
||||
|
||||
Do this by reading the URL itself, not by assuming the UI rendered in a browser. If the fetched page source does not expose `Hardware compatibility`, say that the section was not text-visible and fall back to the tree API plus generic guidance from `quantization.md`.
|
||||
|
||||
## 3. Confirm exact files from the tree API
|
||||
|
||||
Open:
|
||||
|
||||
```text
|
||||
https://huggingface.co/api/models/<repo>/tree/main?recursive=true
|
||||
```
|
||||
|
||||
Treat the JSON response as the source of truth for repo inventory.
|
||||
|
||||
Keep entries where:
|
||||
|
||||
- `type` is `file`
|
||||
- `path` ends with `.gguf`
|
||||
|
||||
Use these fields:
|
||||
|
||||
- `path` for the filename and subdirectory
|
||||
- `size` for the byte size
|
||||
- optionally `lfs.size` to confirm the LFS payload size
|
||||
|
||||
Separate files into:
|
||||
|
||||
- quantized single-file checkpoints, for example `Qwen3.6-35B-A3B-UD-Q4_K_M.gguf`
|
||||
- projector weights, usually `mmproj-*.gguf`
|
||||
- BF16 shard files, usually under `BF16/`
|
||||
- everything else
|
||||
|
||||
Ignore unless the user asks:
|
||||
|
||||
- `README.md`
|
||||
- imatrix or calibration blobs
|
||||
|
||||
Use `https://huggingface.co/<repo>/tree/main` only as a human fallback if the API endpoint fails or the user wants the web view.
|
||||
|
||||
## 4. Build the command
|
||||
|
||||
Preferred order:
|
||||
|
||||
1. Copy the exact HF snippet from the local-app page
|
||||
2. If the page gives a clean quant label, use shorthand selection:
|
||||
|
||||
```bash
|
||||
llama-server -hf <repo>:<QUANT>
|
||||
```
|
||||
|
||||
3. If you need an exact file from the tree API, use the file-specific form:
|
||||
|
||||
```bash
|
||||
llama-server --hf-repo <repo> --hf-file <filename.gguf>
|
||||
```
|
||||
|
||||
4. For CLI usage instead of a server, use:
|
||||
|
||||
```bash
|
||||
llama-cli -hf <repo>:<QUANT>
|
||||
```
|
||||
|
||||
Use the exact-file form when the repo uses custom labels or nonstandard naming that could make `:<QUANT>` ambiguous.
|
||||
|
||||
## 5. Example: `unsloth/Qwen3.6-35B-A3B-GGUF`
|
||||
|
||||
Use these URLs:
|
||||
|
||||
```text
|
||||
https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF?local-app=llama.cpp
|
||||
https://huggingface.co/api/models/unsloth/Qwen3.6-35B-A3B-GGUF/tree/main?recursive=true
|
||||
https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/tree/main
|
||||
```
|
||||
|
||||
On the local-app page, the hardware compatibility section can expose entries such as:
|
||||
|
||||
- `UD-IQ4_XS` - 17.7 GB
|
||||
- `UD-Q4_K_S` - 20.9 GB
|
||||
- `UD-Q4_K_M` - 22.1 GB
|
||||
- `UD-Q5_K_M` - 26.5 GB
|
||||
- `UD-Q6_K` - 29.3 GB
|
||||
- `Q8_0` - 36.9 GB
|
||||
|
||||
On the tree API, you can confirm exact filenames such as:
|
||||
|
||||
- `Qwen3.6-35B-A3B-UD-Q4_K_M.gguf`
|
||||
- `Qwen3.6-35B-A3B-UD-Q5_K_M.gguf`
|
||||
- `Qwen3.6-35B-A3B-UD-Q6_K.gguf`
|
||||
- `Qwen3.6-35B-A3B-Q8_0.gguf`
|
||||
- `mmproj-F16.gguf`
|
||||
|
||||
Good final output for this repo:
|
||||
|
||||
```text
|
||||
Repo: unsloth/Qwen3.6-35B-A3B-GGUF
|
||||
Recommended quant from HF: UD-Q4_K_M (22.1 GB)
|
||||
llama-server: llama-server --hf-repo unsloth/Qwen3.6-35B-A3B-GGUF --hf-file Qwen3.6-35B-A3B-UD-Q4_K_M.gguf
|
||||
Other GGUFs:
|
||||
- Qwen3.6-35B-A3B-UD-Q5_K_M.gguf - 26.5 GB
|
||||
- Qwen3.6-35B-A3B-UD-Q6_K.gguf - 29.3 GB
|
||||
- Qwen3.6-35B-A3B-Q8_0.gguf - 36.9 GB
|
||||
Projector:
|
||||
- mmproj-F16.gguf - 899 MB
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- Repo-specific quant labels matter. Do not rewrite `UD-Q4_K_M` to `Q4_K_M` unless the page itself does.
|
||||
- `mmproj` files are projector weights for multimodal models, not the main language model checkpoint.
|
||||
- If the HF hardware compatibility panel is missing because the user has no hardware profile configured, or because the fetched page source did not expose it, still use the tree API plus generic quant guidance from `quantization.md`.
|
||||
- If the repo already has GGUFs, do not jump straight to conversion workflows.
|
||||
|
|
@ -2,6 +2,22 @@
|
|||
|
||||
Complete guide to GGUF quantization formats and model conversion.
|
||||
|
||||
## Hub-first quant selection
|
||||
|
||||
Before using generic tables, open the model repo with:
|
||||
|
||||
```text
|
||||
https://huggingface.co/<repo>?local-app=llama.cpp
|
||||
```
|
||||
|
||||
Prefer the exact quant labels and sizes shown in the `Hardware compatibility` section of the fetched `?local-app=llama.cpp` page text or HTML. Then confirm the matching filenames in:
|
||||
|
||||
```text
|
||||
https://huggingface.co/api/models/<repo>/tree/main?recursive=true
|
||||
```
|
||||
|
||||
Use the Hub page first, and only fall back to the generic heuristics below when the repo page does not expose a clear recommendation.
|
||||
|
||||
## Quantization Overview
|
||||
|
||||
**GGUF** (GPT-Generated Unified Format) - Standard format for llama.cpp models.
|
||||
|
|
@ -23,11 +39,11 @@ Complete guide to GGUF quantization formats and model conversion.
|
|||
|
||||
## Converting Models
|
||||
|
||||
### HuggingFace to GGUF
|
||||
### Hugging Face to GGUF
|
||||
|
||||
```bash
|
||||
# 1. Download HuggingFace model
|
||||
huggingface-cli download meta-llama/Llama-2-7b-chat-hf \
|
||||
# 1. Download Hugging Face model
|
||||
hf download meta-llama/Llama-2-7b-chat-hf \
|
||||
--local-dir models/llama-2-7b-chat/
|
||||
|
||||
# 2. Convert to FP16 GGUF
|
||||
|
|
@ -152,18 +168,32 @@ Q2_K or Q3_K_S - Fit in limited RAM
|
|||
|
||||
## Finding Pre-Quantized Models
|
||||
|
||||
**TheBloke** on HuggingFace:
|
||||
- https://huggingface.co/TheBloke
|
||||
- Most models available in all GGUF formats
|
||||
- No conversion needed
|
||||
Use the Hub search with the llama.cpp app filter:
|
||||
|
||||
```text
|
||||
https://huggingface.co/models?apps=llama.cpp&sort=trending
|
||||
https://huggingface.co/models?search=<term>&apps=llama.cpp&sort=trending
|
||||
https://huggingface.co/models?search=<term>&apps=llama.cpp&num_parameters=min:0,max:24B&sort=trending
|
||||
```
|
||||
|
||||
For a specific repo, open:
|
||||
|
||||
```text
|
||||
https://huggingface.co/<repo>?local-app=llama.cpp
|
||||
https://huggingface.co/api/models/<repo>/tree/main?recursive=true
|
||||
```
|
||||
|
||||
Then launch directly from the Hub without extra Hub tooling:
|
||||
|
||||
**Example**:
|
||||
```bash
|
||||
# Download pre-quantized Llama 2-7B
|
||||
huggingface-cli download \
|
||||
TheBloke/Llama-2-7B-Chat-GGUF \
|
||||
llama-2-7b-chat.Q4_K_M.gguf \
|
||||
--local-dir models/
|
||||
llama-cli -hf <repo>:Q4_K_M
|
||||
llama-server -hf <repo>:Q4_K_M
|
||||
```
|
||||
|
||||
If you need the exact file name from the tree API:
|
||||
|
||||
```bash
|
||||
llama-server --hf-repo <repo> --hf-file <filename.gguf>
|
||||
```
|
||||
|
||||
## Importance Matrices (imatrix)
|
||||
|
|
|
|||
|
|
@ -2,6 +2,31 @@
|
|||
|
||||
Production deployment of llama.cpp server with OpenAI-compatible API.
|
||||
|
||||
## Direct from Hugging Face Hub
|
||||
|
||||
Prefer the model repo's local-app page first:
|
||||
|
||||
```text
|
||||
https://huggingface.co/<repo>?local-app=llama.cpp
|
||||
```
|
||||
|
||||
If the page shows an exact snippet, copy it. If not, use one of these forms:
|
||||
|
||||
```bash
|
||||
# Choose a quant label directly from the Hub repo
|
||||
llama-server -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0
|
||||
```
|
||||
|
||||
```bash
|
||||
# Pin an exact GGUF file from the repo tree
|
||||
llama-server \
|
||||
--hf-repo microsoft/Phi-3-mini-4k-instruct-gguf \
|
||||
--hf-file Phi-3-mini-4k-instruct-q4.gguf \
|
||||
-c 4096
|
||||
```
|
||||
|
||||
Use the file-specific form when the repo has custom naming or when you already extracted the exact filename from the tree API.
|
||||
|
||||
## Server Modes
|
||||
|
||||
### llama-server
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue