mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-26 01:01:40 +00:00
skills: consolidate mlops redundancies (gguf+llama-cpp, grpo+trl, guidance→optional) (#11965)
Three tightly-scoped built-in skill consolidations to reduce redundancy in the available_skills listing injected into every system prompt: 1. gguf-quantization → llama-cpp (merged) GGUF is llama.cpp's format; two skills covered the same toolchain. The merged llama-cpp skill keeps the full K-quant table + imatrix workflow from gguf and the ROCm/benchmarks/supported-models sections from the original llama-cpp. All 5 reference files preserved. 2. grpo-rl-training → fine-tuning-with-trl (folded in) GRPO isn't a framework, it's a trainer inside TRL. Moved the 17KB deep-dive SKILL.md to references/grpo-training.md and the working template to templates/basic_grpo_training.py. TRL's GRPO workflow section now points to both. Atropos skill's related_skills updated. 3. guidance → optional-skills/mlops/ Dropped from built-in. Outlines (still built-in) covers the same structured-generation ground with wider adoption. Listed in the optional catalog for users who specifically want Guidance. Net: 3 fewer built-in skill lines in every system prompt, zero content loss. Contributor authorship preserved via git rename detection.
This commit is contained in:
parent
598cba62ad
commit
73bccc94c7
15 changed files with 470 additions and 889 deletions
|
|
@ -1,138 +1,271 @@
|
|||
---
|
||||
name: llama-cpp
|
||||
description: Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.
|
||||
version: 1.0.0
|
||||
description: Run LLM inference with llama.cpp on CPU, Apple Silicon, AMD/Intel GPUs, or NVIDIA — plus GGUF model conversion and quantization (2–8 bit with K-quants and imatrix). Covers CLI, Python bindings, OpenAI-compatible server, and Ollama/LM Studio integration. Use for edge deployment, M1/M2/M3/M4 Macs, CUDA-less environments, or flexible local quantization.
|
||||
version: 2.0.0
|
||||
author: Orchestra Research
|
||||
license: MIT
|
||||
dependencies: [llama-cpp-python]
|
||||
dependencies: [llama-cpp-python>=0.2.0]
|
||||
metadata:
|
||||
hermes:
|
||||
tags: [Inference Serving, Llama.cpp, CPU Inference, Apple Silicon, Edge Deployment, GGUF, Quantization, Non-NVIDIA, AMD GPUs, Intel GPUs, Embedded]
|
||||
|
||||
tags: [llama.cpp, GGUF, Quantization, CPU Inference, Apple Silicon, Edge Deployment, Non-NVIDIA, AMD GPUs, Intel GPUs, Embedded, Model Compression]
|
||||
---
|
||||
|
||||
# llama.cpp
|
||||
# llama.cpp + GGUF
|
||||
|
||||
Pure C/C++ LLM inference with minimal dependencies, optimized for CPUs and non-NVIDIA hardware.
|
||||
Pure C/C++ LLM inference with minimal dependencies, plus the GGUF (GPT-Generated Unified Format) standard used for quantized weights. One toolchain covers conversion, quantization, and serving.
|
||||
|
||||
## When to use llama.cpp
|
||||
## When to use
|
||||
|
||||
**Use llama.cpp when:**
|
||||
- Running on CPU-only machines
|
||||
- Deploying on Apple Silicon (M1/M2/M3/M4)
|
||||
- Using AMD or Intel GPUs (no CUDA)
|
||||
- Edge deployment (Raspberry Pi, embedded systems)
|
||||
- Need simple deployment without Docker/Python
|
||||
**Use llama.cpp + GGUF when:**
|
||||
- Running on CPU-only machines or Apple Silicon (M1/M2/M3/M4) with Metal acceleration
|
||||
- Using AMD (ROCm) or Intel GPUs where CUDA isn't available
|
||||
- Edge deployment (Raspberry Pi, embedded systems, consumer laptops)
|
||||
- Need flexible quantization (2–8 bit with K-quants)
|
||||
- Want local AI tools (LM Studio, Ollama, text-generation-webui, koboldcpp)
|
||||
- Want a single binary deploy without Docker/Python
|
||||
|
||||
**Use TensorRT-LLM instead when:**
|
||||
- Have NVIDIA GPUs (A100/H100)
|
||||
- Need maximum throughput (100K+ tok/s)
|
||||
- Running in datacenter with CUDA
|
||||
**Key advantages:**
|
||||
- Universal hardware: CPU, Apple Silicon, NVIDIA, AMD, Intel
|
||||
- No Python runtime required (pure C/C++)
|
||||
- K-quants + imatrix for better low-bit quality
|
||||
- OpenAI-compatible server built in
|
||||
- Rich ecosystem (Ollama, LM Studio, llama-cpp-python)
|
||||
|
||||
**Use vLLM instead when:**
|
||||
- Have NVIDIA GPUs
|
||||
- Need Python-first API
|
||||
- Want PagedAttention
|
||||
**Use alternatives instead:**
|
||||
- **vLLM** — NVIDIA GPUs, PagedAttention, Python-first, max throughput
|
||||
- **TensorRT-LLM** — Production NVIDIA (A100/H100), maximum speed
|
||||
- **AWQ/GPTQ** — Calibrated quantization for NVIDIA-only deployments
|
||||
- **bitsandbytes** — Simple HuggingFace transformers integration
|
||||
- **HQQ** — Fast calibration-free quantization
|
||||
|
||||
## Quick start
|
||||
|
||||
### Installation
|
||||
### Install
|
||||
|
||||
```bash
|
||||
# macOS/Linux
|
||||
# macOS / Linux (simplest)
|
||||
brew install llama.cpp
|
||||
|
||||
# Or build from source
|
||||
git clone https://github.com/ggerganov/llama.cpp
|
||||
git clone https://github.com/ggml-org/llama.cpp
|
||||
cd llama.cpp
|
||||
make
|
||||
make # CPU
|
||||
make GGML_METAL=1 # Apple Silicon
|
||||
make GGML_CUDA=1 # NVIDIA CUDA
|
||||
make LLAMA_HIP=1 # AMD ROCm
|
||||
|
||||
# With Metal (Apple Silicon)
|
||||
make LLAMA_METAL=1
|
||||
|
||||
# With CUDA (NVIDIA)
|
||||
make LLAMA_CUDA=1
|
||||
|
||||
# With ROCm (AMD)
|
||||
make LLAMA_HIP=1
|
||||
# Python bindings (optional)
|
||||
pip install llama-cpp-python
|
||||
# With CUDA: CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
|
||||
# With Metal: CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
|
||||
```
|
||||
|
||||
### Download model
|
||||
### Download a pre-quantized GGUF
|
||||
|
||||
```bash
|
||||
# Download from HuggingFace (GGUF format)
|
||||
# TheBloke hosts most popular models pre-quantized
|
||||
huggingface-cli download \
|
||||
TheBloke/Llama-2-7B-Chat-GGUF \
|
||||
llama-2-7b-chat.Q4_K_M.gguf \
|
||||
--local-dir models/
|
||||
```
|
||||
|
||||
# Or convert from HuggingFace
|
||||
python convert_hf_to_gguf.py models/llama-2-7b-chat/
|
||||
### Or convert a HuggingFace model to GGUF
|
||||
|
||||
```bash
|
||||
# 1. Download HF model
|
||||
huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b
|
||||
|
||||
# 2. Convert to FP16 GGUF
|
||||
python convert_hf_to_gguf.py ./llama-3.1-8b \
|
||||
--outfile llama-3.1-8b-f16.gguf \
|
||||
--outtype f16
|
||||
|
||||
# 3. Quantize to Q4_K_M
|
||||
./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M
|
||||
```
|
||||
|
||||
### Run inference
|
||||
|
||||
```bash
|
||||
# Simple chat
|
||||
./llama-cli \
|
||||
-m models/llama-2-7b-chat.Q4_K_M.gguf \
|
||||
-p "Explain quantum computing" \
|
||||
-n 256 # Max tokens
|
||||
# One-shot prompt
|
||||
./llama-cli -m model.Q4_K_M.gguf -p "Explain quantum computing" -n 256
|
||||
|
||||
# Interactive chat
|
||||
./llama-cli \
|
||||
-m models/llama-2-7b-chat.Q4_K_M.gguf \
|
||||
--interactive
|
||||
./llama-cli -m model.Q4_K_M.gguf --interactive
|
||||
|
||||
# With GPU offload
|
||||
./llama-cli -m model.Q4_K_M.gguf -ngl 35 -p "Hello!"
|
||||
```
|
||||
|
||||
### Server mode
|
||||
### Serve an OpenAI-compatible API
|
||||
|
||||
```bash
|
||||
# Start OpenAI-compatible server
|
||||
./llama-server \
|
||||
-m models/llama-2-7b-chat.Q4_K_M.gguf \
|
||||
-m model.Q4_K_M.gguf \
|
||||
--host 0.0.0.0 \
|
||||
--port 8080 \
|
||||
-ngl 32 # Offload 32 layers to GPU
|
||||
-ngl 35 \
|
||||
-c 4096 \
|
||||
--parallel 4 \
|
||||
--cont-batching
|
||||
```
|
||||
|
||||
# Client request
|
||||
```bash
|
||||
curl http://localhost:8080/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "llama-2-7b-chat",
|
||||
"model": "local",
|
||||
"messages": [{"role": "user", "content": "Hello!"}],
|
||||
"temperature": 0.7,
|
||||
"max_tokens": 100
|
||||
}'
|
||||
```
|
||||
|
||||
## Quantization formats
|
||||
## Quantization formats (GGUF)
|
||||
|
||||
### GGUF format overview
|
||||
### K-quant methods (recommended)
|
||||
|
||||
| Format | Bits | Size (7B) | Speed | Quality | Use Case |
|
||||
|--------|------|-----------|-------|---------|----------|
|
||||
| **Q4_K_M** | 4.5 | 4.1 GB | Fast | Good | **Recommended default** |
|
||||
| Q4_K_S | 4.3 | 3.9 GB | Faster | Lower | Speed critical |
|
||||
| Q5_K_M | 5.5 | 4.8 GB | Medium | Better | Quality critical |
|
||||
| Q6_K | 6.5 | 5.5 GB | Slower | Best | Maximum quality |
|
||||
| Q8_0 | 8.0 | 7.0 GB | Slow | Excellent | Minimal degradation |
|
||||
| Q2_K | 2.5 | 2.7 GB | Fastest | Poor | Testing only |
|
||||
| Type | Bits | Size (7B) | Quality | Use Case |
|
||||
|------|------|-----------|---------|----------|
|
||||
| Q2_K | 2.5 | ~2.8 GB | Low | Extreme compression (testing only) |
|
||||
| Q3_K_S | 3.0 | ~3.0 GB | Low-Med | Memory constrained |
|
||||
| Q3_K_M | 3.3 | ~3.3 GB | Medium | Fits small devices |
|
||||
| Q4_K_S | 4.0 | ~3.8 GB | Med-High | Speed critical |
|
||||
| **Q4_K_M** | 4.5 | ~4.1 GB | High | **Recommended default** |
|
||||
| Q5_K_S | 5.0 | ~4.6 GB | High | Quality focused |
|
||||
| Q5_K_M | 5.5 | ~4.8 GB | Very High | High quality |
|
||||
| Q6_K | 6.0 | ~5.5 GB | Excellent | Near-original |
|
||||
| Q8_0 | 8.0 | ~7.2 GB | Best | Maximum quality, minimal degradation |
|
||||
|
||||
### Choosing quantization
|
||||
**Variant suffixes** — `_S` (Small, faster, lower quality), `_M` (Medium, balanced), `_L` (Large, better quality).
|
||||
|
||||
**Legacy (Q4_0/Q4_1/Q5_0/Q5_1) exist** but always prefer K-quants for better quality/size ratio.
|
||||
|
||||
**IQ quantization** — ultra-low-bit with importance-aware methods: IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_XS, IQ3_S, IQ4_XS. Require `--imatrix`.
|
||||
|
||||
**Task-specific defaults:**
|
||||
- General chat / assistants: Q4_K_M, or Q5_K_M if RAM allows
|
||||
- Code generation: Q5_K_M or Q6_K (higher precision helps)
|
||||
- Technical / medical: Q6_K or Q8_0
|
||||
- Very large (70B, 405B) on consumer hardware: Q3_K_M or Q4_K_S
|
||||
- Raspberry Pi / edge: Q2_K or Q3_K_S
|
||||
|
||||
## Conversion workflows
|
||||
|
||||
### Basic: HF → GGUF → quantized
|
||||
|
||||
```bash
|
||||
# General use (balanced)
|
||||
Q4_K_M # 4-bit, medium quality
|
||||
python convert_hf_to_gguf.py ./model --outfile model-f16.gguf --outtype f16
|
||||
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
|
||||
./llama-cli -m model-q4_k_m.gguf -p "Hello!" -n 50
|
||||
```
|
||||
|
||||
# Maximum speed (more degradation)
|
||||
Q2_K or Q3_K_M
|
||||
### With importance matrix (imatrix) — better low-bit quality
|
||||
|
||||
# Maximum quality (slower)
|
||||
Q6_K or Q8_0
|
||||
`imatrix` gives 10–20% perplexity improvement at Q4, essential at Q3 and below.
|
||||
|
||||
# Very large models (70B, 405B)
|
||||
Q3_K_M or Q4_K_S # Lower bits to fit in memory
|
||||
```bash
|
||||
# 1. Convert to FP16 GGUF
|
||||
python convert_hf_to_gguf.py ./model --outfile model-f16.gguf
|
||||
|
||||
# 2. Prepare calibration data (diverse text, ~100MB is ideal)
|
||||
cat > calibration.txt << 'EOF'
|
||||
The quick brown fox jumps over the lazy dog.
|
||||
Machine learning is a subset of artificial intelligence.
|
||||
# Add more diverse text samples...
|
||||
EOF
|
||||
|
||||
# 3. Generate importance matrix
|
||||
./llama-imatrix -m model-f16.gguf \
|
||||
-f calibration.txt \
|
||||
--chunk 512 \
|
||||
-o model.imatrix \
|
||||
-ngl 35
|
||||
|
||||
# 4. Quantize with imatrix
|
||||
./llama-quantize --imatrix model.imatrix \
|
||||
model-f16.gguf model-q4_k_m.gguf Q4_K_M
|
||||
```
|
||||
|
||||
### Multi-quant batch
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
MODEL="llama-3.1-8b-f16.gguf"
|
||||
IMATRIX="llama-3.1-8b.imatrix"
|
||||
|
||||
./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35
|
||||
|
||||
for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
|
||||
OUTPUT="llama-3.1-8b-${QUANT,,}.gguf"
|
||||
./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT
|
||||
echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))"
|
||||
done
|
||||
```
|
||||
|
||||
### Quality testing (perplexity)
|
||||
|
||||
```bash
|
||||
./llama-perplexity -m model.gguf -f wikitext-2-raw/wiki.test.raw -c 512
|
||||
# Baseline FP16: ~5.96 | Q4_K_M: ~6.06 (+1.7%) | Q2_K: ~6.87 (+15.3%)
|
||||
```
|
||||
|
||||
## Python bindings (llama-cpp-python)
|
||||
|
||||
### Basic generation
|
||||
|
||||
```python
|
||||
from llama_cpp import Llama
|
||||
|
||||
llm = Llama(
|
||||
model_path="./model-q4_k_m.gguf",
|
||||
n_ctx=4096,
|
||||
n_gpu_layers=35, # 0 for CPU only, 99 to offload everything
|
||||
n_threads=8,
|
||||
)
|
||||
|
||||
output = llm(
|
||||
"What is machine learning?",
|
||||
max_tokens=256,
|
||||
temperature=0.7,
|
||||
stop=["</s>", "\n\n"],
|
||||
)
|
||||
print(output["choices"][0]["text"])
|
||||
```
|
||||
|
||||
### Chat completion + streaming
|
||||
|
||||
```python
|
||||
llm = Llama(
|
||||
model_path="./model-q4_k_m.gguf",
|
||||
n_ctx=4096,
|
||||
n_gpu_layers=35,
|
||||
chat_format="llama-3", # Or "chatml", "mistral", etc.
|
||||
)
|
||||
|
||||
# Non-streaming
|
||||
response = llm.create_chat_completion(
|
||||
messages=[
|
||||
{"role": "system", "content": "You are a helpful assistant."},
|
||||
{"role": "user", "content": "What is Python?"},
|
||||
],
|
||||
max_tokens=256,
|
||||
temperature=0.7,
|
||||
)
|
||||
print(response["choices"][0]["message"]["content"])
|
||||
|
||||
# Streaming
|
||||
for chunk in llm("Explain quantum computing:", max_tokens=256, stream=True):
|
||||
print(chunk["choices"][0]["text"], end="", flush=True)
|
||||
```
|
||||
|
||||
### Embeddings
|
||||
|
||||
```python
|
||||
llm = Llama(model_path="./model-q4_k_m.gguf", embedding=True, n_gpu_layers=35)
|
||||
vec = llm.embed("This is a test sentence.")
|
||||
print(f"Embedding dimension: {len(vec)}")
|
||||
```
|
||||
|
||||
## Hardware acceleration
|
||||
|
|
@ -140,122 +273,166 @@ Q3_K_M or Q4_K_S # Lower bits to fit in memory
|
|||
### Apple Silicon (Metal)
|
||||
|
||||
```bash
|
||||
# Build with Metal
|
||||
make LLAMA_METAL=1
|
||||
|
||||
# Run with GPU acceleration (automatic)
|
||||
./llama-cli -m model.gguf -ngl 999 # Offload all layers
|
||||
|
||||
# Performance: M3 Max 40-60 tokens/sec (Llama 2-7B Q4_K_M)
|
||||
make clean && make GGML_METAL=1
|
||||
./llama-cli -m model.gguf -ngl 99 -p "Hello" # offload all layers
|
||||
```
|
||||
|
||||
### NVIDIA GPUs (CUDA)
|
||||
|
||||
```bash
|
||||
# Build with CUDA
|
||||
make LLAMA_CUDA=1
|
||||
|
||||
# Offload layers to GPU
|
||||
./llama-cli -m model.gguf -ngl 35 # Offload 35/40 layers
|
||||
|
||||
# Hybrid CPU+GPU for large models
|
||||
./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20 # GPU: 20 layers, CPU: rest
|
||||
```python
|
||||
llm = Llama(
|
||||
model_path="model.gguf",
|
||||
n_gpu_layers=99, # Offload everything
|
||||
n_threads=1, # Metal handles parallelism
|
||||
)
|
||||
```
|
||||
|
||||
### AMD GPUs (ROCm)
|
||||
Performance: M3 Max ~40–60 tok/s on Llama 2-7B Q4_K_M.
|
||||
|
||||
### NVIDIA (CUDA)
|
||||
|
||||
```bash
|
||||
make clean && make GGML_CUDA=1
|
||||
./llama-cli -m model.gguf -ngl 35 -p "Hello"
|
||||
|
||||
# Hybrid for large models
|
||||
./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20 # GPU: 20 layers, CPU: rest
|
||||
|
||||
# Multi-GPU split
|
||||
./llama-cli -m large-model.gguf --tensor-split 0.5,0.5 -ngl 60
|
||||
```
|
||||
|
||||
### AMD (ROCm)
|
||||
|
||||
```bash
|
||||
# Build with ROCm
|
||||
make LLAMA_HIP=1
|
||||
|
||||
# Run with AMD GPU
|
||||
./llama-cli -m model.gguf -ngl 999
|
||||
```
|
||||
|
||||
## Common patterns
|
||||
|
||||
### Batch processing
|
||||
### CPU
|
||||
|
||||
```bash
|
||||
# Process multiple prompts from file
|
||||
cat prompts.txt | ./llama-cli \
|
||||
-m model.gguf \
|
||||
--batch-size 512 \
|
||||
-n 100
|
||||
# Match PHYSICAL cores, not logical
|
||||
./llama-cli -m model.gguf -t 8 -p "Hello"
|
||||
|
||||
# BLAS acceleration (2–3× speedup)
|
||||
make LLAMA_OPENBLAS=1
|
||||
```
|
||||
|
||||
### Constrained generation
|
||||
|
||||
```bash
|
||||
# JSON output with grammar
|
||||
./llama-cli \
|
||||
-m model.gguf \
|
||||
-p "Generate a person: " \
|
||||
--grammar-file grammars/json.gbnf
|
||||
|
||||
# Outputs valid JSON only
|
||||
```
|
||||
|
||||
### Context size
|
||||
|
||||
```bash
|
||||
# Increase context (default 512)
|
||||
./llama-cli \
|
||||
-m model.gguf \
|
||||
-c 4096 # 4K context window
|
||||
|
||||
# Very long context (if model supports)
|
||||
./llama-cli -m model.gguf -c 32768 # 32K context
|
||||
```python
|
||||
llm = Llama(
|
||||
model_path="model.gguf",
|
||||
n_gpu_layers=0,
|
||||
n_threads=8,
|
||||
n_batch=512, # Larger batch = faster prompt processing
|
||||
)
|
||||
```
|
||||
|
||||
## Performance benchmarks
|
||||
|
||||
### CPU performance (Llama 2-7B Q4_K_M)
|
||||
### CPU (Llama 2-7B Q4_K_M)
|
||||
|
||||
| CPU | Threads | Speed | Cost |
|
||||
|-----|---------|-------|------|
|
||||
| Apple M3 Max | 16 | 50 tok/s | $0 (local) |
|
||||
| AMD Ryzen 9 7950X | 32 | 35 tok/s | $0.50/hour |
|
||||
| Intel i9-13900K | 32 | 30 tok/s | $0.40/hour |
|
||||
| AWS c7i.16xlarge | 64 | 40 tok/s | $2.88/hour |
|
||||
| CPU | Threads | Speed |
|
||||
|-----|---------|-------|
|
||||
| Apple M3 Max (Metal) | 16 | 50 tok/s |
|
||||
| AMD Ryzen 9 7950X | 32 | 35 tok/s |
|
||||
| Intel i9-13900K | 32 | 30 tok/s |
|
||||
|
||||
### GPU acceleration (Llama 2-7B Q4_K_M)
|
||||
### GPU offloading on RTX 4090
|
||||
|
||||
| GPU | Speed | vs CPU | Cost |
|
||||
|-----|-------|--------|------|
|
||||
| NVIDIA RTX 4090 | 120 tok/s | 3-4× | $0 (local) |
|
||||
| NVIDIA A10 | 80 tok/s | 2-3× | $1.00/hour |
|
||||
| AMD MI250 | 70 tok/s | 2× | $2.00/hour |
|
||||
| Apple M3 Max (Metal) | 50 tok/s | ~Same | $0 (local) |
|
||||
| Layers GPU | Speed | VRAM |
|
||||
|------------|-------|------|
|
||||
| 0 (CPU only) | 30 tok/s | 0 GB |
|
||||
| 20 (hybrid) | 80 tok/s | 8 GB |
|
||||
| 35 (all) | 120 tok/s | 12 GB |
|
||||
|
||||
## Supported models
|
||||
|
||||
**LLaMA family**:
|
||||
- Llama 2 (7B, 13B, 70B)
|
||||
- Llama 3 (8B, 70B, 405B)
|
||||
- Code Llama
|
||||
- **LLaMA family**: Llama 2 (7B/13B/70B), Llama 3 (8B/70B/405B), Code Llama
|
||||
- **Mistral family**: Mistral 7B, Mixtral 8x7B/8x22B
|
||||
- **Other**: Falcon, BLOOM, GPT-J, Phi-3, Gemma, Qwen, LLaVA (vision), Whisper (audio)
|
||||
|
||||
**Mistral family**:
|
||||
- Mistral 7B
|
||||
- Mixtral 8x7B, 8x22B
|
||||
Find GGUF models: https://huggingface.co/models?library=gguf
|
||||
|
||||
**Other**:
|
||||
- Falcon, BLOOM, GPT-J
|
||||
- Phi-3, Gemma, Qwen
|
||||
- LLaVA (vision), Whisper (audio)
|
||||
## Ecosystem integrations
|
||||
|
||||
**Find models**: https://huggingface.co/models?library=gguf
|
||||
### Ollama
|
||||
|
||||
```bash
|
||||
cat > Modelfile << 'EOF'
|
||||
FROM ./model-q4_k_m.gguf
|
||||
TEMPLATE """{{ .System }}
|
||||
{{ .Prompt }}"""
|
||||
PARAMETER temperature 0.7
|
||||
PARAMETER num_ctx 4096
|
||||
EOF
|
||||
|
||||
ollama create mymodel -f Modelfile
|
||||
ollama run mymodel "Hello!"
|
||||
```
|
||||
|
||||
### LM Studio
|
||||
|
||||
1. Place GGUF file in `~/.cache/lm-studio/models/`
|
||||
2. Open LM Studio and select the model
|
||||
3. Configure context length and GPU offload, start inference
|
||||
|
||||
### text-generation-webui
|
||||
|
||||
```bash
|
||||
cp model-q4_k_m.gguf text-generation-webui/models/
|
||||
python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35
|
||||
```
|
||||
|
||||
### OpenAI client → llama-server
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
|
||||
response = client.chat.completions.create(
|
||||
model="local-model",
|
||||
messages=[{"role": "user", "content": "Hello!"}],
|
||||
max_tokens=256,
|
||||
)
|
||||
print(response.choices[0].message.content)
|
||||
```
|
||||
|
||||
## Best practices
|
||||
|
||||
1. **Use K-quants** — Q4_K_M is the recommended default
|
||||
2. **Use imatrix** for Q4 and below (calibration improves quality substantially)
|
||||
3. **Offload as many layers as VRAM allows** — start high, reduce by 5 on OOM
|
||||
4. **Thread count** — match physical cores, not logical
|
||||
5. **Batch size** — increase `n_batch` (e.g. 512) for faster prompt processing
|
||||
6. **Context** — start at 4096, grow only as needed (memory scales with ctx)
|
||||
7. **Flash Attention** — add `--flash-attn` if your build supports it
|
||||
|
||||
## Common issues (quick fixes)
|
||||
|
||||
**Model loads slowly** — use `--mmap` for memory-mapped loading.
|
||||
|
||||
**Out of memory (GPU)** — reduce `-ngl`, use a smaller quant (Q4_K_S / Q3_K_M), or quantize the KV cache:
|
||||
```python
|
||||
Llama(model_path="...", type_k=2, type_v=2, n_gpu_layers=35) # Q4_0 KV cache
|
||||
```
|
||||
|
||||
**Garbage output** — wrong `chat_format`, temperature too high, or model file corrupted. Test with `temperature=0.1` and verify FP16 baseline works.
|
||||
|
||||
**Connection refused (server)** — bind to `--host 0.0.0.0`, check `lsof -i :8080`.
|
||||
|
||||
See `references/troubleshooting.md` for the full playbook.
|
||||
|
||||
## References
|
||||
|
||||
- **[Quantization Guide](references/quantization.md)** - GGUF formats, conversion, quality comparison
|
||||
- **[Server Deployment](references/server.md)** - API endpoints, Docker, monitoring
|
||||
- **[Optimization](references/optimization.md)** - Performance tuning, hybrid CPU+GPU
|
||||
- **[advanced-usage.md](references/advanced-usage.md)** — speculative decoding, batched inference, grammar-constrained generation, LoRA, multi-GPU, custom builds, benchmark scripts
|
||||
- **[quantization.md](references/quantization.md)** — perplexity tables, use-case guide, model size scaling (7B/13B/70B RAM needs), imatrix deep dive
|
||||
- **[server.md](references/server.md)** — OpenAI API endpoints, Docker deployment, NGINX load balancing, monitoring
|
||||
- **[optimization.md](references/optimization.md)** — CPU threading, BLAS, GPU offload heuristics, batch tuning, benchmarks
|
||||
- **[troubleshooting.md](references/troubleshooting.md)** — install/convert/quantize/inference/server issues, Apple Silicon, debugging
|
||||
|
||||
## Resources
|
||||
|
||||
- **GitHub**: https://github.com/ggerganov/llama.cpp
|
||||
- **Models**: https://huggingface.co/models?library=gguf
|
||||
- **Discord**: https://discord.gg/llama-cpp
|
||||
|
||||
|
||||
- **GitHub**: https://github.com/ggml-org/llama.cpp
|
||||
- **Python bindings**: https://github.com/abetlen/llama-cpp-python
|
||||
- **Pre-quantized models**: https://huggingface.co/TheBloke
|
||||
- **GGUF converter Space**: https://huggingface.co/spaces/ggml-org/gguf-my-repo
|
||||
- **License**: MIT
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue