- Restored 21 skills removed in commits757d012and740dd92: accelerate, audiocraft, code-review, faiss, flash-attention, gguf, grpo-rl-training, guidance, llava, nemo-curator, obliteratus, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, stable-diffusion, tensorrt-llm, torchtitan, trl-fine-tuning, whisper - Rewrote sync_skills() with proper update semantics: * New skills (not in manifest): copied to user dir * Existing skills (in manifest + on disk): updated via hash comparison * User-deleted skills (in manifest, not on disk): respected, not re-added * Stale manifest entries (removed from bundled): cleaned from manifest - Added sync_skills() to CLI startup (cmd_chat) and gateway startup (start_gateway) — previously only ran during 'hermes update' - Updated cmd_update output to show new/updated/cleaned counts - Rewrote tests: 20 tests covering manifest CRUD, dir hashing, fresh install, user deletion respect, update detection, stale cleanup, and name collision handling 75 bundled skills total. 2002 tests pass.
8.7 KiB
GGUF Troubleshooting Guide
Installation Issues
Build Fails
Error: make: *** No targets specified and no makefile found
Fix:
# Ensure you're in llama.cpp directory
cd llama.cpp
make
Error: fatal error: cuda_runtime.h: No such file or directory
Fix:
# Install CUDA toolkit
# Ubuntu
sudo apt install nvidia-cuda-toolkit
# Or set CUDA path
export CUDA_PATH=/usr/local/cuda
export PATH=$CUDA_PATH/bin:$PATH
make GGML_CUDA=1
Python Bindings Issues
Error: ERROR: Failed building wheel for llama-cpp-python
Fix:
# Install build dependencies
pip install cmake scikit-build-core
# For CUDA support
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
# For Metal (macOS)
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
Error: ImportError: libcudart.so.XX: cannot open shared object file
Fix:
# Add CUDA libraries to path
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
# Or reinstall with correct CUDA version
pip uninstall llama-cpp-python
CUDACXX=/usr/local/cuda/bin/nvcc CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python
Conversion Issues
Model Not Supported
Error: KeyError: 'model.embed_tokens.weight'
Fix:
# Check model architecture
python -c "from transformers import AutoConfig; print(AutoConfig.from_pretrained('./model').architectures)"
# Use appropriate conversion script
# For most models:
python convert_hf_to_gguf.py ./model --outfile model.gguf
# For older models, check if legacy script needed
Vocabulary Mismatch
Error: RuntimeError: Vocabulary size mismatch
Fix:
# Ensure tokenizer matches model
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("./model")
model = AutoModelForCausalLM.from_pretrained("./model")
print(f"Tokenizer vocab size: {len(tokenizer)}")
print(f"Model vocab size: {model.config.vocab_size}")
# If mismatch, resize embeddings before conversion
model.resize_token_embeddings(len(tokenizer))
model.save_pretrained("./model-fixed")
Out of Memory During Conversion
Error: torch.cuda.OutOfMemoryError during conversion
Fix:
# Use CPU for conversion
CUDA_VISIBLE_DEVICES="" python convert_hf_to_gguf.py ./model --outfile model.gguf
# Or use low memory mode
python convert_hf_to_gguf.py ./model --outfile model.gguf --outtype f16
Quantization Issues
Wrong Output File Size
Problem: Quantized file is larger than expected
Check:
# Verify quantization type
./llama-cli -m model.gguf --verbose
# Expected sizes for 7B model:
# Q4_K_M: ~4.1 GB
# Q5_K_M: ~4.8 GB
# Q8_0: ~7.2 GB
# F16: ~13.5 GB
Quantization Crashes
Error: Segmentation fault during quantization
Fix:
# Increase stack size
ulimit -s unlimited
# Or use less threads
./llama-quantize -t 4 model-f16.gguf model-q4.gguf Q4_K_M
Poor Quality After Quantization
Problem: Model outputs gibberish after quantization
Solutions:
- Use importance matrix:
# Generate imatrix with good calibration data
./llama-imatrix -m model-f16.gguf \
-f wiki_sample.txt \
--chunk 512 \
-o model.imatrix
# Quantize with imatrix
./llama-quantize --imatrix model.imatrix \
model-f16.gguf model-q4_k_m.gguf Q4_K_M
- Try higher precision:
# Use Q5_K_M or Q6_K instead of Q4
./llama-quantize model-f16.gguf model-q5_k_m.gguf Q5_K_M
- Check original model:
# Test FP16 version first
./llama-cli -m model-f16.gguf -p "Hello, how are you?" -n 50
Inference Issues
Slow Generation
Problem: Generation is slower than expected
Solutions:
- Enable GPU offload:
./llama-cli -m model.gguf -ngl 35 -p "Hello"
- Optimize batch size:
llm = Llama(
model_path="model.gguf",
n_batch=512, # Increase for faster prompt processing
n_gpu_layers=35
)
- Use appropriate threads:
# Match physical cores, not logical
./llama-cli -m model.gguf -t 8 -p "Hello"
- Enable Flash Attention (if supported):
./llama-cli -m model.gguf -ngl 35 --flash-attn -p "Hello"
Out of Memory
Error: CUDA out of memory or system freeze
Solutions:
- Reduce GPU layers:
# Start low and increase
llm = Llama(model_path="model.gguf", n_gpu_layers=10)
- Use smaller quantization:
./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M
- Reduce context length:
llm = Llama(
model_path="model.gguf",
n_ctx=2048, # Reduce from 4096
n_gpu_layers=35
)
- Quantize KV cache:
llm = Llama(
model_path="model.gguf",
type_k=2, # Q4_0 for K cache
type_v=2, # Q4_0 for V cache
n_gpu_layers=35
)
Garbage Output
Problem: Model outputs random characters or nonsense
Diagnose:
# Check model loading
llm = Llama(model_path="model.gguf", verbose=True)
# Test with simple prompt
output = llm("1+1=", max_tokens=5, temperature=0)
print(output)
Solutions:
- Check model integrity:
# Verify GGUF file
./llama-cli -m model.gguf --verbose 2>&1 | head -50
- Use correct chat format:
llm = Llama(
model_path="model.gguf",
chat_format="llama-3" # Match your model: chatml, mistral, etc.
)
- Check temperature:
# Use lower temperature for deterministic output
output = llm("Hello", max_tokens=50, temperature=0.1)
Token Issues
Error: RuntimeError: unknown token or encoding errors
Fix:
# Ensure UTF-8 encoding
prompt = "Hello, world!".encode('utf-8').decode('utf-8')
output = llm(prompt, max_tokens=50)
Server Issues
Connection Refused
Error: Connection refused when accessing server
Fix:
# Bind to all interfaces
./llama-server -m model.gguf --host 0.0.0.0 --port 8080
# Check if port is in use
lsof -i :8080
Server Crashes Under Load
Problem: Server crashes with multiple concurrent requests
Solutions:
- Limit parallelism:
./llama-server -m model.gguf \
--parallel 2 \
-c 4096 \
--cont-batching
- Add request timeout:
./llama-server -m model.gguf --timeout 300
- Monitor memory:
watch -n 1 nvidia-smi # For GPU
watch -n 1 free -h # For RAM
API Compatibility Issues
Problem: OpenAI client not working with server
Fix:
from openai import OpenAI
# Use correct base URL format
client = OpenAI(
base_url="http://localhost:8080/v1", # Include /v1
api_key="not-needed"
)
# Use correct model name
response = client.chat.completions.create(
model="local", # Or the actual model name
messages=[{"role": "user", "content": "Hello"}]
)
Apple Silicon Issues
Metal Not Working
Problem: Metal acceleration not enabled
Check:
# Verify Metal support
./llama-cli -m model.gguf --verbose 2>&1 | grep -i metal
Fix:
# Rebuild with Metal
make clean
make GGML_METAL=1
# Python bindings
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall
Incorrect Memory Usage on M1/M2
Problem: Model uses too much unified memory
Fix:
# Offload all layers for Metal
llm = Llama(
model_path="model.gguf",
n_gpu_layers=99, # Offload everything
n_threads=1 # Metal handles parallelism
)
Debugging
Enable Verbose Output
# CLI verbose mode
./llama-cli -m model.gguf --verbose -p "Hello" -n 50
# Python verbose
llm = Llama(model_path="model.gguf", verbose=True)
Check Model Metadata
# View GGUF metadata
./llama-cli -m model.gguf --verbose 2>&1 | head -100
Validate GGUF File
import struct
def validate_gguf(filepath):
with open(filepath, 'rb') as f:
magic = f.read(4)
if magic != b'GGUF':
print(f"Invalid magic: {magic}")
return False
version = struct.unpack('<I', f.read(4))[0]
print(f"GGUF version: {version}")
tensor_count = struct.unpack('<Q', f.read(8))[0]
metadata_count = struct.unpack('<Q', f.read(8))[0]
print(f"Tensors: {tensor_count}, Metadata: {metadata_count}")
return True
validate_gguf("model.gguf")
Getting Help
- GitHub Issues: https://github.com/ggml-org/llama.cpp/issues
- Discussions: https://github.com/ggml-org/llama.cpp/discussions
- Reddit: r/LocalLLaMA
Reporting Issues
Include:
- llama.cpp version/commit hash
- Build command used
- Model name and quantization
- Full error message/stack trace
- Hardware: CPU/GPU model, RAM, VRAM
- OS version
- Minimal reproduction steps