mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-25 00:51:20 +00:00
fix(skills/llama-cpp): concise description, restore python bindings, fix curl
- Description truncated to 60 chars in system prompt (extract_skill_description), so the 500-char HF workflow description never reached the agent; shortened to 'llama.cpp local GGUF inference + HF Hub model discovery.' (56 chars). - Restore llama-cpp-python section (basic, chat+stream, embeddings, Llama.from_pretrained) and frontmatter dependencies entry. - Fix broken 'Authorization: Bearer ***' curl line (missing closing quote; llama-server doesn't require auth by default).
This commit is contained in:
parent
d6cf2cc058
commit
7ff7155cbd
2 changed files with 66 additions and 3 deletions
|
|
@ -1,9 +1,10 @@
|
|||
---
|
||||
name: llama-cpp
|
||||
description: Run LLM inference with llama.cpp on CPU, Apple Silicon, AMD/Intel GPUs, or NVIDIA. Covers GGUF quant selection, Hugging Face Hub model search with `apps=llama.cpp`, hardware-aware quant recommendations from `?local-app=llama.cpp`, extracting available `.gguf` files from the Hugging Face tree API, and building the right `llama-cli` or `llama-server` command directly from Hub URLs.
|
||||
version: 2.1.1
|
||||
description: llama.cpp local GGUF inference + HF Hub model discovery.
|
||||
version: 2.1.2
|
||||
author: Orchestra Research
|
||||
license: MIT
|
||||
dependencies: [llama-cpp-python>=0.2.0]
|
||||
metadata:
|
||||
hermes:
|
||||
tags: [llama.cpp, GGUF, Quantization, Hugging Face Hub, CPU Inference, Apple Silicon, Edge Deployment, AMD GPUs, Intel GPUs, NVIDIA, URL-first]
|
||||
|
|
@ -96,7 +97,6 @@ llama-server \
|
|||
```bash
|
||||
curl http://localhost:8080/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "Authorization: Bearer no-key" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{"role": "user", "content": "Write a limerick about Python exceptions"}
|
||||
|
|
@ -104,6 +104,68 @@ curl http://localhost:8080/v1/chat/completions \
|
|||
}'
|
||||
```
|
||||
|
||||
## Python bindings (llama-cpp-python)
|
||||
|
||||
`pip install llama-cpp-python` (CUDA: `CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir`; Metal: `CMAKE_ARGS="-DGGML_METAL=on" ...`).
|
||||
|
||||
### Basic generation
|
||||
|
||||
```python
|
||||
from llama_cpp import Llama
|
||||
|
||||
llm = Llama(
|
||||
model_path="./model-q4_k_m.gguf",
|
||||
n_ctx=4096,
|
||||
n_gpu_layers=35, # 0 for CPU, 99 to offload everything
|
||||
n_threads=8,
|
||||
)
|
||||
|
||||
out = llm("What is machine learning?", max_tokens=256, temperature=0.7)
|
||||
print(out["choices"][0]["text"])
|
||||
```
|
||||
|
||||
### Chat + streaming
|
||||
|
||||
```python
|
||||
llm = Llama(
|
||||
model_path="./model-q4_k_m.gguf",
|
||||
n_ctx=4096,
|
||||
n_gpu_layers=35,
|
||||
chat_format="llama-3", # or "chatml", "mistral", etc.
|
||||
)
|
||||
|
||||
resp = llm.create_chat_completion(
|
||||
messages=[
|
||||
{"role": "system", "content": "You are a helpful assistant."},
|
||||
{"role": "user", "content": "What is Python?"},
|
||||
],
|
||||
max_tokens=256,
|
||||
)
|
||||
print(resp["choices"][0]["message"]["content"])
|
||||
|
||||
# Streaming
|
||||
for chunk in llm("Explain quantum computing:", max_tokens=256, stream=True):
|
||||
print(chunk["choices"][0]["text"], end="", flush=True)
|
||||
```
|
||||
|
||||
### Embeddings
|
||||
|
||||
```python
|
||||
llm = Llama(model_path="./model-q4_k_m.gguf", embedding=True, n_gpu_layers=35)
|
||||
vec = llm.embed("This is a test sentence.")
|
||||
print(f"Embedding dimension: {len(vec)}")
|
||||
```
|
||||
|
||||
You can also load a GGUF straight from the Hub:
|
||||
|
||||
```python
|
||||
llm = Llama.from_pretrained(
|
||||
repo_id="bartowski/Llama-3.2-3B-Instruct-GGUF",
|
||||
filename="*Q4_K_M.gguf",
|
||||
n_gpu_layers=35,
|
||||
)
|
||||
```
|
||||
|
||||
## Choosing a quant
|
||||
|
||||
Use the Hub page first, generic heuristics second.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue