hermes-agent/website/docs/user-guide/skills/bundled/mlops/mlops-inference-llama-cpp.md
Teknium 252d68fd45
docs: deep audit — fix stale config keys, missing commands, and registry drift (#22784)
* docs: deep audit — fix stale config keys, missing commands, and registry drift

Cross-checked ~80 high-impact docs pages (getting-started, reference, top-level
user-guide, user-guide/features) against the live registries:

  hermes_cli/commands.py    COMMAND_REGISTRY (slash commands)
  hermes_cli/auth.py        PROVIDER_REGISTRY (providers)
  hermes_cli/config.py      DEFAULT_CONFIG (config keys)
  toolsets.py               TOOLSETS (toolsets)
  tools/registry.py         get_all_tool_names() (tools)
  python -m hermes_cli.main <subcmd> --help (CLI args)

reference/
- cli-commands.md: drop duplicate hermes fallback row + duplicate section,
  add stepfun/lmstudio to --provider enum, expand auth/mcp/curator subcommand
  lists to match --help output (status/logout/spotify, login, archive/prune/
  list-archived).
- slash-commands.md: add missing /sessions and /reload-skills entries +
  correct the cross-platform Notes line.
- tools-reference.md: drop bogus '68 tools' headline, drop fictional
  'browser-cdp toolset' (these tools live in 'browser' and are runtime-gated),
  add missing 'kanban' and 'video' toolset sections, fix MCP example to use
  the real mcp_<server>_<tool> prefix.
- toolsets-reference.md: list browser_cdp/browser_dialog inside the 'browser'
  row, add missing 'kanban' and 'video' toolset rows, drop the stale
  '38 tools' count for hermes-cli.
- profile-commands.md: add missing install/update/info subcommands, document
  fish completion.
- environment-variables.md: dedupe GMI_API_KEY/GMI_BASE_URL rows (kept the
  one with the correct gmi-serving.com default).
- faq.md: Anthropic/Google/OpenAI examples — direct providers exist (not just
  via OpenRouter), refresh the OpenAI model list.

getting-started/
- installation.md: PortableGit (not MinGit) is what the Windows installer
  fetches; document the 32-bit MinGit fallback.
- installation.md / termux.md: installer prefers .[termux-all] then falls
  back to .[termux].
- nix-setup.md: Python 3.12 (not 3.11), Node.js 22 (not 20); fix invalid
  'nix flake update --flake' invocation.
- updating.md: 'hermes backup restore --state pre-update' doesn't exist —
  point at the snapshot/quick-snapshot flow; correct config key
  'updates.pre_update_backup' (was 'update.backup').

user-guide/
- configuration.md: api_max_retries default 3 (not 2); display.runtime_footer
  is the real key (not display.runtime_metadata_footer); checkpoints defaults
  enabled=false / max_snapshots=20 (not true / 50).
- configuring-models.md: 'hermes model list' / 'hermes model set ...' don't
  exist — hermes model is interactive only.
- tui.md: busy_indicator -> tui_status_indicator with values
  kaomoji|emoji|unicode|ascii (not kawaii|minimal|dots|wings|none).
- security.md: SSH backend keys (TERMINAL_SSH_HOST/USER/KEY) live in .env,
  not config.yaml.
- windows-wsl-quickstart.md: there is no 'hermes api' subcommand — the
  OpenAI-compatible API server runs inside hermes gateway.

user-guide/features/
- computer-use.md: approvals.mode (not security.approval_level); fix broken
  ./browser-use.md link to ./browser.md.
- fallback-providers.md: top-level fallback_providers (not
  model.fallback_providers); the picker is subcommand-based, not modal.
- api-server.md: API_SERVER_* are env vars — write to per-profile .env,
  not 'hermes config set' which targets YAML.
- web-search.md: drop web_crawl as a registered tool (it isn't); deep-crawl
  modes are exposed through web_extract.
- kanban.md: failure_limit default is 2, not '~5'.
- plugins.md: drop hard-coded '33 providers' count.
- honcho.md: fix unclosed quote in echo HONCHO_API_KEY snippet; document
  that 'hermes honcho' subcommand is gated on memory.provider=honcho;
  reconcile subcommand list with actual --help output.
- memory-providers.md: legacy 'hermes honcho setup' redirect documented.

Verified via 'npm run build' — site builds cleanly; broken-link count went
from 149 to 146 (no regressions, fixed a few in passing).

* docs: round 2 audit fixes + regenerate skill catalogs

Follow-up to the previous commit on this branch:

Round 2 manual fixes:
- quickstart.md: KIMI_CODING_API_KEY mentioned alongside KIMI_API_KEY;
  voice-mode and ACP install commands rewritten — bare 'pip install ...'
  doesn't work for curl-installed setups (no pip on PATH, not in repo
  dir); replaced with 'cd ~/.hermes/hermes-agent && uv pip install -e
  ".[voice]"'. ACP already ships in [all] so the curl install includes it.
- cli.md / configuration.md: 'auxiliary.compression.model' shown as
  'google/gemini-3-flash-preview' (the doc's own claimed default);
  actual default is empty (= use main model). Reworded as 'leave empty
  (default) or pin a cheap model'.
- built-in-plugins.md: added the bundled 'kanban/dashboard' plugin row
  that was missing from the table.

Regenerated skill catalogs:
- ran website/scripts/generate-skill-docs.py to refresh all 163 per-skill
  pages and both reference catalogs (skills-catalog.md,
  optional-skills-catalog.md). This adds the entries that were genuinely
  missing — productivity/teams-meeting-pipeline (bundled),
  optional/finance/* (entire category — 7 skills:
  3-statement-model, comps-analysis, dcf-model, excel-author, lbo-model,
  merger-model, pptx-author), creative/hyperframes,
  creative/kanban-video-orchestrator, devops/watchers,
  productivity/shop-app, research/searxng-search,
  apple/macos-computer-use — and rewrites every other per-skill page from
  the current SKILL.md. Most diffs are tiny (one line of refreshed
  metadata).

Validation:
- 'npm run build' succeeded.
- Broken-link count moved 146 -> 155 — the +9 are zh-Hans translation
  shells that lag every newly-added skill page (pre-existing pattern).
  No regressions on any en/ page.
2026-05-09 13:19:51 -07:00

9.7 KiB

title sidebar_label description
Llama Cpp — llama Llama Cpp llama

{/* This page is auto-generated from the skill's SKILL.md by website/scripts/generate-skill-docs.py. Edit the source SKILL.md, not this page. */}

Llama Cpp

llama.cpp local GGUF inference + HF Hub model discovery.

Skill metadata

Source Bundled (installed by default)
Path skills/mlops/inference/llama-cpp
Version 2.1.2
Author Orchestra Research
License MIT
Dependencies llama-cpp-python>=0.2.0
Platforms linux, macos, windows
Tags llama.cpp, GGUF, Quantization, Hugging Face Hub, CPU Inference, Apple Silicon, Edge Deployment, AMD GPUs, Intel GPUs, NVIDIA, URL-first

Reference: full SKILL.md

:::info The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active. :::

llama.cpp + GGUF

Use this skill for local GGUF inference, quant selection, or Hugging Face repo discovery for llama.cpp.

When to use

  • Run local models on CPU, Apple Silicon, CUDA, ROCm, or Intel GPUs
  • Find the right GGUF for a specific Hugging Face repo
  • Build a llama-server or llama-cli command from the Hub
  • Search the Hub for models that already support llama.cpp
  • Enumerate available .gguf files and sizes for a repo
  • Decide between Q4/Q5/Q6/IQ variants for the user's RAM or VRAM

Model Discovery workflow

Prefer URL workflows before asking for hf, Python, or custom scripts.

  1. Search for candidate repos on the Hub:
    • Base: https://huggingface.co/models?apps=llama.cpp&sort=trending
    • Add search=<term> for a model family
    • Add num_parameters=min:0,max:24B or similar when the user has size constraints
  2. Open the repo with the llama.cpp local-app view:
    • https://huggingface.co/<repo>?local-app=llama.cpp
  3. Treat the local-app snippet as the source of truth when it is visible:
    • copy the exact llama-server or llama-cli command
    • report the recommended quant exactly as HF shows it
  4. Read the same ?local-app=llama.cpp URL as page text or HTML and extract the section under Hardware compatibility:
    • prefer its exact quant labels and sizes over generic tables
    • keep repo-specific labels such as UD-Q4_K_M or IQ4_NL_XL
    • if that section is not visible in the fetched page source, say so and fall back to the tree API plus generic quant guidance
  5. Query the tree API to confirm what actually exists:
    • https://huggingface.co/api/models/<repo>/tree/main?recursive=true
    • keep entries where type is file and path ends with .gguf
    • use path and size as the source of truth for filenames and byte sizes
    • separate quantized checkpoints from mmproj-*.gguf projector files and BF16/ shard files
    • use https://huggingface.co/<repo>/tree/main only as a human fallback
  6. If the local-app snippet is not text-visible, reconstruct the command from the repo plus the chosen quant:
    • shorthand quant selection: llama-server -hf <repo>:<QUANT>
    • exact-file fallback: llama-server --hf-repo <repo> --hf-file <filename.gguf>
  7. Only suggest conversion from Transformers weights if the repo does not already expose GGUF files.

Quick start

Install llama.cpp

# macOS / Linux (simplest)
brew install llama.cpp
winget install llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release

Run directly from the Hugging Face Hub

llama-cli -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0
llama-server -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0

Run an exact GGUF file from the Hub

Use this when the tree API shows custom file naming or the exact HF snippet is missing.

llama-server \
    --hf-repo microsoft/Phi-3-mini-4k-instruct-gguf \
    --hf-file Phi-3-mini-4k-instruct-q4.gguf \
    -c 4096

OpenAI-compatible server check

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Write a limerick about Python exceptions"}
    ]
  }'

Python bindings (llama-cpp-python)

pip install llama-cpp-python (CUDA: CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir; Metal: CMAKE_ARGS="-DGGML_METAL=on" ...).

Basic generation

from llama_cpp import Llama

llm = Llama(
    model_path="./model-q4_k_m.gguf",
    n_ctx=4096,
    n_gpu_layers=35,     # 0 for CPU, 99 to offload everything
    n_threads=8,
)

out = llm("What is machine learning?", max_tokens=256, temperature=0.7)
print(out["choices"][0]["text"])

Chat + streaming

llm = Llama(
    model_path="./model-q4_k_m.gguf",
    n_ctx=4096,
    n_gpu_layers=35,
    chat_format="llama-3",   # or "chatml", "mistral", etc.
)

resp = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Python?"},
    ],
    max_tokens=256,
)
print(resp["choices"][0]["message"]["content"])

# Streaming
for chunk in llm("Explain quantum computing:", max_tokens=256, stream=True):
    print(chunk["choices"][0]["text"], end="", flush=True)

Embeddings

llm = Llama(model_path="./model-q4_k_m.gguf", embedding=True, n_gpu_layers=35)
vec = llm.embed("This is a test sentence.")
print(f"Embedding dimension: {len(vec)}")

You can also load a GGUF straight from the Hub:

llm = Llama.from_pretrained(
    repo_id="bartowski/Llama-3.2-3B-Instruct-GGUF",
    filename="*Q4_K_M.gguf",
    n_gpu_layers=35,
)

Choosing a quant

Use the Hub page first, generic heuristics second.

  • Prefer the exact quant that HF marks as compatible for the user's hardware profile.
  • For general chat, start with Q4_K_M.
  • For code or technical work, prefer Q5_K_M or Q6_K if memory allows.
  • For very tight RAM budgets, consider Q3_K_M, IQ variants, or Q2 variants only if the user explicitly prioritizes fit over quality.
  • For multimodal repos, mention mmproj-*.gguf separately. The projector is not the main model file.
  • Do not normalize repo-native labels. If the page says UD-Q4_K_M, report UD-Q4_K_M.

Extracting available GGUFs from a repo

When the user asks what GGUFs exist, return:

  • filename
  • file size
  • quant label
  • whether it is a main model or an auxiliary projector

Ignore unless requested:

  • README
  • BF16 shard files
  • imatrix blobs or calibration artifacts

Use the tree API for this step:

  • https://huggingface.co/api/models/<repo>/tree/main?recursive=true

For a repo like unsloth/Qwen3.6-35B-A3B-GGUF, the local-app page can show quant chips such as UD-Q4_K_M, UD-Q5_K_M, UD-Q6_K, and Q8_0, while the tree API exposes exact file paths such as Qwen3.6-35B-A3B-UD-Q4_K_M.gguf and Qwen3.6-35B-A3B-Q8_0.gguf with byte sizes. Use the tree API to turn a quant label into an exact filename.

Search patterns

Use these URL shapes directly:

https://huggingface.co/models?apps=llama.cpp&sort=trending
https://huggingface.co/models?search=<term>&apps=llama.cpp&sort=trending
https://huggingface.co/models?search=<term>&apps=llama.cpp&num_parameters=min:0,max:24B&sort=trending
https://huggingface.co/<repo>?local-app=llama.cpp
https://huggingface.co/api/models/<repo>/tree/main?recursive=true
https://huggingface.co/<repo>/tree/main

Output format

When answering discovery requests, prefer a compact structured result like:

Repo: <repo>
Recommended quant from HF: <label> (<size>)
llama-server: <command>
Other GGUFs:
- <filename> - <size>
- <filename> - <size>
Source URLs:
- <local-app URL>
- <tree API URL>

References

  • hub-discovery.md - URL-only Hugging Face workflows, search patterns, GGUF extraction, and command reconstruction
  • advanced-usage.md — speculative decoding, batched inference, grammar-constrained generation, LoRA, multi-GPU, custom builds, benchmark scripts
  • quantization.md — quant quality tradeoffs, when to use Q4/Q5/Q6/IQ, model size scaling, imatrix
  • server.md — direct-from-Hub server launch, OpenAI API endpoints, Docker deployment, NGINX load balancing, monitoring
  • optimization.md — CPU threading, BLAS, GPU offload heuristics, batch tuning, benchmarks
  • troubleshooting.md — install/convert/quantize/inference/server issues, Apple Silicon, debugging

Resources