mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-05-21 05:11:26 +00:00

docs: deep audit — fix stale config keys, missing commands, and registry drift (#22784 )

* docs: deep audit — fix stale config keys, missing commands, and registry drift

Cross-checked ~80 high-impact docs pages (getting-started, reference, top-level
user-guide, user-guide/features) against the live registries:

  hermes_cli/commands.py    COMMAND_REGISTRY (slash commands)
  hermes_cli/auth.py        PROVIDER_REGISTRY (providers)
  hermes_cli/config.py      DEFAULT_CONFIG (config keys)
  toolsets.py               TOOLSETS (toolsets)
  tools/registry.py         get_all_tool_names() (tools)
  python -m hermes_cli.main <subcmd> --help (CLI args)

reference/
- cli-commands.md: drop duplicate hermes fallback row + duplicate section,
  add stepfun/lmstudio to --provider enum, expand auth/mcp/curator subcommand
  lists to match --help output (status/logout/spotify, login, archive/prune/
  list-archived).
- slash-commands.md: add missing /sessions and /reload-skills entries +
  correct the cross-platform Notes line.
- tools-reference.md: drop bogus '68 tools' headline, drop fictional
  'browser-cdp toolset' (these tools live in 'browser' and are runtime-gated),
  add missing 'kanban' and 'video' toolset sections, fix MCP example to use
  the real mcp_<server>_<tool> prefix.
- toolsets-reference.md: list browser_cdp/browser_dialog inside the 'browser'
  row, add missing 'kanban' and 'video' toolset rows, drop the stale
  '38 tools' count for hermes-cli.
- profile-commands.md: add missing install/update/info subcommands, document
  fish completion.
- environment-variables.md: dedupe GMI_API_KEY/GMI_BASE_URL rows (kept the
  one with the correct gmi-serving.com default).
- faq.md: Anthropic/Google/OpenAI examples — direct providers exist (not just
  via OpenRouter), refresh the OpenAI model list.

getting-started/
- installation.md: PortableGit (not MinGit) is what the Windows installer
  fetches; document the 32-bit MinGit fallback.
- installation.md / termux.md: installer prefers .[termux-all] then falls
  back to .[termux].
- nix-setup.md: Python 3.12 (not 3.11), Node.js 22 (not 20); fix invalid
  'nix flake update --flake' invocation.
- updating.md: 'hermes backup restore --state pre-update' doesn't exist —
  point at the snapshot/quick-snapshot flow; correct config key
  'updates.pre_update_backup' (was 'update.backup').

user-guide/
- configuration.md: api_max_retries default 3 (not 2); display.runtime_footer
  is the real key (not display.runtime_metadata_footer); checkpoints defaults
  enabled=false / max_snapshots=20 (not true / 50).
- configuring-models.md: 'hermes model list' / 'hermes model set ...' don't
  exist — hermes model is interactive only.
- tui.md: busy_indicator -> tui_status_indicator with values
  kaomoji|emoji|unicode|ascii (not kawaii|minimal|dots|wings|none).
- security.md: SSH backend keys (TERMINAL_SSH_HOST/USER/KEY) live in .env,
  not config.yaml.
- windows-wsl-quickstart.md: there is no 'hermes api' subcommand — the
  OpenAI-compatible API server runs inside hermes gateway.

user-guide/features/
- computer-use.md: approvals.mode (not security.approval_level); fix broken
  ./browser-use.md link to ./browser.md.
- fallback-providers.md: top-level fallback_providers (not
  model.fallback_providers); the picker is subcommand-based, not modal.
- api-server.md: API_SERVER_* are env vars — write to per-profile .env,
  not 'hermes config set' which targets YAML.
- web-search.md: drop web_crawl as a registered tool (it isn't); deep-crawl
  modes are exposed through web_extract.
- kanban.md: failure_limit default is 2, not '~5'.
- plugins.md: drop hard-coded '33 providers' count.
- honcho.md: fix unclosed quote in echo HONCHO_API_KEY snippet; document
  that 'hermes honcho' subcommand is gated on memory.provider=honcho;
  reconcile subcommand list with actual --help output.
- memory-providers.md: legacy 'hermes honcho setup' redirect documented.

Verified via 'npm run build' — site builds cleanly; broken-link count went
from 149 to 146 (no regressions, fixed a few in passing).

* docs: round 2 audit fixes + regenerate skill catalogs

Follow-up to the previous commit on this branch:

Round 2 manual fixes:
- quickstart.md: KIMI_CODING_API_KEY mentioned alongside KIMI_API_KEY;
  voice-mode and ACP install commands rewritten — bare 'pip install ...'
  doesn't work for curl-installed setups (no pip on PATH, not in repo
  dir); replaced with 'cd ~/.hermes/hermes-agent && uv pip install -e
  ".[voice]"'. ACP already ships in [all] so the curl install includes it.
- cli.md / configuration.md: 'auxiliary.compression.model' shown as
  'google/gemini-3-flash-preview' (the doc's own claimed default);
  actual default is empty (= use main model). Reworded as 'leave empty
  (default) or pin a cheap model'.
- built-in-plugins.md: added the bundled 'kanban/dashboard' plugin row
  that was missing from the table.

Regenerated skill catalogs:
- ran website/scripts/generate-skill-docs.py to refresh all 163 per-skill
  pages and both reference catalogs (skills-catalog.md,
  optional-skills-catalog.md). This adds the entries that were genuinely
  missing — productivity/teams-meeting-pipeline (bundled),
  optional/finance/* (entire category — 7 skills:
  3-statement-model, comps-analysis, dcf-model, excel-author, lbo-model,
  merger-model, pptx-author), creative/hyperframes,
  creative/kanban-video-orchestrator, devops/watchers,
  productivity/shop-app, research/searxng-search,
  apple/macos-computer-use — and rewrites every other per-skill page from
  the current SKILL.md. Most diffs are tiny (one line of refreshed
  metadata).

Validation:
- 'npm run build' succeeded.
- Broken-link count moved 146 -> 155 — the +9 are zh-Hans translation
  shells that lag every newly-added skill page (pre-existing pattern).
  No regressions on any en/ page.

2026-05-09 13:19:51 -07:00

9.9 KiB

Raw Blame History

title	sidebar_label	description
Serving Llms Vllm — vLLM: high-throughput LLM serving, OpenAI API, quantization	Serving Llms Vllm	vLLM: high-throughput LLM serving, OpenAI API, quantization

{/* This page is auto-generated from the skill's SKILL.md by website/scripts/generate-skill-docs.py. Edit the source SKILL.md, not this page. */}

Serving Llms Vllm

vLLM: high-throughput LLM serving, OpenAI API, quantization.

Skill metadata


Source	Bundled (installed by default)
Path	`skills/mlops/inference/vllm`
Version	`1.0.0`
Author	Orchestra Research
License	MIT
Dependencies	`vllm`, `torch`, `transformers`
Platforms	linux, macos
Tags	`vLLM`, `Inference Serving`, `PagedAttention`, `Continuous Batching`, `High Throughput`, `Production`, `OpenAI API`, `Quantization`, `Tensor Parallelism`

Reference: full SKILL.md

:::info The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active. :::

vLLM - High-Performance LLM Serving

When to use

Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.

Quick start

vLLM achieves 24x higher throughput than standard transformers through PagedAttention (block-based KV cache) and continuous batching (mixing prefill/decode requests).

Installation:

pip install vllm

Basic offline inference:

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3-8B-Instruct")
sampling = SamplingParams(temperature=0.7, max_tokens=256)

outputs = llm.generate(["Explain quantum computing"], sampling)
print(outputs[0].outputs[0].text)

OpenAI-compatible server:

vllm serve meta-llama/Llama-3-8B-Instruct

# Query with OpenAI SDK
python -c "
from openai import OpenAI
client = OpenAI(base_url='http://localhost:8000/v1', api_key='EMPTY')
print(client.chat.completions.create(
    model='meta-llama/Llama-3-8B-Instruct',
    messages=[{'role': 'user', 'content': 'Hello!'}]
).choices[0].message.content)
"

Common workflows

Workflow 1: Production API deployment

Copy this checklist and track progress:

Deployment Progress:
- [ ] Step 1: Configure server settings
- [ ] Step 2: Test with limited traffic
- [ ] Step 3: Enable monitoring
- [ ] Step 4: Deploy to production
- [ ] Step 5: Verify performance metrics

Step 1: Configure server settings

Choose configuration based on your model size:

# For 7B-13B models on single GPU
vllm serve meta-llama/Llama-3-8B-Instruct \
  --gpu-memory-utilization 0.9 \
  --max-model-len 8192 \
  --port 8000

# For 30B-70B models with tensor parallelism
vllm serve meta-llama/Llama-2-70b-hf \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.9 \
  --quantization awq \
  --port 8000

# For production with caching and metrics
vllm serve meta-llama/Llama-3-8B-Instruct \
  --gpu-memory-utilization 0.9 \
  --enable-prefix-caching \
  --enable-metrics \
  --metrics-port 9090 \
  --port 8000 \
  --host 0.0.0.0

Step 2: Test with limited traffic

Run load test before production:

# Install load testing tool
pip install locust

# Create test_load.py with sample requests
# Run: locust -f test_load.py --host http://localhost:8000

Verify TTFT (time to first token) < 500ms and throughput > 100 req/sec.

Step 3: Enable monitoring

vLLM exposes Prometheus metrics on port 9090:

curl http://localhost:9090/metrics | grep vllm

Key metrics to monitor:

vllm:time_to_first_token_seconds - Latency
vllm:num_requests_running - Active requests
vllm:gpu_cache_usage_perc - KV cache utilization

Step 4: Deploy to production

Use Docker for consistent deployment:

# Run vLLM in Docker
docker run --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3-8B-Instruct \
  --gpu-memory-utilization 0.9 \
  --enable-prefix-caching

Step 5: Verify performance metrics

Check that deployment meets targets:

TTFT < 500ms (for short prompts)
Throughput > target req/sec
GPU utilization > 80%
No OOM errors in logs

Workflow 2: Offline batch inference

For processing large datasets without server overhead.

Copy this checklist:

Batch Processing:
- [ ] Step 1: Prepare input data
- [ ] Step 2: Configure LLM engine
- [ ] Step 3: Run batch inference
- [ ] Step 4: Process results

Step 1: Prepare input data

# Load prompts from file
prompts = []
with open("prompts.txt") as f:
    prompts = [line.strip() for line in f]

print(f"Loaded {len(prompts)} prompts")

Step 2: Configure LLM engine

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3-8B-Instruct",
    tensor_parallel_size=2,  # Use 2 GPUs
    gpu_memory_utilization=0.9,
    max_model_len=4096
)

sampling = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=512,
    stop=["</s>", "\n\n"]
)

Step 3: Run batch inference

vLLM automatically batches requests for efficiency:

# Process all prompts in one call
outputs = llm.generate(prompts, sampling)

# vLLM handles batching internally
# No need to manually chunk prompts

Step 4: Process results

# Extract generated text
results = []
for output in outputs:
    prompt = output.prompt
    generated = output.outputs[0].text
    results.append({
        "prompt": prompt,
        "generated": generated,
        "tokens": len(output.outputs[0].token_ids)
    })

# Save to file
import json
with open("results.jsonl", "w") as f:
    for result in results:
        f.write(json.dumps(result) + "\n")

print(f"Processed {len(results)} prompts")

Workflow 3: Quantized model serving

Fit large models in limited GPU memory.

Quantization Setup:
- [ ] Step 1: Choose quantization method
- [ ] Step 2: Find or create quantized model
- [ ] Step 3: Launch with quantization flag
- [ ] Step 4: Verify accuracy

Step 1: Choose quantization method

AWQ: Best for 70B models, minimal accuracy loss
GPTQ: Wide model support, good compression
FP8: Fastest on H100 GPUs

Step 2: Find or create quantized model

Use pre-quantized models from HuggingFace:

# Search for AWQ models
# Example: TheBloke/Llama-2-70B-AWQ

Step 3: Launch with quantization flag

# Using pre-quantized model
vllm serve TheBloke/Llama-2-70B-AWQ \
  --quantization awq \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.95

# Results: 70B model in ~40GB VRAM

Step 4: Verify accuracy

Test outputs match expected quality:

# Compare quantized vs non-quantized responses
# Verify task-specific performance unchanged

When to use vs alternatives

Use vLLM when:

Deploying production LLM APIs (100+ req/sec)
Serving OpenAI-compatible endpoints
Limited GPU memory but need large models
Multi-user applications (chatbots, assistants)
Need low latency with high throughput

Use alternatives instead:

llama.cpp: CPU/edge inference, single-user
HuggingFace transformers: Research, prototyping, one-off generation
TensorRT-LLM: NVIDIA-only, need absolute maximum performance
Text-Generation-Inference: Already in HuggingFace ecosystem

Common issues

Issue: Out of memory during model loading

Reduce memory usage:

vllm serve MODEL \
  --gpu-memory-utilization 0.7 \
  --max-model-len 4096

Or use quantization:

vllm serve MODEL --quantization awq

Issue: Slow first token (TTFT > 1 second)

Enable prefix caching for repeated prompts:

vllm serve MODEL --enable-prefix-caching

For long prompts, enable chunked prefill:

vllm serve MODEL --enable-chunked-prefill

Issue: Model not found error

Use --trust-remote-code for custom models:

vllm serve MODEL --trust-remote-code

Issue: Low throughput (<50 req/sec)

Increase concurrent sequences:

vllm serve MODEL --max-num-seqs 512

Check GPU utilization with nvidia-smi - should be >80%.

Issue: Inference slower than expected

Verify tensor parallelism uses power of 2 GPUs:

vllm serve MODEL --tensor-parallel-size 4  # Not 3

Enable speculative decoding for faster generation:

vllm serve MODEL --speculative-model DRAFT_MODEL

Advanced topics

Server deployment patterns: See references/server-deployment.md for Docker, Kubernetes, and load balancing configurations.

Performance optimization: See references/optimization.md for PagedAttention tuning, continuous batching details, and benchmark results.

Quantization guide: See references/quantization.md for AWQ/GPTQ/FP8 setup, model preparation, and accuracy comparisons.

Troubleshooting: See references/troubleshooting.md for detailed error messages, debugging steps, and performance diagnostics.

Hardware requirements

Small models (7B-13B): 1x A10 (24GB) or A100 (40GB)
Medium models (30B-40B): 2x A100 (40GB) with tensor parallelism
Large models (70B+): 4x A100 (40GB) or 2x A100 (80GB), use AWQ/GPTQ

Supported platforms: NVIDIA (primary), AMD ROCm, Intel GPUs, TPUs

Resources

Official docs: https://docs.vllm.ai
GitHub: https://github.com/vllm-project/vllm
Paper: "Efficient Memory Management for Large Language Model Serving with PagedAttention" (SOSP 2023)
Community: https://discuss.vllm.ai

9.9 KiB Raw Blame History