mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-25 00:51:20 +00:00
docs(website): dedicated page per bundled + optional skill (#14929)
Generates a full dedicated Docusaurus page for every one of the 132 skills
(73 bundled + 59 optional) under website/docs/user-guide/skills/{bundled,optional}/<category>/.
Each page carries the skill's description, metadata (version, author, license,
dependencies, platform gating, tags, related skills cross-linked to their own
pages), and the complete SKILL.md body that Hermes loads at runtime.
Previously the two catalog pages just listed skills with a one-line blurb and
no way to see what the skill actually did — users had to go read the source
repo. Now every skill has a browsable, searchable, cross-linked reference in
the docs.
- website/scripts/generate-skill-docs.py — generator that reads skills/ and
optional-skills/, writes per-skill pages, regenerates both catalog indexes,
and rewrites the Skills section of sidebars.ts. Handles MDX escaping
(outside fenced code blocks: curly braces, unsafe HTML-ish tags) and
rewrites relative references/*.md links to point at the GitHub source.
- website/docs/reference/skills-catalog.md — regenerated; each row links to
the new dedicated page.
- website/docs/reference/optional-skills-catalog.md — same.
- website/sidebars.ts — Skills section now has Bundled / Optional subtrees
with one nested category per skill folder.
- .github/workflows/{docs-site-checks,deploy-site}.yml — run the generator
before docusaurus build so CI stays in sync with the source SKILL.md files.
Build verified locally with `npx docusaurus build`. Only remaining warnings
are pre-existing broken link/anchor issues in unrelated pages.
This commit is contained in:
parent
eb93f88e1d
commit
0f6eabb890
139 changed files with 43523 additions and 306 deletions
|
|
@ -0,0 +1,381 @@
|
|||
---
|
||||
title: "Serving Llms Vllm — Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching"
|
||||
sidebar_label: "Serving Llms Vllm"
|
||||
description: "Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching"
|
||||
---
|
||||
|
||||
{/* This page is auto-generated from the skill's SKILL.md by website/scripts/generate-skill-docs.py. Edit the source SKILL.md, not this page. */}
|
||||
|
||||
# Serving Llms Vllm
|
||||
|
||||
Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.
|
||||
|
||||
## Skill metadata
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Source | Bundled (installed by default) |
|
||||
| Path | `skills/mlops/inference/vllm` |
|
||||
| Version | `1.0.0` |
|
||||
| Author | Orchestra Research |
|
||||
| License | MIT |
|
||||
| Dependencies | `vllm`, `torch`, `transformers` |
|
||||
| Tags | `vLLM`, `Inference Serving`, `PagedAttention`, `Continuous Batching`, `High Throughput`, `Production`, `OpenAI API`, `Quantization`, `Tensor Parallelism` |
|
||||
|
||||
## Reference: full SKILL.md
|
||||
|
||||
:::info
|
||||
The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
|
||||
:::
|
||||
|
||||
# vLLM - High-Performance LLM Serving
|
||||
|
||||
## Quick start
|
||||
|
||||
vLLM achieves 24x higher throughput than standard transformers through PagedAttention (block-based KV cache) and continuous batching (mixing prefill/decode requests).
|
||||
|
||||
**Installation**:
|
||||
```bash
|
||||
pip install vllm
|
||||
```
|
||||
|
||||
**Basic offline inference**:
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
llm = LLM(model="meta-llama/Llama-3-8B-Instruct")
|
||||
sampling = SamplingParams(temperature=0.7, max_tokens=256)
|
||||
|
||||
outputs = llm.generate(["Explain quantum computing"], sampling)
|
||||
print(outputs[0].outputs[0].text)
|
||||
```
|
||||
|
||||
**OpenAI-compatible server**:
|
||||
```bash
|
||||
vllm serve meta-llama/Llama-3-8B-Instruct
|
||||
|
||||
# Query with OpenAI SDK
|
||||
python -c "
|
||||
from openai import OpenAI
|
||||
client = OpenAI(base_url='http://localhost:8000/v1', api_key='EMPTY')
|
||||
print(client.chat.completions.create(
|
||||
model='meta-llama/Llama-3-8B-Instruct',
|
||||
messages=[{'role': 'user', 'content': 'Hello!'}]
|
||||
).choices[0].message.content)
|
||||
"
|
||||
```
|
||||
|
||||
## Common workflows
|
||||
|
||||
### Workflow 1: Production API deployment
|
||||
|
||||
Copy this checklist and track progress:
|
||||
|
||||
```
|
||||
Deployment Progress:
|
||||
- [ ] Step 1: Configure server settings
|
||||
- [ ] Step 2: Test with limited traffic
|
||||
- [ ] Step 3: Enable monitoring
|
||||
- [ ] Step 4: Deploy to production
|
||||
- [ ] Step 5: Verify performance metrics
|
||||
```
|
||||
|
||||
**Step 1: Configure server settings**
|
||||
|
||||
Choose configuration based on your model size:
|
||||
|
||||
```bash
|
||||
# For 7B-13B models on single GPU
|
||||
vllm serve meta-llama/Llama-3-8B-Instruct \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--max-model-len 8192 \
|
||||
--port 8000
|
||||
|
||||
# For 30B-70B models with tensor parallelism
|
||||
vllm serve meta-llama/Llama-2-70b-hf \
|
||||
--tensor-parallel-size 4 \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--quantization awq \
|
||||
--port 8000
|
||||
|
||||
# For production with caching and metrics
|
||||
vllm serve meta-llama/Llama-3-8B-Instruct \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--enable-prefix-caching \
|
||||
--enable-metrics \
|
||||
--metrics-port 9090 \
|
||||
--port 8000 \
|
||||
--host 0.0.0.0
|
||||
```
|
||||
|
||||
**Step 2: Test with limited traffic**
|
||||
|
||||
Run load test before production:
|
||||
|
||||
```bash
|
||||
# Install load testing tool
|
||||
pip install locust
|
||||
|
||||
# Create test_load.py with sample requests
|
||||
# Run: locust -f test_load.py --host http://localhost:8000
|
||||
```
|
||||
|
||||
Verify TTFT (time to first token) < 500ms and throughput > 100 req/sec.
|
||||
|
||||
**Step 3: Enable monitoring**
|
||||
|
||||
vLLM exposes Prometheus metrics on port 9090:
|
||||
|
||||
```bash
|
||||
curl http://localhost:9090/metrics | grep vllm
|
||||
```
|
||||
|
||||
Key metrics to monitor:
|
||||
- `vllm:time_to_first_token_seconds` - Latency
|
||||
- `vllm:num_requests_running` - Active requests
|
||||
- `vllm:gpu_cache_usage_perc` - KV cache utilization
|
||||
|
||||
**Step 4: Deploy to production**
|
||||
|
||||
Use Docker for consistent deployment:
|
||||
|
||||
```bash
|
||||
# Run vLLM in Docker
|
||||
docker run --gpus all -p 8000:8000 \
|
||||
vllm/vllm-openai:latest \
|
||||
--model meta-llama/Llama-3-8B-Instruct \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--enable-prefix-caching
|
||||
```
|
||||
|
||||
**Step 5: Verify performance metrics**
|
||||
|
||||
Check that deployment meets targets:
|
||||
- TTFT < 500ms (for short prompts)
|
||||
- Throughput > target req/sec
|
||||
- GPU utilization > 80%
|
||||
- No OOM errors in logs
|
||||
|
||||
### Workflow 2: Offline batch inference
|
||||
|
||||
For processing large datasets without server overhead.
|
||||
|
||||
Copy this checklist:
|
||||
|
||||
```
|
||||
Batch Processing:
|
||||
- [ ] Step 1: Prepare input data
|
||||
- [ ] Step 2: Configure LLM engine
|
||||
- [ ] Step 3: Run batch inference
|
||||
- [ ] Step 4: Process results
|
||||
```
|
||||
|
||||
**Step 1: Prepare input data**
|
||||
|
||||
```python
|
||||
# Load prompts from file
|
||||
prompts = []
|
||||
with open("prompts.txt") as f:
|
||||
prompts = [line.strip() for line in f]
|
||||
|
||||
print(f"Loaded {len(prompts)} prompts")
|
||||
```
|
||||
|
||||
**Step 2: Configure LLM engine**
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
llm = LLM(
|
||||
model="meta-llama/Llama-3-8B-Instruct",
|
||||
tensor_parallel_size=2, # Use 2 GPUs
|
||||
gpu_memory_utilization=0.9,
|
||||
max_model_len=4096
|
||||
)
|
||||
|
||||
sampling = SamplingParams(
|
||||
temperature=0.7,
|
||||
top_p=0.95,
|
||||
max_tokens=512,
|
||||
stop=["</s>", "\n\n"]
|
||||
)
|
||||
```
|
||||
|
||||
**Step 3: Run batch inference**
|
||||
|
||||
vLLM automatically batches requests for efficiency:
|
||||
|
||||
```python
|
||||
# Process all prompts in one call
|
||||
outputs = llm.generate(prompts, sampling)
|
||||
|
||||
# vLLM handles batching internally
|
||||
# No need to manually chunk prompts
|
||||
```
|
||||
|
||||
**Step 4: Process results**
|
||||
|
||||
```python
|
||||
# Extract generated text
|
||||
results = []
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated = output.outputs[0].text
|
||||
results.append({
|
||||
"prompt": prompt,
|
||||
"generated": generated,
|
||||
"tokens": len(output.outputs[0].token_ids)
|
||||
})
|
||||
|
||||
# Save to file
|
||||
import json
|
||||
with open("results.jsonl", "w") as f:
|
||||
for result in results:
|
||||
f.write(json.dumps(result) + "\n")
|
||||
|
||||
print(f"Processed {len(results)} prompts")
|
||||
```
|
||||
|
||||
### Workflow 3: Quantized model serving
|
||||
|
||||
Fit large models in limited GPU memory.
|
||||
|
||||
```
|
||||
Quantization Setup:
|
||||
- [ ] Step 1: Choose quantization method
|
||||
- [ ] Step 2: Find or create quantized model
|
||||
- [ ] Step 3: Launch with quantization flag
|
||||
- [ ] Step 4: Verify accuracy
|
||||
```
|
||||
|
||||
**Step 1: Choose quantization method**
|
||||
|
||||
- **AWQ**: Best for 70B models, minimal accuracy loss
|
||||
- **GPTQ**: Wide model support, good compression
|
||||
- **FP8**: Fastest on H100 GPUs
|
||||
|
||||
**Step 2: Find or create quantized model**
|
||||
|
||||
Use pre-quantized models from HuggingFace:
|
||||
|
||||
```bash
|
||||
# Search for AWQ models
|
||||
# Example: TheBloke/Llama-2-70B-AWQ
|
||||
```
|
||||
|
||||
**Step 3: Launch with quantization flag**
|
||||
|
||||
```bash
|
||||
# Using pre-quantized model
|
||||
vllm serve TheBloke/Llama-2-70B-AWQ \
|
||||
--quantization awq \
|
||||
--tensor-parallel-size 1 \
|
||||
--gpu-memory-utilization 0.95
|
||||
|
||||
# Results: 70B model in ~40GB VRAM
|
||||
```
|
||||
|
||||
**Step 4: Verify accuracy**
|
||||
|
||||
Test outputs match expected quality:
|
||||
|
||||
```python
|
||||
# Compare quantized vs non-quantized responses
|
||||
# Verify task-specific performance unchanged
|
||||
```
|
||||
|
||||
## When to use vs alternatives
|
||||
|
||||
**Use vLLM when:**
|
||||
- Deploying production LLM APIs (100+ req/sec)
|
||||
- Serving OpenAI-compatible endpoints
|
||||
- Limited GPU memory but need large models
|
||||
- Multi-user applications (chatbots, assistants)
|
||||
- Need low latency with high throughput
|
||||
|
||||
**Use alternatives instead:**
|
||||
- **llama.cpp**: CPU/edge inference, single-user
|
||||
- **HuggingFace transformers**: Research, prototyping, one-off generation
|
||||
- **TensorRT-LLM**: NVIDIA-only, need absolute maximum performance
|
||||
- **Text-Generation-Inference**: Already in HuggingFace ecosystem
|
||||
|
||||
## Common issues
|
||||
|
||||
**Issue: Out of memory during model loading**
|
||||
|
||||
Reduce memory usage:
|
||||
```bash
|
||||
vllm serve MODEL \
|
||||
--gpu-memory-utilization 0.7 \
|
||||
--max-model-len 4096
|
||||
```
|
||||
|
||||
Or use quantization:
|
||||
```bash
|
||||
vllm serve MODEL --quantization awq
|
||||
```
|
||||
|
||||
**Issue: Slow first token (TTFT > 1 second)**
|
||||
|
||||
Enable prefix caching for repeated prompts:
|
||||
```bash
|
||||
vllm serve MODEL --enable-prefix-caching
|
||||
```
|
||||
|
||||
For long prompts, enable chunked prefill:
|
||||
```bash
|
||||
vllm serve MODEL --enable-chunked-prefill
|
||||
```
|
||||
|
||||
**Issue: Model not found error**
|
||||
|
||||
Use `--trust-remote-code` for custom models:
|
||||
```bash
|
||||
vllm serve MODEL --trust-remote-code
|
||||
```
|
||||
|
||||
**Issue: Low throughput (<50 req/sec)**
|
||||
|
||||
Increase concurrent sequences:
|
||||
```bash
|
||||
vllm serve MODEL --max-num-seqs 512
|
||||
```
|
||||
|
||||
Check GPU utilization with `nvidia-smi` - should be >80%.
|
||||
|
||||
**Issue: Inference slower than expected**
|
||||
|
||||
Verify tensor parallelism uses power of 2 GPUs:
|
||||
```bash
|
||||
vllm serve MODEL --tensor-parallel-size 4 # Not 3
|
||||
```
|
||||
|
||||
Enable speculative decoding for faster generation:
|
||||
```bash
|
||||
vllm serve MODEL --speculative-model DRAFT_MODEL
|
||||
```
|
||||
|
||||
## Advanced topics
|
||||
|
||||
**Server deployment patterns**: See [references/server-deployment.md](https://github.com/NousResearch/hermes-agent/blob/main/skills/mlops/inference/vllm/references/server-deployment.md) for Docker, Kubernetes, and load balancing configurations.
|
||||
|
||||
**Performance optimization**: See [references/optimization.md](https://github.com/NousResearch/hermes-agent/blob/main/skills/mlops/inference/vllm/references/optimization.md) for PagedAttention tuning, continuous batching details, and benchmark results.
|
||||
|
||||
**Quantization guide**: See [references/quantization.md](https://github.com/NousResearch/hermes-agent/blob/main/skills/mlops/inference/vllm/references/quantization.md) for AWQ/GPTQ/FP8 setup, model preparation, and accuracy comparisons.
|
||||
|
||||
**Troubleshooting**: See [references/troubleshooting.md](https://github.com/NousResearch/hermes-agent/blob/main/skills/mlops/inference/vllm/references/troubleshooting.md) for detailed error messages, debugging steps, and performance diagnostics.
|
||||
|
||||
## Hardware requirements
|
||||
|
||||
- **Small models (7B-13B)**: 1x A10 (24GB) or A100 (40GB)
|
||||
- **Medium models (30B-40B)**: 2x A100 (40GB) with tensor parallelism
|
||||
- **Large models (70B+)**: 4x A100 (40GB) or 2x A100 (80GB), use AWQ/GPTQ
|
||||
|
||||
Supported platforms: NVIDIA (primary), AMD ROCm, Intel GPUs, TPUs
|
||||
|
||||
## Resources
|
||||
|
||||
- Official docs: https://docs.vllm.ai
|
||||
- GitHub: https://github.com/vllm-project/vllm
|
||||
- Paper: "Efficient Memory Management for Large Language Model Serving with PagedAttention" (SOSP 2023)
|
||||
- Community: https://discuss.vllm.ai
|
||||
Loading…
Add table
Add a link
Reference in a new issue