mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-25 00:51:20 +00:00
docs(website): dedicated page per bundled + optional skill (#14929)
Generates a full dedicated Docusaurus page for every one of the 132 skills
(73 bundled + 59 optional) under website/docs/user-guide/skills/{bundled,optional}/<category>/.
Each page carries the skill's description, metadata (version, author, license,
dependencies, platform gating, tags, related skills cross-linked to their own
pages), and the complete SKILL.md body that Hermes loads at runtime.
Previously the two catalog pages just listed skills with a one-line blurb and
no way to see what the skill actually did — users had to go read the source
repo. Now every skill has a browsable, searchable, cross-linked reference in
the docs.
- website/scripts/generate-skill-docs.py — generator that reads skills/ and
optional-skills/, writes per-skill pages, regenerates both catalog indexes,
and rewrites the Skills section of sidebars.ts. Handles MDX escaping
(outside fenced code blocks: curly braces, unsafe HTML-ish tags) and
rewrites relative references/*.md links to point at the GitHub source.
- website/docs/reference/skills-catalog.md — regenerated; each row links to
the new dedicated page.
- website/docs/reference/optional-skills-catalog.md — same.
- website/sidebars.ts — Skills section now has Bundled / Optional subtrees
with one nested category per skill folder.
- .github/workflows/{docs-site-checks,deploy-site}.yml — run the generator
before docusaurus build so CI stays in sync with the source SKILL.md files.
Build verified locally with `npx docusaurus build`. Only remaining warnings
are pre-existing broken link/anchor issues in unrelated pages.
This commit is contained in:
parent
eb93f88e1d
commit
0f6eabb890
139 changed files with 43523 additions and 306 deletions
|
|
@ -0,0 +1,205 @@
|
|||
---
|
||||
title: "Tensorrt Llm — Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency"
|
||||
sidebar_label: "Tensorrt Llm"
|
||||
description: "Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency"
|
||||
---
|
||||
|
||||
{/* This page is auto-generated from the skill's SKILL.md by website/scripts/generate-skill-docs.py. Edit the source SKILL.md, not this page. */}
|
||||
|
||||
# Tensorrt Llm
|
||||
|
||||
Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.
|
||||
|
||||
## Skill metadata
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Source | Optional — install with `hermes skills install official/mlops/tensorrt-llm` |
|
||||
| Path | `optional-skills/mlops/tensorrt-llm` |
|
||||
| Version | `1.0.0` |
|
||||
| Author | Orchestra Research |
|
||||
| License | MIT |
|
||||
| Dependencies | `tensorrt-llm`, `torch` |
|
||||
| Tags | `Inference Serving`, `TensorRT-LLM`, `NVIDIA`, `Inference Optimization`, `High Throughput`, `Low Latency`, `Production`, `FP8`, `INT4`, `In-Flight Batching`, `Multi-GPU` |
|
||||
|
||||
## Reference: full SKILL.md
|
||||
|
||||
:::info
|
||||
The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
|
||||
:::
|
||||
|
||||
# TensorRT-LLM
|
||||
|
||||
NVIDIA's open-source library for optimizing LLM inference with state-of-the-art performance on NVIDIA GPUs.
|
||||
|
||||
## When to use TensorRT-LLM
|
||||
|
||||
**Use TensorRT-LLM when:**
|
||||
- Deploying on NVIDIA GPUs (A100, H100, GB200)
|
||||
- Need maximum throughput (24,000+ tokens/sec on Llama 3)
|
||||
- Require low latency for real-time applications
|
||||
- Working with quantized models (FP8, INT4, FP4)
|
||||
- Scaling across multiple GPUs or nodes
|
||||
|
||||
**Use vLLM instead when:**
|
||||
- Need simpler setup and Python-first API
|
||||
- Want PagedAttention without TensorRT compilation
|
||||
- Working with AMD GPUs or non-NVIDIA hardware
|
||||
|
||||
**Use llama.cpp instead when:**
|
||||
- Deploying on CPU or Apple Silicon
|
||||
- Need edge deployment without NVIDIA GPUs
|
||||
- Want simpler GGUF quantization format
|
||||
|
||||
## Quick start
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
# Docker (recommended)
|
||||
docker pull nvidia/tensorrt_llm:latest
|
||||
|
||||
# pip install
|
||||
pip install tensorrt_llm==1.2.0rc3
|
||||
|
||||
# Requires CUDA 13.0.0, TensorRT 10.13.2, Python 3.10-3.12
|
||||
```
|
||||
|
||||
### Basic inference
|
||||
|
||||
```python
|
||||
from tensorrt_llm import LLM, SamplingParams
|
||||
|
||||
# Initialize model
|
||||
llm = LLM(model="meta-llama/Meta-Llama-3-8B")
|
||||
|
||||
# Configure sampling
|
||||
sampling_params = SamplingParams(
|
||||
max_tokens=100,
|
||||
temperature=0.7,
|
||||
top_p=0.9
|
||||
)
|
||||
|
||||
# Generate
|
||||
prompts = ["Explain quantum computing"]
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
|
||||
for output in outputs:
|
||||
print(output.text)
|
||||
```
|
||||
|
||||
### Serving with trtllm-serve
|
||||
|
||||
```bash
|
||||
# Start server (automatic model download and compilation)
|
||||
trtllm-serve meta-llama/Meta-Llama-3-8B \
|
||||
--tp_size 4 \ # Tensor parallelism (4 GPUs)
|
||||
--max_batch_size 256 \
|
||||
--max_num_tokens 4096
|
||||
|
||||
# Client request
|
||||
curl -X POST http://localhost:8000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "meta-llama/Meta-Llama-3-8B",
|
||||
"messages": [{"role": "user", "content": "Hello!"}],
|
||||
"temperature": 0.7,
|
||||
"max_tokens": 100
|
||||
}'
|
||||
```
|
||||
|
||||
## Key features
|
||||
|
||||
### Performance optimizations
|
||||
- **In-flight batching**: Dynamic batching during generation
|
||||
- **Paged KV cache**: Efficient memory management
|
||||
- **Flash Attention**: Optimized attention kernels
|
||||
- **Quantization**: FP8, INT4, FP4 for 2-4× faster inference
|
||||
- **CUDA graphs**: Reduced kernel launch overhead
|
||||
|
||||
### Parallelism
|
||||
- **Tensor parallelism (TP)**: Split model across GPUs
|
||||
- **Pipeline parallelism (PP)**: Layer-wise distribution
|
||||
- **Expert parallelism**: For Mixture-of-Experts models
|
||||
- **Multi-node**: Scale beyond single machine
|
||||
|
||||
### Advanced features
|
||||
- **Speculative decoding**: Faster generation with draft models
|
||||
- **LoRA serving**: Efficient multi-adapter deployment
|
||||
- **Disaggregated serving**: Separate prefill and generation
|
||||
|
||||
## Common patterns
|
||||
|
||||
### Quantized model (FP8)
|
||||
|
||||
```python
|
||||
from tensorrt_llm import LLM
|
||||
|
||||
# Load FP8 quantized model (2× faster, 50% memory)
|
||||
llm = LLM(
|
||||
model="meta-llama/Meta-Llama-3-70B",
|
||||
dtype="fp8",
|
||||
max_num_tokens=8192
|
||||
)
|
||||
|
||||
# Inference same as before
|
||||
outputs = llm.generate(["Summarize this article..."])
|
||||
```
|
||||
|
||||
### Multi-GPU deployment
|
||||
|
||||
```python
|
||||
# Tensor parallelism across 8 GPUs
|
||||
llm = LLM(
|
||||
model="meta-llama/Meta-Llama-3-405B",
|
||||
tensor_parallel_size=8,
|
||||
dtype="fp8"
|
||||
)
|
||||
```
|
||||
|
||||
### Batch inference
|
||||
|
||||
```python
|
||||
# Process 100 prompts efficiently
|
||||
prompts = [f"Question {i}: ..." for i in range(100)]
|
||||
|
||||
outputs = llm.generate(
|
||||
prompts,
|
||||
sampling_params=SamplingParams(max_tokens=200)
|
||||
)
|
||||
|
||||
# Automatic in-flight batching for maximum throughput
|
||||
```
|
||||
|
||||
## Performance benchmarks
|
||||
|
||||
**Meta Llama 3-8B** (H100 GPU):
|
||||
- Throughput: 24,000 tokens/sec
|
||||
- Latency: ~10ms per token
|
||||
- vs PyTorch: **100× faster**
|
||||
|
||||
**Llama 3-70B** (8× A100 80GB):
|
||||
- FP8 quantization: 2× faster than FP16
|
||||
- Memory: 50% reduction with FP8
|
||||
|
||||
## Supported models
|
||||
|
||||
- **LLaMA family**: Llama 2, Llama 3, CodeLlama
|
||||
- **GPT family**: GPT-2, GPT-J, GPT-NeoX
|
||||
- **Qwen**: Qwen, Qwen2, QwQ
|
||||
- **DeepSeek**: DeepSeek-V2, DeepSeek-V3
|
||||
- **Mixtral**: Mixtral-8x7B, Mixtral-8x22B
|
||||
- **Vision**: LLaVA, Phi-3-vision
|
||||
- **100+ models** on HuggingFace
|
||||
|
||||
## References
|
||||
|
||||
- **[Optimization Guide](https://github.com/NousResearch/hermes-agent/blob/main/optional-skills/mlops/tensorrt-llm/references/optimization.md)** - Quantization, batching, KV cache tuning
|
||||
- **[Multi-GPU Setup](https://github.com/NousResearch/hermes-agent/blob/main/optional-skills/mlops/tensorrt-llm/references/multi-gpu.md)** - Tensor/pipeline parallelism, multi-node
|
||||
- **[Serving Guide](https://github.com/NousResearch/hermes-agent/blob/main/optional-skills/mlops/tensorrt-llm/references/serving.md)** - Production deployment, monitoring, autoscaling
|
||||
|
||||
## Resources
|
||||
|
||||
- **Docs**: https://nvidia.github.io/TensorRT-LLM/
|
||||
- **GitHub**: https://github.com/NVIDIA/TensorRT-LLM
|
||||
- **Models**: https://huggingface.co/models?library=tensorrt_llm
|
||||
Loading…
Add table
Add a link
Reference in a new issue