mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-26 01:01:40 +00:00
- Restored 21 skills removed in commits757d012and740dd92: accelerate, audiocraft, code-review, faiss, flash-attention, gguf, grpo-rl-training, guidance, llava, nemo-curator, obliteratus, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, stable-diffusion, tensorrt-llm, torchtitan, trl-fine-tuning, whisper - Rewrote sync_skills() with proper update semantics: * New skills (not in manifest): copied to user dir * Existing skills (in manifest + on disk): updated via hash comparison * User-deleted skills (in manifest, not on disk): respected, not re-added * Stale manifest entries (removed from bundled): cleaned from manifest - Added sync_skills() to CLI startup (cmd_chat) and gateway startup (start_gateway) — previously only ran during 'hermes update' - Updated cmd_update output to show new/updated/cleaned counts - Rewrote tests: 20 tests covering manifest CRUD, dir hashing, fresh install, user deletion respect, update detection, stale cleanup, and name collision handling 75 bundled skills total. 2002 tests pass.
190 lines
4.9 KiB
Markdown
190 lines
4.9 KiB
Markdown
---
|
||
name: tensorrt-llm
|
||
description: Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.
|
||
version: 1.0.0
|
||
author: Orchestra Research
|
||
license: MIT
|
||
dependencies: [tensorrt-llm, torch]
|
||
metadata:
|
||
hermes:
|
||
tags: [Inference Serving, TensorRT-LLM, NVIDIA, Inference Optimization, High Throughput, Low Latency, Production, FP8, INT4, In-Flight Batching, Multi-GPU]
|
||
|
||
---
|
||
|
||
# TensorRT-LLM
|
||
|
||
NVIDIA's open-source library for optimizing LLM inference with state-of-the-art performance on NVIDIA GPUs.
|
||
|
||
## When to use TensorRT-LLM
|
||
|
||
**Use TensorRT-LLM when:**
|
||
- Deploying on NVIDIA GPUs (A100, H100, GB200)
|
||
- Need maximum throughput (24,000+ tokens/sec on Llama 3)
|
||
- Require low latency for real-time applications
|
||
- Working with quantized models (FP8, INT4, FP4)
|
||
- Scaling across multiple GPUs or nodes
|
||
|
||
**Use vLLM instead when:**
|
||
- Need simpler setup and Python-first API
|
||
- Want PagedAttention without TensorRT compilation
|
||
- Working with AMD GPUs or non-NVIDIA hardware
|
||
|
||
**Use llama.cpp instead when:**
|
||
- Deploying on CPU or Apple Silicon
|
||
- Need edge deployment without NVIDIA GPUs
|
||
- Want simpler GGUF quantization format
|
||
|
||
## Quick start
|
||
|
||
### Installation
|
||
|
||
```bash
|
||
# Docker (recommended)
|
||
docker pull nvidia/tensorrt_llm:latest
|
||
|
||
# pip install
|
||
pip install tensorrt_llm==1.2.0rc3
|
||
|
||
# Requires CUDA 13.0.0, TensorRT 10.13.2, Python 3.10-3.12
|
||
```
|
||
|
||
### Basic inference
|
||
|
||
```python
|
||
from tensorrt_llm import LLM, SamplingParams
|
||
|
||
# Initialize model
|
||
llm = LLM(model="meta-llama/Meta-Llama-3-8B")
|
||
|
||
# Configure sampling
|
||
sampling_params = SamplingParams(
|
||
max_tokens=100,
|
||
temperature=0.7,
|
||
top_p=0.9
|
||
)
|
||
|
||
# Generate
|
||
prompts = ["Explain quantum computing"]
|
||
outputs = llm.generate(prompts, sampling_params)
|
||
|
||
for output in outputs:
|
||
print(output.text)
|
||
```
|
||
|
||
### Serving with trtllm-serve
|
||
|
||
```bash
|
||
# Start server (automatic model download and compilation)
|
||
trtllm-serve meta-llama/Meta-Llama-3-8B \
|
||
--tp_size 4 \ # Tensor parallelism (4 GPUs)
|
||
--max_batch_size 256 \
|
||
--max_num_tokens 4096
|
||
|
||
# Client request
|
||
curl -X POST http://localhost:8000/v1/chat/completions \
|
||
-H "Content-Type: application/json" \
|
||
-d '{
|
||
"model": "meta-llama/Meta-Llama-3-8B",
|
||
"messages": [{"role": "user", "content": "Hello!"}],
|
||
"temperature": 0.7,
|
||
"max_tokens": 100
|
||
}'
|
||
```
|
||
|
||
## Key features
|
||
|
||
### Performance optimizations
|
||
- **In-flight batching**: Dynamic batching during generation
|
||
- **Paged KV cache**: Efficient memory management
|
||
- **Flash Attention**: Optimized attention kernels
|
||
- **Quantization**: FP8, INT4, FP4 for 2-4× faster inference
|
||
- **CUDA graphs**: Reduced kernel launch overhead
|
||
|
||
### Parallelism
|
||
- **Tensor parallelism (TP)**: Split model across GPUs
|
||
- **Pipeline parallelism (PP)**: Layer-wise distribution
|
||
- **Expert parallelism**: For Mixture-of-Experts models
|
||
- **Multi-node**: Scale beyond single machine
|
||
|
||
### Advanced features
|
||
- **Speculative decoding**: Faster generation with draft models
|
||
- **LoRA serving**: Efficient multi-adapter deployment
|
||
- **Disaggregated serving**: Separate prefill and generation
|
||
|
||
## Common patterns
|
||
|
||
### Quantized model (FP8)
|
||
|
||
```python
|
||
from tensorrt_llm import LLM
|
||
|
||
# Load FP8 quantized model (2× faster, 50% memory)
|
||
llm = LLM(
|
||
model="meta-llama/Meta-Llama-3-70B",
|
||
dtype="fp8",
|
||
max_num_tokens=8192
|
||
)
|
||
|
||
# Inference same as before
|
||
outputs = llm.generate(["Summarize this article..."])
|
||
```
|
||
|
||
### Multi-GPU deployment
|
||
|
||
```python
|
||
# Tensor parallelism across 8 GPUs
|
||
llm = LLM(
|
||
model="meta-llama/Meta-Llama-3-405B",
|
||
tensor_parallel_size=8,
|
||
dtype="fp8"
|
||
)
|
||
```
|
||
|
||
### Batch inference
|
||
|
||
```python
|
||
# Process 100 prompts efficiently
|
||
prompts = [f"Question {i}: ..." for i in range(100)]
|
||
|
||
outputs = llm.generate(
|
||
prompts,
|
||
sampling_params=SamplingParams(max_tokens=200)
|
||
)
|
||
|
||
# Automatic in-flight batching for maximum throughput
|
||
```
|
||
|
||
## Performance benchmarks
|
||
|
||
**Meta Llama 3-8B** (H100 GPU):
|
||
- Throughput: 24,000 tokens/sec
|
||
- Latency: ~10ms per token
|
||
- vs PyTorch: **100× faster**
|
||
|
||
**Llama 3-70B** (8× A100 80GB):
|
||
- FP8 quantization: 2× faster than FP16
|
||
- Memory: 50% reduction with FP8
|
||
|
||
## Supported models
|
||
|
||
- **LLaMA family**: Llama 2, Llama 3, CodeLlama
|
||
- **GPT family**: GPT-2, GPT-J, GPT-NeoX
|
||
- **Qwen**: Qwen, Qwen2, QwQ
|
||
- **DeepSeek**: DeepSeek-V2, DeepSeek-V3
|
||
- **Mixtral**: Mixtral-8x7B, Mixtral-8x22B
|
||
- **Vision**: LLaVA, Phi-3-vision
|
||
- **100+ models** on HuggingFace
|
||
|
||
## References
|
||
|
||
- **[Optimization Guide](references/optimization.md)** - Quantization, batching, KV cache tuning
|
||
- **[Multi-GPU Setup](references/multi-gpu.md)** - Tensor/pipeline parallelism, multi-node
|
||
- **[Serving Guide](references/serving.md)** - Production deployment, monitoring, autoscaling
|
||
|
||
## Resources
|
||
|
||
- **Docs**: https://nvidia.github.io/TensorRT-LLM/
|
||
- **GitHub**: https://github.com/NVIDIA/TensorRT-LLM
|
||
- **Models**: https://huggingface.co/models?library=tensorrt_llm
|
||
|
||
|