mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-25 00:51:20 +00:00
feat(gateway): skill-aware slash commands, paginated /commands, Telegram 100-cap (#3934)
* feat(gateway): skill-aware slash commands, paginated /commands, Telegram 100-cap Map active skills to Telegram's slash command menu so users can discover and invoke skills directly. Three changes: 1. Telegram menu now includes active skill commands alongside built-in commands, capped at 100 entries (Telegram Bot API limit). Overflow commands remain callable but hidden from the picker. Logged at startup when cap is hit. 2. New /commands [page] gateway command for paginated browsing of all commands + skills. /help now shows first 10 skill commands and points to /commands for the full list. 3. When a user types a slash command that matches a disabled or uninstalled skill, they get actionable guidance: - Disabled: 'Enable it with: hermes skills config' - Optional (not installed): 'Install with: hermes skills install official/<path>' Built on ideas from PR #3921 by @kshitijk4poor. * chore: move 21 niche skills to optional-skills Move specialized/niche skills from built-in (skills/) to optional (optional-skills/) to reduce the default skill count. Users can install them with: hermes skills install official/<category>/<name> Moved skills (21): - mlops: accelerate, chroma, faiss, flash-attention, hermes-atropos-environments, huggingface-tokenizers, instructor, lambda-labs, llava, nemo-curator, pinecone, pytorch-lightning, qdrant, saelens, simpo, slime, tensorrt-llm, torchtitan - research: domain-intel, duckduckgo-search - devops: inference-sh cli Built-in skills: 96 → 75 Optional skills: 22 → 43 * fix: only include repo built-in skills in Telegram menu, not user-installed User-installed skills (from hub or manually added) stay accessible via /skills and by typing the command directly, but don't get registered in the Telegram slash command picker. Only skills whose SKILL.md is under the repo's skills/ directory are included in the menu. This keeps the Telegram menu focused on the curated built-in set while user-installed skills remain discoverable through /skills and /commands.
This commit is contained in:
parent
97d6813f51
commit
5ceed021dc
73 changed files with 163 additions and 4 deletions
298
optional-skills/mlops/tensorrt-llm/references/multi-gpu.md
Normal file
298
optional-skills/mlops/tensorrt-llm/references/multi-gpu.md
Normal file
|
|
@ -0,0 +1,298 @@
|
|||
# Multi-GPU Deployment Guide
|
||||
|
||||
Comprehensive guide to scaling TensorRT-LLM across multiple GPUs and nodes.
|
||||
|
||||
## Parallelism Strategies
|
||||
|
||||
### Tensor Parallelism (TP)
|
||||
|
||||
**What it does**: Splits model layers across GPUs horizontally.
|
||||
|
||||
**Use case**:
|
||||
- Model fits in total GPU memory but not single GPU
|
||||
- Need low latency (single forward pass)
|
||||
- GPUs on same node (NVLink required for best performance)
|
||||
|
||||
**Example** (Llama 3-70B on 4× A100):
|
||||
```python
|
||||
from tensorrt_llm import LLM
|
||||
|
||||
llm = LLM(
|
||||
model="meta-llama/Meta-Llama-3-70B",
|
||||
tensor_parallel_size=4, # Split across 4 GPUs
|
||||
dtype="fp16"
|
||||
)
|
||||
|
||||
# Model automatically sharded across GPUs
|
||||
# Single forward pass, low latency
|
||||
```
|
||||
|
||||
**Performance**:
|
||||
- Latency: ~Same as single GPU
|
||||
- Throughput: 4× higher (4 GPUs)
|
||||
- Communication: High (activations synced every layer)
|
||||
|
||||
### Pipeline Parallelism (PP)
|
||||
|
||||
**What it does**: Splits model layers across GPUs vertically (layer-wise).
|
||||
|
||||
**Use case**:
|
||||
- Very large models (175B+)
|
||||
- Can tolerate higher latency
|
||||
- GPUs across multiple nodes
|
||||
|
||||
**Example** (Llama 3-405B on 8× H100):
|
||||
```python
|
||||
llm = LLM(
|
||||
model="meta-llama/Meta-Llama-3-405B",
|
||||
tensor_parallel_size=4, # TP=4 within nodes
|
||||
pipeline_parallel_size=2, # PP=2 across nodes
|
||||
dtype="fp8"
|
||||
)
|
||||
|
||||
# Total: 8 GPUs (4×2)
|
||||
# Layers 0-40: Node 1 (4 GPUs with TP)
|
||||
# Layers 41-80: Node 2 (4 GPUs with TP)
|
||||
```
|
||||
|
||||
**Performance**:
|
||||
- Latency: Higher (sequential through pipeline)
|
||||
- Throughput: High with micro-batching
|
||||
- Communication: Lower than TP
|
||||
|
||||
### Expert Parallelism (EP)
|
||||
|
||||
**What it does**: Distributes MoE experts across GPUs.
|
||||
|
||||
**Use case**: Mixture-of-Experts models (Mixtral, DeepSeek-V2)
|
||||
|
||||
**Example** (Mixtral-8x22B on 8× A100):
|
||||
```python
|
||||
llm = LLM(
|
||||
model="mistralai/Mixtral-8x22B",
|
||||
tensor_parallel_size=4,
|
||||
expert_parallel_size=2, # Distribute 8 experts across 2 groups
|
||||
dtype="fp8"
|
||||
)
|
||||
```
|
||||
|
||||
## Configuration Examples
|
||||
|
||||
### Small model (7-13B) - Single GPU
|
||||
|
||||
```python
|
||||
# Llama 3-8B on 1× A100 80GB
|
||||
llm = LLM(
|
||||
model="meta-llama/Meta-Llama-3-8B",
|
||||
dtype="fp16" # or fp8 for H100
|
||||
)
|
||||
```
|
||||
|
||||
**Resources**:
|
||||
- GPU: 1× A100 80GB
|
||||
- Memory: ~16GB model + 30GB KV cache
|
||||
- Throughput: 3,000-5,000 tokens/sec
|
||||
|
||||
### Medium model (70B) - Multi-GPU same node
|
||||
|
||||
```python
|
||||
# Llama 3-70B on 4× A100 80GB (NVLink)
|
||||
llm = LLM(
|
||||
model="meta-llama/Meta-Llama-3-70B",
|
||||
tensor_parallel_size=4,
|
||||
dtype="fp8" # 70GB → 35GB per GPU
|
||||
)
|
||||
```
|
||||
|
||||
**Resources**:
|
||||
- GPU: 4× A100 80GB with NVLink
|
||||
- Memory: ~35GB per GPU (FP8)
|
||||
- Throughput: 10,000-15,000 tokens/sec
|
||||
- Latency: 15-20ms per token
|
||||
|
||||
### Large model (405B) - Multi-node
|
||||
|
||||
```python
|
||||
# Llama 3-405B on 2 nodes × 8 H100 = 16 GPUs
|
||||
llm = LLM(
|
||||
model="meta-llama/Meta-Llama-3-405B",
|
||||
tensor_parallel_size=8, # TP within each node
|
||||
pipeline_parallel_size=2, # PP across 2 nodes
|
||||
dtype="fp8"
|
||||
)
|
||||
```
|
||||
|
||||
**Resources**:
|
||||
- GPU: 2 nodes × 8 H100 80GB
|
||||
- Memory: ~25GB per GPU (FP8)
|
||||
- Throughput: 20,000-30,000 tokens/sec
|
||||
- Network: InfiniBand recommended
|
||||
|
||||
## Server Deployment
|
||||
|
||||
### Single-node multi-GPU
|
||||
|
||||
```bash
|
||||
# Llama 3-70B on 4 GPUs (automatic TP)
|
||||
trtllm-serve meta-llama/Meta-Llama-3-70B \
|
||||
--tp_size 4 \
|
||||
--max_batch_size 256 \
|
||||
--dtype fp8
|
||||
|
||||
# Listens on http://localhost:8000
|
||||
```
|
||||
|
||||
### Multi-node with Ray
|
||||
|
||||
```bash
|
||||
# Node 1 (head node)
|
||||
ray start --head --port=6379
|
||||
|
||||
# Node 2 (worker)
|
||||
ray start --address='node1:6379'
|
||||
|
||||
# Deploy across cluster
|
||||
trtllm-serve meta-llama/Meta-Llama-3-405B \
|
||||
--tp_size 8 \
|
||||
--pp_size 2 \
|
||||
--num_workers 2 \ # 2 nodes
|
||||
--dtype fp8
|
||||
```
|
||||
|
||||
### Kubernetes deployment
|
||||
|
||||
```yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: tensorrt-llm-llama3-70b
|
||||
spec:
|
||||
replicas: 1
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: trtllm
|
||||
image: nvidia/tensorrt_llm:latest
|
||||
command:
|
||||
- trtllm-serve
|
||||
- meta-llama/Meta-Llama-3-70B
|
||||
- --tp_size=4
|
||||
- --max_batch_size=256
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 4 # Request 4 GPUs
|
||||
```
|
||||
|
||||
## Parallelism Decision Tree
|
||||
|
||||
```
|
||||
Model size < 20GB?
|
||||
├─ YES: Single GPU (no parallelism)
|
||||
└─ NO: Model size < 80GB?
|
||||
├─ YES: TP=2 or TP=4 (same node)
|
||||
└─ NO: Model size < 320GB?
|
||||
├─ YES: TP=4 or TP=8 (same node, NVLink required)
|
||||
└─ NO: TP=8 + PP=2 (multi-node)
|
||||
```
|
||||
|
||||
## Communication Optimization
|
||||
|
||||
### NVLink vs PCIe
|
||||
|
||||
**NVLink** (DGX A100, HGX H100):
|
||||
- Bandwidth: 600 GB/s (A100), 900 GB/s (H100)
|
||||
- Ideal for TP (high communication)
|
||||
- **Recommended for all multi-GPU setups**
|
||||
|
||||
**PCIe**:
|
||||
- Bandwidth: 64 GB/s (PCIe 4.0 x16)
|
||||
- 10× slower than NVLink
|
||||
- Avoid TP, use PP instead
|
||||
|
||||
### InfiniBand for multi-node
|
||||
|
||||
**HDR InfiniBand** (200 Gb/s):
|
||||
- Required for multi-node TP or PP
|
||||
- Latency: <1μs
|
||||
- **Essential for 405B+ models**
|
||||
|
||||
## Monitoring Multi-GPU
|
||||
|
||||
```python
|
||||
# Monitor GPU utilization
|
||||
nvidia-smi dmon -s u
|
||||
|
||||
# Monitor memory
|
||||
nvidia-smi dmon -s m
|
||||
|
||||
# Monitor NVLink utilization
|
||||
nvidia-smi nvlink --status
|
||||
|
||||
# TensorRT-LLM built-in metrics
|
||||
curl http://localhost:8000/metrics
|
||||
```
|
||||
|
||||
**Key metrics**:
|
||||
- GPU utilization: Target 80-95%
|
||||
- Memory usage: Should be balanced across GPUs
|
||||
- NVLink traffic: High for TP, low for PP
|
||||
- Throughput: Tokens/sec across all GPUs
|
||||
|
||||
## Common Issues
|
||||
|
||||
### Imbalanced GPU memory
|
||||
|
||||
**Symptom**: GPU 0 has 90% memory, GPU 3 has 40%
|
||||
|
||||
**Solutions**:
|
||||
- Verify TP/PP configuration
|
||||
- Check model sharding (should be equal)
|
||||
- Restart server to reset state
|
||||
|
||||
### Low NVLink utilization
|
||||
|
||||
**Symptom**: NVLink bandwidth <100 GB/s with TP=4
|
||||
|
||||
**Solutions**:
|
||||
- Verify NVLink topology: `nvidia-smi topo -m`
|
||||
- Check for PCIe fallback
|
||||
- Ensure GPUs are on same NVSwitch
|
||||
|
||||
### OOM with multi-GPU
|
||||
|
||||
**Solutions**:
|
||||
- Increase TP size (more GPUs)
|
||||
- Reduce batch size
|
||||
- Enable FP8 quantization
|
||||
- Use pipeline parallelism
|
||||
|
||||
## Performance Scaling
|
||||
|
||||
### TP Scaling (Llama 3-70B, FP8)
|
||||
|
||||
| GPUs | TP Size | Throughput | Latency | Efficiency |
|
||||
|------|---------|------------|---------|------------|
|
||||
| 1 | 1 | OOM | - | - |
|
||||
| 2 | 2 | 6,000 tok/s | 18ms | 85% |
|
||||
| 4 | 4 | 11,000 tok/s | 16ms | 78% |
|
||||
| 8 | 8 | 18,000 tok/s | 15ms | 64% |
|
||||
|
||||
**Note**: Efficiency drops with more GPUs due to communication overhead.
|
||||
|
||||
### PP Scaling (Llama 3-405B, FP8)
|
||||
|
||||
| Nodes | TP | PP | Total GPUs | Throughput |
|
||||
|-------|----|----|------------|------------|
|
||||
| 1 | 8 | 1 | 8 | OOM |
|
||||
| 2 | 8 | 2 | 16 | 25,000 tok/s |
|
||||
| 4 | 8 | 4 | 32 | 45,000 tok/s |
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Prefer TP over PP** when possible (lower latency)
|
||||
2. **Use NVLink** for all TP deployments
|
||||
3. **Use InfiniBand** for multi-node deployments
|
||||
4. **Start with smallest TP** that fits model in memory
|
||||
5. **Monitor GPU balance** - all GPUs should have similar utilization
|
||||
6. **Test with benchmark** before production
|
||||
7. **Use FP8** on H100 for 2× speedup
|
||||
242
optional-skills/mlops/tensorrt-llm/references/optimization.md
Normal file
242
optional-skills/mlops/tensorrt-llm/references/optimization.md
Normal file
|
|
@ -0,0 +1,242 @@
|
|||
# TensorRT-LLM Optimization Guide
|
||||
|
||||
Comprehensive guide to optimizing LLM inference with TensorRT-LLM.
|
||||
|
||||
## Quantization
|
||||
|
||||
### FP8 Quantization (Recommended for H100)
|
||||
|
||||
**Benefits**:
|
||||
- 2× faster inference
|
||||
- 50% memory reduction
|
||||
- Minimal accuracy loss (<1% perplexity degradation)
|
||||
|
||||
**Usage**:
|
||||
```python
|
||||
from tensorrt_llm import LLM
|
||||
|
||||
# Automatic FP8 quantization
|
||||
llm = LLM(
|
||||
model="meta-llama/Meta-Llama-3-70B",
|
||||
dtype="fp8",
|
||||
quantization="fp8"
|
||||
)
|
||||
```
|
||||
|
||||
**Performance** (Llama 3-70B on 8× H100):
|
||||
- FP16: 5,000 tokens/sec
|
||||
- FP8: **10,000 tokens/sec** (2× speedup)
|
||||
- Memory: 140GB → 70GB
|
||||
|
||||
### INT4 Quantization (Maximum compression)
|
||||
|
||||
**Benefits**:
|
||||
- 4× memory reduction
|
||||
- 3-4× faster inference
|
||||
- Fits larger models on same hardware
|
||||
|
||||
**Usage**:
|
||||
```python
|
||||
# INT4 with AWQ calibration
|
||||
llm = LLM(
|
||||
model="meta-llama/Meta-Llama-3-405B",
|
||||
dtype="int4_awq",
|
||||
quantization="awq"
|
||||
)
|
||||
|
||||
# INT4 with GPTQ calibration
|
||||
llm = LLM(
|
||||
model="meta-llama/Meta-Llama-3-405B",
|
||||
dtype="int4_gptq",
|
||||
quantization="gptq"
|
||||
)
|
||||
```
|
||||
|
||||
**Trade-offs**:
|
||||
- Accuracy: 1-3% perplexity increase
|
||||
- Speed: 3-4× faster than FP16
|
||||
- Use case: When memory is critical
|
||||
|
||||
## In-Flight Batching
|
||||
|
||||
**What it does**: Dynamically batches requests during generation instead of waiting for all sequences to finish.
|
||||
|
||||
**Configuration**:
|
||||
```python
|
||||
# Server configuration
|
||||
trtllm-serve meta-llama/Meta-Llama-3-8B \
|
||||
--max_batch_size 256 \ # Maximum concurrent sequences
|
||||
--max_num_tokens 4096 \ # Total tokens in batch
|
||||
--enable_chunked_context \ # Split long prompts
|
||||
--scheduler_policy max_utilization
|
||||
```
|
||||
|
||||
**Performance**:
|
||||
- Throughput: **4-8× higher** vs static batching
|
||||
- Latency: Lower P50/P99 for mixed workloads
|
||||
- GPU utilization: 80-95% vs 40-60%
|
||||
|
||||
## Paged KV Cache
|
||||
|
||||
**What it does**: Manages KV cache memory like OS manages virtual memory (paging).
|
||||
|
||||
**Benefits**:
|
||||
- 40-60% higher throughput
|
||||
- No memory fragmentation
|
||||
- Supports longer sequences
|
||||
|
||||
**Configuration**:
|
||||
```python
|
||||
# Automatic paged KV cache (default)
|
||||
llm = LLM(
|
||||
model="meta-llama/Meta-Llama-3-8B",
|
||||
kv_cache_free_gpu_mem_fraction=0.9, # Use 90% GPU mem for cache
|
||||
enable_prefix_caching=True # Cache common prefixes
|
||||
)
|
||||
```
|
||||
|
||||
## Speculative Decoding
|
||||
|
||||
**What it does**: Uses small draft model to predict multiple tokens, verified by target model in parallel.
|
||||
|
||||
**Speedup**: 2-3× faster for long generations
|
||||
|
||||
**Usage**:
|
||||
```python
|
||||
from tensorrt_llm import LLM
|
||||
|
||||
# Target model (Llama 3-70B)
|
||||
llm = LLM(
|
||||
model="meta-llama/Meta-Llama-3-70B",
|
||||
speculative_model="meta-llama/Meta-Llama-3-8B", # Draft model
|
||||
num_speculative_tokens=5 # Tokens to predict ahead
|
||||
)
|
||||
|
||||
# Same API, 2-3× faster
|
||||
outputs = llm.generate(prompts)
|
||||
```
|
||||
|
||||
**Best models for drafting**:
|
||||
- Target: Llama 3-70B → Draft: Llama 3-8B
|
||||
- Target: Qwen2-72B → Draft: Qwen2-7B
|
||||
- Same family, 8-10× smaller
|
||||
|
||||
## CUDA Graphs
|
||||
|
||||
**What it does**: Reduces kernel launch overhead by recording GPU operations.
|
||||
|
||||
**Benefits**:
|
||||
- 10-20% lower latency
|
||||
- More stable P99 latency
|
||||
- Better for small batch sizes
|
||||
|
||||
**Configuration** (automatic by default):
|
||||
```python
|
||||
llm = LLM(
|
||||
model="meta-llama/Meta-Llama-3-8B",
|
||||
enable_cuda_graph=True, # Default: True
|
||||
cuda_graph_cache_size=2 # Cache 2 graph variants
|
||||
)
|
||||
```
|
||||
|
||||
## Chunked Context
|
||||
|
||||
**What it does**: Splits long prompts into chunks to reduce memory spikes.
|
||||
|
||||
**Use case**: Prompts >8K tokens with limited GPU memory
|
||||
|
||||
**Configuration**:
|
||||
```bash
|
||||
trtllm-serve meta-llama/Meta-Llama-3-8B \
|
||||
--max_num_tokens 4096 \
|
||||
--enable_chunked_context \
|
||||
--max_chunked_prefill_length 2048 # Process 2K tokens at a time
|
||||
```
|
||||
|
||||
## Overlap Scheduling
|
||||
|
||||
**What it does**: Overlaps compute and memory operations.
|
||||
|
||||
**Benefits**:
|
||||
- 15-25% higher throughput
|
||||
- Better GPU utilization
|
||||
- Default in v1.2.0+
|
||||
|
||||
**No configuration needed** - enabled automatically.
|
||||
|
||||
## Quantization Comparison Table
|
||||
|
||||
| Method | Memory | Speed | Accuracy | Use Case |
|
||||
|--------|--------|-------|----------|----------|
|
||||
| FP16 | 1× (baseline) | 1× | Best | High accuracy needed |
|
||||
| FP8 | 0.5× | 2× | -0.5% ppl | **H100 default** |
|
||||
| INT4 AWQ | 0.25× | 3-4× | -1.5% ppl | Memory critical |
|
||||
| INT4 GPTQ | 0.25× | 3-4× | -2% ppl | Maximum speed |
|
||||
|
||||
## Tuning Workflow
|
||||
|
||||
1. **Start with defaults**:
|
||||
```python
|
||||
llm = LLM(model="meta-llama/Meta-Llama-3-70B")
|
||||
```
|
||||
|
||||
2. **Enable FP8** (if H100):
|
||||
```python
|
||||
llm = LLM(model="...", dtype="fp8")
|
||||
```
|
||||
|
||||
3. **Tune batch size**:
|
||||
```python
|
||||
# Increase until OOM, then reduce 20%
|
||||
trtllm-serve ... --max_batch_size 256
|
||||
```
|
||||
|
||||
4. **Enable chunked context** (if long prompts):
|
||||
```bash
|
||||
--enable_chunked_context --max_chunked_prefill_length 2048
|
||||
```
|
||||
|
||||
5. **Try speculative decoding** (if latency critical):
|
||||
```python
|
||||
llm = LLM(model="...", speculative_model="...")
|
||||
```
|
||||
|
||||
## Benchmarking
|
||||
|
||||
```bash
|
||||
# Install benchmark tool
|
||||
pip install tensorrt_llm[benchmark]
|
||||
|
||||
# Run benchmark
|
||||
python benchmarks/python/benchmark.py \
|
||||
--model meta-llama/Meta-Llama-3-8B \
|
||||
--batch_size 64 \
|
||||
--input_len 128 \
|
||||
--output_len 256 \
|
||||
--dtype fp8
|
||||
```
|
||||
|
||||
**Metrics to track**:
|
||||
- Throughput (tokens/sec)
|
||||
- Latency P50/P90/P99 (ms)
|
||||
- GPU memory usage (GB)
|
||||
- GPU utilization (%)
|
||||
|
||||
## Common Issues
|
||||
|
||||
**OOM errors**:
|
||||
- Reduce `max_batch_size`
|
||||
- Reduce `max_num_tokens`
|
||||
- Enable INT4 quantization
|
||||
- Increase `tensor_parallel_size`
|
||||
|
||||
**Low throughput**:
|
||||
- Increase `max_batch_size`
|
||||
- Enable in-flight batching
|
||||
- Verify CUDA graphs enabled
|
||||
- Check GPU utilization
|
||||
|
||||
**High latency**:
|
||||
- Try speculative decoding
|
||||
- Reduce `max_batch_size` (less queueing)
|
||||
- Use FP8 instead of FP16
|
||||
470
optional-skills/mlops/tensorrt-llm/references/serving.md
Normal file
470
optional-skills/mlops/tensorrt-llm/references/serving.md
Normal file
|
|
@ -0,0 +1,470 @@
|
|||
# Production Serving Guide
|
||||
|
||||
Comprehensive guide to deploying TensorRT-LLM in production environments.
|
||||
|
||||
## Server Modes
|
||||
|
||||
### trtllm-serve (Recommended)
|
||||
|
||||
**Features**:
|
||||
- OpenAI-compatible API
|
||||
- Automatic model download and compilation
|
||||
- Built-in load balancing
|
||||
- Prometheus metrics
|
||||
- Health checks
|
||||
|
||||
**Basic usage**:
|
||||
```bash
|
||||
trtllm-serve meta-llama/Meta-Llama-3-8B \
|
||||
--tp_size 1 \
|
||||
--max_batch_size 256 \
|
||||
--port 8000
|
||||
```
|
||||
|
||||
**Advanced configuration**:
|
||||
```bash
|
||||
trtllm-serve meta-llama/Meta-Llama-3-70B \
|
||||
--tp_size 4 \
|
||||
--dtype fp8 \
|
||||
--max_batch_size 256 \
|
||||
--max_num_tokens 4096 \
|
||||
--enable_chunked_context \
|
||||
--scheduler_policy max_utilization \
|
||||
--port 8000 \
|
||||
--api_key $API_KEY # Optional authentication
|
||||
```
|
||||
|
||||
### Python LLM API (For embedding)
|
||||
|
||||
```python
|
||||
from tensorrt_llm import LLM
|
||||
|
||||
class LLMService:
|
||||
def __init__(self):
|
||||
self.llm = LLM(
|
||||
model="meta-llama/Meta-Llama-3-8B",
|
||||
dtype="fp8"
|
||||
)
|
||||
|
||||
def generate(self, prompt, max_tokens=100):
|
||||
from tensorrt_llm import SamplingParams
|
||||
|
||||
params = SamplingParams(
|
||||
max_tokens=max_tokens,
|
||||
temperature=0.7
|
||||
)
|
||||
outputs = self.llm.generate([prompt], params)
|
||||
return outputs[0].text
|
||||
|
||||
# Use in FastAPI, Flask, etc
|
||||
from fastapi import FastAPI
|
||||
app = FastAPI()
|
||||
service = LLMService()
|
||||
|
||||
@app.post("/generate")
|
||||
def generate(prompt: str):
|
||||
return {"response": service.generate(prompt)}
|
||||
```
|
||||
|
||||
## OpenAI-Compatible API
|
||||
|
||||
### Chat Completions
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:8000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "meta-llama/Meta-Llama-3-8B",
|
||||
"messages": [
|
||||
{"role": "system", "content": "You are a helpful assistant."},
|
||||
{"role": "user", "content": "Explain quantum computing"}
|
||||
],
|
||||
"temperature": 0.7,
|
||||
"max_tokens": 500,
|
||||
"stream": false
|
||||
}'
|
||||
```
|
||||
|
||||
**Response**:
|
||||
```json
|
||||
{
|
||||
"id": "chat-abc123",
|
||||
"object": "chat.completion",
|
||||
"created": 1234567890,
|
||||
"model": "meta-llama/Meta-Llama-3-8B",
|
||||
"choices": [{
|
||||
"index": 0,
|
||||
"message": {
|
||||
"role": "assistant",
|
||||
"content": "Quantum computing is..."
|
||||
},
|
||||
"finish_reason": "stop"
|
||||
}],
|
||||
"usage": {
|
||||
"prompt_tokens": 25,
|
||||
"completion_tokens": 150,
|
||||
"total_tokens": 175
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Streaming
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:8000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "meta-llama/Meta-Llama-3-8B",
|
||||
"messages": [{"role": "user", "content": "Count to 10"}],
|
||||
"stream": true
|
||||
}'
|
||||
```
|
||||
|
||||
**Response** (SSE stream):
|
||||
```
|
||||
data: {"choices":[{"delta":{"content":"1"}}]}
|
||||
|
||||
data: {"choices":[{"delta":{"content":", 2"}}]}
|
||||
|
||||
data: {"choices":[{"delta":{"content":", 3"}}]}
|
||||
|
||||
data: [DONE]
|
||||
```
|
||||
|
||||
### Completions
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "meta-llama/Meta-Llama-3-8B",
|
||||
"prompt": "The capital of France is",
|
||||
"max_tokens": 10,
|
||||
"temperature": 0.0
|
||||
}'
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Prometheus Metrics
|
||||
|
||||
**Enable metrics**:
|
||||
```bash
|
||||
trtllm-serve meta-llama/Meta-Llama-3-8B \
|
||||
--enable_metrics \
|
||||
--metrics_port 9090
|
||||
```
|
||||
|
||||
**Key metrics**:
|
||||
```bash
|
||||
# Scrape metrics
|
||||
curl http://localhost:9090/metrics
|
||||
|
||||
# Important metrics:
|
||||
# - trtllm_request_success_total - Total successful requests
|
||||
# - trtllm_request_latency_seconds - Request latency histogram
|
||||
# - trtllm_tokens_generated_total - Total tokens generated
|
||||
# - trtllm_active_requests - Current active requests
|
||||
# - trtllm_queue_size - Requests waiting in queue
|
||||
# - trtllm_gpu_memory_usage_bytes - GPU memory usage
|
||||
# - trtllm_kv_cache_usage_ratio - KV cache utilization
|
||||
```
|
||||
|
||||
### Health Checks
|
||||
|
||||
```bash
|
||||
# Readiness probe
|
||||
curl http://localhost:8000/health/ready
|
||||
|
||||
# Liveness probe
|
||||
curl http://localhost:8000/health/live
|
||||
|
||||
# Model info
|
||||
curl http://localhost:8000/v1/models
|
||||
```
|
||||
|
||||
**Kubernetes probes**:
|
||||
```yaml
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /health/live
|
||||
port: 8000
|
||||
initialDelaySeconds: 60
|
||||
periodSeconds: 10
|
||||
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /health/ready
|
||||
port: 8000
|
||||
initialDelaySeconds: 30
|
||||
periodSeconds: 5
|
||||
```
|
||||
|
||||
## Production Deployment
|
||||
|
||||
### Docker Deployment
|
||||
|
||||
**Dockerfile**:
|
||||
```dockerfile
|
||||
FROM nvidia/tensorrt_llm:latest
|
||||
|
||||
# Copy any custom configs
|
||||
COPY config.yaml /app/config.yaml
|
||||
|
||||
# Expose ports
|
||||
EXPOSE 8000 9090
|
||||
|
||||
# Start server
|
||||
CMD ["trtllm-serve", "meta-llama/Meta-Llama-3-8B", \
|
||||
"--tp_size", "4", \
|
||||
"--dtype", "fp8", \
|
||||
"--max_batch_size", "256", \
|
||||
"--enable_metrics", \
|
||||
"--metrics_port", "9090"]
|
||||
```
|
||||
|
||||
**Run container**:
|
||||
```bash
|
||||
docker run --gpus all -p 8000:8000 -p 9090:9090 \
|
||||
tensorrt-llm:latest
|
||||
```
|
||||
|
||||
### Kubernetes Deployment
|
||||
|
||||
**Complete deployment**:
|
||||
```yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: tensorrt-llm
|
||||
spec:
|
||||
replicas: 2 # Multiple replicas for HA
|
||||
selector:
|
||||
matchLabels:
|
||||
app: tensorrt-llm
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: tensorrt-llm
|
||||
spec:
|
||||
containers:
|
||||
- name: trtllm
|
||||
image: nvidia/tensorrt_llm:latest
|
||||
command:
|
||||
- trtllm-serve
|
||||
- meta-llama/Meta-Llama-3-70B
|
||||
- --tp_size=4
|
||||
- --dtype=fp8
|
||||
- --max_batch_size=256
|
||||
- --enable_metrics
|
||||
ports:
|
||||
- containerPort: 8000
|
||||
name: http
|
||||
- containerPort: 9090
|
||||
name: metrics
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 4
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /health/live
|
||||
port: 8000
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /health/ready
|
||||
port: 8000
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: tensorrt-llm
|
||||
spec:
|
||||
selector:
|
||||
app: tensorrt-llm
|
||||
ports:
|
||||
- name: http
|
||||
port: 80
|
||||
targetPort: 8000
|
||||
- name: metrics
|
||||
port: 9090
|
||||
targetPort: 9090
|
||||
type: LoadBalancer
|
||||
```
|
||||
|
||||
### Load Balancing
|
||||
|
||||
**NGINX configuration**:
|
||||
```nginx
|
||||
upstream tensorrt_llm {
|
||||
least_conn; # Route to least busy server
|
||||
server trtllm-1:8000 max_fails=3 fail_timeout=30s;
|
||||
server trtllm-2:8000 max_fails=3 fail_timeout=30s;
|
||||
server trtllm-3:8000 max_fails=3 fail_timeout=30s;
|
||||
}
|
||||
|
||||
server {
|
||||
listen 80;
|
||||
location / {
|
||||
proxy_pass http://tensorrt_llm;
|
||||
proxy_read_timeout 300s; # Long timeout for slow generations
|
||||
proxy_connect_timeout 10s;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Autoscaling
|
||||
|
||||
### Horizontal Pod Autoscaler (HPA)
|
||||
|
||||
```yaml
|
||||
apiVersion: autoscaling/v2
|
||||
kind: HorizontalPodAutoscaler
|
||||
metadata:
|
||||
name: tensorrt-llm-hpa
|
||||
spec:
|
||||
scaleTargetRef:
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
name: tensorrt-llm
|
||||
minReplicas: 2
|
||||
maxReplicas: 10
|
||||
metrics:
|
||||
- type: Pods
|
||||
pods:
|
||||
metric:
|
||||
name: trtllm_active_requests
|
||||
target:
|
||||
type: AverageValue
|
||||
averageValue: "50" # Scale when avg >50 active requests
|
||||
```
|
||||
|
||||
### Custom Metrics
|
||||
|
||||
```yaml
|
||||
# Scale based on queue size
|
||||
- type: Pods
|
||||
pods:
|
||||
metric:
|
||||
name: trtllm_queue_size
|
||||
target:
|
||||
type: AverageValue
|
||||
averageValue: "10"
|
||||
```
|
||||
|
||||
## Cost Optimization
|
||||
|
||||
### GPU Selection
|
||||
|
||||
**A100 80GB** ($3-4/hour):
|
||||
- Use for: 70B models with FP8
|
||||
- Throughput: 10,000-15,000 tok/s (TP=4)
|
||||
- Cost per 1M tokens: $0.20-0.30
|
||||
|
||||
**H100 80GB** ($6-8/hour):
|
||||
- Use for: 70B models with FP8, 405B models
|
||||
- Throughput: 20,000-30,000 tok/s (TP=4)
|
||||
- Cost per 1M tokens: $0.15-0.25 (2× faster = lower cost)
|
||||
|
||||
**L4** ($0.50-1/hour):
|
||||
- Use for: 7-8B models
|
||||
- Throughput: 1,000-2,000 tok/s
|
||||
- Cost per 1M tokens: $0.25-0.50
|
||||
|
||||
### Batch Size Tuning
|
||||
|
||||
**Impact on cost**:
|
||||
- Batch size 1: 1,000 tok/s → $3/hour per 1M = $3/M tokens
|
||||
- Batch size 64: 5,000 tok/s → $3/hour per 5M = $0.60/M tokens
|
||||
- **5× cost reduction** with batching
|
||||
|
||||
**Recommendation**: Target batch size 32-128 for cost efficiency.
|
||||
|
||||
## Security
|
||||
|
||||
### API Authentication
|
||||
|
||||
```bash
|
||||
# Generate API key
|
||||
export API_KEY=$(openssl rand -hex 32)
|
||||
|
||||
# Start server with authentication
|
||||
trtllm-serve meta-llama/Meta-Llama-3-8B \
|
||||
--api_key $API_KEY
|
||||
|
||||
# Client request
|
||||
curl -X POST http://localhost:8000/v1/chat/completions \
|
||||
-H "Authorization: Bearer $API_KEY" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"model": "...", "messages": [...]}'
|
||||
```
|
||||
|
||||
### Network Policies
|
||||
|
||||
```yaml
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: NetworkPolicy
|
||||
metadata:
|
||||
name: tensorrt-llm-policy
|
||||
spec:
|
||||
podSelector:
|
||||
matchLabels:
|
||||
app: tensorrt-llm
|
||||
policyTypes:
|
||||
- Ingress
|
||||
ingress:
|
||||
- from:
|
||||
- podSelector:
|
||||
matchLabels:
|
||||
app: api-gateway # Only allow from gateway
|
||||
ports:
|
||||
- protocol: TCP
|
||||
port: 8000
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### High latency
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
# Check queue size
|
||||
curl http://localhost:9090/metrics | grep queue_size
|
||||
|
||||
# Check active requests
|
||||
curl http://localhost:9090/metrics | grep active_requests
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
- Scale horizontally (more replicas)
|
||||
- Increase batch size (if GPU underutilized)
|
||||
- Enable chunked context (if long prompts)
|
||||
- Use FP8 quantization
|
||||
|
||||
### OOM crashes
|
||||
|
||||
**Solutions**:
|
||||
- Reduce `max_batch_size`
|
||||
- Reduce `max_num_tokens`
|
||||
- Enable FP8 or INT4 quantization
|
||||
- Increase `tensor_parallel_size`
|
||||
|
||||
### Timeout errors
|
||||
|
||||
**NGINX config**:
|
||||
```nginx
|
||||
proxy_read_timeout 600s; # 10 minutes for very long generations
|
||||
proxy_send_timeout 600s;
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Use FP8 on H100** for 2× speedup and 50% cost reduction
|
||||
2. **Monitor metrics** - Set up Prometheus + Grafana
|
||||
3. **Set readiness probes** - Prevent routing to unhealthy pods
|
||||
4. **Use load balancing** - Distribute load across replicas
|
||||
5. **Tune batch size** - Balance latency and throughput
|
||||
6. **Enable streaming** - Better UX for chat applications
|
||||
7. **Set up autoscaling** - Handle traffic spikes
|
||||
8. **Use persistent volumes** - Cache compiled models
|
||||
9. **Implement retries** - Handle transient failures
|
||||
10. **Monitor costs** - Track cost per token
|
||||
Loading…
Add table
Add a link
Reference in a new issue