hermes-agent/optional-skills/mlops/tensorrt-llm/references/multi-gpu.md
Teknium 5ceed021dc
feat(gateway): skill-aware slash commands, paginated /commands, Telegram 100-cap (#3934)
* feat(gateway): skill-aware slash commands, paginated /commands, Telegram 100-cap

Map active skills to Telegram's slash command menu so users can
discover and invoke skills directly. Three changes:

1. Telegram menu now includes active skill commands alongside built-in
   commands, capped at 100 entries (Telegram Bot API limit). Overflow
   commands remain callable but hidden from the picker. Logged at
   startup when cap is hit.

2. New /commands [page] gateway command for paginated browsing of all
   commands + skills. /help now shows first 10 skill commands and
   points to /commands for the full list.

3. When a user types a slash command that matches a disabled or
   uninstalled skill, they get actionable guidance:
   - Disabled: 'Enable it with: hermes skills config'
   - Optional (not installed): 'Install with: hermes skills install official/<path>'

Built on ideas from PR #3921 by @kshitijk4poor.

* chore: move 21 niche skills to optional-skills

Move specialized/niche skills from built-in (skills/) to optional
(optional-skills/) to reduce the default skill count. Users can
install them with: hermes skills install official/<category>/<name>

Moved skills (21):
- mlops: accelerate, chroma, faiss, flash-attention,
  hermes-atropos-environments, huggingface-tokenizers, instructor,
  lambda-labs, llava, nemo-curator, pinecone, pytorch-lightning,
  qdrant, saelens, simpo, slime, tensorrt-llm, torchtitan
- research: domain-intel, duckduckgo-search
- devops: inference-sh cli

Built-in skills: 96 → 75
Optional skills: 22 → 43

* fix: only include repo built-in skills in Telegram menu, not user-installed

User-installed skills (from hub or manually added) stay accessible via
/skills and by typing the command directly, but don't get registered
in the Telegram slash command picker. Only skills whose SKILL.md is
under the repo's skills/ directory are included in the menu.

This keeps the Telegram menu focused on the curated built-in set while
user-installed skills remain discoverable through /skills and /commands.
2026-03-30 10:57:30 -07:00

298 lines
6.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Multi-GPU Deployment Guide
Comprehensive guide to scaling TensorRT-LLM across multiple GPUs and nodes.
## Parallelism Strategies
### Tensor Parallelism (TP)
**What it does**: Splits model layers across GPUs horizontally.
**Use case**:
- Model fits in total GPU memory but not single GPU
- Need low latency (single forward pass)
- GPUs on same node (NVLink required for best performance)
**Example** (Llama 3-70B on 4× A100):
```python
from tensorrt_llm import LLM
llm = LLM(
model="meta-llama/Meta-Llama-3-70B",
tensor_parallel_size=4, # Split across 4 GPUs
dtype="fp16"
)
# Model automatically sharded across GPUs
# Single forward pass, low latency
```
**Performance**:
- Latency: ~Same as single GPU
- Throughput: 4× higher (4 GPUs)
- Communication: High (activations synced every layer)
### Pipeline Parallelism (PP)
**What it does**: Splits model layers across GPUs vertically (layer-wise).
**Use case**:
- Very large models (175B+)
- Can tolerate higher latency
- GPUs across multiple nodes
**Example** (Llama 3-405B on 8× H100):
```python
llm = LLM(
model="meta-llama/Meta-Llama-3-405B",
tensor_parallel_size=4, # TP=4 within nodes
pipeline_parallel_size=2, # PP=2 across nodes
dtype="fp8"
)
# Total: 8 GPUs (4×2)
# Layers 0-40: Node 1 (4 GPUs with TP)
# Layers 41-80: Node 2 (4 GPUs with TP)
```
**Performance**:
- Latency: Higher (sequential through pipeline)
- Throughput: High with micro-batching
- Communication: Lower than TP
### Expert Parallelism (EP)
**What it does**: Distributes MoE experts across GPUs.
**Use case**: Mixture-of-Experts models (Mixtral, DeepSeek-V2)
**Example** (Mixtral-8x22B on 8× A100):
```python
llm = LLM(
model="mistralai/Mixtral-8x22B",
tensor_parallel_size=4,
expert_parallel_size=2, # Distribute 8 experts across 2 groups
dtype="fp8"
)
```
## Configuration Examples
### Small model (7-13B) - Single GPU
```python
# Llama 3-8B on 1× A100 80GB
llm = LLM(
model="meta-llama/Meta-Llama-3-8B",
dtype="fp16" # or fp8 for H100
)
```
**Resources**:
- GPU: 1× A100 80GB
- Memory: ~16GB model + 30GB KV cache
- Throughput: 3,000-5,000 tokens/sec
### Medium model (70B) - Multi-GPU same node
```python
# Llama 3-70B on 4× A100 80GB (NVLink)
llm = LLM(
model="meta-llama/Meta-Llama-3-70B",
tensor_parallel_size=4,
dtype="fp8" # 70GB → 35GB per GPU
)
```
**Resources**:
- GPU: 4× A100 80GB with NVLink
- Memory: ~35GB per GPU (FP8)
- Throughput: 10,000-15,000 tokens/sec
- Latency: 15-20ms per token
### Large model (405B) - Multi-node
```python
# Llama 3-405B on 2 nodes × 8 H100 = 16 GPUs
llm = LLM(
model="meta-llama/Meta-Llama-3-405B",
tensor_parallel_size=8, # TP within each node
pipeline_parallel_size=2, # PP across 2 nodes
dtype="fp8"
)
```
**Resources**:
- GPU: 2 nodes × 8 H100 80GB
- Memory: ~25GB per GPU (FP8)
- Throughput: 20,000-30,000 tokens/sec
- Network: InfiniBand recommended
## Server Deployment
### Single-node multi-GPU
```bash
# Llama 3-70B on 4 GPUs (automatic TP)
trtllm-serve meta-llama/Meta-Llama-3-70B \
--tp_size 4 \
--max_batch_size 256 \
--dtype fp8
# Listens on http://localhost:8000
```
### Multi-node with Ray
```bash
# Node 1 (head node)
ray start --head --port=6379
# Node 2 (worker)
ray start --address='node1:6379'
# Deploy across cluster
trtllm-serve meta-llama/Meta-Llama-3-405B \
--tp_size 8 \
--pp_size 2 \
--num_workers 2 \ # 2 nodes
--dtype fp8
```
### Kubernetes deployment
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: tensorrt-llm-llama3-70b
spec:
replicas: 1
template:
spec:
containers:
- name: trtllm
image: nvidia/tensorrt_llm:latest
command:
- trtllm-serve
- meta-llama/Meta-Llama-3-70B
- --tp_size=4
- --max_batch_size=256
resources:
limits:
nvidia.com/gpu: 4 # Request 4 GPUs
```
## Parallelism Decision Tree
```
Model size < 20GB?
├─ YES: Single GPU (no parallelism)
└─ NO: Model size < 80GB?
├─ YES: TP=2 or TP=4 (same node)
└─ NO: Model size < 320GB?
├─ YES: TP=4 or TP=8 (same node, NVLink required)
└─ NO: TP=8 + PP=2 (multi-node)
```
## Communication Optimization
### NVLink vs PCIe
**NVLink** (DGX A100, HGX H100):
- Bandwidth: 600 GB/s (A100), 900 GB/s (H100)
- Ideal for TP (high communication)
- **Recommended for all multi-GPU setups**
**PCIe**:
- Bandwidth: 64 GB/s (PCIe 4.0 x16)
- 10× slower than NVLink
- Avoid TP, use PP instead
### InfiniBand for multi-node
**HDR InfiniBand** (200 Gb/s):
- Required for multi-node TP or PP
- Latency: <1μs
- **Essential for 405B+ models**
## Monitoring Multi-GPU
```python
# Monitor GPU utilization
nvidia-smi dmon -s u
# Monitor memory
nvidia-smi dmon -s m
# Monitor NVLink utilization
nvidia-smi nvlink --status
# TensorRT-LLM built-in metrics
curl http://localhost:8000/metrics
```
**Key metrics**:
- GPU utilization: Target 80-95%
- Memory usage: Should be balanced across GPUs
- NVLink traffic: High for TP, low for PP
- Throughput: Tokens/sec across all GPUs
## Common Issues
### Imbalanced GPU memory
**Symptom**: GPU 0 has 90% memory, GPU 3 has 40%
**Solutions**:
- Verify TP/PP configuration
- Check model sharding (should be equal)
- Restart server to reset state
### Low NVLink utilization
**Symptom**: NVLink bandwidth <100 GB/s with TP=4
**Solutions**:
- Verify NVLink topology: `nvidia-smi topo -m`
- Check for PCIe fallback
- Ensure GPUs are on same NVSwitch
### OOM with multi-GPU
**Solutions**:
- Increase TP size (more GPUs)
- Reduce batch size
- Enable FP8 quantization
- Use pipeline parallelism
## Performance Scaling
### TP Scaling (Llama 3-70B, FP8)
| GPUs | TP Size | Throughput | Latency | Efficiency |
|------|---------|------------|---------|------------|
| 1 | 1 | OOM | - | - |
| 2 | 2 | 6,000 tok/s | 18ms | 85% |
| 4 | 4 | 11,000 tok/s | 16ms | 78% |
| 8 | 8 | 18,000 tok/s | 15ms | 64% |
**Note**: Efficiency drops with more GPUs due to communication overhead.
### PP Scaling (Llama 3-405B, FP8)
| Nodes | TP | PP | Total GPUs | Throughput |
|-------|----|----|------------|------------|
| 1 | 8 | 1 | 8 | OOM |
| 2 | 8 | 2 | 16 | 25,000 tok/s |
| 4 | 8 | 4 | 32 | 45,000 tok/s |
## Best Practices
1. **Prefer TP over PP** when possible (lower latency)
2. **Use NVLink** for all TP deployments
3. **Use InfiniBand** for multi-node deployments
4. **Start with smallest TP** that fits model in memory
5. **Monitor GPU balance** - all GPUs should have similar utilization
6. **Test with benchmark** before production
7. **Use FP8** on H100 for 2× speedup