mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-25 00:51:20 +00:00
fix: restore all removed bundled skills + fix skills sync system
- Restored 21 skills removed in commits757d012and740dd92: accelerate, audiocraft, code-review, faiss, flash-attention, gguf, grpo-rl-training, guidance, llava, nemo-curator, obliteratus, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, stable-diffusion, tensorrt-llm, torchtitan, trl-fine-tuning, whisper - Rewrote sync_skills() with proper update semantics: * New skills (not in manifest): copied to user dir * Existing skills (in manifest + on disk): updated via hash comparison * User-deleted skills (in manifest, not on disk): respected, not re-added * Stale manifest entries (removed from bundled): cleaned from manifest - Added sync_skills() to CLI startup (cmd_chat) and gateway startup (start_gateway) — previously only ran during 'hermes update' - Updated cmd_update output to show new/updated/cleaned counts - Rewrote tests: 20 tests covering manifest CRUD, dir hashing, fresh install, user deletion respect, update detection, stale cleanup, and name collision handling 75 bundled skills total. 2002 tests pass.
This commit is contained in:
parent
68fbae5692
commit
ab0f4126cf
74 changed files with 27881 additions and 44 deletions
298
skills/mlops/tensorrt-llm/references/multi-gpu.md
Normal file
298
skills/mlops/tensorrt-llm/references/multi-gpu.md
Normal file
|
|
@ -0,0 +1,298 @@
|
|||
# Multi-GPU Deployment Guide
|
||||
|
||||
Comprehensive guide to scaling TensorRT-LLM across multiple GPUs and nodes.
|
||||
|
||||
## Parallelism Strategies
|
||||
|
||||
### Tensor Parallelism (TP)
|
||||
|
||||
**What it does**: Splits model layers across GPUs horizontally.
|
||||
|
||||
**Use case**:
|
||||
- Model fits in total GPU memory but not single GPU
|
||||
- Need low latency (single forward pass)
|
||||
- GPUs on same node (NVLink required for best performance)
|
||||
|
||||
**Example** (Llama 3-70B on 4× A100):
|
||||
```python
|
||||
from tensorrt_llm import LLM
|
||||
|
||||
llm = LLM(
|
||||
model="meta-llama/Meta-Llama-3-70B",
|
||||
tensor_parallel_size=4, # Split across 4 GPUs
|
||||
dtype="fp16"
|
||||
)
|
||||
|
||||
# Model automatically sharded across GPUs
|
||||
# Single forward pass, low latency
|
||||
```
|
||||
|
||||
**Performance**:
|
||||
- Latency: ~Same as single GPU
|
||||
- Throughput: 4× higher (4 GPUs)
|
||||
- Communication: High (activations synced every layer)
|
||||
|
||||
### Pipeline Parallelism (PP)
|
||||
|
||||
**What it does**: Splits model layers across GPUs vertically (layer-wise).
|
||||
|
||||
**Use case**:
|
||||
- Very large models (175B+)
|
||||
- Can tolerate higher latency
|
||||
- GPUs across multiple nodes
|
||||
|
||||
**Example** (Llama 3-405B on 8× H100):
|
||||
```python
|
||||
llm = LLM(
|
||||
model="meta-llama/Meta-Llama-3-405B",
|
||||
tensor_parallel_size=4, # TP=4 within nodes
|
||||
pipeline_parallel_size=2, # PP=2 across nodes
|
||||
dtype="fp8"
|
||||
)
|
||||
|
||||
# Total: 8 GPUs (4×2)
|
||||
# Layers 0-40: Node 1 (4 GPUs with TP)
|
||||
# Layers 41-80: Node 2 (4 GPUs with TP)
|
||||
```
|
||||
|
||||
**Performance**:
|
||||
- Latency: Higher (sequential through pipeline)
|
||||
- Throughput: High with micro-batching
|
||||
- Communication: Lower than TP
|
||||
|
||||
### Expert Parallelism (EP)
|
||||
|
||||
**What it does**: Distributes MoE experts across GPUs.
|
||||
|
||||
**Use case**: Mixture-of-Experts models (Mixtral, DeepSeek-V2)
|
||||
|
||||
**Example** (Mixtral-8x22B on 8× A100):
|
||||
```python
|
||||
llm = LLM(
|
||||
model="mistralai/Mixtral-8x22B",
|
||||
tensor_parallel_size=4,
|
||||
expert_parallel_size=2, # Distribute 8 experts across 2 groups
|
||||
dtype="fp8"
|
||||
)
|
||||
```
|
||||
|
||||
## Configuration Examples
|
||||
|
||||
### Small model (7-13B) - Single GPU
|
||||
|
||||
```python
|
||||
# Llama 3-8B on 1× A100 80GB
|
||||
llm = LLM(
|
||||
model="meta-llama/Meta-Llama-3-8B",
|
||||
dtype="fp16" # or fp8 for H100
|
||||
)
|
||||
```
|
||||
|
||||
**Resources**:
|
||||
- GPU: 1× A100 80GB
|
||||
- Memory: ~16GB model + 30GB KV cache
|
||||
- Throughput: 3,000-5,000 tokens/sec
|
||||
|
||||
### Medium model (70B) - Multi-GPU same node
|
||||
|
||||
```python
|
||||
# Llama 3-70B on 4× A100 80GB (NVLink)
|
||||
llm = LLM(
|
||||
model="meta-llama/Meta-Llama-3-70B",
|
||||
tensor_parallel_size=4,
|
||||
dtype="fp8" # 70GB → 35GB per GPU
|
||||
)
|
||||
```
|
||||
|
||||
**Resources**:
|
||||
- GPU: 4× A100 80GB with NVLink
|
||||
- Memory: ~35GB per GPU (FP8)
|
||||
- Throughput: 10,000-15,000 tokens/sec
|
||||
- Latency: 15-20ms per token
|
||||
|
||||
### Large model (405B) - Multi-node
|
||||
|
||||
```python
|
||||
# Llama 3-405B on 2 nodes × 8 H100 = 16 GPUs
|
||||
llm = LLM(
|
||||
model="meta-llama/Meta-Llama-3-405B",
|
||||
tensor_parallel_size=8, # TP within each node
|
||||
pipeline_parallel_size=2, # PP across 2 nodes
|
||||
dtype="fp8"
|
||||
)
|
||||
```
|
||||
|
||||
**Resources**:
|
||||
- GPU: 2 nodes × 8 H100 80GB
|
||||
- Memory: ~25GB per GPU (FP8)
|
||||
- Throughput: 20,000-30,000 tokens/sec
|
||||
- Network: InfiniBand recommended
|
||||
|
||||
## Server Deployment
|
||||
|
||||
### Single-node multi-GPU
|
||||
|
||||
```bash
|
||||
# Llama 3-70B on 4 GPUs (automatic TP)
|
||||
trtllm-serve meta-llama/Meta-Llama-3-70B \
|
||||
--tp_size 4 \
|
||||
--max_batch_size 256 \
|
||||
--dtype fp8
|
||||
|
||||
# Listens on http://localhost:8000
|
||||
```
|
||||
|
||||
### Multi-node with Ray
|
||||
|
||||
```bash
|
||||
# Node 1 (head node)
|
||||
ray start --head --port=6379
|
||||
|
||||
# Node 2 (worker)
|
||||
ray start --address='node1:6379'
|
||||
|
||||
# Deploy across cluster
|
||||
trtllm-serve meta-llama/Meta-Llama-3-405B \
|
||||
--tp_size 8 \
|
||||
--pp_size 2 \
|
||||
--num_workers 2 \ # 2 nodes
|
||||
--dtype fp8
|
||||
```
|
||||
|
||||
### Kubernetes deployment
|
||||
|
||||
```yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: tensorrt-llm-llama3-70b
|
||||
spec:
|
||||
replicas: 1
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: trtllm
|
||||
image: nvidia/tensorrt_llm:latest
|
||||
command:
|
||||
- trtllm-serve
|
||||
- meta-llama/Meta-Llama-3-70B
|
||||
- --tp_size=4
|
||||
- --max_batch_size=256
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 4 # Request 4 GPUs
|
||||
```
|
||||
|
||||
## Parallelism Decision Tree
|
||||
|
||||
```
|
||||
Model size < 20GB?
|
||||
├─ YES: Single GPU (no parallelism)
|
||||
└─ NO: Model size < 80GB?
|
||||
├─ YES: TP=2 or TP=4 (same node)
|
||||
└─ NO: Model size < 320GB?
|
||||
├─ YES: TP=4 or TP=8 (same node, NVLink required)
|
||||
└─ NO: TP=8 + PP=2 (multi-node)
|
||||
```
|
||||
|
||||
## Communication Optimization
|
||||
|
||||
### NVLink vs PCIe
|
||||
|
||||
**NVLink** (DGX A100, HGX H100):
|
||||
- Bandwidth: 600 GB/s (A100), 900 GB/s (H100)
|
||||
- Ideal for TP (high communication)
|
||||
- **Recommended for all multi-GPU setups**
|
||||
|
||||
**PCIe**:
|
||||
- Bandwidth: 64 GB/s (PCIe 4.0 x16)
|
||||
- 10× slower than NVLink
|
||||
- Avoid TP, use PP instead
|
||||
|
||||
### InfiniBand for multi-node
|
||||
|
||||
**HDR InfiniBand** (200 Gb/s):
|
||||
- Required for multi-node TP or PP
|
||||
- Latency: <1μs
|
||||
- **Essential for 405B+ models**
|
||||
|
||||
## Monitoring Multi-GPU
|
||||
|
||||
```python
|
||||
# Monitor GPU utilization
|
||||
nvidia-smi dmon -s u
|
||||
|
||||
# Monitor memory
|
||||
nvidia-smi dmon -s m
|
||||
|
||||
# Monitor NVLink utilization
|
||||
nvidia-smi nvlink --status
|
||||
|
||||
# TensorRT-LLM built-in metrics
|
||||
curl http://localhost:8000/metrics
|
||||
```
|
||||
|
||||
**Key metrics**:
|
||||
- GPU utilization: Target 80-95%
|
||||
- Memory usage: Should be balanced across GPUs
|
||||
- NVLink traffic: High for TP, low for PP
|
||||
- Throughput: Tokens/sec across all GPUs
|
||||
|
||||
## Common Issues
|
||||
|
||||
### Imbalanced GPU memory
|
||||
|
||||
**Symptom**: GPU 0 has 90% memory, GPU 3 has 40%
|
||||
|
||||
**Solutions**:
|
||||
- Verify TP/PP configuration
|
||||
- Check model sharding (should be equal)
|
||||
- Restart server to reset state
|
||||
|
||||
### Low NVLink utilization
|
||||
|
||||
**Symptom**: NVLink bandwidth <100 GB/s with TP=4
|
||||
|
||||
**Solutions**:
|
||||
- Verify NVLink topology: `nvidia-smi topo -m`
|
||||
- Check for PCIe fallback
|
||||
- Ensure GPUs are on same NVSwitch
|
||||
|
||||
### OOM with multi-GPU
|
||||
|
||||
**Solutions**:
|
||||
- Increase TP size (more GPUs)
|
||||
- Reduce batch size
|
||||
- Enable FP8 quantization
|
||||
- Use pipeline parallelism
|
||||
|
||||
## Performance Scaling
|
||||
|
||||
### TP Scaling (Llama 3-70B, FP8)
|
||||
|
||||
| GPUs | TP Size | Throughput | Latency | Efficiency |
|
||||
|------|---------|------------|---------|------------|
|
||||
| 1 | 1 | OOM | - | - |
|
||||
| 2 | 2 | 6,000 tok/s | 18ms | 85% |
|
||||
| 4 | 4 | 11,000 tok/s | 16ms | 78% |
|
||||
| 8 | 8 | 18,000 tok/s | 15ms | 64% |
|
||||
|
||||
**Note**: Efficiency drops with more GPUs due to communication overhead.
|
||||
|
||||
### PP Scaling (Llama 3-405B, FP8)
|
||||
|
||||
| Nodes | TP | PP | Total GPUs | Throughput |
|
||||
|-------|----|----|------------|------------|
|
||||
| 1 | 8 | 1 | 8 | OOM |
|
||||
| 2 | 8 | 2 | 16 | 25,000 tok/s |
|
||||
| 4 | 8 | 4 | 32 | 45,000 tok/s |
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Prefer TP over PP** when possible (lower latency)
|
||||
2. **Use NVLink** for all TP deployments
|
||||
3. **Use InfiniBand** for multi-node deployments
|
||||
4. **Start with smallest TP** that fits model in memory
|
||||
5. **Monitor GPU balance** - all GPUs should have similar utilization
|
||||
6. **Test with benchmark** before production
|
||||
7. **Use FP8** on H100 for 2× speedup
|
||||
Loading…
Add table
Add a link
Reference in a new issue