mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-25 00:51:20 +00:00
fix: restore all removed bundled skills + fix skills sync system
- Restored 21 skills removed in commits757d012and740dd92: accelerate, audiocraft, code-review, faiss, flash-attention, gguf, grpo-rl-training, guidance, llava, nemo-curator, obliteratus, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, stable-diffusion, tensorrt-llm, torchtitan, trl-fine-tuning, whisper - Rewrote sync_skills() with proper update semantics: * New skills (not in manifest): copied to user dir * Existing skills (in manifest + on disk): updated via hash comparison * User-deleted skills (in manifest, not on disk): respected, not re-added * Stale manifest entries (removed from bundled): cleaned from manifest - Added sync_skills() to CLI startup (cmd_chat) and gateway startup (start_gateway) — previously only ran during 'hermes update' - Updated cmd_update output to show new/updated/cleaned counts - Rewrote tests: 20 tests covering manifest CRUD, dir hashing, fresh install, user deletion respect, update detection, stale cleanup, and name collision handling 75 bundled skills total. 2002 tests pass.
This commit is contained in:
parent
68fbae5692
commit
ab0f4126cf
74 changed files with 27881 additions and 44 deletions
242
skills/mlops/tensorrt-llm/references/optimization.md
Normal file
242
skills/mlops/tensorrt-llm/references/optimization.md
Normal file
|
|
@ -0,0 +1,242 @@
|
|||
# TensorRT-LLM Optimization Guide
|
||||
|
||||
Comprehensive guide to optimizing LLM inference with TensorRT-LLM.
|
||||
|
||||
## Quantization
|
||||
|
||||
### FP8 Quantization (Recommended for H100)
|
||||
|
||||
**Benefits**:
|
||||
- 2× faster inference
|
||||
- 50% memory reduction
|
||||
- Minimal accuracy loss (<1% perplexity degradation)
|
||||
|
||||
**Usage**:
|
||||
```python
|
||||
from tensorrt_llm import LLM
|
||||
|
||||
# Automatic FP8 quantization
|
||||
llm = LLM(
|
||||
model="meta-llama/Meta-Llama-3-70B",
|
||||
dtype="fp8",
|
||||
quantization="fp8"
|
||||
)
|
||||
```
|
||||
|
||||
**Performance** (Llama 3-70B on 8× H100):
|
||||
- FP16: 5,000 tokens/sec
|
||||
- FP8: **10,000 tokens/sec** (2× speedup)
|
||||
- Memory: 140GB → 70GB
|
||||
|
||||
### INT4 Quantization (Maximum compression)
|
||||
|
||||
**Benefits**:
|
||||
- 4× memory reduction
|
||||
- 3-4× faster inference
|
||||
- Fits larger models on same hardware
|
||||
|
||||
**Usage**:
|
||||
```python
|
||||
# INT4 with AWQ calibration
|
||||
llm = LLM(
|
||||
model="meta-llama/Meta-Llama-3-405B",
|
||||
dtype="int4_awq",
|
||||
quantization="awq"
|
||||
)
|
||||
|
||||
# INT4 with GPTQ calibration
|
||||
llm = LLM(
|
||||
model="meta-llama/Meta-Llama-3-405B",
|
||||
dtype="int4_gptq",
|
||||
quantization="gptq"
|
||||
)
|
||||
```
|
||||
|
||||
**Trade-offs**:
|
||||
- Accuracy: 1-3% perplexity increase
|
||||
- Speed: 3-4× faster than FP16
|
||||
- Use case: When memory is critical
|
||||
|
||||
## In-Flight Batching
|
||||
|
||||
**What it does**: Dynamically batches requests during generation instead of waiting for all sequences to finish.
|
||||
|
||||
**Configuration**:
|
||||
```python
|
||||
# Server configuration
|
||||
trtllm-serve meta-llama/Meta-Llama-3-8B \
|
||||
--max_batch_size 256 \ # Maximum concurrent sequences
|
||||
--max_num_tokens 4096 \ # Total tokens in batch
|
||||
--enable_chunked_context \ # Split long prompts
|
||||
--scheduler_policy max_utilization
|
||||
```
|
||||
|
||||
**Performance**:
|
||||
- Throughput: **4-8× higher** vs static batching
|
||||
- Latency: Lower P50/P99 for mixed workloads
|
||||
- GPU utilization: 80-95% vs 40-60%
|
||||
|
||||
## Paged KV Cache
|
||||
|
||||
**What it does**: Manages KV cache memory like OS manages virtual memory (paging).
|
||||
|
||||
**Benefits**:
|
||||
- 40-60% higher throughput
|
||||
- No memory fragmentation
|
||||
- Supports longer sequences
|
||||
|
||||
**Configuration**:
|
||||
```python
|
||||
# Automatic paged KV cache (default)
|
||||
llm = LLM(
|
||||
model="meta-llama/Meta-Llama-3-8B",
|
||||
kv_cache_free_gpu_mem_fraction=0.9, # Use 90% GPU mem for cache
|
||||
enable_prefix_caching=True # Cache common prefixes
|
||||
)
|
||||
```
|
||||
|
||||
## Speculative Decoding
|
||||
|
||||
**What it does**: Uses small draft model to predict multiple tokens, verified by target model in parallel.
|
||||
|
||||
**Speedup**: 2-3× faster for long generations
|
||||
|
||||
**Usage**:
|
||||
```python
|
||||
from tensorrt_llm import LLM
|
||||
|
||||
# Target model (Llama 3-70B)
|
||||
llm = LLM(
|
||||
model="meta-llama/Meta-Llama-3-70B",
|
||||
speculative_model="meta-llama/Meta-Llama-3-8B", # Draft model
|
||||
num_speculative_tokens=5 # Tokens to predict ahead
|
||||
)
|
||||
|
||||
# Same API, 2-3× faster
|
||||
outputs = llm.generate(prompts)
|
||||
```
|
||||
|
||||
**Best models for drafting**:
|
||||
- Target: Llama 3-70B → Draft: Llama 3-8B
|
||||
- Target: Qwen2-72B → Draft: Qwen2-7B
|
||||
- Same family, 8-10× smaller
|
||||
|
||||
## CUDA Graphs
|
||||
|
||||
**What it does**: Reduces kernel launch overhead by recording GPU operations.
|
||||
|
||||
**Benefits**:
|
||||
- 10-20% lower latency
|
||||
- More stable P99 latency
|
||||
- Better for small batch sizes
|
||||
|
||||
**Configuration** (automatic by default):
|
||||
```python
|
||||
llm = LLM(
|
||||
model="meta-llama/Meta-Llama-3-8B",
|
||||
enable_cuda_graph=True, # Default: True
|
||||
cuda_graph_cache_size=2 # Cache 2 graph variants
|
||||
)
|
||||
```
|
||||
|
||||
## Chunked Context
|
||||
|
||||
**What it does**: Splits long prompts into chunks to reduce memory spikes.
|
||||
|
||||
**Use case**: Prompts >8K tokens with limited GPU memory
|
||||
|
||||
**Configuration**:
|
||||
```bash
|
||||
trtllm-serve meta-llama/Meta-Llama-3-8B \
|
||||
--max_num_tokens 4096 \
|
||||
--enable_chunked_context \
|
||||
--max_chunked_prefill_length 2048 # Process 2K tokens at a time
|
||||
```
|
||||
|
||||
## Overlap Scheduling
|
||||
|
||||
**What it does**: Overlaps compute and memory operations.
|
||||
|
||||
**Benefits**:
|
||||
- 15-25% higher throughput
|
||||
- Better GPU utilization
|
||||
- Default in v1.2.0+
|
||||
|
||||
**No configuration needed** - enabled automatically.
|
||||
|
||||
## Quantization Comparison Table
|
||||
|
||||
| Method | Memory | Speed | Accuracy | Use Case |
|
||||
|--------|--------|-------|----------|----------|
|
||||
| FP16 | 1× (baseline) | 1× | Best | High accuracy needed |
|
||||
| FP8 | 0.5× | 2× | -0.5% ppl | **H100 default** |
|
||||
| INT4 AWQ | 0.25× | 3-4× | -1.5% ppl | Memory critical |
|
||||
| INT4 GPTQ | 0.25× | 3-4× | -2% ppl | Maximum speed |
|
||||
|
||||
## Tuning Workflow
|
||||
|
||||
1. **Start with defaults**:
|
||||
```python
|
||||
llm = LLM(model="meta-llama/Meta-Llama-3-70B")
|
||||
```
|
||||
|
||||
2. **Enable FP8** (if H100):
|
||||
```python
|
||||
llm = LLM(model="...", dtype="fp8")
|
||||
```
|
||||
|
||||
3. **Tune batch size**:
|
||||
```python
|
||||
# Increase until OOM, then reduce 20%
|
||||
trtllm-serve ... --max_batch_size 256
|
||||
```
|
||||
|
||||
4. **Enable chunked context** (if long prompts):
|
||||
```bash
|
||||
--enable_chunked_context --max_chunked_prefill_length 2048
|
||||
```
|
||||
|
||||
5. **Try speculative decoding** (if latency critical):
|
||||
```python
|
||||
llm = LLM(model="...", speculative_model="...")
|
||||
```
|
||||
|
||||
## Benchmarking
|
||||
|
||||
```bash
|
||||
# Install benchmark tool
|
||||
pip install tensorrt_llm[benchmark]
|
||||
|
||||
# Run benchmark
|
||||
python benchmarks/python/benchmark.py \
|
||||
--model meta-llama/Meta-Llama-3-8B \
|
||||
--batch_size 64 \
|
||||
--input_len 128 \
|
||||
--output_len 256 \
|
||||
--dtype fp8
|
||||
```
|
||||
|
||||
**Metrics to track**:
|
||||
- Throughput (tokens/sec)
|
||||
- Latency P50/P90/P99 (ms)
|
||||
- GPU memory usage (GB)
|
||||
- GPU utilization (%)
|
||||
|
||||
## Common Issues
|
||||
|
||||
**OOM errors**:
|
||||
- Reduce `max_batch_size`
|
||||
- Reduce `max_num_tokens`
|
||||
- Enable INT4 quantization
|
||||
- Increase `tensor_parallel_size`
|
||||
|
||||
**Low throughput**:
|
||||
- Increase `max_batch_size`
|
||||
- Enable in-flight batching
|
||||
- Verify CUDA graphs enabled
|
||||
- Check GPU utilization
|
||||
|
||||
**High latency**:
|
||||
- Try speculative decoding
|
||||
- Reduce `max_batch_size` (less queueing)
|
||||
- Use FP8 instead of FP16
|
||||
Loading…
Add table
Add a link
Reference in a new issue