fix: restore all removed bundled skills + fix skills sync system

- Restored 21 skills removed in commits 757d012 and 740dd92:
  accelerate, audiocraft, code-review, faiss, flash-attention, gguf,
  grpo-rl-training, guidance, llava, nemo-curator, obliteratus, peft,
  pytorch-fsdp, pytorch-lightning, simpo, slime, stable-diffusion,
  tensorrt-llm, torchtitan, trl-fine-tuning, whisper

- Rewrote sync_skills() with proper update semantics:
  * New skills (not in manifest): copied to user dir
  * Existing skills (in manifest + on disk): updated via hash comparison
  * User-deleted skills (in manifest, not on disk): respected, not re-added
  * Stale manifest entries (removed from bundled): cleaned from manifest

- Added sync_skills() to CLI startup (cmd_chat) and gateway startup
  (start_gateway) — previously only ran during 'hermes update'

- Updated cmd_update output to show new/updated/cleaned counts

- Rewrote tests: 20 tests covering manifest CRUD, dir hashing, fresh
  install, user deletion respect, update detection, stale cleanup, and
  name collision handling

75 bundled skills total. 2002 tests pass.
This commit is contained in:
teknium1 2026-03-06 15:57:12 -08:00
parent 68fbae5692
commit ab0f4126cf
74 changed files with 27881 additions and 44 deletions

View file

@ -0,0 +1,242 @@
# TensorRT-LLM Optimization Guide
Comprehensive guide to optimizing LLM inference with TensorRT-LLM.
## Quantization
### FP8 Quantization (Recommended for H100)
**Benefits**:
- 2× faster inference
- 50% memory reduction
- Minimal accuracy loss (<1% perplexity degradation)
**Usage**:
```python
from tensorrt_llm import LLM
# Automatic FP8 quantization
llm = LLM(
model="meta-llama/Meta-Llama-3-70B",
dtype="fp8",
quantization="fp8"
)
```
**Performance** (Llama 3-70B on 8× H100):
- FP16: 5,000 tokens/sec
- FP8: **10,000 tokens/sec** (2× speedup)
- Memory: 140GB → 70GB
### INT4 Quantization (Maximum compression)
**Benefits**:
- 4× memory reduction
- 3-4× faster inference
- Fits larger models on same hardware
**Usage**:
```python
# INT4 with AWQ calibration
llm = LLM(
model="meta-llama/Meta-Llama-3-405B",
dtype="int4_awq",
quantization="awq"
)
# INT4 with GPTQ calibration
llm = LLM(
model="meta-llama/Meta-Llama-3-405B",
dtype="int4_gptq",
quantization="gptq"
)
```
**Trade-offs**:
- Accuracy: 1-3% perplexity increase
- Speed: 3-4× faster than FP16
- Use case: When memory is critical
## In-Flight Batching
**What it does**: Dynamically batches requests during generation instead of waiting for all sequences to finish.
**Configuration**:
```python
# Server configuration
trtllm-serve meta-llama/Meta-Llama-3-8B \
--max_batch_size 256 \ # Maximum concurrent sequences
--max_num_tokens 4096 \ # Total tokens in batch
--enable_chunked_context \ # Split long prompts
--scheduler_policy max_utilization
```
**Performance**:
- Throughput: **4-8× higher** vs static batching
- Latency: Lower P50/P99 for mixed workloads
- GPU utilization: 80-95% vs 40-60%
## Paged KV Cache
**What it does**: Manages KV cache memory like OS manages virtual memory (paging).
**Benefits**:
- 40-60% higher throughput
- No memory fragmentation
- Supports longer sequences
**Configuration**:
```python
# Automatic paged KV cache (default)
llm = LLM(
model="meta-llama/Meta-Llama-3-8B",
kv_cache_free_gpu_mem_fraction=0.9, # Use 90% GPU mem for cache
enable_prefix_caching=True # Cache common prefixes
)
```
## Speculative Decoding
**What it does**: Uses small draft model to predict multiple tokens, verified by target model in parallel.
**Speedup**: 2-3× faster for long generations
**Usage**:
```python
from tensorrt_llm import LLM
# Target model (Llama 3-70B)
llm = LLM(
model="meta-llama/Meta-Llama-3-70B",
speculative_model="meta-llama/Meta-Llama-3-8B", # Draft model
num_speculative_tokens=5 # Tokens to predict ahead
)
# Same API, 2-3× faster
outputs = llm.generate(prompts)
```
**Best models for drafting**:
- Target: Llama 3-70B → Draft: Llama 3-8B
- Target: Qwen2-72B → Draft: Qwen2-7B
- Same family, 8-10× smaller
## CUDA Graphs
**What it does**: Reduces kernel launch overhead by recording GPU operations.
**Benefits**:
- 10-20% lower latency
- More stable P99 latency
- Better for small batch sizes
**Configuration** (automatic by default):
```python
llm = LLM(
model="meta-llama/Meta-Llama-3-8B",
enable_cuda_graph=True, # Default: True
cuda_graph_cache_size=2 # Cache 2 graph variants
)
```
## Chunked Context
**What it does**: Splits long prompts into chunks to reduce memory spikes.
**Use case**: Prompts >8K tokens with limited GPU memory
**Configuration**:
```bash
trtllm-serve meta-llama/Meta-Llama-3-8B \
--max_num_tokens 4096 \
--enable_chunked_context \
--max_chunked_prefill_length 2048 # Process 2K tokens at a time
```
## Overlap Scheduling
**What it does**: Overlaps compute and memory operations.
**Benefits**:
- 15-25% higher throughput
- Better GPU utilization
- Default in v1.2.0+
**No configuration needed** - enabled automatically.
## Quantization Comparison Table
| Method | Memory | Speed | Accuracy | Use Case |
|--------|--------|-------|----------|----------|
| FP16 | 1× (baseline) | 1× | Best | High accuracy needed |
| FP8 | 0.5× | 2× | -0.5% ppl | **H100 default** |
| INT4 AWQ | 0.25× | 3-4× | -1.5% ppl | Memory critical |
| INT4 GPTQ | 0.25× | 3-4× | -2% ppl | Maximum speed |
## Tuning Workflow
1. **Start with defaults**:
```python
llm = LLM(model="meta-llama/Meta-Llama-3-70B")
```
2. **Enable FP8** (if H100):
```python
llm = LLM(model="...", dtype="fp8")
```
3. **Tune batch size**:
```python
# Increase until OOM, then reduce 20%
trtllm-serve ... --max_batch_size 256
```
4. **Enable chunked context** (if long prompts):
```bash
--enable_chunked_context --max_chunked_prefill_length 2048
```
5. **Try speculative decoding** (if latency critical):
```python
llm = LLM(model="...", speculative_model="...")
```
## Benchmarking
```bash
# Install benchmark tool
pip install tensorrt_llm[benchmark]
# Run benchmark
python benchmarks/python/benchmark.py \
--model meta-llama/Meta-Llama-3-8B \
--batch_size 64 \
--input_len 128 \
--output_len 256 \
--dtype fp8
```
**Metrics to track**:
- Throughput (tokens/sec)
- Latency P50/P90/P99 (ms)
- GPU memory usage (GB)
- GPU utilization (%)
## Common Issues
**OOM errors**:
- Reduce `max_batch_size`
- Reduce `max_num_tokens`
- Enable INT4 quantization
- Increase `tensor_parallel_size`
**Low throughput**:
- Increase `max_batch_size`
- Enable in-flight batching
- Verify CUDA graphs enabled
- Check GPU utilization
**High latency**:
- Try speculative decoding
- Reduce `max_batch_size` (less queueing)
- Use FP8 instead of FP16