hermes-agent/optional-skills/mlops/tensorrt-llm/SKILL.md
Teknium b18b17f9c9 feat(skills): gate 7 Linux/macOS-only skills from Windows via platforms frontmatter
Hermes's skill loader (agent/skill_utils.skill_matches_platform) already honors
the 'platforms:' frontmatter field and skip-loads skills whose declared
platform list doesn't include sys.platform. Seven bundled skills are in fact
Linux/macOS-only but never declared it, so they leak into Windows skill
listings and sometimes load with broken instructions.

Audited all 160 SKILL.md files (skills/ + optional-skills/) for Windows-
hostile signals: apt-get/brew/systemd/chmod+x install flows, ptrace/proc
runtime dependencies, bash-only launcher scripts, and package dependencies
with no Windows build. The 7 below fail one or more of those tests in a way
that fundamentally can't be papered over by docs edits:

  minecraft-modpack-server      bash start.sh + chmod +x + apt openjdk
  evaluating-llms-harness       lm-eval-harness bash launcher scripts
  distributed-llm-pretraining-
  torchtitan                    bash multi-node torchrun launcher
  python-debugpy                remote attach relies on /proc ptrace_scope
  pytorch-fsdp                  NCCL backend; Windows path is WSL only
  tensorrt-llm                  NVIDIA TensorRT-LLM has no Windows build
  searxng-search                Docker volume flow assumes POSIX $(pwd)

All seven get 'platforms: [linux, macos]'. On Windows the loader now skips
them silently — no more phantom skill listings, no more mid-task failures
because an Apple-only path was surfaced as a suggestion.

Cross-platform skills that merely CONTAIN signals in examples or
install-instructions (brew install as one of several paths, /tmp/ in a code
snippet, etc.) are NOT touched by this commit. A broader audit that
declares the ~140 cross-platform skills as 'platforms: [linux, macos,
windows]' can follow as a separate change once each has been verified
working on Windows.

The installed user copies under ~/AppData/Local/hermes/skills/ (when they
exist) are also patched so the running session reflects the gating
immediately, but only the in-repo files are committed here.
2026-05-08 14:27:40 -07:00

191 lines
5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
name: tensorrt-llm
description: Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.
version: 1.0.0
author: Orchestra Research
license: MIT
dependencies: [tensorrt-llm, torch]
platforms: [linux, macos]
metadata:
hermes:
tags: [Inference Serving, TensorRT-LLM, NVIDIA, Inference Optimization, High Throughput, Low Latency, Production, FP8, INT4, In-Flight Batching, Multi-GPU]
---
# TensorRT-LLM
NVIDIA's open-source library for optimizing LLM inference with state-of-the-art performance on NVIDIA GPUs.
## When to use TensorRT-LLM
**Use TensorRT-LLM when:**
- Deploying on NVIDIA GPUs (A100, H100, GB200)
- Need maximum throughput (24,000+ tokens/sec on Llama 3)
- Require low latency for real-time applications
- Working with quantized models (FP8, INT4, FP4)
- Scaling across multiple GPUs or nodes
**Use vLLM instead when:**
- Need simpler setup and Python-first API
- Want PagedAttention without TensorRT compilation
- Working with AMD GPUs or non-NVIDIA hardware
**Use llama.cpp instead when:**
- Deploying on CPU or Apple Silicon
- Need edge deployment without NVIDIA GPUs
- Want simpler GGUF quantization format
## Quick start
### Installation
```bash
# Docker (recommended)
docker pull nvidia/tensorrt_llm:latest
# pip install
pip install tensorrt_llm==1.2.0rc3
# Requires CUDA 13.0.0, TensorRT 10.13.2, Python 3.10-3.12
```
### Basic inference
```python
from tensorrt_llm import LLM, SamplingParams
# Initialize model
llm = LLM(model="meta-llama/Meta-Llama-3-8B")
# Configure sampling
sampling_params = SamplingParams(
max_tokens=100,
temperature=0.7,
top_p=0.9
)
# Generate
prompts = ["Explain quantum computing"]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.text)
```
### Serving with trtllm-serve
```bash
# Start server (automatic model download and compilation)
trtllm-serve meta-llama/Meta-Llama-3-8B \
--tp_size 4 \ # Tensor parallelism (4 GPUs)
--max_batch_size 256 \
--max_num_tokens 4096
# Client request
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3-8B",
"messages": [{"role": "user", "content": "Hello!"}],
"temperature": 0.7,
"max_tokens": 100
}'
```
## Key features
### Performance optimizations
- **In-flight batching**: Dynamic batching during generation
- **Paged KV cache**: Efficient memory management
- **Flash Attention**: Optimized attention kernels
- **Quantization**: FP8, INT4, FP4 for 2-4× faster inference
- **CUDA graphs**: Reduced kernel launch overhead
### Parallelism
- **Tensor parallelism (TP)**: Split model across GPUs
- **Pipeline parallelism (PP)**: Layer-wise distribution
- **Expert parallelism**: For Mixture-of-Experts models
- **Multi-node**: Scale beyond single machine
### Advanced features
- **Speculative decoding**: Faster generation with draft models
- **LoRA serving**: Efficient multi-adapter deployment
- **Disaggregated serving**: Separate prefill and generation
## Common patterns
### Quantized model (FP8)
```python
from tensorrt_llm import LLM
# Load FP8 quantized model (2× faster, 50% memory)
llm = LLM(
model="meta-llama/Meta-Llama-3-70B",
dtype="fp8",
max_num_tokens=8192
)
# Inference same as before
outputs = llm.generate(["Summarize this article..."])
```
### Multi-GPU deployment
```python
# Tensor parallelism across 8 GPUs
llm = LLM(
model="meta-llama/Meta-Llama-3-405B",
tensor_parallel_size=8,
dtype="fp8"
)
```
### Batch inference
```python
# Process 100 prompts efficiently
prompts = [f"Question {i}: ..." for i in range(100)]
outputs = llm.generate(
prompts,
sampling_params=SamplingParams(max_tokens=200)
)
# Automatic in-flight batching for maximum throughput
```
## Performance benchmarks
**Meta Llama 3-8B** (H100 GPU):
- Throughput: 24,000 tokens/sec
- Latency: ~10ms per token
- vs PyTorch: **100× faster**
**Llama 3-70B** (8× A100 80GB):
- FP8 quantization: 2× faster than FP16
- Memory: 50% reduction with FP8
## Supported models
- **LLaMA family**: Llama 2, Llama 3, CodeLlama
- **GPT family**: GPT-2, GPT-J, GPT-NeoX
- **Qwen**: Qwen, Qwen2, QwQ
- **DeepSeek**: DeepSeek-V2, DeepSeek-V3
- **Mixtral**: Mixtral-8x7B, Mixtral-8x22B
- **Vision**: LLaVA, Phi-3-vision
- **100+ models** on HuggingFace
## References
- **[Optimization Guide](references/optimization.md)** - Quantization, batching, KV cache tuning
- **[Multi-GPU Setup](references/multi-gpu.md)** - Tensor/pipeline parallelism, multi-node
- **[Serving Guide](references/serving.md)** - Production deployment, monitoring, autoscaling
## Resources
- **Docs**: https://nvidia.github.io/TensorRT-LLM/
- **GitHub**: https://github.com/NVIDIA/TensorRT-LLM
- **Models**: https://huggingface.co/models?library=tensorrt_llm