hermes-agent/optional-skills/mlops/tensorrt-llm/SKILL.md
Teknium 5ceed021dc
feat(gateway): skill-aware slash commands, paginated /commands, Telegram 100-cap (#3934)
* feat(gateway): skill-aware slash commands, paginated /commands, Telegram 100-cap

Map active skills to Telegram's slash command menu so users can
discover and invoke skills directly. Three changes:

1. Telegram menu now includes active skill commands alongside built-in
   commands, capped at 100 entries (Telegram Bot API limit). Overflow
   commands remain callable but hidden from the picker. Logged at
   startup when cap is hit.

2. New /commands [page] gateway command for paginated browsing of all
   commands + skills. /help now shows first 10 skill commands and
   points to /commands for the full list.

3. When a user types a slash command that matches a disabled or
   uninstalled skill, they get actionable guidance:
   - Disabled: 'Enable it with: hermes skills config'
   - Optional (not installed): 'Install with: hermes skills install official/<path>'

Built on ideas from PR #3921 by @kshitijk4poor.

* chore: move 21 niche skills to optional-skills

Move specialized/niche skills from built-in (skills/) to optional
(optional-skills/) to reduce the default skill count. Users can
install them with: hermes skills install official/<category>/<name>

Moved skills (21):
- mlops: accelerate, chroma, faiss, flash-attention,
  hermes-atropos-environments, huggingface-tokenizers, instructor,
  lambda-labs, llava, nemo-curator, pinecone, pytorch-lightning,
  qdrant, saelens, simpo, slime, tensorrt-llm, torchtitan
- research: domain-intel, duckduckgo-search
- devops: inference-sh cli

Built-in skills: 96 → 75
Optional skills: 22 → 43

* fix: only include repo built-in skills in Telegram menu, not user-installed

User-installed skills (from hub or manually added) stay accessible via
/skills and by typing the command directly, but don't get registered
in the Telegram slash command picker. Only skills whose SKILL.md is
under the repo's skills/ directory are included in the menu.

This keeps the Telegram menu focused on the curated built-in set while
user-installed skills remain discoverable through /skills and /commands.
2026-03-30 10:57:30 -07:00

4.9 KiB
Raw Blame History

name description version author license dependencies metadata
tensorrt-llm Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling. 1.0.0 Orchestra Research MIT
tensorrt-llm
torch
hermes
tags
Inference Serving
TensorRT-LLM
NVIDIA
Inference Optimization
High Throughput
Low Latency
Production
FP8
INT4
In-Flight Batching
Multi-GPU

TensorRT-LLM

NVIDIA's open-source library for optimizing LLM inference with state-of-the-art performance on NVIDIA GPUs.

When to use TensorRT-LLM

Use TensorRT-LLM when:

  • Deploying on NVIDIA GPUs (A100, H100, GB200)
  • Need maximum throughput (24,000+ tokens/sec on Llama 3)
  • Require low latency for real-time applications
  • Working with quantized models (FP8, INT4, FP4)
  • Scaling across multiple GPUs or nodes

Use vLLM instead when:

  • Need simpler setup and Python-first API
  • Want PagedAttention without TensorRT compilation
  • Working with AMD GPUs or non-NVIDIA hardware

Use llama.cpp instead when:

  • Deploying on CPU or Apple Silicon
  • Need edge deployment without NVIDIA GPUs
  • Want simpler GGUF quantization format

Quick start

Installation

# Docker (recommended)
docker pull nvidia/tensorrt_llm:latest

# pip install
pip install tensorrt_llm==1.2.0rc3

# Requires CUDA 13.0.0, TensorRT 10.13.2, Python 3.10-3.12

Basic inference

from tensorrt_llm import LLM, SamplingParams

# Initialize model
llm = LLM(model="meta-llama/Meta-Llama-3-8B")

# Configure sampling
sampling_params = SamplingParams(
    max_tokens=100,
    temperature=0.7,
    top_p=0.9
)

# Generate
prompts = ["Explain quantum computing"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.text)

Serving with trtllm-serve

# Start server (automatic model download and compilation)
trtllm-serve meta-llama/Meta-Llama-3-8B \
    --tp_size 4 \              # Tensor parallelism (4 GPUs)
    --max_batch_size 256 \
    --max_num_tokens 4096

# Client request
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3-8B",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Key features

Performance optimizations

  • In-flight batching: Dynamic batching during generation
  • Paged KV cache: Efficient memory management
  • Flash Attention: Optimized attention kernels
  • Quantization: FP8, INT4, FP4 for 2-4× faster inference
  • CUDA graphs: Reduced kernel launch overhead

Parallelism

  • Tensor parallelism (TP): Split model across GPUs
  • Pipeline parallelism (PP): Layer-wise distribution
  • Expert parallelism: For Mixture-of-Experts models
  • Multi-node: Scale beyond single machine

Advanced features

  • Speculative decoding: Faster generation with draft models
  • LoRA serving: Efficient multi-adapter deployment
  • Disaggregated serving: Separate prefill and generation

Common patterns

Quantized model (FP8)

from tensorrt_llm import LLM

# Load FP8 quantized model (2× faster, 50% memory)
llm = LLM(
    model="meta-llama/Meta-Llama-3-70B",
    dtype="fp8",
    max_num_tokens=8192
)

# Inference same as before
outputs = llm.generate(["Summarize this article..."])

Multi-GPU deployment

# Tensor parallelism across 8 GPUs
llm = LLM(
    model="meta-llama/Meta-Llama-3-405B",
    tensor_parallel_size=8,
    dtype="fp8"
)

Batch inference

# Process 100 prompts efficiently
prompts = [f"Question {i}: ..." for i in range(100)]

outputs = llm.generate(
    prompts,
    sampling_params=SamplingParams(max_tokens=200)
)

# Automatic in-flight batching for maximum throughput

Performance benchmarks

Meta Llama 3-8B (H100 GPU):

  • Throughput: 24,000 tokens/sec
  • Latency: ~10ms per token
  • vs PyTorch: 100× faster

Llama 3-70B (8× A100 80GB):

  • FP8 quantization: 2× faster than FP16
  • Memory: 50% reduction with FP8

Supported models

  • LLaMA family: Llama 2, Llama 3, CodeLlama
  • GPT family: GPT-2, GPT-J, GPT-NeoX
  • Qwen: Qwen, Qwen2, QwQ
  • DeepSeek: DeepSeek-V2, DeepSeek-V3
  • Mixtral: Mixtral-8x7B, Mixtral-8x22B
  • Vision: LLaVA, Phi-3-vision
  • 100+ models on HuggingFace

References

Resources