mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-04-25 00:51:20 +00:00

feat(gateway): skill-aware slash commands, paginated /commands, Telegram 100-cap (#3934 )

* feat(gateway): skill-aware slash commands, paginated /commands, Telegram 100-cap

Map active skills to Telegram's slash command menu so users can
discover and invoke skills directly. Three changes:

1. Telegram menu now includes active skill commands alongside built-in
   commands, capped at 100 entries (Telegram Bot API limit). Overflow
   commands remain callable but hidden from the picker. Logged at
   startup when cap is hit.

2. New /commands [page] gateway command for paginated browsing of all
   commands + skills. /help now shows first 10 skill commands and
   points to /commands for the full list.

3. When a user types a slash command that matches a disabled or
   uninstalled skill, they get actionable guidance:
   - Disabled: 'Enable it with: hermes skills config'
   - Optional (not installed): 'Install with: hermes skills install official/<path>'

Built on ideas from PR #3921 by @kshitijk4poor.

* chore: move 21 niche skills to optional-skills

Move specialized/niche skills from built-in (skills/) to optional
(optional-skills/) to reduce the default skill count. Users can
install them with: hermes skills install official/<category>/<name>

Moved skills (21):
- mlops: accelerate, chroma, faiss, flash-attention,
  hermes-atropos-environments, huggingface-tokenizers, instructor,
  lambda-labs, llava, nemo-curator, pinecone, pytorch-lightning,
  qdrant, saelens, simpo, slime, tensorrt-llm, torchtitan
- research: domain-intel, duckduckgo-search
- devops: inference-sh cli

Built-in skills: 96 → 75
Optional skills: 22 → 43

* fix: only include repo built-in skills in Telegram menu, not user-installed

User-installed skills (from hub or manually added) stay accessible via
/skills and by typing the command directly, but don't get registered
in the Telegram slash command picker. Only skills whose SKILL.md is
under the repo's skills/ directory are included in the menu.

This keeps the Telegram menu focused on the curated built-in set while
user-installed skills remain discoverable through /skills and /commands.

2026-03-30 10:57:30 -07:00

6.5 KiB

Raw Blame History

Multi-GPU Deployment Guide

Comprehensive guide to scaling TensorRT-LLM across multiple GPUs and nodes.

Parallelism Strategies

Tensor Parallelism (TP)

What it does: Splits model layers across GPUs horizontally.

Use case:

Model fits in total GPU memory but not single GPU
Need low latency (single forward pass)
GPUs on same node (NVLink required for best performance)

Example (Llama 3-70B on 4× A100):

from tensorrt_llm import LLM

llm = LLM(
    model="meta-llama/Meta-Llama-3-70B",
    tensor_parallel_size=4,  # Split across 4 GPUs
    dtype="fp16"
)

# Model automatically sharded across GPUs
# Single forward pass, low latency

Performance:

Latency: ~Same as single GPU
Throughput: 4× higher (4 GPUs)
Communication: High (activations synced every layer)

Pipeline Parallelism (PP)

What it does: Splits model layers across GPUs vertically (layer-wise).

Use case:

Very large models (175B+)
Can tolerate higher latency
GPUs across multiple nodes

Example (Llama 3-405B on 8× H100):

llm = LLM(
    model="meta-llama/Meta-Llama-3-405B",
    tensor_parallel_size=4,   # TP=4 within nodes
    pipeline_parallel_size=2, # PP=2 across nodes
    dtype="fp8"
)

# Total: 8 GPUs (4×2)
# Layers 0-40: Node 1 (4 GPUs with TP)
# Layers 41-80: Node 2 (4 GPUs with TP)

Performance:

Latency: Higher (sequential through pipeline)
Throughput: High with micro-batching
Communication: Lower than TP

Expert Parallelism (EP)

What it does: Distributes MoE experts across GPUs.

Use case: Mixture-of-Experts models (Mixtral, DeepSeek-V2)

Example (Mixtral-8x22B on 8× A100):

llm = LLM(
    model="mistralai/Mixtral-8x22B",
    tensor_parallel_size=4,
    expert_parallel_size=2,  # Distribute 8 experts across 2 groups
    dtype="fp8"
)

Configuration Examples

Small model (7-13B) - Single GPU

# Llama 3-8B on 1× A100 80GB
llm = LLM(
    model="meta-llama/Meta-Llama-3-8B",
    dtype="fp16"  # or fp8 for H100
)

Resources:

GPU: 1× A100 80GB
Memory: ~16GB model + 30GB KV cache
Throughput: 3,000-5,000 tokens/sec

Medium model (70B) - Multi-GPU same node

# Llama 3-70B on 4× A100 80GB (NVLink)
llm = LLM(
    model="meta-llama/Meta-Llama-3-70B",
    tensor_parallel_size=4,
    dtype="fp8"  # 70GB → 35GB per GPU
)

Resources:

GPU: 4× A100 80GB with NVLink
Memory: ~35GB per GPU (FP8)
Throughput: 10,000-15,000 tokens/sec
Latency: 15-20ms per token

Large model (405B) - Multi-node

# Llama 3-405B on 2 nodes × 8 H100 = 16 GPUs
llm = LLM(
    model="meta-llama/Meta-Llama-3-405B",
    tensor_parallel_size=8,    # TP within each node
    pipeline_parallel_size=2,  # PP across 2 nodes
    dtype="fp8"
)

Resources:

GPU: 2 nodes × 8 H100 80GB
Memory: ~25GB per GPU (FP8)
Throughput: 20,000-30,000 tokens/sec
Network: InfiniBand recommended

Server Deployment

Single-node multi-GPU

# Llama 3-70B on 4 GPUs (automatic TP)
trtllm-serve meta-llama/Meta-Llama-3-70B \
    --tp_size 4 \
    --max_batch_size 256 \
    --dtype fp8

# Listens on http://localhost:8000

Multi-node with Ray

# Node 1 (head node)
ray start --head --port=6379

# Node 2 (worker)
ray start --address='node1:6379'

# Deploy across cluster
trtllm-serve meta-llama/Meta-Llama-3-405B \
    --tp_size 8 \
    --pp_size 2 \
    --num_workers 2 \  # 2 nodes
    --dtype fp8

Kubernetes deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tensorrt-llm-llama3-70b
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: trtllm
        image: nvidia/tensorrt_llm:latest
        command:
          - trtllm-serve
          - meta-llama/Meta-Llama-3-70B
          - --tp_size=4
          - --max_batch_size=256
        resources:
          limits:
            nvidia.com/gpu: 4  # Request 4 GPUs

Parallelism Decision Tree

Model size < 20GB?
├─ YES: Single GPU (no parallelism)
└─ NO: Model size < 80GB?
    ├─ YES: TP=2 or TP=4 (same node)
    └─ NO: Model size < 320GB?
        ├─ YES: TP=4 or TP=8 (same node, NVLink required)
        └─ NO: TP=8 + PP=2 (multi-node)

Communication Optimization

NVLink vs PCIe

NVLink (DGX A100, HGX H100):

Bandwidth: 600 GB/s (A100), 900 GB/s (H100)
Ideal for TP (high communication)
Recommended for all multi-GPU setups

PCIe:

Bandwidth: 64 GB/s (PCIe 4.0 x16)
10× slower than NVLink
Avoid TP, use PP instead

InfiniBand for multi-node

HDR InfiniBand (200 Gb/s):

Required for multi-node TP or PP
Latency: <1μs
Essential for 405B+ models

Monitoring Multi-GPU

# Monitor GPU utilization
nvidia-smi dmon -s u

# Monitor memory
nvidia-smi dmon -s m

# Monitor NVLink utilization
nvidia-smi nvlink --status

# TensorRT-LLM built-in metrics
curl http://localhost:8000/metrics

Key metrics:

GPU utilization: Target 80-95%
Memory usage: Should be balanced across GPUs
NVLink traffic: High for TP, low for PP
Throughput: Tokens/sec across all GPUs

Common Issues

Imbalanced GPU memory

Symptom: GPU 0 has 90% memory, GPU 3 has 40%

Solutions:

Verify TP/PP configuration
Check model sharding (should be equal)
Restart server to reset state

Low NVLink utilization

Symptom: NVLink bandwidth <100 GB/s with TP=4

Solutions:

Verify NVLink topology: nvidia-smi topo -m
Check for PCIe fallback
Ensure GPUs are on same NVSwitch

OOM with multi-GPU

Solutions:

Increase TP size (more GPUs)
Reduce batch size
Enable FP8 quantization
Use pipeline parallelism

Performance Scaling

TP Scaling (Llama 3-70B, FP8)

GPUs	TP Size	Throughput	Latency	Efficiency
1	1	OOM	-	-
2	2	6,000 tok/s	18ms	85%
4	4	11,000 tok/s	16ms	78%
8	8	18,000 tok/s	15ms	64%

Note: Efficiency drops with more GPUs due to communication overhead.

PP Scaling (Llama 3-405B, FP8)

Nodes	TP	PP	Total GPUs	Throughput
1	8	1	8	OOM
2	8	2	16	25,000 tok/s
4	8	4	32	45,000 tok/s

Best Practices

Prefer TP over PP when possible (lower latency)
Use NVLink for all TP deployments
Use InfiniBand for multi-node deployments
Start with smallest TP that fits model in memory
Monitor GPU balance - all GPUs should have similar utilization
Test with benchmark before production
Use FP8 on H100 for 2× speedup

6.5 KiB Raw Blame History Unescape Escape

Multi-GPU Deployment Guide

Parallelism Strategies

Tensor Parallelism (TP)

Pipeline Parallelism (PP)

Expert Parallelism (EP)

Configuration Examples

Small model (7-13B) - Single GPU

Medium model (70B) - Multi-GPU same node

Large model (405B) - Multi-node

Server Deployment

Single-node multi-GPU

Multi-node with Ray

Kubernetes deployment

Parallelism Decision Tree

Communication Optimization

NVLink vs PCIe

InfiniBand for multi-node

Monitoring Multi-GPU

Common Issues

Imbalanced GPU memory

Low NVLink utilization

OOM with multi-GPU

Performance Scaling

TP Scaling (Llama 3-70B, FP8)

PP Scaling (Llama 3-405B, FP8)

Best Practices

6.5 KiB

Raw Blame History