feat(gateway): skill-aware slash commands, paginated /commands, Telegram 100-cap (#3934)

* feat(gateway): skill-aware slash commands, paginated /commands, Telegram 100-cap Map active skills to Telegram's slash command menu so users can discover and invoke skills directly. Three changes: 1. Telegram menu now includes active skill commands alongside built-in commands, capped at 100 entries (Telegram Bot API limit). Overflow commands remain callable but hidden from the picker. Logged at startup when cap is hit. 2. New /commands [page] gateway command for paginated browsing of all commands + skills. /help now shows first 10 skill commands and points to /commands for the full list. 3. When a user types a slash command that matches a disabled or uninstalled skill, they get actionable guidance: - Disabled: 'Enable it with: hermes skills config' - Optional (not installed): 'Install with: hermes skills install official/<path>' Built on ideas from PR #3921 by @kshitijk4poor. * chore: move 21 niche skills to optional-skills Move specialized/niche skills from built-in (skills/) to optional (optional-skills/) to reduce the default skill count. Users can install them with: hermes skills install official/<category>/<name> Moved skills (21): - mlops: accelerate, chroma, faiss, flash-attention, hermes-atropos-environments, huggingface-tokenizers, instructor, lambda-labs, llava, nemo-curator, pinecone, pytorch-lightning, qdrant, saelens, simpo, slime, tensorrt-llm, torchtitan - research: domain-intel, duckduckgo-search - devops: inference-sh cli Built-in skills: 96 → 75 Optional skills: 22 → 43 * fix: only include repo built-in skills in Telegram menu, not user-installed User-installed skills (from hub or manually added) stay accessible via /skills and by typing the command directly, but don't get registered in the Telegram slash command picker. Only skills whose SKILL.md is under the repo's skills/ directory are included in the menu. This keeps the Telegram menu focused on the curated built-in set while user-installed skills remain discoverable through /skills and /commands.
2026-04-25 00:51:20 +00:00 · 2026-03-30 10:57:30 -07:00 · 2026-03-30 10:57:30 -07:00 · 5ceed021dc
commit 5ceed021dc
parent 97d6813f51
73 changed files with 163 additions and 4 deletions
--- a/skills/mlops/inference/tensorrt-llm/references/serving.md
+++ b/skills/mlops/inference/tensorrt-llm/references/serving.md
@ -1,470 +0,0 @@
-# Production Serving Guide
-
-Comprehensive guide to deploying TensorRT-LLM in production environments.
-
-## Server Modes
-
-### trtllm-serve (Recommended)
-
-**Features**:
- OpenAI-compatible API
- Automatic model download and compilation
- Built-in load balancing
- Prometheus metrics
- Health checks
-
-**Basic usage**:
-```bash
-trtllm-serve meta-llama/Meta-Llama-3-8B \
-    --tp_size 1 \
-    --max_batch_size 256 \
-    --port 8000
-```
-
-**Advanced configuration**:
-```bash
-trtllm-serve meta-llama/Meta-Llama-3-70B \
-    --tp_size 4 \
-    --dtype fp8 \
-    --max_batch_size 256 \
-    --max_num_tokens 4096 \
-    --enable_chunked_context \
-    --scheduler_policy max_utilization \
-    --port 8000 \
-    --api_key $API_KEY  # Optional authentication
-```
-
-### Python LLM API (For embedding)
-
-```python
-from tensorrt_llm import LLM
-
-class LLMService:
-    def __init__(self):
-        self.llm = LLM(
-            model="meta-llama/Meta-Llama-3-8B",
-            dtype="fp8"
-        )
-
-    def generate(self, prompt, max_tokens=100):
-        from tensorrt_llm import SamplingParams
-
-        params = SamplingParams(
-            max_tokens=max_tokens,
-            temperature=0.7
-        )
-        outputs = self.llm.generate([prompt], params)
-        return outputs[0].text
-
-# Use in FastAPI, Flask, etc
-from fastapi import FastAPI
-app = FastAPI()
-service = LLMService()
-
-@app.post("/generate")
-def generate(prompt: str):
-    return {"response": service.generate(prompt)}
-```
-
-## OpenAI-Compatible API
-
-### Chat Completions
-
-```bash
-curl -X POST http://localhost:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "meta-llama/Meta-Llama-3-8B",
-    "messages": [
-      {"role": "system", "content": "You are a helpful assistant."},
-      {"role": "user", "content": "Explain quantum computing"}
-    ],
-    "temperature": 0.7,
-    "max_tokens": 500,
-    "stream": false
-  }'
-```
-
-**Response**:
-```json
-{
-  "id": "chat-abc123",
-  "object": "chat.completion",
-  "created": 1234567890,
-  "model": "meta-llama/Meta-Llama-3-8B",
-  "choices": [{
-    "index": 0,
-    "message": {
-      "role": "assistant",
-      "content": "Quantum computing is..."
-    },
-    "finish_reason": "stop"
-  }],
-  "usage": {
-    "prompt_tokens": 25,
-    "completion_tokens": 150,
-    "total_tokens": 175
-  }
-}
-```
-
-### Streaming
-
-```bash
-curl -X POST http://localhost:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "meta-llama/Meta-Llama-3-8B",
-    "messages": [{"role": "user", "content": "Count to 10"}],
-    "stream": true
-  }'
-```
-
-**Response** (SSE stream):
-```
-data: {"choices":[{"delta":{"content":"1"}}]}
-
-data: {"choices":[{"delta":{"content":", 2"}}]}
-
-data: {"choices":[{"delta":{"content":", 3"}}]}
-
-data: [DONE]
-```
-
-### Completions
-
-```bash
-curl -X POST http://localhost:8000/v1/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "meta-llama/Meta-Llama-3-8B",
-    "prompt": "The capital of France is",
-    "max_tokens": 10,
-    "temperature": 0.0
-  }'
-```
-
-## Monitoring
-
-### Prometheus Metrics
-
-**Enable metrics**:
-```bash
-trtllm-serve meta-llama/Meta-Llama-3-8B \
-    --enable_metrics \
-    --metrics_port 9090
-```
-
-**Key metrics**:
-```bash
-# Scrape metrics
-curl http://localhost:9090/metrics
-
-# Important metrics:
-# - trtllm_request_success_total - Total successful requests
-# - trtllm_request_latency_seconds - Request latency histogram
-# - trtllm_tokens_generated_total - Total tokens generated
-# - trtllm_active_requests - Current active requests
-# - trtllm_queue_size - Requests waiting in queue
-# - trtllm_gpu_memory_usage_bytes - GPU memory usage
-# - trtllm_kv_cache_usage_ratio - KV cache utilization
-```
-
-### Health Checks
-
-```bash
-# Readiness probe
-curl http://localhost:8000/health/ready
-
-# Liveness probe
-curl http://localhost:8000/health/live
-
-# Model info
-curl http://localhost:8000/v1/models
-```
-
-**Kubernetes probes**:
-```yaml
-livenessProbe:
-  httpGet:
-    path: /health/live
-    port: 8000
-  initialDelaySeconds: 60
-  periodSeconds: 10
-
-readinessProbe:
-  httpGet:
-    path: /health/ready
-    port: 8000
-  initialDelaySeconds: 30
-  periodSeconds: 5
-```
-
-## Production Deployment
-
-### Docker Deployment
-
-**Dockerfile**:
-```dockerfile
-FROM nvidia/tensorrt_llm:latest
-
-# Copy any custom configs
-COPY config.yaml /app/config.yaml
-
-# Expose ports
-EXPOSE 8000 9090
-
-# Start server
-CMD ["trtllm-serve", "meta-llama/Meta-Llama-3-8B", \
-     "--tp_size", "4", \
-     "--dtype", "fp8", \
-     "--max_batch_size", "256", \
-     "--enable_metrics", \
-     "--metrics_port", "9090"]
-```
-
-**Run container**:
-```bash
-docker run --gpus all -p 8000:8000 -p 9090:9090 \
-    tensorrt-llm:latest
-```
-
-### Kubernetes Deployment
-
-**Complete deployment**:
-```yaml
-apiVersion: apps/v1
-kind: Deployment
-metadata:
-  name: tensorrt-llm
-spec:
-  replicas: 2  # Multiple replicas for HA
-  selector:
-    matchLabels:
-      app: tensorrt-llm
-  template:
-    metadata:
-      labels:
-        app: tensorrt-llm
-    spec:
-      containers:
-      - name: trtllm
-        image: nvidia/tensorrt_llm:latest
-        command:
-          - trtllm-serve
-          - meta-llama/Meta-Llama-3-70B
-          - --tp_size=4
-          - --dtype=fp8
-          - --max_batch_size=256
-          - --enable_metrics
-        ports:
-        - containerPort: 8000
-          name: http
-        - containerPort: 9090
-          name: metrics
-        resources:
-          limits:
-            nvidia.com/gpu: 4
-        livenessProbe:
-          httpGet:
-            path: /health/live
-            port: 8000
-        readinessProbe:
-          httpGet:
-            path: /health/ready
-            port: 8000
---
-apiVersion: v1
-kind: Service
-metadata:
-  name: tensorrt-llm
-spec:
-  selector:
-    app: tensorrt-llm
-  ports:
-  - name: http
-    port: 80
-    targetPort: 8000
-  - name: metrics
-    port: 9090
-    targetPort: 9090
-  type: LoadBalancer
-```
-
-### Load Balancing
-
-**NGINX configuration**:
-```nginx
-upstream tensorrt_llm {
-    least_conn;  # Route to least busy server
-    server trtllm-1:8000 max_fails=3 fail_timeout=30s;
-    server trtllm-2:8000 max_fails=3 fail_timeout=30s;
-    server trtllm-3:8000 max_fails=3 fail_timeout=30s;
-}
-
-server {
-    listen 80;
-    location / {
-        proxy_pass http://tensorrt_llm;
-        proxy_read_timeout 300s;  # Long timeout for slow generations
-        proxy_connect_timeout 10s;
-    }
-}
-```
-
-## Autoscaling
-
-### Horizontal Pod Autoscaler (HPA)
-
-```yaml
-apiVersion: autoscaling/v2
-kind: HorizontalPodAutoscaler
-metadata:
-  name: tensorrt-llm-hpa
-spec:
-  scaleTargetRef:
-    apiVersion: apps/v1
-    kind: Deployment
-    name: tensorrt-llm
-  minReplicas: 2
-  maxReplicas: 10
-  metrics:
-  - type: Pods
-    pods:
-      metric:
-        name: trtllm_active_requests
-      target:
-        type: AverageValue
-        averageValue: "50"  # Scale when avg >50 active requests
-```
-
-### Custom Metrics
-
-```yaml
-# Scale based on queue size
- type: Pods
-  pods:
-    metric:
-      name: trtllm_queue_size
-    target:
-      type: AverageValue
-      averageValue: "10"
-```
-
-## Cost Optimization
-
-### GPU Selection
-
-**A100 80GB** ($3-4/hour):
- Use for: 70B models with FP8
- Throughput: 10,000-15,000 tok/s (TP=4)
- Cost per 1M tokens: $0.20-0.30
-
-**H100 80GB** ($6-8/hour):
- Use for: 70B models with FP8, 405B models
- Throughput: 20,000-30,000 tok/s (TP=4)
- Cost per 1M tokens: $0.15-0.25 (2× faster = lower cost)
-
-**L4** ($0.50-1/hour):
- Use for: 7-8B models
- Throughput: 1,000-2,000 tok/s
- Cost per 1M tokens: $0.25-0.50
-
-### Batch Size Tuning
-
-**Impact on cost**:
- Batch size 1: 1,000 tok/s → $3/hour per 1M = $3/M tokens
- Batch size 64: 5,000 tok/s → $3/hour per 5M = $0.60/M tokens
- **5× cost reduction** with batching
-
-**Recommendation**: Target batch size 32-128 for cost efficiency.
-
-## Security
-
-### API Authentication
-
-```bash
-# Generate API key
-export API_KEY=$(openssl rand -hex 32)
-
-# Start server with authentication
-trtllm-serve meta-llama/Meta-Llama-3-8B \
-    --api_key $API_KEY
-
-# Client request
-curl -X POST http://localhost:8000/v1/chat/completions \
-  -H "Authorization: Bearer $API_KEY" \
-  -H "Content-Type: application/json" \
-  -d '{"model": "...", "messages": [...]}'
-```
-
-### Network Policies
-
-```yaml
-apiVersion: networking.k8s.io/v1
-kind: NetworkPolicy
-metadata:
-  name: tensorrt-llm-policy
-spec:
-  podSelector:
-    matchLabels:
-      app: tensorrt-llm
-  policyTypes:
-  - Ingress
-  ingress:
-  - from:
-    - podSelector:
-        matchLabels:
-          app: api-gateway  # Only allow from gateway
-    ports:
-    - protocol: TCP
-      port: 8000
-```
-
-## Troubleshooting
-
-### High latency
-
-**Diagnosis**:
-```bash
-# Check queue size
-curl http://localhost:9090/metrics | grep queue_size
-
-# Check active requests
-curl http://localhost:9090/metrics | grep active_requests
-```
-
-**Solutions**:
- Scale horizontally (more replicas)
- Increase batch size (if GPU underutilized)
- Enable chunked context (if long prompts)
- Use FP8 quantization
-
-### OOM crashes
-
-**Solutions**:
- Reduce `max_batch_size`
- Reduce `max_num_tokens`
- Enable FP8 or INT4 quantization
- Increase `tensor_parallel_size`
-
-### Timeout errors
-
-**NGINX config**:
-```nginx
-proxy_read_timeout 600s;  # 10 minutes for very long generations
-proxy_send_timeout 600s;
-```
-
-## Best Practices
-
-1. **Use FP8 on H100** for 2× speedup and 50% cost reduction
-2. **Monitor metrics** - Set up Prometheus + Grafana
-3. **Set readiness probes** - Prevent routing to unhealthy pods
-4. **Use load balancing** - Distribute load across replicas
-5. **Tune batch size** - Balance latency and throughput
-6. **Enable streaming** - Better UX for chat applications
-7. **Set up autoscaling** - Handle traffic spikes
-8. **Use persistent volumes** - Cache compiled models
-9. **Implement retries** - Handle transient failures
-10. **Monitor costs** - Track cost per token