refactor: reorganize skills into sub-categories

The skills directory was getting disorganized — mlops alone had 40 skills in a flat list, and 12 categories were singletons with just one skill each. Code change: - prompt_builder.py: Support sub-categories in skill scanner. skills/mlops/training/axolotl/SKILL.md now shows as category 'mlops/training' instead of just 'mlops'. Backwards-compatible with existing flat structure. Split mlops (40 skills) into 7 sub-categories: - mlops/training (12): accelerate, axolotl, flash-attention, grpo-rl-training, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, torchtitan, trl-fine-tuning, unsloth - mlops/inference (8): gguf, guidance, instructor, llama-cpp, obliteratus, outlines, tensorrt-llm, vllm - mlops/models (6): audiocraft, clip, llava, segment-anything, stable-diffusion, whisper - mlops/vector-databases (4): chroma, faiss, pinecone, qdrant - mlops/evaluation (5): huggingface-tokenizers, lm-evaluation-harness, nemo-curator, saelens, weights-and-biases - mlops/cloud (2): lambda-labs, modal - mlops/research (1): dspy Merged singleton categories: - gifs → media (gif-search joins youtube-content) - music-creation → media (heartmula, songsee) - diagramming → creative (excalidraw joins ascii-art) - ocr-and-documents → productivity - domain → research (domain-intel) - feeds → research (blogwatcher) - market-data → research (polymarket) Fixed misplaced skills: - mlops/code-review → software-development (not ML-specific) - mlops/ml-paper-writing → research (academic writing) Added DESCRIPTION.md files for all new/updated categories.
2026-04-25 00:51:20 +00:00 · 2026-03-09 03:35:53 -07:00 · 2026-03-09 03:35:53 -07:00 · 732c66b0f3
commit 732c66b0f3
parent d6c710706f
217 changed files with 39 additions and 4 deletions
--- a/skills/mlops/inference/tensorrt-llm/references/serving.md
+++ b/skills/mlops/inference/tensorrt-llm/references/serving.md
@ -0,0 +1,470 @@
+# Production Serving Guide
+
+Comprehensive guide to deploying TensorRT-LLM in production environments.
+
+## Server Modes
+
+### trtllm-serve (Recommended)
+
+**Features**:
+- OpenAI-compatible API
+- Automatic model download and compilation
+- Built-in load balancing
+- Prometheus metrics
+- Health checks
+
+**Basic usage**:
+```bash
+trtllm-serve meta-llama/Meta-Llama-3-8B \
+    --tp_size 1 \
+    --max_batch_size 256 \
+    --port 8000
+```
+
+**Advanced configuration**:
+```bash
+trtllm-serve meta-llama/Meta-Llama-3-70B \
+    --tp_size 4 \
+    --dtype fp8 \
+    --max_batch_size 256 \
+    --max_num_tokens 4096 \
+    --enable_chunked_context \
+    --scheduler_policy max_utilization \
+    --port 8000 \
+    --api_key $API_KEY  # Optional authentication
+```
+
+### Python LLM API (For embedding)
+
+```python
+from tensorrt_llm import LLM
+
+class LLMService:
+    def __init__(self):
+        self.llm = LLM(
+            model="meta-llama/Meta-Llama-3-8B",
+            dtype="fp8"
+        )
+
+    def generate(self, prompt, max_tokens=100):
+        from tensorrt_llm import SamplingParams
+
+        params = SamplingParams(
+            max_tokens=max_tokens,
+            temperature=0.7
+        )
+        outputs = self.llm.generate([prompt], params)
+        return outputs[0].text
+
+# Use in FastAPI, Flask, etc
+from fastapi import FastAPI
+app = FastAPI()
+service = LLMService()
+
+@app.post("/generate")
+def generate(prompt: str):
+    return {"response": service.generate(prompt)}
+```
+
+## OpenAI-Compatible API
+
+### Chat Completions
+
+```bash
+curl -X POST http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "meta-llama/Meta-Llama-3-8B",
+    "messages": [
+      {"role": "system", "content": "You are a helpful assistant."},
+      {"role": "user", "content": "Explain quantum computing"}
+    ],
+    "temperature": 0.7,
+    "max_tokens": 500,
+    "stream": false
+  }'
+```
+
+**Response**:
+```json
+{
+  "id": "chat-abc123",
+  "object": "chat.completion",
+  "created": 1234567890,
+  "model": "meta-llama/Meta-Llama-3-8B",
+  "choices": [{
+    "index": 0,
+    "message": {
+      "role": "assistant",
+      "content": "Quantum computing is..."
+    },
+    "finish_reason": "stop"
+  }],
+  "usage": {
+    "prompt_tokens": 25,
+    "completion_tokens": 150,
+    "total_tokens": 175
+  }
+}
+```
+
+### Streaming
+
+```bash
+curl -X POST http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "meta-llama/Meta-Llama-3-8B",
+    "messages": [{"role": "user", "content": "Count to 10"}],
+    "stream": true
+  }'
+```
+
+**Response** (SSE stream):
+```
+data: {"choices":[{"delta":{"content":"1"}}]}
+
+data: {"choices":[{"delta":{"content":", 2"}}]}
+
+data: {"choices":[{"delta":{"content":", 3"}}]}
+
+data: [DONE]
+```
+
+### Completions
+
+```bash
+curl -X POST http://localhost:8000/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "meta-llama/Meta-Llama-3-8B",
+    "prompt": "The capital of France is",
+    "max_tokens": 10,
+    "temperature": 0.0
+  }'
+```
+
+## Monitoring
+
+### Prometheus Metrics
+
+**Enable metrics**:
+```bash
+trtllm-serve meta-llama/Meta-Llama-3-8B \
+    --enable_metrics \
+    --metrics_port 9090
+```
+
+**Key metrics**:
+```bash
+# Scrape metrics
+curl http://localhost:9090/metrics
+
+# Important metrics:
+# - trtllm_request_success_total - Total successful requests
+# - trtllm_request_latency_seconds - Request latency histogram
+# - trtllm_tokens_generated_total - Total tokens generated
+# - trtllm_active_requests - Current active requests
+# - trtllm_queue_size - Requests waiting in queue
+# - trtllm_gpu_memory_usage_bytes - GPU memory usage
+# - trtllm_kv_cache_usage_ratio - KV cache utilization
+```
+
+### Health Checks
+
+```bash
+# Readiness probe
+curl http://localhost:8000/health/ready
+
+# Liveness probe
+curl http://localhost:8000/health/live
+
+# Model info
+curl http://localhost:8000/v1/models
+```
+
+**Kubernetes probes**:
+```yaml
+livenessProbe:
+  httpGet:
+    path: /health/live
+    port: 8000
+  initialDelaySeconds: 60
+  periodSeconds: 10
+
+readinessProbe:
+  httpGet:
+    path: /health/ready
+    port: 8000
+  initialDelaySeconds: 30
+  periodSeconds: 5
+```
+
+## Production Deployment
+
+### Docker Deployment
+
+**Dockerfile**:
+```dockerfile
+FROM nvidia/tensorrt_llm:latest
+
+# Copy any custom configs
+COPY config.yaml /app/config.yaml
+
+# Expose ports
+EXPOSE 8000 9090
+
+# Start server
+CMD ["trtllm-serve", "meta-llama/Meta-Llama-3-8B", \
+     "--tp_size", "4", \
+     "--dtype", "fp8", \
+     "--max_batch_size", "256", \
+     "--enable_metrics", \
+     "--metrics_port", "9090"]
+```
+
+**Run container**:
+```bash
+docker run --gpus all -p 8000:8000 -p 9090:9090 \
+    tensorrt-llm:latest
+```
+
+### Kubernetes Deployment
+
+**Complete deployment**:
+```yaml
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: tensorrt-llm
+spec:
+  replicas: 2  # Multiple replicas for HA
+  selector:
+    matchLabels:
+      app: tensorrt-llm
+  template:
+    metadata:
+      labels:
+        app: tensorrt-llm
+    spec:
+      containers:
+      - name: trtllm
+        image: nvidia/tensorrt_llm:latest
+        command:
+          - trtllm-serve
+          - meta-llama/Meta-Llama-3-70B
+          - --tp_size=4
+          - --dtype=fp8
+          - --max_batch_size=256
+          - --enable_metrics
+        ports:
+        - containerPort: 8000
+          name: http
+        - containerPort: 9090
+          name: metrics
+        resources:
+          limits:
+            nvidia.com/gpu: 4
+        livenessProbe:
+          httpGet:
+            path: /health/live
+            port: 8000
+        readinessProbe:
+          httpGet:
+            path: /health/ready
+            port: 8000
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: tensorrt-llm
+spec:
+  selector:
+    app: tensorrt-llm
+  ports:
+  - name: http
+    port: 80
+    targetPort: 8000
+  - name: metrics
+    port: 9090
+    targetPort: 9090
+  type: LoadBalancer
+```
+
+### Load Balancing
+
+**NGINX configuration**:
+```nginx
+upstream tensorrt_llm {
+    least_conn;  # Route to least busy server
+    server trtllm-1:8000 max_fails=3 fail_timeout=30s;
+    server trtllm-2:8000 max_fails=3 fail_timeout=30s;
+    server trtllm-3:8000 max_fails=3 fail_timeout=30s;
+}
+
+server {
+    listen 80;
+    location / {
+        proxy_pass http://tensorrt_llm;
+        proxy_read_timeout 300s;  # Long timeout for slow generations
+        proxy_connect_timeout 10s;
+    }
+}
+```
+
+## Autoscaling
+
+### Horizontal Pod Autoscaler (HPA)
+
+```yaml
+apiVersion: autoscaling/v2
+kind: HorizontalPodAutoscaler
+metadata:
+  name: tensorrt-llm-hpa
+spec:
+  scaleTargetRef:
+    apiVersion: apps/v1
+    kind: Deployment
+    name: tensorrt-llm
+  minReplicas: 2
+  maxReplicas: 10
+  metrics:
+  - type: Pods
+    pods:
+      metric:
+        name: trtllm_active_requests
+      target:
+        type: AverageValue
+        averageValue: "50"  # Scale when avg >50 active requests
+```
+
+### Custom Metrics
+
+```yaml
+# Scale based on queue size
+- type: Pods
+  pods:
+    metric:
+      name: trtllm_queue_size
+    target:
+      type: AverageValue
+      averageValue: "10"
+```
+
+## Cost Optimization
+
+### GPU Selection
+
+**A100 80GB** ($3-4/hour):
+- Use for: 70B models with FP8
+- Throughput: 10,000-15,000 tok/s (TP=4)
+- Cost per 1M tokens: $0.20-0.30
+
+**H100 80GB** ($6-8/hour):
+- Use for: 70B models with FP8, 405B models
+- Throughput: 20,000-30,000 tok/s (TP=4)
+- Cost per 1M tokens: $0.15-0.25 (2× faster = lower cost)
+
+**L4** ($0.50-1/hour):
+- Use for: 7-8B models
+- Throughput: 1,000-2,000 tok/s
+- Cost per 1M tokens: $0.25-0.50
+
+### Batch Size Tuning
+
+**Impact on cost**:
+- Batch size 1: 1,000 tok/s → $3/hour per 1M = $3/M tokens
+- Batch size 64: 5,000 tok/s → $3/hour per 5M = $0.60/M tokens
+- **5× cost reduction** with batching
+
+**Recommendation**: Target batch size 32-128 for cost efficiency.
+
+## Security
+
+### API Authentication
+
+```bash
+# Generate API key
+export API_KEY=$(openssl rand -hex 32)
+
+# Start server with authentication
+trtllm-serve meta-llama/Meta-Llama-3-8B \
+    --api_key $API_KEY
+
+# Client request
+curl -X POST http://localhost:8000/v1/chat/completions \
+  -H "Authorization: Bearer $API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{"model": "...", "messages": [...]}'
+```
+
+### Network Policies
+
+```yaml
+apiVersion: networking.k8s.io/v1
+kind: NetworkPolicy
+metadata:
+  name: tensorrt-llm-policy
+spec:
+  podSelector:
+    matchLabels:
+      app: tensorrt-llm
+  policyTypes:
+  - Ingress
+  ingress:
+  - from:
+    - podSelector:
+        matchLabels:
+          app: api-gateway  # Only allow from gateway
+    ports:
+    - protocol: TCP
+      port: 8000
+```
+
+## Troubleshooting
+
+### High latency
+
+**Diagnosis**:
+```bash
+# Check queue size
+curl http://localhost:9090/metrics | grep queue_size
+
+# Check active requests
+curl http://localhost:9090/metrics | grep active_requests
+```
+
+**Solutions**:
+- Scale horizontally (more replicas)
+- Increase batch size (if GPU underutilized)
+- Enable chunked context (if long prompts)
+- Use FP8 quantization
+
+### OOM crashes
+
+**Solutions**:
+- Reduce `max_batch_size`
+- Reduce `max_num_tokens`
+- Enable FP8 or INT4 quantization
+- Increase `tensor_parallel_size`
+
+### Timeout errors
+
+**NGINX config**:
+```nginx
+proxy_read_timeout 600s;  # 10 minutes for very long generations
+proxy_send_timeout 600s;
+```
+
+## Best Practices
+
+1. **Use FP8 on H100** for 2× speedup and 50% cost reduction
+2. **Monitor metrics** - Set up Prometheus + Grafana
+3. **Set readiness probes** - Prevent routing to unhealthy pods
+4. **Use load balancing** - Distribute load across replicas
+5. **Tune batch size** - Balance latency and throughput
+6. **Enable streaming** - Better UX for chat applications
+7. **Set up autoscaling** - Handle traffic spikes
+8. **Use persistent volumes** - Cache compiled models
+9. **Implement retries** - Handle transient failures
+10. **Monitor costs** - Track cost per token