refactor: remove outdated skills and references from MLOps

- Deleted the `huggingface-accelerate` skill documentation, which included details on distributed training and common workflows. - Removed `custom-plugins.md`, `megatron-integration.md`, `performance.md`, and other related reference documents that were no longer relevant or necessary. - This cleanup aims to streamline the MLOps skills repository and improve maintainability.
2026-04-27 01:11:40 +00:00 · 2026-02-25 04:22:48 -08:00 · 2026-02-25 04:22:48 -08:00 · 757d012ab5
commit 757d012ab5
parent f64a87209d
47 changed files with 170 additions and 21638 deletions
--- a/skills/mlops/tensorrt-llm/SKILL.md
+++ b/skills/mlops/tensorrt-llm/SKILL.md
@ -1,190 +0,0 @@
---
-name: tensorrt-llm
-description: Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.
-version: 1.0.0
-author: Orchestra Research
-license: MIT
-dependencies: [tensorrt-llm, torch]
-metadata:
-  hermes:
-    tags: [Inference Serving, TensorRT-LLM, NVIDIA, Inference Optimization, High Throughput, Low Latency, Production, FP8, INT4, In-Flight Batching, Multi-GPU]
-
---
-
-# TensorRT-LLM
-
-NVIDIA's open-source library for optimizing LLM inference with state-of-the-art performance on NVIDIA GPUs.
-
-## When to use TensorRT-LLM
-
-**Use TensorRT-LLM when:**
- Deploying on NVIDIA GPUs (A100, H100, GB200)
- Need maximum throughput (24,000+ tokens/sec on Llama 3)
- Require low latency for real-time applications
- Working with quantized models (FP8, INT4, FP4)
- Scaling across multiple GPUs or nodes
-
-**Use vLLM instead when:**
- Need simpler setup and Python-first API
- Want PagedAttention without TensorRT compilation
- Working with AMD GPUs or non-NVIDIA hardware
-
-**Use llama.cpp instead when:**
- Deploying on CPU or Apple Silicon
- Need edge deployment without NVIDIA GPUs
- Want simpler GGUF quantization format
-
-## Quick start
-
-### Installation
-
-```bash
-# Docker (recommended)
-docker pull nvidia/tensorrt_llm:latest
-
-# pip install
-pip install tensorrt_llm==1.2.0rc3
-
-# Requires CUDA 13.0.0, TensorRT 10.13.2, Python 3.10-3.12
-```
-
-### Basic inference
-
-```python
-from tensorrt_llm import LLM, SamplingParams
-
-# Initialize model
-llm = LLM(model="meta-llama/Meta-Llama-3-8B")
-
-# Configure sampling
-sampling_params = SamplingParams(
-    max_tokens=100,
-    temperature=0.7,
-    top_p=0.9
-)
-
-# Generate
-prompts = ["Explain quantum computing"]
-outputs = llm.generate(prompts, sampling_params)
-
-for output in outputs:
-    print(output.text)
-```
-
-### Serving with trtllm-serve
-
-```bash
-# Start server (automatic model download and compilation)
-trtllm-serve meta-llama/Meta-Llama-3-8B \
-    --tp_size 4 \              # Tensor parallelism (4 GPUs)
-    --max_batch_size 256 \
-    --max_num_tokens 4096
-
-# Client request
-curl -X POST http://localhost:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "meta-llama/Meta-Llama-3-8B",
-    "messages": [{"role": "user", "content": "Hello!"}],
-    "temperature": 0.7,
-    "max_tokens": 100
-  }'
-```
-
-## Key features
-
-### Performance optimizations
- **In-flight batching**: Dynamic batching during generation
- **Paged KV cache**: Efficient memory management
- **Flash Attention**: Optimized attention kernels
- **Quantization**: FP8, INT4, FP4 for 2-4× faster inference
- **CUDA graphs**: Reduced kernel launch overhead
-
-### Parallelism
- **Tensor parallelism (TP)**: Split model across GPUs
- **Pipeline parallelism (PP)**: Layer-wise distribution
- **Expert parallelism**: For Mixture-of-Experts models
- **Multi-node**: Scale beyond single machine
-
-### Advanced features
- **Speculative decoding**: Faster generation with draft models
- **LoRA serving**: Efficient multi-adapter deployment
- **Disaggregated serving**: Separate prefill and generation
-
-## Common patterns
-
-### Quantized model (FP8)
-
-```python
-from tensorrt_llm import LLM
-
-# Load FP8 quantized model (2× faster, 50% memory)
-llm = LLM(
-    model="meta-llama/Meta-Llama-3-70B",
-    dtype="fp8",
-    max_num_tokens=8192
-)
-
-# Inference same as before
-outputs = llm.generate(["Summarize this article..."])
-```
-
-### Multi-GPU deployment
-
-```python
-# Tensor parallelism across 8 GPUs
-llm = LLM(
-    model="meta-llama/Meta-Llama-3-405B",
-    tensor_parallel_size=8,
-    dtype="fp8"
-)
-```
-
-### Batch inference
-
-```python
-# Process 100 prompts efficiently
-prompts = [f"Question {i}: ..." for i in range(100)]
-
-outputs = llm.generate(
-    prompts,
-    sampling_params=SamplingParams(max_tokens=200)
-)
-
-# Automatic in-flight batching for maximum throughput
-```
-
-## Performance benchmarks
-
-**Meta Llama 3-8B** (H100 GPU):
- Throughput: 24,000 tokens/sec
- Latency: ~10ms per token
- vs PyTorch: **100× faster**
-
-**Llama 3-70B** (8× A100 80GB):
- FP8 quantization: 2× faster than FP16
- Memory: 50% reduction with FP8
-
-## Supported models
-
- **LLaMA family**: Llama 2, Llama 3, CodeLlama
- **GPT family**: GPT-2, GPT-J, GPT-NeoX
- **Qwen**: Qwen, Qwen2, QwQ
- **DeepSeek**: DeepSeek-V2, DeepSeek-V3
- **Mixtral**: Mixtral-8x7B, Mixtral-8x22B
- **Vision**: LLaVA, Phi-3-vision
- **100+ models** on HuggingFace
-
-## References
-
- **[Optimization Guide](references/optimization.md)** - Quantization, batching, KV cache tuning
- **[Multi-GPU Setup](references/multi-gpu.md)** - Tensor/pipeline parallelism, multi-node
- **[Serving Guide](references/serving.md)** - Production deployment, monitoring, autoscaling
-
-## Resources
-
- **Docs**: https://nvidia.github.io/TensorRT-LLM/
- **GitHub**: https://github.com/NVIDIA/TensorRT-LLM
- **Models**: https://huggingface.co/models?library=tensorrt_llm
-
-