mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-26 01:01:40 +00:00
fix: restore all removed bundled skills + fix skills sync system
- Restored 21 skills removed in commits757d012and740dd92: accelerate, audiocraft, code-review, faiss, flash-attention, gguf, grpo-rl-training, guidance, llava, nemo-curator, obliteratus, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, stable-diffusion, tensorrt-llm, torchtitan, trl-fine-tuning, whisper - Rewrote sync_skills() with proper update semantics: * New skills (not in manifest): copied to user dir * Existing skills (in manifest + on disk): updated via hash comparison * User-deleted skills (in manifest, not on disk): respected, not re-added * Stale manifest entries (removed from bundled): cleaned from manifest - Added sync_skills() to CLI startup (cmd_chat) and gateway startup (start_gateway) — previously only ran during 'hermes update' - Updated cmd_update output to show new/updated/cleaned counts - Rewrote tests: 20 tests covering manifest CRUD, dir hashing, fresh install, user deletion respect, update detection, stale cleanup, and name collision handling 75 bundled skills total. 2002 tests pass.
This commit is contained in:
parent
68fbae5692
commit
ab0f4126cf
74 changed files with 27881 additions and 44 deletions
470
skills/mlops/tensorrt-llm/references/serving.md
Normal file
470
skills/mlops/tensorrt-llm/references/serving.md
Normal file
|
|
@ -0,0 +1,470 @@
|
|||
# Production Serving Guide
|
||||
|
||||
Comprehensive guide to deploying TensorRT-LLM in production environments.
|
||||
|
||||
## Server Modes
|
||||
|
||||
### trtllm-serve (Recommended)
|
||||
|
||||
**Features**:
|
||||
- OpenAI-compatible API
|
||||
- Automatic model download and compilation
|
||||
- Built-in load balancing
|
||||
- Prometheus metrics
|
||||
- Health checks
|
||||
|
||||
**Basic usage**:
|
||||
```bash
|
||||
trtllm-serve meta-llama/Meta-Llama-3-8B \
|
||||
--tp_size 1 \
|
||||
--max_batch_size 256 \
|
||||
--port 8000
|
||||
```
|
||||
|
||||
**Advanced configuration**:
|
||||
```bash
|
||||
trtllm-serve meta-llama/Meta-Llama-3-70B \
|
||||
--tp_size 4 \
|
||||
--dtype fp8 \
|
||||
--max_batch_size 256 \
|
||||
--max_num_tokens 4096 \
|
||||
--enable_chunked_context \
|
||||
--scheduler_policy max_utilization \
|
||||
--port 8000 \
|
||||
--api_key $API_KEY # Optional authentication
|
||||
```
|
||||
|
||||
### Python LLM API (For embedding)
|
||||
|
||||
```python
|
||||
from tensorrt_llm import LLM
|
||||
|
||||
class LLMService:
|
||||
def __init__(self):
|
||||
self.llm = LLM(
|
||||
model="meta-llama/Meta-Llama-3-8B",
|
||||
dtype="fp8"
|
||||
)
|
||||
|
||||
def generate(self, prompt, max_tokens=100):
|
||||
from tensorrt_llm import SamplingParams
|
||||
|
||||
params = SamplingParams(
|
||||
max_tokens=max_tokens,
|
||||
temperature=0.7
|
||||
)
|
||||
outputs = self.llm.generate([prompt], params)
|
||||
return outputs[0].text
|
||||
|
||||
# Use in FastAPI, Flask, etc
|
||||
from fastapi import FastAPI
|
||||
app = FastAPI()
|
||||
service = LLMService()
|
||||
|
||||
@app.post("/generate")
|
||||
def generate(prompt: str):
|
||||
return {"response": service.generate(prompt)}
|
||||
```
|
||||
|
||||
## OpenAI-Compatible API
|
||||
|
||||
### Chat Completions
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:8000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "meta-llama/Meta-Llama-3-8B",
|
||||
"messages": [
|
||||
{"role": "system", "content": "You are a helpful assistant."},
|
||||
{"role": "user", "content": "Explain quantum computing"}
|
||||
],
|
||||
"temperature": 0.7,
|
||||
"max_tokens": 500,
|
||||
"stream": false
|
||||
}'
|
||||
```
|
||||
|
||||
**Response**:
|
||||
```json
|
||||
{
|
||||
"id": "chat-abc123",
|
||||
"object": "chat.completion",
|
||||
"created": 1234567890,
|
||||
"model": "meta-llama/Meta-Llama-3-8B",
|
||||
"choices": [{
|
||||
"index": 0,
|
||||
"message": {
|
||||
"role": "assistant",
|
||||
"content": "Quantum computing is..."
|
||||
},
|
||||
"finish_reason": "stop"
|
||||
}],
|
||||
"usage": {
|
||||
"prompt_tokens": 25,
|
||||
"completion_tokens": 150,
|
||||
"total_tokens": 175
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Streaming
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:8000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "meta-llama/Meta-Llama-3-8B",
|
||||
"messages": [{"role": "user", "content": "Count to 10"}],
|
||||
"stream": true
|
||||
}'
|
||||
```
|
||||
|
||||
**Response** (SSE stream):
|
||||
```
|
||||
data: {"choices":[{"delta":{"content":"1"}}]}
|
||||
|
||||
data: {"choices":[{"delta":{"content":", 2"}}]}
|
||||
|
||||
data: {"choices":[{"delta":{"content":", 3"}}]}
|
||||
|
||||
data: [DONE]
|
||||
```
|
||||
|
||||
### Completions
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "meta-llama/Meta-Llama-3-8B",
|
||||
"prompt": "The capital of France is",
|
||||
"max_tokens": 10,
|
||||
"temperature": 0.0
|
||||
}'
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Prometheus Metrics
|
||||
|
||||
**Enable metrics**:
|
||||
```bash
|
||||
trtllm-serve meta-llama/Meta-Llama-3-8B \
|
||||
--enable_metrics \
|
||||
--metrics_port 9090
|
||||
```
|
||||
|
||||
**Key metrics**:
|
||||
```bash
|
||||
# Scrape metrics
|
||||
curl http://localhost:9090/metrics
|
||||
|
||||
# Important metrics:
|
||||
# - trtllm_request_success_total - Total successful requests
|
||||
# - trtllm_request_latency_seconds - Request latency histogram
|
||||
# - trtllm_tokens_generated_total - Total tokens generated
|
||||
# - trtllm_active_requests - Current active requests
|
||||
# - trtllm_queue_size - Requests waiting in queue
|
||||
# - trtllm_gpu_memory_usage_bytes - GPU memory usage
|
||||
# - trtllm_kv_cache_usage_ratio - KV cache utilization
|
||||
```
|
||||
|
||||
### Health Checks
|
||||
|
||||
```bash
|
||||
# Readiness probe
|
||||
curl http://localhost:8000/health/ready
|
||||
|
||||
# Liveness probe
|
||||
curl http://localhost:8000/health/live
|
||||
|
||||
# Model info
|
||||
curl http://localhost:8000/v1/models
|
||||
```
|
||||
|
||||
**Kubernetes probes**:
|
||||
```yaml
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /health/live
|
||||
port: 8000
|
||||
initialDelaySeconds: 60
|
||||
periodSeconds: 10
|
||||
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /health/ready
|
||||
port: 8000
|
||||
initialDelaySeconds: 30
|
||||
periodSeconds: 5
|
||||
```
|
||||
|
||||
## Production Deployment
|
||||
|
||||
### Docker Deployment
|
||||
|
||||
**Dockerfile**:
|
||||
```dockerfile
|
||||
FROM nvidia/tensorrt_llm:latest
|
||||
|
||||
# Copy any custom configs
|
||||
COPY config.yaml /app/config.yaml
|
||||
|
||||
# Expose ports
|
||||
EXPOSE 8000 9090
|
||||
|
||||
# Start server
|
||||
CMD ["trtllm-serve", "meta-llama/Meta-Llama-3-8B", \
|
||||
"--tp_size", "4", \
|
||||
"--dtype", "fp8", \
|
||||
"--max_batch_size", "256", \
|
||||
"--enable_metrics", \
|
||||
"--metrics_port", "9090"]
|
||||
```
|
||||
|
||||
**Run container**:
|
||||
```bash
|
||||
docker run --gpus all -p 8000:8000 -p 9090:9090 \
|
||||
tensorrt-llm:latest
|
||||
```
|
||||
|
||||
### Kubernetes Deployment
|
||||
|
||||
**Complete deployment**:
|
||||
```yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: tensorrt-llm
|
||||
spec:
|
||||
replicas: 2 # Multiple replicas for HA
|
||||
selector:
|
||||
matchLabels:
|
||||
app: tensorrt-llm
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: tensorrt-llm
|
||||
spec:
|
||||
containers:
|
||||
- name: trtllm
|
||||
image: nvidia/tensorrt_llm:latest
|
||||
command:
|
||||
- trtllm-serve
|
||||
- meta-llama/Meta-Llama-3-70B
|
||||
- --tp_size=4
|
||||
- --dtype=fp8
|
||||
- --max_batch_size=256
|
||||
- --enable_metrics
|
||||
ports:
|
||||
- containerPort: 8000
|
||||
name: http
|
||||
- containerPort: 9090
|
||||
name: metrics
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 4
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /health/live
|
||||
port: 8000
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /health/ready
|
||||
port: 8000
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: tensorrt-llm
|
||||
spec:
|
||||
selector:
|
||||
app: tensorrt-llm
|
||||
ports:
|
||||
- name: http
|
||||
port: 80
|
||||
targetPort: 8000
|
||||
- name: metrics
|
||||
port: 9090
|
||||
targetPort: 9090
|
||||
type: LoadBalancer
|
||||
```
|
||||
|
||||
### Load Balancing
|
||||
|
||||
**NGINX configuration**:
|
||||
```nginx
|
||||
upstream tensorrt_llm {
|
||||
least_conn; # Route to least busy server
|
||||
server trtllm-1:8000 max_fails=3 fail_timeout=30s;
|
||||
server trtllm-2:8000 max_fails=3 fail_timeout=30s;
|
||||
server trtllm-3:8000 max_fails=3 fail_timeout=30s;
|
||||
}
|
||||
|
||||
server {
|
||||
listen 80;
|
||||
location / {
|
||||
proxy_pass http://tensorrt_llm;
|
||||
proxy_read_timeout 300s; # Long timeout for slow generations
|
||||
proxy_connect_timeout 10s;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Autoscaling
|
||||
|
||||
### Horizontal Pod Autoscaler (HPA)
|
||||
|
||||
```yaml
|
||||
apiVersion: autoscaling/v2
|
||||
kind: HorizontalPodAutoscaler
|
||||
metadata:
|
||||
name: tensorrt-llm-hpa
|
||||
spec:
|
||||
scaleTargetRef:
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
name: tensorrt-llm
|
||||
minReplicas: 2
|
||||
maxReplicas: 10
|
||||
metrics:
|
||||
- type: Pods
|
||||
pods:
|
||||
metric:
|
||||
name: trtllm_active_requests
|
||||
target:
|
||||
type: AverageValue
|
||||
averageValue: "50" # Scale when avg >50 active requests
|
||||
```
|
||||
|
||||
### Custom Metrics
|
||||
|
||||
```yaml
|
||||
# Scale based on queue size
|
||||
- type: Pods
|
||||
pods:
|
||||
metric:
|
||||
name: trtllm_queue_size
|
||||
target:
|
||||
type: AverageValue
|
||||
averageValue: "10"
|
||||
```
|
||||
|
||||
## Cost Optimization
|
||||
|
||||
### GPU Selection
|
||||
|
||||
**A100 80GB** ($3-4/hour):
|
||||
- Use for: 70B models with FP8
|
||||
- Throughput: 10,000-15,000 tok/s (TP=4)
|
||||
- Cost per 1M tokens: $0.20-0.30
|
||||
|
||||
**H100 80GB** ($6-8/hour):
|
||||
- Use for: 70B models with FP8, 405B models
|
||||
- Throughput: 20,000-30,000 tok/s (TP=4)
|
||||
- Cost per 1M tokens: $0.15-0.25 (2× faster = lower cost)
|
||||
|
||||
**L4** ($0.50-1/hour):
|
||||
- Use for: 7-8B models
|
||||
- Throughput: 1,000-2,000 tok/s
|
||||
- Cost per 1M tokens: $0.25-0.50
|
||||
|
||||
### Batch Size Tuning
|
||||
|
||||
**Impact on cost**:
|
||||
- Batch size 1: 1,000 tok/s → $3/hour per 1M = $3/M tokens
|
||||
- Batch size 64: 5,000 tok/s → $3/hour per 5M = $0.60/M tokens
|
||||
- **5× cost reduction** with batching
|
||||
|
||||
**Recommendation**: Target batch size 32-128 for cost efficiency.
|
||||
|
||||
## Security
|
||||
|
||||
### API Authentication
|
||||
|
||||
```bash
|
||||
# Generate API key
|
||||
export API_KEY=$(openssl rand -hex 32)
|
||||
|
||||
# Start server with authentication
|
||||
trtllm-serve meta-llama/Meta-Llama-3-8B \
|
||||
--api_key $API_KEY
|
||||
|
||||
# Client request
|
||||
curl -X POST http://localhost:8000/v1/chat/completions \
|
||||
-H "Authorization: Bearer $API_KEY" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"model": "...", "messages": [...]}'
|
||||
```
|
||||
|
||||
### Network Policies
|
||||
|
||||
```yaml
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: NetworkPolicy
|
||||
metadata:
|
||||
name: tensorrt-llm-policy
|
||||
spec:
|
||||
podSelector:
|
||||
matchLabels:
|
||||
app: tensorrt-llm
|
||||
policyTypes:
|
||||
- Ingress
|
||||
ingress:
|
||||
- from:
|
||||
- podSelector:
|
||||
matchLabels:
|
||||
app: api-gateway # Only allow from gateway
|
||||
ports:
|
||||
- protocol: TCP
|
||||
port: 8000
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### High latency
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
# Check queue size
|
||||
curl http://localhost:9090/metrics | grep queue_size
|
||||
|
||||
# Check active requests
|
||||
curl http://localhost:9090/metrics | grep active_requests
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
- Scale horizontally (more replicas)
|
||||
- Increase batch size (if GPU underutilized)
|
||||
- Enable chunked context (if long prompts)
|
||||
- Use FP8 quantization
|
||||
|
||||
### OOM crashes
|
||||
|
||||
**Solutions**:
|
||||
- Reduce `max_batch_size`
|
||||
- Reduce `max_num_tokens`
|
||||
- Enable FP8 or INT4 quantization
|
||||
- Increase `tensor_parallel_size`
|
||||
|
||||
### Timeout errors
|
||||
|
||||
**NGINX config**:
|
||||
```nginx
|
||||
proxy_read_timeout 600s; # 10 minutes for very long generations
|
||||
proxy_send_timeout 600s;
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Use FP8 on H100** for 2× speedup and 50% cost reduction
|
||||
2. **Monitor metrics** - Set up Prometheus + Grafana
|
||||
3. **Set readiness probes** - Prevent routing to unhealthy pods
|
||||
4. **Use load balancing** - Distribute load across replicas
|
||||
5. **Tune batch size** - Balance latency and throughput
|
||||
6. **Enable streaming** - Better UX for chat applications
|
||||
7. **Set up autoscaling** - Handle traffic spikes
|
||||
8. **Use persistent volumes** - Cache compiled models
|
||||
9. **Implement retries** - Handle transient failures
|
||||
10. **Monitor costs** - Track cost per token
|
||||
Loading…
Add table
Add a link
Reference in a new issue