refactor: reorganize skills into sub-categories

The skills directory was getting disorganized — mlops alone had 40 skills in a flat list, and 12 categories were singletons with just one skill each. Code change: - prompt_builder.py: Support sub-categories in skill scanner. skills/mlops/training/axolotl/SKILL.md now shows as category 'mlops/training' instead of just 'mlops'. Backwards-compatible with existing flat structure. Split mlops (40 skills) into 7 sub-categories: - mlops/training (12): accelerate, axolotl, flash-attention, grpo-rl-training, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, torchtitan, trl-fine-tuning, unsloth - mlops/inference (8): gguf, guidance, instructor, llama-cpp, obliteratus, outlines, tensorrt-llm, vllm - mlops/models (6): audiocraft, clip, llava, segment-anything, stable-diffusion, whisper - mlops/vector-databases (4): chroma, faiss, pinecone, qdrant - mlops/evaluation (5): huggingface-tokenizers, lm-evaluation-harness, nemo-curator, saelens, weights-and-biases - mlops/cloud (2): lambda-labs, modal - mlops/research (1): dspy Merged singleton categories: - gifs → media (gif-search joins youtube-content) - music-creation → media (heartmula, songsee) - diagramming → creative (excalidraw joins ascii-art) - ocr-and-documents → productivity - domain → research (domain-intel) - feeds → research (blogwatcher) - market-data → research (polymarket) Fixed misplaced skills: - mlops/code-review → software-development (not ML-specific) - mlops/ml-paper-writing → research (academic writing) Added DESCRIPTION.md files for all new/updated categories.
2026-04-25 00:51:20 +00:00 · 2026-03-09 03:35:53 -07:00 · 2026-03-09 03:35:53 -07:00 · 732c66b0f3
commit 732c66b0f3
parent d6c710706f
217 changed files with 39 additions and 4 deletions
--- a/skills/mlops/inference/llama-cpp/references/server.md
+++ b/skills/mlops/inference/llama-cpp/references/server.md
@ -0,0 +1,125 @@
+# Server Deployment Guide
+
+Production deployment of llama.cpp server with OpenAI-compatible API.
+
+## Server Modes
+
+### llama-server
+
+```bash
+# Basic server
+./llama-server \
+    -m models/llama-2-7b-chat.Q4_K_M.gguf \
+    --host 0.0.0.0 \
+    --port 8080 \
+    -c 4096  # Context size
+
+# With GPU acceleration
+./llama-server \
+    -m models/llama-2-70b.Q4_K_M.gguf \
+    -ngl 40  # Offload 40 layers to GPU
+```
+
+## OpenAI-Compatible API
+
+### Chat completions
+```bash
+curl http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "llama-2",
+    "messages": [
+      {"role": "system", "content": "You are helpful"},
+      {"role": "user", "content": "Hello"}
+    ],
+    "temperature": 0.7,
+    "max_tokens": 100
+  }'
+```
+
+### Streaming
+```bash
+curl http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "llama-2",
+    "messages": [{"role": "user", "content": "Count to 10"}],
+    "stream": true
+  }'
+```
+
+## Docker Deployment
+
+**Dockerfile**:
+```dockerfile
+FROM ubuntu:22.04
+RUN apt-get update && apt-get install -y git build-essential
+RUN git clone https://github.com/ggerganov/llama.cpp
+WORKDIR /llama.cpp
+RUN make LLAMA_CUDA=1
+COPY models/ /models/
+EXPOSE 8080
+CMD ["./llama-server", "-m", "/models/model.gguf", "--host", "0.0.0.0", "--port", "8080"]
+```
+
+**Run**:
+```bash
+docker run --gpus all -p 8080:8080 llama-cpp:latest
+```
+
+## Monitoring
+
+```bash
+# Server metrics endpoint
+curl http://localhost:8080/metrics
+
+# Health check
+curl http://localhost:8080/health
+```
+
+**Metrics**:
+- requests_total
+- tokens_generated
+- prompt_tokens
+- completion_tokens
+- kv_cache_tokens
+
+## Load Balancing
+
+**NGINX**:
+```nginx
+upstream llama_cpp {
+    server llama1:8080;
+    server llama2:8080;
+}
+
+server {
+    location / {
+        proxy_pass http://llama_cpp;
+        proxy_read_timeout 300s;
+    }
+}
+```
+
+## Performance Tuning
+
+**Parallel requests**:
+```bash
+./llama-server \
+    -m model.gguf \
+    -np 4  # 4 parallel slots
+```
+
+**Continuous batching**:
+```bash
+./llama-server \
+    -m model.gguf \
+    --cont-batching  # Enable continuous batching
+```
+
+**Context caching**:
+```bash
+./llama-server \
+    -m model.gguf \
+    --cache-prompt  # Cache processed prompts
+```