refactor: reorganize skills into sub-categories

The skills directory was getting disorganized — mlops alone had 40 skills in a flat list, and 12 categories were singletons with just one skill each. Code change: - prompt_builder.py: Support sub-categories in skill scanner. skills/mlops/training/axolotl/SKILL.md now shows as category 'mlops/training' instead of just 'mlops'. Backwards-compatible with existing flat structure. Split mlops (40 skills) into 7 sub-categories: - mlops/training (12): accelerate, axolotl, flash-attention, grpo-rl-training, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, torchtitan, trl-fine-tuning, unsloth - mlops/inference (8): gguf, guidance, instructor, llama-cpp, obliteratus, outlines, tensorrt-llm, vllm - mlops/models (6): audiocraft, clip, llava, segment-anything, stable-diffusion, whisper - mlops/vector-databases (4): chroma, faiss, pinecone, qdrant - mlops/evaluation (5): huggingface-tokenizers, lm-evaluation-harness, nemo-curator, saelens, weights-and-biases - mlops/cloud (2): lambda-labs, modal - mlops/research (1): dspy Merged singleton categories: - gifs → media (gif-search joins youtube-content) - music-creation → media (heartmula, songsee) - diagramming → creative (excalidraw joins ascii-art) - ocr-and-documents → productivity - domain → research (domain-intel) - feeds → research (blogwatcher) - market-data → research (polymarket) Fixed misplaced skills: - mlops/code-review → software-development (not ML-specific) - mlops/ml-paper-writing → research (academic writing) Added DESCRIPTION.md files for all new/updated categories.
2026-04-25 00:51:20 +00:00 · 2026-03-09 03:35:53 -07:00 · 2026-03-09 03:35:53 -07:00 · 732c66b0f3
commit 732c66b0f3
parent d6c710706f
217 changed files with 39 additions and 4 deletions
--- a/skills/mlops/evaluation/nemo-curator/references/deduplication.md
+++ b/skills/mlops/evaluation/nemo-curator/references/deduplication.md
@ -0,0 +1,87 @@
+# Deduplication Guide
+
+Complete guide to exact, fuzzy, and semantic deduplication.
+
+## Exact deduplication
+
+Remove documents with identical content.
+
+```python
+from nemo_curator.modules import ExactDuplicates
+
+# Exact deduplication
+exact_dedup = ExactDuplicates(
+    id_field="id",
+    text_field="text",
+    hash_method="md5"  # or "sha256"
+)
+
+deduped = exact_dedup(dataset)
+```
+
+**Performance**: ~16× faster on GPU vs CPU
+
+## Fuzzy deduplication
+
+Remove near-duplicate documents using MinHash + LSH.
+
+```python
+from nemo_curator.modules import FuzzyDuplicates
+
+fuzzy_dedup = FuzzyDuplicates(
+    id_field="id",
+    text_field="text",
+    num_hashes=260,        # MinHash permutations (more = accurate)
+    num_buckets=20,        # LSH buckets (more = faster, less recall)
+    hash_method="md5",
+    jaccard_threshold=0.8  # Similarity threshold
+)
+
+deduped = fuzzy_dedup(dataset)
+```
+
+**Parameters**:
+- `num_hashes`: 128-512 (default 260)
+- `num_buckets`: 10-50 (default 20)
+- `jaccard_threshold`: 0.7-0.9 (default 0.8)
+
+**Performance**: 16× faster on 8TB dataset (120h → 7.5h)
+
+## Semantic deduplication
+
+Remove semantically similar documents using embeddings.
+
+```python
+from nemo_curator.modules import SemanticDuplicates
+
+semantic_dedup = SemanticDuplicates(
+    id_field="id",
+    text_field="text",
+    embedding_model="sentence-transformers/all-MiniLM-L6-v2",
+    embedding_batch_size=256,
+    threshold=0.85,  # Cosine similarity threshold
+    device="cuda"
+)
+
+deduped = semantic_dedup(dataset)
+```
+
+**Models**:
+- `all-MiniLM-L6-v2`: Fast, 384 dims
+- `all-mpnet-base-v2`: Better quality, 768 dims
+- Custom models supported
+
+## Comparison
+
+| Method | Speed | Recall | Use Case |
+|--------|-------|--------|----------|
+| Exact | Fastest | 100% | Exact matches only |
+| Fuzzy | Fast | ~95% | Near-duplicates (recommended) |
+| Semantic | Slow | ~90% | Paraphrases, rewrites |
+
+## Best practices
+
+1. **Start with exact dedup** - Remove obvious duplicates
+2. **Use fuzzy for large datasets** - Best speed/quality trade-off
+3. **Semantic for high-value data** - Expensive but thorough
+4. **GPU acceleration required** - 10-16× speedup