refactor: remove outdated skills and references from MLOps

- Deleted the `huggingface-accelerate` skill documentation, which included details on distributed training and common workflows. - Removed `custom-plugins.md`, `megatron-integration.md`, `performance.md`, and other related reference documents that were no longer relevant or necessary. - This cleanup aims to streamline the MLOps skills repository and improve maintainability.
2026-04-28 01:21:43 +00:00 · 2026-02-25 04:22:48 -08:00 · 2026-02-25 04:22:48 -08:00 · 757d012ab5
commit 757d012ab5
parent f64a87209d
47 changed files with 170 additions and 21638 deletions
--- a/skills/mlops/nemo-curator/references/deduplication.md
+++ b/skills/mlops/nemo-curator/references/deduplication.md
@ -1,87 +0,0 @@
-# Deduplication Guide
-
-Complete guide to exact, fuzzy, and semantic deduplication.
-
-## Exact deduplication
-
-Remove documents with identical content.
-
-```python
-from nemo_curator.modules import ExactDuplicates
-
-# Exact deduplication
-exact_dedup = ExactDuplicates(
-    id_field="id",
-    text_field="text",
-    hash_method="md5"  # or "sha256"
-)
-
-deduped = exact_dedup(dataset)
-```
-
-**Performance**: ~16× faster on GPU vs CPU
-
-## Fuzzy deduplication
-
-Remove near-duplicate documents using MinHash + LSH.
-
-```python
-from nemo_curator.modules import FuzzyDuplicates
-
-fuzzy_dedup = FuzzyDuplicates(
-    id_field="id",
-    text_field="text",
-    num_hashes=260,        # MinHash permutations (more = accurate)
-    num_buckets=20,        # LSH buckets (more = faster, less recall)
-    hash_method="md5",
-    jaccard_threshold=0.8  # Similarity threshold
-)
-
-deduped = fuzzy_dedup(dataset)
-```
-
-**Parameters**:
- `num_hashes`: 128-512 (default 260)
- `num_buckets`: 10-50 (default 20)
- `jaccard_threshold`: 0.7-0.9 (default 0.8)
-
-**Performance**: 16× faster on 8TB dataset (120h → 7.5h)
-
-## Semantic deduplication
-
-Remove semantically similar documents using embeddings.
-
-```python
-from nemo_curator.modules import SemanticDuplicates
-
-semantic_dedup = SemanticDuplicates(
-    id_field="id",
-    text_field="text",
-    embedding_model="sentence-transformers/all-MiniLM-L6-v2",
-    embedding_batch_size=256,
-    threshold=0.85,  # Cosine similarity threshold
-    device="cuda"
-)
-
-deduped = semantic_dedup(dataset)
-```
-
-**Models**:
- `all-MiniLM-L6-v2`: Fast, 384 dims
- `all-mpnet-base-v2`: Better quality, 768 dims
- Custom models supported
-
-## Comparison
-
-| Method | Speed | Recall | Use Case |
-|--------|-------|--------|----------|
-| Exact | Fastest | 100% | Exact matches only |
-| Fuzzy | Fast | ~95% | Near-duplicates (recommended) |
-| Semantic | Slow | ~90% | Paraphrases, rewrites |
-
-## Best practices
-
-1. **Start with exact dedup** - Remove obvious duplicates
-2. **Use fuzzy for large datasets** - Best speed/quality trade-off
-3. **Semantic for high-value data** - Expensive but thorough
-4. **GPU acceleration required** - 10-16× speedup
--- a/skills/mlops/nemo-curator/references/filtering.md
+++ b/skills/mlops/nemo-curator/references/filtering.md
@ -1,102 +0,0 @@
-# Quality Filtering Guide
-
-Complete guide to NeMo Curator's 30+ quality filters.
-
-## Text-based filters
-
-### Word count
-
-```python
-from nemo_curator.filters import WordCountFilter
-
-# Filter by word count
-dataset = dataset.filter(WordCountFilter(min_words=50, max_words=100000))
-```
-
-### Repeated content
-
-```python
-from nemo_curator.filters import RepeatedLinesFilter
-
-# Remove documents with >30% repeated lines
-dataset = dataset.filter(RepeatedLinesFilter(max_repeated_line_fraction=0.3))
-```
-
-### Symbol ratio
-
-```python
-from nemo_curator.filters import SymbolToWordRatioFilter
-
-# Remove documents with too many symbols
-dataset = dataset.filter(SymbolToWordRatioFilter(max_symbol_to_word_ratio=0.3))
-```
-
-### URL ratio
-
-```python
-from nemo_curator.filters import UrlRatioFilter
-
-# Remove documents with many URLs
-dataset = dataset.filter(UrlRatioFilter(max_url_ratio=0.2))
-```
-
-## Language filtering
-
-```python
-from nemo_curator.filters import LanguageIdentificationFilter
-
-# Keep only English documents
-dataset = dataset.filter(LanguageIdentificationFilter(target_languages=["en"]))
-
-# Multiple languages
-dataset = dataset.filter(LanguageIdentificationFilter(target_languages=["en", "es", "fr"]))
-```
-
-## Classifier-based filtering
-
-### Quality classifier
-
-```python
-from nemo_curator.classifiers import QualityClassifier
-
-quality_clf = QualityClassifier(
-    model_path="nvidia/quality-classifier-deberta",
-    batch_size=256,
-    device="cuda"
-)
-
-# Filter low-quality (threshold > 0.5 = high quality)
-dataset = dataset.filter(lambda doc: quality_clf(doc["text"]) > 0.5)
-```
-
-### NSFW classifier
-
-```python
-from nemo_curator.classifiers import NSFWClassifier
-
-nsfw_clf = NSFWClassifier(threshold=0.9, device="cuda")
-
-# Remove NSFW content
-dataset = dataset.filter(lambda doc: nsfw_clf(doc["text"]) < 0.9)
-```
-
-## Heuristic filters
-
-Full list of 30+ filters:
- WordCountFilter
- RepeatedLinesFilter
- UrlRatioFilter
- SymbolToWordRatioFilter
- NonAlphaNumericFilter
- BulletsFilter
- WhiteSpaceFilter
- ParenthesesFilter
- LongWordFilter
- And 20+ more...
-
-## Best practices
-
-1. **Apply cheap filters first** - Word count before GPU classifiers
-2. **Tune thresholds on sample** - Test on 10k docs before full run
-3. **Use GPU classifiers sparingly** - Expensive but effective
-4. **Chain filters efficiently** - Order by cost (cheap → expensive)