feat(gateway): skill-aware slash commands, paginated /commands, Telegram 100-cap (#3934)

* feat(gateway): skill-aware slash commands, paginated /commands, Telegram 100-cap Map active skills to Telegram's slash command menu so users can discover and invoke skills directly. Three changes: 1. Telegram menu now includes active skill commands alongside built-in commands, capped at 100 entries (Telegram Bot API limit). Overflow commands remain callable but hidden from the picker. Logged at startup when cap is hit. 2. New /commands [page] gateway command for paginated browsing of all commands + skills. /help now shows first 10 skill commands and points to /commands for the full list. 3. When a user types a slash command that matches a disabled or uninstalled skill, they get actionable guidance: - Disabled: 'Enable it with: hermes skills config' - Optional (not installed): 'Install with: hermes skills install official/<path>' Built on ideas from PR #3921 by @kshitijk4poor. * chore: move 21 niche skills to optional-skills Move specialized/niche skills from built-in (skills/) to optional (optional-skills/) to reduce the default skill count. Users can install them with: hermes skills install official/<category>/<name> Moved skills (21): - mlops: accelerate, chroma, faiss, flash-attention, hermes-atropos-environments, huggingface-tokenizers, instructor, lambda-labs, llava, nemo-curator, pinecone, pytorch-lightning, qdrant, saelens, simpo, slime, tensorrt-llm, torchtitan - research: domain-intel, duckduckgo-search - devops: inference-sh cli Built-in skills: 96 → 75 Optional skills: 22 → 43 * fix: only include repo built-in skills in Telegram menu, not user-installed User-installed skills (from hub or manually added) stay accessible via /skills and by typing the command directly, but don't get registered in the Telegram slash command picker. Only skills whose SKILL.md is under the repo's skills/ directory are included in the menu. This keeps the Telegram menu focused on the curated built-in set while user-installed skills remain discoverable through /skills and /commands.
2026-04-26 01:01:40 +00:00 · 2026-03-30 10:57:30 -07:00 · 2026-03-30 10:57:30 -07:00 · 5ceed021dc
commit 5ceed021dc
parent 97d6813f51
73 changed files with 163 additions and 4 deletions
--- a/optional-skills/mlops/nemo-curator/references/deduplication.md
+++ b/optional-skills/mlops/nemo-curator/references/deduplication.md
@ -0,0 +1,87 @@
+# Deduplication Guide
+
+Complete guide to exact, fuzzy, and semantic deduplication.
+
+## Exact deduplication
+
+Remove documents with identical content.
+
+```python
+from nemo_curator.modules import ExactDuplicates
+
+# Exact deduplication
+exact_dedup = ExactDuplicates(
+    id_field="id",
+    text_field="text",
+    hash_method="md5"  # or "sha256"
+)
+
+deduped = exact_dedup(dataset)
+```
+
+**Performance**: ~16× faster on GPU vs CPU
+
+## Fuzzy deduplication
+
+Remove near-duplicate documents using MinHash + LSH.
+
+```python
+from nemo_curator.modules import FuzzyDuplicates
+
+fuzzy_dedup = FuzzyDuplicates(
+    id_field="id",
+    text_field="text",
+    num_hashes=260,        # MinHash permutations (more = accurate)
+    num_buckets=20,        # LSH buckets (more = faster, less recall)
+    hash_method="md5",
+    jaccard_threshold=0.8  # Similarity threshold
+)
+
+deduped = fuzzy_dedup(dataset)
+```
+
+**Parameters**:
+- `num_hashes`: 128-512 (default 260)
+- `num_buckets`: 10-50 (default 20)
+- `jaccard_threshold`: 0.7-0.9 (default 0.8)
+
+**Performance**: 16× faster on 8TB dataset (120h → 7.5h)
+
+## Semantic deduplication
+
+Remove semantically similar documents using embeddings.
+
+```python
+from nemo_curator.modules import SemanticDuplicates
+
+semantic_dedup = SemanticDuplicates(
+    id_field="id",
+    text_field="text",
+    embedding_model="sentence-transformers/all-MiniLM-L6-v2",
+    embedding_batch_size=256,
+    threshold=0.85,  # Cosine similarity threshold
+    device="cuda"
+)
+
+deduped = semantic_dedup(dataset)
+```
+
+**Models**:
+- `all-MiniLM-L6-v2`: Fast, 384 dims
+- `all-mpnet-base-v2`: Better quality, 768 dims
+- Custom models supported
+
+## Comparison
+
+| Method | Speed | Recall | Use Case |
+|--------|-------|--------|----------|
+| Exact | Fastest | 100% | Exact matches only |
+| Fuzzy | Fast | ~95% | Near-duplicates (recommended) |
+| Semantic | Slow | ~90% | Paraphrases, rewrites |
+
+## Best practices
+
+1. **Start with exact dedup** - Remove obvious duplicates
+2. **Use fuzzy for large datasets** - Best speed/quality trade-off
+3. **Semantic for high-value data** - Expensive but thorough
+4. **GPU acceleration required** - 10-16× speedup
--- a/optional-skills/mlops/nemo-curator/references/filtering.md
+++ b/optional-skills/mlops/nemo-curator/references/filtering.md
@ -0,0 +1,102 @@
+# Quality Filtering Guide
+
+Complete guide to NeMo Curator's 30+ quality filters.
+
+## Text-based filters
+
+### Word count
+
+```python
+from nemo_curator.filters import WordCountFilter
+
+# Filter by word count
+dataset = dataset.filter(WordCountFilter(min_words=50, max_words=100000))
+```
+
+### Repeated content
+
+```python
+from nemo_curator.filters import RepeatedLinesFilter
+
+# Remove documents with >30% repeated lines
+dataset = dataset.filter(RepeatedLinesFilter(max_repeated_line_fraction=0.3))
+```
+
+### Symbol ratio
+
+```python
+from nemo_curator.filters import SymbolToWordRatioFilter
+
+# Remove documents with too many symbols
+dataset = dataset.filter(SymbolToWordRatioFilter(max_symbol_to_word_ratio=0.3))
+```
+
+### URL ratio
+
+```python
+from nemo_curator.filters import UrlRatioFilter
+
+# Remove documents with many URLs
+dataset = dataset.filter(UrlRatioFilter(max_url_ratio=0.2))
+```
+
+## Language filtering
+
+```python
+from nemo_curator.filters import LanguageIdentificationFilter
+
+# Keep only English documents
+dataset = dataset.filter(LanguageIdentificationFilter(target_languages=["en"]))
+
+# Multiple languages
+dataset = dataset.filter(LanguageIdentificationFilter(target_languages=["en", "es", "fr"]))
+```
+
+## Classifier-based filtering
+
+### Quality classifier
+
+```python
+from nemo_curator.classifiers import QualityClassifier
+
+quality_clf = QualityClassifier(
+    model_path="nvidia/quality-classifier-deberta",
+    batch_size=256,
+    device="cuda"
+)
+
+# Filter low-quality (threshold > 0.5 = high quality)
+dataset = dataset.filter(lambda doc: quality_clf(doc["text"]) > 0.5)
+```
+
+### NSFW classifier
+
+```python
+from nemo_curator.classifiers import NSFWClassifier
+
+nsfw_clf = NSFWClassifier(threshold=0.9, device="cuda")
+
+# Remove NSFW content
+dataset = dataset.filter(lambda doc: nsfw_clf(doc["text"]) < 0.9)
+```
+
+## Heuristic filters
+
+Full list of 30+ filters:
+- WordCountFilter
+- RepeatedLinesFilter
+- UrlRatioFilter
+- SymbolToWordRatioFilter
+- NonAlphaNumericFilter
+- BulletsFilter
+- WhiteSpaceFilter
+- ParenthesesFilter
+- LongWordFilter
+- And 20+ more...
+
+## Best practices
+
+1. **Apply cheap filters first** - Word count before GPU classifiers
+2. **Tune thresholds on sample** - Test on 10k docs before full run
+3. **Use GPU classifiers sparingly** - Expensive but effective
+4. **Chain filters efficiently** - Order by cost (cheap → expensive)