mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-25 00:51:20 +00:00
* feat(gateway): skill-aware slash commands, paginated /commands, Telegram 100-cap Map active skills to Telegram's slash command menu so users can discover and invoke skills directly. Three changes: 1. Telegram menu now includes active skill commands alongside built-in commands, capped at 100 entries (Telegram Bot API limit). Overflow commands remain callable but hidden from the picker. Logged at startup when cap is hit. 2. New /commands [page] gateway command for paginated browsing of all commands + skills. /help now shows first 10 skill commands and points to /commands for the full list. 3. When a user types a slash command that matches a disabled or uninstalled skill, they get actionable guidance: - Disabled: 'Enable it with: hermes skills config' - Optional (not installed): 'Install with: hermes skills install official/<path>' Built on ideas from PR #3921 by @kshitijk4poor. * chore: move 21 niche skills to optional-skills Move specialized/niche skills from built-in (skills/) to optional (optional-skills/) to reduce the default skill count. Users can install them with: hermes skills install official/<category>/<name> Moved skills (21): - mlops: accelerate, chroma, faiss, flash-attention, hermes-atropos-environments, huggingface-tokenizers, instructor, lambda-labs, llava, nemo-curator, pinecone, pytorch-lightning, qdrant, saelens, simpo, slime, tensorrt-llm, torchtitan - research: domain-intel, duckduckgo-search - devops: inference-sh cli Built-in skills: 96 → 75 Optional skills: 22 → 43 * fix: only include repo built-in skills in Telegram menu, not user-installed User-installed skills (from hub or manually added) stay accessible via /skills and by typing the command directly, but don't get registered in the Telegram slash command picker. Only skills whose SKILL.md is under the repo's skills/ directory are included in the menu. This keeps the Telegram menu focused on the curated built-in set while user-installed skills remain discoverable through /skills and /commands.
87 lines
2.1 KiB
Markdown
87 lines
2.1 KiB
Markdown
# Deduplication Guide
|
||
|
||
Complete guide to exact, fuzzy, and semantic deduplication.
|
||
|
||
## Exact deduplication
|
||
|
||
Remove documents with identical content.
|
||
|
||
```python
|
||
from nemo_curator.modules import ExactDuplicates
|
||
|
||
# Exact deduplication
|
||
exact_dedup = ExactDuplicates(
|
||
id_field="id",
|
||
text_field="text",
|
||
hash_method="md5" # or "sha256"
|
||
)
|
||
|
||
deduped = exact_dedup(dataset)
|
||
```
|
||
|
||
**Performance**: ~16× faster on GPU vs CPU
|
||
|
||
## Fuzzy deduplication
|
||
|
||
Remove near-duplicate documents using MinHash + LSH.
|
||
|
||
```python
|
||
from nemo_curator.modules import FuzzyDuplicates
|
||
|
||
fuzzy_dedup = FuzzyDuplicates(
|
||
id_field="id",
|
||
text_field="text",
|
||
num_hashes=260, # MinHash permutations (more = accurate)
|
||
num_buckets=20, # LSH buckets (more = faster, less recall)
|
||
hash_method="md5",
|
||
jaccard_threshold=0.8 # Similarity threshold
|
||
)
|
||
|
||
deduped = fuzzy_dedup(dataset)
|
||
```
|
||
|
||
**Parameters**:
|
||
- `num_hashes`: 128-512 (default 260)
|
||
- `num_buckets`: 10-50 (default 20)
|
||
- `jaccard_threshold`: 0.7-0.9 (default 0.8)
|
||
|
||
**Performance**: 16× faster on 8TB dataset (120h → 7.5h)
|
||
|
||
## Semantic deduplication
|
||
|
||
Remove semantically similar documents using embeddings.
|
||
|
||
```python
|
||
from nemo_curator.modules import SemanticDuplicates
|
||
|
||
semantic_dedup = SemanticDuplicates(
|
||
id_field="id",
|
||
text_field="text",
|
||
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
|
||
embedding_batch_size=256,
|
||
threshold=0.85, # Cosine similarity threshold
|
||
device="cuda"
|
||
)
|
||
|
||
deduped = semantic_dedup(dataset)
|
||
```
|
||
|
||
**Models**:
|
||
- `all-MiniLM-L6-v2`: Fast, 384 dims
|
||
- `all-mpnet-base-v2`: Better quality, 768 dims
|
||
- Custom models supported
|
||
|
||
## Comparison
|
||
|
||
| Method | Speed | Recall | Use Case |
|
||
|--------|-------|--------|----------|
|
||
| Exact | Fastest | 100% | Exact matches only |
|
||
| Fuzzy | Fast | ~95% | Near-duplicates (recommended) |
|
||
| Semantic | Slow | ~90% | Paraphrases, rewrites |
|
||
|
||
## Best practices
|
||
|
||
1. **Start with exact dedup** - Remove obvious duplicates
|
||
2. **Use fuzzy for large datasets** - Best speed/quality trade-off
|
||
3. **Semantic for high-value data** - Expensive but thorough
|
||
4. **GPU acceleration required** - 10-16× speedup
|