mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-26 01:01:40 +00:00
- Restored 21 skills removed in commits757d012and740dd92: accelerate, audiocraft, code-review, faiss, flash-attention, gguf, grpo-rl-training, guidance, llava, nemo-curator, obliteratus, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, stable-diffusion, tensorrt-llm, torchtitan, trl-fine-tuning, whisper - Rewrote sync_skills() with proper update semantics: * New skills (not in manifest): copied to user dir * Existing skills (in manifest + on disk): updated via hash comparison * User-deleted skills (in manifest, not on disk): respected, not re-added * Stale manifest entries (removed from bundled): cleaned from manifest - Added sync_skills() to CLI startup (cmd_chat) and gateway startup (start_gateway) — previously only ran during 'hermes update' - Updated cmd_update output to show new/updated/cleaned counts - Rewrote tests: 20 tests covering manifest CRUD, dir hashing, fresh install, user deletion respect, update detection, stale cleanup, and name collision handling 75 bundled skills total. 2002 tests pass.
87 lines
2.1 KiB
Markdown
87 lines
2.1 KiB
Markdown
# Deduplication Guide
|
||
|
||
Complete guide to exact, fuzzy, and semantic deduplication.
|
||
|
||
## Exact deduplication
|
||
|
||
Remove documents with identical content.
|
||
|
||
```python
|
||
from nemo_curator.modules import ExactDuplicates
|
||
|
||
# Exact deduplication
|
||
exact_dedup = ExactDuplicates(
|
||
id_field="id",
|
||
text_field="text",
|
||
hash_method="md5" # or "sha256"
|
||
)
|
||
|
||
deduped = exact_dedup(dataset)
|
||
```
|
||
|
||
**Performance**: ~16× faster on GPU vs CPU
|
||
|
||
## Fuzzy deduplication
|
||
|
||
Remove near-duplicate documents using MinHash + LSH.
|
||
|
||
```python
|
||
from nemo_curator.modules import FuzzyDuplicates
|
||
|
||
fuzzy_dedup = FuzzyDuplicates(
|
||
id_field="id",
|
||
text_field="text",
|
||
num_hashes=260, # MinHash permutations (more = accurate)
|
||
num_buckets=20, # LSH buckets (more = faster, less recall)
|
||
hash_method="md5",
|
||
jaccard_threshold=0.8 # Similarity threshold
|
||
)
|
||
|
||
deduped = fuzzy_dedup(dataset)
|
||
```
|
||
|
||
**Parameters**:
|
||
- `num_hashes`: 128-512 (default 260)
|
||
- `num_buckets`: 10-50 (default 20)
|
||
- `jaccard_threshold`: 0.7-0.9 (default 0.8)
|
||
|
||
**Performance**: 16× faster on 8TB dataset (120h → 7.5h)
|
||
|
||
## Semantic deduplication
|
||
|
||
Remove semantically similar documents using embeddings.
|
||
|
||
```python
|
||
from nemo_curator.modules import SemanticDuplicates
|
||
|
||
semantic_dedup = SemanticDuplicates(
|
||
id_field="id",
|
||
text_field="text",
|
||
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
|
||
embedding_batch_size=256,
|
||
threshold=0.85, # Cosine similarity threshold
|
||
device="cuda"
|
||
)
|
||
|
||
deduped = semantic_dedup(dataset)
|
||
```
|
||
|
||
**Models**:
|
||
- `all-MiniLM-L6-v2`: Fast, 384 dims
|
||
- `all-mpnet-base-v2`: Better quality, 768 dims
|
||
- Custom models supported
|
||
|
||
## Comparison
|
||
|
||
| Method | Speed | Recall | Use Case |
|
||
|--------|-------|--------|----------|
|
||
| Exact | Fastest | 100% | Exact matches only |
|
||
| Fuzzy | Fast | ~95% | Near-duplicates (recommended) |
|
||
| Semantic | Slow | ~90% | Paraphrases, rewrites |
|
||
|
||
## Best practices
|
||
|
||
1. **Start with exact dedup** - Remove obvious duplicates
|
||
2. **Use fuzzy for large datasets** - Best speed/quality trade-off
|
||
3. **Semantic for high-value data** - Expensive but thorough
|
||
4. **GPU acceleration required** - 10-16× speedup
|