mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-25 00:51:20 +00:00
- Introduced new skills tools: `skills_categories`, `skills_list`, and `skill_view` in `model_tools.py`, allowing for better organization and access to skill-related functionalities. - Updated `toolsets.py` to include a new `skills` toolset, providing a dedicated space for skill tools. - Enhanced `batch_runner.py` to recognize and validate skills tools during batch processing. - Added comprehensive tool definitions for skills tools, ensuring compatibility with OpenAI's expected format. - Created new shell script `test_skills_kimi.sh` for testing skills tool functionality with Kimi K2.5. - Added example skill files demonstrating the structure and usage of skills within the Hermes-Agent framework, including `SKILL.md` for example and audiocraft skills. - Improved documentation for skills tools and their integration into the existing tool framework, ensuring clarity for future development and usage.
2.1 KiB
2.1 KiB
Deduplication Guide
Complete guide to exact, fuzzy, and semantic deduplication.
Exact deduplication
Remove documents with identical content.
from nemo_curator.modules import ExactDuplicates
# Exact deduplication
exact_dedup = ExactDuplicates(
id_field="id",
text_field="text",
hash_method="md5" # or "sha256"
)
deduped = exact_dedup(dataset)
Performance: ~16× faster on GPU vs CPU
Fuzzy deduplication
Remove near-duplicate documents using MinHash + LSH.
from nemo_curator.modules import FuzzyDuplicates
fuzzy_dedup = FuzzyDuplicates(
id_field="id",
text_field="text",
num_hashes=260, # MinHash permutations (more = accurate)
num_buckets=20, # LSH buckets (more = faster, less recall)
hash_method="md5",
jaccard_threshold=0.8 # Similarity threshold
)
deduped = fuzzy_dedup(dataset)
Parameters:
num_hashes: 128-512 (default 260)num_buckets: 10-50 (default 20)jaccard_threshold: 0.7-0.9 (default 0.8)
Performance: 16× faster on 8TB dataset (120h → 7.5h)
Semantic deduplication
Remove semantically similar documents using embeddings.
from nemo_curator.modules import SemanticDuplicates
semantic_dedup = SemanticDuplicates(
id_field="id",
text_field="text",
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
embedding_batch_size=256,
threshold=0.85, # Cosine similarity threshold
device="cuda"
)
deduped = semantic_dedup(dataset)
Models:
all-MiniLM-L6-v2: Fast, 384 dimsall-mpnet-base-v2: Better quality, 768 dims- Custom models supported
Comparison
| Method | Speed | Recall | Use Case |
|---|---|---|---|
| Exact | Fastest | 100% | Exact matches only |
| Fuzzy | Fast | ~95% | Near-duplicates (recommended) |
| Semantic | Slow | ~90% | Paraphrases, rewrites |
Best practices
- Start with exact dedup - Remove obvious duplicates
- Use fuzzy for large datasets - Best speed/quality trade-off
- Semantic for high-value data - Expensive but thorough
- GPU acceleration required - 10-16× speedup