Add skills tools and enhance model integration

- Introduced new skills tools: `skills_categories`, `skills_list`, and `skill_view` in `model_tools.py`, allowing for better organization and access to skill-related functionalities. - Updated `toolsets.py` to include a new `skills` toolset, providing a dedicated space for skill tools. - Enhanced `batch_runner.py` to recognize and validate skills tools during batch processing. - Added comprehensive tool definitions for skills tools, ensuring compatibility with OpenAI's expected format. - Created new shell script `test_skills_kimi.sh` for testing skills tool functionality with Kimi K2.5. - Added example skill files demonstrating the structure and usage of skills within the Hermes-Agent framework, including `SKILL.md` for example and audiocraft skills. - Improved documentation for skills tools and their integration into the existing tool framework, ensuring clarity for future development and usage.
2026-04-26 01:01:40 +00:00 · 2026-01-30 07:39:55 +00:00 · 2026-01-30 07:39:55 +00:00 · f172f7d4aa
commit f172f7d4aa
parent 8e8b6be690
189 changed files with 116214 additions and 2 deletions
--- a/skills/mlops/nemo-curator/references/deduplication.md
+++ b/skills/mlops/nemo-curator/references/deduplication.md
@ -0,0 +1,87 @@
+# Deduplication Guide
+
+Complete guide to exact, fuzzy, and semantic deduplication.
+
+## Exact deduplication
+
+Remove documents with identical content.
+
+```python
+from nemo_curator.modules import ExactDuplicates
+
+# Exact deduplication
+exact_dedup = ExactDuplicates(
+    id_field="id",
+    text_field="text",
+    hash_method="md5"  # or "sha256"
+)
+
+deduped = exact_dedup(dataset)
+```
+
+**Performance**: ~16× faster on GPU vs CPU
+
+## Fuzzy deduplication
+
+Remove near-duplicate documents using MinHash + LSH.
+
+```python
+from nemo_curator.modules import FuzzyDuplicates
+
+fuzzy_dedup = FuzzyDuplicates(
+    id_field="id",
+    text_field="text",
+    num_hashes=260,        # MinHash permutations (more = accurate)
+    num_buckets=20,        # LSH buckets (more = faster, less recall)
+    hash_method="md5",
+    jaccard_threshold=0.8  # Similarity threshold
+)
+
+deduped = fuzzy_dedup(dataset)
+```
+
+**Parameters**:
+- `num_hashes`: 128-512 (default 260)
+- `num_buckets`: 10-50 (default 20)
+- `jaccard_threshold`: 0.7-0.9 (default 0.8)
+
+**Performance**: 16× faster on 8TB dataset (120h → 7.5h)
+
+## Semantic deduplication
+
+Remove semantically similar documents using embeddings.
+
+```python
+from nemo_curator.modules import SemanticDuplicates
+
+semantic_dedup = SemanticDuplicates(
+    id_field="id",
+    text_field="text",
+    embedding_model="sentence-transformers/all-MiniLM-L6-v2",
+    embedding_batch_size=256,
+    threshold=0.85,  # Cosine similarity threshold
+    device="cuda"
+)
+
+deduped = semantic_dedup(dataset)
+```
+
+**Models**:
+- `all-MiniLM-L6-v2`: Fast, 384 dims
+- `all-mpnet-base-v2`: Better quality, 768 dims
+- Custom models supported
+
+## Comparison
+
+| Method | Speed | Recall | Use Case |
+|--------|-------|--------|----------|
+| Exact | Fastest | 100% | Exact matches only |
+| Fuzzy | Fast | ~95% | Near-duplicates (recommended) |
+| Semantic | Slow | ~90% | Paraphrases, rewrites |
+
+## Best practices
+
+1. **Start with exact dedup** - Remove obvious duplicates
+2. **Use fuzzy for large datasets** - Best speed/quality trade-off
+3. **Semantic for high-value data** - Expensive but thorough
+4. **GPU acceleration required** - 10-16× speedup