mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-04-26 01:01:40 +00:00

teknium1 ab0f4126cf fix: restore all removed bundled skills + fix skills sync system

- Restored 21 skills removed in commits 757d012 and 740dd92:
  accelerate, audiocraft, code-review, faiss, flash-attention, gguf,
  grpo-rl-training, guidance, llava, nemo-curator, obliteratus, peft,
  pytorch-fsdp, pytorch-lightning, simpo, slime, stable-diffusion,
  tensorrt-llm, torchtitan, trl-fine-tuning, whisper

- Rewrote sync_skills() with proper update semantics:
  * New skills (not in manifest): copied to user dir
  * Existing skills (in manifest + on disk): updated via hash comparison
  * User-deleted skills (in manifest, not on disk): respected, not re-added
  * Stale manifest entries (removed from bundled): cleaned from manifest

- Added sync_skills() to CLI startup (cmd_chat) and gateway startup
  (start_gateway) — previously only ran during 'hermes update'

- Updated cmd_update output to show new/updated/cleaned counts

- Rewrote tests: 20 tests covering manifest CRUD, dir hashing, fresh
  install, user deletion respect, update detection, stale cleanup, and
  name collision handling

75 bundled skills total. 2002 tests pass.

2026-03-06 15:57:30 -08:00

2.1 KiB

Raw Blame History

Deduplication Guide

Complete guide to exact, fuzzy, and semantic deduplication.

Exact deduplication

Remove documents with identical content.

from nemo_curator.modules import ExactDuplicates

# Exact deduplication
exact_dedup = ExactDuplicates(
    id_field="id",
    text_field="text",
    hash_method="md5"  # or "sha256"
)

deduped = exact_dedup(dataset)

Performance: ~16× faster on GPU vs CPU

Fuzzy deduplication

Remove near-duplicate documents using MinHash + LSH.

from nemo_curator.modules import FuzzyDuplicates

fuzzy_dedup = FuzzyDuplicates(
    id_field="id",
    text_field="text",
    num_hashes=260,        # MinHash permutations (more = accurate)
    num_buckets=20,        # LSH buckets (more = faster, less recall)
    hash_method="md5",
    jaccard_threshold=0.8  # Similarity threshold
)

deduped = fuzzy_dedup(dataset)

Parameters:

num_hashes: 128-512 (default 260)
num_buckets: 10-50 (default 20)
jaccard_threshold: 0.7-0.9 (default 0.8)

Performance: 16× faster on 8TB dataset (120h → 7.5h)

Semantic deduplication

Remove semantically similar documents using embeddings.

from nemo_curator.modules import SemanticDuplicates

semantic_dedup = SemanticDuplicates(
    id_field="id",
    text_field="text",
    embedding_model="sentence-transformers/all-MiniLM-L6-v2",
    embedding_batch_size=256,
    threshold=0.85,  # Cosine similarity threshold
    device="cuda"
)

deduped = semantic_dedup(dataset)

Models:

all-MiniLM-L6-v2: Fast, 384 dims
all-mpnet-base-v2: Better quality, 768 dims
Custom models supported

Comparison

Method	Speed	Recall	Use Case
Exact	Fastest	100%	Exact matches only
Fuzzy	Fast	~95%	Near-duplicates (recommended)
Semantic	Slow	~90%	Paraphrases, rewrites

Best practices

Start with exact dedup - Remove obvious duplicates
Use fuzzy for large datasets - Best speed/quality trade-off
Semantic for high-value data - Expensive but thorough
GPU acceleration required - 10-16× speedup

2.1 KiB Raw Blame History Unescape Escape