mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-26 01:01:40 +00:00
- Restored 21 skills removed in commits757d012and740dd92: accelerate, audiocraft, code-review, faiss, flash-attention, gguf, grpo-rl-training, guidance, llava, nemo-curator, obliteratus, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, stable-diffusion, tensorrt-llm, torchtitan, trl-fine-tuning, whisper - Rewrote sync_skills() with proper update semantics: * New skills (not in manifest): copied to user dir * Existing skills (in manifest + on disk): updated via hash comparison * User-deleted skills (in manifest, not on disk): respected, not re-added * Stale manifest entries (removed from bundled): cleaned from manifest - Added sync_skills() to CLI startup (cmd_chat) and gateway startup (start_gateway) — previously only ran during 'hermes update' - Updated cmd_update output to show new/updated/cleaned counts - Rewrote tests: 20 tests covering manifest CRUD, dir hashing, fresh install, user deletion respect, update detection, stale cleanup, and name collision handling 75 bundled skills total. 2002 tests pass.
386 lines
9.1 KiB
Markdown
386 lines
9.1 KiB
Markdown
---
|
||
name: nemo-curator
|
||
description: GPU-accelerated data curation for LLM training. Supports text/image/video/audio. Features fuzzy deduplication (16× faster), quality filtering (30+ heuristics), semantic deduplication, PII redaction, NSFW detection. Scales across GPUs with RAPIDS. Use for preparing high-quality training datasets, cleaning web data, or deduplicating large corpora.
|
||
version: 1.0.0
|
||
author: Orchestra Research
|
||
license: MIT
|
||
dependencies: [nemo-curator, cudf, dask, rapids]
|
||
metadata:
|
||
hermes:
|
||
tags: [Data Processing, NeMo Curator, Data Curation, GPU Acceleration, Deduplication, Quality Filtering, NVIDIA, RAPIDS, PII Redaction, Multimodal, LLM Training Data]
|
||
|
||
---
|
||
|
||
# NeMo Curator - GPU-Accelerated Data Curation
|
||
|
||
NVIDIA's toolkit for preparing high-quality training data for LLMs.
|
||
|
||
## When to use NeMo Curator
|
||
|
||
**Use NeMo Curator when:**
|
||
- Preparing LLM training data from web scrapes (Common Crawl)
|
||
- Need fast deduplication (16× faster than CPU)
|
||
- Curating multi-modal datasets (text, images, video, audio)
|
||
- Filtering low-quality or toxic content
|
||
- Scaling data processing across GPU cluster
|
||
|
||
**Performance**:
|
||
- **16× faster** fuzzy deduplication (8TB RedPajama v2)
|
||
- **40% lower TCO** vs CPU alternatives
|
||
- **Near-linear scaling** across GPU nodes
|
||
|
||
**Use alternatives instead**:
|
||
- **datatrove**: CPU-based, open-source data processing
|
||
- **dolma**: Allen AI's data toolkit
|
||
- **Ray Data**: General ML data processing (no curation focus)
|
||
|
||
## Quick start
|
||
|
||
### Installation
|
||
|
||
```bash
|
||
# Text curation (CUDA 12)
|
||
uv pip install "nemo-curator[text_cuda12]"
|
||
|
||
# All modalities
|
||
uv pip install "nemo-curator[all_cuda12]"
|
||
|
||
# CPU-only (slower)
|
||
uv pip install "nemo-curator[cpu]"
|
||
```
|
||
|
||
### Basic text curation pipeline
|
||
|
||
```python
|
||
from nemo_curator import ScoreFilter, Modify
|
||
from nemo_curator.datasets import DocumentDataset
|
||
import pandas as pd
|
||
|
||
# Load data
|
||
df = pd.DataFrame({"text": ["Good document", "Bad doc", "Excellent text"]})
|
||
dataset = DocumentDataset(df)
|
||
|
||
# Quality filtering
|
||
def quality_score(doc):
|
||
return len(doc["text"].split()) > 5 # Filter short docs
|
||
|
||
filtered = ScoreFilter(quality_score)(dataset)
|
||
|
||
# Deduplication
|
||
from nemo_curator.modules import ExactDuplicates
|
||
deduped = ExactDuplicates()(filtered)
|
||
|
||
# Save
|
||
deduped.to_parquet("curated_data/")
|
||
```
|
||
|
||
## Data curation pipeline
|
||
|
||
### Stage 1: Quality filtering
|
||
|
||
```python
|
||
from nemo_curator.filters import (
|
||
WordCountFilter,
|
||
RepeatedLinesFilter,
|
||
UrlRatioFilter,
|
||
NonAlphaNumericFilter
|
||
)
|
||
|
||
# Apply 30+ heuristic filters
|
||
from nemo_curator import ScoreFilter
|
||
|
||
# Word count filter
|
||
dataset = dataset.filter(WordCountFilter(min_words=50, max_words=100000))
|
||
|
||
# Remove repetitive content
|
||
dataset = dataset.filter(RepeatedLinesFilter(max_repeated_line_fraction=0.3))
|
||
|
||
# URL ratio filter
|
||
dataset = dataset.filter(UrlRatioFilter(max_url_ratio=0.2))
|
||
```
|
||
|
||
### Stage 2: Deduplication
|
||
|
||
**Exact deduplication**:
|
||
```python
|
||
from nemo_curator.modules import ExactDuplicates
|
||
|
||
# Remove exact duplicates
|
||
deduped = ExactDuplicates(id_field="id", text_field="text")(dataset)
|
||
```
|
||
|
||
**Fuzzy deduplication** (16× faster on GPU):
|
||
```python
|
||
from nemo_curator.modules import FuzzyDuplicates
|
||
|
||
# MinHash + LSH deduplication
|
||
fuzzy_dedup = FuzzyDuplicates(
|
||
id_field="id",
|
||
text_field="text",
|
||
num_hashes=260, # MinHash parameters
|
||
num_buckets=20,
|
||
hash_method="md5"
|
||
)
|
||
|
||
deduped = fuzzy_dedup(dataset)
|
||
```
|
||
|
||
**Semantic deduplication**:
|
||
```python
|
||
from nemo_curator.modules import SemanticDuplicates
|
||
|
||
# Embedding-based deduplication
|
||
semantic_dedup = SemanticDuplicates(
|
||
id_field="id",
|
||
text_field="text",
|
||
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
|
||
threshold=0.8 # Cosine similarity threshold
|
||
)
|
||
|
||
deduped = semantic_dedup(dataset)
|
||
```
|
||
|
||
### Stage 3: PII redaction
|
||
|
||
```python
|
||
from nemo_curator.modules import Modify
|
||
from nemo_curator.modifiers import PIIRedactor
|
||
|
||
# Redact personally identifiable information
|
||
pii_redactor = PIIRedactor(
|
||
supported_entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "PERSON", "LOCATION"],
|
||
anonymize_action="replace" # or "redact"
|
||
)
|
||
|
||
redacted = Modify(pii_redactor)(dataset)
|
||
```
|
||
|
||
### Stage 4: Classifier filtering
|
||
|
||
```python
|
||
from nemo_curator.classifiers import QualityClassifier
|
||
|
||
# Quality classification
|
||
quality_clf = QualityClassifier(
|
||
model_path="nvidia/quality-classifier-deberta",
|
||
batch_size=256,
|
||
device="cuda"
|
||
)
|
||
|
||
# Filter low-quality documents
|
||
high_quality = dataset.filter(lambda doc: quality_clf(doc["text"]) > 0.5)
|
||
```
|
||
|
||
## GPU acceleration
|
||
|
||
### GPU vs CPU performance
|
||
|
||
| Operation | CPU (16 cores) | GPU (A100) | Speedup |
|
||
|-----------|----------------|------------|---------|
|
||
| Fuzzy dedup (8TB) | 120 hours | 7.5 hours | 16× |
|
||
| Exact dedup (1TB) | 8 hours | 0.5 hours | 16× |
|
||
| Quality filtering | 2 hours | 0.2 hours | 10× |
|
||
|
||
### Multi-GPU scaling
|
||
|
||
```python
|
||
from nemo_curator import get_client
|
||
import dask_cuda
|
||
|
||
# Initialize GPU cluster
|
||
client = get_client(cluster_type="gpu", n_workers=8)
|
||
|
||
# Process with 8 GPUs
|
||
deduped = FuzzyDuplicates(...)(dataset)
|
||
```
|
||
|
||
## Multi-modal curation
|
||
|
||
### Image curation
|
||
|
||
```python
|
||
from nemo_curator.image import (
|
||
AestheticFilter,
|
||
NSFWFilter,
|
||
CLIPEmbedder
|
||
)
|
||
|
||
# Aesthetic scoring
|
||
aesthetic_filter = AestheticFilter(threshold=5.0)
|
||
filtered_images = aesthetic_filter(image_dataset)
|
||
|
||
# NSFW detection
|
||
nsfw_filter = NSFWFilter(threshold=0.9)
|
||
safe_images = nsfw_filter(filtered_images)
|
||
|
||
# Generate CLIP embeddings
|
||
clip_embedder = CLIPEmbedder(model="openai/clip-vit-base-patch32")
|
||
image_embeddings = clip_embedder(safe_images)
|
||
```
|
||
|
||
### Video curation
|
||
|
||
```python
|
||
from nemo_curator.video import (
|
||
SceneDetector,
|
||
ClipExtractor,
|
||
InternVideo2Embedder
|
||
)
|
||
|
||
# Detect scenes
|
||
scene_detector = SceneDetector(threshold=27.0)
|
||
scenes = scene_detector(video_dataset)
|
||
|
||
# Extract clips
|
||
clip_extractor = ClipExtractor(min_duration=2.0, max_duration=10.0)
|
||
clips = clip_extractor(scenes)
|
||
|
||
# Generate embeddings
|
||
video_embedder = InternVideo2Embedder()
|
||
video_embeddings = video_embedder(clips)
|
||
```
|
||
|
||
### Audio curation
|
||
|
||
```python
|
||
from nemo_curator.audio import (
|
||
ASRInference,
|
||
WERFilter,
|
||
DurationFilter
|
||
)
|
||
|
||
# ASR transcription
|
||
asr = ASRInference(model="nvidia/stt_en_fastconformer_hybrid_large_pc")
|
||
transcribed = asr(audio_dataset)
|
||
|
||
# Filter by WER (word error rate)
|
||
wer_filter = WERFilter(max_wer=0.3)
|
||
high_quality_audio = wer_filter(transcribed)
|
||
|
||
# Duration filtering
|
||
duration_filter = DurationFilter(min_duration=1.0, max_duration=30.0)
|
||
filtered_audio = duration_filter(high_quality_audio)
|
||
```
|
||
|
||
## Common patterns
|
||
|
||
### Web scrape curation (Common Crawl)
|
||
|
||
```python
|
||
from nemo_curator import ScoreFilter, Modify
|
||
from nemo_curator.filters import *
|
||
from nemo_curator.modules import *
|
||
from nemo_curator.datasets import DocumentDataset
|
||
|
||
# Load Common Crawl data
|
||
dataset = DocumentDataset.read_parquet("common_crawl/*.parquet")
|
||
|
||
# Pipeline
|
||
pipeline = [
|
||
# 1. Quality filtering
|
||
WordCountFilter(min_words=100, max_words=50000),
|
||
RepeatedLinesFilter(max_repeated_line_fraction=0.2),
|
||
SymbolToWordRatioFilter(max_symbol_to_word_ratio=0.3),
|
||
UrlRatioFilter(max_url_ratio=0.3),
|
||
|
||
# 2. Language filtering
|
||
LanguageIdentificationFilter(target_languages=["en"]),
|
||
|
||
# 3. Deduplication
|
||
ExactDuplicates(id_field="id", text_field="text"),
|
||
FuzzyDuplicates(id_field="id", text_field="text", num_hashes=260),
|
||
|
||
# 4. PII redaction
|
||
PIIRedactor(),
|
||
|
||
# 5. NSFW filtering
|
||
NSFWClassifier(threshold=0.8)
|
||
]
|
||
|
||
# Execute
|
||
for stage in pipeline:
|
||
dataset = stage(dataset)
|
||
|
||
# Save
|
||
dataset.to_parquet("curated_common_crawl/")
|
||
```
|
||
|
||
### Distributed processing
|
||
|
||
```python
|
||
from nemo_curator import get_client
|
||
from dask_cuda import LocalCUDACluster
|
||
|
||
# Multi-GPU cluster
|
||
cluster = LocalCUDACluster(n_workers=8)
|
||
client = get_client(cluster=cluster)
|
||
|
||
# Process large dataset
|
||
dataset = DocumentDataset.read_parquet("s3://large_dataset/*.parquet")
|
||
deduped = FuzzyDuplicates(...)(dataset)
|
||
|
||
# Cleanup
|
||
client.close()
|
||
cluster.close()
|
||
```
|
||
|
||
## Performance benchmarks
|
||
|
||
### Fuzzy deduplication (8TB RedPajama v2)
|
||
|
||
- **CPU (256 cores)**: 120 hours
|
||
- **GPU (8× A100)**: 7.5 hours
|
||
- **Speedup**: 16×
|
||
|
||
### Exact deduplication (1TB)
|
||
|
||
- **CPU (64 cores)**: 8 hours
|
||
- **GPU (4× A100)**: 0.5 hours
|
||
- **Speedup**: 16×
|
||
|
||
### Quality filtering (100GB)
|
||
|
||
- **CPU (32 cores)**: 2 hours
|
||
- **GPU (2× A100)**: 0.2 hours
|
||
- **Speedup**: 10×
|
||
|
||
## Cost comparison
|
||
|
||
**CPU-based curation** (AWS c5.18xlarge × 10):
|
||
- Cost: $3.60/hour × 10 = $36/hour
|
||
- Time for 8TB: 120 hours
|
||
- **Total**: $4,320
|
||
|
||
**GPU-based curation** (AWS p4d.24xlarge × 2):
|
||
- Cost: $32.77/hour × 2 = $65.54/hour
|
||
- Time for 8TB: 7.5 hours
|
||
- **Total**: $491.55
|
||
|
||
**Savings**: 89% reduction ($3,828 saved)
|
||
|
||
## Supported data formats
|
||
|
||
- **Input**: Parquet, JSONL, CSV
|
||
- **Output**: Parquet (recommended), JSONL
|
||
- **WebDataset**: TAR archives for multi-modal
|
||
|
||
## Use cases
|
||
|
||
**Production deployments**:
|
||
- NVIDIA used NeMo Curator to prepare Nemotron-4 training data
|
||
- Open-source datasets curated: RedPajama v2, The Pile
|
||
|
||
## References
|
||
|
||
- **[Filtering Guide](references/filtering.md)** - 30+ quality filters, heuristics
|
||
- **[Deduplication Guide](references/deduplication.md)** - Exact, fuzzy, semantic methods
|
||
|
||
## Resources
|
||
|
||
- **GitHub**: https://github.com/NVIDIA/NeMo-Curator ⭐ 500+
|
||
- **Docs**: https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/
|
||
- **Version**: 0.4.0+
|
||
- **License**: Apache 2.0
|
||
|
||
|
||
|