docs(website): dedicated page per bundled + optional skill (#14929)

Generates a full dedicated Docusaurus page for every one of the 132 skills (73 bundled + 59 optional) under website/docs/user-guide/skills/{bundled,optional}/<category>/. Each page carries the skill's description, metadata (version, author, license, dependencies, platform gating, tags, related skills cross-linked to their own pages), and the complete SKILL.md body that Hermes loads at runtime. Previously the two catalog pages just listed skills with a one-line blurb and no way to see what the skill actually did — users had to go read the source repo. Now every skill has a browsable, searchable, cross-linked reference in the docs. - website/scripts/generate-skill-docs.py — generator that reads skills/ and optional-skills/, writes per-skill pages, regenerates both catalog indexes, and rewrites the Skills section of sidebars.ts. Handles MDX escaping (outside fenced code blocks: curly braces, unsafe HTML-ish tags) and rewrites relative references/*.md links to point at the GitHub source. - website/docs/reference/skills-catalog.md — regenerated; each row links to the new dedicated page. - website/docs/reference/optional-skills-catalog.md — same. - website/sidebars.ts — Skills section now has Bundled / Optional subtrees with one nested category per skill folder. - .github/workflows/{docs-site-checks,deploy-site}.yml — run the generator before docusaurus build so CI stays in sync with the source SKILL.md files. Build verified locally with `npx docusaurus build`. Only remaining warnings are pre-existing broken link/anchor issues in unrelated pages.
2026-04-26 01:01:40 +00:00 · 2026-04-23 22:22:11 -07:00 · 2026-04-23 22:22:11 -07:00 · 0f6eabb890
commit 0f6eabb890
parent eb93f88e1d
139 changed files with 43523 additions and 306 deletions
--- a/website/docs/user-guide/skills/optional/mlops/mlops-huggingface-tokenizers.md
+++ b/website/docs/user-guide/skills/optional/mlops/mlops-huggingface-tokenizers.md
@ -0,0 +1,534 @@
+---
+title: "Huggingface Tokenizers — Fast tokenizers optimized for research and production"
+sidebar_label: "Huggingface Tokenizers"
+description: "Fast tokenizers optimized for research and production"
+---
+
+{/* This page is auto-generated from the skill's SKILL.md by website/scripts/generate-skill-docs.py. Edit the source SKILL.md, not this page. */}
+
+# Huggingface Tokenizers
+
+Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in &lt;20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.
+
+## Skill metadata
+
+| | |
+|---|---|
+| Source | Optional — install with `hermes skills install official/mlops/huggingface-tokenizers` |
+| Path | `optional-skills/mlops/huggingface-tokenizers` |
+| Version | `1.0.0` |
+| Author | Orchestra Research |
+| License | MIT |
+| Dependencies | `tokenizers`, `transformers`, `datasets` |
+| Tags | `Tokenization`, `HuggingFace`, `BPE`, `WordPiece`, `Unigram`, `Fast Tokenization`, `Rust`, `Custom Tokenizer`, `Alignment Tracking`, `Production` |
+
+## Reference: full SKILL.md
+
+:::info
+The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
+:::
+
+# HuggingFace Tokenizers - Fast Tokenization for NLP
+
+Fast, production-ready tokenizers with Rust performance and Python ease-of-use.
+
+## When to use HuggingFace Tokenizers
+
+**Use HuggingFace Tokenizers when:**
+- Need extremely fast tokenization (&lt;20s per GB of text)
+- Training custom tokenizers from scratch
+- Want alignment tracking (token → original text position)
+- Building production NLP pipelines
+- Need to tokenize large corpora efficiently
+
+**Performance**:
+- **Speed**: &lt;20 seconds to tokenize 1GB on CPU
+- **Implementation**: Rust core with Python/Node.js bindings
+- **Efficiency**: 10-100× faster than pure Python implementations
+
+**Use alternatives instead**:
+- **SentencePiece**: Language-independent, used by T5/ALBERT
+- **tiktoken**: OpenAI's BPE tokenizer for GPT models
+- **transformers AutoTokenizer**: Loading pretrained only (uses this library internally)
+
+## Quick start
+
+### Installation
+
+```bash
+# Install tokenizers
+pip install tokenizers
+
+# With transformers integration
+pip install tokenizers transformers
+```
+
+### Load pretrained tokenizer
+
+```python
+from tokenizers import Tokenizer
+
+# Load from HuggingFace Hub
+tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
+
+# Encode text
+output = tokenizer.encode("Hello, how are you?")
+print(output.tokens)  # ['hello', ',', 'how', 'are', 'you', '?']
+print(output.ids)     # [7592, 1010, 2129, 2024, 2017, 1029]
+
+# Decode back
+text = tokenizer.decode(output.ids)
+print(text)  # "hello, how are you?"
+```
+
+### Train custom BPE tokenizer
+
+```python
+from tokenizers import Tokenizer
+from tokenizers.models import BPE
+from tokenizers.trainers import BpeTrainer
+from tokenizers.pre_tokenizers import Whitespace
+
+# Initialize tokenizer with BPE model
+tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
+tokenizer.pre_tokenizer = Whitespace()
+
+# Configure trainer
+trainer = BpeTrainer(
+    vocab_size=30000,
+    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
+    min_frequency=2
+)
+
+# Train on files
+files = ["train.txt", "validation.txt"]
+tokenizer.train(files, trainer)
+
+# Save
+tokenizer.save("my-tokenizer.json")
+```
+
+**Training time**: ~1-2 minutes for 100MB corpus, ~10-20 minutes for 1GB
+
+### Batch encoding with padding
+
+```python
+# Enable padding
+tokenizer.enable_padding(pad_id=3, pad_token="[PAD]")
+
+# Encode batch
+texts = ["Hello world", "This is a longer sentence"]
+encodings = tokenizer.encode_batch(texts)
+
+for encoding in encodings:
+    print(encoding.ids)
+# [101, 7592, 2088, 102, 3, 3, 3]
+# [101, 2023, 2003, 1037, 2936, 6251, 102]
+```
+
+## Tokenization algorithms
+
+### BPE (Byte-Pair Encoding)
+
+**How it works**:
+1. Start with character-level vocabulary
+2. Find most frequent character pair
+3. Merge into new token, add to vocabulary
+4. Repeat until vocabulary size reached
+
+**Used by**: GPT-2, GPT-3, RoBERTa, BART, DeBERTa
+
+```python
+from tokenizers import Tokenizer
+from tokenizers.models import BPE
+from tokenizers.trainers import BpeTrainer
+from tokenizers.pre_tokenizers import ByteLevel
+
+tokenizer = Tokenizer(BPE(unk_token="<|endoftext|>"))
+tokenizer.pre_tokenizer = ByteLevel()
+
+trainer = BpeTrainer(
+    vocab_size=50257,
+    special_tokens=["<|endoftext|>"],
+    min_frequency=2
+)
+
+tokenizer.train(files=["data.txt"], trainer=trainer)
+```
+
+**Advantages**:
+- Handles OOV words well (breaks into subwords)
+- Flexible vocabulary size
+- Good for morphologically rich languages
+
+**Trade-offs**:
+- Tokenization depends on merge order
+- May split common words unexpectedly
+
+### WordPiece
+
+**How it works**:
+1. Start with character vocabulary
+2. Score merge pairs: `frequency(pair) / (frequency(first) × frequency(second))`
+3. Merge highest scoring pair
+4. Repeat until vocabulary size reached
+
+**Used by**: BERT, DistilBERT, MobileBERT
+
+```python
+from tokenizers import Tokenizer
+from tokenizers.models import WordPiece
+from tokenizers.trainers import WordPieceTrainer
+from tokenizers.pre_tokenizers import Whitespace
+from tokenizers.normalizers import BertNormalizer
+
+tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
+tokenizer.normalizer = BertNormalizer(lowercase=True)
+tokenizer.pre_tokenizer = Whitespace()
+
+trainer = WordPieceTrainer(
+    vocab_size=30522,
+    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
+    continuing_subword_prefix="##"
+)
+
+tokenizer.train(files=["corpus.txt"], trainer=trainer)
+```
+
+**Advantages**:
+- Prioritizes meaningful merges (high score = semantically related)
+- Used successfully in BERT (state-of-the-art results)
+
+**Trade-offs**:
+- Unknown words become `[UNK]` if no subword match
+- Saves vocabulary, not merge rules (larger files)
+
+### Unigram
+
+**How it works**:
+1. Start with large vocabulary (all substrings)
+2. Compute loss for corpus with current vocabulary
+3. Remove tokens with minimal impact on loss
+4. Repeat until vocabulary size reached
+
+**Used by**: ALBERT, T5, mBART, XLNet (via SentencePiece)
+
+```python
+from tokenizers import Tokenizer
+from tokenizers.models import Unigram
+from tokenizers.trainers import UnigramTrainer
+
+tokenizer = Tokenizer(Unigram())
+
+trainer = UnigramTrainer(
+    vocab_size=8000,
+    special_tokens=["<unk>", "<s>", "</s>"],
+    unk_token="<unk>"
+)
+
+tokenizer.train(files=["data.txt"], trainer=trainer)
+```
+
+**Advantages**:
+- Probabilistic (finds most likely tokenization)
+- Works well for languages without word boundaries
+- Handles diverse linguistic contexts
+
+**Trade-offs**:
+- Computationally expensive to train
+- More hyperparameters to tune
+
+## Tokenization pipeline
+
+Complete pipeline: **Normalization → Pre-tokenization → Model → Post-processing**
+
+### Normalization
+
+Clean and standardize text:
+
+```python
+from tokenizers.normalizers import NFD, StripAccents, Lowercase, Sequence
+
+tokenizer.normalizer = Sequence([
+    NFD(),           # Unicode normalization (decompose)
+    Lowercase(),     # Convert to lowercase
+    StripAccents()   # Remove accents
+])
+
+# Input: "Héllo WORLD"
+# After normalization: "hello world"
+```
+
+**Common normalizers**:
+- `NFD`, `NFC`, `NFKD`, `NFKC` - Unicode normalization forms
+- `Lowercase()` - Convert to lowercase
+- `StripAccents()` - Remove accents (é → e)
+- `Strip()` - Remove whitespace
+- `Replace(pattern, content)` - Regex replacement
+
+### Pre-tokenization
+
+Split text into word-like units:
+
+```python
+from tokenizers.pre_tokenizers import Whitespace, Punctuation, Sequence, ByteLevel
+
+# Split on whitespace and punctuation
+tokenizer.pre_tokenizer = Sequence([
+    Whitespace(),
+    Punctuation()
+])
+
+# Input: "Hello, world!"
+# After pre-tokenization: ["Hello", ",", "world", "!"]
+```
+
+**Common pre-tokenizers**:
+- `Whitespace()` - Split on spaces, tabs, newlines
+- `ByteLevel()` - GPT-2 style byte-level splitting
+- `Punctuation()` - Isolate punctuation
+- `Digits(individual_digits=True)` - Split digits individually
+- `Metaspace()` - Replace spaces with ▁ (SentencePiece style)
+
+### Post-processing
+
+Add special tokens for model input:
+
+```python
+from tokenizers.processors import TemplateProcessing
+
+# BERT-style: [CLS] sentence [SEP]
+tokenizer.post_processor = TemplateProcessing(
+    single="[CLS] $A [SEP]",
+    pair="[CLS] $A [SEP] $B [SEP]",
+    special_tokens=[
+        ("[CLS]", 1),
+        ("[SEP]", 2),
+    ],
+)
+```
+
+**Common patterns**:
+```python
+# GPT-2: sentence <|endoftext|>
+TemplateProcessing(
+    single="$A <|endoftext|>",
+    special_tokens=[("<|endoftext|>", 50256)]
+)
+
+# RoBERTa: <s> sentence </s>
+TemplateProcessing(
+    single="<s> $A </s>",
+    pair="<s> $A </s> </s> $B </s>",
+    special_tokens=[("<s>", 0), ("</s>", 2)]
+)
+```
+
+## Alignment tracking
+
+Track token positions in original text:
+
+```python
+output = tokenizer.encode("Hello, world!")
+
+# Get token offsets
+for token, offset in zip(output.tokens, output.offsets):
+    start, end = offset
+    print(f"{token:10} → [{start:2}, {end:2}): {text[start:end]!r}")
+
+# Output:
+# hello      → [ 0,  5): 'Hello'
+# ,          → [ 5,  6): ','
+# world      → [ 7, 12): 'world'
+# !          → [12, 13): '!'
+```
+
+**Use cases**:
+- Named entity recognition (map predictions back to text)
+- Question answering (extract answer spans)
+- Token classification (align labels to original positions)
+
+## Integration with transformers
+
+### Load with AutoTokenizer
+
+```python
+from transformers import AutoTokenizer
+
+# AutoTokenizer automatically uses fast tokenizers
+tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
+
+# Check if using fast tokenizer
+print(tokenizer.is_fast)  # True
+
+# Access underlying tokenizers.Tokenizer
+fast_tokenizer = tokenizer.backend_tokenizer
+print(type(fast_tokenizer))  # <class 'tokenizers.Tokenizer'>
+```
+
+### Convert custom tokenizer to transformers
+
+```python
+from tokenizers import Tokenizer
+from transformers import PreTrainedTokenizerFast
+
+# Train custom tokenizer
+tokenizer = Tokenizer(BPE())
+# ... train tokenizer ...
+tokenizer.save("my-tokenizer.json")
+
+# Wrap for transformers
+transformers_tokenizer = PreTrainedTokenizerFast(
+    tokenizer_file="my-tokenizer.json",
+    unk_token="[UNK]",
+    pad_token="[PAD]",
+    cls_token="[CLS]",
+    sep_token="[SEP]",
+    mask_token="[MASK]"
+)
+
+# Use like any transformers tokenizer
+outputs = transformers_tokenizer(
+    "Hello world",
+    padding=True,
+    truncation=True,
+    max_length=512,
+    return_tensors="pt"
+)
+```
+
+## Common patterns
+
+### Train from iterator (large datasets)
+
+```python
+from datasets import load_dataset
+
+# Load dataset
+dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="train")
+
+# Create batch iterator
+def batch_iterator(batch_size=1000):
+    for i in range(0, len(dataset), batch_size):
+        yield dataset[i:i + batch_size]["text"]
+
+# Train tokenizer
+tokenizer.train_from_iterator(
+    batch_iterator(),
+    trainer=trainer,
+    length=len(dataset)  # For progress bar
+)
+```
+
+**Performance**: Processes 1GB in ~10-20 minutes
+
+### Enable truncation and padding
+
+```python
+# Enable truncation
+tokenizer.enable_truncation(max_length=512)
+
+# Enable padding
+tokenizer.enable_padding(
+    pad_id=tokenizer.token_to_id("[PAD]"),
+    pad_token="[PAD]",
+    length=512  # Fixed length, or None for batch max
+)
+
+# Encode with both
+output = tokenizer.encode("This is a long sentence that will be truncated...")
+print(len(output.ids))  # 512
+```
+
+### Multi-processing
+
+```python
+from tokenizers import Tokenizer
+from multiprocessing import Pool
+
+# Load tokenizer
+tokenizer = Tokenizer.from_file("tokenizer.json")
+
+def encode_batch(texts):
+    return tokenizer.encode_batch(texts)
+
+# Process large corpus in parallel
+with Pool(8) as pool:
+    # Split corpus into chunks
+    chunk_size = 1000
+    chunks = [corpus[i:i+chunk_size] for i in range(0, len(corpus), chunk_size)]
+
+    # Encode in parallel
+    results = pool.map(encode_batch, chunks)
+```
+
+**Speedup**: 5-8× with 8 cores
+
+## Performance benchmarks
+
+### Training speed
+
+| Corpus Size | BPE (30k vocab) | WordPiece (30k) | Unigram (8k) |
+|-------------|-----------------|-----------------|--------------|
+| 10 MB       | 15 sec          | 18 sec          | 25 sec       |
+| 100 MB      | 1.5 min         | 2 min           | 4 min        |
+| 1 GB        | 15 min          | 20 min          | 40 min       |
+
+**Hardware**: 16-core CPU, tested on English Wikipedia
+
+### Tokenization speed
+
+| Implementation | 1 GB corpus | Throughput    |
+|----------------|-------------|---------------|
+| Pure Python    | ~20 minutes | ~50 MB/min    |
+| HF Tokenizers  | ~15 seconds | ~4 GB/min     |
+| **Speedup**    | **80×**     | **80×**       |
+
+**Test**: English text, average sentence length 20 words
+
+### Memory usage
+
+| Task                    | Memory  |
+|-------------------------|---------|
+| Load tokenizer          | ~10 MB  |
+| Train BPE (30k vocab)   | ~200 MB |
+| Encode 1M sentences     | ~500 MB |
+
+## Supported models
+
+Pre-trained tokenizers available via `from_pretrained()`:
+
+**BERT family**:
+- `bert-base-uncased`, `bert-large-cased`
+- `distilbert-base-uncased`
+- `roberta-base`, `roberta-large`
+
+**GPT family**:
+- `gpt2`, `gpt2-medium`, `gpt2-large`
+- `distilgpt2`
+
+**T5 family**:
+- `t5-small`, `t5-base`, `t5-large`
+- `google/flan-t5-xxl`
+
+**Other**:
+- `facebook/bart-base`, `facebook/mbart-large-cc25`
+- `albert-base-v2`, `albert-xlarge-v2`
+- `xlm-roberta-base`, `xlm-roberta-large`
+
+Browse all: https://huggingface.co/models?library=tokenizers
+
+## References
+
+- **[Training Guide](https://github.com/NousResearch/hermes-agent/blob/main/optional-skills/mlops/huggingface-tokenizers/references/training.md)** - Train custom tokenizers, configure trainers, handle large datasets
+- **[Algorithms Deep Dive](https://github.com/NousResearch/hermes-agent/blob/main/optional-skills/mlops/huggingface-tokenizers/references/algorithms.md)** - BPE, WordPiece, Unigram explained in detail
+- **[Pipeline Components](https://github.com/NousResearch/hermes-agent/blob/main/optional-skills/mlops/huggingface-tokenizers/references/pipeline.md)** - Normalizers, pre-tokenizers, post-processors, decoders
+- **[Transformers Integration](https://github.com/NousResearch/hermes-agent/blob/main/optional-skills/mlops/huggingface-tokenizers/references/integration.md)** - AutoTokenizer, PreTrainedTokenizerFast, special tokens
+
+## Resources
+
+- **Docs**: https://huggingface.co/docs/tokenizers
+- **GitHub**: https://github.com/huggingface/tokenizers ⭐ 9,000+
+- **Version**: 0.20.0+
+- **Course**: https://huggingface.co/learn/nlp-course/chapter6/1
+- **Paper**: BPE (Sennrich et al., 2016), WordPiece (Schuster & Nakajima, 2012)