- Introduced new skills tools: `skills_categories`, `skills_list`, and `skill_view` in `model_tools.py`, allowing for better organization and access to skill-related functionalities. - Updated `toolsets.py` to include a new `skills` toolset, providing a dedicated space for skill tools. - Enhanced `batch_runner.py` to recognize and validate skills tools during batch processing. - Added comprehensive tool definitions for skills tools, ensuring compatibility with OpenAI's expected format. - Created new shell script `test_skills_kimi.sh` for testing skills tool functionality with Kimi K2.5. - Added example skill files demonstrating the structure and usage of skills within the Hermes-Agent framework, including `SKILL.md` for example and audiocraft skills. - Improved documentation for skills tools and their integration into the existing tool framework, ensuring clarity for future development and usage.
15 KiB
Tokenization Algorithms Deep Dive
Comprehensive explanation of BPE, WordPiece, and Unigram algorithms.
Byte-Pair Encoding (BPE)
Algorithm overview
BPE iteratively merges the most frequent pair of tokens in a corpus.
Training process:
- Initialize vocabulary with all characters
- Count frequency of all adjacent token pairs
- Merge most frequent pair into new token
- Add new token to vocabulary
- Update corpus with new token
- Repeat until vocabulary size reached
Step-by-step example
Corpus:
low: 5
lower: 2
newest: 6
widest: 3
Iteration 1:
Count pairs:
'e' + 's': 9 (newest: 6, widest: 3) ← most frequent
'l' + 'o': 7
'o' + 'w': 7
...
Merge: 'e' + 's' → 'es'
Updated corpus:
low: 5
lower: 2
newest: 6 → newes|t: 6
widest: 3 → wides|t: 3
Vocabulary: [a-z] + ['es']
Iteration 2:
Count pairs:
'es' + 't': 9 ← most frequent
'l' + 'o': 7
...
Merge: 'es' + 't' → 'est'
Updated corpus:
low: 5
lower: 2
newest: 6 → new|est: 6
widest: 3 → wid|est: 3
Vocabulary: [a-z] + ['es', 'est']
Continue until desired vocabulary size...
Tokenization with trained BPE
Given vocabulary: ['l', 'o', 'w', 'e', 'r', 'n', 's', 't', 'i', 'd', 'es', 'est', 'lo', 'low', 'ne', 'new', 'newest', 'wi', 'wid', 'widest']
Tokenize "lowest":
Step 1: Split into characters
['l', 'o', 'w', 'e', 's', 't']
Step 2: Apply merges in order learned during training
- Merge 'l' + 'o' → 'lo' (if this merge was learned)
- Merge 'lo' + 'w' → 'low' (if learned)
- Merge 'e' + 's' → 'es' (learned)
- Merge 'es' + 't' → 'est' (learned)
Final: ['low', 'est']
Implementation
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
# Initialize
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
# Configure trainer
trainer = BpeTrainer(
vocab_size=1000,
min_frequency=2,
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)
# Train
corpus = [
"This is a sample corpus for BPE training.",
"BPE learns subword units from the training data.",
# ... more sentences
]
tokenizer.train_from_iterator(corpus, trainer=trainer)
# Use
output = tokenizer.encode("This is tokenization")
print(output.tokens) # ['This', 'is', 'token', 'ization']
Byte-level BPE (GPT-2 variant)
Problem: Standard BPE has limited character coverage (256+ Unicode chars)
Solution: Operate on byte level (256 bytes)
from tokenizers.pre_tokenizers import ByteLevel
from tokenizers.decoders import ByteLevel as ByteLevelDecoder
tokenizer = Tokenizer(BPE())
# Byte-level pre-tokenization
tokenizer.pre_tokenizer = ByteLevel()
tokenizer.decoder = ByteLevelDecoder()
# This handles ALL possible characters, including emojis
text = "Hello 🌍 世界"
tokens = tokenizer.encode(text).tokens
Advantages:
- Handles any Unicode character (256 byte coverage)
- No unknown tokens (worst case: bytes)
- Used by GPT-2, GPT-3, BART
Trade-offs:
- Slightly worse compression (bytes vs characters)
- More tokens for non-ASCII text
BPE variants
SentencePiece BPE:
- Language-independent (no pre-tokenization)
- Treats input as raw byte stream
- Used by T5, ALBERT, XLNet
Robust BPE:
- Dropout during training (randomly skip merges)
- More robust tokenization at inference
- Reduces overfitting to training data
WordPiece
Algorithm overview
WordPiece is similar to BPE but uses a different merge selection criterion.
Training process:
- Initialize vocabulary with all characters
- Count frequency of all token pairs
- Score each pair:
score = freq(pair) / (freq(first) × freq(second)) - Merge pair with highest score
- Repeat until vocabulary size reached
Why different scoring?
BPE: Merges most frequent pairs
- "aa" appears 100 times → high priority
- Even if 'a' appears 1000 times alone
WordPiece: Merges pairs that are semantically related
- "aa" appears 100 times, 'a' appears 1000 times → low score (100 / (1000 × 1000))
- "th" appears 50 times, 't' appears 60 times, 'h' appears 55 times → high score (50 / (60 × 55))
- Prioritizes pairs that appear together more than expected
Step-by-step example
Corpus:
low: 5
lower: 2
newest: 6
widest: 3
Iteration 1:
Count frequencies:
'e': 11 (lower: 2, newest: 6, widest: 3)
's': 9
't': 9
...
Count pairs:
'e' + 's': 9 (newest: 6, widest: 3)
'es' + 't': 9 (newest: 6, widest: 3)
...
Compute scores:
score('e' + 's') = 9 / (11 × 9) = 0.091
score('es' + 't') = 9 / (9 × 9) = 0.111 ← highest score
score('l' + 'o') = 7 / (7 × 9) = 0.111 ← tied
Choose: 'es' + 't' → 'est' (or 'lo' if tied)
Key difference: WordPiece prioritizes rare combinations over frequent ones.
Tokenization with WordPiece
Given vocabulary: ['##e', '##s', '##t', 'l', 'o', 'w', 'new', 'est', 'low']
Tokenize "lowest":
Step 1: Find longest matching prefix
'lowest' → 'low' (matches)
Step 2: Find longest match for remainder
'est' → 'est' (matches)
Final: ['low', 'est']
If no match:
Tokenize "unknownword":
'unknownword' → no match
'unknown' → no match
'unkn' → no match
'un' → no match
'u' → no match
→ [UNK]
Implementation
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer
from tokenizers.normalizers import BertNormalizer
from tokenizers.pre_tokenizers import BertPreTokenizer
# Initialize BERT-style tokenizer
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
# Normalization (lowercase, accent stripping)
tokenizer.normalizer = BertNormalizer(lowercase=True)
# Pre-tokenization (whitespace + punctuation)
tokenizer.pre_tokenizer = BertPreTokenizer()
# Configure trainer
trainer = WordPieceTrainer(
vocab_size=30522, # BERT vocab size
min_frequency=2,
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
continuing_subword_prefix="##" # BERT uses ##
)
# Train
tokenizer.train_from_iterator(corpus, trainer=trainer)
# Use
output = tokenizer.encode("Tokenization works great!")
print(output.tokens) # ['token', '##ization', 'works', 'great', '!']
Subword prefix
BERT uses ## prefix:
"unbelievable" → ['un', '##believ', '##able']
Why?
- Indicates token is a continuation
- Allows reconstruction: remove ##, concatenate
- Helps model distinguish word boundaries
WordPiece advantages
Semantic merges:
- Prioritizes meaningful combinations
- "qu" has high score (always together)
- "qx" has low score (rare combination)
Better for morphology:
- Captures affixes: un-, -ing, -ed
- Preserves word stems
Trade-offs:
- Slower training than BPE
- More memory (stores vocabulary, not merges)
- Original implementation not open-source (HF reimplementation)
Unigram
Algorithm overview
Unigram works backward: start with large vocabulary, remove tokens.
Training process:
- Initialize with large vocabulary (all substrings)
- Estimate probability of each token (frequency-based)
- For each token, compute loss increase if removed
- Remove 10-20% of tokens with lowest loss impact
- Re-estimate probabilities
- Repeat until desired vocabulary size
Probabilistic tokenization
Unigram assumption: Each token is independent.
Given vocabulary with probabilities:
P('low') = 0.02
P('l') = 0.01
P('o') = 0.015
P('w') = 0.01
P('est') = 0.03
P('e') = 0.02
P('s') = 0.015
P('t') = 0.015
Tokenize "lowest":
Option 1: ['low', 'est']
P = P('low') × P('est') = 0.02 × 0.03 = 0.0006
Option 2: ['l', 'o', 'w', 'est']
P = 0.01 × 0.015 × 0.01 × 0.03 = 0.000000045
Option 3: ['low', 'e', 's', 't']
P = 0.02 × 0.02 × 0.015 × 0.015 = 0.0000009
Choose option 1 (highest probability)
Viterbi algorithm
Finding best tokenization is expensive (exponential possibilities).
Viterbi algorithm (dynamic programming):
def tokenize_viterbi(word, vocab, probs):
n = len(word)
# dp[i] = (best_prob, best_tokens) for word[:i]
dp = [{} for _ in range(n + 1)]
dp[0] = (0.0, []) # log probability
for i in range(1, n + 1):
best_prob = float('-inf')
best_tokens = []
# Try all possible last tokens
for j in range(i):
token = word[j:i]
if token in vocab:
prob = dp[j][0] + log(probs[token])
if prob > best_prob:
best_prob = prob
best_tokens = dp[j][1] + [token]
dp[i] = (best_prob, best_tokens)
return dp[n][1]
Time complexity: O(n² × vocab_size) vs O(2^n) brute force
Implementation
from tokenizers import Tokenizer
from tokenizers.models import Unigram
from tokenizers.trainers import UnigramTrainer
# Initialize
tokenizer = Tokenizer(Unigram())
# Configure trainer
trainer = UnigramTrainer(
vocab_size=8000,
special_tokens=["<unk>", "<s>", "</s>"],
unk_token="<unk>",
max_piece_length=16, # Max token length
n_sub_iterations=2, # EM iterations
shrinking_factor=0.75 # Remove 25% each iteration
)
# Train
tokenizer.train_from_iterator(corpus, trainer=trainer)
# Use
output = tokenizer.encode("Tokenization with Unigram")
print(output.tokens) # ['▁Token', 'ization', '▁with', '▁Un', 'igram']
Unigram advantages
Probabilistic:
- Multiple valid tokenizations
- Can sample different tokenizations (data augmentation)
Subword regularization:
# Sample different tokenizations
for _ in range(3):
tokens = tokenizer.encode("tokenization", is_pretokenized=False).tokens
print(tokens)
# Output (different each time):
# ['token', 'ization']
# ['tok', 'en', 'ization']
# ['token', 'iz', 'ation']
Language-independent:
- No word boundaries needed
- Works for CJK languages (Chinese, Japanese, Korean)
- Treats input as character stream
Trade-offs:
- Slower training (EM algorithm)
- More hyperparameters
- Larger model (stores probabilities)
Algorithm comparison
Training speed
| Algorithm | Small (10MB) | Medium (100MB) | Large (1GB) |
|---|---|---|---|
| BPE | 10-15 sec | 1-2 min | 10-20 min |
| WordPiece | 15-20 sec | 2-3 min | 15-30 min |
| Unigram | 20-30 sec | 3-5 min | 30-60 min |
Tested on: 16-core CPU, 30k vocab
Tokenization quality
Tested on English Wikipedia (perplexity measurement):
| Algorithm | Vocab Size | Tokens/Word | Unknown Rate |
|---|---|---|---|
| BPE | 30k | 1.3 | 0.5% |
| WordPiece | 30k | 1.2 | 1.2% |
| Unigram | 8k | 1.5 | 0.3% |
Key observations:
- WordPiece: Slightly better compression
- BPE: Lower unknown rate
- Unigram: Smallest vocab, good coverage
Compression ratio
Characters per token (higher = better compression):
| Language | BPE (30k) | WordPiece (30k) | Unigram (8k) |
|---|---|---|---|
| English | 4.2 | 4.5 | 3.8 |
| Chinese | 2.1 | 2.3 | 2.5 |
| Arabic | 3.5 | 3.8 | 3.2 |
Best for each:
- English: WordPiece
- Chinese: Unigram (language-independent)
- Arabic: WordPiece
Use case recommendations
BPE - Best for:
- English language models
- Code (handles symbols well)
- Fast training needed
- Models: GPT-2, GPT-3, RoBERTa, BART
WordPiece - Best for:
- Masked language modeling (BERT-style)
- Morphologically rich languages
- Semantic understanding tasks
- Models: BERT, DistilBERT, ELECTRA
Unigram - Best for:
- Multilingual models
- Languages without word boundaries (CJK)
- Data augmentation via subword regularization
- Models: T5, ALBERT, XLNet (via SentencePiece)
Advanced topics
Handling rare words
BPE approach:
"antidisestablishmentarianism"
→ ['anti', 'dis', 'establish', 'ment', 'arian', 'ism']
WordPiece approach:
"antidisestablishmentarianism"
→ ['anti', '##dis', '##establish', '##ment', '##arian', '##ism']
Unigram approach:
"antidisestablishmentarianism"
→ ['▁anti', 'dis', 'establish', 'ment', 'arian', 'ism']
Handling numbers
Challenge: Infinite number combinations
BPE solution: Byte-level (handles any digit sequence)
tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = ByteLevel()
# Handles any number
"123456789" → byte-level tokens
WordPiece solution: Digit pre-tokenization
from tokenizers.pre_tokenizers import Digits
# Split digits individually or as groups
tokenizer.pre_tokenizer = Digits(individual_digits=True)
"123" → ['1', '2', '3']
Unigram solution: Learns common number patterns
# Learns patterns during training
"2023" → ['202', '3'] or ['20', '23']
Handling case sensitivity
Lowercase (BERT):
from tokenizers.normalizers import Lowercase
tokenizer.normalizer = Lowercase()
"Hello WORLD" → "hello world" → ['hello', 'world']
Preserve case (GPT-2):
# No case normalization
tokenizer.normalizer = None
"Hello WORLD" → ['Hello', 'WORLD']
Cased tokens (RoBERTa):
# Learns separate tokens for different cases
Vocabulary: ['Hello', 'hello', 'HELLO', 'world', 'WORLD']
Handling emojis and special characters
Byte-level (GPT-2):
tokenizer.pre_tokenizer = ByteLevel()
"Hello 🌍 👋" → byte-level representation (always works)
Unicode normalization:
from tokenizers.normalizers import NFKC
tokenizer.normalizer = NFKC()
"é" (composed) ↔ "é" (decomposed) → normalized to one form
Troubleshooting
Issue: Poor subword splitting
Symptom:
"running" → ['r', 'u', 'n', 'n', 'i', 'n', 'g'] (too granular)
Solutions:
- Increase vocabulary size
- Train longer (more merge iterations)
- Lower
min_frequencythreshold
Issue: Too many unknown tokens
Symptom:
5% of tokens are [UNK]
Solutions:
- Increase vocabulary size
- Use byte-level BPE (no UNK possible)
- Verify training corpus is representative
Issue: Inconsistent tokenization
Symptom:
"running" → ['run', 'ning']
"runner" → ['r', 'u', 'n', 'n', 'e', 'r']
Solutions:
- Check normalization consistency
- Ensure pre-tokenization is deterministic
- Use Unigram for probabilistic variance
Best practices
-
Match algorithm to model architecture:
- BERT-style → WordPiece
- GPT-style → BPE
- T5-style → Unigram
-
Use byte-level for multilingual:
- Handles any Unicode
- No unknown tokens
-
Test on representative data:
- Measure compression ratio
- Check unknown token rate
- Inspect sample tokenizations
-
Version control tokenizers:
- Save with model
- Document special tokens
- Track vocabulary changes