feat(gateway): skill-aware slash commands, paginated /commands, Telegram 100-cap (#3934)

* feat(gateway): skill-aware slash commands, paginated /commands, Telegram 100-cap Map active skills to Telegram's slash command menu so users can discover and invoke skills directly. Three changes: 1. Telegram menu now includes active skill commands alongside built-in commands, capped at 100 entries (Telegram Bot API limit). Overflow commands remain callable but hidden from the picker. Logged at startup when cap is hit. 2. New /commands [page] gateway command for paginated browsing of all commands + skills. /help now shows first 10 skill commands and points to /commands for the full list. 3. When a user types a slash command that matches a disabled or uninstalled skill, they get actionable guidance: - Disabled: 'Enable it with: hermes skills config' - Optional (not installed): 'Install with: hermes skills install official/<path>' Built on ideas from PR #3921 by @kshitijk4poor. * chore: move 21 niche skills to optional-skills Move specialized/niche skills from built-in (skills/) to optional (optional-skills/) to reduce the default skill count. Users can install them with: hermes skills install official/<category>/<name> Moved skills (21): - mlops: accelerate, chroma, faiss, flash-attention, hermes-atropos-environments, huggingface-tokenizers, instructor, lambda-labs, llava, nemo-curator, pinecone, pytorch-lightning, qdrant, saelens, simpo, slime, tensorrt-llm, torchtitan - research: domain-intel, duckduckgo-search - devops: inference-sh cli Built-in skills: 96 → 75 Optional skills: 22 → 43 * fix: only include repo built-in skills in Telegram menu, not user-installed User-installed skills (from hub or manually added) stay accessible via /skills and by typing the command directly, but don't get registered in the Telegram slash command picker. Only skills whose SKILL.md is under the repo's skills/ directory are included in the menu. This keeps the Telegram menu focused on the curated built-in set while user-installed skills remain discoverable through /skills and /commands.
2026-04-25 00:51:20 +00:00 · 2026-03-30 10:57:30 -07:00 · 2026-03-30 10:57:30 -07:00 · 5ceed021dc
commit 5ceed021dc
parent 97d6813f51
73 changed files with 163 additions and 4 deletions
--- a/optional-skills/mlops/huggingface-tokenizers/references/algorithms.md
+++ b/optional-skills/mlops/huggingface-tokenizers/references/algorithms.md
@ -0,0 +1,653 @@
+# Tokenization Algorithms Deep Dive
+
+Comprehensive explanation of BPE, WordPiece, and Unigram algorithms.
+
+## Byte-Pair Encoding (BPE)
+
+### Algorithm overview
+
+BPE iteratively merges the most frequent pair of tokens in a corpus.
+
+**Training process**:
+1. Initialize vocabulary with all characters
+2. Count frequency of all adjacent token pairs
+3. Merge most frequent pair into new token
+4. Add new token to vocabulary
+5. Update corpus with new token
+6. Repeat until vocabulary size reached
+
+### Step-by-step example
+
+**Corpus**:
+```
+low: 5
+lower: 2
+newest: 6
+widest: 3
+```
+
+**Iteration 1**:
+```
+Count pairs:
+'e' + 's': 9 (newest: 6, widest: 3)  ← most frequent
+'l' + 'o': 7
+'o' + 'w': 7
+...
+
+Merge: 'e' + 's' → 'es'
+
+Updated corpus:
+low: 5
+lower: 2
+newest: 6 → newes|t: 6
+widest: 3 → wides|t: 3
+
+Vocabulary: [a-z] + ['es']
+```
+
+**Iteration 2**:
+```
+Count pairs:
+'es' + 't': 9  ← most frequent
+'l' + 'o': 7
+...
+
+Merge: 'es' + 't' → 'est'
+
+Updated corpus:
+low: 5
+lower: 2
+newest: 6 → new|est: 6
+widest: 3 → wid|est: 3
+
+Vocabulary: [a-z] + ['es', 'est']
+```
+
+**Continue until desired vocabulary size...**
+
+### Tokenization with trained BPE
+
+Given vocabulary: `['l', 'o', 'w', 'e', 'r', 'n', 's', 't', 'i', 'd', 'es', 'est', 'lo', 'low', 'ne', 'new', 'newest', 'wi', 'wid', 'widest']`
+
+Tokenize "lowest":
+```
+Step 1: Split into characters
+['l', 'o', 'w', 'e', 's', 't']
+
+Step 2: Apply merges in order learned during training
+- Merge 'l' + 'o' → 'lo' (if this merge was learned)
+- Merge 'lo' + 'w' → 'low' (if learned)
+- Merge 'e' + 's' → 'es' (learned)
+- Merge 'es' + 't' → 'est' (learned)
+
+Final: ['low', 'est']
+```
+
+### Implementation
+
+```python
+from tokenizers import Tokenizer
+from tokenizers.models import BPE
+from tokenizers.trainers import BpeTrainer
+from tokenizers.pre_tokenizers import Whitespace
+
+# Initialize
+tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
+tokenizer.pre_tokenizer = Whitespace()
+
+# Configure trainer
+trainer = BpeTrainer(
+    vocab_size=1000,
+    min_frequency=2,
+    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
+)
+
+# Train
+corpus = [
+    "This is a sample corpus for BPE training.",
+    "BPE learns subword units from the training data.",
+    # ... more sentences
+]
+
+tokenizer.train_from_iterator(corpus, trainer=trainer)
+
+# Use
+output = tokenizer.encode("This is tokenization")
+print(output.tokens)  # ['This', 'is', 'token', 'ization']
+```
+
+### Byte-level BPE (GPT-2 variant)
+
+**Problem**: Standard BPE has limited character coverage (256+ Unicode chars)
+
+**Solution**: Operate on byte level (256 bytes)
+
+```python
+from tokenizers.pre_tokenizers import ByteLevel
+from tokenizers.decoders import ByteLevel as ByteLevelDecoder
+
+tokenizer = Tokenizer(BPE())
+
+# Byte-level pre-tokenization
+tokenizer.pre_tokenizer = ByteLevel()
+tokenizer.decoder = ByteLevelDecoder()
+
+# This handles ALL possible characters, including emojis
+text = "Hello 🌍 世界"
+tokens = tokenizer.encode(text).tokens
+```
+
+**Advantages**:
+- Handles any Unicode character (256 byte coverage)
+- No unknown tokens (worst case: bytes)
+- Used by GPT-2, GPT-3, BART
+
+**Trade-offs**:
+- Slightly worse compression (bytes vs characters)
+- More tokens for non-ASCII text
+
+### BPE variants
+
+**SentencePiece BPE**:
+- Language-independent (no pre-tokenization)
+- Treats input as raw byte stream
+- Used by T5, ALBERT, XLNet
+
+**Robust BPE**:
+- Dropout during training (randomly skip merges)
+- More robust tokenization at inference
+- Reduces overfitting to training data
+
+## WordPiece
+
+### Algorithm overview
+
+WordPiece is similar to BPE but uses a different merge selection criterion.
+
+**Training process**:
+1. Initialize vocabulary with all characters
+2. Count frequency of all token pairs
+3. Score each pair: `score = freq(pair) / (freq(first) × freq(second))`
+4. Merge pair with highest score
+5. Repeat until vocabulary size reached
+
+### Why different scoring?
+
+**BPE**: Merges most frequent pairs
+- "aa" appears 100 times → high priority
+- Even if 'a' appears 1000 times alone
+
+**WordPiece**: Merges pairs that are semantically related
+- "aa" appears 100 times, 'a' appears 1000 times → low score (100 / (1000 × 1000))
+- "th" appears 50 times, 't' appears 60 times, 'h' appears 55 times → high score (50 / (60 × 55))
+- Prioritizes pairs that appear together more than expected
+
+### Step-by-step example
+
+**Corpus**:
+```
+low: 5
+lower: 2
+newest: 6
+widest: 3
+```
+
+**Iteration 1**:
+```
+Count frequencies:
+'e': 11 (lower: 2, newest: 6, widest: 3)
+'s': 9
+'t': 9
+...
+
+Count pairs:
+'e' + 's': 9 (newest: 6, widest: 3)
+'es' + 't': 9 (newest: 6, widest: 3)
+...
+
+Compute scores:
+score('e' + 's') = 9 / (11 × 9) = 0.091
+score('es' + 't') = 9 / (9 × 9) = 0.111  ← highest score
+score('l' + 'o') = 7 / (7 × 9) = 0.111   ← tied
+
+Choose: 'es' + 't' → 'est' (or 'lo' if tied)
+```
+
+**Key difference**: WordPiece prioritizes rare combinations over frequent ones.
+
+### Tokenization with WordPiece
+
+Given vocabulary: `['##e', '##s', '##t', 'l', 'o', 'w', 'new', 'est', 'low']`
+
+Tokenize "lowest":
+```
+Step 1: Find longest matching prefix
+'lowest' → 'low' (matches)
+
+Step 2: Find longest match for remainder
+'est' → 'est' (matches)
+
+Final: ['low', 'est']
+```
+
+**If no match**:
+```
+Tokenize "unknownword":
+'unknownword' → no match
+'unknown' → no match
+'unkn' → no match
+'un' → no match
+'u' → no match
+→ [UNK]
+```
+
+### Implementation
+
+```python
+from tokenizers import Tokenizer
+from tokenizers.models import WordPiece
+from tokenizers.trainers import WordPieceTrainer
+from tokenizers.normalizers import BertNormalizer
+from tokenizers.pre_tokenizers import BertPreTokenizer
+
+# Initialize BERT-style tokenizer
+tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
+
+# Normalization (lowercase, accent stripping)
+tokenizer.normalizer = BertNormalizer(lowercase=True)
+
+# Pre-tokenization (whitespace + punctuation)
+tokenizer.pre_tokenizer = BertPreTokenizer()
+
+# Configure trainer
+trainer = WordPieceTrainer(
+    vocab_size=30522,  # BERT vocab size
+    min_frequency=2,
+    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
+    continuing_subword_prefix="##"  # BERT uses ##
+)
+
+# Train
+tokenizer.train_from_iterator(corpus, trainer=trainer)
+
+# Use
+output = tokenizer.encode("Tokenization works great!")
+print(output.tokens)  # ['token', '##ization', 'works', 'great', '!']
+```
+
+### Subword prefix
+
+**BERT uses `##` prefix**:
+```
+"unbelievable" → ['un', '##believ', '##able']
+```
+
+**Why?**
+- Indicates token is a continuation
+- Allows reconstruction: remove ##, concatenate
+- Helps model distinguish word boundaries
+
+### WordPiece advantages
+
+**Semantic merges**:
+- Prioritizes meaningful combinations
+- "qu" has high score (always together)
+- "qx" has low score (rare combination)
+
+**Better for morphology**:
+- Captures affixes: un-, -ing, -ed
+- Preserves word stems
+
+**Trade-offs**:
+- Slower training than BPE
+- More memory (stores vocabulary, not merges)
+- Original implementation not open-source (HF reimplementation)
+
+## Unigram
+
+### Algorithm overview
+
+Unigram works backward: start with large vocabulary, remove tokens.
+
+**Training process**:
+1. Initialize with large vocabulary (all substrings)
+2. Estimate probability of each token (frequency-based)
+3. For each token, compute loss increase if removed
+4. Remove 10-20% of tokens with lowest loss impact
+5. Re-estimate probabilities
+6. Repeat until desired vocabulary size
+
+### Probabilistic tokenization
+
+**Unigram assumption**: Each token is independent.
+
+Given vocabulary with probabilities:
+```
+P('low') = 0.02
+P('l') = 0.01
+P('o') = 0.015
+P('w') = 0.01
+P('est') = 0.03
+P('e') = 0.02
+P('s') = 0.015
+P('t') = 0.015
+```
+
+Tokenize "lowest":
+```
+Option 1: ['low', 'est']
+P = P('low') × P('est') = 0.02 × 0.03 = 0.0006
+
+Option 2: ['l', 'o', 'w', 'est']
+P = 0.01 × 0.015 × 0.01 × 0.03 = 0.000000045
+
+Option 3: ['low', 'e', 's', 't']
+P = 0.02 × 0.02 × 0.015 × 0.015 = 0.0000009
+
+Choose option 1 (highest probability)
+```
+
+### Viterbi algorithm
+
+Finding best tokenization is expensive (exponential possibilities).
+
+**Viterbi algorithm** (dynamic programming):
+```python
+def tokenize_viterbi(word, vocab, probs):
+    n = len(word)
+    # dp[i] = (best_prob, best_tokens) for word[:i]
+    dp = [{} for _ in range(n + 1)]
+    dp[0] = (0.0, [])  # log probability
+
+    for i in range(1, n + 1):
+        best_prob = float('-inf')
+        best_tokens = []
+
+        # Try all possible last tokens
+        for j in range(i):
+            token = word[j:i]
+            if token in vocab:
+                prob = dp[j][0] + log(probs[token])
+                if prob > best_prob:
+                    best_prob = prob
+                    best_tokens = dp[j][1] + [token]
+
+        dp[i] = (best_prob, best_tokens)
+
+    return dp[n][1]
+```
+
+**Time complexity**: O(n² × vocab_size) vs O(2^n) brute force
+
+### Implementation
+
+```python
+from tokenizers import Tokenizer
+from tokenizers.models import Unigram
+from tokenizers.trainers import UnigramTrainer
+
+# Initialize
+tokenizer = Tokenizer(Unigram())
+
+# Configure trainer
+trainer = UnigramTrainer(
+    vocab_size=8000,
+    special_tokens=["<unk>", "<s>", "</s>"],
+    unk_token="<unk>",
+    max_piece_length=16,      # Max token length
+    n_sub_iterations=2,       # EM iterations
+    shrinking_factor=0.75     # Remove 25% each iteration
+)
+
+# Train
+tokenizer.train_from_iterator(corpus, trainer=trainer)
+
+# Use
+output = tokenizer.encode("Tokenization with Unigram")
+print(output.tokens)  # ['▁Token', 'ization', '▁with', '▁Un', 'igram']
+```
+
+### Unigram advantages
+
+**Probabilistic**:
+- Multiple valid tokenizations
+- Can sample different tokenizations (data augmentation)
+
+**Subword regularization**:
+```python
+# Sample different tokenizations
+for _ in range(3):
+    tokens = tokenizer.encode("tokenization", is_pretokenized=False).tokens
+    print(tokens)
+
+# Output (different each time):
+# ['token', 'ization']
+# ['tok', 'en', 'ization']
+# ['token', 'iz', 'ation']
+```
+
+**Language-independent**:
+- No word boundaries needed
+- Works for CJK languages (Chinese, Japanese, Korean)
+- Treats input as character stream
+
+**Trade-offs**:
+- Slower training (EM algorithm)
+- More hyperparameters
+- Larger model (stores probabilities)
+
+## Algorithm comparison
+
+### Training speed
+
+| Algorithm  | Small (10MB) | Medium (100MB) | Large (1GB) |
+|------------|--------------|----------------|-------------|
+| BPE        | 10-15 sec    | 1-2 min        | 10-20 min   |
+| WordPiece  | 15-20 sec    | 2-3 min        | 15-30 min   |
+| Unigram    | 20-30 sec    | 3-5 min        | 30-60 min   |
+
+**Tested on**: 16-core CPU, 30k vocab
+
+### Tokenization quality
+
+Tested on English Wikipedia (perplexity measurement):
+
+| Algorithm  | Vocab Size | Tokens/Word | Unknown Rate |
+|------------|------------|-------------|--------------|
+| BPE        | 30k        | 1.3         | 0.5%         |
+| WordPiece  | 30k        | 1.2         | 1.2%         |
+| Unigram    | 8k         | 1.5         | 0.3%         |
+
+**Key observations**:
+- WordPiece: Slightly better compression
+- BPE: Lower unknown rate
+- Unigram: Smallest vocab, good coverage
+
+### Compression ratio
+
+Characters per token (higher = better compression):
+
+| Language | BPE (30k) | WordPiece (30k) | Unigram (8k) |
+|----------|-----------|-----------------|--------------|
+| English  | 4.2       | 4.5             | 3.8          |
+| Chinese  | 2.1       | 2.3             | 2.5          |
+| Arabic   | 3.5       | 3.8             | 3.2          |
+
+**Best for each**:
+- English: WordPiece
+- Chinese: Unigram (language-independent)
+- Arabic: WordPiece
+
+### Use case recommendations
+
+**BPE** - Best for:
+- English language models
+- Code (handles symbols well)
+- Fast training needed
+- **Models**: GPT-2, GPT-3, RoBERTa, BART
+
+**WordPiece** - Best for:
+- Masked language modeling (BERT-style)
+- Morphologically rich languages
+- Semantic understanding tasks
+- **Models**: BERT, DistilBERT, ELECTRA
+
+**Unigram** - Best for:
+- Multilingual models
+- Languages without word boundaries (CJK)
+- Data augmentation via subword regularization
+- **Models**: T5, ALBERT, XLNet (via SentencePiece)
+
+## Advanced topics
+
+### Handling rare words
+
+**BPE approach**:
+```
+"antidisestablishmentarianism"
+→ ['anti', 'dis', 'establish', 'ment', 'arian', 'ism']
+```
+
+**WordPiece approach**:
+```
+"antidisestablishmentarianism"
+→ ['anti', '##dis', '##establish', '##ment', '##arian', '##ism']
+```
+
+**Unigram approach**:
+```
+"antidisestablishmentarianism"
+→ ['▁anti', 'dis', 'establish', 'ment', 'arian', 'ism']
+```
+
+### Handling numbers
+
+**Challenge**: Infinite number combinations
+
+**BPE solution**: Byte-level (handles any digit sequence)
+```python
+tokenizer = Tokenizer(BPE())
+tokenizer.pre_tokenizer = ByteLevel()
+
+# Handles any number
+"123456789" → byte-level tokens
+```
+
+**WordPiece solution**: Digit pre-tokenization
+```python
+from tokenizers.pre_tokenizers import Digits
+
+# Split digits individually or as groups
+tokenizer.pre_tokenizer = Digits(individual_digits=True)
+
+"123" → ['1', '2', '3']
+```
+
+**Unigram solution**: Learns common number patterns
+```python
+# Learns patterns during training
+"2023" → ['202', '3'] or ['20', '23']
+```
+
+### Handling case sensitivity
+
+**Lowercase (BERT)**:
+```python
+from tokenizers.normalizers import Lowercase
+
+tokenizer.normalizer = Lowercase()
+
+"Hello WORLD" → "hello world" → ['hello', 'world']
+```
+
+**Preserve case (GPT-2)**:
+```python
+# No case normalization
+tokenizer.normalizer = None
+
+"Hello WORLD" → ['Hello', 'WORLD']
+```
+
+**Cased tokens (RoBERTa)**:
+```python
+# Learns separate tokens for different cases
+Vocabulary: ['Hello', 'hello', 'HELLO', 'world', 'WORLD']
+```
+
+### Handling emojis and special characters
+
+**Byte-level (GPT-2)**:
+```python
+tokenizer.pre_tokenizer = ByteLevel()
+
+"Hello 🌍 👋" → byte-level representation (always works)
+```
+
+**Unicode normalization**:
+```python
+from tokenizers.normalizers import NFKC
+
+tokenizer.normalizer = NFKC()
+
+"é" (composed) ↔ "é" (decomposed) → normalized to one form
+```
+
+## Troubleshooting
+
+### Issue: Poor subword splitting
+
+**Symptom**:
+```
+"running" → ['r', 'u', 'n', 'n', 'i', 'n', 'g']  (too granular)
+```
+
+**Solutions**:
+1. Increase vocabulary size
+2. Train longer (more merge iterations)
+3. Lower `min_frequency` threshold
+
+### Issue: Too many unknown tokens
+
+**Symptom**:
+```
+5% of tokens are [UNK]
+```
+
+**Solutions**:
+1. Increase vocabulary size
+2. Use byte-level BPE (no UNK possible)
+3. Verify training corpus is representative
+
+### Issue: Inconsistent tokenization
+
+**Symptom**:
+```
+"running" → ['run', 'ning']
+"runner" → ['r', 'u', 'n', 'n', 'e', 'r']
+```
+
+**Solutions**:
+1. Check normalization consistency
+2. Ensure pre-tokenization is deterministic
+3. Use Unigram for probabilistic variance
+
+## Best practices
+
+1. **Match algorithm to model architecture**:
+   - BERT-style → WordPiece
+   - GPT-style → BPE
+   - T5-style → Unigram
+
+2. **Use byte-level for multilingual**:
+   - Handles any Unicode
+   - No unknown tokens
+
+3. **Test on representative data**:
+   - Measure compression ratio
+   - Check unknown token rate
+   - Inspect sample tokenizations
+
+4. **Version control tokenizers**:
+   - Save with model
+   - Document special tokens
+   - Track vocabulary changes