mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-25 00:51:20 +00:00
- Introduced new skills tools: `skills_categories`, `skills_list`, and `skill_view` in `model_tools.py`, allowing for better organization and access to skill-related functionalities. - Updated `toolsets.py` to include a new `skills` toolset, providing a dedicated space for skill tools. - Enhanced `batch_runner.py` to recognize and validate skills tools during batch processing. - Added comprehensive tool definitions for skills tools, ensuring compatibility with OpenAI's expected format. - Created new shell script `test_skills_kimi.sh` for testing skills tool functionality with Kimi K2.5. - Added example skill files demonstrating the structure and usage of skills within the Hermes-Agent framework, including `SKILL.md` for example and audiocraft skills. - Improved documentation for skills tools and their integration into the existing tool framework, ensuring clarity for future development and usage.
637 lines
15 KiB
Markdown
637 lines
15 KiB
Markdown
# Transformers Integration
|
||
|
||
Complete guide to using HuggingFace Tokenizers with the Transformers library.
|
||
|
||
## AutoTokenizer
|
||
|
||
The easiest way to load tokenizers.
|
||
|
||
### Loading pretrained tokenizers
|
||
|
||
```python
|
||
from transformers import AutoTokenizer
|
||
|
||
# Load from HuggingFace Hub
|
||
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
|
||
|
||
# Check if using fast tokenizer (Rust-based)
|
||
print(tokenizer.is_fast) # True
|
||
|
||
# Access underlying tokenizers.Tokenizer
|
||
if tokenizer.is_fast:
|
||
fast_tokenizer = tokenizer.backend_tokenizer
|
||
print(type(fast_tokenizer)) # <class 'tokenizers.Tokenizer'>
|
||
```
|
||
|
||
### Fast vs slow tokenizers
|
||
|
||
| Feature | Fast (Rust) | Slow (Python) |
|
||
|--------------------------|----------------|---------------|
|
||
| Speed | 5-10× faster | Baseline |
|
||
| Alignment tracking | ✅ Full support | ❌ Limited |
|
||
| Batch processing | ✅ Optimized | ⚠️ Slower |
|
||
| Offset mapping | ✅ Yes | ❌ No |
|
||
| Installation | `tokenizers` | Built-in |
|
||
|
||
**Always use fast tokenizers when available.**
|
||
|
||
### Check available tokenizers
|
||
|
||
```python
|
||
from transformers import TOKENIZER_MAPPING
|
||
|
||
# List all fast tokenizers
|
||
for config_class, (slow, fast) in TOKENIZER_MAPPING.items():
|
||
if fast is not None:
|
||
print(f"{config_class.__name__}: {fast.__name__}")
|
||
```
|
||
|
||
## PreTrainedTokenizerFast
|
||
|
||
Wrap custom tokenizers for transformers.
|
||
|
||
### Convert custom tokenizer
|
||
|
||
```python
|
||
from tokenizers import Tokenizer
|
||
from tokenizers.models import BPE
|
||
from tokenizers.trainers import BpeTrainer
|
||
from transformers import PreTrainedTokenizerFast
|
||
|
||
# Train custom tokenizer
|
||
tokenizer = Tokenizer(BPE())
|
||
trainer = BpeTrainer(
|
||
vocab_size=30000,
|
||
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
|
||
)
|
||
tokenizer.train(files=["corpus.txt"], trainer=trainer)
|
||
|
||
# Save tokenizer
|
||
tokenizer.save("my-tokenizer.json")
|
||
|
||
# Wrap for transformers
|
||
transformers_tokenizer = PreTrainedTokenizerFast(
|
||
tokenizer_file="my-tokenizer.json",
|
||
unk_token="[UNK]",
|
||
sep_token="[SEP]",
|
||
pad_token="[PAD]",
|
||
cls_token="[CLS]",
|
||
mask_token="[MASK]"
|
||
)
|
||
|
||
# Save in transformers format
|
||
transformers_tokenizer.save_pretrained("my-tokenizer")
|
||
```
|
||
|
||
**Result**: Directory with `tokenizer.json` + `tokenizer_config.json` + `special_tokens_map.json`
|
||
|
||
### Use like any transformers tokenizer
|
||
|
||
```python
|
||
# Load
|
||
from transformers import AutoTokenizer
|
||
tokenizer = AutoTokenizer.from_pretrained("my-tokenizer")
|
||
|
||
# Encode with all transformers features
|
||
outputs = tokenizer(
|
||
"Hello world",
|
||
padding="max_length",
|
||
truncation=True,
|
||
max_length=128,
|
||
return_tensors="pt"
|
||
)
|
||
|
||
print(outputs.keys())
|
||
# dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
|
||
```
|
||
|
||
## Special tokens
|
||
|
||
### Default special tokens
|
||
|
||
| Model Family | CLS/BOS | SEP/EOS | PAD | UNK | MASK |
|
||
|--------------|---------|---------------|---------|---------|---------|
|
||
| BERT | [CLS] | [SEP] | [PAD] | [UNK] | [MASK] |
|
||
| GPT-2 | - | <\|endoftext\|> | <\|endoftext\|> | <\|endoftext\|> | - |
|
||
| RoBERTa | <s> | </s> | <pad> | <unk> | <mask> |
|
||
| T5 | - | </s> | <pad> | <unk> | - |
|
||
|
||
### Adding special tokens
|
||
|
||
```python
|
||
# Add new special tokens
|
||
special_tokens_dict = {
|
||
"additional_special_tokens": ["<|image|>", "<|video|>", "<|audio|>"]
|
||
}
|
||
|
||
num_added_tokens = tokenizer.add_special_tokens(special_tokens_dict)
|
||
print(f"Added {num_added_tokens} tokens")
|
||
|
||
# Resize model embeddings
|
||
model.resize_token_embeddings(len(tokenizer))
|
||
|
||
# Use new tokens
|
||
text = "This is an image: <|image|>"
|
||
tokens = tokenizer.encode(text)
|
||
```
|
||
|
||
### Adding regular tokens
|
||
|
||
```python
|
||
# Add domain-specific tokens
|
||
new_tokens = ["COVID-19", "mRNA", "vaccine"]
|
||
num_added = tokenizer.add_tokens(new_tokens)
|
||
|
||
# These are NOT special tokens (can be split if needed)
|
||
tokenizer.add_tokens(new_tokens, special_tokens=False)
|
||
|
||
# These ARE special tokens (never split)
|
||
tokenizer.add_tokens(new_tokens, special_tokens=True)
|
||
```
|
||
|
||
## Encoding and decoding
|
||
|
||
### Basic encoding
|
||
|
||
```python
|
||
# Single sentence
|
||
text = "Hello, how are you?"
|
||
encoded = tokenizer(text)
|
||
|
||
print(encoded)
|
||
# {'input_ids': [101, 7592, 1010, 2129, 2024, 2017, 1029, 102],
|
||
# 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0],
|
||
# 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
|
||
```
|
||
|
||
### Batch encoding
|
||
|
||
```python
|
||
# Multiple sentences
|
||
texts = ["Hello world", "How are you?", "I am fine"]
|
||
encoded = tokenizer(texts, padding=True, truncation=True, max_length=10)
|
||
|
||
print(encoded['input_ids'])
|
||
# [[101, 7592, 2088, 102, 0, 0, 0, 0, 0, 0],
|
||
# [101, 2129, 2024, 2017, 1029, 102, 0, 0, 0, 0],
|
||
# [101, 1045, 2572, 2986, 102, 0, 0, 0, 0, 0]]
|
||
```
|
||
|
||
### Return tensors
|
||
|
||
```python
|
||
# Return PyTorch tensors
|
||
outputs = tokenizer("Hello world", return_tensors="pt")
|
||
print(outputs['input_ids'].shape) # torch.Size([1, 5])
|
||
|
||
# Return TensorFlow tensors
|
||
outputs = tokenizer("Hello world", return_tensors="tf")
|
||
|
||
# Return NumPy arrays
|
||
outputs = tokenizer("Hello world", return_tensors="np")
|
||
|
||
# Return lists (default)
|
||
outputs = tokenizer("Hello world", return_tensors=None)
|
||
```
|
||
|
||
### Decoding
|
||
|
||
```python
|
||
# Decode token IDs
|
||
ids = [101, 7592, 2088, 102]
|
||
text = tokenizer.decode(ids)
|
||
print(text) # "[CLS] hello world [SEP]"
|
||
|
||
# Skip special tokens
|
||
text = tokenizer.decode(ids, skip_special_tokens=True)
|
||
print(text) # "hello world"
|
||
|
||
# Batch decode
|
||
batch_ids = [[101, 7592, 102], [101, 2088, 102]]
|
||
texts = tokenizer.batch_decode(batch_ids, skip_special_tokens=True)
|
||
print(texts) # ["hello", "world"]
|
||
```
|
||
|
||
## Padding and truncation
|
||
|
||
### Padding strategies
|
||
|
||
```python
|
||
# Pad to max length in batch
|
||
tokenizer(texts, padding="longest")
|
||
|
||
# Pad to model max length
|
||
tokenizer(texts, padding="max_length", max_length=128)
|
||
|
||
# No padding
|
||
tokenizer(texts, padding=False)
|
||
|
||
# Pad to multiple of value (for efficient computation)
|
||
tokenizer(texts, padding="max_length", max_length=128, pad_to_multiple_of=8)
|
||
# Result: length will be 128 (already multiple of 8)
|
||
```
|
||
|
||
### Truncation strategies
|
||
|
||
```python
|
||
# Truncate to max length
|
||
tokenizer(text, truncation=True, max_length=10)
|
||
|
||
# Only truncate first sequence (for pairs)
|
||
tokenizer(text1, text2, truncation="only_first", max_length=20)
|
||
|
||
# Only truncate second sequence
|
||
tokenizer(text1, text2, truncation="only_second", max_length=20)
|
||
|
||
# Truncate longest first (default for pairs)
|
||
tokenizer(text1, text2, truncation="longest_first", max_length=20)
|
||
|
||
# No truncation (error if too long)
|
||
tokenizer(text, truncation=False)
|
||
```
|
||
|
||
### Stride for long documents
|
||
|
||
```python
|
||
# For documents longer than max_length
|
||
text = "Very long document " * 1000
|
||
|
||
# Encode with overlap
|
||
encodings = tokenizer(
|
||
text,
|
||
max_length=512,
|
||
stride=128, # Overlap between chunks
|
||
truncation=True,
|
||
return_overflowing_tokens=True,
|
||
return_offsets_mapping=True
|
||
)
|
||
|
||
# Get all chunks
|
||
num_chunks = len(encodings['input_ids'])
|
||
print(f"Split into {num_chunks} chunks")
|
||
|
||
# Each chunk overlaps by stride tokens
|
||
for i, chunk in enumerate(encodings['input_ids']):
|
||
print(f"Chunk {i}: {len(chunk)} tokens")
|
||
```
|
||
|
||
**Use case**: Long document QA, sliding window inference
|
||
|
||
## Alignment and offsets
|
||
|
||
### Offset mapping
|
||
|
||
```python
|
||
# Get character offsets for each token
|
||
encoded = tokenizer("Hello, world!", return_offsets_mapping=True)
|
||
|
||
for token, (start, end) in zip(
|
||
encoded.tokens(),
|
||
encoded['offset_mapping'][0]
|
||
):
|
||
print(f"{token:10s} → [{start:2d}, {end:2d})")
|
||
|
||
# Output:
|
||
# [CLS] → [ 0, 0)
|
||
# Hello → [ 0, 5)
|
||
# , → [ 5, 6)
|
||
# world → [ 7, 12)
|
||
# ! → [12, 13)
|
||
# [SEP] → [ 0, 0)
|
||
```
|
||
|
||
### Word IDs
|
||
|
||
```python
|
||
# Get word index for each token
|
||
encoded = tokenizer("Hello world", return_offsets_mapping=True)
|
||
word_ids = encoded.word_ids()
|
||
|
||
print(word_ids)
|
||
# [None, 0, 1, None]
|
||
# None = special token, 0 = first word, 1 = second word
|
||
```
|
||
|
||
**Use case**: Token classification (NER, POS tagging)
|
||
|
||
### Character to token mapping
|
||
|
||
```python
|
||
text = "Machine learning is awesome"
|
||
encoded = tokenizer(text, return_offsets_mapping=True)
|
||
|
||
# Find token for character position
|
||
char_pos = 8 # "l" in "learning"
|
||
token_idx = encoded.char_to_token(char_pos)
|
||
|
||
print(f"Character {char_pos} is in token {token_idx}: {encoded.tokens()[token_idx]}")
|
||
# Character 8 is in token 2: learning
|
||
```
|
||
|
||
**Use case**: Question answering (map answer character span to tokens)
|
||
|
||
### Sequence pairs
|
||
|
||
```python
|
||
# Encode sentence pair
|
||
encoded = tokenizer("Question here", "Answer here", return_offsets_mapping=True)
|
||
|
||
# Get sequence IDs (which sequence each token belongs to)
|
||
sequence_ids = encoded.sequence_ids()
|
||
print(sequence_ids)
|
||
# [None, 0, 0, 0, None, 1, 1, 1, None]
|
||
# None = special token, 0 = question, 1 = answer
|
||
```
|
||
|
||
## Model integration
|
||
|
||
### Use with transformers models
|
||
|
||
```python
|
||
from transformers import AutoModel, AutoTokenizer
|
||
import torch
|
||
|
||
# Load model and tokenizer
|
||
model = AutoModel.from_pretrained("bert-base-uncased")
|
||
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
|
||
|
||
# Tokenize
|
||
text = "Hello world"
|
||
inputs = tokenizer(text, return_tensors="pt")
|
||
|
||
# Forward pass
|
||
with torch.no_grad():
|
||
outputs = model(**inputs)
|
||
|
||
# Get embeddings
|
||
last_hidden_state = outputs.last_hidden_state
|
||
print(last_hidden_state.shape) # [1, seq_len, hidden_size]
|
||
```
|
||
|
||
### Custom model with custom tokenizer
|
||
|
||
```python
|
||
from transformers import BertConfig, BertModel
|
||
|
||
# Train custom tokenizer
|
||
from tokenizers import Tokenizer, models, trainers
|
||
tokenizer = Tokenizer(models.BPE())
|
||
trainer = trainers.BpeTrainer(vocab_size=30000)
|
||
tokenizer.train(files=["data.txt"], trainer=trainer)
|
||
|
||
# Wrap for transformers
|
||
from transformers import PreTrainedTokenizerFast
|
||
fast_tokenizer = PreTrainedTokenizerFast(
|
||
tokenizer_object=tokenizer,
|
||
unk_token="[UNK]",
|
||
pad_token="[PAD]"
|
||
)
|
||
|
||
# Create model with custom vocab size
|
||
config = BertConfig(vocab_size=30000)
|
||
model = BertModel(config)
|
||
|
||
# Use together
|
||
inputs = fast_tokenizer("Hello world", return_tensors="pt")
|
||
outputs = model(**inputs)
|
||
```
|
||
|
||
### Save and load together
|
||
|
||
```python
|
||
# Save both
|
||
model.save_pretrained("my-model")
|
||
tokenizer.save_pretrained("my-model")
|
||
|
||
# Directory structure:
|
||
# my-model/
|
||
# ├── config.json
|
||
# ├── pytorch_model.bin
|
||
# ├── tokenizer.json
|
||
# ├── tokenizer_config.json
|
||
# └── special_tokens_map.json
|
||
|
||
# Load both
|
||
from transformers import AutoModel, AutoTokenizer
|
||
|
||
model = AutoModel.from_pretrained("my-model")
|
||
tokenizer = AutoTokenizer.from_pretrained("my-model")
|
||
```
|
||
|
||
## Advanced features
|
||
|
||
### Multimodal tokenization
|
||
|
||
```python
|
||
from transformers import AutoTokenizer
|
||
|
||
# LLaVA-style (image + text)
|
||
tokenizer = AutoTokenizer.from_pretrained("llava-hf/llava-1.5-7b-hf")
|
||
|
||
# Add image placeholder token
|
||
tokenizer.add_special_tokens({"additional_special_tokens": ["<image>"]})
|
||
|
||
# Use in prompt
|
||
text = "Describe this image: <image>"
|
||
inputs = tokenizer(text, return_tensors="pt")
|
||
```
|
||
|
||
### Template formatting
|
||
|
||
```python
|
||
# Chat template
|
||
messages = [
|
||
{"role": "system", "content": "You are a helpful assistant."},
|
||
{"role": "user", "content": "Hello!"},
|
||
{"role": "assistant", "content": "Hi! How can I help?"},
|
||
{"role": "user", "content": "What's the weather?"}
|
||
]
|
||
|
||
# Apply chat template (if tokenizer has one)
|
||
if hasattr(tokenizer, "apply_chat_template"):
|
||
text = tokenizer.apply_chat_template(messages, tokenize=False)
|
||
inputs = tokenizer(text, return_tensors="pt")
|
||
```
|
||
|
||
### Custom template
|
||
|
||
```python
|
||
from transformers import PreTrainedTokenizerFast
|
||
|
||
tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")
|
||
|
||
# Define chat template
|
||
tokenizer.chat_template = """
|
||
{%- for message in messages %}
|
||
{%- if message['role'] == 'system' %}
|
||
System: {{ message['content'] }}\\n
|
||
{%- elif message['role'] == 'user' %}
|
||
User: {{ message['content'] }}\\n
|
||
{%- elif message['role'] == 'assistant' %}
|
||
Assistant: {{ message['content'] }}\\n
|
||
{%- endif %}
|
||
{%- endfor %}
|
||
Assistant:
|
||
"""
|
||
|
||
# Use template
|
||
text = tokenizer.apply_chat_template(messages, tokenize=False)
|
||
```
|
||
|
||
## Performance optimization
|
||
|
||
### Batch processing
|
||
|
||
```python
|
||
# Process large datasets efficiently
|
||
from datasets import load_dataset
|
||
|
||
dataset = load_dataset("imdb", split="train[:1000]")
|
||
|
||
# Tokenize in batches
|
||
def tokenize_function(examples):
|
||
return tokenizer(
|
||
examples["text"],
|
||
padding="max_length",
|
||
truncation=True,
|
||
max_length=512
|
||
)
|
||
|
||
# Map over dataset (batched)
|
||
tokenized_dataset = dataset.map(
|
||
tokenize_function,
|
||
batched=True,
|
||
batch_size=1000,
|
||
num_proc=4 # Parallel processing
|
||
)
|
||
```
|
||
|
||
### Caching
|
||
|
||
```python
|
||
# Enable caching for repeated tokenization
|
||
tokenizer = AutoTokenizer.from_pretrained(
|
||
"bert-base-uncased",
|
||
use_fast=True,
|
||
cache_dir="./cache" # Cache tokenizer files
|
||
)
|
||
|
||
# Tokenize with caching
|
||
from functools import lru_cache
|
||
|
||
@lru_cache(maxsize=10000)
|
||
def cached_tokenize(text):
|
||
return tuple(tokenizer.encode(text))
|
||
|
||
# Reuses cached results for repeated inputs
|
||
```
|
||
|
||
### Memory efficiency
|
||
|
||
```python
|
||
# For very large datasets, use streaming
|
||
from datasets import load_dataset
|
||
|
||
dataset = load_dataset("pile", split="train", streaming=True)
|
||
|
||
def process_batch(batch):
|
||
# Tokenize
|
||
tokens = tokenizer(batch["text"], truncation=True, max_length=512)
|
||
|
||
# Process tokens...
|
||
|
||
return tokens
|
||
|
||
# Process in chunks (memory efficient)
|
||
for batch in dataset.batch(batch_size=1000):
|
||
processed = process_batch(batch)
|
||
```
|
||
|
||
## Troubleshooting
|
||
|
||
### Issue: Tokenizer not fast
|
||
|
||
**Symptom**:
|
||
```python
|
||
tokenizer.is_fast # False
|
||
```
|
||
|
||
**Solution**: Install tokenizers library
|
||
```bash
|
||
pip install tokenizers
|
||
```
|
||
|
||
### Issue: Special tokens not working
|
||
|
||
**Symptom**: Special tokens are split into subwords
|
||
|
||
**Solution**: Add as special tokens, not regular tokens
|
||
```python
|
||
# Wrong
|
||
tokenizer.add_tokens(["<|image|>"])
|
||
|
||
# Correct
|
||
tokenizer.add_special_tokens({"additional_special_tokens": ["<|image|>"]})
|
||
```
|
||
|
||
### Issue: Offset mapping not available
|
||
|
||
**Symptom**:
|
||
```python
|
||
tokenizer("text", return_offsets_mapping=True)
|
||
# Error: return_offsets_mapping not supported
|
||
```
|
||
|
||
**Solution**: Use fast tokenizer
|
||
```python
|
||
from transformers import AutoTokenizer
|
||
|
||
# Load fast version
|
||
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
|
||
```
|
||
|
||
### Issue: Padding inconsistent
|
||
|
||
**Symptom**: Some sequences padded, others not
|
||
|
||
**Solution**: Specify padding strategy
|
||
```python
|
||
# Explicit padding
|
||
tokenizer(
|
||
texts,
|
||
padding="max_length", # or "longest"
|
||
max_length=128
|
||
)
|
||
```
|
||
|
||
## Best practices
|
||
|
||
1. **Always use fast tokenizers**:
|
||
- 5-10× faster
|
||
- Full alignment tracking
|
||
- Better batch processing
|
||
|
||
2. **Save tokenizer with model**:
|
||
- Ensures reproducibility
|
||
- Prevents version mismatches
|
||
|
||
3. **Use batch processing for datasets**:
|
||
- Tokenize with `.map(batched=True)`
|
||
- Set `num_proc` for parallelism
|
||
|
||
4. **Enable caching for repeated inputs**:
|
||
- Use `lru_cache` for inference
|
||
- Cache tokenizer files with `cache_dir`
|
||
|
||
5. **Handle special tokens properly**:
|
||
- Use `add_special_tokens()` for never-split tokens
|
||
- Resize embeddings after adding tokens
|
||
|
||
6. **Test alignment for downstream tasks**:
|
||
- Verify `offset_mapping` is correct
|
||
- Test `char_to_token()` on samples
|
||
|
||
7. **Version control tokenizer config**:
|
||
- Save `tokenizer_config.json`
|
||
- Document custom templates
|
||
- Track vocabulary changes
|