# Transformers Integration Complete guide to using HuggingFace Tokenizers with the Transformers library. ## AutoTokenizer The easiest way to load tokenizers. ### Loading pretrained tokenizers ```python from transformers import AutoTokenizer # Load from HuggingFace Hub tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") # Check if using fast tokenizer (Rust-based) print(tokenizer.is_fast) # True # Access underlying tokenizers.Tokenizer if tokenizer.is_fast: fast_tokenizer = tokenizer.backend_tokenizer print(type(fast_tokenizer)) # ``` ### Fast vs slow tokenizers | Feature | Fast (Rust) | Slow (Python) | |--------------------------|----------------|---------------| | Speed | 5-10× faster | Baseline | | Alignment tracking | ✅ Full support | ❌ Limited | | Batch processing | ✅ Optimized | ⚠️ Slower | | Offset mapping | ✅ Yes | ❌ No | | Installation | `tokenizers` | Built-in | **Always use fast tokenizers when available.** ### Check available tokenizers ```python from transformers import TOKENIZER_MAPPING # List all fast tokenizers for config_class, (slow, fast) in TOKENIZER_MAPPING.items(): if fast is not None: print(f"{config_class.__name__}: {fast.__name__}") ``` ## PreTrainedTokenizerFast Wrap custom tokenizers for transformers. ### Convert custom tokenizer ```python from tokenizers import Tokenizer from tokenizers.models import BPE from tokenizers.trainers import BpeTrainer from transformers import PreTrainedTokenizerFast # Train custom tokenizer tokenizer = Tokenizer(BPE()) trainer = BpeTrainer( vocab_size=30000, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"] ) tokenizer.train(files=["corpus.txt"], trainer=trainer) # Save tokenizer tokenizer.save("my-tokenizer.json") # Wrap for transformers transformers_tokenizer = PreTrainedTokenizerFast( tokenizer_file="my-tokenizer.json", unk_token="[UNK]", sep_token="[SEP]", pad_token="[PAD]", cls_token="[CLS]", mask_token="[MASK]" ) # Save in transformers format transformers_tokenizer.save_pretrained("my-tokenizer") ``` **Result**: Directory with `tokenizer.json` + `tokenizer_config.json` + `special_tokens_map.json` ### Use like any transformers tokenizer ```python # Load from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("my-tokenizer") # Encode with all transformers features outputs = tokenizer( "Hello world", padding="max_length", truncation=True, max_length=128, return_tensors="pt" ) print(outputs.keys()) # dict_keys(['input_ids', 'token_type_ids', 'attention_mask']) ``` ## Special tokens ### Default special tokens | Model Family | CLS/BOS | SEP/EOS | PAD | UNK | MASK | |--------------|---------|---------------|---------|---------|---------| | BERT | [CLS] | [SEP] | [PAD] | [UNK] | [MASK] | | GPT-2 | - | <\|endoftext\|> | <\|endoftext\|> | <\|endoftext\|> | - | | RoBERTa | | | | | | | T5 | - | | | | - | ### Adding special tokens ```python # Add new special tokens special_tokens_dict = { "additional_special_tokens": ["<|image|>", "<|video|>", "<|audio|>"] } num_added_tokens = tokenizer.add_special_tokens(special_tokens_dict) print(f"Added {num_added_tokens} tokens") # Resize model embeddings model.resize_token_embeddings(len(tokenizer)) # Use new tokens text = "This is an image: <|image|>" tokens = tokenizer.encode(text) ``` ### Adding regular tokens ```python # Add domain-specific tokens new_tokens = ["COVID-19", "mRNA", "vaccine"] num_added = tokenizer.add_tokens(new_tokens) # These are NOT special tokens (can be split if needed) tokenizer.add_tokens(new_tokens, special_tokens=False) # These ARE special tokens (never split) tokenizer.add_tokens(new_tokens, special_tokens=True) ``` ## Encoding and decoding ### Basic encoding ```python # Single sentence text = "Hello, how are you?" encoded = tokenizer(text) print(encoded) # {'input_ids': [101, 7592, 1010, 2129, 2024, 2017, 1029, 102], # 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], # 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]} ``` ### Batch encoding ```python # Multiple sentences texts = ["Hello world", "How are you?", "I am fine"] encoded = tokenizer(texts, padding=True, truncation=True, max_length=10) print(encoded['input_ids']) # [[101, 7592, 2088, 102, 0, 0, 0, 0, 0, 0], # [101, 2129, 2024, 2017, 1029, 102, 0, 0, 0, 0], # [101, 1045, 2572, 2986, 102, 0, 0, 0, 0, 0]] ``` ### Return tensors ```python # Return PyTorch tensors outputs = tokenizer("Hello world", return_tensors="pt") print(outputs['input_ids'].shape) # torch.Size([1, 5]) # Return TensorFlow tensors outputs = tokenizer("Hello world", return_tensors="tf") # Return NumPy arrays outputs = tokenizer("Hello world", return_tensors="np") # Return lists (default) outputs = tokenizer("Hello world", return_tensors=None) ``` ### Decoding ```python # Decode token IDs ids = [101, 7592, 2088, 102] text = tokenizer.decode(ids) print(text) # "[CLS] hello world [SEP]" # Skip special tokens text = tokenizer.decode(ids, skip_special_tokens=True) print(text) # "hello world" # Batch decode batch_ids = [[101, 7592, 102], [101, 2088, 102]] texts = tokenizer.batch_decode(batch_ids, skip_special_tokens=True) print(texts) # ["hello", "world"] ``` ## Padding and truncation ### Padding strategies ```python # Pad to max length in batch tokenizer(texts, padding="longest") # Pad to model max length tokenizer(texts, padding="max_length", max_length=128) # No padding tokenizer(texts, padding=False) # Pad to multiple of value (for efficient computation) tokenizer(texts, padding="max_length", max_length=128, pad_to_multiple_of=8) # Result: length will be 128 (already multiple of 8) ``` ### Truncation strategies ```python # Truncate to max length tokenizer(text, truncation=True, max_length=10) # Only truncate first sequence (for pairs) tokenizer(text1, text2, truncation="only_first", max_length=20) # Only truncate second sequence tokenizer(text1, text2, truncation="only_second", max_length=20) # Truncate longest first (default for pairs) tokenizer(text1, text2, truncation="longest_first", max_length=20) # No truncation (error if too long) tokenizer(text, truncation=False) ``` ### Stride for long documents ```python # For documents longer than max_length text = "Very long document " * 1000 # Encode with overlap encodings = tokenizer( text, max_length=512, stride=128, # Overlap between chunks truncation=True, return_overflowing_tokens=True, return_offsets_mapping=True ) # Get all chunks num_chunks = len(encodings['input_ids']) print(f"Split into {num_chunks} chunks") # Each chunk overlaps by stride tokens for i, chunk in enumerate(encodings['input_ids']): print(f"Chunk {i}: {len(chunk)} tokens") ``` **Use case**: Long document QA, sliding window inference ## Alignment and offsets ### Offset mapping ```python # Get character offsets for each token encoded = tokenizer("Hello, world!", return_offsets_mapping=True) for token, (start, end) in zip( encoded.tokens(), encoded['offset_mapping'][0] ): print(f"{token:10s} → [{start:2d}, {end:2d})") # Output: # [CLS] → [ 0, 0) # Hello → [ 0, 5) # , → [ 5, 6) # world → [ 7, 12) # ! → [12, 13) # [SEP] → [ 0, 0) ``` ### Word IDs ```python # Get word index for each token encoded = tokenizer("Hello world", return_offsets_mapping=True) word_ids = encoded.word_ids() print(word_ids) # [None, 0, 1, None] # None = special token, 0 = first word, 1 = second word ``` **Use case**: Token classification (NER, POS tagging) ### Character to token mapping ```python text = "Machine learning is awesome" encoded = tokenizer(text, return_offsets_mapping=True) # Find token for character position char_pos = 8 # "l" in "learning" token_idx = encoded.char_to_token(char_pos) print(f"Character {char_pos} is in token {token_idx}: {encoded.tokens()[token_idx]}") # Character 8 is in token 2: learning ``` **Use case**: Question answering (map answer character span to tokens) ### Sequence pairs ```python # Encode sentence pair encoded = tokenizer("Question here", "Answer here", return_offsets_mapping=True) # Get sequence IDs (which sequence each token belongs to) sequence_ids = encoded.sequence_ids() print(sequence_ids) # [None, 0, 0, 0, None, 1, 1, 1, None] # None = special token, 0 = question, 1 = answer ``` ## Model integration ### Use with transformers models ```python from transformers import AutoModel, AutoTokenizer import torch # Load model and tokenizer model = AutoModel.from_pretrained("bert-base-uncased") tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") # Tokenize text = "Hello world" inputs = tokenizer(text, return_tensors="pt") # Forward pass with torch.no_grad(): outputs = model(**inputs) # Get embeddings last_hidden_state = outputs.last_hidden_state print(last_hidden_state.shape) # [1, seq_len, hidden_size] ``` ### Custom model with custom tokenizer ```python from transformers import BertConfig, BertModel # Train custom tokenizer from tokenizers import Tokenizer, models, trainers tokenizer = Tokenizer(models.BPE()) trainer = trainers.BpeTrainer(vocab_size=30000) tokenizer.train(files=["data.txt"], trainer=trainer) # Wrap for transformers from transformers import PreTrainedTokenizerFast fast_tokenizer = PreTrainedTokenizerFast( tokenizer_object=tokenizer, unk_token="[UNK]", pad_token="[PAD]" ) # Create model with custom vocab size config = BertConfig(vocab_size=30000) model = BertModel(config) # Use together inputs = fast_tokenizer("Hello world", return_tensors="pt") outputs = model(**inputs) ``` ### Save and load together ```python # Save both model.save_pretrained("my-model") tokenizer.save_pretrained("my-model") # Directory structure: # my-model/ # ├── config.json # ├── pytorch_model.bin # ├── tokenizer.json # ├── tokenizer_config.json # └── special_tokens_map.json # Load both from transformers import AutoModel, AutoTokenizer model = AutoModel.from_pretrained("my-model") tokenizer = AutoTokenizer.from_pretrained("my-model") ``` ## Advanced features ### Multimodal tokenization ```python from transformers import AutoTokenizer # LLaVA-style (image + text) tokenizer = AutoTokenizer.from_pretrained("llava-hf/llava-1.5-7b-hf") # Add image placeholder token tokenizer.add_special_tokens({"additional_special_tokens": [""]}) # Use in prompt text = "Describe this image: " inputs = tokenizer(text, return_tensors="pt") ``` ### Template formatting ```python # Chat template messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"}, {"role": "assistant", "content": "Hi! How can I help?"}, {"role": "user", "content": "What's the weather?"} ] # Apply chat template (if tokenizer has one) if hasattr(tokenizer, "apply_chat_template"): text = tokenizer.apply_chat_template(messages, tokenize=False) inputs = tokenizer(text, return_tensors="pt") ``` ### Custom template ```python from transformers import PreTrainedTokenizerFast tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json") # Define chat template tokenizer.chat_template = """ {%- for message in messages %} {%- if message['role'] == 'system' %} System: {{ message['content'] }}\\n {%- elif message['role'] == 'user' %} User: {{ message['content'] }}\\n {%- elif message['role'] == 'assistant' %} Assistant: {{ message['content'] }}\\n {%- endif %} {%- endfor %} Assistant: """ # Use template text = tokenizer.apply_chat_template(messages, tokenize=False) ``` ## Performance optimization ### Batch processing ```python # Process large datasets efficiently from datasets import load_dataset dataset = load_dataset("imdb", split="train[:1000]") # Tokenize in batches def tokenize_function(examples): return tokenizer( examples["text"], padding="max_length", truncation=True, max_length=512 ) # Map over dataset (batched) tokenized_dataset = dataset.map( tokenize_function, batched=True, batch_size=1000, num_proc=4 # Parallel processing ) ``` ### Caching ```python # Enable caching for repeated tokenization tokenizer = AutoTokenizer.from_pretrained( "bert-base-uncased", use_fast=True, cache_dir="./cache" # Cache tokenizer files ) # Tokenize with caching from functools import lru_cache @lru_cache(maxsize=10000) def cached_tokenize(text): return tuple(tokenizer.encode(text)) # Reuses cached results for repeated inputs ``` ### Memory efficiency ```python # For very large datasets, use streaming from datasets import load_dataset dataset = load_dataset("pile", split="train", streaming=True) def process_batch(batch): # Tokenize tokens = tokenizer(batch["text"], truncation=True, max_length=512) # Process tokens... return tokens # Process in chunks (memory efficient) for batch in dataset.batch(batch_size=1000): processed = process_batch(batch) ``` ## Troubleshooting ### Issue: Tokenizer not fast **Symptom**: ```python tokenizer.is_fast # False ``` **Solution**: Install tokenizers library ```bash pip install tokenizers ``` ### Issue: Special tokens not working **Symptom**: Special tokens are split into subwords **Solution**: Add as special tokens, not regular tokens ```python # Wrong tokenizer.add_tokens(["<|image|>"]) # Correct tokenizer.add_special_tokens({"additional_special_tokens": ["<|image|>"]}) ``` ### Issue: Offset mapping not available **Symptom**: ```python tokenizer("text", return_offsets_mapping=True) # Error: return_offsets_mapping not supported ``` **Solution**: Use fast tokenizer ```python from transformers import AutoTokenizer # Load fast version tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True) ``` ### Issue: Padding inconsistent **Symptom**: Some sequences padded, others not **Solution**: Specify padding strategy ```python # Explicit padding tokenizer( texts, padding="max_length", # or "longest" max_length=128 ) ``` ## Best practices 1. **Always use fast tokenizers**: - 5-10× faster - Full alignment tracking - Better batch processing 2. **Save tokenizer with model**: - Ensures reproducibility - Prevents version mismatches 3. **Use batch processing for datasets**: - Tokenize with `.map(batched=True)` - Set `num_proc` for parallelism 4. **Enable caching for repeated inputs**: - Use `lru_cache` for inference - Cache tokenizer files with `cache_dir` 5. **Handle special tokens properly**: - Use `add_special_tokens()` for never-split tokens - Resize embeddings after adding tokens 6. **Test alignment for downstream tasks**: - Verify `offset_mapping` is correct - Test `char_to_token()` on samples 7. **Version control tokenizer config**: - Save `tokenizer_config.json` - Document custom templates - Track vocabulary changes