hermes-agent/skills/mlops/flash-attention/references/transformers-integration.md
teknium1 ab0f4126cf fix: restore all removed bundled skills + fix skills sync system
- Restored 21 skills removed in commits 757d012 and 740dd92:
  accelerate, audiocraft, code-review, faiss, flash-attention, gguf,
  grpo-rl-training, guidance, llava, nemo-curator, obliteratus, peft,
  pytorch-fsdp, pytorch-lightning, simpo, slime, stable-diffusion,
  tensorrt-llm, torchtitan, trl-fine-tuning, whisper

- Rewrote sync_skills() with proper update semantics:
  * New skills (not in manifest): copied to user dir
  * Existing skills (in manifest + on disk): updated via hash comparison
  * User-deleted skills (in manifest, not on disk): respected, not re-added
  * Stale manifest entries (removed from bundled): cleaned from manifest

- Added sync_skills() to CLI startup (cmd_chat) and gateway startup
  (start_gateway) — previously only ran during 'hermes update'

- Updated cmd_update output to show new/updated/cleaned counts

- Rewrote tests: 20 tests covering manifest CRUD, dir hashing, fresh
  install, user deletion respect, update detection, stale cleanup, and
  name collision handling

75 bundled skills total. 2002 tests pass.
2026-03-06 15:57:30 -08:00

7.3 KiB

HuggingFace Transformers Integration

Contents

  • Enabling Flash Attention in Transformers
  • Supported model architectures
  • Configuration examples
  • Performance comparisons
  • Troubleshooting model-specific issues

Enabling Flash Attention in Transformers

HuggingFace Transformers (v4.36+) supports Flash Attention 2 natively.

Simple enable for any supported model:

from transformers import AutoModel

model = AutoModel.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.float16,
    device_map="auto"
)

Install requirements:

pip install transformers>=4.36
pip install flash-attn --no-build-isolation

Supported model architectures

As of Transformers 4.40:

Fully supported:

  • Llama / Llama 2 / Llama 3
  • Mistral / Mixtral
  • Falcon
  • GPT-NeoX
  • Phi / Phi-2 / Phi-3
  • Qwen / Qwen2
  • Gemma
  • Starcoder2
  • GPT-J
  • OPT
  • BLOOM

Partially supported (encoder-decoder):

  • BART
  • T5 / Flan-T5
  • Whisper

Check support:

from transformers import AutoConfig

config = AutoConfig.from_pretrained("model-name")
print(config._attn_implementation_internal)
# 'flash_attention_2' if supported

Configuration examples

Llama 2 with Flash Attention

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "meta-llama/Llama-2-7b-hf"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    attn_implementation="flash_attention_2",
    torch_dtype=torch.float16,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

# Generate
inputs = tokenizer("Once upon a time", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0]))

Mistral with Flash Attention for long context

from transformers import AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,  # Better for long context
    device_map="auto",
    max_position_embeddings=32768  # Extended context
)

# Process long document (32K tokens)
long_text = "..." * 10000
inputs = tokenizer(long_text, return_tensors="pt", truncation=False).to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512)

Fine-tuning with Flash Attention

from transformers import Trainer, TrainingArguments
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.float16
)

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    fp16=True,  # Must match model dtype
    optim="adamw_torch_fused"  # Fast optimizer
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset
)

trainer.train()

Multi-GPU training

from transformers import AutoModelForCausalLM
import torch

# Model parallelism with Flash Attention
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-13b-hf",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.float16,
    device_map="auto",  # Automatic multi-GPU placement
    max_memory={0: "20GB", 1: "20GB"}  # Limit per GPU
)

Performance comparisons

Memory usage (Llama 2 7B, batch=1)

Sequence Length Standard Attention Flash Attention 2 Reduction
512 1.2 GB 0.9 GB 25%
2048 3.8 GB 1.4 GB 63%
8192 14.2 GB 3.2 GB 77%
32768 OOM (>24GB) 10.8 GB Fits!

Speed (tokens/sec, A100 80GB)

Model Standard Flash Attn 2 Speedup
Llama 2 7B (seq=2048) 42 118 2.8x
Llama 2 13B (seq=4096) 18 52 2.9x
Llama 2 70B (seq=2048) 4 11 2.75x

Training throughput (samples/sec)

Model Batch Size Standard Flash Attn 2 Speedup
Llama 2 7B 4 1.2 3.1 2.6x
Llama 2 7B 8 2.1 5.8 2.8x
Llama 2 13B 2 0.6 1.7 2.8x

Troubleshooting model-specific issues

Issue: Model doesn't support Flash Attention

Check support list above. If not supported, use PyTorch SDPA as fallback:

model = AutoModelForCausalLM.from_pretrained(
    "model-name",
    attn_implementation="sdpa",  # PyTorch native (still faster)
    torch_dtype=torch.float16
)

Issue: CUDA out of memory during loading

Reduce memory footprint:

model = AutoModelForCausalLM.from_pretrained(
    "model-name",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.float16,
    device_map="auto",
    max_memory={0: "18GB"},  # Reserve memory for KV cache
    low_cpu_mem_usage=True
)

Issue: Slower inference than expected

Ensure dtype matches:

# Model and inputs must both be float16/bfloat16
model = model.to(torch.float16)
inputs = tokenizer(..., return_tensors="pt").to("cuda")
inputs = {k: v.to(torch.float16) if v.dtype == torch.float32 else v
          for k, v in inputs.items()}

Issue: Different outputs vs standard attention

Flash Attention is numerically equivalent but uses different computation order. Small differences (<1e-3) are normal:

# Compare outputs
model_standard = AutoModelForCausalLM.from_pretrained("model-name", torch_dtype=torch.float16)
model_flash = AutoModelForCausalLM.from_pretrained(
    "model-name",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.float16
)

inputs = tokenizer("Test", return_tensors="pt").to("cuda")

with torch.no_grad():
    out_standard = model_standard(**inputs).logits
    out_flash = model_flash(**inputs).logits

diff = (out_standard - out_flash).abs().max()
print(f"Max diff: {diff:.6f}")  # Should be ~1e-3 to 1e-4

Issue: ImportError during model loading

Install flash-attn:

pip install flash-attn --no-build-isolation

Or disable Flash Attention:

model = AutoModelForCausalLM.from_pretrained(
    "model-name",
    attn_implementation="eager",  # Standard PyTorch
    torch_dtype=torch.float16
)

Best practices

  1. Always use float16/bfloat16 with Flash Attention (not float32)
  2. Set device_map="auto" for automatic memory management
  3. Use bfloat16 for long context (better numerical stability)
  4. Enable gradient checkpointing for training large models
  5. Monitor memory with torch.cuda.max_memory_allocated()

Example with all best practices:

from transformers import AutoModelForCausalLM, TrainingArguments

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,  # Better for training
    device_map="auto",
    low_cpu_mem_usage=True
)

# Enable gradient checkpointing for memory
model.gradient_checkpointing_enable()

# Training with optimizations
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    bf16=True,  # Match model dtype
    optim="adamw_torch_fused",
    gradient_checkpointing=True
)