mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-25 00:51:20 +00:00
* feat(gateway): skill-aware slash commands, paginated /commands, Telegram 100-cap Map active skills to Telegram's slash command menu so users can discover and invoke skills directly. Three changes: 1. Telegram menu now includes active skill commands alongside built-in commands, capped at 100 entries (Telegram Bot API limit). Overflow commands remain callable but hidden from the picker. Logged at startup when cap is hit. 2. New /commands [page] gateway command for paginated browsing of all commands + skills. /help now shows first 10 skill commands and points to /commands for the full list. 3. When a user types a slash command that matches a disabled or uninstalled skill, they get actionable guidance: - Disabled: 'Enable it with: hermes skills config' - Optional (not installed): 'Install with: hermes skills install official/<path>' Built on ideas from PR #3921 by @kshitijk4poor. * chore: move 21 niche skills to optional-skills Move specialized/niche skills from built-in (skills/) to optional (optional-skills/) to reduce the default skill count. Users can install them with: hermes skills install official/<category>/<name> Moved skills (21): - mlops: accelerate, chroma, faiss, flash-attention, hermes-atropos-environments, huggingface-tokenizers, instructor, lambda-labs, llava, nemo-curator, pinecone, pytorch-lightning, qdrant, saelens, simpo, slime, tensorrt-llm, torchtitan - research: domain-intel, duckduckgo-search - devops: inference-sh cli Built-in skills: 96 → 75 Optional skills: 22 → 43 * fix: only include repo built-in skills in Telegram menu, not user-installed User-installed skills (from hub or manually added) stay accessible via /skills and by typing the command directly, but don't get registered in the Telegram slash command picker. Only skills whose SKILL.md is under the repo's skills/ directory are included in the menu. This keeps the Telegram menu focused on the curated built-in set while user-installed skills remain discoverable through /skills and /commands.
318 lines
9.2 KiB
Markdown
318 lines
9.2 KiB
Markdown
# SAELens Tutorials
|
|
|
|
## Tutorial 1: Loading and Analyzing Pre-trained SAEs
|
|
|
|
### Goal
|
|
Load a pre-trained SAE and analyze which features activate on specific inputs.
|
|
|
|
### Step-by-Step
|
|
|
|
```python
|
|
from transformer_lens import HookedTransformer
|
|
from sae_lens import SAE
|
|
import torch
|
|
|
|
# 1. Load model and SAE
|
|
model = HookedTransformer.from_pretrained("gpt2-small", device="cuda")
|
|
sae, cfg_dict, sparsity = SAE.from_pretrained(
|
|
release="gpt2-small-res-jb",
|
|
sae_id="blocks.8.hook_resid_pre",
|
|
device="cuda"
|
|
)
|
|
|
|
print(f"SAE input dim: {sae.cfg.d_in}")
|
|
print(f"SAE hidden dim: {sae.cfg.d_sae}")
|
|
print(f"Expansion factor: {sae.cfg.d_sae / sae.cfg.d_in:.1f}x")
|
|
|
|
# 2. Get model activations
|
|
prompt = "The capital of France is Paris"
|
|
tokens = model.to_tokens(prompt)
|
|
_, cache = model.run_with_cache(tokens)
|
|
activations = cache["resid_pre", 8] # [1, seq_len, 768]
|
|
|
|
# 3. Encode to SAE features
|
|
features = sae.encode(activations) # [1, seq_len, d_sae]
|
|
|
|
# 4. Analyze sparsity
|
|
active_per_token = (features > 0).sum(dim=-1)
|
|
print(f"Average active features per token: {active_per_token.float().mean():.1f}")
|
|
|
|
# 5. Find top features for each token
|
|
str_tokens = model.to_str_tokens(prompt)
|
|
for pos in range(len(str_tokens)):
|
|
top_features = features[0, pos].topk(5)
|
|
print(f"\nToken '{str_tokens[pos]}':")
|
|
for feat_idx, feat_val in zip(top_features.indices, top_features.values):
|
|
print(f" Feature {feat_idx.item()}: {feat_val.item():.3f}")
|
|
|
|
# 6. Check reconstruction quality
|
|
reconstructed = sae.decode(features)
|
|
mse = ((activations - reconstructed) ** 2).mean()
|
|
print(f"\nReconstruction MSE: {mse.item():.6f}")
|
|
```
|
|
|
|
---
|
|
|
|
## Tutorial 2: Training a Custom SAE
|
|
|
|
### Goal
|
|
Train a Sparse Autoencoder on GPT-2 activations.
|
|
|
|
### Step-by-Step
|
|
|
|
```python
|
|
from sae_lens import LanguageModelSAERunnerConfig, SAETrainingRunner
|
|
|
|
# 1. Configure training
|
|
cfg = LanguageModelSAERunnerConfig(
|
|
# Model
|
|
model_name="gpt2-small",
|
|
hook_name="blocks.6.hook_resid_pre",
|
|
hook_layer=6,
|
|
d_in=768,
|
|
|
|
# SAE architecture
|
|
architecture="standard",
|
|
d_sae=768 * 8, # 8x expansion
|
|
activation_fn="relu",
|
|
|
|
# Training
|
|
lr=4e-4,
|
|
l1_coefficient=8e-5,
|
|
l1_warm_up_steps=1000,
|
|
train_batch_size_tokens=4096,
|
|
training_tokens=10_000_000, # Small run for demo
|
|
|
|
# Data
|
|
dataset_path="monology/pile-uncopyrighted",
|
|
streaming=True,
|
|
context_size=128,
|
|
|
|
# Dead feature prevention
|
|
use_ghost_grads=True,
|
|
dead_feature_window=5000,
|
|
|
|
# Logging
|
|
log_to_wandb=True,
|
|
wandb_project="sae-training-demo",
|
|
|
|
# Hardware
|
|
device="cuda",
|
|
dtype="float32",
|
|
)
|
|
|
|
# 2. Train
|
|
runner = SAETrainingRunner(cfg)
|
|
sae = runner.run()
|
|
|
|
# 3. Save
|
|
sae.save_model("./my_trained_sae")
|
|
```
|
|
|
|
### Hyperparameter Tuning Guide
|
|
|
|
| If you see... | Try... |
|
|
|---------------|--------|
|
|
| High L0 (>200) | Increase `l1_coefficient` |
|
|
| Low CE recovery (<80%) | Decrease `l1_coefficient`, increase `d_sae` |
|
|
| Many dead features (>5%) | Enable `use_ghost_grads`, increase `l1_warm_up_steps` |
|
|
| Training instability | Lower `lr`, increase `lr_warm_up_steps` |
|
|
|
|
---
|
|
|
|
## Tutorial 3: Feature Attribution and Steering
|
|
|
|
### Goal
|
|
Identify which SAE features contribute to specific predictions and use them for steering.
|
|
|
|
### Step-by-Step
|
|
|
|
```python
|
|
from transformer_lens import HookedTransformer
|
|
from sae_lens import SAE
|
|
import torch
|
|
|
|
model = HookedTransformer.from_pretrained("gpt2-small", device="cuda")
|
|
sae, _, _ = SAE.from_pretrained(
|
|
release="gpt2-small-res-jb",
|
|
sae_id="blocks.8.hook_resid_pre",
|
|
device="cuda"
|
|
)
|
|
|
|
# 1. Feature attribution for a specific prediction
|
|
prompt = "The capital of France is"
|
|
tokens = model.to_tokens(prompt)
|
|
_, cache = model.run_with_cache(tokens)
|
|
activations = cache["resid_pre", 8]
|
|
features = sae.encode(activations)
|
|
|
|
# Target token
|
|
target_token = model.to_single_token(" Paris")
|
|
|
|
# Compute feature contributions to target logit
|
|
# contribution = feature_activation * decoder_weight * unembedding
|
|
W_dec = sae.W_dec # [d_sae, d_model]
|
|
W_U = model.W_U # [d_model, d_vocab]
|
|
|
|
# Feature direction projected to vocabulary
|
|
feature_to_logit = W_dec @ W_U # [d_sae, d_vocab]
|
|
|
|
# Contribution of each feature to "Paris" at final position
|
|
feature_acts = features[0, -1] # [d_sae]
|
|
contributions = feature_acts * feature_to_logit[:, target_token]
|
|
|
|
# Top contributing features
|
|
top_features = contributions.topk(10)
|
|
print("Top features contributing to 'Paris':")
|
|
for idx, val in zip(top_features.indices, top_features.values):
|
|
print(f" Feature {idx.item()}: {val.item():.3f}")
|
|
|
|
# 2. Feature steering
|
|
def steer_with_feature(feature_idx, strength=5.0):
|
|
"""Add a feature direction to the residual stream."""
|
|
feature_direction = sae.W_dec[feature_idx] # [d_model]
|
|
|
|
def hook(activation, hook_obj):
|
|
activation[:, -1, :] += strength * feature_direction
|
|
return activation
|
|
|
|
output = model.generate(
|
|
tokens,
|
|
max_new_tokens=10,
|
|
fwd_hooks=[("blocks.8.hook_resid_pre", hook)]
|
|
)
|
|
return model.to_string(output[0])
|
|
|
|
# Try steering with top feature
|
|
top_feature_idx = top_features.indices[0].item()
|
|
print(f"\nSteering with feature {top_feature_idx}:")
|
|
print(steer_with_feature(top_feature_idx, strength=10.0))
|
|
```
|
|
|
|
---
|
|
|
|
## Tutorial 4: Feature Ablation
|
|
|
|
### Goal
|
|
Test the causal importance of features by ablating them.
|
|
|
|
### Step-by-Step
|
|
|
|
```python
|
|
from transformer_lens import HookedTransformer
|
|
from sae_lens import SAE
|
|
import torch
|
|
|
|
model = HookedTransformer.from_pretrained("gpt2-small", device="cuda")
|
|
sae, _, _ = SAE.from_pretrained(
|
|
release="gpt2-small-res-jb",
|
|
sae_id="blocks.8.hook_resid_pre",
|
|
device="cuda"
|
|
)
|
|
|
|
prompt = "The capital of France is"
|
|
tokens = model.to_tokens(prompt)
|
|
|
|
# Baseline prediction
|
|
baseline_logits = model(tokens)
|
|
target_token = model.to_single_token(" Paris")
|
|
baseline_prob = torch.softmax(baseline_logits[0, -1], dim=-1)[target_token].item()
|
|
print(f"Baseline P(Paris): {baseline_prob:.4f}")
|
|
|
|
# Get features to ablate
|
|
_, cache = model.run_with_cache(tokens)
|
|
activations = cache["resid_pre", 8]
|
|
features = sae.encode(activations)
|
|
top_features = features[0, -1].topk(10).indices
|
|
|
|
# Ablate top features one by one
|
|
for feat_idx in top_features:
|
|
def ablation_hook(activation, hook, feat_idx=feat_idx):
|
|
# Encode → zero feature → decode
|
|
feats = sae.encode(activation)
|
|
feats[:, :, feat_idx] = 0
|
|
return sae.decode(feats)
|
|
|
|
ablated_logits = model.run_with_hooks(
|
|
tokens,
|
|
fwd_hooks=[("blocks.8.hook_resid_pre", ablation_hook)]
|
|
)
|
|
ablated_prob = torch.softmax(ablated_logits[0, -1], dim=-1)[target_token].item()
|
|
change = (ablated_prob - baseline_prob) / baseline_prob * 100
|
|
print(f"Ablate feature {feat_idx.item()}: P(Paris)={ablated_prob:.4f} ({change:+.1f}%)")
|
|
```
|
|
|
|
---
|
|
|
|
## Tutorial 5: Comparing Features Across Prompts
|
|
|
|
### Goal
|
|
Find which features activate consistently for a concept.
|
|
|
|
### Step-by-Step
|
|
|
|
```python
|
|
from transformer_lens import HookedTransformer
|
|
from sae_lens import SAE
|
|
import torch
|
|
|
|
model = HookedTransformer.from_pretrained("gpt2-small", device="cuda")
|
|
sae, _, _ = SAE.from_pretrained(
|
|
release="gpt2-small-res-jb",
|
|
sae_id="blocks.8.hook_resid_pre",
|
|
device="cuda"
|
|
)
|
|
|
|
# Test prompts about the same concept
|
|
prompts = [
|
|
"The Eiffel Tower is located in",
|
|
"Paris is the capital of",
|
|
"France's largest city is",
|
|
"The Louvre museum is in",
|
|
]
|
|
|
|
# Collect feature activations
|
|
all_features = []
|
|
for prompt in prompts:
|
|
tokens = model.to_tokens(prompt)
|
|
_, cache = model.run_with_cache(tokens)
|
|
activations = cache["resid_pre", 8]
|
|
features = sae.encode(activations)
|
|
# Take max activation across positions
|
|
max_features = features[0].max(dim=0).values
|
|
all_features.append(max_features)
|
|
|
|
all_features = torch.stack(all_features) # [n_prompts, d_sae]
|
|
|
|
# Find features that activate consistently
|
|
mean_activation = all_features.mean(dim=0)
|
|
min_activation = all_features.min(dim=0).values
|
|
|
|
# Features active in ALL prompts
|
|
consistent_features = (min_activation > 0.5).nonzero().squeeze(-1)
|
|
print(f"Features active in all prompts: {len(consistent_features)}")
|
|
|
|
# Top consistent features
|
|
top_consistent = mean_activation[consistent_features].topk(min(10, len(consistent_features)))
|
|
print("\nTop consistent features (possibly 'France/Paris' related):")
|
|
for idx, val in zip(top_consistent.indices, top_consistent.values):
|
|
feat_idx = consistent_features[idx].item()
|
|
print(f" Feature {feat_idx}: mean activation {val.item():.3f}")
|
|
```
|
|
|
|
---
|
|
|
|
## External Resources
|
|
|
|
### Official Tutorials
|
|
- [Basic Loading & Analysis](https://github.com/jbloomAus/SAELens/blob/main/tutorials/basic_loading_and_analysing.ipynb)
|
|
- [Training SAEs](https://github.com/jbloomAus/SAELens/blob/main/tutorials/training_a_sparse_autoencoder.ipynb)
|
|
- [Logits Lens with Features](https://github.com/jbloomAus/SAELens/blob/main/tutorials/logits_lens_with_features.ipynb)
|
|
|
|
### ARENA Curriculum
|
|
Comprehensive SAE course: https://www.lesswrong.com/posts/LnHowHgmrMbWtpkxx/intro-to-superposition-and-sparse-autoencoders-colab
|
|
|
|
### Key Papers
|
|
- [Towards Monosemanticity](https://transformer-circuits.pub/2023/monosemantic-features) - Anthropic (2023)
|
|
- [Scaling Monosemanticity](https://transformer-circuits.pub/2024/scaling-monosemanticity/) - Anthropic (2024)
|
|
- [Sparse Autoencoders Find Interpretable Features](https://arxiv.org/abs/2309.08600) - ICLR 2024
|