- Restored 21 skills removed in commits757d012and740dd92: accelerate, audiocraft, code-review, faiss, flash-attention, gguf, grpo-rl-training, guidance, llava, nemo-curator, obliteratus, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, stable-diffusion, tensorrt-llm, torchtitan, trl-fine-tuning, whisper - Rewrote sync_skills() with proper update semantics: * New skills (not in manifest): copied to user dir * Existing skills (in manifest + on disk): updated via hash comparison * User-deleted skills (in manifest, not on disk): respected, not re-added * Stale manifest entries (removed from bundled): cleaned from manifest - Added sync_skills() to CLI startup (cmd_chat) and gateway startup (start_gateway) — previously only ran during 'hermes update' - Updated cmd_update output to show new/updated/cleaned counts - Rewrote tests: 20 tests covering manifest CRUD, dir hashing, fresh install, user deletion respect, update detection, stale cleanup, and name collision handling 75 bundled skills total. 2002 tests pass.
6.3 KiB
OBLITERATUS Methods — Detailed Guide
Important: The CLI (
obliteratus obliterate --method) accepts 9 methods: basic, advanced, aggressive, spectral_cascade, informed, surgical, optimized, inverted, nuclear. Four additional methods (failspy, gabliteration, heretic, rdo) are available only via the Python API and will be rejected by argparse if used on CLI.
How Abliteration Works (Theory)
When a model is trained with RLHF/DPO/CAI, it learns to represent "should I refuse?" as a direction in its internal activation space. When processing a "harmful" prompt, activations shift in this direction, causing the model to generate refusal text.
Abliteration works by:
- Measuring this direction (the difference between harmful and harmless activations)
- Removing it from the model's weight matrices via orthogonal projection
- The model can no longer "point toward" refusal, so it responds normally
Mathematically: W_new = W_old - (W_old @ d @ d.T) where d is the refusal direction.
Method Details
basic
Technique: Single refusal direction via diff-in-means Based on: Arditi et al. 2024 ("Refusal in Language Models Is Mediated by a Single Direction") Speed: Fast (~5-10 min for 8B) Quality: Moderate — works for simple refusal patterns Best for: Quick tests, models with clean single-direction refusal Limitation: Misses complex multi-direction refusal patterns
advanced (DEFAULT)
Technique: Multiple SVD directions with norm-preserving projection Speed: Medium (~10-20 min for 8B) Quality: Good — handles multi-direction refusal Best for: Dense models (Llama, Qwen, Mistral) as a reliable default Key improvement: Norm preservation prevents weight magnitude drift
informed (RECOMMENDED)
Technique: Analysis-guided auto-configuration Speed: Slow (~20-40 min for 8B, runs 4 analysis modules first) Quality: Best — adapts to each model's specific refusal implementation Best for: Any model when quality matters more than speed
The informed pipeline runs these analysis modules during abliteration:
- AlignmentImprintDetector — Detects DPO/RLHF/CAI/SFT → sets regularization
- ConceptConeAnalyzer — Polyhedral vs linear refusal → sets n_directions
- CrossLayerAlignmentAnalyzer — Cluster-aware → selects target layers
- DefenseRobustnessEvaluator — Self-repair risk → sets refinement passes
- Ouroboros loop — Re-probes after excision, re-excises if refusal persists
aggressive
Technique: Whitened SVD + jailbreak-contrastive activations + attention head surgery Speed: Slow (~30-60 min for 8B) Quality: High but higher risk of coherence damage Best for: Models that resist gentler methods Key feature: Whitened SVD separates refusal signal from natural activation variance
surgical
Technique: SAE features + neuron masking + head surgery + per-expert directions Speed: Very slow (~1-2 hrs for 8B, needs SAE) Quality: Highest precision Best for: Reasoning models (R1 distills) where you must preserve CoT Key feature: CoT-Aware — explicitly protects reasoning-critical directions
nuclear
Technique: Everything combined — expert transplant + steering + per-expert directions Speed: Very slow Quality: Most thorough removal, highest risk of side effects Best for: Stubborn MoE models (DeepSeek, Mixtral, DBRX) that resist other methods Key feature: Expert-granular abliteration decomposes signals per MoE expert
optimized
Technique: Bayesian hyperparameter search via Optuna TPE Speed: Very slow (runs many trials) Quality: Finds optimal configuration automatically Best for: Research, when you want the mathematically best parameters Requires: optuna package
spectral_cascade
Technique: DCT frequency-domain decomposition of refusal signal Speed: Medium-slow Quality: Novel approach, less battle-tested Best for: Research, exploring alternative decomposition strategies
inverted
Technique: Reflects (inverts) the refusal direction instead of removing it Speed: Fast (same as basic) Quality: Aggressive — model becomes actively willing, not just neutral Best for: When you want the model to be maximally helpful Warning: Can make the model too eager; may reduce safety-adjacent reasoning
failspy / gabliteration / heretic / rdo (PYTHON API ONLY)
Technique: Faithful reproductions of prior community/academic work
Speed: Varies
Quality: Known baselines
Best for: Reproducing published results, comparing methods
⚠️ NOT available via CLI — these methods are only accessible via the Python API.
Do not use --method failspy etc. in CLI commands; argparse will reject them.
Method Selection Flowchart
Is this a quick test?
├─ YES → basic
└─ NO → Is the model MoE (DeepSeek, Mixtral)?
├─ YES → nuclear
└─ NO → Is it a reasoning model (R1 distill)?
├─ YES → surgical
└─ NO → Do you care about speed?
├─ YES → advanced
└─ NO → informed
Key Parameters
| Parameter | Range | Default | Effect |
|---|---|---|---|
| n_directions | 1-32 | auto | More = more thorough but riskier |
| regularization | 0.0-1.0 | 0.0 | Higher preserves more original behavior |
| refinement_passes | 1-5 | 1 | More catches self-repair (Ouroboros effect) |
| quantization | 4/8 bit | none | Saves VRAM, slight quality tradeoff |
Troubleshooting
| Problem | Solution |
|---|---|
| Refusal rate still > 10% | Try aggressive/nuclear, add refinement passes |
| Perplexity up > 20% | Reduce n_directions, increase regularization |
| Model generates nonsense | Regularization too low, try 0.2-0.3 |
| OOM on GPU | Use 4-bit quantization, or try smaller model |
| MoE model barely changes | Use nuclear method (expert-granular) |
| CoT reasoning broken | Use surgical method (CoT-aware) |