mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-04-27 01:11:40 +00:00

teknium1 ab0f4126cf fix: restore all removed bundled skills + fix skills sync system

- Restored 21 skills removed in commits 757d012 and 740dd92:
  accelerate, audiocraft, code-review, faiss, flash-attention, gguf,
  grpo-rl-training, guidance, llava, nemo-curator, obliteratus, peft,
  pytorch-fsdp, pytorch-lightning, simpo, slime, stable-diffusion,
  tensorrt-llm, torchtitan, trl-fine-tuning, whisper

- Rewrote sync_skills() with proper update semantics:
  * New skills (not in manifest): copied to user dir
  * Existing skills (in manifest + on disk): updated via hash comparison
  * User-deleted skills (in manifest, not on disk): respected, not re-added
  * Stale manifest entries (removed from bundled): cleaned from manifest

- Added sync_skills() to CLI startup (cmd_chat) and gateway startup
  (start_gateway) — previously only ran during 'hermes update'

- Updated cmd_update output to show new/updated/cleaned counts

- Rewrote tests: 20 tests covering manifest CRUD, dir hashing, fresh
  install, user deletion respect, update detection, stale cleanup, and
  name collision handling

75 bundled skills total. 2002 tests pass.

2026-03-06 15:57:30 -08:00

6.3 KiB

Raw Blame History

OBLITERATUS Methods — Detailed Guide

Important: The CLI (obliteratus obliterate --method) accepts 9 methods: basic, advanced, aggressive, spectral_cascade, informed, surgical, optimized, inverted, nuclear. Four additional methods (failspy, gabliteration, heretic, rdo) are available only via the Python API and will be rejected by argparse if used on CLI.

How Abliteration Works (Theory)

When a model is trained with RLHF/DPO/CAI, it learns to represent "should I refuse?" as a direction in its internal activation space. When processing a "harmful" prompt, activations shift in this direction, causing the model to generate refusal text.

Abliteration works by:

Measuring this direction (the difference between harmful and harmless activations)
Removing it from the model's weight matrices via orthogonal projection
The model can no longer "point toward" refusal, so it responds normally

Mathematically: W_new = W_old - (W_old @ d @ d.T) where d is the refusal direction.

Method Details

basic

Technique: Single refusal direction via diff-in-means Based on: Arditi et al. 2024 ("Refusal in Language Models Is Mediated by a Single Direction") Speed: Fast (~5-10 min for 8B) Quality: Moderate — works for simple refusal patterns Best for: Quick tests, models with clean single-direction refusal Limitation: Misses complex multi-direction refusal patterns

advanced (DEFAULT)

Technique: Multiple SVD directions with norm-preserving projection Speed: Medium (~10-20 min for 8B) Quality: Good — handles multi-direction refusal Best for: Dense models (Llama, Qwen, Mistral) as a reliable default Key improvement: Norm preservation prevents weight magnitude drift

informed (RECOMMENDED)

Technique: Analysis-guided auto-configuration Speed: Slow (~20-40 min for 8B, runs 4 analysis modules first) Quality: Best — adapts to each model's specific refusal implementation Best for: Any model when quality matters more than speed

The informed pipeline runs these analysis modules during abliteration:

AlignmentImprintDetector — Detects DPO/RLHF/CAI/SFT → sets regularization
ConceptConeAnalyzer — Polyhedral vs linear refusal → sets n_directions
CrossLayerAlignmentAnalyzer — Cluster-aware → selects target layers
DefenseRobustnessEvaluator — Self-repair risk → sets refinement passes
Ouroboros loop — Re-probes after excision, re-excises if refusal persists

aggressive

Technique: Whitened SVD + jailbreak-contrastive activations + attention head surgery Speed: Slow (~30-60 min for 8B) Quality: High but higher risk of coherence damage Best for: Models that resist gentler methods Key feature: Whitened SVD separates refusal signal from natural activation variance

surgical

Technique: SAE features + neuron masking + head surgery + per-expert directions Speed: Very slow (~1-2 hrs for 8B, needs SAE) Quality: Highest precision Best for: Reasoning models (R1 distills) where you must preserve CoT Key feature: CoT-Aware — explicitly protects reasoning-critical directions

nuclear

Technique: Everything combined — expert transplant + steering + per-expert directions Speed: Very slow Quality: Most thorough removal, highest risk of side effects Best for: Stubborn MoE models (DeepSeek, Mixtral, DBRX) that resist other methods Key feature: Expert-granular abliteration decomposes signals per MoE expert

optimized

Technique: Bayesian hyperparameter search via Optuna TPE Speed: Very slow (runs many trials) Quality: Finds optimal configuration automatically Best for: Research, when you want the mathematically best parameters Requires: optuna package

spectral_cascade

Technique: DCT frequency-domain decomposition of refusal signal Speed: Medium-slow Quality: Novel approach, less battle-tested Best for: Research, exploring alternative decomposition strategies

inverted

Technique: Reflects (inverts) the refusal direction instead of removing it Speed: Fast (same as basic) Quality: Aggressive — model becomes actively willing, not just neutral Best for: When you want the model to be maximally helpful Warning: Can make the model too eager; may reduce safety-adjacent reasoning

failspy / gabliteration / heretic / rdo (PYTHON API ONLY)

Technique: Faithful reproductions of prior community/academic work Speed: Varies Quality: Known baselines Best for: Reproducing published results, comparing methods ⚠️ NOT available via CLI — these methods are only accessible via the Python API. Do not use --method failspy etc. in CLI commands; argparse will reject them.

Method Selection Flowchart

Is this a quick test?
├─ YES → basic
└─ NO → Is the model MoE (DeepSeek, Mixtral)?
         ├─ YES → nuclear
         └─ NO → Is it a reasoning model (R1 distill)?
                  ├─ YES → surgical
                  └─ NO → Do you care about speed?
                           ├─ YES → advanced
                           └─ NO → informed

Key Parameters

Parameter	Range	Default	Effect
n_directions	1-32	auto	More = more thorough but riskier
regularization	0.0-1.0	0.0	Higher preserves more original behavior
refinement_passes	1-5	1	More catches self-repair (Ouroboros effect)
quantization	4/8 bit	none	Saves VRAM, slight quality tradeoff

Troubleshooting

Problem	Solution
Refusal rate still > 10%	Try aggressive/nuclear, add refinement passes
Perplexity up > 20%	Reduce n_directions, increase regularization
Model generates nonsense	Regularization too low, try 0.2-0.3
OOM on GPU	Use 4-bit quantization, or try smaller model
MoE model barely changes	Use nuclear method (expert-granular)
CoT reasoning broken	Use surgical method (CoT-aware)

6.3 KiB Raw Blame History