Most SAE research optimizes two numbers: sparsity (L0 — how many features fire) and reconstruction loss (how well the SAE recovers the original activation). The assumption is that better reconstruction + sparser features = more interpretable SAEs.
SAEBench tests this assumption directly. The authors built a suite of 8 diverse evaluations spanning unsupervised metrics and downstream tasks, then ran them on 200+ SAEs of varying architectures, widths, and sparsities. Their central finding:
This matters enormously for our work. We are running Feature Rivalry and 2×2 quadrant analyses on the official Qwen SAE-Res weights. If those SAEs were trained with proxy metrics as the objective, SAEBench suggests we may be working with suboptimal representations — and our results would improve with better-trained SAEs.
SAEBench is organized into two categories: unsupervised/proxy metrics (cheap, no labels) and downstream task evaluations (expensive, require labels or interventions).
| Evaluation | Type | What it measures |
|---|---|---|
| Core (L0 + Loss Recovered) | Unsupervised | Sparsity and reconstruction fidelity |
| Feature Absorption | Unsupervised | Whether features absorb correlated concepts to increase sparsity |
| Auto-Interp | Downstream | Can an LLM judge explain what each feature does? |
| RAVEL | Downstream | Can SAE features causally isolate and edit specific facts? |
| Spurious Correlation Removal (SCR) | Downstream | Can SAE features disentangle spurious from intended signals? |
| Targeted Probe Perturbation (TPP) | Downstream | Can ablating one feature set affect only one class probe? |
| Sparse Probing | Downstream | How well do top-K SAE features classify specific concepts? |
| Unlearning | Downstream | Can SAE features selectively remove knowledge without side effects? |
The standard SAE training objective: minimize reconstruction MSE while keeping L0 (number of non-zero features) low. These are the proxy metrics that every SAE paper reports. SAEBench's key insight is that these metrics alone are insufficient.
Loss recovered measures what fraction of the original model's cross-entropy loss is recovered when replacing the true residual stream with the SAE reconstruction. A value of 1.0 means perfect reconstruction; 0.0 means the SAE output is no better than random. Typical values range from 0.85 to 0.99 for well-trained SAEs.
Feature absorption is a subtle failure mode where an SAE "hides" one concept inside another to increase sparsity. Example: if "short" always starts with "S", the SAE might learn a single "short" feature that implicitly encodes "starts with S", rather than learning two separate features.
SAEBench quantifies this using a first-letter classification task. They train a probe on the true residual stream as ground truth, then check whether SAE features faithfully replicate this probe's behavior. If the probe correctly classifies a token but the SAE features do not, and a different SAE feature "absorbs" that classification direction, that's absorption. Higher scores mean less absorption.
Uses GPT-4o-mini as an LLM judge. For each SAE feature, the top-activating sequences are extracted, an LLM generates an explanation (e.g., "this feature fires on French city names"), and a second LLM tries to predict which sequences would activate based on the explanation alone. The score is the prediction accuracy.
This is the closest proxy to "human interpretability" at scale. A score of 0.5 means random guessing; 1.0 means perfect explanation. Most SAEs score in the 0.3–0.6 range.
RAVEL tests whether SAE features can causally isolate distinct facts about entities. The dataset contains entities (cities, Nobel laureates, occupations) with attributes (country, language, profession). The test: can we change "Paris is in France" to "Paris is in Japan" by intervening on SAE features, without also changing "the language of Paris is French"?
The metric has two components: cause (does the intervention change the target attribute?) and isolation (does it leave other attributes unchanged?). The final score averages both. High RAVEL scores mean the SAE has learned genuinely disentangled features.
Tests whether SAE features can separate spurious signals from intended signals. Example: the Bias in Bios dataset has profession and gender labels. A biased classifier trained on male professors and female nurses picks up on gender. The test: can we identify and ablate the SAE features encoding gender, such that the classifier now correctly predicts profession on balanced test data?
The score is normalized: (accuracy_after_ablation − baseline) / (oracle − baseline), where "oracle" is a classifier trained directly on the intended concept. A score of 1.0 means perfect disentanglement; 0.0 means the SAE cannot separate the concepts at all.
TPP generalizes SCR to any multiclass classification task. For each class, the top SAE features are identified via probe attribution. Then, for each class probe, the features from that class are ablated. The score measures whether ablation hurts only the matching probe (good disentanglement) or also hurts unrelated probes (poor disentanglement).
Formally: TPP = mean(across matching probes) − mean(across non-matching probes). A high TPP means features are cleanly separated by concept.
Tests whether a small number of SAE features can classify specific concepts. For 35 binary classification tasks (language ID, profession, sentiment, code language, news topic), the top-K SAE features are selected by maximum mean difference, and a logistic regression probe is trained on just those K features.
This directly measures whether the SAE has learned semantically meaningful features that generalize to classification. Higher accuracy means the SAE's sparse representation captures the underlying concepts well.
Uses the WMDP-bio dataset (dangerous biology knowledge). The goal: selectively remove biology knowledge by clamping SAE feature activations to negative values, while preserving performance on unrelated MMLU tasks (history, geography, CS).
Features are selected by high sparsity on the "forget" set (WMDP-bio) and low sparsity on the "retain" set (WikiText). The score is the minimum WMDP-bio accuracy achievable while maintaining MMLU accuracy ≥ 0.99. Lower is better — it means the SAE can precisely target and remove specific knowledge.
The most important result. The authors trained 200+ SAEs across 7 architectures, 3 widths (4k, 16k, 65k), and 6 sparsity levels. They then correlated the proxy metrics (L0, loss recovered) with the 6 downstream task metrics.
Why does this happen? Because SAEs can game the reconstruction objective by learning distributed representations that reconstruct well but are not interpretable. Feature absorption is one manifestation: the SAE learns gerrymandered features that happen to reconstruct the activation vector well, but no individual feature corresponds to a human-interpretable concept.
The authors tested 7 SAE architectures: ReLU, TopK, BatchTopK, JumpReLU, Gated, P-anneal, and Matryoshka BatchTopK. Matryoshka SAEs learn a hierarchy of features at different granularities (like Russian nesting dolls), allowing the model to use coarse features for common concepts and fine features for rare ones.
This is a significant practical finding. If you are training new SAEs, Matryoshka should be the default architecture. If you are using pretrained SAEs (like Qwen SAE-Res), you should be aware that they are likely using a simpler architecture (ReLU or TopK) and may underperform on disentanglement tasks.
There is no universal "best" SAE. The optimal width and sparsity vary dramatically across tasks:
This has implications for SAE selection. If your goal is feature editing (RAVEL/SCR/TPP), you want a Matryoshka SAE at moderate width and sparsity. If your goal is concept detection (Sparse Probing), you want the widest dictionary you can afford.
The authors trained SAEs for varying durations and found that while additional training improves proxy metrics (loss recovered), the effect on downstream tasks is modest after the first ~100M tokens. Architecture choice has a much larger impact than training duration.
ReLU SAEs (the most common architecture) show significant feature absorption. This means their features are less interpretable than they appear — a feature that seems to detect "short words" may actually be absorbing "starts with S" and other correlated properties. TopK and JumpReLU reduce but do not eliminate absorption. Matryoshka SAEs show the least absorption among architectures tested.
While analyzing SAEBench, we came across a new paper by McCann (May 2026) that identifies a fundamental problem with the Auto-Interp metric — one of SAEBench's 8 evaluations. This finding is important enough to flag before anyone uses SAEBench rankings as ground truth.
Auto-Interp uses an LLM judge (GPT-4o-mini) to generate natural-language explanations for each SAE feature, then scores how well the explanation predicts which tokens will activate. McCann calls this descriptive collision: many distinct SAE features admit the same explanation.
Reanalyzing the largest public dataset of human-annotated SAE features (Marks et al., 2025 — 722 features across Gemma 2 2B and Pythia 70M):
This means Auto-Interp scores are inflated. A feature that scores 0.6 might seem well-explained, but if 3 other features share the same explanation, the explanation is not actually identifying the feature — it's identifying a cluster of features.
Standard Auto-Interp uses detection scoring: given an explanation, predict which tokens activate. This is invariant to collision because it only tests whether the explanation matches the feature's behavior, not whether the explanation is unique to that feature.
McCann proposes two fixes:
Our recommendation: treat SAEBench's Auto-Interp scores with caution. The downstream task evaluations (RAVEL, SCR, TPP, Sparse Probing, Unlearning) are less affected because they test causal behavior, not explanation quality. Core metrics and Feature Absorption are also unaffected. But if you are selecting an SAE primarily for interpretability, Auto-Interp alone is not sufficient — you need discrimination-aware evaluation.
We are running three concurrent SAE projects: Feature Rivalry, Uncertainty vs Correctness, and WriteSAE. All three depend on the quality of the SAE being used. SAEBench tells us some uncomfortable truths about the Qwen SAE-Res weights we are using.
Based on the SAEBench results, we should expect:
Running the full SAEBench suite on a single SAE takes ~5 hours on an RTX 3090 (24GB VRAM). Here is the breakdown per evaluation:
| Evaluation | Per-SAE Time | Setup Time |
|---|---|---|
| Core (L0 + Loss) | 9 min | 0 min |
| Feature Absorption | 26 min | 33 min |
| Auto-Interp | 9 min | 0 min |
| RAVEL | 45 min | 45 min |
| SCR | 6 min | 22 min |
| TPP | 2 min | 5 min |
| Sparse Probing | 3 min | 15 min |
| Unlearning | 10 min | 33 min |
| Total | 115 min | 177 min |
For our purposes, the "light" suite (Core + Feature Absorption + Sparse Probing + Auto-Interp) would take ~50 minutes of GPU time and give us a solid baseline. The full suite is only needed if we want to compare intervention capabilities (RAVEL, SCR, TPP, Unlearning).
| Dimension | SAEBench | Feature Rivalry / 2×2 Method |
|---|---|---|
| Goal | Evaluate SAE quality holistically | Detect specific failure modes (uncertainty, incorrectness) |
| Input | SAE + model + labeled datasets | SAE + model + sampled responses |
| Output | 8 numerical scores per SAE | Feature rankings / quadrant classifications |
| Supervision | Mostly supervised (downstream tasks) | Unsupervised (rivalry) or supervised (2×2) |
| Intervention | Tests causal editing (RAVEL, SCR, Unlearning) | Tests feature suppression / steering |
| GPU cost | ~5h per SAE (full suite) | ~2-4h per experiment |
| Best use | Selecting/training better SAEs | Using existing SAEs for monitoring |
These are complementary, not competing. SAEBench tells us which SAE to use; Feature Rivalry and the 2×2 method tell us what to do with it. The ideal workflow: run SAEBench on candidate SAEs, pick the best one, then run uncertainty detection on that SAE.
Essential infrastructure. SAEBench is the most important SAE evaluation framework released to date. The finding that proxy metrics do not predict practical performance is a watershed moment — it means much of the existing SAE literature may be optimizing the wrong thing. For our work specifically, SAEBench raises a critical question: are the Qwen SAE-Res weights good enough for the uncertainty detection tasks we care about?
Immediate action: We will run the SAEBench Core + Feature Absorption evaluations on Qwen SAE-Res as soon as GPU access is restored. This is a ~1-hour experiment that will tell us whether our SAE is solid or whether we need to invest in training a Matryoshka baseline before continuing with Feature Rivalry and 2×2 reproductions.
We cloned the SAEBench repository and explored what it would take to evaluate the official Qwen SAE-Res weights. Here is what we found.
SAEBench uses transformer_lens.HookedTransformer for all model loading
and activation extraction. We checked whether transformer-lens supports
Qwen/Qwen3.5-35B-A3B-Base:
The problem is not just MoE routing (which transformer-lens handles for OLMoE
via convert_olmoe_weights). Qwen 3.5 35B-A3B has three features that
transformer-lens does not support:
mlp.gate,
mlp.experts) differ from both dense MLPs and OLMoE's batched
expert format.
Adding support would require writing a new convert_qwen3_5_moe_weights
function, updating convert_hf_model_config with the Qwen3.5 MoE
config mapping, and potentially extending HookedTransformer to handle linear
attention. This is a multi-day engineering effort — not a quick integration.
The Qwen SAE-Res model card provides a demo that uses HuggingFace transformers
directly — AutoModelForCausalLM with register_forward_hook.
This works because transformers natively supports Qwen3_5MoeForConditionalGeneration.
This tells us the path forward: bypass transformer-lens entirely and write lightweight evaluations using transformers + the SAE state dicts directly.
For our immediate needs (Core + Feature Absorption + Sparse Probing), we do not need the full SAEBench framework. We can write ~200-line scripts that:
AutoModelForCausalLMWe already have working code for (1), (2), (3), and (4) from our Feature Rivalry and 2×2 reproductions. Adding Core metrics and Feature Absorption is a matter of:
We wrote starter scripts for the lightweight approach:
saebench_qwen_core.py — Computes L0 and loss recovered.
Uses AutoModelForCausalLM + forward hooks + TopK SAE encode/decode.
Ready to run when GPU access returns.saebench_qwen_absorption.py — Simplified Feature Absorption.
Trains first-letter probes on true residuals and SAE features; finds false
negatives where probe is correct but top-K features don't fire. Checks if
other SAE features align with probe direction. Will refine with decoder-direction
cosine similarity when GPU returns.saebench_qwen_sparse_probing.py — Simplified Sparse Probing.
Trains dense, SAE-full, and SAE-sparse probes on sentiment classification
(SST-2/IMDB/Amazon). Top-K features selected by mean activation difference.
Tests whether SAE features provide a good sparse basis for downstream tasks.saebench_qwen_integration.md — Full integration plan
documenting the transformer-lens blocker, the lightweight approach, and
what an upstream PR would look like.
All files live in ~/scratch/ on this machine.