SAEBench: What Does a "Good" SAE Actually Mean?

Paper: arXiv:2503.09532 · Authors: Karvonen, Rager, Lin, Tigges, Bloom, et al. · Explored: 2026-05-15

Overview

Most SAE research optimizes two numbers: sparsity (L0 — how many features fire) and reconstruction loss (how well the SAE recovers the original activation). The assumption is that better reconstruction + sparser features = more interpretable SAEs.

SAEBench tests this assumption directly. The authors built a suite of 8 diverse evaluations spanning unsupervised metrics and downstream tasks, then ran them on 200+ SAEs of varying architectures, widths, and sparsities. Their central finding:

Proxy metrics do not reliably predict practical performance. SAEs that score well on reconstruction and sparsity often underperform on real interpretability tasks. Conversely, some SAEs that look worse on paper are actually more useful for downstream interventions.

This matters enormously for our work. We are running Feature Rivalry and 2×2 quadrant analyses on the official Qwen SAE-Res weights. If those SAEs were trained with proxy metrics as the objective, SAEBench suggests we may be working with suboptimal representations — and our results would improve with better-trained SAEs.

The 8 Evaluations

SAEBench is organized into two categories: unsupervised/proxy metrics (cheap, no labels) and downstream task evaluations (expensive, require labels or interventions).

Evaluation	Type	What it measures
Core (L0 + Loss Recovered)	Unsupervised	Sparsity and reconstruction fidelity
Feature Absorption	Unsupervised	Whether features absorb correlated concepts to increase sparsity
Auto-Interp	Downstream	Can an LLM judge explain what each feature does?
RAVEL	Downstream	Can SAE features causally isolate and edit specific facts?
Spurious Correlation Removal (SCR)	Downstream	Can SAE features disentangle spurious from intended signals?
Targeted Probe Perturbation (TPP)	Downstream	Can ablating one feature set affect only one class probe?
Sparse Probing	Downstream	How well do top-K SAE features classify specific concepts?
Unlearning	Downstream	Can SAE features selectively remove knowledge without side effects?

1. Core Metrics (L0 + Loss Recovered)

The standard SAE training objective: minimize reconstruction MSE while keeping L0 (number of non-zero features) low. These are the proxy metrics that every SAE paper reports. SAEBench's key insight is that these metrics alone are insufficient.

Loss recovered measures what fraction of the original model's cross-entropy loss is recovered when replacing the true residual stream with the SAE reconstruction. A value of 1.0 means perfect reconstruction; 0.0 means the SAE output is no better than random. Typical values range from 0.85 to 0.99 for well-trained SAEs.

2. Feature Absorption

Feature absorption is a subtle failure mode where an SAE "hides" one concept inside another to increase sparsity. Example: if "short" always starts with "S", the SAE might learn a single "short" feature that implicitly encodes "starts with S", rather than learning two separate features.

SAEBench quantifies this using a first-letter classification task. They train a probe on the true residual stream as ground truth, then check whether SAE features faithfully replicate this probe's behavior. If the probe correctly classifies a token but the SAE features do not, and a different SAE feature "absorbs" that classification direction, that's absorption. Higher scores mean less absorption.

3. Automated Interpretability (Auto-Interp)

Uses GPT-4o-mini as an LLM judge. For each SAE feature, the top-activating sequences are extracted, an LLM generates an explanation (e.g., "this feature fires on French city names"), and a second LLM tries to predict which sequences would activate based on the explanation alone. The score is the prediction accuracy.

This is the closest proxy to "human interpretability" at scale. A score of 0.5 means random guessing; 1.0 means perfect explanation. Most SAEs score in the 0.3–0.6 range.

4. RAVEL (Causal Fact Editing)

RAVEL tests whether SAE features can causally isolate distinct facts about entities. The dataset contains entities (cities, Nobel laureates, occupations) with attributes (country, language, profession). The test: can we change "Paris is in France" to "Paris is in Japan" by intervening on SAE features, without also changing "the language of Paris is French"?

The metric has two components: cause (does the intervention change the target attribute?) and isolation (does it leave other attributes unchanged?). The final score averages both. High RAVEL scores mean the SAE has learned genuinely disentangled features.

5. Spurious Correlation Removal (SCR)

Tests whether SAE features can separate spurious signals from intended signals. Example: the Bias in Bios dataset has profession and gender labels. A biased classifier trained on male professors and female nurses picks up on gender. The test: can we identify and ablate the SAE features encoding gender, such that the classifier now correctly predicts profession on balanced test data?

The score is normalized: (accuracy_after_ablation − baseline) / (oracle − baseline), where "oracle" is a classifier trained directly on the intended concept. A score of 1.0 means perfect disentanglement; 0.0 means the SAE cannot separate the concepts at all.

6. Targeted Probe Perturbation (TPP)

TPP generalizes SCR to any multiclass classification task. For each class, the top SAE features are identified via probe attribution. Then, for each class probe, the features from that class are ablated. The score measures whether ablation hurts only the matching probe (good disentanglement) or also hurts unrelated probes (poor disentanglement).

Formally: TPP = mean(across matching probes) − mean(across non-matching probes). A high TPP means features are cleanly separated by concept.

7. Sparse Probing

Tests whether a small number of SAE features can classify specific concepts. For 35 binary classification tasks (language ID, profession, sentiment, code language, news topic), the top-K SAE features are selected by maximum mean difference, and a logistic regression probe is trained on just those K features.

This directly measures whether the SAE has learned semantically meaningful features that generalize to classification. Higher accuracy means the SAE's sparse representation captures the underlying concepts well.

8. Unlearning

Uses the WMDP-bio dataset (dangerous biology knowledge). The goal: selectively remove biology knowledge by clamping SAE feature activations to negative values, while preserving performance on unrelated MMLU tasks (history, geography, CS).

Features are selected by high sparsity on the "forget" set (WMDP-bio) and low sparsity on the "retain" set (WikiText). The score is the minimum WMDP-bio accuracy achievable while maintaining MMLU accuracy ≥ 0.99. Lower is better — it means the SAE can precisely target and remove specific knowledge.

Key Findings

Finding 1: Proxy metrics do not predict downstream performance

The most important result. The authors trained 200+ SAEs across 7 architectures, 3 widths (4k, 16k, 65k), and 6 sparsity levels. They then correlated the proxy metrics (L0, loss recovered) with the 6 downstream task metrics.

Result: Reconstruction quality and sparsity are weakly or not at all correlated with downstream task performance. An SAE with excellent loss recovered may score poorly on RAVEL, SCR, and TPP. This means you cannot tell whether an SAE is "good" just by looking at its training loss.

Why does this happen? Because SAEs can game the reconstruction objective by learning distributed representations that reconstruct well but are not interpretable. Feature absorption is one manifestation: the SAE learns gerrymandered features that happen to reconstruct the activation vector well, but no individual feature corresponds to a human-interpretable concept.

Finding 2: Matryoshka SAEs dominate on 5 of 8 metrics

The authors tested 7 SAE architectures: ReLU, TopK, BatchTopK, JumpReLU, Gated, P-anneal, and Matryoshka BatchTopK. Matryoshka SAEs learn a hierarchy of features at different granularities (like Russian nesting dolls), allowing the model to use coarse features for common concepts and fine features for rare ones.

Result: Matryoshka BatchTopK outperforms all other architectures on 5 of 8 metrics, especially in the typical L0 range of 40–200. It performs best on concept detection (Sparse Probing) and feature disentanglement (RAVEL, SCR, TPP). The only metrics where it does not lead are Unlearning and Feature Absorption.

This is a significant practical finding. If you are training new SAEs, Matryoshka should be the default architecture. If you are using pretrained SAEs (like Qwen SAE-Res), you should be aware that they are likely using a simpler architecture (ReLU or TopK) and may underperform on disentanglement tasks.

Finding 3: The "best" SAE depends on the task

There is no universal "best" SAE. The optimal width and sparsity vary dramatically across tasks:

Sparse Probing benefits from wider dictionaries (65k > 16k > 4k)
Unlearning works best with narrower dictionaries and higher sparsity
RAVEL prefers moderate widths (~16k) and moderate sparsity (L0 ~50-100)
Auto-Interp scores improve with width but plateau around 16k

This has implications for SAE selection. If your goal is feature editing (RAVEL/SCR/TPP), you want a Matryoshka SAE at moderate width and sparsity. If your goal is concept detection (Sparse Probing), you want the widest dictionary you can afford.

Finding 4: Training time matters less than architecture

The authors trained SAEs for varying durations and found that while additional training improves proxy metrics (loss recovered), the effect on downstream tasks is modest after the first ~100M tokens. Architecture choice has a much larger impact than training duration.

Finding 5: Feature absorption is pervasive in ReLU SAEs

ReLU SAEs (the most common architecture) show significant feature absorption. This means their features are less interpretable than they appear — a feature that seems to detect "short words" may actually be absorbing "starts with S" and other correlated properties. TopK and JumpReLU reduce but do not eliminate absorption. Matryoshka SAEs show the least absorption among architectures tested.

A Critical Caveat: Descriptive Collision (arXiv:2605.12874)

While analyzing SAEBench, we came across a new paper by McCann (May 2026) that identifies a fundamental problem with the Auto-Interp metric — one of SAEBench's 8 evaluations. This finding is important enough to flag before anyone uses SAEBench rankings as ground truth.

The problem: one explanation describes many features

Auto-Interp uses an LLM judge (GPT-4o-mini) to generate natural-language explanations for each SAE feature, then scores how well the explanation predicts which tokens will activate. McCann calls this descriptive collision: many distinct SAE features admit the same explanation.

Reanalyzing the largest public dataset of human-annotated SAE features (Marks et al., 2025 — 722 features across Gemma 2 2B and Pythia 70M):

Mean annotation string is reused across 3.07 features
82.1% of features share their annotation with at least one other feature
The most common annotation ("plural nouns") labels 101 distinct features spanning 18 layers and four model components
Information-theoretically, the average annotation resolves only 70% of feature identity

This means Auto-Interp scores are inflated. A feature that scores 0.6 might seem well-explained, but if 3 other features share the same explanation, the explanation is not actually identifying the feature — it's identifying a cluster of features.

Why detection scoring is blind to collision

Standard Auto-Interp uses detection scoring: given an explanation, predict which tokens activate. This is invariant to collision because it only tests whether the explanation matches the feature's behavior, not whether the explanation is unique to that feature.

McCann proposes two fixes:

Collision-adjusted detection: Penalize explanations that match multiple features. If an explanation fits feature A and feature B, the score for both is reduced.
Discrimination scoring: Test whether the explanation can distinguish the target feature from its neighbors. Given two features and one explanation, can the judge identify which feature it belongs to?

Implication for SAEBench

Auto-Interp rankings in SAEBench may be unreliable. If descriptive collision is as prevalent as McCann suggests, the Auto-Interp scores reported by SAEBench are systematically inflated. A Matryoshka SAE that scores well on Auto-Interp might not actually have more interpretable features — it might just have features that are easier to explain with generic descriptions.

Our recommendation: treat SAEBench's Auto-Interp scores with caution. The downstream task evaluations (RAVEL, SCR, TPP, Sparse Probing, Unlearning) are less affected because they test causal behavior, not explanation quality. Core metrics and Feature Absorption are also unaffected. But if you are selecting an SAE primarily for interpretability, Auto-Interp alone is not sufficient — you need discrimination-aware evaluation.

Implications for Our SAE Work

We are running three concurrent SAE projects: Feature Rivalry, Uncertainty vs Correctness, and WriteSAE. All three depend on the quality of the SAE being used. SAEBench tells us some uncomfortable truths about the Qwen SAE-Res weights we are using.

What we know about Qwen SAE-Res

Architecture: ReLU (based on the SAE-Res paper and code)
Width: 32,768 per layer
L0: ~50 (reported in the model card)
Trained on: unknown token count, but likely sufficient for convergence

What SAEBench predicts about Qwen SAE-Res

Based on the SAEBench results, we should expect:

Good reconstruction: ReLU SAEs at L0 ~50 typically recover 95%+ of loss
Moderate feature absorption: ReLU architecture is prone to absorption
Moderate disentanglement: ReLU underperforms Matryoshka on RAVEL/SCR/TPP
Good sparse probing: 32k width is in the high-performance regime for concept detection
Moderate auto-interp: ReLU features are reasonably interpretable but not optimal

Key concern: If Feature Rivalry and the 2×2 quadrant method are detecting uncertainty via features that are actually absorbed (encoding multiple correlated concepts), our results may be confounded. A "pure uncertainty feature" might actually be absorbing "incorrectness" or "low-confidence token position." SAEBench suggests we should validate our feature identifications with ablation or steering experiments — which both papers do, to their credit.

What we should consider changing

Evaluate Qwen SAE-Res with SAEBench. We should run at least the Core, Feature Absorption, and Sparse Probing evaluations on Qwen SAE-Res to get baseline numbers. This requires ~2 hours of GPU time per evaluation.
Train a Matryoshka baseline. If SAEBench is correct, a Matryoshka SAE trained on Qwen 3.5 activations would outperform SAE-Res on disentanglement tasks. This is a 1–2 day training run but could significantly improve all downstream interpretability results.
Compare results across architectures. When we reproduce Feature Rivalry and the 2×2 method, we should test on both SAE-Res and a Matryoshka SAE. If the rivalry/correctness signals are architecture-independent, that strengthens the findings. If they differ, we learn something about SAE quality.

Computational Requirements

Running the full SAEBench suite on a single SAE takes ~5 hours on an RTX 3090 (24GB VRAM). Here is the breakdown per evaluation:

Evaluation	Per-SAE Time	Setup Time
Core (L0 + Loss)	9 min	0 min
Feature Absorption	26 min	33 min
Auto-Interp	9 min	0 min
RAVEL	45 min	45 min
SCR	6 min	22 min
TPP	2 min	5 min
Sparse Probing	3 min	15 min
Unlearning	10 min	33 min
Total	115 min	177 min

For our purposes, the "light" suite (Core + Feature Absorption + Sparse Probing + Auto-Interp) would take ~50 minutes of GPU time and give us a solid baseline. The full suite is only needed if we want to compare intervention capabilities (RAVEL, SCR, TPP, Unlearning).

Comparison: SAEBench vs Our Current Approach

Dimension	SAEBench	Feature Rivalry / 2×2 Method
Goal	Evaluate SAE quality holistically	Detect specific failure modes (uncertainty, incorrectness)
Input	SAE + model + labeled datasets	SAE + model + sampled responses
Output	8 numerical scores per SAE	Feature rankings / quadrant classifications
Supervision	Mostly supervised (downstream tasks)	Unsupervised (rivalry) or supervised (2×2)
Intervention	Tests causal editing (RAVEL, SCR, Unlearning)	Tests feature suppression / steering
GPU cost	~5h per SAE (full suite)	~2-4h per experiment
Best use	Selecting/training better SAEs	Using existing SAEs for monitoring

These are complementary, not competing. SAEBench tells us which SAE to use; Feature Rivalry and the 2×2 method tell us what to do with it. The ideal workflow: run SAEBench on candidate SAEs, pick the best one, then run uncertainty detection on that SAE.

Our Assessment

What we like

The framework is comprehensive. 8 metrics covering reconstruction, interpretability, disentanglement, and intervention. No single metric dominates.
The proxy metric finding is actionable. It directly challenges the field's default evaluation practice and gives a clear prescription: evaluate on downstream tasks, not just loss.
Matryoshka dominance is surprising and useful. Before SAEBench, Matryoshka SAEs were an interesting curiosity. Now they look like the default architecture for new SAE training.
The codebase is well-engineered. SAEBench supports custom SAEs, SAE Lens SAEs, and dictionary_learning SAEs out of the box. Integrating Qwen SAE-Res would be straightforward.
The Neuronpedia interface is excellent. Being able to browse 200+ SAEs side-by-side with all 8 metrics is genuinely useful for building intuition.

What concerns us

Limited model coverage. SAEBench only officially supports Pythia and Gemma. Adding Qwen support requires minor code changes (batch size, dtype, submodule strings) but is not plug-and-play.
Auto-Interp depends on GPT-4o-mini. This introduces API cost and potential judge bias. The scores may not generalize to other LLM judges.
RAVEL is expensive. At 45 min setup + 45 min per SAE, it's the most costly evaluation. For quick iteration, you might skip it.
The "best" SAE finding may not generalize. Matryoshka dominance was shown on Pythia-160M and Gemma-2-2B. Qwen 3.5 35B-A3B is a much larger, MoE-based model. The optimal architecture may differ at scale.
No reasoning benchmarks. SAEBench tasks are mostly factual/classification. SAE quality on reasoning tasks (math, coding, chain-of-thought) is untested.

What we would try next

Run SAEBench Core + Absorption on Qwen SAE-Res. This is the minimum viable evaluation. It tells us whether our SAE is in the ballpark of quality and whether feature absorption is a concern for our uncertainty detection work.
Train a Matryoshka baseline on Qwen 3.5. If SAEBench's findings hold, this should improve disentanglement (RAVEL/SCR/TPP) and concept detection (Sparse Probing). The training code is available; it's a matter of compute budget.
Test whether Feature Rivalry results are architecture-sensitive. Run the rivalry analysis on both SAE-Res and a Matryoshka SAE. If rivalry scores correlate with disentanglement metrics (TPP/SCR), we have a cross-validation.
Contribute Qwen support to SAEBench. If we do the integration work, we can upstream it. This benefits the community and gives us a standardized evaluation pipeline for all future SAE work.

Verdict

Essential infrastructure. SAEBench is the most important SAE evaluation framework released to date. The finding that proxy metrics do not predict practical performance is a watershed moment — it means much of the existing SAE literature may be optimizing the wrong thing. For our work specifically, SAEBench raises a critical question: are the Qwen SAE-Res weights good enough for the uncertainty detection tasks we care about?

Immediate action: We will run the SAEBench Core + Feature Absorption evaluations on Qwen SAE-Res as soon as GPU access is restored. This is a ~1-hour experiment that will tell us whether our SAE is solid or whether we need to invest in training a Matryoshka baseline before continuing with Feature Rivalry and 2×2 reproductions.

Integration Findings: SAEBench + Qwen 3.5 35B-A3B

We cloned the SAEBench repository and explored what it would take to evaluate the official Qwen SAE-Res weights. Here is what we found.

Architecture mismatch: why transformer-lens cannot load Qwen 3.5

SAEBench uses transformer_lens.HookedTransformer for all model loading and activation extraction. We checked whether transformer-lens supports Qwen/Qwen3.5-35B-A3B-Base:

Supported: Qwen3, Qwen2.5, Qwen2 (standard dense transformers)
NOT supported: Qwen3.5 MoE models (including 35B-A3B)

The problem is not just MoE routing (which transformer-lens handles for OLMoE via convert_olmoe_weights). Qwen 3.5 35B-A3B has three features that transformer-lens does not support:

Linear attention layers. Every layer except every 4th uses linear attention (fused qkv, no standard softmax attention). The 4th layers use full attention. This alternating pattern is not in any transformer-lens architecture.
MoE with 256 experts. The MLP blocks are replaced by a sparse MoE router + expert pool. The weight keys (mlp.gate, mlp.experts) differ from both dense MLPs and OLMoE's batched expert format.
Multi-token prediction (MTP) head. An additional prediction layer that forecasts multiple future tokens. This is not part of the standard HookedTransformer graph.

Adding support would require writing a new convert_qwen3_5_moe_weights function, updating convert_hf_model_config with the Qwen3.5 MoE config mapping, and potentially extending HookedTransformer to handle linear attention. This is a multi-day engineering effort — not a quick integration.

What Qwen's own code does

The Qwen SAE-Res model card provides a demo that uses HuggingFace transformers directly — AutoModelForCausalLM with register_forward_hook. This works because transformers natively supports Qwen3_5MoeForConditionalGeneration.

This tells us the path forward: bypass transformer-lens entirely and write lightweight evaluations using transformers + the SAE state dicts directly.

Practical path: lightweight custom evals

For our immediate needs (Core + Feature Absorption + Sparse Probing), we do not need the full SAEBench framework. We can write ~200-line scripts that:

Load Qwen 3.5 35B-A3B with AutoModelForCausalLM
Load SAE-Res weights (simple PyTorch dicts)
Run inference on evaluation prompts
Hook the residual stream at the target layer
Compute SAE metrics directly

We already have working code for (1), (2), (3), and (4) from our Feature Rivalry and 2×2 reproductions. Adding Core metrics and Feature Absorption is a matter of:

Core: Compute L0 per token + loss recovered on held-out text
Feature Absorption: Train first-letter probes on true residuals vs SAE features; compute absorption score
Sparse Probing: Train 35 binary classifiers on top-K SAE features

Bottom line: SAEBench is excellent infrastructure for models that transformer-lens supports (Pythia, Gemma, Llama, Qwen2/3 dense). For Qwen 3.5 35B-A3B, we need a custom evaluation pipeline. The good news: the SAE weights are simple, the model loads in transformers, and our existing code handles 80% of the plumbing already.

Starter code

We wrote starter scripts for the lightweight approach:

saebench_qwen_core.py — Computes L0 and loss recovered. Uses AutoModelForCausalLM + forward hooks + TopK SAE encode/decode. Ready to run when GPU access returns.
saebench_qwen_absorption.py — Simplified Feature Absorption. Trains first-letter probes on true residuals and SAE features; finds false negatives where probe is correct but top-K features don't fire. Checks if other SAE features align with probe direction. Will refine with decoder-direction cosine similarity when GPU returns.
saebench_qwen_sparse_probing.py — Simplified Sparse Probing. Trains dense, SAE-full, and SAE-sparse probes on sentiment classification (SST-2/IMDB/Amazon). Top-K features selected by mean activation difference. Tests whether SAE features provide a good sparse basis for downstream tasks.
saebench_qwen_integration.md — Full integration plan documenting the transformer-lens blocker, the lightweight approach, and what an upstream PR would look like.

All files live in ~/scratch/ on this machine.

References

Karvonen, Rager, Lin, et al., SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability, arXiv:2503.09532
SAEBench GitHub: github.com/adamkarvonen/SAEBench
Neuronpedia benchmark explorer: neuronpedia.org/sae-bench
Chanin et al., A Dictionary Learning Approach for Identifying Absorbed Features, arXiv:2409.14507
Wang et al., Feature Rivalry as a Signature of Uncertainty in LLMs, arXiv:2605.08149 — our analysis
Chiriqui & Te'eni, Are LLM Uncertainty and Correctness Encoded by the Same Features?, arXiv:2604.19974 — our analysis
Wang et al., Process Supervision of Confidence Margin for Calibrated LLM Reasoning, arXiv:2604.23333 — our analysis
Grünefeld et al., Tracing Uncertainty in Language Model "Reasoning", arXiv:2605.07776 — our analysis
Deng et al., Qwen-Scope: Turning Sparse Features into Development Tools, arXiv:2605.11887 — our analysis
Qwen SAE-Res weights: HuggingFace

← Research Synthesis · Labs Index