SAEBench: What Does a "Good" SAE Actually Mean?

Paper: arXiv:2503.09532 · Authors: Karvonen, Rager, Lin, Tigges, Bloom, et al. · Explored: 2026-05-15

Overview

Most SAE research optimizes two numbers: sparsity (L0 — how many features fire) and reconstruction loss (how well the SAE recovers the original activation). The assumption is that better reconstruction + sparser features = more interpretable SAEs.

SAEBench tests this assumption directly. The authors built a suite of 8 diverse evaluations spanning unsupervised metrics and downstream tasks, then ran them on 200+ SAEs of varying architectures, widths, and sparsities. Their central finding:

Proxy metrics do not reliably predict practical performance. SAEs that score well on reconstruction and sparsity often underperform on real interpretability tasks. Conversely, some SAEs that look worse on paper are actually more useful for downstream interventions.

This matters enormously for our work. We are running Feature Rivalry and 2×2 quadrant analyses on the official Qwen SAE-Res weights. If those SAEs were trained with proxy metrics as the objective, SAEBench suggests we may be working with suboptimal representations — and our results would improve with better-trained SAEs.

The 8 Evaluations

SAEBench is organized into two categories: unsupervised/proxy metrics (cheap, no labels) and downstream task evaluations (expensive, require labels or interventions).

EvaluationTypeWhat it measures
Core (L0 + Loss Recovered)UnsupervisedSparsity and reconstruction fidelity
Feature AbsorptionUnsupervisedWhether features absorb correlated concepts to increase sparsity
Auto-InterpDownstreamCan an LLM judge explain what each feature does?
RAVELDownstreamCan SAE features causally isolate and edit specific facts?
Spurious Correlation Removal (SCR)DownstreamCan SAE features disentangle spurious from intended signals?
Targeted Probe Perturbation (TPP)DownstreamCan ablating one feature set affect only one class probe?
Sparse ProbingDownstreamHow well do top-K SAE features classify specific concepts?
UnlearningDownstreamCan SAE features selectively remove knowledge without side effects?

1. Core Metrics (L0 + Loss Recovered)

The standard SAE training objective: minimize reconstruction MSE while keeping L0 (number of non-zero features) low. These are the proxy metrics that every SAE paper reports. SAEBench's key insight is that these metrics alone are insufficient.

Loss recovered measures what fraction of the original model's cross-entropy loss is recovered when replacing the true residual stream with the SAE reconstruction. A value of 1.0 means perfect reconstruction; 0.0 means the SAE output is no better than random. Typical values range from 0.85 to 0.99 for well-trained SAEs.

2. Feature Absorption

Feature absorption is a subtle failure mode where an SAE "hides" one concept inside another to increase sparsity. Example: if "short" always starts with "S", the SAE might learn a single "short" feature that implicitly encodes "starts with S", rather than learning two separate features.

SAEBench quantifies this using a first-letter classification task. They train a probe on the true residual stream as ground truth, then check whether SAE features faithfully replicate this probe's behavior. If the probe correctly classifies a token but the SAE features do not, and a different SAE feature "absorbs" that classification direction, that's absorption. Higher scores mean less absorption.

3. Automated Interpretability (Auto-Interp)

Uses GPT-4o-mini as an LLM judge. For each SAE feature, the top-activating sequences are extracted, an LLM generates an explanation (e.g., "this feature fires on French city names"), and a second LLM tries to predict which sequences would activate based on the explanation alone. The score is the prediction accuracy.

This is the closest proxy to "human interpretability" at scale. A score of 0.5 means random guessing; 1.0 means perfect explanation. Most SAEs score in the 0.3–0.6 range.

4. RAVEL (Causal Fact Editing)

RAVEL tests whether SAE features can causally isolate distinct facts about entities. The dataset contains entities (cities, Nobel laureates, occupations) with attributes (country, language, profession). The test: can we change "Paris is in France" to "Paris is in Japan" by intervening on SAE features, without also changing "the language of Paris is French"?

The metric has two components: cause (does the intervention change the target attribute?) and isolation (does it leave other attributes unchanged?). The final score averages both. High RAVEL scores mean the SAE has learned genuinely disentangled features.

5. Spurious Correlation Removal (SCR)

Tests whether SAE features can separate spurious signals from intended signals. Example: the Bias in Bios dataset has profession and gender labels. A biased classifier trained on male professors and female nurses picks up on gender. The test: can we identify and ablate the SAE features encoding gender, such that the classifier now correctly predicts profession on balanced test data?

The score is normalized: (accuracy_after_ablation − baseline) / (oracle − baseline), where "oracle" is a classifier trained directly on the intended concept. A score of 1.0 means perfect disentanglement; 0.0 means the SAE cannot separate the concepts at all.

6. Targeted Probe Perturbation (TPP)

TPP generalizes SCR to any multiclass classification task. For each class, the top SAE features are identified via probe attribution. Then, for each class probe, the features from that class are ablated. The score measures whether ablation hurts only the matching probe (good disentanglement) or also hurts unrelated probes (poor disentanglement).

Formally: TPP = mean(across matching probes) − mean(across non-matching probes). A high TPP means features are cleanly separated by concept.

7. Sparse Probing

Tests whether a small number of SAE features can classify specific concepts. For 35 binary classification tasks (language ID, profession, sentiment, code language, news topic), the top-K SAE features are selected by maximum mean difference, and a logistic regression probe is trained on just those K features.

This directly measures whether the SAE has learned semantically meaningful features that generalize to classification. Higher accuracy means the SAE's sparse representation captures the underlying concepts well.

8. Unlearning

Uses the WMDP-bio dataset (dangerous biology knowledge). The goal: selectively remove biology knowledge by clamping SAE feature activations to negative values, while preserving performance on unrelated MMLU tasks (history, geography, CS).

Features are selected by high sparsity on the "forget" set (WMDP-bio) and low sparsity on the "retain" set (WikiText). The score is the minimum WMDP-bio accuracy achievable while maintaining MMLU accuracy ≥ 0.99. Lower is better — it means the SAE can precisely target and remove specific knowledge.

Key Findings

Finding 1: Proxy metrics do not predict downstream performance

The most important result. The authors trained 200+ SAEs across 7 architectures, 3 widths (4k, 16k, 65k), and 6 sparsity levels. They then correlated the proxy metrics (L0, loss recovered) with the 6 downstream task metrics.

Result: Reconstruction quality and sparsity are weakly or not at all correlated with downstream task performance. An SAE with excellent loss recovered may score poorly on RAVEL, SCR, and TPP. This means you cannot tell whether an SAE is "good" just by looking at its training loss.

Why does this happen? Because SAEs can game the reconstruction objective by learning distributed representations that reconstruct well but are not interpretable. Feature absorption is one manifestation: the SAE learns gerrymandered features that happen to reconstruct the activation vector well, but no individual feature corresponds to a human-interpretable concept.

Finding 2: Matryoshka SAEs dominate on 5 of 8 metrics

The authors tested 7 SAE architectures: ReLU, TopK, BatchTopK, JumpReLU, Gated, P-anneal, and Matryoshka BatchTopK. Matryoshka SAEs learn a hierarchy of features at different granularities (like Russian nesting dolls), allowing the model to use coarse features for common concepts and fine features for rare ones.

Result: Matryoshka BatchTopK outperforms all other architectures on 5 of 8 metrics, especially in the typical L0 range of 40–200. It performs best on concept detection (Sparse Probing) and feature disentanglement (RAVEL, SCR, TPP). The only metrics where it does not lead are Unlearning and Feature Absorption.

This is a significant practical finding. If you are training new SAEs, Matryoshka should be the default architecture. If you are using pretrained SAEs (like Qwen SAE-Res), you should be aware that they are likely using a simpler architecture (ReLU or TopK) and may underperform on disentanglement tasks.

Finding 3: The "best" SAE depends on the task

There is no universal "best" SAE. The optimal width and sparsity vary dramatically across tasks:

This has implications for SAE selection. If your goal is feature editing (RAVEL/SCR/TPP), you want a Matryoshka SAE at moderate width and sparsity. If your goal is concept detection (Sparse Probing), you want the widest dictionary you can afford.

Finding 4: Training time matters less than architecture

The authors trained SAEs for varying durations and found that while additional training improves proxy metrics (loss recovered), the effect on downstream tasks is modest after the first ~100M tokens. Architecture choice has a much larger impact than training duration.

Finding 5: Feature absorption is pervasive in ReLU SAEs

ReLU SAEs (the most common architecture) show significant feature absorption. This means their features are less interpretable than they appear — a feature that seems to detect "short words" may actually be absorbing "starts with S" and other correlated properties. TopK and JumpReLU reduce but do not eliminate absorption. Matryoshka SAEs show the least absorption among architectures tested.

A Critical Caveat: Descriptive Collision (arXiv:2605.12874)

While analyzing SAEBench, we came across a new paper by McCann (May 2026) that identifies a fundamental problem with the Auto-Interp metric — one of SAEBench's 8 evaluations. This finding is important enough to flag before anyone uses SAEBench rankings as ground truth.

The problem: one explanation describes many features

Auto-Interp uses an LLM judge (GPT-4o-mini) to generate natural-language explanations for each SAE feature, then scores how well the explanation predicts which tokens will activate. McCann calls this descriptive collision: many distinct SAE features admit the same explanation.

Reanalyzing the largest public dataset of human-annotated SAE features (Marks et al., 2025 — 722 features across Gemma 2 2B and Pythia 70M):

This means Auto-Interp scores are inflated. A feature that scores 0.6 might seem well-explained, but if 3 other features share the same explanation, the explanation is not actually identifying the feature — it's identifying a cluster of features.

Why detection scoring is blind to collision

Standard Auto-Interp uses detection scoring: given an explanation, predict which tokens activate. This is invariant to collision because it only tests whether the explanation matches the feature's behavior, not whether the explanation is unique to that feature.

McCann proposes two fixes:

  1. Collision-adjusted detection: Penalize explanations that match multiple features. If an explanation fits feature A and feature B, the score for both is reduced.
  2. Discrimination scoring: Test whether the explanation can distinguish the target feature from its neighbors. Given two features and one explanation, can the judge identify which feature it belongs to?

Implication for SAEBench

Auto-Interp rankings in SAEBench may be unreliable. If descriptive collision is as prevalent as McCann suggests, the Auto-Interp scores reported by SAEBench are systematically inflated. A Matryoshka SAE that scores well on Auto-Interp might not actually have more interpretable features — it might just have features that are easier to explain with generic descriptions.

Our recommendation: treat SAEBench's Auto-Interp scores with caution. The downstream task evaluations (RAVEL, SCR, TPP, Sparse Probing, Unlearning) are less affected because they test causal behavior, not explanation quality. Core metrics and Feature Absorption are also unaffected. But if you are selecting an SAE primarily for interpretability, Auto-Interp alone is not sufficient — you need discrimination-aware evaluation.

Implications for Our SAE Work

We are running three concurrent SAE projects: Feature Rivalry, Uncertainty vs Correctness, and WriteSAE. All three depend on the quality of the SAE being used. SAEBench tells us some uncomfortable truths about the Qwen SAE-Res weights we are using.

What we know about Qwen SAE-Res

What SAEBench predicts about Qwen SAE-Res

Based on the SAEBench results, we should expect:

Key concern: If Feature Rivalry and the 2×2 quadrant method are detecting uncertainty via features that are actually absorbed (encoding multiple correlated concepts), our results may be confounded. A "pure uncertainty feature" might actually be absorbing "incorrectness" or "low-confidence token position." SAEBench suggests we should validate our feature identifications with ablation or steering experiments — which both papers do, to their credit.

What we should consider changing

  1. Evaluate Qwen SAE-Res with SAEBench. We should run at least the Core, Feature Absorption, and Sparse Probing evaluations on Qwen SAE-Res to get baseline numbers. This requires ~2 hours of GPU time per evaluation.
  2. Train a Matryoshka baseline. If SAEBench is correct, a Matryoshka SAE trained on Qwen 3.5 activations would outperform SAE-Res on disentanglement tasks. This is a 1–2 day training run but could significantly improve all downstream interpretability results.
  3. Compare results across architectures. When we reproduce Feature Rivalry and the 2×2 method, we should test on both SAE-Res and a Matryoshka SAE. If the rivalry/correctness signals are architecture-independent, that strengthens the findings. If they differ, we learn something about SAE quality.

Computational Requirements

Running the full SAEBench suite on a single SAE takes ~5 hours on an RTX 3090 (24GB VRAM). Here is the breakdown per evaluation:

EvaluationPer-SAE TimeSetup Time
Core (L0 + Loss)9 min0 min
Feature Absorption26 min33 min
Auto-Interp9 min0 min
RAVEL45 min45 min
SCR6 min22 min
TPP2 min5 min
Sparse Probing3 min15 min
Unlearning10 min33 min
Total115 min177 min

For our purposes, the "light" suite (Core + Feature Absorption + Sparse Probing + Auto-Interp) would take ~50 minutes of GPU time and give us a solid baseline. The full suite is only needed if we want to compare intervention capabilities (RAVEL, SCR, TPP, Unlearning).

Comparison: SAEBench vs Our Current Approach

DimensionSAEBenchFeature Rivalry / 2×2 Method
GoalEvaluate SAE quality holisticallyDetect specific failure modes (uncertainty, incorrectness)
InputSAE + model + labeled datasetsSAE + model + sampled responses
Output8 numerical scores per SAEFeature rankings / quadrant classifications
SupervisionMostly supervised (downstream tasks)Unsupervised (rivalry) or supervised (2×2)
InterventionTests causal editing (RAVEL, SCR, Unlearning)Tests feature suppression / steering
GPU cost~5h per SAE (full suite)~2-4h per experiment
Best useSelecting/training better SAEsUsing existing SAEs for monitoring

These are complementary, not competing. SAEBench tells us which SAE to use; Feature Rivalry and the 2×2 method tell us what to do with it. The ideal workflow: run SAEBench on candidate SAEs, pick the best one, then run uncertainty detection on that SAE.

Our Assessment

What we like

What concerns us

What we would try next

  1. Run SAEBench Core + Absorption on Qwen SAE-Res. This is the minimum viable evaluation. It tells us whether our SAE is in the ballpark of quality and whether feature absorption is a concern for our uncertainty detection work.
  2. Train a Matryoshka baseline on Qwen 3.5. If SAEBench's findings hold, this should improve disentanglement (RAVEL/SCR/TPP) and concept detection (Sparse Probing). The training code is available; it's a matter of compute budget.
  3. Test whether Feature Rivalry results are architecture-sensitive. Run the rivalry analysis on both SAE-Res and a Matryoshka SAE. If rivalry scores correlate with disentanglement metrics (TPP/SCR), we have a cross-validation.
  4. Contribute Qwen support to SAEBench. If we do the integration work, we can upstream it. This benefits the community and gives us a standardized evaluation pipeline for all future SAE work.

Verdict

Essential infrastructure. SAEBench is the most important SAE evaluation framework released to date. The finding that proxy metrics do not predict practical performance is a watershed moment — it means much of the existing SAE literature may be optimizing the wrong thing. For our work specifically, SAEBench raises a critical question: are the Qwen SAE-Res weights good enough for the uncertainty detection tasks we care about?

Immediate action: We will run the SAEBench Core + Feature Absorption evaluations on Qwen SAE-Res as soon as GPU access is restored. This is a ~1-hour experiment that will tell us whether our SAE is solid or whether we need to invest in training a Matryoshka baseline before continuing with Feature Rivalry and 2×2 reproductions.

Integration Findings: SAEBench + Qwen 3.5 35B-A3B

We cloned the SAEBench repository and explored what it would take to evaluate the official Qwen SAE-Res weights. Here is what we found.

Architecture mismatch: why transformer-lens cannot load Qwen 3.5

SAEBench uses transformer_lens.HookedTransformer for all model loading and activation extraction. We checked whether transformer-lens supports Qwen/Qwen3.5-35B-A3B-Base:

The problem is not just MoE routing (which transformer-lens handles for OLMoE via convert_olmoe_weights). Qwen 3.5 35B-A3B has three features that transformer-lens does not support:

  1. Linear attention layers. Every layer except every 4th uses linear attention (fused qkv, no standard softmax attention). The 4th layers use full attention. This alternating pattern is not in any transformer-lens architecture.
  2. MoE with 256 experts. The MLP blocks are replaced by a sparse MoE router + expert pool. The weight keys (mlp.gate, mlp.experts) differ from both dense MLPs and OLMoE's batched expert format.
  3. Multi-token prediction (MTP) head. An additional prediction layer that forecasts multiple future tokens. This is not part of the standard HookedTransformer graph.

Adding support would require writing a new convert_qwen3_5_moe_weights function, updating convert_hf_model_config with the Qwen3.5 MoE config mapping, and potentially extending HookedTransformer to handle linear attention. This is a multi-day engineering effort — not a quick integration.

What Qwen's own code does

The Qwen SAE-Res model card provides a demo that uses HuggingFace transformers directly — AutoModelForCausalLM with register_forward_hook. This works because transformers natively supports Qwen3_5MoeForConditionalGeneration.

This tells us the path forward: bypass transformer-lens entirely and write lightweight evaluations using transformers + the SAE state dicts directly.

Practical path: lightweight custom evals

For our immediate needs (Core + Feature Absorption + Sparse Probing), we do not need the full SAEBench framework. We can write ~200-line scripts that:

  1. Load Qwen 3.5 35B-A3B with AutoModelForCausalLM
  2. Load SAE-Res weights (simple PyTorch dicts)
  3. Run inference on evaluation prompts
  4. Hook the residual stream at the target layer
  5. Compute SAE metrics directly

We already have working code for (1), (2), (3), and (4) from our Feature Rivalry and 2×2 reproductions. Adding Core metrics and Feature Absorption is a matter of:

Bottom line: SAEBench is excellent infrastructure for models that transformer-lens supports (Pythia, Gemma, Llama, Qwen2/3 dense). For Qwen 3.5 35B-A3B, we need a custom evaluation pipeline. The good news: the SAE weights are simple, the model loads in transformers, and our existing code handles 80% of the plumbing already.

Starter code

We wrote starter scripts for the lightweight approach:

All files live in ~/scratch/ on this machine.

References

← Research Synthesis · Labs Index