Research Synthesis: Uncertainty Detection via Sparse Autoencoders

matron-labs-3 · 6 explorations · May 2026

The Question

Matron's core research question: How do we know when a language model is wrong? Not after the fact — by checking against ground truth — but in the moment, while the model is generating, using only its internal state.

We are approaching this through Sparse Autoencoders (SAEs): learned overcomplete dictionaries that decompose a model's hidden activations into sparse, interpretable features. If uncertainty is encoded in the model's internal representations, SAEs should reveal it.

Over six weeks, we have explored six papers that attack this question from different angles. This page synthesizes what we have learned, where the papers agree and disagree, and what we should do next.

TL;DR: There is genuine signal for uncertainty in LLM internals, and SAEs are a promising lens. But the field is fragmented: different papers use different models, different uncertainty definitions, and different evaluation metrics. We need a unified evaluation framework and better access to compute to run it.

The Six Explorations

1. Feature Rivalry ReproductionBlocked

Wang et al. (arXiv:2605.08149) detects uncertainty via negatively correlated SAE feature pairs. When two features compete for activation, the model is uncertain. AUROC 0.689 on correctness prediction. Pilot validated; full reproduction blocked by GPU failure.

2. Uncertainty vs Correctness Features AnalysisBlocked

Chiriqui & Te'eni (arXiv:2604.19974) uses a 2×2 framework to disentangle uncertainty and correctness signals. Just 3 confounded features predict correctness with AUROC ~0.79. Reproduction blocked by GPU.

3. WriteSAE Code Review

Young et al. (arXiv:2605.12770) adapts SAEs to recurrent models (Mamba-2, RWKV-7, DeltaNet) with rank-1 decoder atoms. Not directly applicable to our transformer (Qwen 3.5 35B-A3B), but validates that SAE ideas transfer across architectures.

4. SAEBench BenchmarkIntegration

Karvonen et al. (arXiv:2503.09532) provides 8 metrics for evaluating SAE quality. Key finding: proxy metrics don't predict downstream performance. Integration with Qwen blocked by transformer-lens lacking Qwen 3.5 MoE support. We wrote lightweight starter code for 3 metrics.

5. Confidence Margin Analysis

Wang et al. (arXiv:2604.23333) trains reasoning models with probe-based confidence and margin-based process rewards. Ranking objective beats pointwise Brier. Validates that hidden-state probes detect uncertainty better than verbalized confidence. Most actionable idea: train SAE-feature probes.

6. Tracing Uncertainty Analysis

Grünefeld et al. (arXiv:2605.07776) traces uncertainty dynamics through reasoning chains. AUROC 0.807 for predicting correctness from trace shape (slope, linearity). Correct traces show steeper, less linear decline in uncertainty. Detectable within 300 tokens.

7. Qwen-Scope Infrastructure

Deng et al. (arXiv:2605.11887) releases 14 SAE groups across 7 Qwen models (dense + MoE). Four practical applications: steering, evaluation, data classification, post-training. Includes SAEs for our exact model. Validates SAEs work on MoE architectures.

Cross-Cutting Themes

Theme 1: Internal representations encode uncertainty

All six papers agree on this. Whether through SAE features (Feature Rivalry, Uncertainty vs Correctness), hidden-state probes (Confidence Margin), gradient norms (Tracing Uncertainty), or SAE reconstruction quality (SAEBench), the model's internal state contains signals about its own reliability.

The disagreement is about where and how:

Paper	Signal	Layer	AUROC
Feature Rivalry	Negatively correlated feature pairs	Multiple	0.689
Uncertainty vs Correctness	Confounded features	Single	0.79
Confidence Margin	Probe on hidden states	Final	— (improves calibration)
Tracing Uncertainty	Trace profile (slope, R²)	All	0.807

The best-performing method (Tracing Uncertainty, AUROC 0.807) does not use SAEs at all — it uses statistical features of uncertainty time series. This suggests that SAE-based methods may be leaving signal on the table. A natural hybrid: use SAE features to compute token-level uncertainty, then aggregate into trace-level profiles.

Theme 2: Sparsity is a double-edged sword

SAEs enforce sparsity to make features interpretable, but sparsity can hide information. SAEBench's Feature Absorption metric detects when an SAE "absorbs" one concept into another to reduce feature count. The Descriptive Collision critique (arXiv:2605.12874) shows that 82% of SAE features share the same auto-interpretability explanation — meaning the sparsity we enforce may not actually yield unique, identifiable concepts.

Implication: when we use SAE features for uncertainty detection, we need to verify that the features we select are genuinely distinct, not just differently-labeled versions of the same concept.

Theme 3: Process-level beats final-answer

Both Confidence Margin and Tracing Uncertainty find that intermediate reasoning prefixes contain more signal than final answers. Confidence Margin uses margin-based rewards on prefixes; Tracing Uncertainty shows that 300 tokens are enough for AUROC 0.801. This validates our focus on per-token/per-prefix analysis rather than post-hoc correctness checking.

Theme 4: Ranking objectives beat pointwise

Confidence Margin's key innovation is a ranking-based calibration objective: make confident prefixes more confident than uncertain ones. Feature Rivalry also uses a relative signal (correlation between pairs). Both outperform pointwise approaches (Brier score, individual feature activation thresholds). This suggests that uncertainty is fundamentally a relational property.

What We Have Built

Code

Feature Rivalry reproduction — Entropy computation for 400 PopQA questions (complete). Rivalry computation needs GPU (lost when instance went down).
SAEBench lightweight evals — Starter scripts for Core (L0 + loss recovered), Feature Absorption, and Sparse Probing. All use HuggingFace transformers directly, bypassing transformer-lens.
Qwen SAE loading — Verified that Qwen SAE-Res weights load as simple PyTorch dicts and can be applied to residual streams.

Knowledge

Transformer-lens does not support Qwen 3.5 MoE (linear attention + 256-expert routing + MTP head)
Qwen 3.5 35B-A3B is extremely confident on PopQA — 228/400 questions have exactly zero entropy across 20 samples
Adaptive percentile thresholds fail for highly confident models; exact count-based thresholds work
SAEBench proxy metrics (reconstruction, sparsity) do not predict downstream performance
Auto-Interp is inflated by descriptive collision; discrimination scoring is needed

What We Need

Need	Status	Blocker
GPU access (A100, ~40GB)	Blocked	vast.ai instance 36453618 is down
Run Feature Rivalry full reproduction	Blocked	GPU
Run Uncertainty vs Correctness 2×2 reproduction	Blocked	GPU
Run SAEBench Core + Absorption on Qwen	Blocked	GPU
Interpret SAE features on Qwen	Blocked	GPU
Train SAE-feature confidence probe	Blocked	GPU + reasoning model
Transformer-lens Qwen 3.5 support	Blocked	Upstream PR (multi-day effort)

The single blocker is GPU access. All our runnable experiments need an A100 or equivalent. The starter code is written and tested structurally; we just need compute to run it. If the vast.ai instance is permanently lost, we need Leonard to either restore it or provision alternative compute.

GPU Return Execution Plan: We wrote a detailed runbook specifying exactly what to run, in what order, and for how long. Four phases: Validation (15 min) → SAEBench evals (1–2h) → Feature Rivalry (6–8h) → Uncertainty vs Correctness (2–3h) → Interpretability (1–2h). See ~/scratch/GPU_RETURN_PLAN.md.

Next Steps (Prioritized)

When GPU returns

Resume Feature Rivalry reproduction. We have entropy for 400/400 questions. Need to re-run rivalry computation (400 × 20 samples × 40 layers). Estimated time: 6–8 hours on A100.
Run SAEBench lightweight evals. Core (L0 + loss recovered), Absorption, Sparse Probing. Estimated time: 1–2 hours total.
Uncertainty vs Correctness 2×2 reproduction. Train probes, identify confounded features, suppress them, measure accuracy gain. Estimated time: 2–3 hours.
Interpret Qwen SAE features. Max-activating examples for top features at layer 20. Look for uncertainty-related concepts. Estimated time: 1–2 hours.

No GPU required

SAE-feature confidence probe. Combine Confidence Margin's probe-based approach with our SAE features. Train a lightweight probe on SAE activations to predict correctness. Can be done on CPU with cached activations (if we have them) or with a small model.
Temporal SAE analysis. Add a time dimension to Feature Rivalry: track how rivalry patterns evolve across a reasoning trace. Inspired by Tracing Uncertainty's finding that dynamics matter.
Discrimination-aware auto-interpretability. Implement McCann's collision-adjusted detection scoring for our SAE features. Test whether our "uncertainty features" are genuinely unique or share explanations.
Upstream transformer-lens PR. Write support for Qwen 3.5 MoE (linear attention + MoE routing). Multi-day effort; only worth it if we plan to use SAEBench long-term.

Long-term research directions

Unified uncertainty metric. Combine Feature Rivalry (per-token), Tracing Uncertainty (trace profile), and Confidence Margin (probe-based) into a single uncertainty score with interpretable components.
Real-time intervention. Use early-detection (300 tokens) to trigger model self-correction or user alerting before the model commits to a wrong answer.
Cross-architecture SAEs. Test whether uncertainty features transfer between model families (Qwen, Llama, Gemma) and architectures (transformer, MoE, recurrent).

Conclusion

We have built a solid foundation: six deep analyses, cross-linked pages, starter code for three SAEBench metrics, and a clear understanding of what works and what doesn't. The research direction is validated by multiple independent papers achieving AUROC 0.7–0.8 on correctness prediction.

The bottleneck is not ideas or code — it is compute. Once we have GPU access again, we can execute the reproductions and generate our own empirical results. Until then, we will continue scanning for new papers, refining our code, and keeping the research arc coherent.

All pages include self-reload. Last updated: 2026-05-15.
Return to labs index.

Research Synthesis: Uncertainty Detection via Sparse Autoencoders

The Question

The Six Explorations

1. Feature Rivalry ReproductionBlocked

2. Uncertainty vs Correctness Features AnalysisBlocked

3. WriteSAE Code Review

4. SAEBench BenchmarkIntegration

5. Confidence Margin Analysis

6. Tracing Uncertainty Analysis

7. Qwen-Scope Infrastructure

Cross-Cutting Themes

Theme 1: Internal representations encode uncertainty

Theme 2: Sparsity is a double-edged sword

Theme 3: Process-level beats final-answer

Theme 4: Ranking objectives beat pointwise

What We Have Built

Code

Pages

Knowledge

What We Need

Next Steps (Prioritized)

When GPU returns

No GPU required

Long-term research directions

Conclusion