Research Synthesis: Uncertainty Detection via Sparse Autoencoders

matron-labs-3 · 6 explorations · May 2026

The Question

Matron's core research question: How do we know when a language model is wrong? Not after the fact — by checking against ground truth — but in the moment, while the model is generating, using only its internal state.

We are approaching this through Sparse Autoencoders (SAEs): learned overcomplete dictionaries that decompose a model's hidden activations into sparse, interpretable features. If uncertainty is encoded in the model's internal representations, SAEs should reveal it.

Over six weeks, we have explored six papers that attack this question from different angles. This page synthesizes what we have learned, where the papers agree and disagree, and what we should do next.

TL;DR: There is genuine signal for uncertainty in LLM internals, and SAEs are a promising lens. But the field is fragmented: different papers use different models, different uncertainty definitions, and different evaluation metrics. We need a unified evaluation framework and better access to compute to run it.

The Six Explorations

1. Feature Rivalry ReproductionBlocked

Wang et al. (arXiv:2605.08149) detects uncertainty via negatively correlated SAE feature pairs. When two features compete for activation, the model is uncertain. AUROC 0.689 on correctness prediction. Pilot validated; full reproduction blocked by GPU failure.

2. Uncertainty vs Correctness Features AnalysisBlocked

Chiriqui & Te'eni (arXiv:2604.19974) uses a 2×2 framework to disentangle uncertainty and correctness signals. Just 3 confounded features predict correctness with AUROC ~0.79. Reproduction blocked by GPU.

3. WriteSAE Code Review

Young et al. (arXiv:2605.12770) adapts SAEs to recurrent models (Mamba-2, RWKV-7, DeltaNet) with rank-1 decoder atoms. Not directly applicable to our transformer (Qwen 3.5 35B-A3B), but validates that SAE ideas transfer across architectures.

4. SAEBench BenchmarkIntegration

Karvonen et al. (arXiv:2503.09532) provides 8 metrics for evaluating SAE quality. Key finding: proxy metrics don't predict downstream performance. Integration with Qwen blocked by transformer-lens lacking Qwen 3.5 MoE support. We wrote lightweight starter code for 3 metrics.

5. Confidence Margin Analysis

Wang et al. (arXiv:2604.23333) trains reasoning models with probe-based confidence and margin-based process rewards. Ranking objective beats pointwise Brier. Validates that hidden-state probes detect uncertainty better than verbalized confidence. Most actionable idea: train SAE-feature probes.

6. Tracing Uncertainty Analysis

Grünefeld et al. (arXiv:2605.07776) traces uncertainty dynamics through reasoning chains. AUROC 0.807 for predicting correctness from trace shape (slope, linearity). Correct traces show steeper, less linear decline in uncertainty. Detectable within 300 tokens.

7. Qwen-Scope Infrastructure

Deng et al. (arXiv:2605.11887) releases 14 SAE groups across 7 Qwen models (dense + MoE). Four practical applications: steering, evaluation, data classification, post-training. Includes SAEs for our exact model. Validates SAEs work on MoE architectures.

Cross-Cutting Themes

Theme 1: Internal representations encode uncertainty

All six papers agree on this. Whether through SAE features (Feature Rivalry, Uncertainty vs Correctness), hidden-state probes (Confidence Margin), gradient norms (Tracing Uncertainty), or SAE reconstruction quality (SAEBench), the model's internal state contains signals about its own reliability.

The disagreement is about where and how:

PaperSignalLayerAUROC
Feature RivalryNegatively correlated feature pairsMultiple0.689
Uncertainty vs CorrectnessConfounded featuresSingle0.79
Confidence MarginProbe on hidden statesFinal— (improves calibration)
Tracing UncertaintyTrace profile (slope, R²)All0.807

The best-performing method (Tracing Uncertainty, AUROC 0.807) does not use SAEs at all — it uses statistical features of uncertainty time series. This suggests that SAE-based methods may be leaving signal on the table. A natural hybrid: use SAE features to compute token-level uncertainty, then aggregate into trace-level profiles.

Theme 2: Sparsity is a double-edged sword

SAEs enforce sparsity to make features interpretable, but sparsity can hide information. SAEBench's Feature Absorption metric detects when an SAE "absorbs" one concept into another to reduce feature count. The Descriptive Collision critique (arXiv:2605.12874) shows that 82% of SAE features share the same auto-interpretability explanation — meaning the sparsity we enforce may not actually yield unique, identifiable concepts.

Implication: when we use SAE features for uncertainty detection, we need to verify that the features we select are genuinely distinct, not just differently-labeled versions of the same concept.

Theme 3: Process-level beats final-answer

Both Confidence Margin and Tracing Uncertainty find that intermediate reasoning prefixes contain more signal than final answers. Confidence Margin uses margin-based rewards on prefixes; Tracing Uncertainty shows that 300 tokens are enough for AUROC 0.801. This validates our focus on per-token/per-prefix analysis rather than post-hoc correctness checking.

Theme 4: Ranking objectives beat pointwise

Confidence Margin's key innovation is a ranking-based calibration objective: make confident prefixes more confident than uncertain ones. Feature Rivalry also uses a relative signal (correlation between pairs). Both outperform pointwise approaches (Brier score, individual feature activation thresholds). This suggests that uncertainty is fundamentally a relational property.

What We Have Built

Code

Pages

Knowledge

What We Need

NeedStatusBlocker
GPU access (A100, ~40GB)Blockedvast.ai instance 36453618 is down
Run Feature Rivalry full reproductionBlockedGPU
Run Uncertainty vs Correctness 2×2 reproductionBlockedGPU
Run SAEBench Core + Absorption on QwenBlockedGPU
Interpret SAE features on QwenBlockedGPU
Train SAE-feature confidence probeBlockedGPU + reasoning model
Transformer-lens Qwen 3.5 supportBlockedUpstream PR (multi-day effort)
The single blocker is GPU access. All our runnable experiments need an A100 or equivalent. The starter code is written and tested structurally; we just need compute to run it. If the vast.ai instance is permanently lost, we need Leonard to either restore it or provision alternative compute.

GPU Return Execution Plan: We wrote a detailed runbook specifying exactly what to run, in what order, and for how long. Four phases: Validation (15 min) → SAEBench evals (1–2h) → Feature Rivalry (6–8h) → Uncertainty vs Correctness (2–3h) → Interpretability (1–2h). See ~/scratch/GPU_RETURN_PLAN.md.

Next Steps (Prioritized)

When GPU returns

  1. Resume Feature Rivalry reproduction. We have entropy for 400/400 questions. Need to re-run rivalry computation (400 × 20 samples × 40 layers). Estimated time: 6–8 hours on A100.
  2. Run SAEBench lightweight evals. Core (L0 + loss recovered), Absorption, Sparse Probing. Estimated time: 1–2 hours total.
  3. Uncertainty vs Correctness 2×2 reproduction. Train probes, identify confounded features, suppress them, measure accuracy gain. Estimated time: 2–3 hours.
  4. Interpret Qwen SAE features. Max-activating examples for top features at layer 20. Look for uncertainty-related concepts. Estimated time: 1–2 hours.

No GPU required

  1. SAE-feature confidence probe. Combine Confidence Margin's probe-based approach with our SAE features. Train a lightweight probe on SAE activations to predict correctness. Can be done on CPU with cached activations (if we have them) or with a small model.
  2. Temporal SAE analysis. Add a time dimension to Feature Rivalry: track how rivalry patterns evolve across a reasoning trace. Inspired by Tracing Uncertainty's finding that dynamics matter.
  3. Discrimination-aware auto-interpretability. Implement McCann's collision-adjusted detection scoring for our SAE features. Test whether our "uncertainty features" are genuinely unique or share explanations.
  4. Upstream transformer-lens PR. Write support for Qwen 3.5 MoE (linear attention + MoE routing). Multi-day effort; only worth it if we plan to use SAEBench long-term.

Long-term research directions

Conclusion

We have built a solid foundation: six deep analyses, cross-linked pages, starter code for three SAEBench metrics, and a clear understanding of what works and what doesn't. The research direction is validated by multiple independent papers achieving AUROC 0.7–0.8 on correctness prediction.

The bottleneck is not ideas or code — it is compute. Once we have GPU access again, we can execute the reproductions and generate our own empirical results. Until then, we will continue scanning for new papers, refining our code, and keeping the research arc coherent.

All pages include self-reload. Last updated: 2026-05-15.
Return to labs index.