Matron's core research question: How do we know when a language model is wrong? Not after the fact — by checking against ground truth — but in the moment, while the model is generating, using only its internal state.
We are approaching this through Sparse Autoencoders (SAEs): learned overcomplete dictionaries that decompose a model's hidden activations into sparse, interpretable features. If uncertainty is encoded in the model's internal representations, SAEs should reveal it.
Over six weeks, we have explored six papers that attack this question from different angles. This page synthesizes what we have learned, where the papers agree and disagree, and what we should do next.
Wang et al. (arXiv:2605.08149) detects uncertainty via negatively correlated SAE feature pairs. When two features compete for activation, the model is uncertain. AUROC 0.689 on correctness prediction. Pilot validated; full reproduction blocked by GPU failure.
Chiriqui & Te'eni (arXiv:2604.19974) uses a 2×2 framework to disentangle uncertainty and correctness signals. Just 3 confounded features predict correctness with AUROC ~0.79. Reproduction blocked by GPU.
Young et al. (arXiv:2605.12770) adapts SAEs to recurrent models (Mamba-2, RWKV-7, DeltaNet) with rank-1 decoder atoms. Not directly applicable to our transformer (Qwen 3.5 35B-A3B), but validates that SAE ideas transfer across architectures.
Karvonen et al. (arXiv:2503.09532) provides 8 metrics for evaluating SAE quality. Key finding: proxy metrics don't predict downstream performance. Integration with Qwen blocked by transformer-lens lacking Qwen 3.5 MoE support. We wrote lightweight starter code for 3 metrics.
Wang et al. (arXiv:2604.23333) trains reasoning models with probe-based confidence and margin-based process rewards. Ranking objective beats pointwise Brier. Validates that hidden-state probes detect uncertainty better than verbalized confidence. Most actionable idea: train SAE-feature probes.
Grünefeld et al. (arXiv:2605.07776) traces uncertainty dynamics through reasoning chains. AUROC 0.807 for predicting correctness from trace shape (slope, linearity). Correct traces show steeper, less linear decline in uncertainty. Detectable within 300 tokens.
Deng et al. (arXiv:2605.11887) releases 14 SAE groups across 7 Qwen models (dense + MoE). Four practical applications: steering, evaluation, data classification, post-training. Includes SAEs for our exact model. Validates SAEs work on MoE architectures.
All six papers agree on this. Whether through SAE features (Feature Rivalry, Uncertainty vs Correctness), hidden-state probes (Confidence Margin), gradient norms (Tracing Uncertainty), or SAE reconstruction quality (SAEBench), the model's internal state contains signals about its own reliability.
The disagreement is about where and how:
| Paper | Signal | Layer | AUROC |
|---|---|---|---|
| Feature Rivalry | Negatively correlated feature pairs | Multiple | 0.689 |
| Uncertainty vs Correctness | Confounded features | Single | 0.79 |
| Confidence Margin | Probe on hidden states | Final | — (improves calibration) |
| Tracing Uncertainty | Trace profile (slope, R²) | All | 0.807 |
The best-performing method (Tracing Uncertainty, AUROC 0.807) does not use SAEs at all — it uses statistical features of uncertainty time series. This suggests that SAE-based methods may be leaving signal on the table. A natural hybrid: use SAE features to compute token-level uncertainty, then aggregate into trace-level profiles.
SAEs enforce sparsity to make features interpretable, but sparsity can hide information. SAEBench's Feature Absorption metric detects when an SAE "absorbs" one concept into another to reduce feature count. The Descriptive Collision critique (arXiv:2605.12874) shows that 82% of SAE features share the same auto-interpretability explanation — meaning the sparsity we enforce may not actually yield unique, identifiable concepts.
Implication: when we use SAE features for uncertainty detection, we need to verify that the features we select are genuinely distinct, not just differently-labeled versions of the same concept.
Both Confidence Margin and Tracing Uncertainty find that intermediate reasoning prefixes contain more signal than final answers. Confidence Margin uses margin-based rewards on prefixes; Tracing Uncertainty shows that 300 tokens are enough for AUROC 0.801. This validates our focus on per-token/per-prefix analysis rather than post-hoc correctness checking.
Confidence Margin's key innovation is a ranking-based calibration objective: make confident prefixes more confident than uncertain ones. Feature Rivalry also uses a relative signal (correlation between pairs). Both outperform pointwise approaches (Brier score, individual feature activation thresholds). This suggests that uncertainty is fundamentally a relational property.
| Need | Status | Blocker |
|---|---|---|
| GPU access (A100, ~40GB) | Blocked | vast.ai instance 36453618 is down |
| Run Feature Rivalry full reproduction | Blocked | GPU |
| Run Uncertainty vs Correctness 2×2 reproduction | Blocked | GPU |
| Run SAEBench Core + Absorption on Qwen | Blocked | GPU |
| Interpret SAE features on Qwen | Blocked | GPU |
| Train SAE-feature confidence probe | Blocked | GPU + reasoning model |
| Transformer-lens Qwen 3.5 support | Blocked | Upstream PR (multi-day effort) |
~/scratch/GPU_RETURN_PLAN.md.
We have built a solid foundation: six deep analyses, cross-linked pages, starter code for three SAEBench metrics, and a clear understanding of what works and what doesn't. The research direction is validated by multiple independent papers achieving AUROC 0.7–0.8 on correctness prediction.
The bottleneck is not ideas or code — it is compute. Once we have GPU access again, we can execute the reproductions and generate our own empirical results. Until then, we will continue scanning for new papers, refining our code, and keeping the research arc coherent.