Most work on detecting LLM errors treats uncertainty (the model doesn't know) and incorrectness (the model is wrong) as the same thing. This paper asks a sharper question: are uncertainty and correctness encoded by the same SAE features, or different ones?
The authors run thousands of prompts through a model (Qwen2.5-Instruct, 14B), classify each response into one of four quadrants, and then use SAE analysis to find which features activate in which quadrant. Their framework is simple but powerful:
By comparing feature activation patterns across these four conditions, they can disentangle three distinct feature populations:
We are simultaneously running a Feature Rivalry reproduction on Qwen 3.5 35B-A3B. Feature Rivalry detects uncertainty via negatively correlated feature pairs — when two features anti-correlate, the model is uncertain. The current paper takes a different angle: it looks at feature activation magnitudes in a structured 2×2 design.
These two methods are complementary. Feature Rivalry is pairwise and unsupervised (needs no ground-truth labels). The 2×2 method is population-level and requires labeled correctness data. A combined approach — using Feature Rivalry to flag uncertain prompts, then the 2×2 method to identify whether the uncertainty is "honest" (the model doesn't know) or "deceptive" (the model is confidently wrong) — could be significantly more powerful than either method alone.
The authors sample 20 responses per prompt using temperature = 1.0. Each response is classified as correct or incorrect against a gold answer. Uncertainty is measured by answer consistency — if all 20 responses agree, uncertainty is low; if they diverge, uncertainty is high. This is essentially the same entropy-based approach used in Feature Rivalry.
This produces four groups per prompt:
| Quadrant | Uncertainty | Correctness | Description |
|---|---|---|---|
| Q1 | Low | Correct | Model is confident and right |
| Q2 | Low | Incorrect | Model is confidently wrong (hallucination) |
| Q3 | High | Correct | Model guessed correctly despite uncertainty |
| Q4 | High | Incorrect | Model doesn't know and is wrong |
For each response, they extract SAE feature activations at a chosen layer (they test layers 0-39). Activations are max-pooled across the sequence, producing a single sparse vector per response. They then compute the mean activation for each feature within each quadrant.
A feature is "informative" if its mean activation differs significantly between quadrants. Specifically, they compute a difference score for each feature:
For uncertainty signal: abs(mean(Q_high_uncertainty) - mean(Q_low_uncertainty)) For correctness signal: abs(mean(Q_correct) - mean(Q_incorrect))
Features are then classified into three populations:
To validate that these features are causally relevant (not just correlates), the authors perform feature suppression: they zero out the top-N features from a given population during generation and measure the impact on accuracy. If suppressing "confounded" features improves accuracy, those features were genuinely contributing to incorrectness.
They also test feature addition (amplifying features) and directional steering (adding/subtracting the mean activation vector of a population).
The paper reports results on Qwen2.5-Instruct (14B) with a custom-trained SAE (width 65k, L0 ~50). All experiments use the Natural Questions (NQ) dataset.
Across all 40 layers, the authors find that features cleanly separate into three populations:
This is the central empirical claim. The authors show t-SNE plots of feature activations where the three populations form visually distinct clusters. Notably, most features are not informative — only a small subset (hundreds out of 65k) show significant quadrant-specific activation patterns.
For comparison, Feature Rivalry (arXiv:2605.08149) achieves AUROC = 0.689 on LLaMA2-7B using a pairwise rivalry score across many feature pairs. The 2×2 method here achieves higher AUROC with fewer features, but requires labeled correctness data.
When the authors zero out the top 10 confounded features during generation, model accuracy on NQ improves by 1.1 percentage points (from a baseline that the paper does not explicitly state, but appears to be in the 30-40% range for NQ open-domain).
This is a causal validation: if these features were merely correlates of error, suppressing them would have no effect or might even hurt. The fact that accuracy improves suggests the model is actively "using" these confounded features to produce incorrect answers — perhaps as a kind of spurious heuristic.
Suppressing pure uncertainty features makes the model more confident but does not change correctness rates — these features act as a "confidence brake." Suppressing pure incorrectness features improves accuracy slightly but also reduces the model's ability to express uncertainty. The confounded features are the most impactful to suppress because they capture the dangerous intersection: the model being wrong and not knowing it.
| Dimension | 2×2 Quadrant Method | Feature Rivalry |
|---|---|---|
| Signal type | Feature activation magnitude | Feature pair anti-correlation |
| Supervision | Requires labeled correctness data | Unsupervised (needs only sampled responses) |
| Granularity | Population-level (mean activations) | Per-prompt (rivalry score) |
| What it detects | Uncertainty vs incorrectness as distinct concepts | Uncertainty as feature competition |
| Best AUROC | 0.79 (3 features, layer 28) | 0.689 (many pairs, all layers) |
| Intervention | Feature suppression/addition | Rivalry vector steering |
| Model tested | Qwen2.5-Instruct 14B | LLaMA2-7B |
These methods are not competitors — they detect different things through different mechanisms:
A combined pipeline might look like:
We read this paper in the context of our ongoing Feature Rivalry reproduction. The central claim — that uncertainty and incorrectness are separable signals in SAE space — is both intuitive and well-supported by the evidence presented. The 2×2 framework is elegant, and the suppression experiments provide genuine causal evidence rather than mere correlation.
Worth exploring further. The 2×2 framework is a genuinely useful conceptual tool that goes beyond prior work by disentangling two conflated failure modes. The AUROC and suppression results are strong enough to take seriously. The main gap is replication on larger models with different SAEs — which we are positioned to attempt as soon as GPU access is restored.
Immediate action: We will attempt to reproduce the core suppression experiment on Qwen 3.5 35B-A3B using the official SAE-Res weights, comparing results to our Feature Rivalry baseline. This will be the next GPU experiment once the vast.ai instance is back online.