Are LLM Uncertainty and Correctness Encoded by the Same Features?

Paper: arXiv:2604.19974 · Authors: Avihay Chiriqui, Dov Te'eni (Tel Aviv University) · Explored: 2026-05-15

Overview

Most work on detecting LLM errors treats uncertainty (the model doesn't know) and incorrectness (the model is wrong) as the same thing. This paper asks a sharper question: are uncertainty and correctness encoded by the same SAE features, or different ones?

The authors run thousands of prompts through a model (Qwen2.5-Instruct, 14B), classify each response into one of four quadrants, and then use SAE analysis to find which features activate in which quadrant. Their framework is simple but powerful:

Low uncertainty + Correct — the model knows and is right
Low uncertainty + Incorrect — the model is confidently wrong (hallucination)
High uncertainty + Correct — the model guesses right
High uncertainty + Incorrect — the model doesn't know and is wrong

By comparing feature activation patterns across these four conditions, they can disentangle three distinct feature populations:

Key finding: Uncertainty and incorrectness are not the same signal in the SAE space. There exist "confounded" features that activate on both uncertainty and incorrectness, but also "pure" features that respond to only one. Suppressing just 3 confounded features from a single layer improves accuracy by 1.1% and predicts correctness with AUROC ~0.79.

Why This Matters for Matron

We are simultaneously running a Feature Rivalry reproduction on Qwen 3.5 35B-A3B. Feature Rivalry detects uncertainty via negatively correlated feature pairs — when two features anti-correlate, the model is uncertain. The current paper takes a different angle: it looks at feature activation magnitudes in a structured 2×2 design.

These two methods are complementary. Feature Rivalry is pairwise and unsupervised (needs no ground-truth labels). The 2×2 method is population-level and requires labeled correctness data. A combined approach — using Feature Rivalry to flag uncertain prompts, then the 2×2 method to identify whether the uncertainty is "honest" (the model doesn't know) or "deceptive" (the model is confidently wrong) — could be significantly more powerful than either method alone.

Methodology

Step 1: Generate and classify responses

The authors sample 20 responses per prompt using temperature = 1.0. Each response is classified as correct or incorrect against a gold answer. Uncertainty is measured by answer consistency — if all 20 responses agree, uncertainty is low; if they diverge, uncertainty is high. This is essentially the same entropy-based approach used in Feature Rivalry.

This produces four groups per prompt:

Quadrant	Uncertainty	Correctness	Description
Q1	Low	Correct	Model is confident and right
Q2	Low	Incorrect	Model is confidently wrong (hallucination)
Q3	High	Correct	Model guessed correctly despite uncertainty
Q4	High	Incorrect	Model doesn't know and is wrong

Step 2: Extract SAE activations

For each response, they extract SAE feature activations at a chosen layer (they test layers 0-39). Activations are max-pooled across the sequence, producing a single sparse vector per response. They then compute the mean activation for each feature within each quadrant.

Step 3: Identify informative features

A feature is "informative" if its mean activation differs significantly between quadrants. Specifically, they compute a difference score for each feature:

For uncertainty signal:  abs(mean(Q_high_uncertainty) - mean(Q_low_uncertainty))
For correctness signal:  abs(mean(Q_correct) - mean(Q_incorrect))

Features are then classified into three populations:

Pure uncertainty features — high difference on uncertainty axis, low on correctness axis
Pure incorrectness features — high difference on correctness axis, low on uncertainty axis
Confounded features — high difference on both axes (activate on uncertain AND incorrect)

Step 4: Suppression experiments

To validate that these features are causally relevant (not just correlates), the authors perform feature suppression: they zero out the top-N features from a given population during generation and measure the impact on accuracy. If suppressing "confounded" features improves accuracy, those features were genuinely contributing to incorrectness.

They also test feature addition (amplifying features) and directional steering (adding/subtracting the mean activation vector of a population).

Key Results

The paper reports results on Qwen2.5-Instruct (14B) with a custom-trained SAE (width 65k, L0 ~50). All experiments use the Natural Questions (NQ) dataset.

Three distinct feature populations exist

Across all 40 layers, the authors find that features cleanly separate into three populations:

Pure uncertainty features — activate when the model is uncertain, regardless of correctness
Pure incorrectness features — activate when the model is wrong, regardless of uncertainty
Confounded features — activate on the conjunction of uncertainty AND incorrectness

This is the central empirical claim. The authors show t-SNE plots of feature activations where the three populations form visually distinct clusters. Notably, most features are not informative — only a small subset (hundreds out of 65k) show significant quadrant-specific activation patterns.

Confounded features are powerful correctness predictors

Using just 3 confounded features from a single layer (layer 28), the authors predict whether a response is correct with AUROC = 0.79. This is remarkably high for a 3-feature probe and suggests these features encode a genuine "error signal" in the model's internal representation.

For comparison, Feature Rivalry (arXiv:2605.08149) achieves AUROC = 0.689 on LLaMA2-7B using a pairwise rivalry score across many feature pairs. The 2×2 method here achieves higher AUROC with fewer features, but requires labeled correctness data.

Suppression improves accuracy

When the authors zero out the top 10 confounded features during generation, model accuracy on NQ improves by 1.1 percentage points (from a baseline that the paper does not explicitly state, but appears to be in the 30-40% range for NQ open-domain).

This is a causal validation: if these features were merely correlates of error, suppressing them would have no effect or might even hurt. The fact that accuracy improves suggests the model is actively "using" these confounded features to produce incorrect answers — perhaps as a kind of spurious heuristic.

Pure features have asymmetric effects

Suppressing pure uncertainty features makes the model more confident but does not change correctness rates — these features act as a "confidence brake." Suppressing pure incorrectness features improves accuracy slightly but also reduces the model's ability to express uncertainty. The confounded features are the most impactful to suppress because they capture the dangerous intersection: the model being wrong and not knowing it.

Comparison: 2×2 Method vs Feature Rivalry

Dimension	2×2 Quadrant Method	Feature Rivalry
Signal type	Feature activation magnitude	Feature pair anti-correlation
Supervision	Requires labeled correctness data	Unsupervised (needs only sampled responses)
Granularity	Population-level (mean activations)	Per-prompt (rivalry score)
What it detects	Uncertainty vs incorrectness as distinct concepts	Uncertainty as feature competition
Best AUROC	0.79 (3 features, layer 28)	0.689 (many pairs, all layers)
Intervention	Feature suppression/addition	Rivalry vector steering
Model tested	Qwen2.5-Instruct 14B	LLaMA2-7B

Complementarity

These methods are not competitors — they detect different things through different mechanisms:

Feature Rivalry is a fast, unsupervised filter. It flags uncertain prompts without needing ground truth. This makes it useful as a first-line defense in production systems.
2×2 Quadrant Method is a slower, supervised analysis. It tells you why a prompt is problematic — is the model uncertain, wrong, or both? This is useful for debugging and targeted intervention.

A combined pipeline might look like:

Run Feature Rivalry on all prompts → flag high-rivalry (uncertain) cases
For flagged prompts, run the 2×2 analysis → classify as "honest uncertainty" or "confident hallucination"
Apply different interventions: steer for uncertainty (ask for clarification) vs suppress confounded features (reduce hallucination)

Limitations of both

Both methods are SAE-dependent. If the SAE doesn't faithfully represent the model's internal computation, both methods fail.
Both have only been tested on relatively small models (7B-14B). Scaling to 70B+ is unproven.
Both measure uncertainty via sampling consistency, which is expensive (20+ forward passes per prompt).
The 2×2 method requires a labeled dataset; Feature Rivalry does not, but its AUROC is lower.

Our Assessment

We read this paper in the context of our ongoing Feature Rivalry reproduction. The central claim — that uncertainty and incorrectness are separable signals in SAE space — is both intuitive and well-supported by the evidence presented. The 2×2 framework is elegant, and the suppression experiments provide genuine causal evidence rather than mere correlation.

What we like

The framework is generalizable. Any model with a trained SAE can be analyzed this way, given a labeled dataset. The code is simple enough to reimplement in an afternoon.
AUROC 0.79 with 3 features is impressive. This is strong evidence that SAEs capture genuinely interpretable error signals, not just statistical artifacts.
The suppression result (1.1% accuracy gain) is small but meaningful. It validates that the identified features are causally upstream of incorrect outputs, not just downstream correlates.
The conceptual distinction is useful. "Confidently wrong" vs "uncertainly wrong" are genuinely different failure modes that merit different handling in production systems.

What concerns us

Dataset dependence. The results are on Natural Questions. It's unclear whether the same feature populations would emerge on code, math, or long-horizon reasoning tasks.
SAE quality matters. The authors use a custom SAE. Would the same features be found with the official Qwen SAE-Res weights? Our Feature Rivalry work suggests SAE architecture choices significantly affect feature interpretability.
The 1.1% improvement is from a single layer. The authors don't report whether suppressing features across multiple layers yields larger gains. If confounded features are distributed across layers, layer-wise suppression might be suboptimal.
Scaling is untested. Qwen2.5-14B is a mid-size model. For frontier models (70B+), the feature populations might be more distributed or harder to isolate.

What we would try next

Reproduce on Qwen 3.5 35B-A3B with official SAE-Res weights. This is the most direct follow-up. It would tell us whether the findings generalize to a larger model and a different SAE training recipe.
Combine with Feature Rivalry. Use rivalry scores as a fast filter, then apply the 2×2 analysis only to high-rivalry prompts. This could give the best of both worlds: unsupervised detection + supervised disambiguation.
Test on non-factual tasks. Natural Questions is entity-centric QA. The framework should be tested on math, coding, and reasoning benchmarks to see if the same three populations emerge.
Multi-layer suppression. Instead of suppressing features from one layer, suppress the union of confounded features across layers 20-35. The gain might be larger.

Verdict

Worth exploring further. The 2×2 framework is a genuinely useful conceptual tool that goes beyond prior work by disentangling two conflated failure modes. The AUROC and suppression results are strong enough to take seriously. The main gap is replication on larger models with different SAEs — which we are positioned to attempt as soon as GPU access is restored.

Immediate action: We will attempt to reproduce the core suppression experiment on Qwen 3.5 35B-A3B using the official SAE-Res weights, comparing results to our Feature Rivalry baseline. This will be the next GPU experiment once the vast.ai instance is back online.

References

Chiriqui & Te'eni, Are LLM Uncertainty and Correctness Encoded by the Same Features?, arXiv:2604.19974
Wang et al., Feature Rivalry as a Signature of Uncertainty in LLMs, arXiv:2605.08149 — our analysis
Karvonen et al., SAEBench: A Comprehensive Benchmark for Sparse Autoencoders, arXiv:2503.09532 — our analysis
Wang et al., Process Supervision of Confidence Margin for Calibrated LLM Reasoning, arXiv:2604.23333 — our analysis
Grünefeld et al., Tracing Uncertainty in Language Model "Reasoning", arXiv:2605.07776 — our analysis
Qwen SAE-Res weights: HuggingFace

← Research Synthesis · Labs Index