Tracing Uncertainty in LLM Reasoning

Paper: arXiv:2605.07776 · Authors: Grünefeld, Højer, Mondorf, Plank, Rogers, Hardmeier, Heinrich, Frellsen · Explored: 2026-05-15

Overview

Chain-of-Thought and test-time scaling have made LLMs much better at math, logic, and science. But the dynamics of how a model reasons — and when it goes wrong — remain poorly understood. Most prior work treats reasoning as a black box: feed a question, get an answer, check correctness. The intermediate tokens are discarded.

This paper treats reasoning traces as evolving model states and studies them through the lens of uncertainty quantification. At every token position, the authors compute three types of uncertainty across two "channels":

Trace uncertainty — uncertainty about the next token
Answer uncertainty — uncertainty about the final answer

For each channel, they compute:

Epistemic (U_E) — gradient norm of predicted probability w.r.t. model parameters. High = the model has not seen similar patterns in training.
Committal aleatoric (U_C) — Bernoulli variance of target probability. High = the model is uncertain about its top prediction.
Distributional aleatoric (U_D) — predictive entropy over the full vocabulary. High = the output distribution is flat.

From the six time series (3 types × 2 channels), they extract an uncertainty trace profile: early mean, middle mean, late mean, slope, and R² (linearity). This small feature vector predicts whether a trace will yield a correct answer with AUROC up to 0.807 — a 59% improvement over the Self-Certainty baseline and 11.5% over the prior state-of-the-art (CRV).

Core result: The shape of uncertainty over a reasoning trace is highly predictive of correctness. Correct traces show a steeper, less linear decline in uncertainty. Incorrect traces are persistently more uncertain, with shallower slopes and more irregular dynamics. Most importantly, this signal is detectable within the first 300 tokens.

Method in Detail

Uncertainty estimation at each token

For a reasoning trace prefix v_{<t}, the authors compute uncertainty with respect to either the next token v_t (trace) or the final answer ŷ (answer). Answer uncertainty requires a second forward/backward pass conditioned on the prefix and applied to the final answer tokens.

  U_E ≈ ||∇_θ p(y | v_{<t}, θ)||²        (epistemic)
  U_C ≈ p(y | v_{<t}) · (1 − p(y | v_{<t}))  (committal)
  U_D = −Σ_v p(v | v_{<t}) log p(v | v_{<t})  (distributional)

Epistemic uncertainty uses gradient norms, which is computationally expensive (requires a backward pass at every token). The authors note this is the main cost of the method. Committal and distributional uncertainties are cheap: just forward-pass probabilities and entropy.

Feature extraction: the uncertainty trace profile

From each of the 6 time series, they extract 5 features:

μ_early — mean uncertainty in first 1/3 of trace
μ_mid — mean uncertainty in middle 1/3
μ_late — mean uncertainty in final 1/3
slope (m) — linear regression slope over full trace
R² — goodness of linear fit

This gives a 30-dimensional feature vector (6 series × 5 features) that summarizes the dynamics of the entire trace. A simple logistic regression classifier on these features predicts correctness.

Models and datasets

The authors evaluate five models spanning the spectrum from standard LMs to reasoning models:

Llama 3.1 (8B) and Llama 3.2 (1B) — standard SFT + limited RLHF
Qwen 2.5 (0.5B) — SFT with reasoning data focus
DeepSeek R1 Distill Qwen (1.5B) — explicit reasoning optimization via RL
Qwen 3 (0.6B) — explicit reasoning optimization via RL

Datasets: GSM8K (grade-school math) and ProntoQA (logical reasoning with low-likelihood tokens). Both have verifiable answers.

Results

Correctness prediction AUROC

Model	GSM8K			ProntoQA
	LR	GB	SC	LR	GB	SC
Llama 3.1	0.783	0.758	0.491	0.799	0.762	0.566
Llama 3.2	0.758	0.767	0.566	0.550	0.533	0.549
Qwen 2.5	0.807	0.787	0.689	0.519	0.476	0.565
DeepSeek R1	0.786	0.775	0.703	0.672	0.639	0.615
Qwen 3	0.727	0.665	0.728	0.657	0.551	0.611

LR = Logistic Regression on uncertainty trace profile; GB = Gradient Boosting; SC = Self-Certainty baseline. The uncertainty trace profile consistently outperforms Self-Certainty, often dramatically (e.g., Llama 3.1 on GSM8K: 0.783 vs 0.491). On Qwen 2.5 GSM8K, it reaches the best overall AUROC of 0.807.

Trace uncertainty dominates answer uncertainty

The authors find that trace-level uncertainty (U^Tr) carries most of the predictive power. Answer-level uncertainty (U^A) contributes but is secondary. This is surprising: the model's uncertainty about the next token is more informative than its uncertainty about the final answer.

Within trace uncertainty, distributional (U_D) and committal (U_C) perform similarly (AUROC up to 0.78), while epistemic (U_E) is slightly weaker. This suggests that aleatoric uncertainty (the model's current confusion) is more predictive than epistemic uncertainty (whether the model has seen similar patterns).

Early detection: errors visible within 300 tokens

The authors train classifiers on progressively longer prefixes of the trace. On GSM8K, AUROC reaches 0.801 using only the first 300 tokens — nearly matching the full-trace score. This means:

Errors are detectable early, not just at the end
We don't need to wait for the full reasoning chain to know it's going wrong
Early detection enables intervention (e.g., prompt the model to reconsider)

Qualitative differences: correct vs incorrect traces

The feature analysis reveals distinct uncertainty profiles:

Levels: Incorrect traces have higher uncertainty throughout (early, mid, late means are all elevated). The gap is more pronounced for trace uncertainty than answer uncertainty.
Slope: Incorrect traces have shallower decline in uncertainty (higher slope, meaning less decrease over time). Correct traces show a steep, confident descent.
Linearity (R²): For standard LMs (Llama), correct traces are more linear (higher R²). For reasoning models (R1, Qwen3), this flips: incorrect traces are more linear, suggesting that reasoning models converge smoothly on wrong answers.

Epistemic uncertainty shows opposite patterns for trace vs answer

A subtle but important finding: U_E^Tr (epistemic on next token) is higher for incorrect traces, while U_E^A (epistemic on final answer) is lower for incorrect traces.

Interpretation: when generating an incorrect answer, the model takes less-supported steps (high U_E^Tr), but the eventual wrong answer is actually more familiar to the model than correct answers tend to be (low U_E^A). The trace and answer pull in opposite directions. For reasoning models, U_E^Tr decreases more steeply on incorrect traces — the model converges smoothly on a wrong answer that it has seen before.

Connections to Our SAE Work

Our three active projects — Feature Rivalry, Uncertainty vs Correctness, and Confidence Margin — all address the same question: how do we know when an LLM is wrong? This paper offers a complementary perspective that operates at a different level of abstraction.

Token-level features vs trace-level profiles

Our Feature Rivalry work analyzes individual tokens within a generation, looking for negatively correlated SAE feature pairs that signal uncertainty. The Tracing Uncertainty paper analyzes entire traces, summarizing the shape of uncertainty over time. Both approaches achieve AUROC ~0.8 on correctness prediction, but with different inputs:

Dimension	Feature Rivalry	Tracing Uncertainty
Unit	Per-token SAE features	Trace-level uncertainty time series
Signal	Feature correlation patterns	Uncertainty dynamics (slope, linearity)
Cost	One forward pass + SAE decode	Forward + backward pass per token
Interpretability	Human-inspectable features	Statistical profile (opaque)
Early detection	Per-token (immediate)	~300 tokens for full AUROC

A natural hybrid: use SAE features to compute token-level uncertainty estimates, then aggregate them into trace-level profiles. This would give us the interpretability of SAEs with the predictive power of trace dynamics.

Committal uncertainty ≈ feature rivalry?

The paper's committal aleatoric uncertainty (U_C) measures Bernoulli variance of the top prediction. When U_C is high, the model is unsure about its own top choice. This is conceptually similar to Feature Rivalry: negatively correlated feature pairs also signal that the model is torn between competing hypotheses.

One hypothesis: the features identified by Feature Rivalry as "uncertainty signals" might be the mechanistic basis of the committal uncertainty measured in this paper. If we could trace the gradient of U_C back through the network, we might find that it flows through the same rivalry features.

Epistemic uncertainty and SAE novelty detection

The paper's epistemic uncertainty (U_E) is estimated via gradient norms. High U_E means the model has not seen similar patterns in training. This is related to the idea of using SAEs for out-of-distribution detection: if an input activates unusual SAE features (or fails to activate familiar ones), the model is likely outside its training distribution.

SAEBench's Core metrics include loss recovered, which indirectly measures how well the SAE represents the input. Low loss recovered on unfamiliar inputs might correlate with high U_E. We could test this by computing both metrics on the same dataset.

Reasoning models have different uncertainty dynamics

The paper finds that standard LMs (Llama) and reasoning models (R1, Qwen3) have qualitatively different uncertainty profiles. For LMs, correct traces are more linear (higher R²). For RMs, incorrect traces are more linear.

This has implications for our SAE work: if we train SAEs on a reasoning model like Qwen 3.5 35B-A3B, the uncertainty features we discover may differ from those in a standard LM. The "smooth convergence on wrong answers" pattern might manifest as different feature activation patterns in the SAE.

Our Assessment

What we like

Treating traces as evolving states is the right framing. Most prior work discards the temporal structure of generation. This paper shows there is rich signal in how uncertainty changes over time.
Early detection is practically valuable. AUROC 0.801 at 300 tokens means we can intervene before the model commits to a wrong answer. This is much more actionable than post-hoc correctness checking.
The three-way uncertainty decomposition is principled. Epistemic vs aleatoric is a classic distinction, but the committal/distributional split within aleatoric is novel and useful. The opposite-direction patterns of U_E^Tr vs U_E^A are a genuinely surprising finding.
Works across model types. The method is not specific to reasoning models — it works on standard LMs too, though the profiles differ. This suggests the approach is robust.

What concerns us

Gradient-based epistemic uncertainty is expensive. Computing ||∇_θ p(y|v_{<t})||² at every token requires a backward pass per token. For long traces, this is prohibitive. The paper notes this is the main computational cost.
Limited to verifiable domains. Like RLCM (Confidence Margin), this method requires ground-truth answers for the correctness labels. It won't work for open-ended generation or creative tasks.
Trace profile is a black box. While the features are interpretable (slope, linearity, means), the connection to model internals is opaque. We don't know why incorrect traces have shallower slopes.
ProntoQA results are weaker. On ProntoQA, AUROCs are much lower (0.47–0.67) than on GSM8K (0.72–0.81). The method seems more effective for math reasoning than for logic puzzles with low-likelihood tokens.

What we would try next

SAE-based uncertainty trace profiles. Instead of computing U_D, U_C, U_E from logits and gradients, compute analogous signals from SAE feature activations. For example: "distributional uncertainty" = entropy over SAE feature activations; "committal uncertainty" = variance of top feature activations. Aggregate into trace profiles and compare AUROC.
Per-token SAE features as uncertainty signals. At each token, identify the top SAE features and track how their activations evolve. Do "uncertainty features" show the same steep decline on correct traces as the statistical uncertainty measures?
Feature rivalry in reasoning traces. Run Feature Rivalry analysis on reasoning model traces (Qwen 3.5 35B-A3B) and compare the rivalry patterns between correct and incorrect traces. Do incorrect traces show more persistent rivalry?

Verdict

Strong method, complementary to our SAE approach. The uncertainty trace profile achieves impressive predictive performance with a simple feature engineering pipeline. The finding that trace-level uncertainty is more predictive than answer-level uncertainty validates the idea that the process of reasoning contains more signal than the final output.

For our work, the most actionable insight is that temporal dynamics matter. Our current Feature Rivalry analysis is per-token; adding a temporal dimension (how do rivalry patterns evolve over a trace) could significantly improve predictive power. The early-detection result (AUROC 0.8 at 300 tokens) also suggests that uncertainty signals are strong enough to be useful for real-time intervention.

References

Grünefeld et al., Tracing Uncertainty in Language Model "Reasoning", arXiv:2605.07776
Wang et al., Feature Rivalry as a Signature of Uncertainty in LLMs, arXiv:2605.08149 — our analysis
Chiriqui & Te'eni, Are LLM Uncertainty and Correctness Encoded by the Same Features?, arXiv:2604.19974 — our analysis
Wang et al., Process Supervision of Confidence Margin for Calibrated LLM Reasoning, arXiv:2604.23333 — our analysis
Karvonen et al., SAEBench: A Comprehensive Benchmark for Sparse Autoencoders, arXiv:2503.09532 — our analysis
Marks et al., Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs, 2025 — CRV baseline
Lin et al., Scaling LLM Test-Time Compute Optimally, 2025 — Self-Certainty baseline

← Research Synthesis · Labs Index