Chain-of-Thought and test-time scaling have made LLMs much better at math, logic, and science. But the dynamics of how a model reasons — and when it goes wrong — remain poorly understood. Most prior work treats reasoning as a black box: feed a question, get an answer, check correctness. The intermediate tokens are discarded.
This paper treats reasoning traces as evolving model states and studies them through the lens of uncertainty quantification. At every token position, the authors compute three types of uncertainty across two "channels":
For each channel, they compute:
From the six time series (3 types × 2 channels), they extract an uncertainty trace profile: early mean, middle mean, late mean, slope, and R² (linearity). This small feature vector predicts whether a trace will yield a correct answer with AUROC up to 0.807 — a 59% improvement over the Self-Certainty baseline and 11.5% over the prior state-of-the-art (CRV).
For a reasoning trace prefix v_{<t}, the authors compute uncertainty with respect to either the next token v_t (trace) or the final answer ŷ (answer). Answer uncertainty requires a second forward/backward pass conditioned on the prefix and applied to the final answer tokens.
U_E ≈ ||∇_θ p(y | v_{<t}, θ)||² (epistemic)
U_C ≈ p(y | v_{<t}) · (1 − p(y | v_{<t})) (committal)
U_D = −Σ_v p(v | v_{<t}) log p(v | v_{<t}) (distributional)
Epistemic uncertainty uses gradient norms, which is computationally expensive (requires a backward pass at every token). The authors note this is the main cost of the method. Committal and distributional uncertainties are cheap: just forward-pass probabilities and entropy.
From each of the 6 time series, they extract 5 features:
This gives a 30-dimensional feature vector (6 series × 5 features) that summarizes the dynamics of the entire trace. A simple logistic regression classifier on these features predicts correctness.
The authors evaluate five models spanning the spectrum from standard LMs to reasoning models:
Datasets: GSM8K (grade-school math) and ProntoQA (logical reasoning with low-likelihood tokens). Both have verifiable answers.
| Model | GSM8K | ProntoQA | ||||
|---|---|---|---|---|---|---|
| LR | GB | SC | LR | GB | SC | |
| Llama 3.1 | 0.783 | 0.758 | 0.491 | 0.799 | 0.762 | 0.566 |
| Llama 3.2 | 0.758 | 0.767 | 0.566 | 0.550 | 0.533 | 0.549 |
| Qwen 2.5 | 0.807 | 0.787 | 0.689 | 0.519 | 0.476 | 0.565 |
| DeepSeek R1 | 0.786 | 0.775 | 0.703 | 0.672 | 0.639 | 0.615 |
| Qwen 3 | 0.727 | 0.665 | 0.728 | 0.657 | 0.551 | 0.611 |
LR = Logistic Regression on uncertainty trace profile; GB = Gradient Boosting; SC = Self-Certainty baseline. The uncertainty trace profile consistently outperforms Self-Certainty, often dramatically (e.g., Llama 3.1 on GSM8K: 0.783 vs 0.491). On Qwen 2.5 GSM8K, it reaches the best overall AUROC of 0.807.
The authors find that trace-level uncertainty (U^Tr) carries most of the predictive power. Answer-level uncertainty (U^A) contributes but is secondary. This is surprising: the model's uncertainty about the next token is more informative than its uncertainty about the final answer.
Within trace uncertainty, distributional (U_D) and committal (U_C) perform similarly (AUROC up to 0.78), while epistemic (U_E) is slightly weaker. This suggests that aleatoric uncertainty (the model's current confusion) is more predictive than epistemic uncertainty (whether the model has seen similar patterns).
The authors train classifiers on progressively longer prefixes of the trace. On GSM8K, AUROC reaches 0.801 using only the first 300 tokens — nearly matching the full-trace score. This means:
The feature analysis reveals distinct uncertainty profiles:
A subtle but important finding: U_E^Tr (epistemic on next token) is higher for incorrect traces, while U_E^A (epistemic on final answer) is lower for incorrect traces.
Interpretation: when generating an incorrect answer, the model takes less-supported steps (high U_E^Tr), but the eventual wrong answer is actually more familiar to the model than correct answers tend to be (low U_E^A). The trace and answer pull in opposite directions. For reasoning models, U_E^Tr decreases more steeply on incorrect traces — the model converges smoothly on a wrong answer that it has seen before.
Our three active projects — Feature Rivalry, Uncertainty vs Correctness, and Confidence Margin — all address the same question: how do we know when an LLM is wrong? This paper offers a complementary perspective that operates at a different level of abstraction.
Our Feature Rivalry work analyzes individual tokens within a generation, looking for negatively correlated SAE feature pairs that signal uncertainty. The Tracing Uncertainty paper analyzes entire traces, summarizing the shape of uncertainty over time. Both approaches achieve AUROC ~0.8 on correctness prediction, but with different inputs:
| Dimension | Feature Rivalry | Tracing Uncertainty |
|---|---|---|
| Unit | Per-token SAE features | Trace-level uncertainty time series |
| Signal | Feature correlation patterns | Uncertainty dynamics (slope, linearity) |
| Cost | One forward pass + SAE decode | Forward + backward pass per token |
| Interpretability | Human-inspectable features | Statistical profile (opaque) |
| Early detection | Per-token (immediate) | ~300 tokens for full AUROC |
A natural hybrid: use SAE features to compute token-level uncertainty estimates, then aggregate them into trace-level profiles. This would give us the interpretability of SAEs with the predictive power of trace dynamics.
The paper's committal aleatoric uncertainty (U_C) measures Bernoulli variance of the top prediction. When U_C is high, the model is unsure about its own top choice. This is conceptually similar to Feature Rivalry: negatively correlated feature pairs also signal that the model is torn between competing hypotheses.
One hypothesis: the features identified by Feature Rivalry as "uncertainty signals" might be the mechanistic basis of the committal uncertainty measured in this paper. If we could trace the gradient of U_C back through the network, we might find that it flows through the same rivalry features.
The paper's epistemic uncertainty (U_E) is estimated via gradient norms. High U_E means the model has not seen similar patterns in training. This is related to the idea of using SAEs for out-of-distribution detection: if an input activates unusual SAE features (or fails to activate familiar ones), the model is likely outside its training distribution.
SAEBench's Core metrics include loss recovered, which indirectly measures how well the SAE represents the input. Low loss recovered on unfamiliar inputs might correlate with high U_E. We could test this by computing both metrics on the same dataset.
The paper finds that standard LMs (Llama) and reasoning models (R1, Qwen3) have qualitatively different uncertainty profiles. For LMs, correct traces are more linear (higher R²). For RMs, incorrect traces are more linear.
This has implications for our SAE work: if we train SAEs on a reasoning model like Qwen 3.5 35B-A3B, the uncertainty features we discover may differ from those in a standard LM. The "smooth convergence on wrong answers" pattern might manifest as different feature activation patterns in the SAE.
Strong method, complementary to our SAE approach. The uncertainty trace profile achieves impressive predictive performance with a simple feature engineering pipeline. The finding that trace-level uncertainty is more predictive than answer-level uncertainty validates the idea that the process of reasoning contains more signal than the final output.
For our work, the most actionable insight is that temporal dynamics matter. Our current Feature Rivalry analysis is per-token; adding a temporal dimension (how do rivalry patterns evolve over a trace) could significantly improve predictive power. The early-detection result (AUROC 0.8 at 300 tokens) also suggests that uncertainty signals are strong enough to be useful for real-time intervention.