Confidence Margin: Calibrating Reasoning Models with Process Supervision

Paper: arXiv:2604.23333 · Authors: Wang, Zuo, Jurayj, Van Durme, Liu (JHU) · Explored: 2026-05-15

Overview

Scaling test-time computation with RL has made reasoning LLMs dramatically better at math, coding, and science. But there is a catch: the same training that improves accuracy also makes the model overconfident. A model trained with outcome-only rewards (GRPO) learns to be right more often, but its confidence no longer reflects its true probability of being correct.

This paper introduces RLCM (Reinforcement Learning with Confidence Margin), a calibration-aware RL framework that jointly optimizes for correctness and confidence reliability. The key insight: instead of trying to make confidence match correctness at every step, train the model to assign higher confidence to reasoning prefixes that are more likely to succeed, and lower confidence to prefixes that are likely to fail. The gap between them — the confidence margin — becomes the training signal.

Core result: RLCM achieves the best Pareto frontier on accuracy vs calibration across math, code, logic, and science benchmarks. It preserves the accuracy gains of GRPO while reducing Expected Calibration Error (ECE) by 30–50%. The confidence signal is also more useful for downstream tasks: better early-exit control and better confidence-weighted answer aggregation.

The Problem: Outcome-Only RL Destroys Calibration

GRPO (Group Relative Policy Optimization) and similar methods optimize only for final-answer correctness. The reward is binary: right or wrong. This has two consequences:

Sparse credit assignment. The model gets no feedback about which reasoning steps were good or bad — only whether the final answer matched the ground truth.
Overconfidence incentive. Binary rewards encourage the model to be maximally confident on every question, because hedging or admitting uncertainty never helps win the binary reward.

The paper shows this empirically: GRPO-trained models achieve higher accuracy than the base model, but their ECE and PCE (Positive Calibration Error) increase. The model becomes better at reasoning and worse at knowing when it is wrong.

The Method: Three Components

1. A probe-based confidence estimator

RLCM attaches a lightweight 2-layer MLP probe to the model's final-layer hidden states. At any intermediate reasoning prefix, the probe outputs a confidence score C_b(y) ∈ (0,1) — the probability that the prefix will lead to a correct final answer.

To train the probe, the authors use forced-answer sampling: at a truncation point b, they append the end-of-thinking token and force the model to generate a final answer. They sample K completions and compute the Monte Carlo accuracy:

Y_b(y) = (1/K) Σ_{k=1}^K 𝟙[â_{b,k} = a*]

The probe is trained with binary cross-entropy on these soft targets. Importantly, the probe is updated jointly with policy training so it tracks the evolving rollout distribution, but its gradients do not backprop into the policy model.

2. Margin-based process reward

For each reasoning trajectory, the authors select a set of compute budgets (truncation points) and partition them into two groups:

𝒷⁺(y): prefixes with Y_b(y) ≥ 0.5 (more likely to succeed)
𝒷⁻(y): prefixes with Y_b(y) < 0.5 (more likely to fail)

The margin reward is simply the gap between their average predicted confidence:

R_margin(y) = mean_{b∈𝒷⁺} C_b(y) − mean_{b∈𝒷⁻} C_b(y)

This is a ranking objective, not a score-matching objective. The probe does not need to output exact probabilities — it only needs to rank better prefixes above worse ones. This makes the objective less brittle than Brier-score penalties, which the paper shows actually degrade accuracy when combined with reasoning RL.

3. Joint optimization with GRPO

The final reward combines the standard answer-correctness reward with the margin reward:

R_total = R_answer + λ · R_margin

The policy is trained with GRPO-style group-relative advantages, but with the margin reward providing dense process-level supervision. The KL regularizer is removed following recent work that shows it is unnecessary for reasoning RL.

Results

Accuracy vs calibration trade-off

The paper evaluates on MATH-500, AMC, OlympiadBench, AIME 2024/2025 (math), GPQA (science), LogiQA (logic), and LiveCodeBench (code). The base model is R1-distilled Qwen-7B. All methods are trained on the GRPO-LEAD math dataset.

Method	Accuracy	ECE ↓	PCE ↓	Confidence Type
Base (R1-distilled)	Moderate	Low	Low	—
GRPO	Highest	High	High	—
RLCR (Brier reward)	Lower	Moderate	Moderate	Verbalized
C²GSPG	Lower	Moderate	Moderate	Logit-based
RLCM (this paper)	Near-best	Lowest	Lowest	Probe-based

Key takeaways:

GRPO wins on accuracy, loses on calibration. Outcome-only RL improves reasoning but makes confidence unreliable.
Brier-style calibration rewards hurt accuracy. Direct score matching creates a brittle trade-off (Figure 3 in the paper).
RLCM finds the sweet spot. It matches GRPO's accuracy while cutting ECE by 30–50%. On AIME24/25, it achieves the best accuracy and the best calibration simultaneously.
Probe-based > verbalized > logit-based. The 2-layer MLP probe produces more expressive confidence than verbalized "I'm sure" tokens or raw softmax logits.

Process-level calibration

The authors verify that calibration holds throughout the reasoning trajectory, not just at the final answer. At varying compute budgets (from 512 tokens to 8k tokens), RLCM consistently achieves the lowest ECE while matching the best accuracy. This confirms that the margin reward provides useful supervision at every stage of reasoning.

Verbalized uncertainty is preserved

The authors analyze uncertainty expressions in reasoning traces (self-correction like "wait, let me reconsider", hedging like "perhaps", knowledge gaps like "I don't know"). GRPO suppresses self-correction behavior, while RLCM preserves the base model's distribution of uncertainty expressions. This is notable because RLCM has no explicit reward on semantic uncertainty — the effect emerges naturally from calibration at the token level.

Ablation: margin vs Brier, final vs process

The ablation study isolates two design choices:

Where supervision is applied: Final-answer only vs intermediate prefixes. Process-level supervision consistently outperforms final-only.
How supervision is applied: Brier score (pointwise) vs margin reward (ranking). The margin reward achieves better accuracy-calibration trade-offs than Brier.

The best configuration — process-level + margin reward — is RLCM.

Downstream Applications

1. Conformal risk control (early exit)

With calibrated confidence, the model can decide when to stop reasoning. The paper uses Learn-Then-Test (LTT) with two thresholds:

λ₁: Early exit when confidence is persistently low (save compute on unsolvable problems)
λ₂: Early exit when confidence is high enough (stop once answer is reliable)

RLCM provides the most useful confidence signal for this: its realized accuracy tracks the perfect-calibration line more closely than GRPO or RLCR. GRPO is overly conservative, wasting tokens. RLCR has step-like plateaus due to coarse verbalized confidence. RLCM offers a smooth, fine-grained accuracy–compute trade-off.

2. Confidence-weighted aggregation

When sampling multiple rollouts, majority voting weights all answers equally. Confidence-weighted voting weights each answer by its estimated reliability:

â_conf = argmax_{a∈A} Σ_{i: a_i=a} c_i

Although RLCM and GRPO have nearly identical Pass@1 accuracy, RLCM delivers substantially larger gains from confidence-weighted voting. This means its confidence estimates are not just calibrated — they are discriminative, correctly separating good rollouts from bad ones.

Connections to Our SAE Work

We are running three concurrent projects on uncertainty detection in LLMs: Feature Rivalry, Uncertainty vs Correctness Features, and SAEBench evaluation. RLCM connects to all three in interesting ways.

Hidden-state probes vs SAE features

RLCM uses a 2-layer MLP probe on the final-layer hidden state to estimate confidence. Our work uses Sparse Autoencoder features at intermediate layers to detect uncertainty and incorrectness. Both approaches share the core assumption that the model's internal representations encode reliability signals that can be extracted with lightweight learned functions.

The difference is granularity and interpretability:

Dimension	RLCM probe	SAE features (our work)
Unit of analysis	Final hidden state vector	Individual sparse features
Layer	Final layer only	Any layer (we use layer 20)
Interpretability	Black-box MLP	Human-inspectable concepts
Training signal	Monte Carlo correctness	Entropy / correctness labels
Use case	Train better models	Monitor existing models

A natural hybrid: train an RLCM-style probe on SAE feature activations rather than raw hidden states. This would combine RLCM's strong calibration performance with the interpretability of SAE features. If the probe learns to weight a small number of SAE features, we could inspect which concepts it uses for confidence estimation — something impossible with a black-box MLP.

Process supervision vs per-token analysis

RLCM provides process-level supervision at truncation points. Our Feature Rivalry work analyzes per-token feature correlations within a single generation. These are complementary:

RLCM asks: "At this prefix, is the model likely to succeed?" It operates at the trajectory level.
Feature Rivalry asks: "At this token, do two features compete?" It operates at the token level.

A trajectory with low RLCM confidence might show high feature rivalry at the token where the model first becomes uncertain. Combining both signals could give a richer uncertainty profile: trajectory-level + token-level.

Margin reward and feature rivalry

The margin reward widens the confidence gap between correct and incorrect prefixes. Feature Rivalry detects negatively correlated feature pairs that activate on uncertain tokens. These are conceptually similar: both identify a gap or conflict in the model's internal state that signals unreliability.

One speculative connection: if we trained a model with RLCM's margin reward, would the resulting SAE features show more rivalry on uncertain prefixes? The margin reward explicitly encourages the model to separate confident and unconfident states, which might make the corresponding SAE features more separable too.

Our Assessment

What we like

The margin reward is elegant. Ranking-based calibration supervision avoids the brittleness of pointwise Brier penalties. The idea that calibration is a relative property (good prefixes should be more confident than bad ones) is both intuitive and empirically effective.
Process-level supervision is the right level of granularity. Final-answer rewards are too sparse; token-level rewards are too noisy. Intermediate prefixes provide dense but stable supervision.
Probe-based confidence outperforms verbalized and logit-based. This validates the broader research direction of using internal representations (not just surface behavior) for uncertainty estimation.
Downstream applications are well-demonstrated. Early exit and confidence-weighted aggregation are practical use cases, and the gains are substantial.

What concerns us

Requires reasoning models and RL infrastructure. RLCM is a training method, not a monitoring method. You need a reasoning model (R1-distilled), a reward model or verifier, and GRPO training setup. This is not something we can apply to an existing deployed model without retraining.
Monte Carlo correctness is expensive. Forcing K completions at every truncation point during training adds significant compute overhead. The paper uses K=8; at scale, this is costly.
Limited to verifiable domains. The method requires ground-truth answers to compute Y_b(y). It works for math and code but not for open-ended generation, creative writing, or subjective tasks.
Probe is a black box. Unlike SAE-based approaches, the 2-layer MLP probe provides no interpretability into what the model uses to estimate confidence.

What we would try next

SAE-based confidence probe. Train a lightweight probe on SAE feature activations instead of raw hidden states. This would give us interpretable confidence estimation: we could inspect which features the probe weights for confidence.
Feature rivalry on RLCM-trained models. If we can access an RLCM-trained checkpoint, run Feature Rivalry analysis on it and compare with a GRPO-trained baseline. Does margin-based training create more separable uncertainty features?
Per-token confidence trajectory. Combine RLCM's prefix-level confidence with Feature Rivalry's token-level signals to build a richer uncertainty profile across the full generation.

Verdict

Strong method, wrong layer of the stack for us. RLCM is an excellent training technique for reasoning models, and the margin reward is a genuinely new idea that solves a real problem (outcome-only RL destroying calibration). However, our work focuses on monitoring existing models, not training new ones. The most actionable idea for us is the probe-based confidence estimator: it validates that hidden-state-based uncertainty detection is a powerful approach, and it suggests we should try training probes on SAE features for calibrated, interpretable confidence estimation.

References

Wang, Zuo, Jurayj, Van Durme, Liu, Process Supervision of Confidence Margin for Calibrated LLM Reasoning, arXiv:2604.23333
Shao et al., DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, 2024 — GRPO baseline
Damani et al., Calibration-Aware Reinforcement Learning, 2025 — RLCR baseline
Wang et al., Feature Rivalry as a Signature of Uncertainty in LLMs, arXiv:2605.08149 — our analysis
Chiriqui & Te'eni, Are LLM Uncertainty and Correctness Encoded by the Same Features?, arXiv:2604.19974 — our analysis
Karvonen et al., SAEBench: A Comprehensive Benchmark for Sparse Autoencoders, arXiv:2503.09532 — our analysis
Grünefeld et al., Tracing Uncertainty in Language Model "Reasoning", arXiv:2605.07776 — our analysis

← Research Synthesis · Labs Index