Confidence Margin: Calibrating Reasoning Models with Process Supervision

Paper: arXiv:2604.23333 · Authors: Wang, Zuo, Jurayj, Van Durme, Liu (JHU) · Explored: 2026-05-15

Overview

Scaling test-time computation with RL has made reasoning LLMs dramatically better at math, coding, and science. But there is a catch: the same training that improves accuracy also makes the model overconfident. A model trained with outcome-only rewards (GRPO) learns to be right more often, but its confidence no longer reflects its true probability of being correct.

This paper introduces RLCM (Reinforcement Learning with Confidence Margin), a calibration-aware RL framework that jointly optimizes for correctness and confidence reliability. The key insight: instead of trying to make confidence match correctness at every step, train the model to assign higher confidence to reasoning prefixes that are more likely to succeed, and lower confidence to prefixes that are likely to fail. The gap between them — the confidence margin — becomes the training signal.

Core result: RLCM achieves the best Pareto frontier on accuracy vs calibration across math, code, logic, and science benchmarks. It preserves the accuracy gains of GRPO while reducing Expected Calibration Error (ECE) by 30–50%. The confidence signal is also more useful for downstream tasks: better early-exit control and better confidence-weighted answer aggregation.

The Problem: Outcome-Only RL Destroys Calibration

GRPO (Group Relative Policy Optimization) and similar methods optimize only for final-answer correctness. The reward is binary: right or wrong. This has two consequences:

  1. Sparse credit assignment. The model gets no feedback about which reasoning steps were good or bad — only whether the final answer matched the ground truth.
  2. Overconfidence incentive. Binary rewards encourage the model to be maximally confident on every question, because hedging or admitting uncertainty never helps win the binary reward.

The paper shows this empirically: GRPO-trained models achieve higher accuracy than the base model, but their ECE and PCE (Positive Calibration Error) increase. The model becomes better at reasoning and worse at knowing when it is wrong.

The Method: Three Components

1. A probe-based confidence estimator

RLCM attaches a lightweight 2-layer MLP probe to the model's final-layer hidden states. At any intermediate reasoning prefix, the probe outputs a confidence score C_b(y) ∈ (0,1) — the probability that the prefix will lead to a correct final answer.

To train the probe, the authors use forced-answer sampling: at a truncation point b, they append the end-of-thinking token and force the model to generate a final answer. They sample K completions and compute the Monte Carlo accuracy:

Y_b(y) = (1/K) Σ_{k=1}^K 𝟙[â_{b,k} = a*]

The probe is trained with binary cross-entropy on these soft targets. Importantly, the probe is updated jointly with policy training so it tracks the evolving rollout distribution, but its gradients do not backprop into the policy model.

2. Margin-based process reward

For each reasoning trajectory, the authors select a set of compute budgets (truncation points) and partition them into two groups:

The margin reward is simply the gap between their average predicted confidence:

R_margin(y) = mean_{b∈𝒷⁺} C_b(y) − mean_{b∈𝒷⁻} C_b(y)

This is a ranking objective, not a score-matching objective. The probe does not need to output exact probabilities — it only needs to rank better prefixes above worse ones. This makes the objective less brittle than Brier-score penalties, which the paper shows actually degrade accuracy when combined with reasoning RL.

3. Joint optimization with GRPO

The final reward combines the standard answer-correctness reward with the margin reward:

R_total = R_answer + λ · R_margin

The policy is trained with GRPO-style group-relative advantages, but with the margin reward providing dense process-level supervision. The KL regularizer is removed following recent work that shows it is unnecessary for reasoning RL.

Results

Accuracy vs calibration trade-off

The paper evaluates on MATH-500, AMC, OlympiadBench, AIME 2024/2025 (math), GPQA (science), LogiQA (logic), and LiveCodeBench (code). The base model is R1-distilled Qwen-7B. All methods are trained on the GRPO-LEAD math dataset.

MethodAccuracyECE ↓PCE ↓Confidence Type
Base (R1-distilled)ModerateLowLow
GRPOHighestHighHigh
RLCR (Brier reward)LowerModerateModerateVerbalized
C²GSPGLowerModerateModerateLogit-based
RLCM (this paper)Near-bestLowestLowestProbe-based

Key takeaways:

Process-level calibration

The authors verify that calibration holds throughout the reasoning trajectory, not just at the final answer. At varying compute budgets (from 512 tokens to 8k tokens), RLCM consistently achieves the lowest ECE while matching the best accuracy. This confirms that the margin reward provides useful supervision at every stage of reasoning.

Verbalized uncertainty is preserved

The authors analyze uncertainty expressions in reasoning traces (self-correction like "wait, let me reconsider", hedging like "perhaps", knowledge gaps like "I don't know"). GRPO suppresses self-correction behavior, while RLCM preserves the base model's distribution of uncertainty expressions. This is notable because RLCM has no explicit reward on semantic uncertainty — the effect emerges naturally from calibration at the token level.

Ablation: margin vs Brier, final vs process

The ablation study isolates two design choices:

The best configuration — process-level + margin reward — is RLCM.

Downstream Applications

1. Conformal risk control (early exit)

With calibrated confidence, the model can decide when to stop reasoning. The paper uses Learn-Then-Test (LTT) with two thresholds:

RLCM provides the most useful confidence signal for this: its realized accuracy tracks the perfect-calibration line more closely than GRPO or RLCR. GRPO is overly conservative, wasting tokens. RLCR has step-like plateaus due to coarse verbalized confidence. RLCM offers a smooth, fine-grained accuracy–compute trade-off.

2. Confidence-weighted aggregation

When sampling multiple rollouts, majority voting weights all answers equally. Confidence-weighted voting weights each answer by its estimated reliability:

â_conf = argmax_{a∈A} Σ_{i: a_i=a} c_i

Although RLCM and GRPO have nearly identical Pass@1 accuracy, RLCM delivers substantially larger gains from confidence-weighted voting. This means its confidence estimates are not just calibrated — they are discriminative, correctly separating good rollouts from bad ones.

Connections to Our SAE Work

We are running three concurrent projects on uncertainty detection in LLMs: Feature Rivalry, Uncertainty vs Correctness Features, and SAEBench evaluation. RLCM connects to all three in interesting ways.

Hidden-state probes vs SAE features

RLCM uses a 2-layer MLP probe on the final-layer hidden state to estimate confidence. Our work uses Sparse Autoencoder features at intermediate layers to detect uncertainty and incorrectness. Both approaches share the core assumption that the model's internal representations encode reliability signals that can be extracted with lightweight learned functions.

The difference is granularity and interpretability:

DimensionRLCM probeSAE features (our work)
Unit of analysisFinal hidden state vectorIndividual sparse features
LayerFinal layer onlyAny layer (we use layer 20)
InterpretabilityBlack-box MLPHuman-inspectable concepts
Training signalMonte Carlo correctnessEntropy / correctness labels
Use caseTrain better modelsMonitor existing models

A natural hybrid: train an RLCM-style probe on SAE feature activations rather than raw hidden states. This would combine RLCM's strong calibration performance with the interpretability of SAE features. If the probe learns to weight a small number of SAE features, we could inspect which concepts it uses for confidence estimation — something impossible with a black-box MLP.

Process supervision vs per-token analysis

RLCM provides process-level supervision at truncation points. Our Feature Rivalry work analyzes per-token feature correlations within a single generation. These are complementary:

A trajectory with low RLCM confidence might show high feature rivalry at the token where the model first becomes uncertain. Combining both signals could give a richer uncertainty profile: trajectory-level + token-level.

Margin reward and feature rivalry

The margin reward widens the confidence gap between correct and incorrect prefixes. Feature Rivalry detects negatively correlated feature pairs that activate on uncertain tokens. These are conceptually similar: both identify a gap or conflict in the model's internal state that signals unreliability.

One speculative connection: if we trained a model with RLCM's margin reward, would the resulting SAE features show more rivalry on uncertain prefixes? The margin reward explicitly encourages the model to separate confident and unconfident states, which might make the corresponding SAE features more separable too.

Our Assessment

What we like

What concerns us

What we would try next

  1. SAE-based confidence probe. Train a lightweight probe on SAE feature activations instead of raw hidden states. This would give us interpretable confidence estimation: we could inspect which features the probe weights for confidence.
  2. Feature rivalry on RLCM-trained models. If we can access an RLCM-trained checkpoint, run Feature Rivalry analysis on it and compare with a GRPO-trained baseline. Does margin-based training create more separable uncertainty features?
  3. Per-token confidence trajectory. Combine RLCM's prefix-level confidence with Feature Rivalry's token-level signals to build a richer uncertainty profile across the full generation.

Verdict

Strong method, wrong layer of the stack for us. RLCM is an excellent training technique for reasoning models, and the margin reward is a genuinely new idea that solves a real problem (outcome-only RL destroying calibration). However, our work focuses on monitoring existing models, not training new ones. The most actionable idea for us is the probe-based confidence estimator: it validates that hidden-state-based uncertainty detection is a powerful approach, and it suggests we should try training probes on SAE features for calibrated, interpretable confidence estimation.

References

← Research Synthesis · Labs Index