Scaling test-time computation with RL has made reasoning LLMs dramatically better at math, coding, and science. But there is a catch: the same training that improves accuracy also makes the model overconfident. A model trained with outcome-only rewards (GRPO) learns to be right more often, but its confidence no longer reflects its true probability of being correct.
This paper introduces RLCM (Reinforcement Learning with Confidence Margin), a calibration-aware RL framework that jointly optimizes for correctness and confidence reliability. The key insight: instead of trying to make confidence match correctness at every step, train the model to assign higher confidence to reasoning prefixes that are more likely to succeed, and lower confidence to prefixes that are likely to fail. The gap between them — the confidence margin — becomes the training signal.
GRPO (Group Relative Policy Optimization) and similar methods optimize only for final-answer correctness. The reward is binary: right or wrong. This has two consequences:
The paper shows this empirically: GRPO-trained models achieve higher accuracy than the base model, but their ECE and PCE (Positive Calibration Error) increase. The model becomes better at reasoning and worse at knowing when it is wrong.
RLCM attaches a lightweight 2-layer MLP probe to the model's final-layer hidden states. At any intermediate reasoning prefix, the probe outputs a confidence score C_b(y) ∈ (0,1) — the probability that the prefix will lead to a correct final answer.
To train the probe, the authors use forced-answer sampling: at a truncation point b, they append the end-of-thinking token and force the model to generate a final answer. They sample K completions and compute the Monte Carlo accuracy:
Y_b(y) = (1/K) Σ_{k=1}^K 𝟙[â_{b,k} = a*]
The probe is trained with binary cross-entropy on these soft targets. Importantly, the probe is updated jointly with policy training so it tracks the evolving rollout distribution, but its gradients do not backprop into the policy model.
For each reasoning trajectory, the authors select a set of compute budgets (truncation points) and partition them into two groups:
The margin reward is simply the gap between their average predicted confidence:
R_margin(y) = mean_{b∈𝒷⁺} C_b(y) − mean_{b∈𝒷⁻} C_b(y)
This is a ranking objective, not a score-matching objective. The probe does not need to output exact probabilities — it only needs to rank better prefixes above worse ones. This makes the objective less brittle than Brier-score penalties, which the paper shows actually degrade accuracy when combined with reasoning RL.
The final reward combines the standard answer-correctness reward with the margin reward:
R_total = R_answer + λ · R_margin
The policy is trained with GRPO-style group-relative advantages, but with the margin reward providing dense process-level supervision. The KL regularizer is removed following recent work that shows it is unnecessary for reasoning RL.
The paper evaluates on MATH-500, AMC, OlympiadBench, AIME 2024/2025 (math), GPQA (science), LogiQA (logic), and LiveCodeBench (code). The base model is R1-distilled Qwen-7B. All methods are trained on the GRPO-LEAD math dataset.
| Method | Accuracy | ECE ↓ | PCE ↓ | Confidence Type |
|---|---|---|---|---|
| Base (R1-distilled) | Moderate | Low | Low | — |
| GRPO | Highest | High | High | — |
| RLCR (Brier reward) | Lower | Moderate | Moderate | Verbalized |
| C²GSPG | Lower | Moderate | Moderate | Logit-based |
| RLCM (this paper) | Near-best | Lowest | Lowest | Probe-based |
Key takeaways:
The authors verify that calibration holds throughout the reasoning trajectory, not just at the final answer. At varying compute budgets (from 512 tokens to 8k tokens), RLCM consistently achieves the lowest ECE while matching the best accuracy. This confirms that the margin reward provides useful supervision at every stage of reasoning.
The authors analyze uncertainty expressions in reasoning traces (self-correction like "wait, let me reconsider", hedging like "perhaps", knowledge gaps like "I don't know"). GRPO suppresses self-correction behavior, while RLCM preserves the base model's distribution of uncertainty expressions. This is notable because RLCM has no explicit reward on semantic uncertainty — the effect emerges naturally from calibration at the token level.
The ablation study isolates two design choices:
The best configuration — process-level + margin reward — is RLCM.
With calibrated confidence, the model can decide when to stop reasoning. The paper uses Learn-Then-Test (LTT) with two thresholds:
RLCM provides the most useful confidence signal for this: its realized accuracy tracks the perfect-calibration line more closely than GRPO or RLCR. GRPO is overly conservative, wasting tokens. RLCR has step-like plateaus due to coarse verbalized confidence. RLCM offers a smooth, fine-grained accuracy–compute trade-off.
When sampling multiple rollouts, majority voting weights all answers equally. Confidence-weighted voting weights each answer by its estimated reliability:
â_conf = argmax_{a∈A} Σ_{i: a_i=a} c_i
Although RLCM and GRPO have nearly identical Pass@1 accuracy, RLCM delivers substantially larger gains from confidence-weighted voting. This means its confidence estimates are not just calibrated — they are discriminative, correctly separating good rollouts from bad ones.
We are running three concurrent projects on uncertainty detection in LLMs: Feature Rivalry, Uncertainty vs Correctness Features, and SAEBench evaluation. RLCM connects to all three in interesting ways.
RLCM uses a 2-layer MLP probe on the final-layer hidden state to estimate confidence. Our work uses Sparse Autoencoder features at intermediate layers to detect uncertainty and incorrectness. Both approaches share the core assumption that the model's internal representations encode reliability signals that can be extracted with lightweight learned functions.
The difference is granularity and interpretability:
| Dimension | RLCM probe | SAE features (our work) |
|---|---|---|
| Unit of analysis | Final hidden state vector | Individual sparse features |
| Layer | Final layer only | Any layer (we use layer 20) |
| Interpretability | Black-box MLP | Human-inspectable concepts |
| Training signal | Monte Carlo correctness | Entropy / correctness labels |
| Use case | Train better models | Monitor existing models |
A natural hybrid: train an RLCM-style probe on SAE feature activations rather than raw hidden states. This would combine RLCM's strong calibration performance with the interpretability of SAE features. If the probe learns to weight a small number of SAE features, we could inspect which concepts it uses for confidence estimation — something impossible with a black-box MLP.
RLCM provides process-level supervision at truncation points. Our Feature Rivalry work analyzes per-token feature correlations within a single generation. These are complementary:
A trajectory with low RLCM confidence might show high feature rivalry at the token where the model first becomes uncertain. Combining both signals could give a richer uncertainty profile: trajectory-level + token-level.
The margin reward widens the confidence gap between correct and incorrect prefixes. Feature Rivalry detects negatively correlated feature pairs that activate on uncertain tokens. These are conceptually similar: both identify a gap or conflict in the model's internal state that signals unreliability.
One speculative connection: if we trained a model with RLCM's margin reward, would the resulting SAE features show more rivalry on uncertain prefixes? The margin reward explicitly encourages the model to separate confident and unconfident states, which might make the corresponding SAE features more separable too.
Strong method, wrong layer of the stack for us. RLCM is an excellent training technique for reasoning models, and the margin reward is a genuinely new idea that solves a real problem (outcome-only RL destroying calibration). However, our work focuses on monitoring existing models, not training new ones. The most actionable idea for us is the probe-based confidence estimator: it validates that hidden-state-based uncertainty detection is a powerful approach, and it suggests we should try training probes on SAE features for calibrated, interpretable confidence estimation.