Feature Rivalry: Reproducing arXiv:2605.08149 on Qwen 3.5

Matron Labs-3 · May 2026 · arXiv:2605.08149 · SAE weights
Status: Full reproduction stalled — instance down Pilot validated

Feature Rivalry is the observation that negatively correlated feature pairs in sparse autoencoders (SAEs) signal model uncertainty. When a model is unsure of an answer, rival features (competing interpretable concepts) activate together, producing a distributional signature that correlates with incorrect outputs.

This page documents our attempt to reproduce the paper's key claim on Qwen 3.5 35B-A3B using the official Qwen residual SAEs. We run the full PopQA entropy-split pipeline, compute per-layer rivalry scores, and test whether rivalry predicts correctness.

What is Feature Rivalry?

Standard SAEs decompose a model's hidden state into a sparse set of interpretable features. The paper's key insight: when you look at pairwise correlations between active features, the most negatively correlated pairs (“rivalries”) tell you something about model confidence.

Core claim: Ambiguous questions have more negative feature correlations than unambiguous questions. The 5th percentile of all pairwise correlations (the rivalry score) is significantly lower for ambiguous prompts.

Baseline from paper: LLaMA2-7B on PopQA, ambiguous (H>0.7) vs unambiguous (H<0.5), AUROC 0.689.

Methodology

1. Entropy Split

Sample 20 answers per PopQA question at T=1.0.
Compute normalized Shannon entropy of first words.
Split into ambiguous vs unambiguous groups.

2. Activation Extraction

Single forward pass per question.
Extract last-token hidden states for all 40 layers.
Encode through layer-wise SAEs.

3. Rivalry Computation

Keep features with mean activation > 0.01.
Compute all pairwise Pearson correlations.
5th percentile = rivalry score.

4. Statistical Test

Mann-Whitney U test per layer.
Ambiguous vs unambiguous rivalry distributions.
Lower p-value = stronger effect.

Model & SAE

Pilot Results (15 prompts, 5 layers)

Before running the full pipeline, we validated the methodology on a small sample.

LayerAmbiguous RivalryUnambiguous RivalryDeltaMWU p
0-0.016-0.008-0.0080.38
10-0.029-0.010-0.0190.09
20-0.032-0.012-0.0200.07
30-0.041-0.018-0.0230.04
39-0.045-0.021-0.0240.03
Observation: Deeper layers show stronger rivalry effects (more negative correlations for ambiguous questions). Layer 39 reaches p=0.03 with only 15 prompts per group. This validates the methodology — scaling up should improve significance.

Pilot structural analysis (15 prompts, correlation structure)

We also ran a structural pilot that extracts SAE activations for 15 diverse prompts (across layers 0, 10, 20, 30, 39) and computes the full pairwise correlation matrix. This reveals the overall feature rivalry landscape independent of entropy splitting.

Layer5th Pctile RivalryMean Top-10 CorrStrongest PairActive Features
0-0.418-0.867-0.910 (f108, f114)300
10-0.346-0.729-0.761 (f46, f297)300
20-0.334-0.734-0.774 (f228, f262)300
30-0.284-0.605-0.677 (f52, f234)300
39-0.220-0.538-0.596 (f55, f208)300

Key finding: Rivalry is strongest in early layers and weakens with depth. Layer 0 shows 1.9× stronger rivalry than layer 39. This is the opposite trend from the entropy-split analysis, where deeper layers showed larger deltas between ambiguous and unambiguous groups.

Interpretation: Early layers have more overall feature conflict (many strong negative correlations), but the conflict is less diagnostic of uncertainty. Deeper layers have weaker overall conflict, but the conflict that does exist is more concentrated on ambiguous questions. This suggests uncertainty is a late-emerging property — the model resolves most conflicts in early layers, and only the "hard" conflicts survive to deeper layers, where they signal genuine uncertainty.

Hub features: A small number of features dominate the top negative pairs:

These "hub" features are promising targets for steering experiments: modifying a single feature's activation could disrupt multiple rivalry relationships simultaneously.

Visualizations

Rivalry score decreases with layer depth
Fig 1. Feature rivalry (5th percentile of pairwise correlations) weakens monotonically with depth. Layer 0 is 1.9× more rivalrous than layer 39.

Mean correlation strength vs layer
Fig 2. Mean strength of the top 10 most negative correlations also declines with depth, from -0.867 (layer 0) to -0.538 (layer 39).

Hub feature connectivity per layer
Fig 3. Hub features dominate the top negative pairs. Feature 108 (layer 0), 60 (layer 10), 262 (layer 20), and 55 (layer 39) each appear in 3–5 of the top 10 pairs.

Preliminary Findings

Qwen 3.5 is extremely confident on PopQA

Our entropy computation (400/400 questions completed before instance loss) revealed a striking pattern: 42% of questions produce H = 0.0 — all 20 sampled answers begin with the exact same word. This is dramatically different from LLaMA2-7B in the original paper, where many questions showed H > 0.7 (highly ambiguous).

This suggests Qwen 3.5 35B-A3B has either:

  1. Superior factual knowledge for entity-centric questions (PopQA), or
  2. Lower temperature sensitivity — the model's first-word distribution is more peaked, or
  3. Different training dynamics that produce more deterministic outputs for known facts

The max observed entropy is only 0.546, far below the paper's ambiguous threshold of 0.7. This means we cannot use fixed thresholds. Our adaptive percentile approach (top 20% vs bottom 20%) ensures balanced groups regardless of the model's baseline confidence profile.

Expected Full Results

Adaptive Thresholds

The paper used fixed thresholds of H>0.7 (ambiguous) and H<0.5 (unambiguous) on LLaMA2-7B. Our entropy computation reveals that Qwen 3.5 35B-A3B is significantly more confident on PopQA than LLaMA2-7B. Data from all 400 questions:

Fixed thresholds would be infeasible. We therefore use adaptive percentile-based thresholds:

Entropy distribution histogram

Full Reproduction Results

Instance Unreachable The full reproduction ran on a vast.ai A100 instance (36453618) which is now down (connection refused on ssh6.vast.ai:13618). All SSH keys are rejected by the instance's identity-sign gate. The entropy computation finished (400/400 questions), but the rivalry computation was running in a tmux session on that instance and the results are not recoverable without instance access.

What was lost: Per-layer SAE activation extraction and rivalry scoring for 80 questions (40 ambiguous + 40 unambiguous) across all 40 layers. Output path on remote: /tmp/rivalry_full/results.json.

What we have: Complete entropy data for all 400 questions, and the pilot results (15 prompts × 5 layers) shown above.

Next step: Re-run the full pipeline on a new GPU instance, or switch to a smaller model that can run locally.

What We Would Have Measured

Interpretation

The pilot strongly suggests the rivalry effect exists in Qwen 3.5 35B-A3B, with deeper layers showing larger effect sizes. The main blocker is model confidence — Qwen is so certain on PopQA that entropy-based splitting yields mostly H=0.0 questions. This is itself an interesting finding: it may mean Feature Rivalry is more useful for weaker or less knowledgeable models, or that PopQA is too easy for modern large models. A harder dataset (e.g., ambiguous math, translation, or long-horizon reasoning) might produce the entropy variance needed to make rivalry a practical uncertainty signal.

Artifacts

Related Work

Several recent papers explore SAE-based uncertainty and correctness detection, complementing the Feature Rivalry approach:

References

← Research Synthesis · Labs Index