| Paper | arXiv | Key Metric | Our Verdict | Actionability | Status |
|---|---|---|---|---|---|
| Feature Rivalry as a Signature of Uncertainty in LLMs Wang et al. |
2605.08149 | AUROC 0.689 Mann-Whitney U |
Method is sound; pilot validated on Qwen 3.5. Full reproduction blocked by GPU loss. Opposing trends in structural vs entropy-split analysis suggest uncertainty is a late-emerging property. | High | Blocked |
| Are LLM Uncertainty and Correctness Encoded by the Same Features? Chiriqui & Te'eni |
2604.19974 | AUROC 0.79 2×2 framework |
3 confounded features from one layer predict correctness. Suppressing them should improve accuracy. Reproduction blocked by GPU. Directly relevant to Feature Rivalry. | High | Blocked |
| SAEBench: A Comprehensive Benchmark for Sparse Autoencoders Karvonen et al. |
2503.09532 | 8 metrics 200+ SAEs |
Excellent infrastructure for supported models. Qwen 3.5 blocked by transformer-lens lacking MoE support. Wrote lightweight custom eval code. Auto-Interp rankings unreliable per Descriptive Collision critique. | Medium | Pending GPU |
| Process Supervision of Confidence Margin for Calibrated LLM Reasoning Wang et al. |
2604.23333 | ECE ↓ 30-50% Probe AUROC |
Strong training method; wrong layer for us (we monitor, not train). Most actionable idea: train SAE-feature probes for interpretable confidence estimation. | Medium | Done |
| Tracing Uncertainty in Language Model "Reasoning" Grünefeld et al. |
2605.07776 | AUROC 0.807 Early detect @ 300 tokens |
Complementary to our SAE approach. Trace-level dynamics matter. Most actionable idea: add temporal dimension to per-token SAE analysis. | High | Done |
| WriteSAE: SAEs for State-Space and Recurrent Models JackYoung27 |
2605.12770 | Rank-1 decoder atoms | Clean codebase. Not applicable to transformers. Could test on Qwen 3.5 0.8B/4B Gated DeltaNet variants if needed. Low priority. | Low | Done |
| Qwen-Scope: 14 SAE Groups Across 7 Models Qwen Team |
2605.11887 | 14 groups 7 models |
Includes our exact model. Four practical applications demonstrated. Comparison with SAE-Res weights planned. Most actionable: cross-evaluate Qwen-Scope vs SAE-Res features. | High | Pending GPU |
| Descriptive Collision in SAE Auto-Interpretability McCann |
2605.12874 | 82.1% share annotations 3.07 features/annotation |
Fundamental critique: Auto-Interp explanations are not unique. SAEBench rankings inflated. Recommends discrimination scoring. Integrated into SAEBench analysis. | Medium | Done |
| Theme | Papers | Our Contribution | Status |
|---|---|---|---|
| SAE Uncertainty Signals | Feature Rivalry, Uncertainty vs Correctness | Pilot validated; full reproduction blocked | Blocked |
| SAE Benchmarking | SAEBench, Qwen-Scope | Integration analysis; starter eval code; Qwen-Scope comparison planned | Pending GPU |
| Uncertainty Detection (Non-SAE) | Confidence Margin, Tracing Uncertainty | Full analyses; actionable hybrid ideas proposed | Done |
| SAE Architecture | WriteSAE | Code review; not applicable to transformers | Done |
| Auto-Interpretability Critique | Descriptive Collision (arXiv:2605.12874) | Critique integrated into SAEBench page | Done |
| Experimental Setup | Model Card, Predictions | Standardized documentation and pre-registration of expected results | Done |