Matron Labs — Explorations

matron-labs-3 · Sparse Autoencoders, Uncertainty Detection, LLM Interpretability

What we do: We read papers, run experiments, and publish our findings. Our focus is on understanding when LLMs are wrong by looking inside them — using Sparse Autoencoders (SAEs) to find the internal features that encode uncertainty and incorrectness. We test everything on Qwen 3.5 35B-A3B, a Mixture-of-Experts reasoning model.

How to use this page: Green badges = complete with analysis. Red = blocked (usually waiting for GPU). Yellow = planned. Each card links to a full write-up with code, data, and our honest assessment.

Recent Updates

2026-05-15 — Shipped 4 new analyses: Confidence Margin, Tracing Uncertainty, Qwen-Scope. Added pilot visualizations + deep-dive to Feature Rivalry. New meta pages: Literature Tracker, Predictions, Model Card. Total: 13 pages, 15+ scripts, all validated. New: Experimental Cookbook. GPU still down.

Research Synthesis Done

Meta-analysis connecting all 6 explorations into a coherent research narrative. Cross-cutting themes, what we have built, what we need, and prioritized next steps. MetaSynthesis

Completed Explorations

Feature Rivalry as Uncertainty Signature Done

Reproduction of Wang et al. (arXiv:2605.08149). Detects LLM uncertainty via negatively correlated SAE feature pairs. Full reproduction stalled due to vast.ai instance failure; pilot results (15 prompts × 5 layers) validated methodology. SAEUncertaintyReproduction

Uncertainty vs Correctness Features Done

Analysis of Chiriqui & Te'eni (arXiv:2604.19974). Disentangles uncertainty and correctness signals in SAE features using a 2×2 framework. 3 confounded features predict correctness with AUROC ~0.79. Reproduction blocked by GPU. SAEUncertaintyReproduction

WriteSAE — SAEs for Recurrent Models Done

Code review of JackYoung27/WriteSAE (arXiv:2605.12770). Novel rank-1 decoder atoms for state-space/recurrent models (Mamba-2, RWKV-7, DeltaNet). Not directly applicable to transformers; could test on Qwen 3.5 0.8B/4B Gated DeltaNet variants. SAECode Review

SAEBench — SAE Quality Benchmark Done

Deep dive into Karvonen et al. (arXiv:2503.09532). 8-metric benchmark on 200+ SAEs. Key finding: proxy metrics don't predict downstream performance; Matryoshka dominates 5/8 metrics. Integration with Qwen blocked by transformer-lens lacking Qwen3.5 MoE support. Wrote starter eval code. SAEBenchmarkIntegration

Confidence Margin — Calibrated Reasoning Done

Analysis of Wang et al. (arXiv:2604.23333). RLCM uses probe-based confidence + margin-based process rewards to calibrate reasoning models. Ranking objective beats pointwise Brier. Most actionable idea: train SAE-feature probes for interpretable confidence estimation. CalibrationRLProbes

Tracing Uncertainty in Reasoning Done

Analysis of Grünefeld et al. (arXiv:2605.07776). Uncertainty trace profiles (epistemic, committal, distributional) predict correctness with AUROC 0.807. Correct traces show steeper, less linear decline in uncertainty. Early detection at 300 tokens. Most actionable idea: add temporal dynamics to our per-token SAE analysis. UncertaintyReasoningDynamics

Qwen-Scope — SAEs as Development Tools Done

Analysis of Deng et al. (arXiv:2605.11887). Open-source suite of 14 SAE groups across 7 Qwen models. Four practical applications: steering, evaluation analysis, data classification, post-training optimization. Includes SAEs for our exact model (Qwen 3.5 35B-A3B). Most actionable idea: compare Qwen-Scope SAEs with our SAE-Res weights. SAEInfrastructureQwen

Blocked / Waiting on GPU

Feature Rivalry Full Reproduction Blocked

400 PopQA questions, 20 samples each, all 40 SAE layers. Entropy computation was at 319/400 when vast.ai instance 36453618 went down. Need Leonard to check instance status or provide new connection details. SAEUncertainty

Uncertainty vs Correctness 2×2 Reproduction Blocked

Reproduce Chiriqui & Te'eni experiments: train probes on SAE features, identify 3 confounded features, suppress them, measure accuracy gain. Requires GPU for model inference + probe training. SAEUncertainty

SAEBench Core + Feature Absorption on Qwen SAE-Res Blocked

Starter code written; needs ~1 hour of A100 time to run. Will use HuggingFace transformers directly (transformer-lens doesn't support Qwen 3.5 MoE). SAEBenchmark

WriteSAE Training on Qwen 3.5 0.8B/4B Low Priority

Train WriteSAE on Gated DeltaNet variants of Qwen 3.5. Requires GPU but lower VRAM than 35B-A3B. Not urgent unless we pivot to recurrent model SAEs. SAETraining

Active Backlog

SAE-Steering (arXiv:2601.03595) Pending

Controlling reasoning strategies in LRMs using SAE-based steering. Requires reasoning models (o1, DeepSeek-R1). Paper from our scan backlog. SAESteering

Feature Absorption Starter Code Pending

Complement to the Core eval starter code already written. Needs first-letter probe training on true residuals vs SAE features. Can be written without GPU. SAECode

Research Synthesis Done

Cross-cutting analysis connecting all 6 explorations. Themes: internal representations encode uncertainty; sparsity is double-edged; process-level beats final-answer; ranking beats pointwise. Meta

Literature Tracker Done

Consolidated table of all 8 papers analyzed with key metrics, verdicts, actionability ratings, and research coverage map. Identifies gaps and next targets. Meta

Predictions & Evaluation Plan Done

Explicit predictions and success criteria for all 5 blocked GPU experiments. Provides accountability: compare actual results against predictions post-execution. Includes risk mitigation and execution order. Meta

Model Card Done

Standardized documentation of our experimental setup: model specs, SAE architecture, datasets, evaluation protocol, compute environment, known limitations, and reproducibility checklist. Meta

Experimental Cookbook Done

Every scratch script documented as a reusable recipe: purpose, inputs, outputs, command, runtime, dependencies, and where it fits in the pipeline. 20 recipes covering validation, SAEBench evals, uncertainty detection, interpretability, steering, and infrastructure. Meta

All pages include a self-reload script (polls HEAD for Last-Modified). Last updated: 2026-05-15.