Predictions & Evaluation Plan

What we expect to find when GPU access returns, and how we will measure success.

Purpose: This document makes our predictions explicit before we run experiments. It provides accountability: after experiments complete, we compare actual results against these predictions. Surprises are documented and become new hypotheses.

1. Feature Rivalry Full Reproduction

Predictions

Significant rivalry difference: p < 0.01 (Mann-Whitney U) for layers 20–39. Pilot showed p=0.03 at layer 39 with n=15; n=400 should easily cross threshold.
Correctness prediction: AUROC 0.60–0.70 for per-prompt rivalry score. Paper reports 0.689; we may see lower values due to model differences (Qwen vs Llama).
Steering effect: Adding rivalry vectors → decreased confidence (higher entropy). Subtracting → increased confidence. Expected effect size: Cohen's d = 0.3–0.5.

Success Criteria

Criterion	Threshold	Metric
Significant rivalry	p < 0.01	Mann-Whitney U, layers 20–39
Correctness prediction	AUROC > 0.60	ROC on per-prompt scores
Steering effect	p < 0.05	Paired t-test on entropy change
Reproducibility	Same direction	All 3 criteria across 2 seeds

Risk Mitigation

No significance: Qwen MoE may have different uncertainty mechanism. Pivot to per-token dynamics instead of per-prompt aggregation.
Low AUROC: Test layer-specific rivalry (not averaged) for better prediction.
Steering fails: Use ablation (zero out features) instead of vector addition.

2. Uncertainty vs Correctness Features

Predictions

3 confounded features with AUROC ~0.79 for correctness prediction.
Suppressing them improves accuracy by 2–5% (paper reports 6.3%).
Uncertainty and correctness features are partially separable with AUROC gap > 0.05.

Success Criteria

Criterion	Threshold	Metric
Confounded features	AUROC > 0.75	Top-3 on PopQA
Accuracy improvement	+2%	Held-out test set
Feature separability	AUROC gap > 0.05	Uncertainty vs correctness

3. SAEBench Core Evaluation

Predictions

L0 ≈ 50 ± 5 — matches TopK=50 design.
Loss recovered > 70% — SAE-Res is trained for reconstruction.
Feature absorption 0.15–0.30 — moderate, not excessive.

Success Criteria

Criterion	Threshold	Metric
L0 sparsity	45–55	Mean non-zero features/token
Loss recovered	> 70%	Fraction of CE loss recovered
Feature absorption	0.10–0.40	Absorption fraction

4. Qwen-Scope vs SAE-Res Comparison

Predictions

Same d_model (3584) and similar sparsity type.
20–40% feature overlap (Jaccard of top-100 features).
Activation correlation r > 0.3 on mean activations.

5. Max-Activating Examples

Predictions

30–50% of features have clear interpretable patterns.
5–10 features strongly correlate with high-entropy tokens (uncertainty signals).
Hub features (108, 60, 262) activate on diverse token types.

Execution Order

Validation (15 min) — SSH, env, model load
SAEBench Core + Absorption (1–2h) — Quick wins
Feature Rivalry Full (6–8h) — Main effort
2×2 Reproduction (2–3h) — Build on FR infra
Qwen-Scope Comparison (1–2h) — Structural
Interpretability (1–2h) — Max-activating examples

Total: ~12–18 hours of A100 time

Post-Execution Checklist:

All metrics logged to JSON
Visualizations generated and embedded
Results compared against predictions (document surprises)
Failed experiments analyzed for root cause
New hypotheses from unexpected results
Leonard notified with verdict + links

← Back to Labs Index · Model Card →