Predictions & Evaluation Plan
What we expect to find when GPU access returns, and how we will measure success.
Purpose: This document makes our predictions explicit before we run experiments.
It provides accountability: after experiments complete, we compare actual results
against these predictions. Surprises are documented and become new hypotheses.
1. Feature Rivalry Full Reproduction
Predictions
- Significant rivalry difference: p < 0.01 (Mann-Whitney U) for layers 20–39.
Pilot showed p=0.03 at layer 39 with n=15; n=400 should easily cross threshold.
- Correctness prediction: AUROC 0.60–0.70 for per-prompt rivalry score.
Paper reports 0.689; we may see lower values due to model differences (Qwen vs Llama).
- Steering effect: Adding rivalry vectors → decreased confidence (higher entropy).
Subtracting → increased confidence. Expected effect size: Cohen's d = 0.3–0.5.
Success Criteria
| Criterion | Threshold | Metric |
| Significant rivalry | p < 0.01 | Mann-Whitney U, layers 20–39 |
| Correctness prediction | AUROC > 0.60 | ROC on per-prompt scores |
| Steering effect | p < 0.05 | Paired t-test on entropy change |
| Reproducibility | Same direction | All 3 criteria across 2 seeds |
Risk Mitigation
- No significance: Qwen MoE may have different uncertainty mechanism.
Pivot to per-token dynamics instead of per-prompt aggregation.
- Low AUROC: Test layer-specific rivalry (not averaged) for better prediction.
- Steering fails: Use ablation (zero out features) instead of vector addition.
2. Uncertainty vs Correctness Features
Predictions
- 3 confounded features with AUROC ~0.79 for correctness prediction.
- Suppressing them improves accuracy by 2–5% (paper reports 6.3%).
- Uncertainty and correctness features are partially separable with AUROC gap > 0.05.
Success Criteria
| Criterion | Threshold | Metric |
| Confounded features | AUROC > 0.75 | Top-3 on PopQA |
| Accuracy improvement | +2% | Held-out test set |
| Feature separability | AUROC gap > 0.05 | Uncertainty vs correctness |
3. SAEBench Core Evaluation
Predictions
- L0 ≈ 50 ± 5 — matches TopK=50 design.
- Loss recovered > 70% — SAE-Res is trained for reconstruction.
- Feature absorption 0.15–0.30 — moderate, not excessive.
Success Criteria
| Criterion | Threshold | Metric |
| L0 sparsity | 45–55 | Mean non-zero features/token |
| Loss recovered | > 70% | Fraction of CE loss recovered |
| Feature absorption | 0.10–0.40 | Absorption fraction |
4. Qwen-Scope vs SAE-Res Comparison
Predictions
- Same d_model (3584) and similar sparsity type.
- 20–40% feature overlap (Jaccard of top-100 features).
- Activation correlation r > 0.3 on mean activations.
5. Max-Activating Examples
Predictions
- 30–50% of features have clear interpretable patterns.
- 5–10 features strongly correlate with high-entropy tokens (uncertainty signals).
- Hub features (108, 60, 262) activate on diverse token types.
Execution Order
- Validation (15 min) — SSH, env, model load
- SAEBench Core + Absorption (1–2h) — Quick wins
- Feature Rivalry Full (6–8h) — Main effort
- 2×2 Reproduction (2–3h) — Build on FR infra
- Qwen-Scope Comparison (1–2h) — Structural
- Interpretability (1–2h) — Max-activating examples
Total: ~12–18 hours of A100 time
Post-Execution Checklist:
- All metrics logged to JSON
- Visualizations generated and embedded
- Results compared against predictions (document surprises)
- Failed experiments analyzed for root cause
- New hypotheses from unexpected results
- Leonard notified with verdict + links
← Back to Labs Index ·
Model Card →