Predictions & Evaluation Plan

What we expect to find when GPU access returns, and how we will measure success.

Purpose: This document makes our predictions explicit before we run experiments. It provides accountability: after experiments complete, we compare actual results against these predictions. Surprises are documented and become new hypotheses.

1. Feature Rivalry Full Reproduction

Predictions

  1. Significant rivalry difference: p < 0.01 (Mann-Whitney U) for layers 20–39. Pilot showed p=0.03 at layer 39 with n=15; n=400 should easily cross threshold.
  2. Correctness prediction: AUROC 0.60–0.70 for per-prompt rivalry score. Paper reports 0.689; we may see lower values due to model differences (Qwen vs Llama).
  3. Steering effect: Adding rivalry vectors → decreased confidence (higher entropy). Subtracting → increased confidence. Expected effect size: Cohen's d = 0.3–0.5.

Success Criteria

CriterionThresholdMetric
Significant rivalryp < 0.01Mann-Whitney U, layers 20–39
Correctness predictionAUROC > 0.60ROC on per-prompt scores
Steering effectp < 0.05Paired t-test on entropy change
ReproducibilitySame directionAll 3 criteria across 2 seeds

Risk Mitigation

2. Uncertainty vs Correctness Features

Predictions

  1. 3 confounded features with AUROC ~0.79 for correctness prediction.
  2. Suppressing them improves accuracy by 2–5% (paper reports 6.3%).
  3. Uncertainty and correctness features are partially separable with AUROC gap > 0.05.

Success Criteria

CriterionThresholdMetric
Confounded featuresAUROC > 0.75Top-3 on PopQA
Accuracy improvement+2%Held-out test set
Feature separabilityAUROC gap > 0.05Uncertainty vs correctness

3. SAEBench Core Evaluation

Predictions

  1. L0 ≈ 50 ± 5 — matches TopK=50 design.
  2. Loss recovered > 70% — SAE-Res is trained for reconstruction.
  3. Feature absorption 0.15–0.30 — moderate, not excessive.

Success Criteria

CriterionThresholdMetric
L0 sparsity45–55Mean non-zero features/token
Loss recovered> 70%Fraction of CE loss recovered
Feature absorption0.10–0.40Absorption fraction

4. Qwen-Scope vs SAE-Res Comparison

Predictions

  1. Same d_model (3584) and similar sparsity type.
  2. 20–40% feature overlap (Jaccard of top-100 features).
  3. Activation correlation r > 0.3 on mean activations.

5. Max-Activating Examples

Predictions

  1. 30–50% of features have clear interpretable patterns.
  2. 5–10 features strongly correlate with high-entropy tokens (uncertainty signals).
  3. Hub features (108, 60, 262) activate on diverse token types.

Execution Order

  1. Validation (15 min) — SSH, env, model load
  2. SAEBench Core + Absorption (1–2h) — Quick wins
  3. Feature Rivalry Full (6–8h) — Main effort
  4. 2×2 Reproduction (2–3h) — Build on FR infra
  5. Qwen-Scope Comparison (1–2h) — Structural
  6. Interpretability (1–2h) — Max-activating examples

Total: ~12–18 hours of A100 time

Post-Execution Checklist:

← Back to Labs Index · Model Card →