Model Card — Qwen 3.5 35B-A3B Experimental Setup

Standardized documentation for reproducibility · Last updated: 2026-05-15

Purpose: This card documents our complete experimental setup so that anyone — including future us — can understand exactly what model, SAE, dataset, and evaluation protocol we used. It also flags known limitations and assumptions.

Model

Property	Value
Name	`Qwen/Qwen3.5-35B-A3B-Base`
Parameters	35B total (3B active per token)
Architecture	Mixture-of-Experts (MoE) transformer
Experts	256 total, top-8 activated per token
Layers	40
Hidden dimension	3,584 (d_model)
Attention	Alternating: linear attention (layers 1,2,3,5,6,7...) + full attention (every 4th layer)
Multi-token prediction	Yes (MTP head)
Training	Pre-trained + SFT + RLHF
Context length	128K tokens
Precision	BF16 (inference)
Loading	`AutoModelForCausalLM` via transformers (transformer-lens does NOT support this model)

Sparse Autoencoder

Property	Value
Name	SAE-Res-Qwen3.5-35B-A3B-Base-W32K-L0_50
Source	HuggingFace
Type	Residual-stream SAE (SAE-Res)
Sparsity	TopK = 50
Dictionary size	32,768 features per layer
Input dimension	3,584 (matches d_model)
Layers covered	All 40 layers (individual .sae.pt files)
Training data	Not specified in model card
Loss function	Not specified; assumed MSE + L1 sparsity

Datasets

Dataset	Use	Size	Source
PopQA	Feature Rivalry, 2×2 reproduction	14,268 questions	https://github.com/AlexTMallen/popqa
OpenWebText-10k	SAEBench Core eval (loss recovered)	10,000 documents	HuggingFace `stas/openwebtext-10k`
Sentiment 140	Sparse Probing starter code	1.6M tweets	HuggingFace `sentiment140`
GSM8K	Tracing Uncertainty (reference only)	8,819 problems	HuggingFace `gsm8k`

Evaluation Protocol

Feature Rivalry

Sample 20 completions per question (temperature=1.0, top_p=0.9)
Compute first-token entropy for each completion
Split into ambiguous (high entropy) vs unambiguous (low entropy) groups
Extract SAE activations at target layer for last token
Compute pairwise Pearson correlation within each group
Compare 5th percentile (most negative) via Mann-Whitney U

SAEBench Custom Evals

Load model with AutoModelForCausalLM + trust_remote_code=True
Register forward hook at target layer
Run inference, capture residual stream
SAE encode: pre_acts = residual @ W_enc.T + b_enc
TopK select: keep top 50 activations
SAE decode: reconstructed = topk_acts @ W_dec.T + b_dec
Compute metrics on reconstructed vs original

Compute Environment

Property	Value
Platform	vast.ai instance 36453618
GPU	NVIDIA A100 80GB
SSH	ssh6.vast.ai:13618
Model loading	BF16, ~40GB VRAM
Alternative	4-bit quantization (~20GB VRAM) — not yet tested
Status	DOWN — connection refused since 2026-05-15 05:00 UTC

Known Limitations

No gradient access. We use hidden states and SAE activations, not gradients. Epistemic uncertainty (gradient-based) is not computable with our setup.
Single SAE per layer. Qwen-Scope provides multiple SAE groups, but we only use SAE-Res weights. Different SAEs may reveal different features.
TopK sparsity. The SAE uses hard top-50 selection. This is not differentiable and may create artifacts in feature correlations.
MoE routing opacity. We cannot inspect which experts are activated per token. Expert-specific SAEs might reveal different uncertainty signals.
Verifiable answers only. PopQA has ground-truth answers. Our methods may not generalize to open-ended or creative generation.
No reasoning traces. We analyze single-step Q&A, not chain-of-thought. Methods from Confidence Margin and Tracing Uncertainty require reasoning traces.
transformer-lens incompatible. Standard mechanistic interpretability tools (HookedTransformer, SAE Lens) do not support Qwen 3.5 MoE. We use raw transformers.

Assumptions

SAE features are meaningful. We assume that sparse features encode human-interpretable concepts. Descriptive Collision (arXiv:2605.12874) challenges this — we acknowledge the risk.
Last token position is representative. For most analyses, we use SAE activations at the last token. Per-token dynamics may differ.
Entropy correlates with uncertainty. We use first-token entropy as a proxy for model uncertainty. This is standard but imperfect.
20 samples are sufficient. For entropy estimation, we sample 20 completions. This is the paper's choice; more samples might reduce variance.
Layer 20 is representative. Many analyses focus on layer 20 (mid-network). Other layers may show different patterns.

Reproducibility Checklist

✅ Model weights pinned to Qwen/Qwen3.5-35B-A3B-Base
✅ SAE weights pinned to Qwen/SAE-Res-Qwen3.5-35B-A3B-Base-W32K-L0_50
✅ All scripts saved to ~/scratch/ with docstrings
✅ Random seeds documented in scripts (where applicable)
⚠️ GPU environment not reproducible (vast.ai instance down)
⚠️ No Docker/containerization
✅ Dependency lock file: scratch/requirements.txt

Disclaimer: This is a research exploration, not a production system. Results should be treated as preliminary. We report our honest assessments, including negative results and blocked experiments.

← Back to Labs Index · Predictions & Evaluation Plan →