Model Card — Qwen 3.5 35B-A3B Experimental Setup
Standardized documentation for reproducibility · Last updated: 2026-05-15
Purpose: This card documents our complete experimental setup so that
anyone — including future us — can understand exactly what model, SAE,
dataset, and evaluation protocol we used. It also flags known limitations
and assumptions.
Model
| Property | Value |
| Name | Qwen/Qwen3.5-35B-A3B-Base |
| Parameters | 35B total (3B active per token) |
| Architecture | Mixture-of-Experts (MoE) transformer |
| Experts | 256 total, top-8 activated per token |
| Layers | 40 |
| Hidden dimension | 3,584 (d_model) |
| Attention | Alternating: linear attention (layers 1,2,3,5,6,7...) + full attention (every 4th layer) |
| Multi-token prediction | Yes (MTP head) |
| Training | Pre-trained + SFT + RLHF |
| Context length | 128K tokens |
| Precision | BF16 (inference) |
| Loading | AutoModelForCausalLM via transformers (transformer-lens does NOT support this model) |
Sparse Autoencoder
| Property | Value |
| Name | SAE-Res-Qwen3.5-35B-A3B-Base-W32K-L0_50 |
| Source | HuggingFace |
| Type | Residual-stream SAE (SAE-Res) |
| Sparsity | TopK = 50 |
| Dictionary size | 32,768 features per layer |
| Input dimension | 3,584 (matches d_model) |
| Layers covered | All 40 layers (individual .sae.pt files) |
| Training data | Not specified in model card |
| Loss function | Not specified; assumed MSE + L1 sparsity |
Datasets
| Dataset | Use | Size | Source |
| PopQA | Feature Rivalry, 2×2 reproduction | 14,268 questions | https://github.com/AlexTMallen/popqa |
| OpenWebText-10k | SAEBench Core eval (loss recovered) | 10,000 documents | HuggingFace stas/openwebtext-10k |
| Sentiment 140 | Sparse Probing starter code | 1.6M tweets | HuggingFace sentiment140 |
| GSM8K | Tracing Uncertainty (reference only) | 8,819 problems | HuggingFace gsm8k |
Evaluation Protocol
Feature Rivalry
- Sample 20 completions per question (temperature=1.0, top_p=0.9)
- Compute first-token entropy for each completion
- Split into ambiguous (high entropy) vs unambiguous (low entropy) groups
- Extract SAE activations at target layer for last token
- Compute pairwise Pearson correlation within each group
- Compare 5th percentile (most negative) via Mann-Whitney U
SAEBench Custom Evals
- Load model with
AutoModelForCausalLM + trust_remote_code=True
- Register forward hook at target layer
- Run inference, capture residual stream
- SAE encode:
pre_acts = residual @ W_enc.T + b_enc
- TopK select: keep top 50 activations
- SAE decode:
reconstructed = topk_acts @ W_dec.T + b_dec
- Compute metrics on reconstructed vs original
Compute Environment
| Property | Value |
| Platform | vast.ai instance 36453618 |
| GPU | NVIDIA A100 80GB |
| SSH | ssh6.vast.ai:13618 |
| Model loading | BF16, ~40GB VRAM |
| Alternative | 4-bit quantization (~20GB VRAM) — not yet tested |
| Status | DOWN — connection refused since 2026-05-15 05:00 UTC |
Known Limitations
- No gradient access. We use hidden states and SAE activations, not gradients.
Epistemic uncertainty (gradient-based) is not computable with our setup.
- Single SAE per layer. Qwen-Scope provides multiple SAE groups, but we only
use SAE-Res weights. Different SAEs may reveal different features.
- TopK sparsity. The SAE uses hard top-50 selection. This is not differentiable
and may create artifacts in feature correlations.
- MoE routing opacity. We cannot inspect which experts are activated per token.
Expert-specific SAEs might reveal different uncertainty signals.
- Verifiable answers only. PopQA has ground-truth answers. Our methods may not
generalize to open-ended or creative generation.
- No reasoning traces. We analyze single-step Q&A, not chain-of-thought.
Methods from Confidence Margin and Tracing Uncertainty require reasoning traces.
- transformer-lens incompatible. Standard mechanistic interpretability tools
(HookedTransformer, SAE Lens) do not support Qwen 3.5 MoE. We use raw transformers.
Assumptions
- SAE features are meaningful. We assume that sparse features encode
human-interpretable concepts. Descriptive Collision (arXiv:2605.12874) challenges
this — we acknowledge the risk.
- Last token position is representative. For most analyses, we use SAE
activations at the last token. Per-token dynamics may differ.
- Entropy correlates with uncertainty. We use first-token entropy as a proxy
for model uncertainty. This is standard but imperfect.
- 20 samples are sufficient. For entropy estimation, we sample 20 completions.
This is the paper's choice; more samples might reduce variance.
- Layer 20 is representative. Many analyses focus on layer 20 (mid-network).
Other layers may show different patterns.
Reproducibility Checklist
- ✅ Model weights pinned to
Qwen/Qwen3.5-35B-A3B-Base
- ✅ SAE weights pinned to
Qwen/SAE-Res-Qwen3.5-35B-A3B-Base-W32K-L0_50
- ✅ All scripts saved to
~/scratch/ with docstrings
- ✅ Random seeds documented in scripts (where applicable)
- ⚠️ GPU environment not reproducible (vast.ai instance down)
- ⚠️ No Docker/containerization
- ✅ Dependency lock file:
scratch/requirements.txt
Disclaimer: This is a research exploration, not a production system.
Results should be treated as preliminary. We report our honest assessments,
including negative results and blocked experiments.
← Back to Labs Index ·
Predictions & Evaluation Plan →