Qwen-Scope: Turning Sparse Features into Development Tools

Paper: arXiv:2605.11887 · Authors: Deng et al. (Qwen Team) · Explored: 2026-05-15

Overview

Most SAE research treats sparse autoencoders as post-hoc analysis tools: train them, look at features, write descriptions, done. Qwen-Scope breaks from this pattern. It is an open-source suite of 14 SAE groups across 7 Qwen model variants (Qwen3 and Qwen3.5, both dense and MoE), and it demonstrates four practical applications where SAEs are used as development interfaces — not just diagnostic mirrors.

Core claim: SAEs can serve as reusable representation-level interfaces for diagnosing, controlling, evaluating, and improving LLMs. The paper shows this empirically across steering, evaluation, data classification, and post-training optimization.

What Qwen-Scope Provides

Model	Architecture	SAE Groups
Qwen3-0.6B	Dense	2
Qwen3-4B	Dense	2
Qwen3-8B	Dense	2
Qwen3.5-0.5B	MoE	2
Qwen3.5-1.5B	MoE	2
Qwen3.5-4B	MoE	2
Qwen3.5-35B-A3B	MoE	2

Each "group" contains SAEs for multiple layers. All weights are released on HuggingFace. The training uses a standard TopK SAE architecture with JumpReLU gating, trained on large-scale activation datasets.

Four Applications

1. Inference-time steering

SAE features are used as control vectors during inference. By adding or subtracting feature directions from the residual stream, the model's behavior changes without modifying any weights.

Case studies demonstrated:

Language control: Suppress Chinese-English code-switching by identifying and downweighting language-mixing features
Concept injection: Amplify features associated with specific concepts (e.g., "formal tone") to steer generation style
Preference alignment: Use SAE-derived preference signals to guide outputs toward helpful, harmless, or honest directions

The steering is feature-selective: rather than applying a blanket intervention to all activations, they identify specific features responsible for the target behavior and intervene only on those. This is more precise than full-layer steering (e.g., Representation Engineering).

2. Evaluation analysis

SAE activations provide a representation-level proxy for what a benchmark is actually testing. The authors use this to:

Detect benchmark redundancy: If two benchmarks activate the same set of SAE features, they are measuring the same underlying capability. This reveals overlap that performance metrics alone hide.
Analyze capability coverage: Map which SAE features are engaged by different task types. Show that some capabilities (e.g., reasoning) recruit features from deeper layers while others (e.g., syntax) use shallower ones.
Inter-benchmark similarity: Compute feature-based similarity matrices across 20+ benchmarks, revealing clusters of related tasks.

3. Data-centric workflows

SAE features are used for data classification and synthesis:

Toxicity detection: Identify SAE features that activate on toxic content across 14 languages. Build a rule-based classifier using just these features, achieving competitive F1 without any labeled data.
Data efficiency: Show that feature-based selection identifies high-quality training examples more efficiently than random sampling or uncertainty-based selection.
Safety-oriented synthesis: Use toxic features as guidance signals to generate synthetic safety training data — both toxic examples (for red-teaming) and their safe counterparts (for SFT).

4. Post-training optimization

SAE features are incorporated into training objectives:

SFT: Suppress language-specific features to reduce code-switching. Add a regularization term that penalizes activation of features associated with the wrong language.
RL: Use SAE features as auxiliary rewards. For example, reward the model for activating "diversity" features and penalize "repetition" features during RLHF training.

Results: SAE-guided SFT reduces code-switching by 40% with minimal accuracy loss. SAE-guided RL reduces repetition by 25% without degrading helpfulness.

Connections to Our Work

They have SAEs for our exact model

Qwen-Scope includes SAEs for Qwen3.5-35B-A3B — the exact model we are studying. Their SAEs may differ from the SAE-Res weights we are using (different training recipe, different hyperparameters). Comparing the two could reveal which features are robust across training runs and which are artifacts of the specific SAE configuration.

Uncertainty detection is unexplored in Qwen-Scope

Despite covering steering, evaluation, data, and post-training, Qwen-Scope does not address uncertainty detection or correctness prediction. This is a gap we can fill: apply their SAEs (or ours) to the uncertainty detection tasks explored in Feature Rivalry, Confidence Margin, and Tracing Uncertainty.

Feature steering vs feature rivalry

Qwen-Scope steers by amplifying/suppressing specific features. Feature Rivalry detects uncertainty by finding negatively correlated feature pairs. These are complementary: steering tells us features are causal; rivalry tells us features are in conflict. A model with high rivalry might be harder to steer (conflicting features resist unidirectional intervention).

Evaluation analysis and SAEBench

Qwen-Scope's evaluation analysis uses SAE features to assess benchmark redundancy. SAEBench evaluates SAEs on downstream tasks. These are inverse operations: Qwen-Scope uses SAEs to evaluate benchmarks; SAEBench uses benchmarks to evaluate SAEs. Combining both could yield a bidirectional understanding of which SAE features correspond to which capabilities.

Our Assessment

What we like

Practical focus. Unlike most SAE papers, Qwen-Scope is explicitly about using SAEs, not just analyzing them. The four application directions are all real use cases.
Open source. 14 SAE groups across 7 models, all on HuggingFace. This lowers the barrier to experimentation significantly.
MoE support. They trained SAEs on MoE models, including our target (Qwen 3.5 35B-A3B). This validates that SAEs work on MoE architectures despite the routing complexity.
Feature-selective steering. More precise than full-layer interventions. The code-switching and style transfer demos are convincing.

What concerns us

No uncertainty work. The one application we care most about (detecting when the model is wrong) is absent.
Limited evaluation rigor. The paper demonstrates proof-of-concept results but does not provide comprehensive benchmarks (e.g., no AUROCs, no comparison to baselines for many tasks).
Feature collision risk. Given McCann's Descriptive Collision critique, we should verify that Qwen-Scope's "toxic features" are genuinely unique and not just one of many features sharing the same explanation.

Verdict

Important infrastructure, not a research breakthrough. Qwen-Scope is primarily an engineering contribution: a well-engineered, open-sourced SAE suite with practical demos. It does not introduce new methods or surprising findings. But it is extremely valuable for our work because (1) it gives us SAEs for our exact model, (2) it validates that SAEs work on MoE architectures, and (3) its steering and evaluation applications suggest new experiments we can run.

Most actionable next step: Load Qwen-Scope SAEs for Qwen 3.5 35B-A3B and compare them with the SAE-Res weights we are already using. Do they identify the same features? Do their uncertainty-related features overlap? This could be done in a single GPU session.

References

Deng et al., Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models, arXiv:2605.11887
Qwen-Scope HuggingFace: huggingface.co/Qwen
Qwen SAE-Res weights: SAE-Res-Qwen3.5-35B-A3B-Base-W32K-L0_50
McCann, Descriptive Collision: The Hidden Structure of SAE Auto-Interpretability, arXiv:2605.12874 — our analysis
Karvonen et al., SAEBench, arXiv:2503.09532 — our analysis

← Research Synthesis · Labs Index