Qwen-Scope SAEs Unlock Actionable LLM Internals
Qwen-Scope's open SAEs on 7 Qwen models decompose activations into interpretable features for steering outputs, proxy benchmark analysis (ρ=0.85 correlation), toxicity classification (F1>0.90), and training fixes like 50% code-switching reduction.
SAE Decomposition Reveals Interpretable LLM Features
Sparse autoencoders (SAEs) translate high-dimensional LLM activations into sparse latent features, each corresponding to concepts like languages or behaviors. For Qwen3 and Qwen3.5 models, Qwen-Scope releases 14 SAE groups across 7 variants: dense models (1.7B, 8B, 2B, 9B, 27B) and MoE (30B-A3B, 35B-A3B). SAEs train per layer on residual streams, using top-k (k=50 or 100) activations; dense models expand 16x hidden size, MoE use 32K (16x) or 128K (64x) widths. Except Qwen3.5-27B (instruct), all use base checkpoints. This layer-wise dictionary enables diagnosis of issues like language mixing or repetition without weight changes.
Steer Outputs and Classify via Feature Interventions
Apply steering with h' = h + αd to amplify/suppress features: suppress Chinese feature (ID 6159) to fix English prompts mixing languages; activate classical-Chinese feature (ID 36398) for stylistic shifts. For toxicity, build classifiers from features firing more on toxic data—OR-rule yields F1>0.90 on English for 1.7B/8B models; English features transfer cross-lingually (stronger to Russian/French, weaker to Arabic/Chinese), retaining 99% performance with 10% discovery data. These zero-shot methods cut compute needs versus full evals or training heads.
Proxy Benchmark Analysis Without Model Runs
SAE features act as micro-capabilities for eval: compute redundancy metric from activation overlap correlates ρ≈0.85 with performance-based redundancy on 17 benchmarks (MMLU, GSM8K, MATH, etc.); GSM8K shares 63% features with MATH, allowing safe omission. Pairwise overlap, partialed by MMLU, correlates 75.5% with capability similarity—retain low-overlap benchmarks, consolidate high-overlap ones to streamline suites without forward passes.
Augment Training with Feature-Driven Signals
For SFT, Sparse Autoencoder-guided SFT (SASFT) suppresses non-target language features via auxiliary loss, cutting code-switching >50% across Gemma-2/Llama-3.1/Qwen3 on Chinese/Russian/Korean (full elimination in cases like Qwen3-1.7B Korean), preserving multilingual benchmarks. For RL, synthetically generate repetition via feature steering as rare negatives in DAPO, sharply reducing repetition in 1.7B/8B/30B-A3B. Safety synthesis targets missing features: 4k pairs cover 99.74% features (vs. lower for random), boosting accuracy to 77.75% when mixed 1:1 with real data—matching 120k real-only under budget.