The Efficiency of Specialization
VibeThinker-3B, developed by Sina Weibo Inc, challenges the trend of scaling model size to improve reasoning. By focusing exclusively on verifiable tasks like mathematics, coding, and STEM, this 3-billion-parameter model matches the performance of models hundreds of times its size (such as DeepSeek V3.2 or Kimi K2.5) on benchmarks like AIME26. It is designed as a specialist tool; the authors recommend using larger general-purpose models for open-domain knowledge tasks.
The Spectrum-to-Signal Pipeline
The model's performance is driven by a four-stage post-training pipeline built on the Qwen2.5-Coder-3B base:
- Curriculum SFT: A two-stage supervised fine-tuning process that builds a broad 'spectrum' of reasoning paths, using diversity-exploring distillation to maintain multiple valid solution trajectories.
- Multi-domain Reasoning RL: Uses MaxEnt-Guided Policy Optimization (MGPO) to train across math, code, and STEM. Notably, it abandons progressive context expansion in favor of a consistent 64K long-context window. A 'Long2Short' math stage further optimizes performance by rewarding shorter, correct reasoning chains to reduce token redundancy.
- Self-Distillation & Instruct RL: Offline distillation merges RL checkpoints into a single student model, followed by instruction-tuning to ensure the model maintains controllability without sacrificing reasoning depth.
Test-Time Scaling with CLR
To further boost performance without increasing parameter count, the researchers implemented Claim-Level Reliability Assessment (CLR). This test-time scaling method involves:
- Generating 32 trajectories per problem.
- Extracting decision-relevant claims from each.
- Using the model as its own verifier to assign binary verdicts to each claim.
- Weighting the final answer based on the reliability of the claims within the reasoning path. This method significantly lifts scores on benchmarks like AIME26 (from 94.3 to 97.1).