The Efficiency of Specialization

VibeThinker-3B, developed by Sina Weibo Inc, challenges the trend of scaling model size to improve reasoning. By focusing exclusively on verifiable tasks like mathematics, coding, and STEM, this 3-billion-parameter model matches the performance of models hundreds of times its size (such as DeepSeek V3.2 or Kimi K2.5) on benchmarks like AIME26. It is designed as a specialist tool; the authors recommend using larger general-purpose models for open-domain knowledge tasks.

The Spectrum-to-Signal Pipeline

The model's performance is driven by a four-stage post-training pipeline built on the Qwen2.5-Coder-3B base:

  • Curriculum SFT: A two-stage supervised fine-tuning process that builds a broad 'spectrum' of reasoning paths, using diversity-exploring distillation to maintain multiple valid solution trajectories.
  • Multi-domain Reasoning RL: Uses MaxEnt-Guided Policy Optimization (MGPO) to train across math, code, and STEM. Notably, it abandons progressive context expansion in favor of a consistent 64K long-context window. A 'Long2Short' math stage further optimizes performance by rewarding shorter, correct reasoning chains to reduce token redundancy.
  • Self-Distillation & Instruct RL: Offline distillation merges RL checkpoints into a single student model, followed by instruction-tuning to ensure the model maintains controllability without sacrificing reasoning depth.

Test-Time Scaling with CLR

To further boost performance without increasing parameter count, the researchers implemented Claim-Level Reliability Assessment (CLR). This test-time scaling method involves:

  1. Generating 32 trajectories per problem.
  2. Extracting decision-relevant claims from each.
  3. Using the model as its own verifier to assign binary verdicts to each claim.
  4. Weighting the final answer based on the reliability of the claims within the reasoning path. This method significantly lifts scores on benchmarks like AIME26 (from 94.3 to 97.1).