Strategy-Guided Policy Optimization for LLM Reasoning

The Problem with Trajectory Imitation

Standard distillation methods often rely on imitating the specific solution trajectories of stronger models. This approach encourages the student model to memorize instance-specific steps, which limits its ability to generalize to novel problems. Instead of learning the underlying logic of how to reason, the model merely learns what to answer for a given input.

The SGPO Framework

Strategy-Guided Policy Optimization (SGPO) shifts the focus from trajectory imitation to strategy distillation. The framework operates through two core mechanisms:

Strategy Extraction: It extracts structured strategy descriptions from strong-model responses. These strategies act as reusable guides for the student model.
Comparative Trajectory Construction: For each problem, the framework generates both autonomous trajectories (without guidance) and strategy-guided trajectories. This allows the model to directly compare its behavior against a strategic baseline.

Selective Distillation and Adaptive Guidance

SGPO addresses the "how" and "when" of distillation through two technical innovations:

Token-level Forward-KL Objective: This objective selectively transfers the distributional shift caused by strategy conditioning into the unguided policy. By using proximal constraints, it ensures the training remains stable while focusing on the most relevant reasoning signals.
Adaptive Instance-level Weighting: This mechanism controls the intensity of the guidance. It strengthens the influence of strategies when the model's autonomous exploration fails and automatically reduces that guidance as the model gains competence, preventing over-reliance on the teacher's specific path.

Performance Gains

Experiments across four mathematical benchmarks and two model families demonstrate that SGPO consistently outperforms traditional SFT (Supervised Fine-Tuning), on-policy RL, and hybrid-policy baselines. Specifically, the method improved the average score by 2.2 points over the strongest baseline on the Qwen2.5-7B-Instruct model. The results suggest that the forward-KL objective provides a more effective distillation signal than direct trajectory imitation and that strategy distillation scales effectively with the base model's capabilities.

The Problem with Trajectory Imitation

The SGPO Framework

Selective Distillation and Adaptive Guidance

Performance Gains

More from AI & LLMs

TST Cuts LLM Pre-Training Time 2.5x at Equal FLOPs

SFT + RL Recovers Sandbagged AI Capabilities Using Weak Supervisors

GPUs Crush AI Tasks with Parallel Compute and Vast Memory

PrfaaS Enables Cross-Datacenter LLM Serving with 54% Throughput Gain