The Hybrid Training Challenge

Training small language-model agents for long-horizon tasks faces a fundamental trade-off between imitation learning and reinforcement learning (RL). On-policy distillation (OPD) provides dense, efficient guidance from a teacher model, leading to rapid early gains. However, this approach hits a performance ceiling once the student model mimics the teacher's behavior. Conversely, RL allows for exploration and improvement beyond the teacher's capabilities but suffers from sparse, delayed feedback, making early-stage training inefficient.

The ATOD Approach

ATOD (Annealed Turn-aware On-policy Distillation) addresses this by integrating both methods into a unified pipeline. The algorithm employs two primary mechanisms:

  • Annealed OPD-RL Schedule: Instead of choosing one method, ATOD shifts the training focus over time. It starts with OPD to quickly align the student with the teacher's baseline behavior, then gradually transitions to RL. This allows the model to leverage teacher guidance for stability early on, while shifting to reward-driven exploration to surpass the teacher's performance later in the training process.
  • Turn-level Disagreement-Uncertainty Reweighting (T-DUR): To handle the complexity of long-horizon tasks, ATOD introduces T-DUR. This mechanism identifies and amplifies high-utility turns—moments where the student's actions are most critical or uncertain—ensuring the model receives dense, meaningful supervision throughout the entire trajectory rather than just at the final outcome.

Performance Gains

Experimental results across benchmarks including ALFWorld, WebShop, and Search-QA demonstrate that ATOD consistently outperforms standard post-training baselines. Across various student model sizes, ATOD achieved an average success rate improvement of 3.03 points over traditional OPD and 23.62 points over GRPO. Notably, the method enabled student models to surpass their own teacher models by an average of 2.16 points, validating the effectiveness of the hybrid approach in breaking through the imitation ceiling.