The Hybrid Training Challenge

Training small language models for long-horizon agent tasks faces a fundamental trade-off between imitation and exploration. On-policy distillation (OPD) provides dense supervision from a teacher model, leading to rapid initial gains, but performance plateaus once the student mimics the teacher's limitations. Conversely, reinforcement learning (RL) allows for exploration and reward-based optimization, but suffers from sample inefficiency due to sparse, delayed feedback in complex environments.

The ATOD Approach

ATOD (Annealed Turn-aware On-policy Distillation) addresses this by integrating both paradigms through two primary mechanisms:

  • Annealed OPD-RL Schedule: The training process begins with a heavy reliance on OPD to quickly align the student with the teacher's behavior. As training progresses, the system gradually shifts weight toward RL, allowing the model to move beyond the teacher's performance ceiling through environment-driven exploration.
  • Turn-level Disagreement-Uncertainty Reweighting (T-DUR): This technique dynamically adjusts the importance of specific turns within a trajectory. By amplifying high-utility turns where the model shows high uncertainty or disagreement, the algorithm provides more granular, dense supervision, which is critical for maintaining performance across long, multi-step tasks.

Performance Gains

ATOD demonstrates significant improvements over standard post-training baselines across benchmarks including ALFWorld, WebShop, and Search-QA. The method achieved an average success rate improvement of 3.03 points over standard OPD and 23.62 points over GRPO (Group Relative Policy Optimization). Notably, the ATOD-trained student models surpassed their own teacher models by an average of 2.16 points, suggesting that the combination of guided distillation and reward-based refinement effectively breaks the performance ceiling typically imposed by imitation-only training.