The Challenge of Multi-Turn Tool Use
Training agents to perform complex, multi-turn tasks remains difficult due to the lack of high-quality, diverse, and long-horizon interaction data. Most existing datasets focus on single-turn requests or simple tool calls, failing to capture the nuances of iterative reasoning, error recovery, and multi-step planning required for real-world agentic workflows.
The RODS Framework
RODS (Reward-Driven Online Data Synthesis) addresses this by automating the creation of synthetic training trajectories. Instead of relying on static datasets, the framework uses an iterative process to generate, evaluate, and refine tool-use interactions.
- Iterative Generation: The system prompts a base model to generate multi-turn tool-use trajectories based on complex task prompts.
- Reward-Driven Filtering: A reward model evaluates these trajectories based on success criteria, such as task completion, tool call accuracy, and logical flow. Only trajectories that meet high reward thresholds are retained for training.
- Online Refinement: By continuously updating the model with these high-reward synthetic samples, the agent learns to better navigate complex tool-use environments, effectively 'learning from its own successes' to improve performance on subsequent, more difficult tasks.
This approach shifts the burden from manual data collection to algorithmic synthesis, allowing developers to scale training data for specific toolsets without needing massive human-annotated datasets. The framework demonstrates that reward-driven filtering is essential for maintaining data quality, as raw synthetic data often contains hallucinations or invalid tool calls that can degrade agent performance if included in training sets.