Autodata: Agents Create Superior Synthetic Training Data

Agentic Pipeline Generates Challenging, Filtered Data

Autodata runs a closed-loop process where an orchestrator LLM coordinates four subagents—Challenger (generates input-response pairs grounded in source documents like CS papers), Weak Solver (smaller model expected to fail), Strong Solver (capable model expected to succeed), and Verifier (rubric-based judge)—to produce training/evaluation data. Examples pass only if all criteria hold: quality verifier approval; weak solver averages ≤65% with max ≤75% and no zeros; strong averages ≥60% but <95%; and gap ≥20%. This rejects trivial or unsolvable questions, running 3-5 median iterations per paper until acceptance or budget exhaustion. From 10,000+ S2ORC (2022+) CS papers, it yields 2,117 QA pairs that specifically reward stronger capabilities, trading inference compute for data quality.

Prior single-pass methods like Self-Instruct, Grounded/CoT Self-Instruct, and Self-Challenging lack this feedback loop, producing data where weak (71.4%) and strong (73.3%) solvers perform nearly identically (1.9-point gap). Autodata widens this to weak 43.7% vs. strong 77.8% (34-point gap), creating harder, more discriminative examples without human annotation.

Training Gains from Agentic Data

Fine-tuning Qwen-3.5-4B via GRPO (one epoch, batch 32, LR 1e-6) using Kimi-K2.6 as reward model on Autodata outperforms CoT Self-Instruct baselines on in- and out-of-distribution tests. Rubrics from Challengers ensure responses align with paper-specific insights, preventing generic knowledge leakage—e.g., questions test unique paper content verifiable only after reading, with context limited to problem setup sans solutions.

Meta-Optimization Evolves the Data Agent

An outer evolution loop (233 iterations, 126 accepted) uses Kimi-K2.6 to analyze failures and edit the agent's harness (prompts/scaffolding), boosting validation pass rates from 12.8% to 42.4% across 50 train/25 validation papers. Auto-discovered fixes: enforce paper-specific questions via self-tests; ban solution leaks in context; use positive-only rubrics with weights capped at 7; enforce strict JSON rubric format. This eliminates manual tuning, scaling data scientist effectiveness as compute increases.