Autodata: Agents Create Superior Synthetic Training Data
Meta's Autodata deploys AI agents as data scientists to iteratively generate high-quality QA pairs from CS papers, outperforming CoT Self-Instruct by expanding weak-strong solver gaps from 1.9 to 34 points and boosting downstream model training.
Agentic Pipeline Generates Challenging, Filtered Data
Autodata runs a closed-loop process where an orchestrator LLM coordinates four subagents—Challenger (generates input-response pairs grounded in source documents like CS papers), Weak Solver (smaller model expected to fail), Strong Solver (capable model expected to succeed), and Verifier (rubric-based judge)—to produce training/evaluation data. Examples pass only if all criteria hold: quality verifier approval; weak solver averages ≤65% with max ≤75% and no zeros; strong averages ≥60% but <95%; and gap ≥20%. This rejects trivial or unsolvable questions, running 3-5 median iterations per paper until acceptance or budget exhaustion. From 10,000+ S2ORC (2022+) CS papers, it yields 2,117 QA pairs that specifically reward stronger capabilities, trading inference compute for data quality.
Prior single-pass methods like Self-Instruct, Grounded/CoT Self-Instruct, and Self-Challenging lack this feedback loop, producing data where weak (71.4%) and strong (73.3%) solvers perform nearly identically (1.9-point gap). Autodata widens this to weak 43.7% vs. strong 77.8% (34-point gap), creating harder, more discriminative examples without human annotation.
Training Gains from Agentic Data
Fine-tuning Qwen-3.5-4B via GRPO (one epoch, batch 32, LR 1e-6) using Kimi-K2.6 as reward model on Autodata outperforms CoT Self-Instruct baselines on in- and out-of-distribution tests. Rubrics from Challengers ensure responses align with paper-specific insights, preventing generic knowledge leakage—e.g., questions test unique paper content verifiable only after reading, with context limited to problem setup sans solutions.
Meta-Optimization Evolves the Data Agent
An outer evolution loop (233 iterations, 126 accepted) uses Kimi-K2.6 to analyze failures and edit the agent's harness (prompts/scaffolding), boosting validation pass rates from 12.8% to 42.4% across 50 train/25 validation papers. Auto-discovered fixes: enforce paper-specific questions via self-tests; ban solution leaks in context; use positive-only rubrics with weights capped at 7; enforce strict JSON rubric format. This eliminates manual tuning, scaling data scientist effectiveness as compute increases.