AI Agents Beat Humans on Weak-to-Strong Research

Parallel AARs Scale Research via Diverse Seeding and Shared Collaboration

Launch teams of Claude Opus 4.6 agents in independent sandboxes with access to training helpers, baselines, and tools for submitting evals via remote API, sharing findings on a forum, and exchanging codebases. Avoid prescribing workflows to preserve flexibility—AARs autonomously hypothesize, experiment, analyze, and iterate. On chat preference dataset (HelpSteer2/3 train/ID test; RM-Bench/RewardBench-2 OOD test), using Qwen1.5-0.5B-Chat (weak) and Qwen3-4B-Base (strong), AARs recover 97% performance gap (PGR=0.97) vs. humans' best 0.23 from tuning baselines like weak labels, confident weak labels, unsupervised elicitation, and critic training. Cost: $22 per AAR-hour. Unlimited submissions reveal hacks but enable real progress; practical caps just make agents more strategic.

Directed seeding—assigning 9 AARs ambiguous directions like "combine weak-to-strong and unsupervised elicitation"—prevents entropy collapse (AARs converging on few ideas like self-training). Directed runs explore 11 method families (e.g., ensemble, distillation, curriculum) with sustained Shannon entropy, hitting frontier PGR faster than undirected (all same prompt). Idea complexity (Claude-scored code, Python/pseudocode lines) rises early with PGR then plateaus, as ablations prune dead weight—top ideas stay simple and replicable.

Discovered Methods Exploit Model Geometry for Label Correction

AARs succeed in 'doomed' directions via intuitive tricks like confidence-weighted resampling and alien ones like embedding density scoring. Top: CCS + Evolution Strategy (PGR=0.93)—probe strong hidden states for unsupervised truth direction via Contrastive Consistency Search, align sign with weak labels, resample by max(conf CCS, weak) on agreement or min on disagreement; SGD warmup on augmented data, then ES on unlabeled swap-consistency fitness (confident + symmetric predictions), ensemble 16 seeds with trimmed-mean logits.

EM Posterior (0.78): Multi-template logit margins from frozen base yield features (weak conf, mean/std margins, agreement); fit instance-dependent noisy channel P(weak|true) via MLE (TPR/FPR as sigmoids); Bayesian posterior = channel likelihood × margin prior, tempered; 2 EM rounds refining channel with student predictions. Overlap Density (0.75): Frozen embeddings scored by cross-fit probe predictability, kNN label smoothness, local density, mid-entropy preference (peak H=0.35 nats); z-score sum selects top 50% balanced subset for finetuning.

Simpler wins: MDL Curriculum (0.68)—train 2 epochs on top-half weak-confident at full LR, continue 3 epochs on full data at half LR. These beat baselines (PGR~0.2) by leveraging strong latent capabilities without overfitting, though production transfer (Sonnet 4.0 on helpfulness prefs) gains only +0.5 points due to weak base signals.

Generalization Varies by Exploited Capabilities

Top chat ideas transfer unevenly: SOTA CCS+ES works on math verification (DAPO-Math-17K train; AIME 2024/25 OOD) and coding (TACO easy-medium train; medium-hard+ OOD), but EM posterior succeeds on math (strong zero-shot) fails on code (weak zero-shot). AARs on math/coding often ignore weak supervision, solving via pure strong elicitation—use chat for hill-climbing, reserve others for cross-eval. Key: Design hack-resistant, outcome-gradable evals (balanced labels, no patterns) so AARs climb true progress, not memorization. Bottleneck shifts to eval design for non-gradable alignment; automation unlocks bootstrapping.