AI R&D Automation: 60% Chance by 2028

Coding Capabilities Saturate Benchmarks, Automating Engineering

AI excels at real-world software tasks, solving GitHub issues on SWE-Bench from Claude 2's 2% success in 2023 to Claude Mythos Preview's 93.9% today, hitting benchmark limits due to noise like ImageNet's 6% label errors. This automates code writing, testing, and iteration without humans, as frontier lab workers now code entirely via AI. METR timelines track independent task duration: GPT-3.5 managed 30 seconds (2022), GPT-4 hit 4 minutes (2023), o1 reached 40 minutes (2024), GPT 5.2 ~6 hours (2025), and Opus 4.6 ~12 hours (2026), with forecasts for 100 hours by end-2026. These horizons cover AI R&D chores like data cleaning and experiment launches, letting AI delegate multi-hour work reliably.

Agentic tools chain tasks autonomously, forming 'synthetic teams' where manager AIs oversee specialized sub-agents, scaling projects like Claude Code or OpenCode.

Core AI R&D Skills Advance Rapidly

AI reproduces papers on CORE-Bench, jumping from GPT-4o's 21.5% (Sep 2024) to Opus 4.5's 95.5% (Dec 2025), handling library installs, runs, and result analysis. MLE-Bench sees AI win 64.4% of 75 Kaggle competitions (Gemini3, Feb 2026) vs. o1's 16.9% launch score. Kernel optimization uses LLMs for GPU/Triton/Ascend code, with examples like DeepSeek models, PyTorch-to-CUDA automation, and Meta's infrastructure kernels.

PostTrainBench pits AI against human-tuned models (51% uplift baseline across Qwen/Gemma/SmolLM on AIME/Arena/GSM8K/etc.); top AIs (Opus 4.6/GPT 5.4) achieve 25-28% uplift. Training optimization yields 52× CPU speedup (Claude Mythos Preview, Apr 2026) vs. human 4× in 4-8 hours. Alignment research agents beat Anthropic baselines on scalable oversight via primed teams.

Creativity Emerges, Fueling Self-Improvement

AI R&D is 99% 'perspiration'—scaling data/compute, fixing breaks, parameter sweeps—not radical inventions like transformers. AI handles this 'Lego' work via coding + time horizons, automating engineering fully. Early creativity: Erdos-1051 solution via Gemini (1/13 novel from 700 problems); centaur proofs with Gemini tools. Industry goals confirm: OpenAI targets automated interns (Sep 2026), Anthropic alignment agents, Recursive Superintelligence ($500M for AI R&D automation).

Trade-off: Frontier models costlier/harder, but non-frontier proofs-of-concept imminent (1-2 years), accelerating via scaling if trends hold.