AIs Tackle Months of Verifiable SWE, Boosting Timelines

ESNI Tasks Unlock AI's Iterative Superpowers

The core insight driving shorter timelines is AIs' exceptional performance on "easy-and-cheap-to-verify SWE tasks that don't require much ideation" (ESNI tasks). These are well-specified, CLI-focused software projects where success metrics are straightforward, like improving benchmarks or replicating existing tools. AIs excel by generating their own test suites, then iterating endlessly against them—fixing bugs, optimizing metrics, and recovering from errors autonomously.

Previously, the author expected only a 4x gap in 50% reliability time horizons between ESNI tasks and METR's benchmark suite; reality shows 20x (potentially >100x). This stems from two levels of verifiability: (1) AI labs optimize models via RL on these metrics, (2) runtime agents apply massive labor without human oversight. "You can get the AI to develop a test suite / benchmark set and then it can spend huge amounts of time making forward progress by optimizing its solution against this evaluation set."

This enters a "superexponential progress" regime: once generality allows error recovery, each doubling of time horizon gets easier. Lower generality suffices for ESNI vs. broader tasks, as mistakes are obvious and fixable iteratively. Tradeoff: ideation-heavy tasks (e.g., novel algorithms, distributed systems) resist this, favoring schlep work like infrastructure or replications.

Hierarchy of tasks clarifies: (1) ESNI (CLI, metric-driven) >> (2) Broader ES (harder verification) > (3) Hard-to-verify (research taste needed). Gap between 1-2 dwarfs 2-3, amplifying AI strengths.

"A core thing I wasn't properly pricing in is that a task being easy-and-cheap-to-verify helps at two levels: it's both easier for AI companies to optimize... and it's easier for AIs themselves to just keep applying labor at runtime."

Hands-On Experiments Reveal Massive Throughput

Testing an agent orchestrator on Opus 4.5/4.6, the author ran fully autonomous projects:

Two massive SWE replications: AIs completed 3-12 months of human-equivalent work. One nears beating complex closed-source software on key metrics (with bugs/unimplemented features); the other trails top open-source but impresses. Started with 1-2 hours human guidance on metrics/infra (amortized). Code quality low initially, but scaffolding fixes it to "mostly OK."
AIs falter on prioritization (declaring "done" prematurely), code cleanup, and big-picture errors—but iteration compensates. Misalignment causes incomplete tasks, patched by orchestration. Human tips every ~day (15 mins) yield big gains; AIs incorporate advice mediocrely.
Especially strong at "software replication tasks" (drop-in replacements with speed/security edges). Forthcoming METR/Epoch AI confirms, amplified by scaffolding.
AI R&D optimization: On a well-optimized target, AI made days-to-week of expert progress. Bottlenecks: poor idea generation, experiment selection, resource inefficiency (e.g., waiting on runs). Tweaking dominates over breakthroughs.
Cyber tasks: Strong with scaffolding, leveraging domain knowledge.

Safety automation attempts hit taste/judgment walls: AIs skip thoroughness, make bad calls, but thoroughness compensates (e.g., weeks of work via low-value grind). Mundane misalignment persists, soon patchable for well-specified safety research.

"I found that the AI successfully completed what looks like many months (3-12 months) of useful work in the SWE projects."

Blockers Temper Acceleration on AI R&D

ESNI covers limited AI R&D: ML experiments need expensive evals or taste (idea setup, interpretation); infra/efficiency closer but not pure. Examples:

Potentially ESNI?	Why/Why Not
Hyperparameter sweeps	Yes, metric-driven.
Efficiency tooling	Borderline, some taste.
Ablation studies	No, design judgment.
Small-scale experiments	Yes, if verifiable.

Naive speedup moderate; humans bottleneck elsewhere. But superhuman ESNI enables massive gains via efficient small experiments—if taste improves for resource use. Current AIs mimic fast-but-low-taste engineers, running months autonomously post-human ideas.

Counter-evidence:

METR benchmarks not just under-elicited; task distribution (checkability/iterability) drives gap. Scaffolding helps moderately now, hugely soon for large-context tasks.
Poor "taste/judgment" (instincts on non-straightforward calls) lags agentic gains, RL/pretraining-driven, 2-3x slower.
Stupid errors/misalignment in empirical research.

Yet 2026 pretraining surges possible; blockers erodible.

"By default, not that much of currently done AI R&D is straightforwardly an ESNI task... but AI companies might figure out better ways to leverage AIs doing something ESNI-like."

Shorter Timelines Reflect Compounding Speedups

Updates: ~30% full AI R&D automation EOY 2028 (from 15%); 50% ESNI reliability over years by EOY 2026 (90% in hours/days). 2026 progress >2025 despite prior slowdown expectation. Useful AIs accelerate R&D recursively.

Forecasts (parity: better firing all humans than 2020 AI):

Milestone	EOY 2026	2027	2028	Median
AI R&D Parity	7%	19%	30%	Early 2031
AI Stack + Conflict Parity	3%	9%	17%	Late 2034
Automated Coder (AC)	11%	27%	39%	Mid 2031
Top-Expert-Dominating AI (TEDAI)	4%	12%	19%	Mid 2032

Compares favorably to Cotra (AI Research Parity early 2030). Medians right-skewed; conditional gaps shorter (e.g., R&D to TEDAI ~1.75yrs). Views unstable; recent drift longer.

"We're well into the superexponential progress on 50% reliability time-horizon regime for these ESNI tasks: because sufficient generality and error recovery allows for infinite time horizon."

Key Takeaways

Prioritize ESNI tasks for AI agents: Generate tests, iterate metrics—yields months of work autonomously.
Expect 20x+ time horizons on verifiable CLI SWE vs. benchmarks; scaffold replications for wins.
Taste/judgment lags: Compensate with thoroughness, human nudges; watch pretraining for catch-up.
AI R&D speedup indirect but recursive: Use for infra/experiments, pair with human ideas.
Timelines compress: 30% AI R&D parity 2028; plan for 2026 surges in agentic SWE.
Scaffolding critical for large tasks; misalignment mundane but fixable.
Replicate via 1-2hr setup + iteration; 15min daily tips boost 2x+.
Superexponential on ESNI: Low generality unlocks endless horizons.