AutoResearch: AI Self-Optimizes Code via Experiments

AutoResearch lets AI iteratively improve algorithms without human coding by running experiments in a constrained loop, boosting a chess engine from 750 to 2600 ELO and fixing restaurant inventory failures.

AutoResearch Mechanism: Constrained Experimentation Loop

AutoResearch structures AI optimization around three files: a program file defining the goal algorithm, a prepare.py file for evaluation metrics, and strict rules forbidding changes outside the target code. The AI runs incremental experiments, keeping only variants that improve eval scores and discarding failures. This creates a self-improving loop without human intervention during iterations, unlike vibe coding which builds features sequentially with manual checks. For instance, constrain the AI to modify only scoring logic using predefined tools, ensuring focused progress.

To succeed, define precise evals upfront—vague goals like 'make this better' fail to scale. Provide fast feedback loops via simulations; without them, iteration stalls.

Key Examples: Restaurant Inventory and Chess Engine

In a 30-day restaurant simulation, the initial algorithm fails over 50% of orders by reordering one ingredient at a time when stock hits zero, plus 3-5 day lead times causing delays. AutoResearch optimizes it to order aggressively on day one, group quantities, and preemptively fill inventory above depletion points, sustaining stock through sales fluctuations.

Refining the eval to maximize working capital (not just stock levels) further improves outcomes: the business accumulates cash by avoiding overstock on slow days, channeling revenue efficiently without depleting funds.

For chess, start with a 750 ELO engine; after hours of experiments, it reaches 2600 ELO by incrementally refining scoring, flatlining until breakthroughs then compounding gains.

Implications and Limitations for Software Development

This shifts software development from manual coding to problem definition: articulate goals, evals, and constraints clearly for AI to handle iteration. It fits narrow, simulatable domains with quantifiable metrics, revealing a new paradigm for agent-driven optimization.

Trade-offs: Requires human setup for right metrics and structure; poor evals (e.g., overemphasizing stock) tie up capital. Not universal—fails without fast feedback or clear goals, limiting to tasks like algorithm tuning over broad engineering.

Video description
Autoresearch from Andrej Karpathy shows an early picture of how iterative self-improvement can be a unique fit to software development. The use cases isn't universal but it does add a new way of looking at software development and how we're approach new ways to improve software. Also in this video, I announce the GTC 4080 Super Giveaway! #ai #machinelearning #tech Chapters 00:00 Intro 00:55 Autoresearch 02:08 Simulation 03:57 Software Development 04:53 Giveaway Winner

Summarized by x-ai/grok-4.1-fast via openrouter

4291 input / 1111 output tokens in 10595ms

© 2026 Edge