AutoResearch: AI Self-Optimizes Code via Experiments
AutoResearch lets AI iteratively improve algorithms without human coding by running experiments in a constrained loop, boosting a chess engine from 750 to 2600 ELO and fixing restaurant inventory failures.
AutoResearch Mechanism: Constrained Experimentation Loop
AutoResearch structures AI optimization around three files: a program file defining the goal algorithm, a prepare.py file for evaluation metrics, and strict rules forbidding changes outside the target code. The AI runs incremental experiments, keeping only variants that improve eval scores and discarding failures. This creates a self-improving loop without human intervention during iterations, unlike vibe coding which builds features sequentially with manual checks. For instance, constrain the AI to modify only scoring logic using predefined tools, ensuring focused progress.
To succeed, define precise evals upfront—vague goals like 'make this better' fail to scale. Provide fast feedback loops via simulations; without them, iteration stalls.
Key Examples: Restaurant Inventory and Chess Engine
In a 30-day restaurant simulation, the initial algorithm fails over 50% of orders by reordering one ingredient at a time when stock hits zero, plus 3-5 day lead times causing delays. AutoResearch optimizes it to order aggressively on day one, group quantities, and preemptively fill inventory above depletion points, sustaining stock through sales fluctuations.
Refining the eval to maximize working capital (not just stock levels) further improves outcomes: the business accumulates cash by avoiding overstock on slow days, channeling revenue efficiently without depleting funds.
For chess, start with a 750 ELO engine; after hours of experiments, it reaches 2600 ELO by incrementally refining scoring, flatlining until breakthroughs then compounding gains.
Implications and Limitations for Software Development
This shifts software development from manual coding to problem definition: articulate goals, evals, and constraints clearly for AI to handle iteration. It fits narrow, simulatable domains with quantifiable metrics, revealing a new paradigm for agent-driven optimization.
Trade-offs: Requires human setup for right metrics and structure; poor evals (e.g., overemphasizing stock) tie up capital. Not universal—fails without fast feedback or clear goals, limiting to tasks like algorithm tuning over broad engineering.