AutoResearch: AI Self-Optimizes Code via Experiments

AutoResearch Mechanism: Constrained Experimentation Loop

AutoResearch structures AI optimization around three files: a program file defining the goal algorithm, a prepare.py file for evaluation metrics, and strict rules forbidding changes outside the target code. The AI runs incremental experiments, keeping only variants that improve eval scores and discarding failures. This creates a self-improving loop without human intervention during iterations, unlike vibe coding which builds features sequentially with manual checks. For instance, constrain the AI to modify only scoring logic using predefined tools, ensuring focused progress.

To succeed, define precise evals upfront—vague goals like 'make this better' fail to scale. Provide fast feedback loops via simulations; without them, iteration stalls.

Key Examples: Restaurant Inventory and Chess Engine

In a 30-day restaurant simulation, the initial algorithm fails over 50% of orders by reordering one ingredient at a time when stock hits zero, plus 3-5 day lead times causing delays. AutoResearch optimizes it to order aggressively on day one, group quantities, and preemptively fill inventory above depletion points, sustaining stock through sales fluctuations.

Refining the eval to maximize working capital (not just stock levels) further improves outcomes: the business accumulates cash by avoiding overstock on slow days, channeling revenue efficiently without depleting funds.

For chess, start with a 750 ELO engine; after hours of experiments, it reaches 2600 ELO by incrementally refining scoring, flatlining until breakthroughs then compounding gains.

Implications and Limitations for Software Development

This shifts software development from manual coding to problem definition: articulate goals, evals, and constraints clearly for AI to handle iteration. It fits narrow, simulatable domains with quantifiable metrics, revealing a new paradigm for agent-driven optimization.

Trade-offs: Requires human setup for right metrics and structure; poor evals (e.g., overemphasizing stock) tie up capital. Not universal—fails without fast feedback or clear goals, limiting to tasks like algorithm tuning over broad engineering.