Auto Research: AI Runs Endless Experiments Overnight

Implement the Auto Research Loop for Non-Stop Optimization

The core pattern automates trial-and-error: AI reads the target (code, prompt, copy), proposes one small change, runs a test via API/CLI/file, scores it numerically (e.g., accuracy, speed, reply rate), commits improvements, reverts failures, logs everything, and repeats indefinitely—"Never stop. The human might be asleep." Requires three elements: (1) editable artifact, (2) trackable numeric metric, (3) fast test mechanism (ideally <30s for 100+ overnight runs). Karpathy's repo (42k+ GitHub stars) uses three files: prepare.py (setup/tokenizer), train.py (editable code), program.md (agent instructions). Outperforms manual work because agents run 50-500 iterations without fatigue; Karpathy's 2-day run on a small LLM found 20 stacking improvements, including a months-old bug in his attention mechanism.

Production Wins: Shopify's 20-Year Codebase Transformed

Applied to Shopify's 20-year-old Liquid template engine (powers all stores), CEO Tobi Lütke ran 120 experiments over 2 days, yielding 53% faster execution and 61% fewer memory allocations on code manually optimized for decades—some ideas were "amazing," though possibly overfit. Pattern generalizes beyond ML training: cold emails (reply rates from 2% to 8-12% via Instantly/SmartLead APIs), landing pages (conversion rates via Webflow/Framer APIs), ad creatives (CTR/CPA via Meta/Google Ads), code performance (execution time). Agent deploys variations, baselines against winners, scales top performers—your competitor's 30 manual landing page tests/year becomes your 30/day.

Prompt Demo: 7/15 to Perfect in Minutes for 24¢

Replicate in Cursor/Claude Code (no GPU needed): Clone Karpathy's repo for context, instruct agent to adapt loop for prompt.md (mediocre starter extracts JSON from emails: name, email, service, budget, etc.) against eval.py (15 test cases with tricks like word budgets "10 to 12,000," ambiguous services, informal names). Baseline: 7/15, failing on null websites, name titles, budget ranges, urgency. Agent hypothesizes (e.g., "existing website must be true/false, never null"), edits, re-evals: Experiment 1 (10/15, kept), 2 (12/15), 4 (14/15), 5 (15/15). Full log tracks hypotheses, before/after scores, status. Costs 24¢ via Anthropic API; scales to chatbot scripts, subjects, voice prompts. Trade-offs: Needs fast feedback (slow tests like weekly data drag loop); optimizes tactics (copy/targeting), not strategy (markets); requires API for changes/tests, numeric score over vibes.