Auto Research: AI Runs Endless Experiments Overnight

Karpathy's Auto Research pattern lets AI agents autonomously optimize code, prompts, or copy by iterating changes, testing against a score, and keeping winners—Shopify got 53% faster Liquid code after 120 runs; prompts doubled accuracy from 7/15 to 15/15 for 24¢.

Implement the Auto Research Loop for Non-Stop Optimization

The core pattern automates trial-and-error: AI reads the target (code, prompt, copy), proposes one small change, runs a test via API/CLI/file, scores it numerically (e.g., accuracy, speed, reply rate), commits improvements, reverts failures, logs everything, and repeats indefinitely—"Never stop. The human might be asleep." Requires three elements: (1) editable artifact, (2) trackable numeric metric, (3) fast test mechanism (ideally <30s for 100+ overnight runs). Karpathy's repo (42k+ GitHub stars) uses three files: prepare.py (setup/tokenizer), train.py (editable code), program.md (agent instructions). Outperforms manual work because agents run 50-500 iterations without fatigue; Karpathy's 2-day run on a small LLM found 20 stacking improvements, including a months-old bug in his attention mechanism.

Production Wins: Shopify's 20-Year Codebase Transformed

Applied to Shopify's 20-year-old Liquid template engine (powers all stores), CEO Tobi Lütke ran 120 experiments over 2 days, yielding 53% faster execution and 61% fewer memory allocations on code manually optimized for decades—some ideas were "amazing," though possibly overfit. Pattern generalizes beyond ML training: cold emails (reply rates from 2% to 8-12% via Instantly/SmartLead APIs), landing pages (conversion rates via Webflow/Framer APIs), ad creatives (CTR/CPA via Meta/Google Ads), code performance (execution time). Agent deploys variations, baselines against winners, scales top performers—your competitor's 30 manual landing page tests/year becomes your 30/day.

Prompt Demo: 7/15 to Perfect in Minutes for 24¢

Replicate in Cursor/Claude Code (no GPU needed): Clone Karpathy's repo for context, instruct agent to adapt loop for prompt.md (mediocre starter extracts JSON from emails: name, email, service, budget, etc.) against eval.py (15 test cases with tricks like word budgets "10 to 12,000," ambiguous services, informal names). Baseline: 7/15, failing on null websites, name titles, budget ranges, urgency. Agent hypothesizes (e.g., "existing website must be true/false, never null"), edits, re-evals: Experiment 1 (10/15, kept), 2 (12/15), 4 (14/15), 5 (15/15). Full log tracks hypotheses, before/after scores, status. Costs 24¢ via Anthropic API; scales to chatbot scripts, subjects, voice prompts. Trade-offs: Needs fast feedback (slow tests like weekly data drag loop); optimizes tactics (copy/targeting), not strategy (markets); requires API for changes/tests, numeric score over vibes.

Video description
🤖 Transform your business with AI: https://salesdone.ai 📚 We help entrepreneurs & industry experts build & scale their AI Agency: https://www.skool.com/theaiaccelerator/about 🤚 Join the best community for AI entrepreneurs and connect with 16,000+ members: - https://www.skool.com/systems-to-scale-9517/about Sign up to our weekly AI newsletter - https://ai-core.beehiiv.com/ 🙋 Connect With Me! Instagram - / nicholas.puru X - https://x.com/NicholasPuru LinkedIn - https://www.linkedin.com/in/nicholas-puruczky-113818198/ 0:00 - Karpathy's Auto Research explained 2:07 - Inside the GitHub repo 3:20 - Shopify's results: 53% faster 4:18 - The loop visualized 5:00 - Use cases: email, landing pages, ads, code 7:14 - Live demo: prompt optimization 12:27 - Baseline score: 7 out of 15 14:29 - Autonomous loop running 16:28 - Final score: 15/15 for $0.24 17:05 - Why this pattern matters

Summarized by x-ai/grok-4.1-fast via openrouter

8175 input / 1257 output tokens in 11373ms

© 2026 Edge