AutoAgent Optimizes Harnesses Like Karpathy's Auto-Research

Extend Karpathy's auto-research loop—edit code, run 5-min evals, keep improvements—to agent harnesses (prompts/tools) via meta-agents, yielding domain-specific agents overnight on benchmarks like SpreadsheetBench.

Core Self-Improvement Loop: Edit, Eval, Iterate Overnight

Karpathy's auto-research uses a simple setup with one GPU and 5-minute training runs: fix data prep/tokenizer (prep.py), let an agent edit training code (train.py) for model, loop, hyperparameters, then evaluate per human instructions in program.md. If metrics improve, commit changes; else revert. Humans "program in natural language" via program.md, agent handles code. Run overnight for real gains without manual coding.

AutoAgent applies identical loop to agent harnesses instead of ML training: meta-agent edits task agent's prompts, tools, orchestration (agent.py), runs evals on benchmarks via adapters, commits improvements based on results and reasoning traces. Starts with minimal bash tool; discovers domain-specific logic autonomously.

Architecture Enables Parallel, Domain-Agnostic Optimization

Split into meta-agent (orchestrates iterations, spins thousands of parallel sandboxes) and task agent (executes domain tasks). Connects to any benchmark (e.g., SpreadsheetBench, TerminalBench) for verification. Same files as auto-research: program.md for human guidance on goals/avoidances, agent.py as editable target.

Simplicity mirrors Karpathy: no complex infra needed. Meta-agent reads traces/results post-sandbox runs, decides keeps/reverts, builds specialized tooling/verification/orchestration nobody coded manually.

Benchmark Gains and Harness Engineering Trade-offs

On SpreadsheetBench/TerminalBench, iterations show harness improving: better prompts/tools yield higher scores, compounding overnight. Enables cheaper, specialized agents per domain/workflow vs. monolithic harnesses.

Harness optimization critical because domains need tailored prompts/tools (e.g., spreadsheets vs. terminals), requiring domain+model expertise. Companies gain from stack-specific harnesses running smaller models. Future: domain experts write program.md, meta-agents auto-engineer harnesses—like AI now writes code—define success, return in 24h with optimized setup.

Video description
Auto Agent: Self-Improving AI Harnesses Inspired by Karpathy’s Auto-Research Loop The video explains self-improving agents and highlights Kevin Guo’s Auto Agent project as an extension of Andrej Karpathy’s auto-research idea. Auto-research lets an AI agent iteratively edit training code (e.g., train.py) under a small LLM training setup, run short trainings, evaluate results, and keep or discard changes based on improvement, guided by human-written instructions in program.md. Auto Agent applies the same loop to a different target: optimizing the agent harness itself (prompts, tools, orchestration) rather than ML training code. It uses a meta-agent and a task agent, connects to benchmarks via an adapter, and runs many parallel sandboxes to evaluate iterations using results and reasoning traces. Examples include SpreadsheetBench and TerminalBench, illustrating harness improvements and the broader implications for domain-specific workflows and cheaper, specialized agent setups. Links; https://x.com/karpathy/status/2030371219518931079 https://github.com/karpathy/autoresearch https://x.com/kevingu/status/2039843234760073341 https://github.com/kevinrgu/autoagent/blob/main/program.md 00:00 Self Improving Agents 00:33 Auto Research Recap 01:25 Why Simplicity Worked 02:22 Auto Agent Architecture 03:20 Benchmarks And Results 03:52 Why Harness Optimization Matters 04:36 Future Of Meta Agents 05:01 Wrap Up

Summarized by x-ai/grok-4.1-fast via openrouter

4800 input / 1331 output tokens in 10221ms

© 2026 Edge