Meta Harness: AI Evolves Its Own Code for 6x Gains

Harnesses Unlock LLM Potential Beyond Weights Alone

LLM performance hinges as much on the surrounding harness—the code managing memory, retrieval, tool use, and state—as on model weights themselves. Changing the harness around a fixed LLM creates a 6x performance gap on benchmarks, turning raw next-token prediction into agentic capabilities like long-running code execution in tools such as Cursor or Claude Code. Models like Claude 3.5 Sonnet or GPT-4o are already AGI-capable engines; effective harnesses provide the steering wheel, seats, and power delivery to reach destinations reliably. Manual harness engineering by humans limits scaling, as complexity spans long horizons where early retrieval or storage decisions impact distant reasoning steps.

Prior text optimizers like MCE (meta-context engineering, curating skill libraries) or ACE (agentic context engineering, reflective learning) fail here due to short-horizon feedback, scalar scores (e.g., 0-1), and compressed summaries losing failure traces. These methods cram 100-30k tokens of context, discarding signals from million-token harness runs. Adaptive retrieval—letting the model select relevant memory rather than monolithic prompts—proves superior, as seen in RAG, memory-augmented agents, and executable code search.

Self-Improving Loop with Coding Agent Proposer

Meta Harness introduces an outer optimization loop using a single coding agent (e.g., Claude 3 Opus via Claude Code) as proposer, with unrestricted filesystem access to prior harness artifacts. The loop: (1) Proposer inspects code, scores, execution traces (prompts, tool calls, outputs, state updates) from past directories; (2) Diagnoses failures and proposes edits or rewrites; (3) New harness evaluates on tasks; (4) Logs results for next iteration. This avoids context limits by using tools like grep/cat for targeted retrieval, not full ingestion—crucial as 10 iterations yield 10M+ tokens.

Unlike fixed scaffolds or archives, the minimal design delegates all decisions to the agent, enabling recursive improvement: better core LLMs enhance the proposer, which refines target harnesses faster. It searches domain-specific strategies (prompts, retrieval, state updates) without heuristics, inspecting even low performers to escape local maxima. Fixed iterations end with final test-set evaluation, scaling with agent capability—no human curation needed.

Superior Results and Generalization Across Tasks

On online text classification (USPTO patents, Symptoms2Disease medical, Law benchmarks), Meta Harness achieves median 50 accuracy (best 56.7), surpassing state-of-the-art (best 45.6, median 39.1) and text optimizers like OpenEvolve (by 10+ points) with 10x fewer full evaluations and 11.4k tokens vs. 50.8k for rivals—cheaper and unbiased by preconceptions. It beats ACE (40.9 avg) on Law (45 vs. 29) and S2D (outperforms by 4 points). Generalizing to 9 unseen datasets, it leads by 3 points (73.1 vs. 70.2 ACE) at moderate cost.

For retrieval-augmented math (IMO-level problems), its discovered strategy gains 4.7 points averaged across 5 held-out models by reusing proof patterns adaptively. On TerminalBench-2 (89 long-horizon terminal tasks), it hits 76.4 with Opus 3.5 (tops all but one handwritten harness) and 37.6 with Haiku 3.5 (beats #2 Goose at 35.5), validating on public agentic coding contests.

This echoes the Bitter Lesson: end-to-end learning trumps human heuristics, as in AlphaEvolve's matrix multiplication breakthrough or Tesla FSD's shift to pure neural nets. Self-evolving harnesses signal all software becoming autonomous, extrapolating to code libraries improving overnight without touch.