Harness Beats Model: 6x Agent Performance Gap

Stanford/Tsinghua papers prove agent orchestration (harness) causes 6x performance variation on the same model; optimize harness via subtraction and natural language before switching models.

Harness: OS for LLMs, Driving 6x Performance

A harness turns a raw LLM (the inert CPU) into an agent by managing context (RAM), databases (disk), tools (drivers), and loops for actions, observations, and iteration. It structures nine components like runtime charter (state, contracts, sub-agents) and control logic. Same model + different harness = 6x performance gap, as seen running complex prompts in Claude Code vs. Cursor: varying reasoning paths, token spend, success rates. Focus here first—model choice is secondary.

Tsinghua Ablations: Subtract to Win, Natural Language Boosts

Tsinghua (Pan et al., March 2024) ablated harnesses on SWE-Bench (GPT-4o max reasoning): full harness hit 74-76% success but wasted 16.3M tokens/sample (600+ tool calls, 32+ min); stripped version used 1.2M tokens (51 calls, <7 min)—14x less compute for identical results. Key: self-evolution helped consistently; verifiers hurt (-0.8 SWE-Bench, -8.4 OS-World); multi-candidate search hurt (-5.6). Migrating OSWorld desktop automation from code to structured natural language harness: success 30.4% → 47.2% (+16.8 pts), runtime 361 → 41 min, calls 1200 → 34. Natural language enables isolated testing/swaps for clean experiments.

Stanford Auto-Optimization: Transferable Across Models

Omar Katab (DSPy creator, Stanford) auto-optimized harnesses via LLM (Claude 3 Opus): analyzes raw failure traces (not summaries—summaries drop accuracy 50% → 34.9%), rewrites full harness (structured retrieval, memory, topology). Scaled to 10M tokens/iteration, 400x more feedback, 82 files/round. Results: #2 TerminalBench (76.4%, auto-optimized beat hand-crafted); #1 215-text classification (+7.7 pts SOTA, 4x fewer tokens, Haiku > larger models via harness). Harness transfers: one optimized on Opus boosted five other models. Raw traces irreplaceable—details drive gains.

Subtraction Principle + Audit: Prune, Don't Add

As models advance (e.g., Opus 4.6 dropped context resets), assumptions in harness components expire—prune ruthlessly (Manis rewrote 5x in 6 months; Warel cut 80% tools, improved). Builders: audit before model swaps with 4 questions: (1) Trim unnecessary context window? (2) Drop rarely used tools? (3) Remove hurting verifiers/search loops? (4) Rewrite control logic in natural language (+17 pts potential). Mature engineering = subtraction craft; simpler > complex.

Summarized by x-ai/grok-4.1-fast via openrouter

6276 input / 1760 output tokens in 19200ms

© 2026 Edge