Harness Engineering Delivers 6x Agent Performance Over Models

Harness Beats Model: Proven 6x Gains and Transferability

Same model and benchmark yield 6x performance differences purely from harness changes—everything outside model weights like system prompts, tool definitions, orchestration logic, memory management, verification, and safety guardrails. LangChain improved from outside TerminalBench 2.0 top 30 to rank 5 by tweaking only harness infrastructure. Stanford's Meta-Harness ranked Haiku #1 despite smaller size, outpacing larger models via optimized harness. Key: harnesses transfer across models, boosting five others when optimized on one, making them the reusable asset over volatile models.

Full harnesses hit ~75% SWE-bench pass@4 rate but burn 14x compute (16.3M tokens, 642 calls, 32min vs. 1.2M tokens, 51 calls, 7min stripped). Anthropic's patterns—prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer loops—combine into production agents, but ad-hoc Python scatters logic, blocking ablations.

Natural Language Harnesses Enable Ablation and Efficiency

Tsinghua's NLAH represents control logic (contracts, roles, state, failures) in structured natural language over brittle code, separating runtime charter (state persistence, child agents) for clean swaps. Execution contracts bound calls (inputs, budgets, permissions, conditions, outputs); file-backed state survives truncation.

Migrating OS Symphony code harness to NLAH jumped OSWorld accuracy 30.4% to 47.2%, cut runtime 361min to 141min, LLM calls 1200 to 34 by replacing GUI loops with durable state. Ablations show self-evolution (+4.8 SWE-bench, +2.7 OSWorld) helps via narrow attempt loops; verifiers hurt (-8.4 OSWorld), multi-candidate search (-5.6). 90% compute delegates to child agents—harness orchestrates, doesn't reason.

Disciplined narrowing outperforms broadening; prune over add, as models evolve (e.g., drop context resets when Opus 4.6 no longer needs them).

Automated Optimization and Safety Constraints

Stanford's Meta-Harness uses agentic proposer (Claude Code + Opus 4.6) to rewrite harness from failure traces (10M tokens/iteration, 82 files), evaluator scores proposals. Raw traces irreplaceable (50% to 34.6% without); hits 76.4% TerminalBench 2 (auto-optimized #1), 48.6% text classification (+7.7pts, 4x fewer tokens).

Complements: AutoHarness compiles rules to code (0% illegal moves in 145 games); AgentSpec DSL prevents 90% unsafe executions. Evolving field: harness assumptions expire with models—prune 80% tools (Vercel) or rewrite 5x in 6mo (Manus) for gains. Invest here over model waits: larger, faster, reliable returns.