Harness Engineering: Humans Steer, Agents Code

Paradigm Shift: Code Abundance Frees Engineers

Ryan Leopo, a Member of Technical Staff at OpenAI, has built software exclusively with AI agents for nine months, banning his team from touching code editors. The core insight: post-GPT-5.2 (late 2025), models are "isomorphic" to human engineers, capable of producing, refactoring, and deleting high-quality code for real user problems. "Code is free"—its production cost is zero, limited only by GPU and tokens, not human time. Previously scarce implementation now abundant, shifting engineer roles to systems thinking, design, and delegation. Every engineer accesses 5-5,000x capacity 24/7.

Problem: Traditional engineering bottlenecks on synchronous human attention for code maintenance. Opportunity: Agents are patient, infinitely parallel, handling P3 tasks in parallel (4x attempts, pick winner). Result: Internal tools ship with localization/i18n day-one; migrations complete via 15 parallel agents, no six-month hangs.

Tradeoffs: Human time/attention and model context windows remain scarce. Humans must unblock agents over long horizons (1 day to 6 months), automating low-leverage work to focus high-leverage activities.

"I'm a token billionaire and I believe that in order for us to get into our AGI future, we want everybody to be token billionaires to use the models to do the full job." (Ryan Leopo, opening keynote—frames agent scaling as path to AGI-era productivity.)

Scarce Resources Demand Legible Systems

With code free, prioritize making codebases "legible" to agents: breadcrumb docs, ADRs, persona-oriented guides, ticket histories, code reviews. Structure natively for agents—respect context limits via sameness (large refactors free). Agents internalize trillions of code lines; humans specify non-functional requirements (e.g., timeouts/retries, secure interfaces).

Decisions: Reject slop via explicit guardrails ("do not produce slop"). Short-term velocity hit to diagnose agent failures, then systematize. Diverse team personas (frontend, backend, product-minded) document expertise once; every agent trajectory inherits it, eliminating low-signal reviews.

"The important thing is not the code but the prompt and the guardrails that got you there." (Emphasizes process over output—docs/processes enable agent success.)

Building Harnesses: Outside-In Agent Optimization

Workflow: Tickets → agent (Codex as entrypoint) + 5-10 core skills (launch app, spin observability, boot Chrome DevTools via daemon). Repo harnesses: custom ESLint (workspace-wide), wholesome tests (package privacy, dep edges, Zod dedup, shared utils). Hide infra churn under skills; agents adapt (e.g., undetected daemon switch).

Why 5-10 skills? Breadth obsoletes fast; depth stacks leverage. Per bitter lesson: Minimize custom infra, focus context management—just-in-time instructions (e.g., lint post-prototype: decompose React components for statelessness/snapshots).

Tradeoffs: Overengineering risks obsolescence by model advances (e.g., auto-compaction in GPT-5.4/Codex eliminates /new). Harness = timely text delivery; models follow instructions.

Setup example: No laptop needed—Linear/Beach for tickets, agent handles rest. Leopo spends $1K+/day on billion tokens.

"The whole way we have set up the repository and all of the local dev tools is for codex to invoke them first." (Describes outside-in design—agents drive env, humans delegate.)

Guardrails: Prompt Injection at Every Layer

Enforce quality systematically:

Reviewer agents: Security/reliability checks (timeouts/retries on fetch, secure interfaces) in CI/PRs.
Lints/tests on source: File ≤350 lines (context-efficient), no awaits in loops, parse-not-validate (Zod types).
Error messages: Remediation prompts (e.g., "No unknown here; derive Zod type").
Meta-prompts: Embed via lints, PR comments, agent SDKs in tests, skills (shell to agents for prompt synthesis from OpenAI cookbooks).
QA plans: Product engineer docs critical journeys; review agents assert media/proof.

Observe durable failures (e.g., local coherence over shared utils), eliminate via one-time migrations (code free). Sub-agents refine output; auto-compaction refreshes context.

Evolution: From direct Chrome DevTools to daemon—seamless. Humans shoulder-surf less, trust rises, delegate more.

"Code is free to produce, free to refactor, and it is not a thing to get hung up on anymore." (Core axiom—unblocks bold experimentation/refactors.)

Results and Replication Insights

Outcomes: Full-job agents (features, reliability); humans as staff engineers driving concurrent teams. Non-obvious: Agents crave tokens—feed via sub-agents/tools. Failures instructive: Initial slop from underspecified NFRs; systematize once.

To replicate: Document team expertise durably; build minimal harness (skills + just-in-time prompts); iterate on failures. Counterintuitive: Embrace churn—pick best parallel run; no migration fears.

"You can just simply say do not produce slop. Don't accept slop. You won't get slop in your codebase." (Practical anti-slop rule—short-term hit for long-term gains.)

Key Takeaways

Ban human coding; steer agents via tickets + 5-10 skills for full features/reliability.
Make codebases legible: Persona docs, ADRs, histories for NFR inheritance.
Use just-in-time prompts (lints/tests/PRs) over frontloading—timely context wins.
Systematize failures: Reviewer agents, source tests (e.g., file size, Zod), error remediation.
Parallelize everything (P3s 4x); code free enables instant migrations/tools with i18n.
Minimize skills breadth; depth + auto-compaction handles infra churn.
Document QA/user journeys once; agents assert proof, build human trust.
Spend on tokens boldly—Leopo's billion/day proves scaling viability.
Humans: Focus systems/delegation; agents handle 500 micro-decisions per patch.