Harness Engineering: Agents Code, Humans Steer

Code Abundance Frees Engineers for Steering

Ryan Lopopolo, a Member of Technical Staff at OpenAI, has spent nine months building software solely through AI agents, banning his team from directly editing code. The core shift: "Code is free." Models like GPT-5.2 produce, refactor, and delete code at scale without human synchronous attention, treating implementation as non-scarce. This abundance stems from models' patience, parallelism, and training on trillions of code lines, making them "isomorphic" to human engineers for real-world tasks.

Pre-agent era constraints—no longer apply. P3-priority tasks (formerly deprioritized) now run 4x in parallel; agents select the winner. Internal tools ship with localization and internationalization day one, as capacity isn't traded off. Lopopolo's team built productivity agents for coworkers across OpenAI offices in London, Dublin, Paris, Brussels, Zurich, and Munich.

Scarce resources redefine roles: human time/attention, model context windows. Engineers become "staff engineers" delegating to infinite agent teams, focusing on systems design one day, week, or six months ahead. "Every one of you is a staff engineer. You have as many team members as you can possibly drive concurrently."

Tradeoffs: Initial velocity hits from refining agent outputs, but long-term leverage from durable fixes. Humans unblock agents over long horizons, not micromanage.

Legible Codebases via Documentation and Standardization

Agents need humans to externalize "what good looks like." Years of experience yield 500 micro-decisions per patch (e.g., non-functional requirements like timeouts, retries). Models know all variants from training data; humans specify via "breadcrumbs": ADRs, persona docs, ticket histories, code reviews.

Make codebases "native to agents": respect context scarcity with sameness. Large refactors are free—fire 15 agents to complete migrations that lingered six months. Tests enforce source-code properties: files ≤350 lines for context efficiency; lints ensure retries/timeouts on network calls (e.g., fetch wrappers).

Diverse team expertise amplifies: frontend architects document component patterns; backend experts outline scalability; product minds define QA plans covering user journeys, required PR media (screenshots, videos). One engineer's QA doc becomes every agent's guardrail, eliminating low-signal reviews.

"The important thing is not the code but the prompt and the guardrails that got you there." Review agents scan patches for security/reliability, injecting comments PRs must address pre-merge.

Prompt Injection and Skills for Reliable Execution

Harness engineering operationalizes agent success through just-in-time prompts, minimizing overengineering per the "bitter lesson" (model capability obsoletes complexity). Centralized 5-10 skills hide infra churn: launch apps, spin observability, boot Chrome DevTools via daemon. Codex (agent) is entry point—outside-in dev, with repo tools agent-invoked first.

Guardrails embed prompts everywhere: ESLint rules (custom per workspace), wholesome tests (package privacy, dep edges, Zod dedup), error messages with remediation ("Parse, don't validate at edge; derive type from Zod"—no unknowns).

Reviewer agents, sub-agents, auto-compaction (GPT-5.4 excels) refresh context. Shell to agents for prompt-writing skills synthesized from OpenAI cookbooks. CI runs security checks: "Are there timeouts/retries? Secure interfaces?"

Workflow: Linear tickets → agent + skills. No human editors; beach/margarita/Linear setup shown. Agents prototype UIs, then lints enforce decomposition for snapshot tests. Observed failures (local coherence over shared utils) yield systematic fixes.

Q&A reveals minimalism: Avoid thousands of skills; deepen few. Harness = timely instruction surfacing. Agent hid daemon switch for weeks via docs—humans delegate fully.

"Do not produce slop. Don't accept slop. You won't get slop in your codebase."

Key Takeaways

Treat code as free: Parallelize P3s, refactor at scale, delete freely—focus humans on unblocking agents.
Externalize expertise: Document personas/ADRs/QA for every agent trajectory; one doc accrues team-wide leverage.
Embed guardrails durably: Custom lints/tests on source code (file size, retries, deps); reviewer agents in CI.
Centralize 5-10 skills: Hide infra/tools; agent-first entry (e.g., Codex launches dev stack).
Just-in-time prompts: Auto-compaction + error remediation; synthesize skills from cookbooks.
Minimize harness overengineering: Surface requirements context-efficiently; models follow instructions.
Measure by agent autonomy: Trust via QA plans/media; shoulder-surf less, delegate more.
Fix failure classes systematically: Observe agent/human errors, devise lints/tests, migrate codebase once.
Workflow: Tickets → agents; no laptops—Linear + voice/tools.
Scale internal tools globally: i18n/l10n free with abundance.