Harness Engineering Solves AI's Reliability Gap in Real Projects
AI models excel at one-off code but fail in projects due to forgetfulness, bias, and improvisation. Harness Engineering transforms them into a disciplined system by constraining behavior, standardizing processes, dividing labor, and enforcing verification. The author uses a minimal Go CLI for crypto snapshots (github.com/RyoKusnadi/crypto-snapshot-cli) to show implementation, starting from scratch. This isn't hype—it's a maturity model from Level 0 (experiments) to Level 4 (autonomous self-healing), where most teams hit Level 2: AI as a 'reliable junior partner' via feedback loops like tests and CI.
Key problem: Single prompts treat AI as 'smart autocomplete,' leading to chaos in production. Solution: A stacked framework ensuring stable, correct outputs across the dev lifecycle. Tradeoffs include upfront setup time but massive gains in predictability; rules are 'soft' (AI forgets), so scripts provide 'hard gates.' Early attempts with just rules failed due to AI's 'lazy bypass'; adding scripts fixed it.
'How to make AI consistently, reliably, and predictably deliver the exact results you want within your project.' (Author defines the core problem Harness solves, emphasizing verifiability over intelligence.)
Core Components Form a Stacked System
Rules as Soft Constraints and Team Policies
Rules are foundational 'engineering guidelines'—non-negotiable conventions like post-modification checklists: compile, test, validate before claiming 'done.' They prevent basic mistakes but fail via forgetfulness, irrelevance claims, or laziness. Not specs or scripts; they're like onboarding a dev: 'what's allowed/off-limits.' In the Go project, rules enforce no 'I'm done' without checks.
Tradeoff: Rules set baselines cheaply but need enforcement; alone, they're inadequate for complex tasks.
Skills as Standardized Playbooks
Skills turn vague instructions (e.g., 'compile') into exact SOPs. For Go: go mod tidy, build tags, CGO flags, linker metadata, structured logs, error/warning distinction. AI executes playbooks, not improvises—crucial for repeatability. Author prioritized skills early: compilation, testing, validation.
Why? Improvisation guarantees breakage in real envs. Skills standardize high-frequency ops, bridging rules to execution.
'A Skill tells the AI: “Don’t improvise on this... Just follow these exact steps.”' (Explains why skills prevent derivation errors, using compilation as example.)
Sub-Agents for Role Separation
Single agents self-review, biasing toward progress over quality. Sub-Agents divide labor: requirements specialist, architect, gatekeeper, developer, QA, PM. Each handles one phase, outputs structured docs for handoff. Maps to real teams; prevents context collapse and self-bias.
In practice: Isolation keeps focus; e.g., developer gets only design, not full history.
Workflows as Relay Race Protocols
Workflows orchestrate agents via explicit rules: stage status, handoff outputs, rejection conditions, rollbacks. Three layers: human-readable (philosophy), system-enforced (gates), role-specific (inputs/outputs). Context discipline: gradual rule loading, minimal per stage to avoid overload.
Without: Vague scopes, flawed designs pushed forward, unclear status. With: Auditable advances/rejects. Analogy: Relay rules > fast runners alone.
'Without a Workflow, the engineering floor typically devolves into this: Requirements are vague... Everyone is “working,” but nobody can answer: “What’s the actual status?”' (Highlights chaos without protocols, stressing verifiable state.)
Scripts as Hard Verification Gates
Scripts are 'most important': Master Gatekeeper runs checks like no hardcodes/secrets, structured logging (zap/slog), golangci-lint zero warnings, race tests, coverage baselines, no accidental commits of generated files. AI can't bluff; pass or fail objectively.
Shift: Mature harnesses rely more on scripts than prompts. In Go project: Verifies go.mod sync, build tags, os.Exit bans.
MCP for External Integration
MCP (Model Context Protocol) bridges repo to ecosystems: CI triggers, logs, signing, artifacts, releases. Exposes as tools/resources under policy control. Nice-to-have now, essential for delivery loops.
Full stack: SPEC (goals), Rules, Skills, Sub-Agents, Workflow, Scripts, Dev-Map (structure awareness), Task Board (progress), MCP.
'Scripts are saying: “Your claim that you’re done means nothing. You don’t pass my checkpoint until the system proves it.”' (Emphasizes scripts' role in objective truth over AI claims.)
Implementation Order and Lessons from Go CLI
Start with SPEC: Defines crypto-snapshot-cli (minimal but with breakage-prone seams: deps, builds, tests). Not enterprise-only; ideal for solos/MVPs.
Build order: Rules first (baseline), Skills (standardize), Sub-Agents/Workflow (scale complexity), Scripts (enforce), Dev-Map/Task Board (context), MCP (extend).
Stumbles: Initial rules-only led to chaos; added scripts for gates. Single agent inadequate; sub-agents fixed bias. Workflow prevented devolution.
Results: AI as junior partner—trust via system, not eyes. Evolvable: Level 1 (solo/MVP) to Level 2 (teams).
Dev-Map/Task Board: Project structure/patterns; history/progress alignment.
'Harness Engineering is like building a complete “Engineering Operations System” for AI.' (Analogy capturing the assembled value: mission to awareness.)
Maturity Model Guides Progressive Adoption
Level 0: Throwaways. Level 1: Constraints for MVPs. Level 2: Feedback for production. Level 3: Specialization for multi-service. Level 4: Autonomy for platforms.
Sweet spot: Level 2. Progression fills gaps: e.g., rules insufficient → scripts.
Key Takeaways
- Start with Rules and Skills for quick wins; they're cheap and prevent 80% of basic errors.
- Implement a Master Gatekeeper Script early—list 10-15 project-specific checks like lint, coverage, no secrets.
- Use Sub-Agents for any task > trivial: requirements → architect → dev → QA handoffs via structured docs.
- Define Workflows with explicit stages, handoffs, and rollbacks; use context discipline to avoid overload.
- Prioritize scripts over prompts long-term; they provide the 'proof' AI lacks.
- For Go/Python/TS projects, standardize skills for build/test/deploy (e.g., go mod tidy, race tests).
- Build Dev-Map: Repo tree + patterns to onboard AI fast.
- Evolve to MCP only after core loop; governs external access.
- Test on minimal project first: Expose seams like deps, logging, platforms.
- Measure maturity: Can AI reliably deliver full artifacts (code + tests + validation) without babysitting?