Building Reliable AI with the Planner-Generator-Evaluator Pattern
The planner-generator-evaluator pattern is a multi-agent architecture that separates reasoning, execution, and quality assurance into distinct roles. By forcing an adversarial feedback loop between generation and skeptical evaluation, this harness eliminates the self-confirmation bias and hallucination loops that plague single-agent systems.
Most AI agents look great in a controlled social media demo. They write a quick Python script or draft an email, and the output seems flawless on the first try. But these same monolithic agents fail spectacularly when asked to write production code, validate infrastructure setups, or manage a multi-step workflow without human intervention.
The core problem is scaling AI from single-shot prompts to reliable, long-running production tasks. When a single foundation model tries to plan a complex task, execute the steps, and check its own work, it inevitably cuts corners. It forgets initial constraints, hallucinates variables, and rubber-stamps its own errors.
The planner-generator-evaluator (PGE) harness offers an architectural blueprint to fix this, avoiding the trap of generic, bloated multi-agent frameworks. It enforces strict separation of concerns across the lifecycle of a task. Engineering teams must stop relying on single models to grade their own homework. To ship reliable AI features, you need structural constraints that force models to prove their work.
The Fallacy of Single-Agent Systems
Single-agent systems fail at long-running tasks because generation and evaluation share the same context window, creating an inescapable self-confirmation bias. When a model drafts code and immediately reviews it, it inherently trusts its own logic, blinding it to edge-case failures and gradual coherence loss.
When you ask a large language model (LLM) to generate a complex script and then follow up with "is this correct?", the model reads its own output in the context window. It recognizes the exact probabilistic patterns it just generated, and because those patterns match its internal weights, it assumes the output is correct. This self-confirmation bias makes single agents terrible at catching their own logical flaws, especially in strict domains like continuous integration pipelines.
As the context window fills with previous steps, the model's reasoning capabilities degrade further. It begins to lose track of the original constraints. If you ask a single agent to generate a React component, add unit tests, and then fix accessibility issues, by the third step it has likely forgotten the styling rules you established in step one. The attention mechanism spreads too thin across the growing token history, leading to hallucinated variables and skipped requirements.
The common industry reaction to these failures is to reach for off-the-shelf multi-agent frameworks. These generic tools often add massive coordination overhead without structurally solving the feedback loop problem. They give you a dozen agents chatting with each other, but no mathematical or architectural guarantee of quality. Reliable AI requires deliberate architectural constraints, not just larger foundation models or noisier agent swarms.
Deconstructing the Planner-Generator-Evaluator Pattern
The planner-generator-evaluator pattern divides complex AI tasks into three specialized roles to enforce strict quality control and prevent context contamination. The Planner expands a high-level prompt into a rigorous specification, the Generator produces the required output against that spec, and the Evaluator independently critiques the result until all criteria are met.
This architecture explicitly isolates reasoning from execution. The Planner role is responsible for scoping edge cases, defining strict acceptance criteria, and setting an ambitious bar for the output. Instead of writing the final code or content, it writes the blueprint by translating a vague user request into a deterministic checklist. By forcing the system to declare its intentions before writing code, the Planner establishes a ground-truth document that anchors the rest of the workflow.
The Generator role operates in a narrow, highly focused context. It receives the Planner's specification and executes it without the distraction of higher-level reasoning. By stripping away the conversational history and the original user prompt, the Generator can dedicate its entire attention mechanism to satisfying the exact constraints of the spec. It functions purely as a translation layer between the specification and the final syntax.
The Evaluator role acts as an independent quality assurance mechanism. It provides structured, granular feedback rather than a simple binary pass/fail. This setup adapts the adversarial loop of Generative Adversarial Networks (GANs) for concrete task completion. Instead of training a model's weights, this GAN-inspired architecture iterates on a specific output through rigorous, automated critique. The Evaluator compares the Generator's draft directly against the Planner's exact criteria, sending failures back for correction.
Why the Evaluator Role Dictates Success
A production-grade Evaluator operates as a skeptical quality assurance engineer rather than a passive reviewer. By utilizing a clean context window to judge the Generator's output strictly against the Planner's original specification, this adversarial stance prevents rubber-stamping and forces iterative improvement cycles until the output legitimately passes all constraints.
Context isolation is the most critical factor in this phase. The Evaluator must never see the Generator's internal reasoning or the conversational history that led to the draft. It should only receive the final output and the initial specification. If the Evaluator sees the Generator explaining why it made a mistake, the Evaluator is more likely to forgive the error. Maintaining a strict firewall between these contexts forces the Evaluator to judge the work purely on its objective merits.
Prompt engineering tactics for the Evaluator must actively enforce skepticism. You have to explicitly instruct the model to assume the output is flawed. The prompt should demand that the Evaluator search for specific failure modes, edge cases, and missing requirements. It must output a structured assessment—often a JSON payload detailing exact line numbers and missing criteria—rather than a conversational summary.
Top engineering teams combine this static LLM review with dynamic runtime validation. For example, OpenAI wired the Chrome DevTools Protocol into their agent runtime to validate UI fixes in a real browser. This functional observation provides a ground-truth signal that static code inspection cannot match. When the Evaluator finds a flaw, its detailed, structured feedback becomes the exact prompt for the Generator's next iteration.
Implementing the Planner-Generator-Evaluator Pattern in Production
Implementing the planner-generator-evaluator pattern in production requires assigning distinct foundation models to each role based on their strengths. This balances reasoning capabilities with token costs. Teams route complex planning and evaluation to frontier models, assigning repetitive generation to faster, cost-effective alternatives.
You do not need to use the most expensive model for every step. A standard routing strategy looks like this:
- Planner: Claude 3.5 Sonnet or GPT-4o (high reasoning, deep context)
- Generator: Claude 3.5 Haiku or GPT-4o-mini (fast execution, low cost)
- Evaluator: Claude 3.5 Sonnet or GPT-4o (strict logic, critical analysis)
By deliberately splitting these tasks, you optimize for both budget and performance without degrading the final output.
You must define strict iteration limits to prevent infinite loops and control API spend. Cap the evaluation loop at three to five cycles. If the Generator cannot satisfy the Evaluator after five attempts, the system should halt and escalate the task to a human developer. A loop that fails to converge usually indicates a flawed initial plan or an impossible specification, not a generation error.
Orchestration requires managing state, scheduling, and communication between these isolated contexts. You can wire these agents together using code-first SDKs like ChatBotKit or visual workflow builders. Regardless of the tooling, keep the system prompts entirely isolated per role to maintain the integrity of the adversarial harness. State management must rigorously track which iteration cycle the system is currently executing to enforce the termination thresholds.
The Cost and Complexity Trade-Offs
Adopting a three-agent harness introduces significant architectural complexity, increasing token overhead, latency, and debugging difficulty. Engineering teams must earn this complexity by proving a single-agent baseline cannot meet the required quality bar, investing heavily in observability tooling to trace which agent introduced an error.
Multi-agent loops cost more and run slower. Passing context back and forth between three distinct models over multiple network requests adds unavoidable latency. If your user expects a real-time response, an iterative evaluation loop will feel sluggish and unresponsive. You are fundamentally trading speed for mathematical correctness and reliability.
Debugging also becomes a distributed systems problem. A failure could stem from a vague plan, a hallucinated generation, or an overly pedantic evaluation. As noted in the Codebridge decision framework, teams that ship reliable AI systems are those who earned every agent they added. You must be able to isolate whether the Planner missed a constraint or the Evaluator triggered a false positive.
Do not use this pattern for simple text summarization or basic data extraction. Reserve the PGE harness for high-stakes domains like code generation, continuous integration configuration, or infrastructure as code.1
To maintain system observability, you must mandate robust logging for every iteration cycle. Record the exact prompt, the raw output, the token count, and the latency for every single role in the loop. Without this granular telemetry, diagnosing a failed multi-agent workflow devolves into blind guesswork.
Conclusion
The three-agent harness proves that reliable AI outputs require architectural discipline and strict separation of concerns. Moving beyond monolithic prompts into specialized, adversarial feedback loops catches errors before they reach production. It forces foundation models to operate within the same quality assurance constraints we expect from human engineering teams.
Audit your most error-prone single-agent workflow today. Split the existing prompt into a distinct planning phase and a generation phase, and manually act as the Evaluator yourself to measure the quality delta. Do this before you write any orchestration code.
Footnotes
- Anthropic engineering notes that harness design for long-running apps demands continuous iteration and monitoring. It is a practice, not a pattern you implement once and forget. ↩