Building Long-Running AI Agents: Harnesses and Adversarial Loops

The Evolution of Agentic Longevity

Building agents capable of multi-hour or multi-day tasks requires moving past the limitations of simple, single-shot prompting. As models have evolved from Claude 3.7 to 4.6, the focus has shifted from basic bash command execution to complex, multi-step application development. The core challenge is managing context: as sessions lengthen, models suffer from 'context rot' (loss of coherence) and 'context anxiety' (rushing to finish as the window closes).

The Adversarial Harness Pattern

Rather than relying on an agent to self-evaluate—which often leads to sycophancy or 'rubber-stamping'—the most effective pattern is an adversarial 'generator-critic' loop. This mimics Generative Adversarial Networks (GANs).

The Generator: Focused solely on building features.
The Critic: A separate agent tasked with harsh, objective evaluation using tools like Playwright to interact with the live application.
The Advantage: It is significantly easier to tune a model to be a harsh critic than to force a generator to be self-critical. If the generator fails to meet the rubric, the harness discards the work and restarts, preventing the 'patching' of fundamentally flawed code.

Managing State and Planning

Long-running agents fail when they lack a persistent 'source of truth.'

Persistent Artifacts: Use JSON files for progress tracking rather than Markdown, as models are less likely to accidentally overwrite structured data.
Structured Handoffs: Break large tasks into a series of independent, testable 'sprint contracts.' Each task runs in a fresh context window to reset the model's state, preventing the accumulation of errors.
Rubrics for Subjectivity: You can grade 'taste' and design quality by defining a strict rubric (e.g., originality, craft, functionality). By providing few-shot examples of high-quality output, the evaluator's taste can be calibrated to match the developer's standards.

The Debugging Loop

Treat your agent's execution traces as your primary debugging loop. When an agent fails, don't just tweak the prompt; examine the harness. The goal is to build a system where the harness handles the 'boring' orchestration (environment setup, smoke tests, git commits), allowing the model to focus entirely on the logic. As models improve, parts of the harness will become redundant. The best builders are those who constantly evaluate which parts of their scaffolding can be deleted as the base model's capabilities expand.

The Evolution of Agentic Longevity

The Adversarial Harness Pattern

Managing State and Planning

The Debugging Loop

More from AI & LLMs

4 AI Agent Failures and Marauder's Map Fixes

Evaluating LLM Agents in High-Stakes Energy Analytics

Implementing DeepMind's Deep Research API

The Miranda Hypothesis: Why Persona Evals Fail