Long-Running Agents Persist Across Sessions for Days

Overcome Core Walls with External State and Decoupling

Long-running agents address three universal challenges: finite context (1M-token windows degrade via context rot before limits), no persistent state (each session starts blank, like shift engineers with amnesia), and no self-verification (models prematurely declare completion). Solutions converge on external state outside the model's context: structured files like prd.json (task list), progress.txt (lab notes), and AGENTS.md (rules). The agent remains amnesiac per session, but the filesystem persists progress.

METR's time horizon metric shows frontier models doubling reliable task length every seven months since 2019; TH1.1 doubled 8-hour tasks, projecting day-scale by 2028 and year-scale by 2034. This enables economic shifts: 10-minute agents fix bugs, 10-hour ones own features or migrations stalled for quarters. Anthropic's Claude Sonnet ran 30+ hours autonomously, producing an 11,000-line Slack app. Persistence builds identity, accumulating context like competitor moves or flaky tests, as in Anthropic's month-long Project Vend vending simulation.

Architectures: Ralph Loop to Brain/Hands/Sessions

The Ralph loop—popularized by Geoffrey Huntley and Ryan Carson—is a bash script that iterates: pick next task from prd.json, prompt with context/notes, call agent, run tests, append to progress.txt, update tasks, repeat. It chains loops for analysis/planning/execution, mirroring planner-generator-evaluator triads. Anthropic productizes this in harnesses: initializer sets feature-list.json and init.sh; coding agent progresses incrementally, commits, leaves claude-progress.txt; test ratchets prevent deleting tests.

Anthropic's Managed Agents decouple Brain (model + loop), Hands (ephemeral sandboxes), and Session (append-only event log), enabling stateless harnesses, cattle-not-pets sandboxes, and 60% p50 / 90% p95 faster time-to-first-token. Sessions make recovery queryable. Cursor scales with Planners (emit tasks, spawn subs), Workers (execute), Judges (finish iterations); GPT outperforms Opus for extended work as Opus stops early. Google’s Gemini Enterprise Agent Platform offers Agent Runtime (days-long with sub-second starts), Sessions (pin to CRM), Memory Bank (curated long-term memory, 50% faster expense submission), and Sandbox.

Production Patterns and Build Paths

Five patterns distinguish production: 1) Checkpoint-and-resume (save every N units, recover granularly); 2) Delegated approval (pause with full state for human review, resume sub-second); 3) Memory-layered context (govern drift via identity/registry/gateway); 4) Ambient processing (event-driven, policy in gateway); 5) Fleet orchestration (coordinator delegates to isolated specialists).

To build: For repo coding, use Claude Code/Cursor with Ralph loop, AGENTS.md checklist, git commits, worktrees for overnight runs. For hosted products, pick managed (Google Agent Platform, Claude Managed Agents) for brain/hands/session split + observability. For operational (monitoring/research), stack ADK + Memory Bank + Cloud Run. Prompting drives behavior; match models to roles (e.g., GPT for endurance). Start with longest uninterrupted work unit to scope needs.