Replay Model Breaks for Long-Running Agents

Traditional replay durability—wrapping steps in a journal for replay on resume—built reliable async workflows like order processing without double-charging cards. It caches executed steps, provides audit trails, and enables resuming after human input or failures. But agents invert this: LLMs orchestrate tools in loops, turning every LLM call and tool into journal steps. After one turn, logs balloon; multi-turn interactions hit limits on entry count or size. Agent runtimes double every 4-7 months—from hours to days—making agents sessions, not transactions. Replay forces rigid, deterministic code outside steps, complicates versioning, and can't capture compute state like files, memory, or subprocesses (e.g., cloned repos, dev servers).

Split Durability: Context Logs + Execution Snapshots

Agents need two states: context (append-only log of system/user messages, tool calls/results, assistant responses) and execution (VM state: files, memory, subprocesses). Store context in databases, object storage, or distributed filesystems—durable across code versions, crashes, scales well. For execution, snapshot/restore the machine: save when idle (user lunches), restore on next message. This preserves Git clones, installed packages, datasets cheaply vs. always-on VMs. Recover selectively: snapshot for LLM outages (retry after 15min), restart execution from context log for machine bugs/crashes. Agents force shift from 30-year stateless compute (CGI 1993 → LAMP → Node/Serverless) to stateful compute.

Firecracker Snapshots: 14MB, Sub-Second Ops

CRIU (2011) enabled userspace process checkpoint/restore via 'parasite' injection—transparent, container-compatible, but limited to single processes (no ffmpeg/Chrome), misses closed files, slow with registries. Firecracker microVMs snapshot entire machines, resuming subprocesses seamlessly. Naive 512MB RAM snapshots bloat storage/transfer; compress with seekable format (decompress only accessed pages on restore), plus layering, yields 14MB tunable size. Results: snapshots <1s, restores ~100-200ms. Open-source CLI 'fc-run' (docker-like) runs/snaps/restores/forces VMs fast—15,000 starts/min (30 FPS video equivalent), powers Trigger.dev's agent compute.