WorldLines: Benchmarking Long-Horizon Stateful Embodied Agents

The Challenge of Long-Horizon Embodied State

Most current embodied AI benchmarks focus on short-term, reactive tasks. WorldLines shifts the focus toward 'long-horizon stateful' interactions, where an agent must maintain a persistent internal representation of the world and its own history to succeed. The core problem addressed is the 'forgetting' or state-drift that occurs when agents operate over extended timeframes, making them unable to reconcile past actions with current environmental requirements.

The WorldLines Benchmark Framework

The authors propose WorldLines as a rigorous evaluation suite for testing agent persistence. Unlike static benchmarks, WorldLines requires agents to:

Maintain State: Track object properties, spatial relationships, and task progress across thousands of steps.
Handle Temporal Dependencies: Execute multi-stage plans where the outcome of step 100 is contingent on a decision made at step 5.
Adapt to Dynamic Environments: Manage state updates when the environment changes independently of the agent's actions.

Modeling and Evaluation

The paper introduces a modeling approach that emphasizes memory-augmented architectures. By forcing agents to explicitly manage a 'world state'—rather than relying solely on the context window of a Large Language Model—the researchers demonstrate that agents can significantly improve their success rates in complex, multi-room, or multi-object manipulation tasks. The benchmark provides a standardized way to measure 'state fidelity,' quantifying how accurately an agent's internal model reflects the actual state of the simulated environment over time.

The Challenge of Long-Horizon Embodied State

The WorldLines Benchmark Framework

Modeling and Evaluation

More from AI & LLMs

OpenEvoShield: Defending Multi-Agent Systems Against Evolving Attacks

ClinLens: Long-Horizon Coding Agents for Clinical Data Science

Autonomous Research Agents as Force Multipliers for ML Engineering

Perplexity Brain: Self-Improving Memory for AI Agents