The Failure of Current Spatial Memory in Agents

Most language agents rely on linguistic descriptions or 2D snapshots, which frequently fail when tasked with navigating or reasoning about 3D environments. The core issue is that these models often lack a persistent, coherent spatial memory. They treat objects as transient entities rather than permanent fixtures in a 3D coordinate system. This leads to "hallucinated" spatial relationships where an agent might forget an object's location the moment it moves out of the immediate field of view or is obscured by another object.

Occlusion as the Definitive Test

The authors argue that occlusion—the phenomenon where one object blocks the view of another—is the most effective litmus test for true spatial understanding. If an agent cannot maintain a mental model of an object that is temporarily hidden, it does not possess a functional spatial memory. The paper proposes a framework where agents are evaluated on their ability to predict the state, location, and existence of objects that are occluded. This shifts the focus from simple object recognition to the maintenance of object permanence, requiring the agent to integrate temporal and spatial data to infer the state of the world beyond its current visual input.

Moving Beyond 2D Representations

To solve this, the authors suggest that agents must move toward internal representations that explicitly encode 3D geometry and object permanence. Relying on the LLM's latent space to "guess" spatial relationships is insufficient for complex tasks. Instead, agents need a dedicated memory structure that updates in real-time based on movement and perspective changes. By forcing agents to pass occlusion-based benchmarks, developers can move away from models that merely describe scenes toward those that can reliably interact with and navigate physical or simulated 3D spaces.