The Problem: The Interface Bottleneck

Vision-Language Models (VLMs) often struggle with precise spatial reasoning, such as calculating distances between objects or understanding 3D trajectories. NVIDIA Research argues that the limitation is not necessarily the model's intelligence, but the action interface used to interact with perception tools. Traditional methods, like structured tool-calling, often lack the flexibility required for complex geometric computation.

SpatialClaw: Code as the Action Interface

SpatialClaw bypasses the need for model retraining by wrapping an agent loop around a stateful Python kernel. Instead of calling predefined functions via rigid JSON schemas, the agent writes and executes Python code to manipulate perception data.

Key components include:

  • Stateful Kernel: Pre-loaded with input frames and perception primitives.
  • Perception Primitives: Core tools like tools.Reconstruct (using Depth Anything 3 for depth and camera geometry) and tools.SAM3 (for segmentation).
  • Five-Stage Loop: The agent follows a cycle of planning, code generation, execution, feedback assembly, and final answer submission.
  • Safety: A static Abstract Syntax Tree (AST) checker validates code before execution to prevent unsafe operations.

Performance and Impact

SpatialClaw achieves a 59.9% average accuracy across 20 benchmarks, outperforming the 'SpaceTools' agent by 11.2 points. The framework is training-free, meaning it works across various backbones (such as Qwen3.5/3.6 and Gemma4) without requiring fine-tuning.

Analysis shows that the performance gains are driven by:

  • Code Composition (52.2%): The ability to chain multiple geometric operations.
  • Control Flow (19.5%): The ability to use loops and conditionals to refine spatial queries.

The largest performance improvements were observed in dynamic tasks (e.g., DSI-Bench and MindCube), where the agent must perform chained geometric computations across multiple frames and viewpoints.