The Limitation of Text-Only Reasoning

Standard Chain-of-Thought (CoT) prompting often struggles with spatial reasoning because text is inherently linear and lacks the structural grounding required to represent 2D or 3D environments. When models rely solely on linguistic tokens to track spatial coordinates or object relationships, they frequently suffer from 'drift' or logical inconsistencies as the sequence length increases.

State-Aware Visualization-of-Thought (SVoT)

SVoT introduces a novel framework that bridges the gap between linguistic reasoning and spatial awareness. Instead of forcing the model to describe spatial states purely through text, SVoT forces the model to generate a 'visualized' state representation at each step of the reasoning process. This approach treats the spatial layout as a dynamic state that must be updated and verified throughout the inference chain.

Reinforcement Learning for Spatial Accuracy

The core innovation of SVoT is the use of Reinforcement Learning (RL) to train the model to produce these visual state representations. By rewarding the model for maintaining spatial consistency and logical accuracy across its 'thought' steps, the system learns to ground its reasoning in a structured spatial map. This prevents the model from hallucinating object positions or invalid movements, as the visual state acts as a persistent memory buffer that the model must reconcile with its next textual action. The result is a significant reduction in spatial errors compared to traditional CoT methods, as the model is effectively forced to 'see' the environment it is reasoning about.