Addressing the High-Order Theory of Mind Bottleneck
Theory of Mind (ToM)—the ability to attribute mental states to oneself and others—remains a significant hurdle for large language models, particularly when scaling to 'high-order' reasoning (e.g., 'I think that you think that he knows...'). Current models often struggle with these recursive social dynamics because existing datasets lack the depth required to train robust ToM capabilities. The authors introduce OSCToM, a framework designed to bridge this gap by shifting from static dataset reliance to an active, adversarial generation process.
The OSCToM Framework: RL-Guided Adversarial Generation
OSCToM utilizes a reinforcement learning (RL) loop to guide the generation of adversarial examples that specifically target the weaknesses in a model's ToM reasoning. Instead of relying on human-curated prompts, the system:
- Generates Adversarial Scenarios: Uses a generator to create complex social interactions that test recursive belief states.
- RL-Guided Optimization: Employs an RL agent to evaluate the generated scenarios, rewarding those that successfully challenge the model's current ToM limits while maintaining coherence.
- Iterative Refinement: Continuously updates the generation process based on the model's failure modes, effectively creating a 'curriculum' of increasingly difficult social reasoning tasks.
This approach allows for the systematic exploration of high-order mental states that are rarely captured in standard training corpora. By forcing the model to resolve contradictions and recursive dependencies in these generated scenarios, OSCToM improves the model's internal representation of social dynamics and intent attribution. The authors provide their implementation via a public repository, enabling researchers to apply this adversarial training approach to their own models to improve performance on social reasoning benchmarks.