The Problem: Reactive Agents vs. Predictive Planning

Most current LLM agents operate reactively, lacking the ability to simulate future outcomes before committing to a decision. While fine-tuning agents on look-ahead traces is a common post-training strategy, it often results in "superficial mimicry"—the agent mimics the format of foresight without possessing genuine predictive grounding. To move beyond this, the authors propose internalizing a world model directly into the autoregressive policy, allowing the model to verbalize prospective state rollouts and plan-conditioned success estimates (a textual analogue to Q-values).

A Three-Stage Training Paradigm

To bridge the gap between format mimicry and true predictive capability, the authors introduce a structured, capability-first training pipeline:

  1. World Model Agentic Mid-Training (WM-AMT): This initial stage focuses on injecting latent predictive capabilities into the policy, ensuring the model learns to model world dynamics rather than just predicting the next token in a sequence.
  2. Format-Eliciting SFT (FE-SFT): Once the underlying capability is established, this stage uses Supervised Fine-Tuning to structure the output, teaching the model how to express its internal simulations in a coherent, usable format.
  3. Foresight-Conditioned Reinforcement Learning (FC-RL): The final stage refines the model's performance through RL, specifically focusing on the calibration and utility of the generated simulations. This ensures the agent's "what-if" reasoning is both accurate and useful for decision-making.

By separating the acquisition of predictive capability from the formatting of that output, the model achieves more grounded and reliable foresight compared to standard fine-tuning approaches. The authors demonstrate that this approach consistently outperforms existing baselines in search and mathematical reasoning tasks, proving that effective internal world modeling requires a multi-stage, capability-first training pipeline.