The Hallucination Problem in Agentic Planning

LLM-based agents often struggle with world modeling. When agents reason about state changes in complex environments, they frequently hallucinate transitions that are logically impossible or inconsistent with the environment's rules. Because these agents rely on natural language reasoning, these errors are difficult to quantify using standard regression metrics. In contrast, parameterized world models—trained transition predictors—are easier to evaluate using metrics like NodeMSE and delta accuracy, but they often lack the flexible reasoning capabilities of LLMs.

The GILP Approach: Combining Reasoning with Grounding

Grounded Iterative Language Planning (GILP) bridges this gap by integrating a small, trained parameterized backbone with an API-based LLM agent. The workflow functions as follows:

  1. Backbone Guidance: The parameterized model provides the agent with valid actions, predicted state deltas, risk assessments, and value estimates.
  2. LLM Drafting: The LLM agent generates a proposed action and an imagined state delta based on the backbone's input.
  3. Consistency Gating: A consistency gate compares the LLM's output with the backbone's predictions. If the two disagree, the system triggers a revision process, forcing the agent to re-evaluate its plan.

Performance and Impact

This hybrid architecture significantly improves reliability in graph-structured planning tasks. In experiments using GPT-4o-mini, GILP reduced the hallucinated-state rate from 0.176 to 0.035. Furthermore, in calibrated simulator tests, the method increased success rates from 0.668 to 0.838. This performance boost comes with a relatively low overhead, requiring only ~22% additional LLM calls to achieve these gains in accuracy and consistency.