Moving Beyond Static Output Evaluation

Traditional model evaluation often relies on static benchmarks or single-turn prompt-response pairs. This research argues that to truly understand model reasoning, we must examine the 'trajectories'—the full sequence of thoughts, tool calls, and intermediate actions an agent takes to reach a conclusion. By dissecting these paths, researchers can identify where models deviate from logical reasoning, where they get stuck in loops, or where they rely on superficial patterns rather than robust problem-solving strategies.

Trajectory Analysis as a Diagnostic Tool

With 50 figures and 16 tables, the paper establishes a methodology for mapping agent behavior. The core insight is that an agent's trajectory acts as a high-fidelity trace of its internal state and decision-making process. By analyzing these traces, developers can:

  • Identify Failure Modes: Distinguish between 'hallucination' caused by poor tool usage versus 'hallucination' caused by incorrect internal knowledge retrieval.
  • Optimize Reasoning Paths: Determine if a model is taking unnecessary steps that increase latency and cost without improving accuracy.
  • Improve Robustness: Detect patterns in trajectories that lead to brittle performance, allowing for targeted fine-tuning or prompt refinement to steer the model toward more reliable decision-making sequences.

This approach shifts the focus from 'does the model get the right answer?' to 'how does the model arrive at the answer?', providing a more granular understanding of model reliability in production environments.