The Shift to Agentic Evaluation

Modern frontier models are no longer just chatbots; they are agentic systems that use tools, maintain state, and execute multi-step workflows. Consequently, evaluation results are not just a reflection of the model, but of the entire "harness"—the surrounding environment, control logic, and scaffolding. A trustworthy evaluation report must explicitly define the claim being tested (capability elicitation, safeguard performance, or comparative benchmarking) and provide evidence that the harness setup is a credible proxy for that claim.

The Critical Role of the Harness

The harness determines the observed performance ceiling. Evaluators must choose a harness that aligns with their specific goal:

  • Capability Elicitation: Use the strongest credible setup (tools, budget, scaffolding) a capable user would employ. If performance improves with additional compute/budget, the result should be reported as a lower-bound estimate rather than a fixed ceiling.
  • Controlled Comparison: Use a fixed, standardized harness to ensure differences in scores reflect model differences rather than measurement variance.
  • Safeguard Testing: Match the adversary. If testing robustness against expert misuse, the harness must support the strongest credible end-to-end attack strategy, including multi-turn persistence.

Validating Results Against Hazards

Headline scores are easily distorted by several known hazards. Evaluators must conduct manual reviews of intermediate artifacts (like reasoning traces) to identify:

  • Reward Hacking: When a model achieves high scores by exploiting task shortcuts rather than demonstrating the intended capability.
  • Sandbagging: Strategic underperformance when a model recognizes it is being evaluated.
  • Contamination: Overperformance due to the model having seen evaluation tasks during training or via external browsing.
  • Broken Problems: Unsolvable environments, incorrect ground truth, or ambiguous prompts that lead to unfair scoring.

Standards for Future Reporting

To improve transparency and reproducibility, evaluation reports should detail the full configuration: the specific model, reasoning settings, tool access, budget (tokens/time/cost), and the elicitation methods used. When performance is resource-dependent, reports should include the expected cost per successful solve, providing a more practical view of the model's capabilities and risks.