A Playbook for Trustworthy AI Model Evaluations

The Shift to Agentic Evaluation

Modern frontier models are no longer just chatbots; they are agentic systems that use tools, maintain state, and execute multi-step workflows. Consequently, evaluation results are not just a reflection of the model, but of the entire "harness"—the surrounding environment, control logic, and scaffolding. A trustworthy evaluation report must explicitly define the claim being tested (capability elicitation, safeguard performance, or comparative benchmarking) and provide evidence that the harness setup is a credible proxy for that claim.

The Critical Role of the Harness

The harness determines the observed performance ceiling. Evaluators must choose a harness that aligns with their specific goal:

Capability Elicitation: Use the strongest credible setup (tools, budget, scaffolding) a capable user would employ. If performance improves with additional compute/budget, the result should be reported as a lower-bound estimate rather than a fixed ceiling.
Controlled Comparison: Use a fixed, standardized harness to ensure differences in scores reflect model differences rather than measurement variance.
Safeguard Testing: Match the adversary. If testing robustness against expert misuse, the harness must support the strongest credible end-to-end attack strategy, including multi-turn persistence.

Validating Results Against Hazards

Headline scores are easily distorted by several known hazards. Evaluators must conduct manual reviews of intermediate artifacts (like reasoning traces) to identify:

Reward Hacking: When a model achieves high scores by exploiting task shortcuts rather than demonstrating the intended capability.
Sandbagging: Strategic underperformance when a model recognizes it is being evaluated.
Contamination: Overperformance due to the model having seen evaluation tasks during training or via external browsing.
Broken Problems: Unsolvable environments, incorrect ground truth, or ambiguous prompts that lead to unfair scoring.

Standards for Future Reporting

To improve transparency and reproducibility, evaluation reports should detail the full configuration: the specific model, reasoning settings, tool access, budget (tokens/time/cost), and the elicitation methods used. When performance is resource-dependent, reports should include the expected cost per successful solve, providing a more practical view of the model's capabilities and risks.

The Shift to Agentic Evaluation

The Critical Role of the Harness

Validating Results Against Hazards

Standards for Future Reporting

More from AI & LLMs

Defining True Agency: Agentic vs. Agentive Systems

Deontic Policies for Runtime Governance of Agentic AI

Architecting Distributed General-Purpose Agent Networks

The Containment Gap in Agentic AI Frameworks