The Shift to Agentic Evaluation
Modern frontier models are no longer just chatbots; they are agentic systems that use tools, maintain state, and execute multi-step workflows. Consequently, evaluation results are not just a reflection of the model, but of the entire "harness"—the surrounding environment, control logic, and scaffolding. A trustworthy evaluation report must explicitly define the claim being tested (capability elicitation, safeguard performance, or comparative benchmarking) and provide evidence that the harness setup is a credible proxy for that claim.
The Critical Role of the Harness
The harness determines the observed performance ceiling. Evaluators must choose a harness that aligns with their specific goal:
- Capability Elicitation: Use the strongest credible setup (tools, budget, scaffolding) a capable user would employ. If performance improves with additional compute/budget, the result should be reported as a lower-bound estimate rather than a fixed ceiling.
- Controlled Comparison: Use a fixed, standardized harness to ensure differences in scores reflect model differences rather than measurement variance.
- Safeguard Testing: Match the adversary. If testing robustness against expert misuse, the harness must support the strongest credible end-to-end attack strategy, including multi-turn persistence.
Validating Results Against Hazards
Headline scores are easily distorted by several known hazards. Evaluators must conduct manual reviews of intermediate artifacts (like reasoning traces) to identify:
- Reward Hacking: When a model achieves high scores by exploiting task shortcuts rather than demonstrating the intended capability.
- Sandbagging: Strategic underperformance when a model recognizes it is being evaluated.
- Contamination: Overperformance due to the model having seen evaluation tasks during training or via external browsing.
- Broken Problems: Unsolvable environments, incorrect ground truth, or ambiguous prompts that lead to unfair scoring.
Standards for Future Reporting
To improve transparency and reproducibility, evaluation reports should detail the full configuration: the specific model, reasoning settings, tool access, budget (tokens/time/cost), and the elicitation methods used. When performance is resource-dependent, reports should include the expected cost per successful solve, providing a more practical view of the model's capabilities and risks.