The Problem with Accuracy-Centric Evaluation

Traditional AI evaluation often treats benchmarks as binary: once a model reaches high accuracy, the benchmark is considered 'solved' and retired. This approach is fundamentally flawed because it ignores the nuances of how agents actually perform in real-world environments. Relying solely on accuracy masks critical failures in reasoning, efficiency, and robustness that only become apparent when we look beyond the final score.

A Multi-Dimensional Evaluation Framework

Using CORE-Bench Hard as a case study, the authors propose shifting focus toward six key dimensions of agent performance that remain relevant even after accuracy saturates:

  • Construct Validity: Identifying 'shortcuts' where agents pass tests without true understanding. The authors introduced CORE-Bench v1.1 and an out-of-distribution (OOD) suite to better stress-test agent capabilities.
  • Efficiency & Reliability: Measuring the cost, time, and consistency of agent outputs rather than just the correctness of the final result.
  • Model vs. Scaffold Performance: Disentangling the intelligence of the underlying LLM from the effectiveness of the 'scaffold' (the code/system wrapping the model).
  • Human-Agent Collaboration: Assessing the 'uplift' provided by AI in real-world workflows. In a randomized experiment on computational reproducibility, human-agent teams achieved a 2x speedup compared to humans alone, though the study noted that many human-only attempts hit time limits, suggesting the true performance gap may be even wider.

Moving Toward Rigorous Benchmarking

The authors argue that saturation is not an end-point but an opportunity to deepen evaluation. By moving away from simple accuracy metrics, developers can build more reliable systems that are actually useful in production. The findings demonstrate that even when agents appear to 'solve' a task, they often lack the reliability and efficiency required for professional scientific or engineering work. Future benchmarking should prioritize these operational metrics to ensure AI tools are ready for real-world deployment.