The Philosophy of Evals: Beyond the Leaderboard
Evaluation metrics are often misunderstood. They are neither objective truths nor useless vibes, but rather a tool for iterative engineering. The core problem with current benchmarks is that they are often outdated or optimized for "benchmark maxing" rather than real-world utility.
To effectively use evals, adopt three heuristics:
- Be Skeptical: Treat model-provided benchmark numbers as approximations. They rarely capture the nuance of real-world performance.
- Avoid Bleeding Edge: Let new model releases "set on fire" for a few weeks. Only switch models once they have proven their stability and utility in production environments.
- Prioritize Precision: Use benchmarks that measure actual capabilities (e.g., complex engineering tasks) rather than standardized tests that no longer reflect frontier performance.
The Hill-Climbing Framework
Improving an agent is an engineering and philosophy problem. Because agentic workflows involve complex, multi-step processes (searching docs, installing environments, running code), you cannot rely on simple binary evals. Instead, implement a systematic hill-climbing process:
- Environment Isolation: Run tasks in isolated, standardized environments (e.g., using Docker or virtual machines) to ensure reproducibility.
- Parallelization: Use infrastructure like Modal or Harbor to run large test suites in parallel, making the slowest task the only limiting factor.
- Portfolio Allocation of Failures: After a run, use a secondary agent to analyze the failure traces. This identifies which specific levers—such as container CPU/memory settings, timeout adjustments, or model-specific prompt engineering—actually move the needle.
Categorizing Improvements
When analyzing your results, group your findings into three zones:
- Zone 1 (Obvious Flaws): Fix clear bugs, rate-limiting issues, or harness crashes.
- Zone 2 (Nuance Improvements): This is the most critical area. It involves tailoring prompt engineering and harness logic to specific model families (e.g., Anthropic vs. Gemini). This explains why a model might perform exceptionally well in a general sense but fail in your specific harness.
- Zone 3 (The Danger Zone): Avoid overfitting to the benchmark. Cheating to get a higher score for a tweet provides no value to the end user and degrades the product's actual intelligence.