Shift from Unit Testing to Failure-Mode Evals

Evaluation is not about exhaustive coverage; it is about managing risk and building confidence. Unlike unit tests, which attempt to cover every possible edge case, agent evals should be targeted at known failure modes. Attempting to test everything is an infinite task that prevents shipping. Instead, identify specific failure modes through subject matter expertise and build targeted tests around those risks.

The Evaluation Maturity Continuum

Teams typically progress through four stages of maturity as their agents grow in complexity:

  1. Vibe Checking with Justification: Starting with manual review is acceptable, provided it is documented. When a human annotator gives a 'thumbs up' or 'thumbs down,' they must provide a written justification. This captures domain-specific knowledge that is essential for scaling.
  2. Automated Scaling (LLM-as-Judge): Use the justifications gathered in the first stage to train or prompt an LLM to act as a judge. This automates the evaluation process. Crucially, treat the LLM-as-Judge as a component that also requires validation—do not trust its output blindly.
  3. Production-Trace Flywheel: Stop thinking of evals as static tests and start thinking of them as production replays. Capture real production traces, identify failures, feed them into an offline experimentation environment, and use the results to guide improvements. This creates a feedback loop that allows you to play 'offense' by measuring the impact of every tweak.
  4. Complex Tool Call Evaluation: When agents interact with external systems, evaluation becomes harder. For context-gathering tools, you must introspect the entire trace. For CRUD (Create, Read, Update, Delete) tools, you face the challenge of state. To solve this, inject the captured system state directly into the trace or use timestamp-based queries against your vector database to reconstruct the state as it existed when the original trace occurred.

Practical Implementation Advice

  • Custom Annotation Views: Do not use generic annotation platforms. Build custom views that reflect how your specific agents and data look to ensure high-quality feedback.
  • Deterministic vs. Probabilistic: While deterministic code checks (e.g., token counts, tool call frequency) are useful, embrace LLM-as-Judge for subjective tasks. To maintain rigor, create a 'ground truth' data set for your LLM judges to ensure they align with human expectations.
  • Emerging Patterns: Future-proof your pipeline by using topic modeling to automatically uncover new failure modes in production and leveraging CLI-based automation to integrate evals directly into your deployment workflows.