Calibrate LLM Judges with GEPA for Reliable Evals
Use GEPA to optimize LLM-as-a-judge prompts against human annotations, creating evaluators that match SME judgments and accelerate agent iteration.
Derive Metrics from Real-World Error Clusters
Start by analyzing production traces with subject matter experts (SMEs) to identify failure modes specific to your use case. For a customer support agent like the Towbench airline benchmark (599 traces, 62% compliant), SMEs review conversations and cluster errors into categories: policy adherence, response style, information delivery, and tool usage. Avoid generic metrics like 'hallucination'—they fail because the LLM can't detect issues it couldn't prevent initially.
Make metrics binary (compliant/non-compliant) with required reasoning. This simplifies calibration: humans struggle with 1-5 scales, let alone LLMs. For policy adherence, annotations explain rules, e.g., 'non-compliant because it approved cancellation without verifying reservation met airline rules.' Quality data is paramount—pre-process for balance (training: 480 traces, ~2/3 compliant; validation: 112 traces), ensure reasoning reveals policy nuances, and split by tasks to avoid leakage.
Common mistake: Skipping SME error analysis leads to unlearnable metrics. Principle: Metrics must encode business rules via annotations, enabling the judge to 'learn' compliance without prior knowledge.
Build Annotations as Learning Signals
Use tools like Agenta to queue traces for SME labeling. For each trace, require: binary label + reasoning. This reasoning acts as ground truth supervision—without it, optimization can't infer why a trajectory fails.
Example non-compliant annotation: 'The agent is not compliant because it approved the cancellation without verifying that the reservation met the airline cancellation rule.' Compliant: 'Compliant because it correctly identified the basic economy reservation.' In Towbench, original assertions were post-processed into this format.
Principle: Annotations aren't just labels; they're few-shot examples embedding domain knowledge. Validate distribution (no heavy skew), check for redundancies, and confirm info sufficiency—complex policies need explicit explanations.
"The reasoning here is very important because without the reasoning we will see the optimization algorithm will need to discover itself like why it failed and it's going to be very hard."
GEPA Optimization: Seed, Mutate, and Pareto Filter
GEPA (Gorilla Prompt Optimizer) evolves prompts via genetic-like iterations: sample candidates, evaluate on eval set batches, filter via Pareto frontier, repeat until budget exhausts. It's superior to naive search by balancing performance and diversity.
Seed prompt: Start conservative—"Assume the agent is compliant unless specific violation found." Example: "Evaluate if customer service agent violated policy. Output: Compliant or Non-compliant with reasoning. Assume compliant by default."
Sampling:
- Mutation: Run judge on failing cases; LLM reflects and proposes improved prompt (e.g., adds guidelines from observed errors).
- Merge: Combine guidelines from top prompts (e.g., policy checks from A + style rules from B).
Evaluation: Custom evaluator logs output, error (mismatch with annotation), and reflection reasoning for next mutation. Use flight-llm and uga libraries: optimize_anything(seed_prompt, evaluator, config).
Pareto frontier: Don't pick average-best; for each eval trajectory, select best-performing prompt. This ensures coverage—every test case has a strong solver—promoting diversity over convergence to one 'average' prompt.
Run on train set (batches to save compute); iterations: sample N candidates/iter, evaluate on M% of data. Budget: ~hours on GPU. Post-optimization, GEPA yields prompts correlating >0.8 with humans.
Principle: Diversity via Pareto prevents overfitting to easy cases; mutation leverages LLM self-improvement.
"The way we select which prompts or which candidates we're going to use as a seed for the new iteration is not that we select the ones that have the average best score... instead what they do is that they try to add diversity by trying to look at what are the best candidate per task."
Validate Calibration Against Held-Out Humans
Test optimized judge on validation set: Compute correlation (e.g., Cohen's kappa or accuracy) with human annotations. Towbench results: Naive judge ~60% accuracy; GEPA-optimized ~85% on policy adherence.
Assess robustness: Vary models (GPT-4o, Claude), temperatures. Check failure modes—optimized judges handle edge cases better due to merged guidelines.
Before/after: Naive: Misses subtle policy violations (e.g., unverified membership). Optimized: Incorporates checks like 'Verify user membership before changes.'
"Miscalibrated evals are worse than no evals. They give false confidence while being, at best, useless."
Integrate into Dev Loops for Agents
Deploy calibrated judges for:
- Offline evals: Replace slow human loops; iterate prompts 10x faster.
- Online monitoring: Detect distribution shifts in prod traces.
- Data flywheel: Auto-generate evals from traces, re-optimize agents.
For multi-metric (4 judges), run in parallel. Scale with Agenta for observability/queues.
Common pitfalls: Poor data (small N, skew) caps gains—augment if needed. Over-complex seeds fail; start naive. Ignore validation, risk prod false positives.
"Having calibrated LM as a judge with a similar quality as human annotator will make your development much faster."
Key Takeaways
- Analyze traces with SMEs to cluster 3-5 binary metrics tied to business rules; include rich reasoning in annotations.
- Seed GEPA with naive, conservative prompts assuming success; use mutation for self-reflection on failures.
- Apply Pareto frontier filtering to maintain prompt diversity across eval cases.
- Validate on held-out human data with correlation metrics; aim for >80% match before prod.
- Prioritize data quality over compute—bad annotations doom optimization.
- Build separate judges per metric; binary + reasoning beats scalar scores.
- Use
optimize_anythingAPI for any configurable system, not just prompts. - Integrate into flywheels: Evals from traces → auto-optimize → repeat.
"Quality of the data is really paramount of being able to learn this... without kind of information about what is compliant like why is something correct or not correct it would be quite impossible."
Full code/data: GitHub repo linked in original video.