Offline Eval Gates: Catch LLM Regressions via Scenario Buckets & Paired Scores
Design gates around 4-6 failure scenario buckets with multi-dimension scoring (outcome, process, action, efficiency); always compare baseline vs candidate on identical fixed cases to detect regressions before shipping prompt/model changes.
Scenario Buckets Over Topic Piles for Debuggable Regressions
Focus gates on failure surfaces like 'policy lookup with evidence present, answer expected' or 'missing evidence, abstain expected'—not broad topics like 'customer support.' Each scenario specifies user intent, evidence regime (retrieved, tool-derived, parametric, unsupported), expected action (answer, abstain, clarify, refuse, escalate), risk tier, and change-sensitivity tags (e.g., retrieval_sensitive). Start with 4-6 buckets and a few cases per bucket, prioritizing direct-answerable, missing-evidence abstain, refusal/escalation boundaries, and past failures. This catches regressions on cases that matter, enabling per-bucket debugging: if retrieval changes, filter retrieval-sensitive rows instantly.
Build a minimum viable gold set by drafting expected actions first, adding reference answers/tools only where needed, and marking release blockers. Accelerate with LLMs for clustering logs, generating paraphrases/variants, proposing labels, and drafting rubrics—but humans adjudicate critical cases, with deterministic checks for verifiable tasks. Synthetic data stresses but doesn't replace production-shaped evals.
Multi-Dimension Scoring Separates Signal from Noise
Ditch single 'accuracy' scores; evaluate four dimensions independently: outcome correctness (right thing?), process correctness (valid structure, no unsupported claims, correct evidence/tools?), action correctness (right mode: answer vs abstain vs refuse?), and efficiency (cost/latency bounds). Score baseline and candidate on the exact same frozen dataset/case IDs/snapshots to compute per-case deltas, surfacing new high-risk blockers even if averages improve.
Progress scorers from deterministic checks (starting point for structure/parsing), to reference-based (early for outcomes), LLM-as-judge (after core gates stabilize), and human review (critical/ambiguous only). Treat refusal as action selection within evidence/policy constraints—track precision/recall, over-refusal, unsafe compliance—not binary yes/no. This reveals regressions like answering on missing evidence (dangerous) or refusing clarifiable requests (frustrating).
Blocker Policies and Evidence-Driven Growth
Explicitly define blockers: unsupported claims in high-risk cases, wrong actions in critical buckets, answering when abstain expected—while tone/brand stay advisory and latency/cost as hard limits. Gates enforce decisions: ship only if no blockers, regardless of average uplift.
Grow from real failures (manual reviews, dogfooding bugs, log patterns, candidate regressions), not brainstorming. Maturity path: Phase 1 (small high-risk set + rules), Phase 2 (more cases, history), Phase 3 (stable core + shadow sets, richer scorers). Version policies to maintain trust; change one variable (prompt, retriever) per run to isolate causes. Common pitfalls: headline scores hide root causes; topic buckets evade debugging; unversioned policy erodes reliability; ignoring retrieval drift or binary refusals.
Reference implementation in llm-eval-ops repo demonstrates bucketed cases, structured checks, paired runs, and verdict logic—proving strong gates prioritize design over size.