Replace Vibes Testing with Systematic Evals to Catch Regressions

Agents fail silently on untested inputs like adversarial queries, edge cases, or simplistic user phrasing because traditional unit tests break on non-deterministic outputs—same prompt yields varying but potentially correct text. Human review doesn't scale, misses regressions in CI, and can't validate model switches or prompt tweaks without retesting everything. Evals solve this by treating traces (nested JSON logs of LLM/tool calls with inputs, outputs, metadata like tokens/timing) as test data. Run evals in CI to ensure prompt fixes don't hallucinate features or alter tone adversely. Real teams like Decrypt, Bolt, and Anthropic iterated from vibes to evals for production agents.

Agents amplify issues via cascading failures: wrong tool choice, bad parameters, misparsed tool output, or multi-agent routing errors compound into disasters like confusing Tesla (car) with Nikola Tesla in reports. Evals must handle non-prescriptive paths—agents evolve with model upgrades, finding clever shortcuts that break rigid tests. Distinguish capability evals (hard tasks to benchmark improvement) from regression evals (ensure baselines hold). Eval outputs include score, label, and LLM explanations revealing patterns like systematic prompt flaws vs. one-offs.

Quote: "The usual fix unit test doesn't work here... because the same prompt will produce different text on every single run, but those outputs might all be correct."

Trace First, Then Diagnose Failures Before Writing Evals

Start every pipeline with instrumentation: use Phoenix (Arize's open-source observability) to capture spans (LLM/tool steps) without local install via Phoenix Cloud (free account, API key). Install pip install arize-phoenix[crewai] claude-agent-sdk (assumes Claude API key; adaptable to OpenAI/Gemini). Run pre-built financial analysis agent (Claude-powered, fetches Yahoo Finance data, generates reports) on 13 test queries, auto-tracing to Phoenix UI.

Inspect traces in Phoenix before evals: filter by spans, view inputs/outputs, identify failure modes manually. Categorize root causes—e.g., model unaware of current year fails forward-looking data; tool param errors; hallucinated facts. Use LLM to auto-categorize eval explanations at scale (LLMs all the way down). This data-driven step skips most tutorials' mistake: writing evals blind, measuring wrong metrics. Example: correctness eval scores 0/13 (can't verify future data), but faithfulness (sticks to sources) scores 13/13—proves eval choice > tuning.

Key principle: Read traces to define rubrics—what's "good"? For financial agent: accurate tool use, source fidelity, complete reports without extras. Avoid prescriptive evals (e.g., exact tool sequence) that fail smarter agents. Humans build golden datasets for novel failures; code/LLM evals handle volume.

Quote: "We're going to do something that most of the tutorials skip. We're going to actually look at the data. We're going to read our traces, categorize what went wrong, and figure out what to measure before we write a single eval."

Layer Code, Built-in, and Custom LLM Evals for Comprehensive Coverage

Build evals post-tracing, complementary: code for deterministic checks (fast/cheap), LLM-as-judge for semantics (flexible but costly/nondeterministic).

Code evals (Python functions): Validate JSON output, token limits (<500), required fields, forbidden phrases, keyword presence. Example:

def json_eval(output: str) -> dict:
    try:
        json.loads(output)
        return {"score": 1.0, "label": "valid", "reason": "Parses as JSON"}
    except:
        return {"score": 0.0, "label": "invalid", "reason": "JSON parse error"}

Run via Phoenix: evaluate(pnx.Eval(name="json") .with_code(json_eval), dataset)—milliseconds, reproducible.

Built-in LLM evals (Arize Phoenix): pnx.qa_correctness, pnx.answer_relevancy—prompt powerful LLM (e.g., Claude-3.5-Sonnet) vs. agent output/reference. Scores 0-1 with explanations.

Custom LLM rubric evals: Define rules in prompt, add few-shot examples from traces. For faithfulness:

faithfulness_eval = pnx.LLMEval(
    name="faithfulness",
    prompt_template="""Judge if {output} is faithful to {sources}...""",
    examples=[{"input": ..., "output": ..., "reference": ..., "score": 1.0, "explanation": ...}],
    model="claude-3-5-sonnet-20240620"
)

Meta-evaluate judges: run golden dataset through your eval, score agreement (e.g., 90%+ reliable). Use stronger model for judging.

When to use: Code for format/length; LLM for accuracy, faithfulness, tone. Agents need end-to-end: tool selection, params, output parsing.

Quote: "Choosing the right eval matters more than tuning it. A correctness eval scored 0 out of 13 on the same agent that a faithfulness eval scored 13 out of 13."

Datasets and Experiments: Prove Iterations Work

Create datasets from traces: dataset = pnx.Dataset.from_pandas(traces_df) or golden sets (human-labeled). Run experiments: baseline vs. prompt variants.

exp = pnx.Experiment(name="prompt-v2", trace_dataset=dataset)
exp.log_evals([json_eval, faithfulness_eval], variant="v2_prompt")
exp.compare()  # Tables/charts: scores, spans, explanations

Visualize regressions, filter low-scorers, iterate prompts. Scale to thousands: patterns emerge (e.g., budget query misses costs).

Advanced frameworks: Impact hierarchy (prioritize high-failure evals); data flywheel (evals → insights → prompts → better traces); pairwise (A/B outputs); reliability scoring (judge variance).

Common pitfalls: Overly brittle code evals; unaligned LLM judges (meta-eval fixes); ignoring cascades/multi-agents; prescriptive tests.

Quality criteria: 100% regression suite; explanations actionable; CI-runnable; humans validate outliers.

Practice: Fork speaker's notebook (GitHub/seldo), trace your agent, build 3 evals, experiment on 50+ traces. Prerequisites: Python, LLM API, basic agents (2+ years dev exp).

Quote: "Without evals you can't change your system prompt to fix a tone issue because the tone might get better but suddenly the bot might be hallucinating product features."

Key Takeaways

  • Instrument with Phoenix traces before any evals—inspect spans to pinpoint failures like wrong tools or date ignorance.
  • Layer evals: code for deterministic (JSON, length), LLM for semantic (faithfulness > correctness for sourced tasks).
  • Meta-evaluate LLM judges on golden data to ensure reliability >90%.
  • Use capability evals for new skills, convert to regressions; run experiments to validate changes, not eyeballing.
  • Categorize failures from explanations to fix systematic prompt issues at scale.
  • Avoid prescriptive tests—agents optimize paths; focus non-brittle metrics.
  • Humans for golden sets/outliers only; evals scale CI for model/prompt upgrades.
  • Start simple: 13-query financial agent → full pipeline in notebook.