Build Golden Datasets and Custom Evals for Reliable Agent Testing
Samuel Colvin demonstrates optimizing agents post-deployment by first establishing a baseline with structured evaluations against a "golden dataset"—manually verified ground truth data. For the case study, he scrapes Wikipedia pages for UK MPs, extracts text via BeautifulSoup, and defines Pydantic schemas for MP details and political relations (focusing on ancestors like parents/grandparents, excluding spouses/children).
The golden dataset (golden_relations.json) contains exact relations for ~650 MPs, created by running a high-end model like Opus once and manual checks. Custom evaluators compare agent outputs to this truth:
- Accuracy: Exact match on relations list (1.0 if perfect, partial scores like 0.9 for minor name/description diffs).
- Assertions for relation types, roles, and ancestor filtering.
Key principle: Prefer deterministic, rule-based evals over "LLM-as-judge" to avoid bias. "Defining your own evaluators is far better than LLM as a judge because the LLM as a judge is effectively the kind of lunatics running the asylum."
To run: Load dataset with load_dataset(), register evaluators, then dataset.evaluate(agent_func, name="eval-name") using Pydantic AI's override for prompts/models. Concurrency limits (e.g., max=5) prevent rate limits. Results appear in Logfire UI: spans show inputs/outputs/costs, evals tab aggregates metrics (e.g., 85% accuracy for simple prompt).
Common mistake: Over-relying on console logs—disable terminal output (LOGFIRE_NO_CONSOLE=true) for clean traces. Before: Simple one-liner prompt gets 85% accuracy, confuses non-ancestors/political vs. public figures. After better prompt: Improves to ~90%+ by explicitly discounting same-gen relations.
Setup prerequisites: uv sync, Logfire project (logfire project use demo), API keys (Pydantic Gateway for multi-model access or direct OpenAI/Anthropic). Quality criteria: High accuracy on ancestors, low false positives on spouses/kids.
Evolve Prompts Genetically with GEPA on Production Traces
GEPA (Genetic Evolutionary Prompt Algorithm, via "Jepper" library) optimizes prompts as strings or JSON by breeding top performers. It evaluates candidates on a dataset, selects Pareto frontier (best trade-offs), mutates/crosses them (e.g., mix phrases from high-scorers), and iterates.
Process:
- Define initial prompts (simple vs. advanced) and models as Pydantic models.
- Run evals on split dataset (e.g., 65 test cases for speed).
- Launch GEPA:
gepa.optimize(evaluate_fn, initial_candidates, generations=10, population_size=20). It parallelizes evals, instruments via Logfire for traces. - Output: Ranked prompts by composite score (accuracy + cost/efficiency).
In demo: Simple prompt → 85% acc; advanced (ancestor rules) → better; GEPA evolves hybrids exceeding both (e.g., 92%+ acc). Handles systemic errors like over-including spouses by evolving phrasing: "Only ancestors (parents, grandparents)—exclude spouses, children, siblings."
Trade-offs: Compute-heavy (hundreds of evals/generation); start small dataset. Mistake: Random mutation—GEPA biases toward elites like horse breeding. "It takes the best racehorses and breeds them... you take all of the best resources and breed them."
Extend to production: Use real traces/feedback as eval inputs. Future: Autonomous optimization from Logfire.
Quote: "GEPA is ultimately an optimization library that optimizes a string... it can be a simple text prompt or some JSON data."
Enable Zero-Downtime Tuning with Managed Variables in Production
Logfire's managed variables let you update any Pydantic-serializable object (prompts, models, params) live without restarts. Define as Pydantic model:
from logfire.managed import managed_variable
class AgentConfig(BaseModel):
model: str = "gateway:gpt-4o-mini"
instructions: str = "..."
config = managed_variable(AgentConfig)
In agent: agent = Agent(..., instructions=config.instructions, model=config.model). Changes in Logfire UI propagate instantly (poll every 30s).
Production demo: FastAPI server with /analyze endpoint runs agent on live Wikipedia HTML. Update prompt/model via Logfire—tune for better ancestor detection without deploy.
Implicit feedback: Log user thumbs-up/down, aggregate into evals. Q&A insights:
- Prompt bloat: GEPA prunes inefficient phrasing.
- Context engineering: Chain-of-thought in prompts.
- Internal use: Pydantic team tunes agents on traces.
Trade-offs: Polling overhead (low); free tier generous. Mistake: Mutable globals—managed vars are safe, versioned.
Quote: "Managed variables... don't have to be just text they can be effectively any object that you can define with a Pydantic model."
From Manual to Continuous Optimization Workflow
Full loop: Golden evals → GEPA on traces → Managed vars deploy → Feedback evals. Fits mid-workshop: Assumes Python/Pydantic familiarity, agent-building basics. Broader: Any structured output task (invoices, addresses) benefits.
Exercise: Fork repo (github.com/pydantic/talks/2024-ai-engineer), run uv run main.py eval --split test --prompt initial, compare prompts, GEPA optimize, deploy to FastAPI.
Quote: "Deploying an agent is only the start... change prompts, models... without redeploying."
Key Takeaways
- Create golden datasets from high-model runs + manual verification for deterministic evals—beats LLM judges.
- Use GEPA to breed prompts: Start with 2-5 candidates, 10 generations on 65-case split for quick wins.
- Define managed variables as Pydantic models for instant prod updates—no restarts needed.
- Instrument everything with Logfire: Traces reveal confusions (e.g., spouses as ancestors).
- Prioritize ancestor filtering in political/relation extraction: Evolve phrasing like "exclude same-gen or descendants."
- Run evals in parallel (max_concurrency=5) to optimize costs during optimization.
- For FastAPI agents: Override configs live, log implicit feedback for GEPA inputs.
- Avoid hype: "I don't really believe in AI observability I think it's a feature not a category."
- Scale: Free Logfire tier handles workshops; Gateway simplifies multi-model testing.