Ship Reliable AI Agents: Braintrust Hands-On

Overcome Prototype-to-Production Gaps with Operational Rigor

Prototypes shine in demos but crumble under real users due to non-determinism in LLMs—2+2 can equal 10. Traditional software's determinism (1+1=2) doesn't apply; agentic flows with tools amplify variability. Solution: Decompose into microservices-like stages, each with single responsibility. Avoid monolithic prompts that "work on my machine" but fail at scale. Trainline handles 27M users and 6.3B tickets via agentic travel assistants that manage refunds and reroutes without handoffs—proving rigor scales.

Key principle: Observability over logs. Logs show what happened; traces reveal why. Braintrust's platform instruments any LLM/framework agnostic, using a custom Brainstorm DB for semi-structured trace data at scale. Start the flywheel: Instrument → Evaluate → Remediate → Monitor → Repeat. Target isn't 100% coverage but closing gaps iteratively.

"Works on my machine, fails in production. Patch the prompt, repeat." — Common trap; systematize instead.

Architect Agentic Flows: From Single-Shot to Multi-Stage

Build a Support Triage Agent hands-on: Classify tickets, route to specialists (refund, change, etc.). Assumes Python basics, LLM familiarity (e.g., OpenAI API), no prior Braintrust.

Step 1: Single-Shot Prompting Baseline. Prompt GPT-4o-mini: "Categorize this support ticket: text. Output JSON: {category, confidence, reasoning}." Fast but brittle—hallucinations, context loss in complex domains like train refunds (return vs. advance tickets, delays).

Mistake to avoid: Over-relying on one prompt. Fails edge cases (e.g., ambiguous queries).

Step 2: Add Local Tools for Determinism. Inject functions like get_ticket_details(ticket_id) or check_disruption_status(route). Use structured outputs (JSON mode) for parseable responses. Reduces non-determinism by grounding in APIs.

Step 3: Specialist Stages (True Agentic). Break into chain:

Router: Classify → {refund_agent, change_agent, escalation}.
Each specialist: Prompt + tools specific to task (e.g., refund_agent checks eligibility via is_refundable(ticket_type, delay_minutes)).
Orchestrator aggregates.

Code skeleton:

class Router:
    def __init__(self):
        self.client = OpenAI()
    def route(self, ticket):
        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "system", "content": "Route to: refund|change|escalate"}],
            tools=[route_tool]
        )
        return response.choices[0].message.tool_calls[0].function.arguments

# Chain: router -> specialist -> final_response

Trade-off: Latency up 2-3x, but accuracy +20-30% on Trainline's complex cases. Fits broader workflow post-ML prediction (e.g., disruption forecasts).

"Good luck doing train changes yourself even with ChatGPT." — Trainline on agent superiority.

Instrument and Trace for Deep Visibility

Wrap calls in Braintrust:

import braintrust
experiment = braintrust.init(experiment_name="support-triage")

@braintrust.trace()
def router(ticket):
    # LLM call
    return category

Captures inputs/outputs, intermediate states, tool calls. UI visualizes spans (prompt → tool → response). Query traces by score, filter failures.

Quality criteria: Scores >0.8 pass; <0.6 auto-remediate. Braintrust auto-computes LLM-as-judge evals (e.g., "Is reasoning correct?") or custom scorers.

Before: Blind patching. After: Pinpoint token spikes, model drift.

Evaluate Offline with Golden Datasets

Create golden set: 100+ real tickets + human-labeled {expected_category, reasoning}. Trainline pulls from prod logs.

Run evals:

braintrust.run(experiment, dataset="golden-support", scorers=[accuracy_scorer, helpfulness_scorer])

Metrics: Exact match (category), semantic similarity (reasoning via embedding cosine), custom (e.g., refund logic correctness).

Remediate failures: Low-score traces → analyze (e.g., prompt lacks delay threshold). Iterate prompts/tools.

Exercise: Build your golden set from 20 prod logs; eval new model (e.g., switch GPT-4o-mini to cheaper o1-mini—verify perf parity).

"Before Braintrust, no way to simulate cheaper model perf." — Trainline on cost optimization.

Deploy, Score Online, and Close the Loop

Production flow:

Deploy via Braintrust API: Prod traces auto-log.
Online scoring: Real-time evals on 1% traffic; alert <threshold.
Monitor dashboards: P95 latency, failure rate, token $/query.
Feedback loop: Failed prod traces → new golden data → retrain eval set.

Trainline example: Travel assistant evals on tone, helpfulness, complex reasoning (ticket types/delays). Ships features 2x faster.

Edge cases: No sub for prod data. Use Braintrust to mine failures (e.g., 5% refund misclassifications → specialist fix).

"Move fast without breaking things at Trainline scale." — Core mindset.

Key Takeaways

Decompose agents into single-responsibility stages + tools over monolithic prompts for +20% accuracy.
Instrument everything with Braintrust traces from day 0—reveal hidden failure modes logs miss.
Build golden datasets from real logs; eval offline before model/cost changes.
Online scoring on prod subset + alerts prevents regressions.
Flywheel: Trace → Eval → Fix → Monitor; Trainline ships agent features confidently at 27M-user scale.
Start small: Instrument existing app, add 50 golden examples, iterate weekly.
Custom scorers beat generic (e.g., domain-specific refund rules).
Trade latency for reliability in agentic chains—users value correct over instant.
Platform-agnostic: Works with any LLM/agent framework.

"Perfection is the enemy of good—start the flywheel somewhere." — Giran Moodley.