Agents Demand Production Monitoring, Not Just Evals

Traditional software testing with unit tests and golden datasets fails for agents because they are non-deterministic, unbounded, and face infinite input/output spaces. Agents call tools, access memory sources, spawn sub-agents recursively, creating combinatorial explosion of edge cases no eval suite can cover. Evals work for simple inputs but miss undefined behaviors in production where stakes are high—healthcare, finance, military.

Principle: Monitoring catches long-tail issues evals miss, enabling faster shipping. Like pre-agent products, prioritize production observability over exhaustive testing. Signals split into explicit (objective, verifiable) and implicit (semantic, fuzzy).

"Agent failures are very different than traditional failures in software. They're non-deterministic. There's an infinite space of inputs... outputs... tools to affect other systems arbitrarily."

Common mistake: Relying on LLM-as-judge evals like "rate 1-10"—ineffective vs. binary classifiers for specific issues.

Explicit Signals: Baseline Health Metrics

Track these verifiable metrics with alerts on spikes/drops:

  • Tool error rate: Core; spikes signal integration failures.
  • Latency: Delays in long sessions (hours-long runs).
  • Regenerations: Users retrying.
  • Cost: Sudden jumps indicate inefficiency.

Flat metrics can also warn—e.g., zero errors might mean underuse. Set up dashboards to visualize daily trends.

Implementation: Log at agent harness level, aggregate by day/release. Use for immediate alerting.

Quality criteria: Alert if >threshold (e.g., error rate >5% deviation). Trade-off: Explicit signals are easy/cheap but miss subtle semantic failures.

Implicit Signals: Semantic Detectors for Real Issues

These capture agent behavior nuances via classifiers, regex, and self-reports. Focus on binary flags: issue or not.

Classifiers: Train lightweight models (not full LLMs to avoid doubling costs) on categories like:

  • Refusals ("I can't do that").
  • Task failure (incomplete goals).
  • User frustration ("That's wrong", "WTF").
  • Content moderation/NSFW/jailbreaks.
  • Positive wins.

Raindrop provides out-of-box; build your own with labeled traces. Monitors language-agnostic via trained models. Spike detection: e.g., frustration from 37% to 9% post-prompt change.

Regex: Cheap, powerful for keywords like "this sucks", "horrible". Claude Code's keywords.ts flagged post-release regressions daily. Aggregate across millions; 10% rise is actionable despite misses.

"Regex can be a very good signal... Claude Code source code leaked... keywords.ts... looking for indications of stuff going wrong: WTF, this sucks, horrible."

Principle: Combine for dashboard views—daily rates, spikes trigger alerts. Data threshold: Useful at ~hundreds events when manual review impossible.

Mistake: Over-relying on LLM judges (expensive, unreliable); use custom classifiers.

Experiments: Ship Safely with Signal A/B Testing

Use signals for production experiments:

  1. Ship change (model, prompt, tool) to % users + control group.
  2. Compare signal rates: frustration down? Tools used up?
  3. Metadata flags (experiment_id, version) auto-segment.

Example: Prompt 2.4 reduced frustration 37%→9%, aesthetics complaints down, tools used rose.

Fits workflow: Post-eval, pre-full rollout. Pipe to Statsig/BigQuery for significance. Parallel experiments via query API.

"Ship to some percentage... control group... if issue rates go up, that's a good signal that what you shipped is not good."

Trade-off: Needs volume for stats (hundreds events); great for multi-turn > single-turn.

Self-Diagnostics: Agents Report Their Own Failures

Inspired by OpenAI's December work on models self-confessing misalignment (hallucinations, scheming, shortcuts like deleting tests).

Agents introspect well due to reasoning training. Catches:

  • Tool failures (rants about repeats).
  • User frustration (diplomatic responses).
  • Capability gaps (feature requests).
  • Self-correction (good: bypass sandbox; bad: security risks).

Setup Steps (minimal, no external tools needed):

  1. Add tool: report_issue—generic name (avoid "unsafe" to bypass self-censorship). Description: "Send short report to creator on interesting behaviors: tool failures, user issues, capabilities missing, self-corrections. Be honest."
  2. System prompt: "If you observe issues, call report_issue."
  3. Tool impl: Log/Slack/email output.

Workshop demo (coding agent mimicking Pi):

  • Tools: read/write/edit/bash.
  • Fail write→permission error.
  • Agent bypasses via bash heredoc.
  • Reports: "Created public_ip.py via bash because write failed."

Tuning: Frame as "notes to creator"; experiment tool name/desc for trigger rate. Models resist self-incrimination—use neutral framing.

"All you have to do is... a simple tool... simple line in system prompt... send to Slack... least effort observability."

Advanced: Triage agent scans daily signals, investigates spikes via traces/tools.

Prerequisites: Basic agent (OpenAI API, Python). Fits after basic instrumentation.

Quality: Honest confessions surface insights evals miss. Practice: Mess with tools, tweak prompts, review reports.

Key Takeaways

  • Replace eval-only with monitoring: explicit (errors/latency/cost) + implicit (classifiers/regex) signals.
  • Alert on spikes; start at hundreds events.
  • Run experiments: flag metadata, compare signal deltas pre/post-ship.
  • Self-diagnostics: 1 tool + prompt line; frame neutrally for honest reports.
  • Classifiers > LLM judges: Train cheap models for scale.
  • Regex aggregates win despite misses.
  • Multi-turn agents benefit most; works for single-turn too.
  • Triage agents automate investigations.
  • Experiment tool names/prompts to boost self-reports.
  • Production > evals for long-tail reliability.