The Shift from Uptime to Quality
Traditional observability tools (e.g., Datadog, Grafana) are designed to monitor deterministic systems, focusing on uptime, latency, and 400/500-level errors. Agent observability, however, must account for the non-deterministic nature of LLMs. While technical metrics like time-to-first-token and duration remain relevant, the primary goal shifts to qualitative assessment: Was the response grounded in context? Did the agent use the correct tools? Is the output aligned with brand standards?
The Data Engineering Challenge
Agent traces are fundamentally different from traditional application traces. They are:
- Voluminous: A single agent trace can exceed 1GB, with individual spans reaching 20MB.
- Unstructured: They contain vast amounts of raw text data that require full-text search capabilities.
- Real-time: Engineers need immediate visibility into these complex interactions as they happen.
Because existing databases struggle with this combination of high-volume semi-structured data and the need for full-text indexing, specialized infrastructure is required. This includes write-ahead logs for instant visibility and full-text search libraries (like a forked version of Tantivy) to allow engineers to query traces based on specific keywords or semantic content.
Human-in-the-Loop Evaluation
Unlike traditional observability, which is the domain of systems engineers, agent observability involves subject matter experts—lawyers, clinicians, and wealth advisors—who review traces to grade agent performance. This human feedback is critical for two reasons:
- Failure Mode Identification: Humans identify nuances in agent behavior that automated systems miss.
- Training Signal Generation: Human justifications for their grades serve as the foundation for building automated, scalable scoring functions.
By treating observability and evaluation as two sides of the same coin, teams can transition from manual review to automated, batch-processed evaluation, effectively closing the loop between production failures and iterative improvements.