The Automated Observability Pipeline

PostHog is shifting the paradigm of observability from passive dashboard monitoring to active, self-healing systems. The goal is to collapse the latency between a product signal (error, session replay, or user feedback) and a code fix. The pipeline operates through four distinct stages:

  1. Ingestion & Normalization: Signals are ingested at scale (trillions of events/month). A safety-focused LLM classifier filters out malicious inputs, followed by normalization where diverse data types (logs, stack traces, experiment results) are mapped to a unified structure with assigned weights.
  2. Grouping & Reporting: Raw signals are grouped into "reports." To avoid the pitfalls of structural clustering, the team generates LLM-based queries from the signals and embeds those queries rather than the raw data. This ensures that semantically related signals (e.g., a Slack message about checkout and a checkout error) are grouped together.
  3. Research & Actionability: A research agent (using the Claude Agent SDK) runs in a sandbox, utilizing an MCP (Model Context Protocol) server to pull in logs, codebase context, and external data from tools like Linear or Notion. The agent assesses actionability: if the problem is too ambiguous, it requests human input; if actionable, it proceeds to execution.
  4. Execution: The agent clones the repository into a sandbox, generates a fix, and submits a PR. If CI fails or feedback is provided, the agent rehydrates the sandbox snapshot to iterate until the PR is green.

Lessons in Agentic Engineering

Building this system required moving beyond initial assumptions about how LLMs and agents interact with production data:

  • Embed Queries, Not Signals: Off-the-shelf embedding models prioritize structural similarity (e.g., grouping all errors together regardless of feature context). By using an LLM to extract the "meaning" of a signal into a query first, the system achieves meaningful semantic grouping.
  • Prioritize Specificity: Not all signals are created equal. Error tracking is highly specific and often immediately actionable for an agent, whereas session replays and Slack messages are often too generic to yield a precise code fix. The system ignores low-specificity signals to avoid creating noisy, useless PRs.
  • Invest in Evals: Vibe-checking on local data is insufficient for production-grade pipelines. Testing must be performed on representative, diverse customer data to understand how the agent behaves at scale.
  • Don't Optimize Costs Too Early: While agents are expensive, running them repeatedly on the same problem reveals patterns. Once these patterns are identified, expensive agentic steps can be collapsed into cheaper, one-shot LLM calls or specialized models.