Building an Agentic Incident Resolution System

The Gap Between Detection and Resolution

Traditional monitoring (like Datadog) excels at detecting anomalies, but it lacks the operational context required to resolve them. When an alert fires, engineers typically waste the first several minutes gathering basic information: Who owns this service? Is there a recent deployment? What is the runbook? This context-gathering phase is where incident response time is lost.

The Architecture of Agentic Observability

To move from blind alerting to intelligent response, you must bridge the gap between technical telemetry and organizational context. The author proposes a three-layer stack:

Telemetry Layer (Datadog): Detects abnormal behavior and triggers alerts based on thresholds.
Context Layer (Port): Acts as the system of record for service ownership, dependencies, criticality, and runbooks.
Execution Layer (GitHub Actions): Acts as the agent that performs investigations and remediation.

When a monitor breaches a threshold, the system queries the context layer to understand the service's role. It then executes a workflow to investigate the issue. If the failure matches a known, safe recovery path, the system performs an auto-resolution. If not, it escalates the incident to the correct team, attaching a "context package" that includes the service owner, severity, and relevant runbooks, eliminating the need for manual triage.

Principles for Reliable Automation

Agentic engineering should not be treated as a license to automate everything. To avoid reckless behavior, the author recommends a strict hierarchy for incident handling:

Automate what is repetitive, low-ambiguity, and reversible (e.g., restarting a component or rolling back a deployment).
Enrich what is ambiguous by automatically gathering logs, traces, and ownership data.
Escalate what is risky, novel, or requires complex engineering judgment.

By following this approach, the system acts as "operational memory" rather than a black-box script. Even when an incident cannot be auto-resolved, the system provides a structured record of what happened, what was attempted, and who needs to be involved, ensuring that human engineers can focus their attention on high-value problem solving rather than administrative toil.

The Gap Between Detection and Resolution

The Architecture of Agentic Observability

Principles for Reliable Automation

More from AI Automation

Automating Incident Response with Self-Improving Agents

Building Self-Driving Products: From Signals to PRs

Decomposing AI Workflows into Reusable Skills

CAX-Agent: Reliable APDL Automation via Lightweight Agent Harnesses