Google's Auto-Diagnose: LLM Diagnoses Test Failures at 90% Accuracy

Slash Integration Test Debug Time with LLM Log Analysis

Integration tests at Google, which are 78% functional per a 239-developer survey, often fail with generic symptoms like timeouts while root causes hide in SUT component logs amid noise. Developers report 38.4% of failures take over an hour to diagnose (vs. 2.7% for unit tests), and 8.9% exceed a day—top complaint in a 6,059-developer EngSat survey. Auto-Diagnose triggers on failure via pub/sub, aggregates INFO+ logs across data centers/processes/threads, joins and sorts them by timestamp into one stream, adds component metadata, and feeds to Gemini 2.5 Flash (temperature=0.1, topp=0.8). This yields p50 latency of 56s and p90 of 346s, with 110k input/6k output tokens per run, letting devs act before context-switching.

Step-by-Step Prompting Ensures Reliable Root Causes

No fine-tuning needed—pure prompt engineering guides the LLM: scan sections, read context, locate failure, summarize errors, conclude only with evidence, and apply hard negatives like 'no conclusion if missing component logs.' Output post-processes to markdown with ==Conclusion== (root cause), ==Investigation Steps==, and ==Most Relevant Log Lines== (clickable links), auto-posted to Critique code reviews. Manual eval on 71 failures from 39 teams hit 90.14% root cause accuracy; failures exposed infra bugs like unsaved crash logs, fixed via feedback loop.

Production Feedback Ranks It Top 3.8% of Tools

Deployed on 52,635 failing tests across 224,782 executions and 91,130 changes by 22,962 devs. Of 517 feedbacks from 437 devs, 84.3% were reviewer 'Please fix' requests; dev helpfulness 63%, 'Not helpful' just 5.8% (under 10% live threshold), #14 of 370 Critique tools (top 3.78%). Replicate by building similar pipelines: aggregate/sort logs, chain-of-thought prompt general LLMs, integrate with review tools for instant value.