Google's Auto-Diagnose: LLM Diagnoses Test Failures at 90% Accuracy
Prompt-engineer Gemini 2.5 Flash on timestamp-sorted logs to auto-diagnose integration test root causes, posting fixes to code reviews—90.14% accurate on 71 real failures, 5.8% 'Not helpful' in production across 52k+ tests.
Slash Integration Test Debug Time with LLM Log Analysis
Integration tests at Google, which are 78% functional per a 239-developer survey, often fail with generic symptoms like timeouts while root causes hide in SUT component logs amid noise. Developers report 38.4% of failures take over an hour to diagnose (vs. 2.7% for unit tests), and 8.9% exceed a day—top complaint in a 6,059-developer EngSat survey. Auto-Diagnose triggers on failure via pub/sub, aggregates INFO+ logs across data centers/processes/threads, joins and sorts them by timestamp into one stream, adds component metadata, and feeds to Gemini 2.5 Flash (temperature=0.1, topp=0.8). This yields p50 latency of 56s and p90 of 346s, with 110k input/6k output tokens per run, letting devs act before context-switching.
Step-by-Step Prompting Ensures Reliable Root Causes
No fine-tuning needed—pure prompt engineering guides the LLM: scan sections, read context, locate failure, summarize errors, conclude only with evidence, and apply hard negatives like 'no conclusion if missing component logs.' Output post-processes to markdown with ==Conclusion== (root cause), ==Investigation Steps==, and ==Most Relevant Log Lines== (clickable links), auto-posted to Critique code reviews. Manual eval on 71 failures from 39 teams hit 90.14% root cause accuracy; failures exposed infra bugs like unsaved crash logs, fixed via feedback loop.
Production Feedback Ranks It Top 3.8% of Tools
Deployed on 52,635 failing tests across 224,782 executions and 91,130 changes by 22,962 devs. Of 517 feedbacks from 437 devs, 84.3% were reviewer 'Please fix' requests; dev helpfulness 63%, 'Not helpful' just 5.8% (under 10% live threshold), #14 of 370 Critique tools (top 3.78%). Replicate by building similar pipelines: aggregate/sort logs, chain-of-thought prompt general LLMs, integrate with review tools for instant value.