Google's Auto-Diagnose: 90% Accurate LLM Test Failure Diagnosis

Integration Test Failures Overwhelm Developers with Log Chaos

Diagnosing integration test failures at Google is notoriously painful due to massive, unstructured logs from test drivers and distributed SUT components. A company-wide EngSat survey of 6,059 developers ranked it among the top five complaints. A follow-up Survey-2 with 116 developers confirmed integration failures occur less frequently than unit tests (monthly vs. daily/weekly) but take far longer to diagnose—often over an hour or a full day (Figure 1b). Median failing tests produce 16 log files and 2,801 lines, with means of 26 files and 11,058 lines in production. Developers start with high-level test driver logs showing generic errors like timeouts, then manually hunt across heterogeneous SUT logs (dynamically named by component, split by levels like .info/.error). Low signal-to-noise buries root causes amid irrelevant warnings, creating high cognitive load. Common workarounds: ping experienced colleagues or infra teams, which doesn't scale.

Why integration over unit tests? Unit tests run early/often in isolation; integration tests hit later, testing multi-component interactions in hermetic environments (no external deps). A survey of 239 teams showed functional hermetic tests as most common (Figure 2). Failures surface as Critique findings during code review, blocking submission until fixed (Figure 3). Traditional automated diagnosis tools (statistical debugging, spectrum analysis) target unit-level; integration's distributed logs and setups remain unsolved.

"Diagnosing integration test failures was identified as one of the top five most frequent complaints in a company-wide survey 5 of 6,059 developers." (From Section 2.1: Quantifies the scale of developer frustration, justifying LLM focus.)

Auto-Diagnose Leverages LLM Strengths for Log Synthesis

Auto-Diagnose automates diagnosis by feeding all INFO+ logs (test driver + SUT components) into Gemini 2.5 Flash. On failure notification via pub/sub, logs from data centers/processes/threads are timestamp-sorted into one stream (e.g., Listing 1: server-a.info/error lines). A meticulously engineered prompt (Figure 7) guides step-by-step reasoning: scan log sections, correlate events, identify root cause, extract top relevant lines, conclude precisely. Key decisions:

LLM Choice: Gemini 2.5 Flash for speed/cost (mean 110k input/6k output tokens per run). Params: temperature=0.1 (deterministic), top_p=0.8 (balanced creativity). No fine-tuning on Google's logs.
Prompt Iteration: Refined via real failures to enforce chain-of-thought, negative constraints (no speculation), strict markdown output with linked log lines.
Post-Processing: Formats as Critique finding (Figure 6) with clickable log links, conclusion, relevant lines.
Integration: Posts to Critique in p50=56s, p90=346s—faster than manual debugging.

Tradeoffs: Relies on complete logs; misses if infra bugs drop them (addressed post-eval). Handles heterogeneity without custom parsing. Vs. alternatives: LLMs excel at summarization where rules-based tools fail on variety.

"LLMs are highly successful in diagnosing integration test failures due to their capacity to process and summarize complex textual data." (Abstract conclusion: Core insight on why LLMs fit this unstructured domain over prior methods.)

Rigorous Evaluation Proves High Accuracy and Adoption

Manual case study: Ran on 71 failures from 39 teams (Table 1). 3 expert infra devs (5+ years exp) assessed if conclusion/relevant logs hit root cause; aligned via meeting. Result: 64/71 accurate (90.14%). 7 misses traced to infra bugs—4 test driver logs unsaved on crash, 3 SUT logs—fixed and reported.

Production launch (May 2025): Analyzed 224,782 executions of 52,635 distinct tests across 91,130 code changes by 22,962 authors (Table 2). Feedback buttons: "Not helpful" in 5.8% (94.2% neutral/positive). Ranked #14/370 Critique tools (top 3.78%) by helpfulness. Interviews praised workflow integration.

Decision chain: Surveys → hermetic functional focus → LLM prompt over rules → Critique embedding. Pivot: Discovered/fixed log bugs via eval. Non-obvious: 90% accuracy without domain fine-tuning; speed beats human ramp-up.

"Developers consistently report spending substantially more time diagnosing integration test failures, often more than an hour and sometimes exceeding a day, compared to unit test failures." (Section 1: Highlights time savings potential, as Auto-Diagnose posts in <1min.)

Lessons on LLM Reliability and Infra Dependencies

Failures revealed infra fragility: Crashes dropped logs in 7/71 cases, but this surfaced bugs proactively. Production scale validated robustness on real volume/variety. User perception ties to accuracy—high marks despite no hype. Tradeoff: LLM creativity (top_p=0.8) risks hallucination, mitigated by low temp/strict prompt.

To replicate: Prioritize hermetic tests; timestamp-join logs; iterate prompts on failures; integrate into review flows. Surprising: LLMs handle distributed log correlation better than expected, contradicting unit-test-only benchmarks.

"The sheer volume of logs... presents a significant challenge. Developers must manually sift through a multitude of log files, each with its own formatting." (Section 2.4: Pinpoints why LLMs win—zero-shot text processing scales where humans don't.)

Key Takeaways

Target integration tests: Focus on functional hermetic ones for reproducibility; they're pain points despite lower frequency.
Use off-the-shelf LLMs like Gemini Flash: No fine-tuning needed for log summarization; tune params for determinism (temp=0.1, top_p=0.8).
Engineer prompts rigorously: Step-by-step reasoning + negatives + format constraints; iterate on real failures.
Timestamp-join all logs: Merge multi-source INFO+ into one stream for context.
Integrate into workflows: Post findings to code review (e.g., Critique) in <1min to cut context-switching.
Evaluate with experts: Use 3+ seniors for ground truth; expect 90%+ accuracy if logs complete.
Monitor for infra gaps: Misses often reveal logging bugs—fix them.
Gather production feedback: Buttons + rankings guide iteration; aim for top 5% tool adoption.
Tradeoff honesty: LLMs shine on text but fail sans logs; pair with basics like log saving.

Integration Test Failures Overwhelm Developers with Log Chaos

Auto-Diagnose Leverages LLM Strengths for Log Synthesis

Rigorous Evaluation Proves High Accuracy and Adoption

Lessons on LLM Reliability and Infra Dependencies

Key Takeaways

More on Edge

Google's Auto-Diagnose: LLM Diagnoses Test Failures at 90% Accuracy

Slash LLM Token Costs 10x by Fixing 6 Bad Habits

Build AI Skills for Repeatable Agent Tasks

Claude.md Patterns for Bulletproof AI Coding