The Gap Between Benchmarks and Clinical Reality
Most evaluations of clinical AI rely on hypothetical or exam-style questions, which fail to capture the complexity of actual medical practice. This study introduces the Real-world Point-Of-Care Queries (Real-POCQi) benchmark, consisting of 620 authentic questions submitted by physicians across 30 specialties. By comparing these against 187 questions from the existing HealthBench dataset, the researchers provide a more accurate assessment of how AI performs when doctors need immediate, reliable decision support.
Specialized Engineering vs. General-Purpose Models
The study conducted a head-to-head, blinded evaluation involving 149 practicing physicians who compared answers from three frontier general-purpose models (Claude Opus 4.8, Gemini 3.1 Pro, and GPT-5.5) against a specialized clinical tool (OpenEvidence). The specialized tool consistently outperformed general-purpose models across five critical dimensions: accuracy, clinical utility, source quality, verifiability, and completeness. The win margins for the specialized tool ranged from 25 to 39 percentage points (p<0.001).
These results suggest that while general-purpose models are capable, targeted engineering—such as specialized retrieval, citation management, and clinical customization—provides a measurable and significant performance advantage in high-stakes medical environments. Furthermore, the study found that while LLM-based judges can be useful, they systematically differ from human experts, reinforcing the necessity of using domain-specialized human graders for clinical AI validation.
Implications for AI Evaluation
The authors argue that the field must shift toward evaluation frameworks that mirror real-world query distributions. Relying on static, exam-style benchmarks is insufficient for validating tools intended for clinical decision support. By releasing the Real-POCQi dataset, the researchers aim to provide a standard for future development that prioritizes the specific needs of modern, specialized medicine.