o1 Beats Doctors 67% to 50-55% in ER Triage Study

This TechCrunch article reports on a thin but headline-grabbing Harvard study testing LLMs on medical diagnosis, highlighting AI's edge in low-information scenarios while stressing major gaps for production use.

Raw EMR Data Reveals o1's Triage Superiority

Researchers from Harvard Medical School and Beth Israel Deaconess Medical Center fed OpenAI's o1 and 4o models the exact text from electronic medical records (EMR) of 76 real ER patients—no preprocessing. Two independent attending physicians blindly evaluated diagnoses from the models against those from two internal medicine attending physicians at multiple touchpoints.

o1 matched or exceeded physicians overall, with the gap widest at initial triage (least data, highest urgency): o1 hit exact or very close diagnoses in 67% of cases, vs. 55% for one physician and 50% for the other. Lead author Arjun Manrai stated o1 'eclipsed both prior models and our physician baselines' across benchmarks. Use this to prioritize reasoning-focused models like o1 for data-sparse diagnostic pipelines, but only as a triage aid.

Text-Only Wins Demand Real-World Trials and Specialist Baselines

The study limits scope to text inputs, noting foundation models struggle with images or other non-text data. Authors call for 'prospective trials' before ER deployment, as no accountability framework exists for AI decisions. Co-author Adam Rodman emphasized humans must guide life-or-death choices.

Critic Kristen Panthagani, an ER physician, called headlines overhyped: internal medicine doctors aren't ER specialists, whose goal is first ruling out killers, not final diagnosis. Comparing LLMs to mismatched specialties (e.g., dermatologist on neurosurgery) yields unhelpful results. Builders: Benchmark AI against domain experts and integrate as a 'second opinion' tool to catch oversights without replacing clinicians.