Regex and Format Checks Fail Clinical Safety
Regex validation ensures LLM outputs match structures like "Warfarin 10mg daily"—parsing drug names, numeric doses (positive, <1000mg), units (mg/mcg), and frequencies (daily/BID)—but ignores patient-specific risks. For a 78-year-old male with CrCl 38mL/min, amiodarone, age>75, and 62kg weight, 10mg warfarin passes regex yet risks INR>8 and 15% intracranial hemorrhage chance within 72 hours; correct dose is 2-3mg. In a real October 2025 incident at a 240-bed hospital, regex approved "Enoxaparin 40mg BID" for a 48kg elderly patient with CrCl 42, causing retroperitoneal hematoma, Hgb drop to 7.8, transfusion, and $180K settlement. Audits of seven deployments show 65% use this pattern, yielding 3-5 near-misses per 1,000 outputs. It catches format issues but misses interactions (amiodarone+warfarin), contraindications (renal impairment), allergies, duplicates, and dose-per-kg needs.
Studies quantify the gap: 1.47% hallucination and 3.45% omission rates in clinical notes (12,999 clinician-annotated sentences, 18 configs) mean 7.35 daily hallucinations across 500 encounters, or 220 monthly—10% undetected equals 22 false recommendations. Adversarial prompts spike hallucinations to 50-82% in six LLMs, as models invent significance for fake biomarkers like "fictitious-enzyme-marker."
LLM Self-Validation Inherits the Same Errors
Asking the generating LLM (or another instance) to review outputs fails due to shared training gaps. For the warfarin case, Claude-sonnet-4 often outputs {"safe": true, "concerns": }, missing the overdose. In a September 2025 academic center case, GPT-4 recommended sumatriptan+ketorolac+metoclopramide for migraine, then validated it as safe—overlooking patient's coronary disease (MI history) and propranolol, contraindicating sumatriptan (vasoconstriction risk) and risking hypertensive crisis. Mitigation prompts drop hallucinations from 66% to 44%, but GPT-4o still hits 23%. Correlated errors mean the validator confirms plausible-but-wrong logic, optimizing for language over accuracy.
Multi-Layer External Checks Ensure Safety
Validate independently via seven layers: (1) regex format; (2) RxNorm drug existence; (3) interaction APIs (e.g., amiodarone+warfarin flags high bleeding risk); (4) FHIR/SNOMED contraindications; (5) allergies; (6) patient-specific dosing (age/weight/CrCl); (7) renal adjustments (<60mL/min). Critical issues (interactions/contras/allergies) block EHR entry; warnings queue pharmacist review.
Implementation uses PatientContext (age, weight, CrCl, meds, allergies, conditions, labs) and returns {'approved': bool, 'issues': ValidationIssue(severity, category, desc, source, rec), 'requires_review': bool}. For warfarin example, layers flag interaction (CRITICAL), renal dosing (WARNING), dose inappropriateness (WARNING), blocking approval. This architecture relies zero on LLM data, querying external sources to catch what format/self-checks miss, preventing incidents like bleeding risks in audited deployments.