T2D-Bench: Evidence-Gated Evaluation for Clinical LLM Accuracy

The Problem: Clinical Fluency vs. Evidence Compliance

Large language models often generate text that sounds clinically authoritative but fails to adhere to strict medical guidelines. In the context of Type 2 Diabetes (T2D) management, this creates a significant safety risk. The authors demonstrate that while models like GPT-4o and GPT-4o-mini are highly fluent, they frequently omit necessary evidence or provide recommendations that conflict with established standards of care.

The T2D-Bench Framework

The researchers developed T2D-Bench, an evaluation framework that uses a multi-layer knowledge graph to verify LLM outputs against explicit, computable evidence requirements. The knowledge graph integrates three distinct layers:

Biomedical Spine: Incorporates data from UMLS, DrugBank, and SIDER to ground medical terminology and drug safety.
Clinical Rules: Encodes computable American Diabetes Association (ADA) Standards of Care.
Mechanistic Bridge: Connects lifestyle factors (e.g., diet, exercise) to specific glycemic laboratory effects.

Performance and Correction

Across 100 structured vignettes covering diagnosis, medication safety, and adversarial lifestyle conflicts, the researchers found that GPT-4o-mini failed evidence-path checks in 35% of cases, while GPT-4o failed in 33%.

The framework introduces an "evidence gate" that identifies these unsupported omissions. By applying constrained revision, the system forces the LLM to align its output with the benchmark's evidence requirements. This demonstrates that clinical LLM outputs can be made measurable and correctable by anchoring them to verifiable, graph-based constraints rather than relying on the model's internal probabilistic generation alone.

The Problem: Clinical Fluency vs. Evidence Compliance

The T2D-Bench Framework

Performance and Correction

More from AI & LLMs

MiniMax Sparse Attention: Scaling Long Context with Block-Sparsity

Evaluating LLM Judge Reliability via Subset Selection

Predicting Query-Level Rejection Risk in Clinical LLM Systems

Generating Realistic Mobility Anomalies with LLMs and Kinematics