SimpleQA: Benchmark Exposing LLM Hallucinations on Facts

Building a Reliable Factuality Benchmark

SimpleQA tackles LLM hallucinations by focusing on short, fact-seeking questions with single indisputable answers that don't change over time—unlike broader benchmarks saturated by frontier models like TriviaQA (2017) or NQ (2019). To ensure high quality, AI trainers browsed the web to create questions, requiring agreement from two independent trainers; a third trainer validated 1,000 samples with 94.4% match rate. After manual review, the dataset's inherent error rate is ~3% (2.8% real issues like ambiguity, 2.8% grader errors). At 4,326 questions spanning science, tech, TV, games, and more, it offers low-variance evals with fast researcher UX: concise Q&A for quick API grading.

Grading uses a prompted ChatGPT classifier comparing model predictions to ground truth:

Correct: Fully contains truth without contradiction (e.g., "Wout Weghorst" or with extra matching details).
Incorrect: Any contradiction, even hedged (e.g., wrong name or partial list).
Not attempted: No full answer and no contradictions (e.g., "I don't know").

Ideal models maximize corrects while minimizing incorrects, prioritizing recognition of ignorance over guessing.

Model Comparisons Highlight Reasoning Trade-offs

Without retrieval, smaller models like gpt-4o-mini and o1-mini correctly answer fewer questions due to less world knowledge, but o1-mini/o1-preview "not attempt" far more (leveraging reasoning to detect uncertainty) versus gpt-4o/gpt-4o-mini which hallucinate. This reduces incorrects for reasoning models: o1-preview excels by answering confidently only on known facts, avoiding the pitfalls of direct-response models.

Calibration Reveals Overconfidence Gaps

SimpleQA quantifies if LLMs "know what they know" via two methods:

Stated confidence: Prompt for percentage guess; plot accuracy vs. claimed confidence. All models show positive correlation (reassuring), larger ones calibrate better (o1-preview > o1-mini; gpt-4o > gpt-4o-mini), but all fall below y=x—overstating confidence systematically (e.g., claiming 75% when actual <75%).
Response consistency: Repeat question 100x, bin by answer frequency (string match), plot accuracy vs. frequency. Accuracy rises with frequency across models; o1-preview calibrates best (frequency ≈ accuracy), confirming reasoning aids self-awareness.

Limitations: Tests only short-answer factuality; correlation to long-form accuracy unknown. Open-sourced at github.com/openai/simple-evals to spur trustworthy AI research.