OpenAI Simple Evals: Zero-Shot CoT Benchmarks

Zero-Shot Chain-of-Thought Beats Few-Shot for Instruction-Tuned Models

Apply simple zero-shot prompts like "Solve the following multiple choice problem" to better reflect real-world performance of chat-tuned LLMs, avoiding outdated few-shot or role-playing techniques from base model eras. This approach reduces eval sensitivity to prompt variations, enabling fair comparisons. OpenAI open-sources the library for transparency on published accuracy numbers, but deprecates new model/benchmark updates post-July 2025, retaining only HealthBench, BrowseComp, and SimpleQA implementations. Not a full replacement for the comprehensive openai/evals repo.

OpenAI Models Dominate Key Benchmarks

o3-high leads with 93.3% MMLU, 83.4% GPQA, 98.1% MATH, 88.4% HumanEval, 92.0% MGSM, 89.8% DROP (F1, 3-shot), and 48.6% SimpleQA. o4-mini-high excels in MATH (98.2%) and HumanEval (99.3%), while o3-mini-high hits 97.9% MATH at lower cost. GPT-4.5-preview scores 62.5% SimpleQA (tops table), but lags o3 on most metrics. Competitors trail: Claude 3.5 Sonnet at 88.3% MMLU/59.4% GPQA; Llama 3.1 405B at 88.6% MMLU/50.7% GPQA. Use the full table to select models by task—e.g., o4-mini-high for math/coding efficiency.

Run Evals on OpenAI or Claude APIs

Install per-eval dependencies (e.g., pip install -e human-eval for HumanEval). Set OPENAI_API_KEY or ANTHROPIC_API_KEY. Benchmarks include MMLU (multitask understanding), MATH/GPQA/MGSM (math/reasoning), DROP (discrete reading comprehension), HumanEval (code), SimpleQA (factuality), BrowseComp (browsing agents), HealthBench (health applications). Scripts like mmlu_eval.py, math_eval.py handle sampling/parsing. Add new model adapters or results via PRs (bugs only otherwise). Multilingual MMLU results in multilingual_mmlu_benchmark_results.md.