Gemma 2: Open LLMs Trained on 13T Tokens, Top Benchmarks

Deploy High-Performance LLMs on Limited Hardware

Gemma 2 models (2B, 9B, 27B parameters) are text-to-text, decoder-only LLMs optimized for question answering, summarization, and reasoning. Their small size enables deployment on laptops, desktops, or personal cloud setups, unlike larger models needing massive clusters. Train the 27B on 13T tokens, 9B on 8T, and 2B on 2T from diverse sources like web docs, code, math/science, and multilingual text. Preprocessing filters duplicates, PII, low-quality content, and adult material using heuristics and classifiers, ensuring broad task coverage without common failure modes.

On benchmarks, larger variants excel: 27B PT hits 75.2 MMLU (5-shot), 86.4 HellaSwag (10-shot), 51.8 HumanEval pass@1, 74.0 GSM8K (5-shot maj@1); 9B PT at 71.3 MMLU, 40.2 HumanEval; 2B PT at 51.3 MMLU. They surpass comparably-sized open alternatives across reasoning (ARC-c 71.4 for 27B), QA (TriviaQA 83.7), and math (MATH 42.3), proving state-of-the-art efficiency.

Benchmark	2B PT	9B PT	27B PT
MMLU 5-shot	51.3	71.3	75.2
HumanEval pass@1	17.7	40.2	51.8
GSM8K 5-shot	23.9	68.6	74.0

Train Efficiently with TPUv5p, JAX, and Pathways

Leverage TPUv5p hardware for matrix-heavy training, offering higher throughput than GPUs for LLMs. Use JAX for hardware acceleration and ML Pathways for multi-task orchestration in a single Python process, simplifying workflows as in Gemini papers. This combo scales to 13T tokens while cutting development overhead—ideal for replicating on custom infra.

Data mix includes web, code, math, and polyglot sources; dedupe at sentence/paragraph levels, filter via quality classifiers, and remove PII/adult content to boost generalization without memorization risks.

Pass Safety and Dangerous Capability Thresholds

Instruction-tuned (IT) variants score low toxicity (RealToxicity 8.84 avg for 27B IT) and bias (CrowS-Pairs 36.67 top-1), with strong BBQ (86.94 Disambig for 27B) and TruthfulQA (51.60). They meet Google's internal policies on child safety, harms, and memorization.

Dangerous evals cap risks: 27B IT solves 34/76 InterCode-CTF cyber challenges (low success), 1/13 internal CTF, 0/13 HackTheBox; persuasion tests show 81% find it interesting but minimal harmful shifts (1% toward incorrect beliefs, £3.72 mean donation). Mitigate via preprocessing, post-training, and monitoring—users must add safeguards for production.

Safety Benchmark	2B IT	9B IT	27B IT
RealToxicity avg	8.16	8.25	8.84
TruthfulQA	43.72	50.27	51.60

Limitations: May amplify biases, hallucinate, or violate policies without filters; not for high-risk uses like medical/legal advice.