Gemma 2: Open LLMs Trained on 13T Tokens, Top Benchmarks
Google's Gemma 2 family (2B, 9B, 27B params) are lightweight open decoder-only LLMs trained on 2-13T tokens, outperforming similar-sized open models on MMLU (75.2 for 27B), HumanEval (51.8), and safety benchmarks while running on laptops.
Deploy High-Performance LLMs on Limited Hardware
Gemma 2 models (2B, 9B, 27B parameters) are text-to-text, decoder-only LLMs optimized for question answering, summarization, and reasoning. Their small size enables deployment on laptops, desktops, or personal cloud setups, unlike larger models needing massive clusters. Train the 27B on 13T tokens, 9B on 8T, and 2B on 2T from diverse sources like web docs, code, math/science, and multilingual text. Preprocessing filters duplicates, PII, low-quality content, and adult material using heuristics and classifiers, ensuring broad task coverage without common failure modes.
On benchmarks, larger variants excel: 27B PT hits 75.2 MMLU (5-shot), 86.4 HellaSwag (10-shot), 51.8 HumanEval pass@1, 74.0 GSM8K (5-shot maj@1); 9B PT at 71.3 MMLU, 40.2 HumanEval; 2B PT at 51.3 MMLU. They surpass comparably-sized open alternatives across reasoning (ARC-c 71.4 for 27B), QA (TriviaQA 83.7), and math (MATH 42.3), proving state-of-the-art efficiency.
| Benchmark | 2B PT | 9B PT | 27B PT |
|---|---|---|---|
| MMLU 5-shot | 51.3 | 71.3 | 75.2 |
| HumanEval pass@1 | 17.7 | 40.2 | 51.8 |
| GSM8K 5-shot | 23.9 | 68.6 | 74.0 |
Train Efficiently with TPUv5p, JAX, and Pathways
Leverage TPUv5p hardware for matrix-heavy training, offering higher throughput than GPUs for LLMs. Use JAX for hardware acceleration and ML Pathways for multi-task orchestration in a single Python process, simplifying workflows as in Gemini papers. This combo scales to 13T tokens while cutting development overhead—ideal for replicating on custom infra.
Data mix includes web, code, math, and polyglot sources; dedupe at sentence/paragraph levels, filter via quality classifiers, and remove PII/adult content to boost generalization without memorization risks.
Pass Safety and Dangerous Capability Thresholds
Instruction-tuned (IT) variants score low toxicity (RealToxicity 8.84 avg for 27B IT) and bias (CrowS-Pairs 36.67 top-1), with strong BBQ (86.94 Disambig for 27B) and TruthfulQA (51.60). They meet Google's internal policies on child safety, harms, and memorization.
Dangerous evals cap risks: 27B IT solves 34/76 InterCode-CTF cyber challenges (low success), 1/13 internal CTF, 0/13 HackTheBox; persuasion tests show 81% find it interesting but minimal harmful shifts (1% toward incorrect beliefs, £3.72 mean donation). Mitigate via preprocessing, post-training, and monitoring—users must add safeguards for production.
| Safety Benchmark | 2B IT | 9B IT | 27B IT |
|---|---|---|---|
| RealToxicity avg | 8.16 | 8.25 | 8.84 |
| TruthfulQA | 43.72 | 50.27 | 51.60 |
Limitations: May amplify biases, hallucinate, or violate policies without filters; not for high-risk uses like medical/legal advice.