Gemma 2: Open LLMs Trained on 13T Tokens, Top Benchmarks

Google's Gemma 2 family (2B, 9B, 27B params) are lightweight open decoder-only LLMs trained on 2-13T tokens, outperforming similar-sized open models on MMLU (75.2 for 27B), HumanEval (51.8), and safety benchmarks while running on laptops.

Deploy High-Performance LLMs on Limited Hardware

Gemma 2 models (2B, 9B, 27B parameters) are text-to-text, decoder-only LLMs optimized for question answering, summarization, and reasoning. Their small size enables deployment on laptops, desktops, or personal cloud setups, unlike larger models needing massive clusters. Train the 27B on 13T tokens, 9B on 8T, and 2B on 2T from diverse sources like web docs, code, math/science, and multilingual text. Preprocessing filters duplicates, PII, low-quality content, and adult material using heuristics and classifiers, ensuring broad task coverage without common failure modes.

On benchmarks, larger variants excel: 27B PT hits 75.2 MMLU (5-shot), 86.4 HellaSwag (10-shot), 51.8 HumanEval pass@1, 74.0 GSM8K (5-shot maj@1); 9B PT at 71.3 MMLU, 40.2 HumanEval; 2B PT at 51.3 MMLU. They surpass comparably-sized open alternatives across reasoning (ARC-c 71.4 for 27B), QA (TriviaQA 83.7), and math (MATH 42.3), proving state-of-the-art efficiency.

Benchmark2B PT9B PT27B PT
MMLU 5-shot51.371.375.2
HumanEval pass@117.740.251.8
GSM8K 5-shot23.968.674.0

Train Efficiently with TPUv5p, JAX, and Pathways

Leverage TPUv5p hardware for matrix-heavy training, offering higher throughput than GPUs for LLMs. Use JAX for hardware acceleration and ML Pathways for multi-task orchestration in a single Python process, simplifying workflows as in Gemini papers. This combo scales to 13T tokens while cutting development overhead—ideal for replicating on custom infra.

Data mix includes web, code, math, and polyglot sources; dedupe at sentence/paragraph levels, filter via quality classifiers, and remove PII/adult content to boost generalization without memorization risks.

Pass Safety and Dangerous Capability Thresholds

Instruction-tuned (IT) variants score low toxicity (RealToxicity 8.84 avg for 27B IT) and bias (CrowS-Pairs 36.67 top-1), with strong BBQ (86.94 Disambig for 27B) and TruthfulQA (51.60). They meet Google's internal policies on child safety, harms, and memorization.

Dangerous evals cap risks: 27B IT solves 34/76 InterCode-CTF cyber challenges (low success), 1/13 internal CTF, 0/13 HackTheBox; persuasion tests show 81% find it interesting but minimal harmful shifts (1% toward incorrect beliefs, £3.72 mean donation). Mitigate via preprocessing, post-training, and monitoring—users must add safeguards for production.

Safety Benchmark2B IT9B IT27B IT
RealToxicity avg8.168.258.84
TruthfulQA43.7250.2751.60

Limitations: May amplify biases, hallucinate, or violate policies without filters; not for high-risk uses like medical/legal advice.

Summarized by x-ai/grok-4.1-fast via openrouter

6087 input / 2342 output tokens in 14579ms

© 2026 Edge