Gemma 4: Multimodal Open Models Excelling in Reasoning and Coding

Architectural Designs Enable Scalable, Efficient Deployment

Gemma 4 models use hybrid attention—interleaving local sliding window attention with full global attention, ending in global layers with unified Keys/Values and Proportional RoPE (p-RoPE)—to balance speed, low memory, and long-context handling. Dense models include E2B (2.3B effective/5.1B total params, 35 layers, 512-token window, 128K context, text/image/audio) and E4B (4.5B effective/8B total, 42 layers, same window/context, text/image/audio), both leveraging Per-Layer Embeddings (PLE) for on-device efficiency via token-specific lookups without extra layers. Larger dense 31B has 30.7B params, 60 layers, 1024-token window, 256K context, text/image. MoE 26B A4B activates only 3.8B of 25.2B params across 30 layers (8/128 experts +1 shared, 1024-token window, 256K context, text/image), matching 4B dense speed. All support 262K vocab, variable image aspect/resolution via token budget (higher for detail, lower for speed), audio up to 30s (E2B/E4B), video up to 60s at 1fps. Native system role and function-calling power agents; configurable thinking via <|think|>, <|channel>thought\n<|channel|>. Load via Transformers: AutoProcessor/AutoModelForCausalLM with torch.bfloat16, device_map="auto"; apply chat template with enable_thinking=True/False, parse responses.

Benchmark Leadership in Reasoning, Coding, Multimodality

Instruction-tuned Gemma 4 outperforms Gemma 3 27B across sizes: 31B hits MMLU Pro 85.2%, AIME 2026 89.2%, LiveCodeBench v6 80.0%, Codeforces ELO 2150, GPQA Diamond 84.3%, Tau2 76.9%, BigBench Extra Hard 74.4%; 26B A4B close at 82.6%/88.3%/77.1%/1718/82.3%/68.2%/64.8%; E4B/E2B at 69.4%/60.0%, 42.5%/37.5%, 52.0%/44.0%, etc. Vision: MMMU Pro 76.9%/73.8%/52.6%/44.2%, MATH-Vision 85.6%/82.4%/59.5%/52.4%, OmniDocBench 0.131/0.149/0.181/0.290 edit distance (lower better). Audio (E4B/E2B): CoVoST 35.54%/33.47%, FLEURS 0.08/0.09. Long-context MRCR v2 128K 66.4%/44.1%/25.4%/19.1% vs. Gemma 3 13.5%; HLE no-tools 19.5%/8.7%, with-search 26.5%/17.2%. Pre-trained on web/code/images/audio (cutoff Jan 2025), filtered via dedup, PII removal, toxicity scores, heuristics.

Best Practices Maximize Reasoning and Multimodal Outputs

Sample with temperature=1.0, top_p=0.95, top_k=64. Use standard system/assistant/user roles; libraries handle templates. Multi-turn: append prior exchanges. Modality order: text first, then images/audio/video. Audio prompts: 'Transcribe audio in {LANGUAGE}...' (digits as numerals, no newlines) or transcribe+translate ('{TARGET_LANGUAGE}: translation'). Thinking mode parses internal reasoning. Deploy E2B/E4B on phones/laptops (optimized via PLE), 26B A4B/31B on GPUs/servers for agentic/coding tasks.

Safety Evaluations Show Major Gains Over Prior Models

Rigorous testing (automated/human, no filters) aligns with Google AI principles, minimizing harms like violent/sexual content, hate, harassment. Gemma 4 cuts policy violations vs. Gemma 3/3n across text/image-to-text/all sizes, low unjustified refusals. Risks (hallucinations, biases, misuse) mitigated via diverse data, safety training; limitations include factual errors, non-English weaknesses, no real-time data post-Jan 2025.