Gemma 4 31B-IT: Multimodal Open Model with 256K Context

Architectural Designs for Scalable Multimodal Deployment

Gemma 4 family includes dense models (E2B: 2.3B effective params/5.1B total, 35 layers, 128K context; E4B: 4.5B/8B, 42 layers, 128K; 31B: 30.7B params, 60 layers, 256K) and MoE (26B A4B: 25.2B total/3.8B active, 30 layers, 8/128 experts, 256K). All use 262K vocab, hybrid attention (sliding window 512-1024 tokens + global layers with unified KV and p-RoPE for memory efficiency). Smaller E2B/E4B employ Per-Layer Embeddings (PLE: ~150M vision/~300M audio encoders) for on-device efficiency; larger have ~550M vision. Supports text/image all sizes, audio/video on small (audio max 30s, video 60s at 1fps). Native system prompts, function-calling, configurable thinking modes (<|think|>, <|channel|thought <channel|>) boost reasoning, coding, agents.

MoE activates only 4B params for 26B A4B, matching E4B speed but with larger capacity; dense 31B suits workstations. Variable image resolution via token budget trades detail for speed.

Superior Benchmarks in Reasoning, Coding, Multimodality

Instruction-tuned models excel: 31B leads with 85.2% MMLU Pro, 89.2% AIME 2026 (no tools), 80.0% LiveCodeBench v6, 2150 Codeforces ELO, 84.3% GPQA Diamond, 76.9% Tau2, 19.5% HLE (no tools)/26.5% (with search), 74.4% BigBench Hard. 26B A4B close: 82.6% MMLU Pro, 88.3% AIME, 77.1% LiveCodeBench, 1718 ELO. Small: E4B 69.4% MMLU Pro, E2B 60.0%. Multimodal: 31B 88.4% MMMLU, 76.9% MMMU Pro, 0.131 OmniDocBench edit distance, 85.6% MATH-Vision; audio E4B 35.54% CoVoST, 0.08 FLEURS. Long-context: 31B 66.4% MRCR v2 128K. Outperforms Gemma 3 27B across board (e.g., 67.6% MMLU Pro vs 85.2%).

Integration Code and Best Practices for Production

Load via Transformers: pip install -U transformers torch accelerate; use AutoProcessor/AutoModelForCausalLM or AutoModelForMultimodalLM (add torchvision librosa torchcodec for vision/audio/video). Chat template supports system/user roles, enable_thinking=True for reasoning parsing. Multimodal prompts embed {"type":"image/audio/video","url":URL} before text.

Sampling: temperature=1.0, top_p=0.95, top_k=64. Audio prompts: transcribe numbers as digits, no newlines; translate formats source then '{TARGET}: translation'. Multi-turn via standard roles. Safety: rigorous evals match Gemini, low violations without filters, outperforms prior Gemma.

Pretraining on web/code/images/audio (cutoff Jan 2025), cleaned via dedup, PII filtering. Limits: no fine-grained video/audio beyond specs, potential biases/hallucinations; intended for reasoning/coding/agents, not exhaustive list.