MoE Architecture Delivers Dense-Like Speed at Scale
Gemma 4 26B A4B-it employs a Mixture-of-Experts (MoE) design with 25.2B total parameters but only 3.8B active per inference (8 active / 128 total experts + 1 shared), running as fast as a 4B dense model while outperforming Gemma 3 27B. It features 30 layers, 1024-token sliding window, 256K context via hybrid attention (local sliding + global with unified KV and p-RoPE), and 262K vocab. Supports text/image inputs (~550M vision params); audio/video on smaller E2B/E4B. Dense siblings: E2B (2.3B effective/5.1B total, 35 layers, 128K context, text/image/audio), E4B (4.5B effective/8B total, 42 layers, same context/modalities), 31B (30.7B params, 60 layers, 256K text/image). Per-Layer Embeddings (PLE) on small models boost efficiency for on-device use. Native system prompts, function-calling, and configurable thinking modes (<|think|>, <|channel|thought <channel|>) enable agentic workflows.
Benchmark Leadership in Reasoning, Coding, Multimodality
Instruction-tuned models crush benchmarks: 26B A4B-it scores 82.6% MMLU Pro, 88.3% AIME 2026 (no tools), 77.1% LiveCodeBench v6, 1718 Codeforces ELO, 82.3% GPQA Diamond, 86.3% MMMLU, 73.8% MMMU Pro, 82.4% MATH-Vision; long-context MRCR v2 128K at 44.1%. Larger 31B hits 85.2% MMLU Pro, 89.2% AIME, 80.0% LiveCodeBench, 2150 ELO. Small E4B/E2B: 69.4%/60.0% MMLU Pro, audio CoVoST 35.54%/33.47%. All beat Gemma 3 27B (no think) by wide margins, e.g., Tau2 68.2% vs 16.2%, BigBench Hard 64.8% vs 19.3%. Vision excels in OmniDocBench (0.149 avg edit distance, lower better). Use thinking mode (enable_thinking=True) for complex reasoning; parse with processor.parse_response.
Transformers Integration for Text, Image, Audio, Video
Load via pip install -U transformers torch accelerate; use AutoProcessor/AutoModelForCausalLM (text) or AutoModelForMultimodalLM (multimodal). Example text gen: apply_chat_template on messages with system/user roles, generate max_new_tokens=1024, decode/parse. Multimodal: add {"type": "image/audio/video", "url": "path"} before text in user content; audio needs librosa, video torchcodec/torchvision. Sampling: temperature=1.0, top_p=0.95, top_k=64. Audio max 30s, video 60s (1fps). Prompts: image variable resolution/aspect via token budget; audio transcription e.g., "Transcribe in {LANG}, digits only, no newlines." Multi-turn via standard roles. Modality order: non-text before text.
Safety Gains with Provenance Focus
Trained on web/code/images/audio (cutoff Jan 2025) with dedup, filtering, PII removal. Matches Gemini safety evals: minimal violations (text/image), low unjustified refusals vs Gemma 3. Aligns Google AI Principles; risks like bias/hallucinations mitigated via evals/partnerships. Intended for reasoning/coding/agents/multimodal apps; limits: no native video/audio on large models, potential ethical VLMs issues (fairness, misuse). Apache 2.0 licensed for enterprise/on-device deployment.