Gemma 4: Efficient Architectures Power Top Small Open Models
Gemma 4's 2B-31B models outperform priors with interleaved attention, MoE (26B activates 3.9B params), PLE for on-device, and native multimodal support, ranking top 6 on LMSYS Arena under Apache 2.0.
Model Sizes and Capabilities Set New Benchmarks for Open Efficiency
Gemma 4 launches four variants optimized for distinct use cases: effective 2B (2.3B active params, 5.1B representational) and 4B for on-device text/vision/audio on phones/laptops; 26B MoE (3.9B active params from 128 experts, activating 8 per pass) for efficient inference; and 31B dense for advanced reasoning. The 31B ranks #3 on global AI leaderboards, outperforming models 20x larger, with both large models in LMSYS Arena's top 6. All support 256k context, native function calling, structured JSON, and agentic workflows. Switch to Apache 2.0 license enables seamless dev cycles from prototyping to deployment, downloadable from Hugging Face/Kaggle/Ollama or cloud-hosted on AI Studio/Vertex.
Small models excel in coding, multilingual, and multimodal benchmarks, surpassing Gemma 3 by wide margins—e.g., effective 2B/4B handle vision/text/audio inputs with text outputs, ideal for speech recognition/translation without API costs.
Attention Optimizations Balance Speed and Context
Dense models (31B, effective 2B/4B) use 5:1 local:global attention ratio (4:1 in 2B), with sliding windows of 512 tokens (small) or 1024 (large) in local layers, ending on a global layer attending all prior tokens. Grouped Query Attention (GQA) groups 2 queries per KV head locally (256 dim) and 8 globally (doubled to 512 dim), cutting memory costs while preserving performance—enabling efficient long-context reasoning without full recompute overhead.
For MoE (OURE architecture in 26B), a shared router expert (3x regular size) selects 8 from 128 small FFNN experts per pass, matching 31B performance at lower active params for scalable inference.
Per-Layer Embeddings and Multimodality Drive On-Device Gains
Effective models use Per-Layer Embeddings (PLE): standard token embeddings (1536 dim in 2B, 2560 in 4B) plus 256-dim per-layer tables (35 layers in 2B, 42 in 4B) stored in flash memory, not VRAM—projected up at layer end to slash on-device memory bottlenecks and boost inference speed.
Vision (all models) adds variable aspect ratios/resolutions in 5 budgets (up to 1120 soft tokens), processing 16x16 patches into 3x3 grids for pooled embeddings—e.g., 280-token budget yields 2520 patches. Avoids Gemma 3's pan/scan by preserving spatial positions, suiting OCR/object detection (high res) or text-heavy apps (low res). Encoders: 550M params (large), 150M (small).
Audio (effective models) uses 35M conformer encoder: raw audio → MEL spectrogram → conv downsample to n/4 soft tokens, enabling translation/speech rec without sequential processing.
Practical Deployment Trade-offs
On-device effective models prioritize flash/VRAM efficiency for local runs, trading some representational params for speed. Large models favor reasoning/coding via dense depth or MoE sparsity. Developers allocate image tokens dynamically (e.g., high for spatial tasks), test agentic flows in cloud, then quantize for edge—yielding production-ready open systems rivaling closed giants at sub-31B scale.