Gemma 4: Elite Open Performance at 31B Params

Unmatched Efficiency: High ELO and Arena Ranking in Small Models

Gemma 4 delivers top-tier intelligence per parameter, plotting high on ELO charts (y-axis) while staying left on the x-axis (total params in billions). The 31B dense model and 26B MoE (4B active params) rival Qwen 3.5's 397B (17B active) and outperform DeepSeek V3.2 or GPT-O OSS, despite being runnable on medium-high consumer GPUs like those without GB300-scale hardware. This shift—smaller, faster open models—supports hybrid setups: frontier hosted models for toughest tasks, edge compute for most workloads. On Arena text leaderboard, 31B ranks #3 worldwide (1452 score), trailing only massive GLM-5 and Kimi K2.5 but enabling local runs where giants fail.

Benchmarks reinforce this: MMLU 85.2%, AIME 2026 89% (frontier nears 100%), LiveCodeBench 80%, T2Bench 86%, GPQA Diamond 84.3%. ToolCall-15 scores perfectly for 31B, proving reliable function calling for agents.

Edge-Optimized Variants for Multimodal Agents

Four sizes target diverse hardware: effective 2B (E2B) and 4B (E4B) use per-layer embeddings (PLE)—small token-specific tables for fast lookups—yielding tiny inference footprints for mobile (phones, Raspberry Pi, Nvidia Jetson, Orin Nano). Developed with Pixel, Qualcomm, MediaTek teams, they run offline with zero latency, handling native audio, video/images (variable res, OCR, charts). Larger 26B MoE/31B dense add 256K context (edge at 128K), native function calling, JSON outputs, system prompts for autonomous agents interacting with tools/APIs.

Supports code gen as local assistant (though hosted like GPT-5/o1 best for serious coding), multi-step reasoning, math, instructions. Apache 2.0 license permits commercial use; available on Hugging Face, vLLM, llama.cpp, MLX, Ollama, Nvidia NIMs, LM Studio, Unsloth for fine-tuning/inference.

Trade-offs: Strong on Agents, Short on Context

Gemma 4 excels in agentic workflows and logic but lags context—128K/256K vs. desired longer windows—limiting long-doc tasks. Prioritize for on-device agents over chat; pair with hosted models for code/math peaks. Download and test: 31B matches trillion-param giants on consumer hardware, accelerating local-first AI products.