Gemma 4: Open-Source LLMs Run Offline on Phones

Gemma 4 Closes Open-Source Performance Gap

Google's Gemma 4 family includes four multimodal models: E2B (effective 2B parameters, fits phones), E4B (effective 4B), 26B MoE (25B total but 4B active for efficiency), and 31B dense flagship. All handle text, images, audio (larger ones add video), offer 256K token context, native function calling via special tokens, and built-in step-by-step reasoning. Apache 2 license enables full commercial use, modification, and fine-tuning without restrictions—unlike prior Gemma versions.

On Arena leaderboard, 31B ranks #27 overall (Elo 1452) but #3 open-source, beating Llama (biggest ecosystem), Qwen (200+ languages), and DeepSeek (top SWE-bench coder). 26B MoE is #6 open-source (Elo 1441) despite 10x smaller active size. Benchmarks show 31B at 89.2% AIME 2026 math (4.3x Gemma 3's 20.8%), 80% LiveCodeBench coding, 84.3% GPQA diamond science. Edge E4B hits 42.5% AIME/52% LiveCodeBench on T4 GPU; E2B gets 37.5% AIME on phones. This shrinks the open-closed gap to ~90% capability for most tasks, making local runs viable over paid APIs.

Edge Efficiency Enables New Apps

E2B runs on <1.5GB RAM (quantized), delivering 133 prefill/7.6 decode tokens/sec on $80 Raspberry Pi 5 CPU (reads prompts instantly, ~8 words/sec output). Qualcomm Snapdragon NPU hits 3700 prefill/31 decode—real-time chat speed, 4x faster than Gemma 3 with 60% less battery. No internet, zero data leakage, unlimited use.

Builders already ship: browser vision app with Roboflow RFDeer object detection + Gemma describing scenes as medieval bard via WebGPU/transformers.js; Envision accessibility app for blind users (local phone scene description); full local agents browsing web, managing files, executing code, chaining workflows. Run 26B MoE (laptop-friendly) for flagship quality at edge costs.

Install Locally and Weigh Trade-offs

Use Ollama: download from ollama.com, run ollama pull gemma4:26b (6min install). Test in Ollama app—e.g., explains MoE vs dense: dense activates all parameters (cost scales fully); MoE gates to subset experts for larger effective size at lower compute. Integrates with OpenClaw for local agents (web/file/code tasks). Hugging Face offers browser WebGPU demo, no install.

Limitations: Edge E2B/E4B weak on complex reasoning, deep code, large docs—use 31B or closed models. Quantization (4/2-bit for devices) drops quality below full-precision benchmarks. New release (4 days old) lacks Llama's fine-tunes/adapters/ecosystem. Video only on 26B/31B. 26 closed models still lead overall; gap shrinks but persists for peak tasks. Ideal for simple/local/offline; pair with closed for heavy lifts.