Gemma 4 Revives US Open-Weight Edge

Google's Gemma 4 delivers competitive 31B dense and 26B MoE models under Apache 2.0 for self-hosting on single GPUs, targeting privacy-focused enterprises amid $30B hosted API run-rates.

Gemma 4 Benchmarks and Deployment Specs

Google DeepMind's Gemma 4 family—E2B/E4B edge models (512-token window), 31B dense (30.7B effective params, 1,024-token window), and 26B A4B MoE (25.2B total, 3.8B active per token, 8/128 experts)—ships under Apache 2.0 after 400M+ downloads. Benchmarks show 31B at 1,452 Arena AI text, 84.3% GPQA Diamond, 89.2% AIME 2026, 80.0% LiveCodeBench v6, 76.9% MMMU Pro, 86.4% Tau2-bench retail (vs. 6.6% for Gemma 3 27B), and 19.5% Humanity’s Last Exam (26.5% with search). 26B trails slightly: 1,441 Arena, 82.3% GPQA, 88.3% AIME, 77.1% LiveCodeBench. Artificial Analysis Intelligence Index: 31B at 39 (trails Qwen 3.5 27B's 42 but uses 2.5x fewer tokens), strong on SciCode (43), TerminalBench Hard (36), but weaker agentics.

Architecture uses hybrid sliding-window/global attention and Proportional RoPE. Engineering supports configurable thinking mode, native system-role prompting, function calling with tool tokens, text/image input (video/audio on edge), and concrete docs: validate function args, summarize reasoning for agents, strip thought traces. Runs on phones/Raspberry Pi/Jetson (edge), consumer GPUs/H100 (larger); full MoE loads into memory. Day-one ecosystem: Hugging Face, Ollama, vLLM, llama.cpp, etc. Android: 4x faster, 60% less battery as Gemini Nano 4 base.

Open Weights Niche Sharpens as Hosted APIs Explode

Gemma 4 counters Chinese MoE dominance (trillion-param beasts hard to host) with US-origin models for air-gapped, edge, regulated use cases needing locality/inspectability/customization. Anthropic's run-rate hit $30B (30x from $1B in Dec 2024), signaling market shift to APIs/enterprise agents; clearer lab policies ease enterprise adoption. Future: hybrid—frontier APIs for capability, open weights for privacy/cost control. Fine-tuning bar high (prompting/retrieval dominates), but Gemma raises open customization floor.

Other releases: Z.ai GLM-5V-Turbo (multimodal coding, 200K context, MTP arch); Microsoft MAI-Transcribe-1 (25 langs), MAI-Voice-1 (1min audio/sec), MAI-Image-2; OpenAI $122B round ($852B val, $2B/mo rev); Cursor 3 (multi-workspace agents, Design Mode UI annotation, parallel model runs); Alibaba Qwen3.6-Plus (1M context, agentic coding vs. Claude 4.5); Google Veo 3.1 Lite (50% cheaper video, 720p/1080p, 4-8s).

RAG Tuning: 10-20% Chunk Overlap Boosts Recall

Set chunk overlap to 10-20% of size to capture boundary-spanning context, avoiding incomplete retrievals that degrade answers. Zero overlap misses split explanations; excess bloats index/slows queries. Test recall on domain queries before scaling.

Summarized by x-ai/grok-4.1-fast via openrouter

8588 input / 1767 output tokens in 14029ms

© 2026 Edge