Gemma 4: Apache 2.0 Multimodal Models for Any Use

Apache 2.0 License Enables Unrestricted Commercial Deployment

Gemma 4's standout feature is its pure Apache 2.0 license, allowing full modification, fine-tuning, and commercial deployment without custom restrictions like non-compete clauses. This addresses past frustrations with Gemma 3's limited license, positioning it competitively against Llama or Qwen. Built from Gemini 3 research, these models trickle flagship innovations into open weights, enabling builders to create production AI features like local coding assistants or on-device agents without legal hurdles.

Native Multimodality and Reasoning Boost Agentic Workflows

All four models integrate vision, audio, long chain-of-thought reasoning, and function calling at the architecture level—not bolted-on prompts. Reasoning spans text, images, and audio (on edge models), improving benchmarks like MMU Pro and Sweetbench Pro. Function calling supports multi-turn agentic flows with multiple tools, outperforming instruction-following hacks. Edge models (E2B, E4B) handle ASR, speech-to-text translation (e.g., English to Japanese), and interleaved multi-image inputs for video or OCR. Workstation models excel in code generation, completion, correction across 140 pre-trained and 35 fine-tuned languages.

Optimized Architectures for Edge Efficiency and Workstation Power

Workstation tier: 31B dense model (fewer layers, value normalization, optimized attention for 256K context) and 26B MoE (128 tiny experts, 3.8B-4B active + shared expert, mimicking 27B intelligence at 4B compute cost). Both support 256K context, native aspect-ratio vision encoders for document understanding. Edge tier: E2B/E4B with 128K context, compressed audio encoder (305M params, 87MB vs. prior 681M/390MB, 40ms frames for responsive transcription) and 150M vision encoder (vs. 300-350M). QAT checkpoints preserve quality at low precision; run edge on T4 GPUs or phones, workstations on H100/RTX 6000 or serverless Cloud Run with G4 GPUs (96GB VRAM).

Hands-On Usage Yields Immediate Results

Enable thinking via chat template (enable_thinking=true) for better outputs on tasks like deep learning use cases in finance. Process images/videos with autoprocessor for detailed scene breakdowns (e.g., "girl on beach with dog"). Audio demos transcribe dual voices accurately or translate speech end-to-end. Base and instruction-tuned versions on Hugging Face suit fine-tuning; expect strong results from solid base models, outperforming Gemma 3's 32K context, outdated encoders, and text/vision-only limits.