Gemma 4: Apache 2.0 Multimodal Models for Any Use

Google's Gemma 4 releases four models under true Apache 2.0 license with native vision, audio, reasoning, and function calling—run commercially on edge devices or workstations without restrictions.

Apache 2.0 License Enables Unrestricted Commercial Deployment

Gemma 4's standout feature is its pure Apache 2.0 license, allowing full modification, fine-tuning, and commercial deployment without custom restrictions like non-compete clauses. This addresses past frustrations with Gemma 3's limited license, positioning it competitively against Llama or Qwen. Built from Gemini 3 research, these models trickle flagship innovations into open weights, enabling builders to create production AI features like local coding assistants or on-device agents without legal hurdles.

Native Multimodality and Reasoning Boost Agentic Workflows

All four models integrate vision, audio, long chain-of-thought reasoning, and function calling at the architecture level—not bolted-on prompts. Reasoning spans text, images, and audio (on edge models), improving benchmarks like MMU Pro and Sweetbench Pro. Function calling supports multi-turn agentic flows with multiple tools, outperforming instruction-following hacks. Edge models (E2B, E4B) handle ASR, speech-to-text translation (e.g., English to Japanese), and interleaved multi-image inputs for video or OCR. Workstation models excel in code generation, completion, correction across 140 pre-trained and 35 fine-tuned languages.

Optimized Architectures for Edge Efficiency and Workstation Power

Workstation tier: 31B dense model (fewer layers, value normalization, optimized attention for 256K context) and 26B MoE (128 tiny experts, 3.8B-4B active + shared expert, mimicking 27B intelligence at 4B compute cost). Both support 256K context, native aspect-ratio vision encoders for document understanding. Edge tier: E2B/E4B with 128K context, compressed audio encoder (305M params, 87MB vs. prior 681M/390MB, 40ms frames for responsive transcription) and 150M vision encoder (vs. 300-350M). QAT checkpoints preserve quality at low precision; run edge on T4 GPUs or phones, workstations on H100/RTX 6000 or serverless Cloud Run with G4 GPUs (96GB VRAM).

Hands-On Usage Yields Immediate Results

Enable thinking via chat template (enable_thinking=true) for better outputs on tasks like deep learning use cases in finance. Process images/videos with autoprocessor for detailed scene breakdowns (e.g., "girl on beach with dog"). Audio demos transcribe dual voices accurately or translate speech end-to-end. Base and instruction-tuned versions on Hugging Face suit fine-tuning; expect strong results from solid base models, outperforming Gemma 3's 32K context, outdated encoders, and text/vision-only limits.

Video description
In this video, we look at the launch of the Gemma 4 family of models. These are 4 models 2 small and 2 larger models which combine not only multilingual but multimodality features. Blog: https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/ Colab: https://dripl.ink/EYT7h HF Collection: https://huggingface.co/collections/google/gemma-4 Twitter: https://x.com/Sam_Witteveen 🕵️ Interested in building LLM Agents? Fill out the form below Building LLM Agents Form: https://drp.li/dIMes 👨‍💻Github: https://github.com/samwit/llm-tutorials ⏱️Time Stamps: 00:00 Intro 00:15 Gemma 4 License 00:55 Quick Orientation 01:01 Gemma 4: 2 Model Tiers 03:05 Gemma 4: Thinking, Audio, Image & Video, Function Calling 03:27 Reasoning 04:06 Function Calling 04:56 Audio Support 05:38 Image and Video 06:23 Model Comparison 06:47 Gemma 4 Model Sizes 07:59 Workstation models 09:27 Edge models 11:30 Demo 16:40 Gemma 4 Availability

Summarized by x-ai/grok-4.1-fast via openrouter

7051 input / 1268 output tokens in 12745ms

© 2026 Edge