Quantization-Aware Training (QAT) vs. Post-Training Quantization

Google DeepMind has released QAT checkpoints for the Gemma 4 model family, designed to improve performance on edge devices and consumer GPUs. Unlike standard Post-Training Quantization (PTQ), which compresses a finished model and often results in quality degradation, QAT simulates quantization during the training process. This allows the model to learn how to compensate for precision loss, resulting in higher overall quality at the same memory footprint as PTQ.

The Mobile-Optimized Schema

To facilitate deployment on mobile hardware, Google introduced a specialized mobile schema that achieves a footprint of approximately 1GB for the Gemma 4 E2B model. This is accomplished through four key engineering techniques:

  • Static Activations: Pre-calculating scaling factors during training to reduce computational overhead on-device.
  • Channel-wise Quantization: Optimized for the architecture of mobile accelerators.
  • Targeted 2-bit Compression: Applying aggressive 2-bit quantization specifically to token-generation layers while maintaining higher precision for core reasoning layers to preserve model capability.
  • Memory Optimization: Shrinking the active memory footprint by optimizing embeddings and the KV cache. Developers can further reduce memory usage by dropping audio and vision encoders if the use case is text-only.

Deployment Trade-offs

While the Q4_0 QAT format remains the practical default for consumer GPUs and laptops (offering a 3.2GB footprint for E2B), the new mobile schema is purpose-built for phone-based inference. Both formats represent a significant improvement over the BF16 full-precision baseline, which requires 9.6GB for the E2B model. Developers should note that while these formats are optimized for different hardware targets, they represent a strategic balance between memory efficiency, decode speed, and quality preservation.