The Android AI Architecture

Google’s approach to on-device AI centers on the AI Core system service. Because models like Gemini Nano are large (3-4 GB), shipping them within individual apps is impractical. AI Core centralizes the model on the device, allowing multiple apps to share the same resource. This architecture handles hardware optimization, memory management, and execution priority automatically. Foreground apps receive priority for inference, while background tasks are queued and managed by the system to preserve battery life.

Implementation Strategies

Developers have three primary paths for building intelligent experiences on Android:

  • ML Kit GenAI APIs: The most straightforward path. It provides access to Gemini Nano for tasks like summarization, proofreading, and general prompting. It abstracts away the complexity of TPU inference and model configuration.
  • Hybrid Inference: To solve the reach problem—where Nano is only available on recent flagship devices—developers can use Firebase AI logic. This allows an app to attempt on-device inference first and automatically fall back to cloud models (like Gemini Flash) if the local model is unavailable.
  • LiteRT: For developers requiring custom models or specific optimizations beyond what the standard APIs offer, LiteRT provides a lower-level framework for running custom models on-device.

Managing Trade-offs and Capabilities

While the GenAI APIs are optimized for privacy and low latency, they currently require flagship hardware from the last two years. For broader compatibility, developers should use "classic" ML Kit APIs (vision, OCR), which support over a billion devices.

Regarding RAG-style applications, the current Prompt API supports text and image inputs. While a dedicated embedding API is not yet live, Google confirmed it is coming soon to facilitate vectorization and similarity tasks. Developers are encouraged to treat these tools as foundational layers—building their own logic, prompts, and "skills" on top of the provided APIs rather than expecting the system to handle high-level task orchestration.