On-Device AI Architecture: System vs. Custom Models

Developers have two primary paths for integrating AI into mobile applications. The first is leveraging System GenAI (e.g., Gemini Nano via AI Core), which is pre-installed, highly optimized, and requires no additional app footprint. This is the recommended starting point for general tasks. The second path is LiteRT-LM, a runtime for custom, tiny LLMs (under 1 billion parameters) that are bundled directly with the app. While this requires more engineering effort, it provides full customization and control, allowing developers to build boutique features that system-level models may not support.

Building Robust Agentic Skills

Modern on-device agents can be built using a simple harness that combines a base model (like Gemma 4) with a tool-calling framework. Instead of exposing all function details to the model at once, developers should use a selective loading approach: the model identifies the intent, then loads the specific skill description and associated JavaScript UI on demand.

For high-stakes reliability, fine-tuning is superior to prompt-engineering alone. For example, Function Gemma (270M parameters) improved from 46% to over 90% accuracy on specific app intents after being fine-tuned on a synthetic dataset. Developers can use synthetic data generation (e.g., via Flash) to create these datasets and utilize the Function Gemma fine-tuning lab on Hugging Face to train models for specific, narrow tasks.

Deployment and Performance

LiteRT-LM acts as a cross-platform runtime (supporting Android, iOS, and desktop) that executes models in a single-file format containing the model weights and tokenizer. This runtime is hardware-agnostic, capable of utilizing CPU, GPU, or NPU acceleration.

Key takeaways for production deployment include:

  • Model Chaining: Complex features, such as the "Eloquent" transcription app, can be built by chaining multiple tiny models (e.g., one for ASR, one for text polishing) to achieve high performance while maintaining a small memory footprint.
  • Hardware Optimization: Using specialized runtimes allows tiny models (like 500M parameter VLM models) to run at high speeds on mobile NPUs.
  • Iterative Testing: The Google AI Edge Gallery serves as an open-source reference implementation for testing these models, allowing developers to load custom skills from URLs and debug via ADB.