Tiny LLMs and On-Device Agents via LiteRT-LM on Edge Hardware

Edge AI Benefits Drive On-Device LLMs

Running LLMs on edge devices solves key constraints: ultra-low latency for in-loop UX like live voice translation (impossible via cloud), full privacy in messaging apps, offline capability, and cost savings on laptops. Cormac Brick, Google AI Edge tech lead, emphasizes these over cloud alternatives, drawing from 10 years optimizing hardware from Raspberry Pi to NPUs. Tradeoffs include RAM limits (e.g., 2-4GB for viable models) and hardware variability, pushing optimizations like memory-mapped per-layer embeddings to keep effective params low.

"There's a lot of benefits to running on the edge. There's latency or UX improvements for some really sensitive in-the-loop things like live voice translation." — Cormac Brick, highlighting why Pixel's on-device translation beats cloud latency.

Google's stack—LiteRT (ex-TensorFlow Lite), MediaPipe, LiteRT-LM—ships in Photos, YouTube Shorts effects, and Android system services. One .tflite file deploys cross-platform (Android/iOS/Mac/Linux/Windows/Web/IoT) on CPU/GPU; NPUs need separate compilation. This enables broad reach beyond premium devices.

System GenAI vs. In-App Tiny LLMs: Deployment Patterns

Two trends emerge: system-level GenAI integrates 2-5B param models into OS (Android AI Core, Apple Intelligence) for broad APIs like summarization/prompting, pre-loaded on premium devices. Customization via prompting or skills; no app downloads needed.

In-app GenAI uses tiny LLMs (TLMs, 100-500M params) bundled with apps/webpages for wider device compatibility. Fine-tuning is essential below 500M params for production reliability on tasks like summarization, transcription, voice-to-function (e.g., Function Gemma at 270M params hits 85-90% on 10 Android functions). Prompting alone fails for tiny models; fine-tuning yields "really reliable performance."

Decision chain: System for foundation tasks (leverage OS investment); in-app for custom, task-specific reliability. Tradeoff: System limits to premium hardware; tiny models sacrifice generality but gain deployability.

"For the really really tiny models certainly less than 500 million parameters you need to fine-tune to get production level reliability." — Brick on why prompting isn't enough for edge-scale models.

Gemma 2B/4B: Edge-Optimized for Agents and Multimodality

Gemma 2 (E2B: 2B effective params; E4B: 4B) targets edge with RAM efficiency via partial embedding loads (hundreds of bytes per token). Multimodal (audio/image/text for small sizes); built-in function calling + thinking unlocks on-device agents. Apache 2.0 license broadens use.

Performance (snapshot, ongoing optimizations with Qualcomm/Intel/Raspberry Pi):

Device	Gemma 2B Prefill/Decode (tok/s)	Gemma 4B Prefill/Decode (tok/s)
High-end Android (GPU)	2000+/1000+	~half
MacBook	1000s	Proportional
Raspberry Pi 5	20/133	N/A
Qualcomm IoT NPU	High (NPU boost)	High

E2B/4B on AI Core roadmap for Android integration. Larger Gemma for laptops (32GB RAM).

"One of the big step ups... was they've kind of built in function calling which is excellent and they also have built-in thinking. So that combination... unlocks our ability to now do skills on device." — Brick on Gemma's agent enablers.

Progressive Skills: Token-Efficient On-Device Agents

Google AI Gallery app demos agent skills: mood journaling (log/analyze trends via voice), calendar checks, Wikipedia queries, music synthesis from images. No fine-tuning; skills as on-demand JS snippets with one-line descriptions.

Mechanism: Progressive disclosure—model sees skill summaries first, loads details (functions) only if relevant via a "load skill" meta-function. Cuts context bloat, boosts reliability on lightweight models (poor at long contexts). Patterns: knowledge augmentation (Wikipedia), interactive UI (flashcards), web services (weather/maps/music).

"The way we've built the skills is there's a kind of one-line description... if it thinks that sounds interesting, then it asks for more... This is particularly important for token efficiency and frankly reliability on edge models." — Brick explaining conditional depth over full MCP descriptions.

Tiny Model Workflow: Fine-Tune and Deploy

For TLMs: Fine-tune Gemma-based models (e.g., 100-500M) on task data, quantize, deploy via LiteRT-LM. Example app (team-built): Real-world tiny LLM use, voice-to-action. Cross-platform speed via hardware accel (GPU/NPU).

Tradeoffs: Tiny = task-specific excellence but no generality; needs fine-tuning. Results: Voice-to-function at 85-90% on small models, deployable everywhere.

Key Takeaways

Prioritize edge for latency/privacy/offline/cost; use LiteRT-LM for cross-platform .tflite deployment (CPU/GPU standard, NPU compiled).
Choose system GenAI (2-5B params via OS APIs) for foundation tasks on premium devices; in-app TLMs (100-500M) for custom tasks with fine-tuning.
Gemma 2B/4B: 2-4GB RAM effective, multimodal, agent-ready; expect 100-2000+ tok/s depending on hardware.
Build skills progressively: One-line summaries → on-demand JS loads for token efficiency and dynamic tools.
Fine-tune tiny models below 500M params for 85-90% reliability on voice/action tasks; avoid prompting alone.
Optimize embeddings (memory-map PLE) to fit RAM constraints; track partners like Qualcomm for NPU gains.
Test on real hardware: Raspberry Pi 133 tok/s decode viable for simple analysis; high-end phones hit production speeds.
Extend models low-code: Wikipedia/maps/music skills turn static LLMs into fresh-knowledge agents.