Run Gemma 4 on iPhone at 40 Tokens/Sec with MLX

Integrate MLX Swift LM for On-Device LLM Apps

To build iOS, iPadOS, or macOS apps running LLMs locally on Apple Silicon, install the MLX Swift LM GitHub repo—a framework optimized by Apple for iPhone and Mac chips. The API is straightforward: pass a Hugging Face model ID, and it auto-downloads and runs the model. Implementation takes under 10 minutes, enabling native chatbots like Locally AI, which supports Gemma 4, Qwen, Small LM, and Apple Foundation models. For Python/Mac apps, use MLX variants like MLX VLM for vision-language or MLX Audio for speech. This setup ensures fully offline, optimized performance without cloud dependency.

Quantization is key for iPhone: select 4-8 bit versions from Hugging Face's MLX Community (nearly 5,000 models, quantized in 4-bit/6-bit/8-bit within 30 minutes of release). Avoid under 4-bit due to quality loss; full-precision models exceed device limits (e.g., 1-3GB downloads). Example: Gemma 4 4-bit or 8-bit runs smoothly, while tiny 300-350M parameter models enable Shortcuts automation for text processing.

Benchmark Performance and Real-World Speed

On latest iPhones, Gemma 4 4-bit quantized hits 40 tokens/second with streaming—fast enough for responsive chat UIs generating long outputs in seconds. Older iPhones deliver 20 tokens/second, still viable for most apps. Demo shows live, offline generation rivaling cloud speed without latency. Trade-offs: model size (1-3GB) is the main barrier, but shrinking models and improving hardware (e.g., next iPhone) boost usability. Enable non-streaming for batch tasks or streaming for interactive use.

MLX Swift LM supports tool calling (improved in recent models), though structured generation requires third-party packages from Hugging Face. The ecosystem expands to Omni models for text-to-speech, speech-to-speech, image/video generation.

Try and Scale with Apps and Servers

Test via free Locally AI app (App Store, QR code)—select verified MLX-compatible models; not all Hugging Face uploads work perfectly on iPhone. Recently acquired by LM Studio, which downloads/runs models via Llama.cpp or MLX, exposes OpenAI/Anthropic-compatible servers for app integration. This combo lets you prototype on-device, scale to local servers, and compare engines for optimal speed/quality.