Run Gemma 4 on iPhone at 40 tok/s with MLX Swift LM

Install MLX Swift LM in iOS apps to run 4-8 bit quantized Gemma 4 from Hugging Face MLX community, achieving 40 tokens/second on latest iPhones for offline chatbot inference.

Build On-Device LLM Apps in Under 10 Minutes

Use MLX Swift LM GitHub repo to add native LLM inference to iOS, iPadOS, or macOS apps. The API downloads and loads models directly via Hugging Face integration—just pass the model ID. For Python or macOS scripting, use MLX examples from mlx-community. This powers apps like Locally AI, a free App Store chatbot supporting Apple Foundation models and open-source options. Quantize to 4-8 bit for iPhone compatibility: below 4-bit degrades output quality significantly, while 8-bit suits smaller models under 350M parameters. Models range 1-3GB, the main storage barrier, but latest iPhones handle them efficiently for text processing, automation via Shortcuts, and streaming UI.

Source Quantized Models from MLX Community

Search Hugging Face's MLX Community for 4,000-5,000+ quantized weights (4-bit, 5-bit, 6-bit, 8-bit BF16, etc.), available ~30 minutes after lab releases. For Gemma 4 (Google's smaller variants), grab the 8-bit version and quantize to 4-bit for iPhone. Pass the repo ID (e.g., mlx-community/Gemma-4-8bit) to MLX Swift LM—it auto-downloads and runs. Test smaller Quen or small LM models for speed; larger ones like Gemma 4 excel in chat. Ecosystem expands with MLX VLM (vision), MLX Audio (speech), and MLX Video (generation), enabling multimodal on-device apps.

Hit 40 tok/s Offline and Scale to Older Devices

On latest iPhones, 4-bit Gemma 4 streams at 40 tokens/second—fast enough for real-time chat without waiting (e.g., long outputs in 4 seconds). Older iPhones drop to 20 tok/s, still viable for many apps. Demo shows live, offline generation rivaling cloud speed. MLX Swift LM supports tool calling (improved in recent models); structured outputs and custom packages are emerging via community efforts. Post-acquisition by LM Studio, integrate with its server for OpenAI/Anthropic-compatible endpoints using MLX or Llama.cpp backends. Download Locally AI from App Store to try pre-vetted models instantly—no dev setup needed.

Summarized by x-ai/grok-4.1-fast via openrouter

5742 input / 1856 output tokens in 11537ms

© 2026 Edge