Ditching Cloud APIs for Local Apple Silicon Power

API outages like Claude's highlight the need for local models: private, cheap, fast, and performant. The creator benchmarks fully-specced M5 MacBook Pro (128GB RAM) against M4 Max using Qwen 3.5 (35B MoE, NVFP4 quantized) and Google's Gemma 4 (27B), in GGUF (general format) and MLX (Apple-optimized). Tools include Ollama for MLX support, J Bench (live-streaming multi-device benchmarks tracking prefill, decode t/s, wall time, RAM), MacMon for real-time GPU/RAM/power viz, and graph walks for context scaling. Decision: Prioritize MLX on Apple silicon for production-local inference, as GGUF wastes cycles without hardware-specific ops.

Warm-up cold starts load models into unified memory; subsequent runs reveal true perf. Simple prompts (e.g., "explain hash table in 2 sentences," "design rate limiter") test baseline. M5 warms Qwen/MLX/Gemma faster post-initial load. Wall time—what users feel—prioritizes over raw decode, as it folds prefill, KV cache, overhead.

"If you're running on Apple silicon, always find an MLX model. There's just really no debate about this, and they're up to twice as good as their GG counterparts." This quote underscores the format choice: MLX leverages Apple's GPU/Neural Engine for unified memory ops, Mixture-of-Experts (MoE) routing, and NVFP4 quantization from Nvidia.

MLX Unlocks 2x Speed on Apple Hardware

GGUF suits cross-platform (e.g., llama.cpp), but MLX crushes it on M-series: Qwen 3.5 MLX hits 118 t/s decode vs 60 t/s GGUF on M5 (consistent across 5 prompts). Gemma 4 MLX prefills at 550 t/s, decode ~100 t/s, fits 16GB RAM peak. Qwen GGUF lags prefill (slowest), decode 50 t/s—still usable (>30 t/s threshold). Gemma edges Qwen in prefill/efficiency; Qwen wins some wall times via density.

M5 stays quieter (fans light) vs M4's spin-up, using less power (35W vs 40W peak). RAM: Gemma MLX ~16-42GB peak (swaps efficiently); larger contexts spike to 55GB. Non-obvious: Prefill dominates short prompts (small impact), but scales poorly—key for RAG/agents stacking context.

"Anything over 30 tokens per second, I consider fully usable. Once you drop below 20, I consider that the dead zone." Speaker's benchmark sets practical bar; MLX clears it effortlessly, enabling real workflows sans cloud.

Tradeoffs: MLX locks to Apple (no Windows/Linux easy port), but for Mac users, it's non-negotiable. GGUF for portability if multi-platform. Both MoE (A4B/A3B active params), maximizing IQ/param like Gemma's design.

M5 Hardware Leaps 15-50% Over M4 in Real Workloads

M5's architecture (new super core?) shines: 15-50% faster overall, doubling prefill on long contexts (e.g., M5 Gemma MLX does 8K graph walk in 13s; M4 lags). Context scaling (graph walks BFS: 200-32K tokens) exposes gaps—M5 totals 280s full run vs M4's 400s (40% win). Decode drops 20% as prompts grow (KV cache balloons), but M5 sustains ~117 t/s steady.

Fans/GPU max out (100% util, efficiency cores idle for some tasks). Accuracy: Both models nail short graphs, falter 8-32K (e.g., Qwen misses depth-14 tree node; Gemma too). Limits local SLMs to ~32K effective context before perf craters—agentic stacks (e.g., Claude Code 2-3 turns =32K) amplify this.

Upgrade rationale: M5 prefill edge scales with agent/RAG prompts; M4 works harder (noisier, hotter). From M1-M4, gains compound for daily local AI.

"Upgrade from your M4, from your M3, from your M2, from your M1, whatever you're currently using. I have a fully maxed out M4, and the M5 is outperforming it by a wide margin." Hands-on verdict after side-by-side; quantifies why holdouts should spec M5 Max for MLX.

Agentic Limits and Future-Proofing Local AI

Simple benchmarks undersell reality—context stacks in agents kill perf. Graph walks mimic reasoning (precise token traversal); local 30-35B MoE handle BFS correctly short-term, degrade long (vs cloud SOTA like Mythos at 80% on 1M). Wall time balloons: 32K prompts take minutes, not seconds.

Insight: Local viable now for private/offline (no API dependency), but architect agents for short contexts or KV optimizations. US Gemma competes Chinese Qwen—open, dense, RAM-thrifty. Prep by benchmarking your stack: J Bench streams multi-device for apples-to-apples.

"As prompt size increases, local model performance goes down very very quickly. This might sound obvious, but it's important to realize the impacts of this when you're expecting your local model to do agentic work." Counterintuitive for demo-focused devs; forces context pruning in production agents.

Key Takeaways

  • Always hunt MLX variants for Apple silicon—2x decode (100+ t/s), quieter, efficient vs GGUF.
  • M5 Max beats M4 Max 15-50% (up to 40% wall time on 32K contexts); upgrade if local AI core workflow.
  • Gemma 4 MLX: Prefill king (550 t/s), 16GB RAM fit—max IQ/param for agents.
  • Qwen 3.5 MLX: Decode beast (118 t/s), NVFP4/MoE shine; viable >30 t/s.
  • Context kills speed (decode drops 20% at 32K)—prune for agents, track wall time over raw t/s.
  • Benchmark live: J Bench + MacMon for prefill/decode/wall/RAM/power; >30 t/s = usable.
  • Local SLMs ready for private reasoning (BFS graphs), but cap at 32K; cloud for ultra-long.