M5 Max Crushes M4 in Local LLM Benchmarks via MLX

Upgrade to M5 for 15-50% Faster Local LLM Inference

The core opportunity is reducing dependency on flaky cloud APIs like Claude or OpenAI by running powerful local models privately, cheaply, and quickly on Apple Silicon. The speaker benchmarks fully specced M5 Max (new) vs. M4 Max MacBooks using Qwen 3.5 (Alibaba, 35B MoE) and Gemma 4 (Google, 26B/27B MoE) in GGUF (Nvidia NVFP4 optimized) and MLX formats. Cold starts load models into memory first; subsequent runs measure true performance.

Key metrics: Prefill (prompt processing speed), Decode (tokens/second generation), Wall (end-to-end time including overheads), and peak RAM. M5 consistently leads: prefill nearly doubles M4 on long contexts (e.g., 32K tokens), decode hits 118 t/s vs. 60 t/s on GGUF Qwen MLX vs. GGUF. Wall times drop 40% on context scaling (M5: 280s total vs. M4: 400s for graph walks up to 32K). M5 runs quieter with lower power (35W vs. 40W), maxing GPU/super cores efficiently. Tradeoff: Larger prompts tank performance across both (tokens/s drop 20%+ at scale), limiting agentic workflows without KV cache optimizations.

"The M5 is about 15 to 50% faster than the M4, which is a pretty massive jump. And we're going to see the trend continue as we increase the context size." (Context: Summarizing simple prompt benchmarks; highlights hardware upgrade value for growing prompts in agents.)

MLX Doubles GGUF Speeds—Mandatory for Apple Silicon

GGUF (via Ollama) works but lags; MLX (Apple's framework) leverages unified memory and silicon optimizations for massive gains. On simple prompts (e.g., "explain hash table in 2 sentences" to "design rate limiter"): MLX Qwen decode 118 t/s vs. GGUF 60 t/s; Gemma 4 MLX prefill 550 t/s. Wall times halve. RAM: Gemma 4 MLX fits 16GB (peak 42GB used on 128GB system), vs. larger for Qwen.

Reasoning: MLX uses NVFP4 quantization + MoE efficiency, avoiding GGUF overheads. Always seek MLX variants first—no debate, as they smoke GGUF by 2x. On M4, gap smaller but present; M5 amplifies it. Tools like J-Bench stream live metrics across devices, warming models first.

"If you're on Apple silicon, use MLX. There's just really no debate about this, and they're up to twice as good as their GG counterparts." (Context: Post-simple benchmarks; distills why engineers waste time on suboptimal formats.)

Qwen 3.5 Edges Gemma 4 on Wall Time, But Gemma Packs Denser

Qwen 3.5 MLX (A4B active) vs. Gemma 4 MLX (A3B): Neck-and-neck on simple tasks, Qwen faster wall (consistent decode post-prefill). Context scaling (graph walks BFS: 200-32K tokens) reveals tradeoffs—both struggle >8K (F1 scores drop, e.g., misses at 16K), but answer correctly often. Qwen wall faster (e.g., 13s at 8K); Gemma slower but denser (max intelligence/parameter). US-origin Gemma competitive with Chinese labs like Alibaba's Qwen.

Decision chain: Start simple (usable >30 t/s), scale to agent-like (context stacks fast in tools like Claude Code). Prefill matters more at scale (M5 doubles M4). Non-deterministic variance exists, but MLX stabilizes high speeds. Failure insight: Local models hit limits on precise reasoning (BFS traversal errors), unlike cloud SOTA like Mythos (80% at 1M tokens).

"Anything over 30 tokens per second, I consider fully usable. Once you drop below 20, I consider that the dead zone." (Context: Defining practical thresholds; invites reader input on viable local speeds.)

Context Explosion Kills Local Agent Performance

Simple prompts (2-4 sentences) fly (>100 t/s decode), but real workloads (agents, 32K+ context) expose drops: Tokens/s fall as prompt grows, wall balloons (400s M4 full run). Agent context stacks via tool calls/history—2-3 interactions hit 32K. Prepare now: MLX + M5 viable for production local, but KV cache innovations needed for scale. Fans spin, GPU 100%, RAM swaps—devices "cook" but M5 more efficient.

Tradeoffs: Local = private/fast/cheap vs. cloud reliability. Benchmarks via J-Bench (multi-device streaming), MacMon (live GPU/RAM viz). Progression: Warmup → simple → context → (truncated) agentic (Pi coding agent teased).

"As prompt size increases, local model performance goes down very very quickly... What's limiting our models now... isn't so much the performance... but context window stacks up very very quickly." (Context: Graph walks results; warns on agentic real-world use.)

Key Takeaways

Always use MLX over GGUF on Apple Silicon for 2x prefill/decode gains; check Ollama/MLX hubs first.
Upgrade M-series: M5 Max 15-50% faster wall times than M4 Max, especially prefill on long contexts—worth it for local AI workflows.
Qwen 3.5 MLX edges Gemma 4 on speed; Gemma denser (16GB RAM fit)—pick by use (reasoning vs. efficiency).
Target >30 t/s decode for usable local inference; benchmark your prompts, as context scaling tanks speeds 20%+.
Warm models, measure wall time over raw t/s; tools like J-Bench/MacMon reveal real costs (fans, power, RAM).
Local models handle simple/medium tasks now (BFS graphs to 8K), but agent stacks need optimizations—ditch cloud dependency prep starts today.
Non-obvious: M5 quieter/lower power despite higher loads; US Gemma 4 proves open competitive edge.