Two-Tier KV Cache Bypasses RAM Limits for Large Models

Run 35B parameter models like Qwen 3.6 (4-bit quantized) on a standard M2 MacBook Pro without RAM exhaustion by using oMLX's two-tier KV cache. Recent conversation context stays in fast unified memory for instant GPU/CPU access via MLX's zero-copy arrays and lazy computation, which fuses operations on-the-fly without data copying over PCI bus. Inactive history—system prompts, tool definitions—is frozen and swapped to high-speed SSD, mimicking OS paging. This achieves 89% cache efficiency: in a coding task generating 1.78M tokens, 1.59M were cached from disk, preventing full reloads. Persistent SSD caching survives /clear commands; oMLX detects prefixes and rehydrates state instantly, avoiding hallucinations from memory wipes that plague other runners.

Trade-off: occasional 400 errors hit 32K context limit, requiring manual clears, but speeds outweigh this for background use.

3x Inference Speed and System Responsiveness vs. LM Studio

oMLX delivers 47 tokens/second average on Qwen 3.6 during agentic coding, vs. LM Studio's 16 t/s—explaining why the same movie search/wishlist app task (MovieDB API integration) took 20 minutes on oMLX vs. 35 minutes on LM Studio. oMLX leaves RAM free for multitasking: browse web, watch videos on secondary monitor without lag. LM Studio maxes out resources, causing stuttering even on idle apps due to hot-keeping full KV cache in RAM.

LM Studio wins on stability—no context errors—but oMLX's SSD extension effectively triples effective memory, proving 128GB RAM unnecessary for powerful local agents on Apple Silicon. Outputs match across tools (same model/seeds), so judge quality separately.

Setup and Testing: Intuitive Dashboard for Agents

Launch oMLX server via simple UI: pick directory, add API key, access dashboard with pre-built snippets for agent harnesses like Codex CLI (leaner than Claude Code CLI, which wastes 16.2K/32K tokens on prompts). Test with codex command + task prompt; monitor live metrics (tokens/sec, cache %). Handles real-world coding without bloat, though app needed DB fixes for persistence post-refresh.