Q4_K_M Quant Cuts LLM VRAM 72% with 2-3% Quality Drop
Quantize LLMs to Q4_K_M for ~0.56 bytes/param, fitting 8B models in 5GB total VRAM (weights +1GB overhead); MoE loads all params but activates subset for speed.
Quantization Slashes VRAM While Preserving Quality
Model weights dominate VRAM usage, calculated as parameter_count × bytes_per_weight + KV_cache + 1GB overhead. Q4_K_M quantization uses 0.56 bytes/param (4 bits average via k-quants), reducing F16 (2 bytes/param) by 72% with 2-3% quality loss. Q5_K_M (0.69 bytes, 1% loss), Q6_K (0.81 bytes, 0.5% loss), Q8_0 (1.06 bytes, 0.1% loss) trade more VRAM for fidelity. Rule of thumb: 1B params ≈ 0.56GB at Q4_K_M. Example: Llama 3.1 8B (8B params) needs 4.5GB weights at Q4_K_M, totaling 5.25GB with 256MB KV cache (4K context) and 512MB overhead—fits 8GB GPUs.
K-quants apply variable bit depths per layer, outperforming naive quantization. Avoid Q2_K (0.31 bytes, noticeable loss) unless desperate.
MoE Models Load All Weights but Infer at Active Param Speed
Mixture-of-Experts (MoE) models like Qwen 3 30B-A3B (30B total/3B active) require full VRAM for all params (16.8GB Q4_K_M) but compute only the routed experts, matching 3B dense speed with 14-20B quality. DeepSeek R1 671B (671B/37B) loads 375GB Q4_K_M but infers subset—viable on high-end Mac M4 Ultra (140GB usable) or clusters, not consumer GPUs. Dense equivalents: Mistral 7B (4GB Q4), Llama 3.1 8B (4.5GB), 70B class (39.5GB Q4). Benchmarks: Llama 3.2 3B (1.8GB), Phi-4-mini 3.8B (2.1GB), Qwen 3 14B (8.3GB), DeepSeek R1 32B (18.4GB), Llama 3.3 70B (39.5GB), Qwen 3 235B-A22B (131GB).
KV Cache and Context Scale VRAM Predictably
KV cache = 2 × layers × d_head × kv_heads × context × 2 bytes (F16). Llama 3.1 8B: 256MB at 4K, 2GB at 32K, 8GB at 128K—pushes 5GB Q4 total to 13GB. 70B models hit 8GB KV at 32K. Quantize KV to Q8/Q4 (halves size) via llama.cpp --kv-cache-type q8_0. Limit context to needs: 4-8K for chat (512MB max KV), 32K+ for docs. Flash attention cuts peak memory; leave 1-2GB headroom.
Match Models to GPU Tiers for Optimal Performance
8GB (RTX 4060): Llama 3.2 3B Q6 (2.6GB total ~4GB), Llama 3.1 8B Q4 (5GB). 12GB (RTX 4070): Qwen 3 8B Q6 (6.6GB+overhead ~8GB), Phi-4 14B Q4 (7.8GB). 16GB (RTX 4080 Super): Mistral Small 24B Q4 (13.4GB). 24GB (RTX 4090): Qwen 3 30B-A3B Q4 (16.8GB), DeepSeek 32B Q4 (18.4GB); 70B Q4 needs 50% offload (3-5 t/s). Mac M4 Max 64GB (~46GB usable): 70B Q4 (39.5GB) fits. Dual 4090s (48GB): 70B Q5. Offload gradually (--n-gpu-layers): 10-20% barely slows; >30% drops 5-20x via PCIe limits. Monitor with nvidia-smi; test via Will It Run AI calculator.