Load LLMs Fast with mmap and Quantize for Consumer Hardware

Inference engines like llama.cpp use mmap to load 15GB models in <10s by lazily pulling weights from SSD to RAM/GPU, avoiding duplication. Quantize to GGUF Q4_K_M for best speed-quality on 32GB RAM GPUs, balancing compression and perplexity.

Memory Mapping Accelerates Model Loading Without RAM Waste

Downloaded LLM artifacts—like Gemma's 15GB model.safetensors (weights as JSON-like tensors) and config.json (architecture: attention heads, layers, vocab size)—aren't executables. Engines load them into memory hierarchy (SSD → RAM → GPU). Naive copying duplicates 15GB in 32GB RAM, wasting space. Instead, llama.cpp uses mmap: OS maps SSD files logically to RAM, loading pages lazily on access. Evicted pages reload from SSD via PCIe (7GB/s NVMe), adding ~107ms for 750MB (5% of model). This loads Qwen 2.5 in <10s to first token, vs. vLLM's minutes due to compilation overhead. mmap frees RAM for apps like Chrome, as OS evicts unused weights.

vLLM (Python) sometimes outperforms llama.cpp (C++) despite language speed myths—Python overhead is negligible; architecture/scheduling matter more. TGI/TensorRT-LLM mix Rust/C++/Python for hybrid offloading (RAM for weights, GPU for compute).

Quantization Compresses Weights with Minimal Accuracy Loss

Reduce BF16 weights to INT4/INT8 (like 4K to 1080p) via formats: GGUF, EXL2/3, AWQ, FP8, MVFP4_bits. Group quantization (e.g., 32/256 weights) normalizes to min/max scale, rounds to low-precision integers (-8 to 7 for INT4), dequantizes with stored scale/bias.

  • Symmetric (Q4_0): ±max range.
  • Asymmetric (Q4_1): min-to-max + bias shift.
  • K-Quants (Q4_K_S/M): Hierarchical (256-group superblock scale + 32-group local); mixed precision (e.g., Q4_K_M: 4-bit most, 6-bit output/FFN gate/norm). Preserves outliers better, popular on Hugging Face.

AWQ calibrates on data to scale 'salient' weights (high activation magnitude), minimizing error. EXL2 uses Hessian (loss second derivative) for sensitivity, assigns 2-6 bits per group—fastest for Llama-13B (high tokens/sec, low perplexity, comparable size). GGUF dominates for local runs on 32GB consumer GPUs (hobbyist max); EXL3 newer but less adopted. Hardware: FP8 (Hopper GPUs), MVFP4 (Blackwell).

Trade-offs: Lower bits = smaller/faster but higher perplexity. Q4_K_M hits sweet spot for 30B models on 32-70GB VRAM.

Summarized by x-ai/grok-4.1-fast via openrouter

6582 input / 1748 output tokens in 11618ms

© 2026 Edge