LLM Inference: mmap Loading & Quantization Deep Dive
Efficient LLM inference hinges on mmap for lazy memory loading (e.g., <10s startup on llama.cpp) and quantization like GGUF K-Quants or AWQ/EXL2 to shrink 15GB models while preserving quality via salient weights and mixed precision.
Memory-Efficient Model Loading with mmap
LLM model artifacts from Hugging Face—like 15GB model.safetensors (weights in bfloat16), config.json (architecture details: attention heads, layers, vocab size)—reside on SSD and must load into RAM/GPU hierarchy without exhausting resources. Naive copying duplicates data temporarily, wasting space. mmap solves this by letting the OS map SSD files to virtual memory addresses, loading weights lazily on access. Evicted pages reload from SSD via PCIe (7GB/s NVMe), adding ~107ms latency for 5% of a 15GB model (750MB). This enables fast starts: llama.cpp loads a Qwen 2.5 model in <10s by offloading weights between RAM (bunk-bed style) and GPU for compute. vLLM uses mmap too but takes minutes due to compilation and init overhead for concurrency.
Trade-off: mmap trades minor disk latency for not hogging RAM, ideal when Chrome/other apps compete for space. Engines like llama.cpp (C++) excel here, but Python-based vLLM outperforms in tokens/s despite language overhead—proving architecture matters more than raw speed (e.g., Fibonacci benchmark intuition fails for inference).
Quantization: Compress Weights Without Quality Loss
Quantization reduces bfloat16 weights to int4/int8 (like 4K to 1080p), shrinking models for 32GB consumer GPUs (hobbyist limit) or 60-70GB enthusiast cards. Standard round-to-nearest (RTN) brute-forces tensors per-channel/group, but uniform scales cause accuracy drops as values (e.g., 0.9124, 6.34) cram into int4's -8 to 7 range.
GGUF improves via grouping: 32 weights normalized by group min/max (symmetric: ±max; asymmetric: min to max). Q4_0 (symmetric, 1 scale), Q4_1 (asymmetric, scale + bias). K-Quants (Q4_K_S/M) add hierarchy—256-weight supergroup (global scale) with 32-weight subgroups (local scales)—plus mixed precision (e.g., Q4_K_M: 4-bit most, 6-bit output/FFN gate/norm for sensitivity). Popular on Hugging Face; balances compression/quality.
AWQ calibrates with data to ID salient weights (high activation magnitude), scaling them pre-quant to minimize error. EXL2/3 uses Hessian (2nd-order loss sensitivity) for per-group mixed precision (salient: 4-6 bits; others: 2-3 bits). Benchmarks: EXL2 tops Llama-13B tokens/s with low perplexity, comparable size. Hardware natives: FP8 (Hopper GPUs), NVFP4 (Blackwell). All akin to zip/tar—pick by engine/hardware; GGUF wins locally for offloading.
Engine Trade-offs for Prefill, Decoding, Serving
Loading sets up prefill (prompt embedding), decoding (token gen), serving (concurrency/scheduling). llama.cpp (C++) optimizes memory; vLLM/SGLang (Python) prioritize throughput/scheduling; TGI/TensorRT-LLM (Rust/C++/Python) mix for speed. vLLM beats llama.cpp in some speeds despite Python, hinting optimized kernels matter. Future phases cover speculative decoding, KV cache, etc.—but loading/quant right avoids 100% failures from memory exhaustion.