The Bottleneck: Memory Management in LLM Inference
LLM inference consists of two distinct phases: the prefill phase (compute-bound), where the model processes input prompts to build a context representation, and the decode phase (memory-bound), where the model generates tokens one by one. During the decode phase, the system must repeatedly access the Key-Value (KV) cache—the stored mathematical representation of previous tokens.
Traditional systems suffer from significant memory waste due to "naive" allocation, where they reserve contiguous blocks of GPU memory based on the maximum possible output length. This leads to internal and external fragmentation, where 60-80% of the memory allocated for the KV cache often sits empty, severely limiting the number of concurrent requests a GPU can handle.
Solving Fragmentation with Paged Attention
Paged attention applies the operating system concept of virtual memory paging to GPU VRAM. Instead of requiring a single contiguous block for a request's KV cache, it breaks the cache into small, fixed-size pages (defaulting to 16 tokens). A block table maps these logical pages to non-contiguous physical addresses in VRAM. This approach eliminates fragmentation and allows for efficient memory reuse, such as sharing system prompts across multiple requests to save space.
Tuning for Production Throughput
To maximize GPU utilization, developers should focus on three primary configuration strategies:
- GPU Memory Utilization: Adjust the fraction of VRAM allocated to the KV cache. While the default is 0.9, stable workloads can be pushed to 0.95 to increase concurrency, while bursty workloads may require lowering it to 0.8 to avoid Out-of-Memory (OOM) errors.
- Prefix Caching: By hashing KV blocks by token sequence, the system can point multiple requests sharing the same system prompt to the same physical memory. This is particularly effective for RAG pipelines and coding agents, where shared prompts are frequent.
- Chunked Prefill: This technique breaks up the prefill phase to allow the system to interleave decode requests. This prevents long prompts from causing "stuttering" in token streams and can improve throughput by up to 50% in high-load scenarios.
For latency-sensitive applications, speculative decoding can be used to leverage idle GPU compute during the decode phase. A smaller "draft" model proposes tokens, which the larger model verifies in a single forward pass, maintaining output quality while accelerating generation speed.