LLM Inference: Fast Prefill, Slow Decode

Core Phases of LLM Inference

LLM inference divides into two distinct stages: prefill (processing the input prompt) and decode (generating output tokens). Prefill runs all input tokens in parallel on the GPU, achieving 0.55-2.98 ms per token (e.g., 219 tokens in ~120-167 ms, or 1378 tokens/sec). Decode processes one token at a time sequentially, taking ~38-42 ms per token (e.g., 199 tokens in ~7800-8400 ms, or 23-25 tokens/sec). This explains why prompts process 5-50x faster per token than generation, even at equal lengths—parallelism in prefill fully utilizes GPU compute, while decode cannot.

Using Phi-3 Mini (3.8B parameters, FP16 weights, 4k context) on a T4 GPU (16GB VRAM, all layers offloaded via n_gpu_layers=-1), load time is consistently 677 ms. Resetting the model each run avoids KV cache interference for clean measurements.

Prompt Length Slows Generation via KV Cache Overhead

Larger prompts increase total prefill time linearly (e.g., 3567 tokens: 2689 ms total, 0.75 ms/token) but hit peak efficiency around 400 tokens (0.57 ms/token at 404 tokens, up to 1309 tokens/sec). Shorter prompts (<111 tokens) underutilize GPU, with per-token time dropping as batch size grows to ~400 before slightly rising.

Critically, longer prompts tax decode: fixed 199 output tokens take 7.05s (111 input, 28 tokens/sec) to 9.93s (3567 input, 20 tokens/sec). This ~40% slowdown stems from larger KV cache updates during sequential generation, proving input context directly impacts output speed despite identical generation length.

Output Length Drives Linear Costs, Minimal Per-Token Variance

Fixed minimal prompt yields linear decode scaling: 50 tokens ~1.6s total, 1500 tokens ~50s total. Per-token time stays stable at 33-36 ms (e.g., 32.91 ms at 50 tokens to 35.84 ms at 1500, +8.9%), with minor degradation from growing KV cache. Multiple runs (10-20) confirm convergence to 40-42 ms/token, dismissing initial variances from GPU warmup or noise—always average repeats for reliability.

Prefill remains constant regardless of output size (e.g., ~90 ms for 111-token prompt across all tests), isolating it from generation.

Optimization Insights from Phase Trade-offs

To minimize latency, keep prompts concise yet batch-sized for GPU saturation (~400 tokens). Long contexts incur 'tax' on both phases: prefill grows linearly, decode slows via cache. Generation dominates time for longer outputs (e.g., 7571 ms decode vs 48 ms prefill in first test). Use tools like llama_cpp for verbose perf logs (llama_perf_context_print) to profile: track prompt eval time, eval time, tokens/sec, and graphs reused (KV cache hits). These mechanics enable better model selection, quantization, and prompt engineering for production AI pipelines.