Elastic KV Cache: Boost LLM Serving Efficiency
kvcached on vLLM enables dynamic KV-cache allocation, slashing idle VRAM by reserving none upfront, handling bursty loads without latency hits, and sharing GPUs across models by releasing memory when idle.
Why Dynamic KV-Cache Beats Static Allocation
Static KV-cache in engines like vLLM pre-reserves a fixed GPU memory pool for potential requests, wasting VRAM during idle periods common in bursty LLM serving—think chat apps with sporadic user spikes. kvcached replaces this with elastic allocation: memory expands on-demand during bursts and shrinks to zero when idle, freeing VRAM for other models or processes. Principle: KV-cache (key-value states for transformer attention) is request-specific and temporary; holding it statically ignores real workloads. Common mistake: Over-provisioning gpu-memory-utilization (default 0.9) bloats idle usage without throughput gains. kvcached autopatches vLLM via env vars (ENABLE_KVCACHED=true, KVCACHED_AUTOPATCH=1), using shared IPC for multi-instance coordination—no code changes needed.
Hands-on principle: Always baseline against static to quantify wins. For production, target workloads mimic reality: concurrent requests in bursts (e.g., 6 parallel chats), followed by pauses (6s+). Quality criteria: Idle VRAM near model weights only; peak matches static; latency p50/p95 comparable; post-burst release to baseline.
Reproducible Setup for GPU Experiments
Prerequisites: Python 3.10+, NVIDIA GPU (T4/A100 tested), CUDA 12+. Assumes vLLM familiarity; no ML PhD needed. Clone the full notebook from GitHub for one-click Colab run.
Step 1: Verify GPU and install.
import torch
assert torch.cuda.is_available()
print(f"GPU: {torch.cuda.get_device_name(0)} ({torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB)") # E.g., Tesla T4 (15.0 GB)
Install pinned versions:
pip_install("vllm==0.10.2") # Stable for autopatch
pip_install("kvcached", extra=["--no-build-isolation"]) # Compiles CUDA kernel (~1min)
pip_install("matplotlib requests pynvml numpy")
Models: Lightweight Qwen2.5-0.5B/1.5B-Instruct (HuggingFace) for fast loads; scale to Llama3.1-8B.
Step 2: Launch servers. Core function:
def launch_vllm(model, port, kvcached=True, gpu_mem_util=0.55):
env = os.environ.copy()
env["VLLM_USE_V1"] = "1"
if kvcached:
env["ENABLE_KVCACHED"] = "true"
env["KVCACHED_AUTOPATCH"] = "1"
env["KVCACHED_IPC_NAME"] = f"kvc_{port}" # Unique shm per instance
cmd = [
"python", "-m", "vllm.entrypoints.openai.api_server",
"--model", model, "--port", str(port),
"--max-model-len", "2048",
"--disable-log-requests", "--enforce-eager", # Eager for memory purity
]
if not kvcached: cmd += ["--gpu-memory-utilization", str(gpu_mem_util)]
proc = subprocess.Popen(cmd, env=env, ...)
return proc
Wait for readiness: Poll /v1/models endpoint (420s timeout). Shutdown gracefully: SIGTERM then SIGKILL.
Step 3: Monitor VRAM precisely.
import pynvml
pynvml.nvmlInit()
NV_HANDLE = pynvml.nvmlDeviceGetHandleByIndex(0)
def vram_used_mb():
return pynvml.nvmlDeviceGetMemoryInfo(NV_HANDLE).used / (1024**2)
class MemorySampler(threading.Thread):
def __init__(self, interval=0.2): ... # 5Hz sampling
Avoid mistake: Use pynvml over torch.cuda; more accurate for fragmented VRAM.
Benchmarking Bursty Workloads: Code and Metrics
Simulate real traffic: 3 bursts of 6 concurrent /chat/completions (180 tokens, temp=0.7). Prompts vary (quantum explainer to haiku). Pauses=6s trigger release.
def bursty_workload(port, model, n_bursts=3, burst_size=6, pause=6.0):
def one(i):
body = {"model": model, "messages": [{"role": "user", "content": PROMPTS[i % 7]}], "max_tokens": 180}
return requests.post(f"http://localhost:{port}/v1/chat/completions", json=body).elapsed
with ThreadPoolExecutor(max_workers=burst_size) as ex:
for b in range(n_bursts):
latencies += ex.map(one, range(burst_size))
time.sleep(pause) # Idle gap
return latencies
Run paired experiments:
- kvcached=True: Idle ~model weights (e.g., 1100MB on T4 for 0.5B).
- Baseline (kvcached=False, gpu_mem_util=0.55): Idle bloats to 4500MB (reserved pool).
Capture: sampler.start() pre-burst, stop post-pause. Metrics: peak VRAM, median latency, flex (peak-idle).
Visualization template:
import matplotlib.pyplot as plt
fig, axes = plt.subplots(1,2, figsize=(14,4.5))
# Plot time vs VRAM (kvcached solid, baseline dashed)
axes[0].plot(tk, mk, label="kvcached", lw=2)
axes[0].axhline(idle_kvc, ls=":", alpha=0.3) # Annotate baselines
# Boxplot latencies
axes[1].boxplot([lat_kvc, lat_base], labels=["kvcached", "baseline"])
plt.savefig("kvcached_bursty.png")
Expected: kvcached idle 1100MB → burst peak 4500MB → release to 1100MB. Baseline stuck at 4500MB. Latencies match (median ~1.2s). Savings: 3400MB idle.
"The idle gap is where kvcached releases physical VRAM -- a static-allocation engine simply cannot."
Multi-Model GPU Sharing: Dynamic Memory Arbitration
Load two models sequentially on one GPU (ports 8001/8002). Alternate bursts (4 concurrent, no pause between rounds, 5s settle).
pA, _ = launch_vllm("Qwen/Qwen2.5-0.5B", 8001, kvcached=True)
wait_ready(8001)
pB, _ = launch_vllm("Qwen/Qwen2.5-1.5B", 8002, kvcached=True)
wait_ready(8002) # Total idle ~2000MB
sampler.start()
for i in range(4):
port, model = (8001, MODEL_A) if i%2==0 else (8002, MODEL_B)
bursty_workload(port, model, n_bursts=1, burst_size=4)
time.sleep(5) # Switch
Observation: VRAM flexes 2000MB idle → 4500MB (model A burst) → 2000MB → 5000MB (model B, larger). No OOM; static would fail.
Principle: IPC-shared cache pool arbitrates fairly; idle instances yield instantly. Scale to 4+ models on A100. Mistake: Mismatched IPC_NAME causes collisions—unique per port.
"Two LLMs on one T4 via kvcached — memory flexes per active model."
CLI Tools for Production Monitoring
kvcached bundles:
kvtop: Live KV-per-instance (like htop/nvtop). Run:kvtop→ see alloc/release realtime.kvctl: Budget caps, e.g.,kvctl kvc_8001 limit 2GB.
Test: shutil.which("kvtop") post-install. Integrate with Prometheus for dashboards.
"kvtop — live per-instance KV memory monitor (like nvtop for kvcached)."
Full reproducibility: GitHub notebook auto-generates plots/summaries. Extend: Ray Serve integration, Kubernetes multi-GPU.
Key Takeaways
- Install kvcached on vLLM 0.10.2; autopatch via ENABLE_KVCACHED=true—no engine fork needed.
- Benchmark bursty: 3x6 requests, 6s pauses; expect 70%+ idle VRAM savings vs static gpu_mem_util=0.55.
- Monitor with pynvml sampler (0.2s interval) + matplotlib for proof.
- Multi-model: Unique KVCACHED_IPC_NAME per port; alternate loads show flex.
- Avoid static pitfalls: No release post-burst wastes tenant slots.
- Production: kvtop/kvctl for observability; target <20% overhead.
- Replicate on Colab T4: Full code yields plots in <10min.
- Principle: Demand-driven KV > fixed pools for 90% real workloads.
Notable quotes:
- "kvcached enables significant VRAM savings during idle periods while maintaining competitive latency under load."
- "By running multiple models on a single GPU and alternating traffic, we clearly saw how memory is allocated only when needed and released when idle."
- "VRAM flex: kvcached peak-idle = XXX MB (baseline can't release -- static pool)."
- "This is great for bursty or multi-tenant inference environments."