Elastic KV Cache: Boost LLM Serving Efficiency

kvcached on vLLM enables dynamic KV-cache allocation, slashing idle VRAM by reserving none upfront, handling bursty loads without latency hits, and sharing GPUs across models by releasing memory when idle.

Why Dynamic KV-Cache Beats Static Allocation

Static KV-cache in engines like vLLM pre-reserves a fixed GPU memory pool for potential requests, wasting VRAM during idle periods common in bursty LLM serving—think chat apps with sporadic user spikes. kvcached replaces this with elastic allocation: memory expands on-demand during bursts and shrinks to zero when idle, freeing VRAM for other models or processes. Principle: KV-cache (key-value states for transformer attention) is request-specific and temporary; holding it statically ignores real workloads. Common mistake: Over-provisioning gpu-memory-utilization (default 0.9) bloats idle usage without throughput gains. kvcached autopatches vLLM via env vars (ENABLE_KVCACHED=true, KVCACHED_AUTOPATCH=1), using shared IPC for multi-instance coordination—no code changes needed.

Hands-on principle: Always baseline against static to quantify wins. For production, target workloads mimic reality: concurrent requests in bursts (e.g., 6 parallel chats), followed by pauses (6s+). Quality criteria: Idle VRAM near model weights only; peak matches static; latency p50/p95 comparable; post-burst release to baseline.

Reproducible Setup for GPU Experiments

Prerequisites: Python 3.10+, NVIDIA GPU (T4/A100 tested), CUDA 12+. Assumes vLLM familiarity; no ML PhD needed. Clone the full notebook from GitHub for one-click Colab run.

Step 1: Verify GPU and install.

import torch
assert torch.cuda.is_available()
print(f"GPU: {torch.cuda.get_device_name(0)} ({torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB)")  # E.g., Tesla T4 (15.0 GB)

Install pinned versions:

pip_install("vllm==0.10.2")  # Stable for autopatch
pip_install("kvcached", extra=["--no-build-isolation"])  # Compiles CUDA kernel (~1min)
pip_install("matplotlib requests pynvml numpy")

Models: Lightweight Qwen2.5-0.5B/1.5B-Instruct (HuggingFace) for fast loads; scale to Llama3.1-8B.

Step 2: Launch servers. Core function:

def launch_vllm(model, port, kvcached=True, gpu_mem_util=0.55):
    env = os.environ.copy()
    env["VLLM_USE_V1"] = "1"
    if kvcached:
        env["ENABLE_KVCACHED"] = "true"
        env["KVCACHED_AUTOPATCH"] = "1"
        env["KVCACHED_IPC_NAME"] = f"kvc_{port}"  # Unique shm per instance
    cmd = [
        "python", "-m", "vllm.entrypoints.openai.api_server",
        "--model", model, "--port", str(port),
        "--max-model-len", "2048",
        "--disable-log-requests", "--enforce-eager",  # Eager for memory purity
    ]
    if not kvcached: cmd += ["--gpu-memory-utilization", str(gpu_mem_util)]
    proc = subprocess.Popen(cmd, env=env, ...)
    return proc

Wait for readiness: Poll /v1/models endpoint (420s timeout). Shutdown gracefully: SIGTERM then SIGKILL.

Step 3: Monitor VRAM precisely.

import pynvml
pynvml.nvmlInit()
NV_HANDLE = pynvml.nvmlDeviceGetHandleByIndex(0)
def vram_used_mb():
    return pynvml.nvmlDeviceGetMemoryInfo(NV_HANDLE).used / (1024**2)
class MemorySampler(threading.Thread):
    def __init__(self, interval=0.2): ...  # 5Hz sampling

Avoid mistake: Use pynvml over torch.cuda; more accurate for fragmented VRAM.

Benchmarking Bursty Workloads: Code and Metrics

Simulate real traffic: 3 bursts of 6 concurrent /chat/completions (180 tokens, temp=0.7). Prompts vary (quantum explainer to haiku). Pauses=6s trigger release.

def bursty_workload(port, model, n_bursts=3, burst_size=6, pause=6.0):
    def one(i):
        body = {"model": model, "messages": [{"role": "user", "content": PROMPTS[i % 7]}], "max_tokens": 180}
        return requests.post(f"http://localhost:{port}/v1/chat/completions", json=body).elapsed
    with ThreadPoolExecutor(max_workers=burst_size) as ex:
        for b in range(n_bursts):
            latencies += ex.map(one, range(burst_size))
            time.sleep(pause)  # Idle gap
    return latencies

Run paired experiments:

  1. kvcached=True: Idle ~model weights (e.g., 1100MB on T4 for 0.5B).
  2. Baseline (kvcached=False, gpu_mem_util=0.55): Idle bloats to 4500MB (reserved pool).

Capture: sampler.start() pre-burst, stop post-pause. Metrics: peak VRAM, median latency, flex (peak-idle).

Visualization template:

import matplotlib.pyplot as plt
fig, axes = plt.subplots(1,2, figsize=(14,4.5))
# Plot time vs VRAM (kvcached solid, baseline dashed)
axes[0].plot(tk, mk, label="kvcached", lw=2)
axes[0].axhline(idle_kvc, ls=":", alpha=0.3)  # Annotate baselines
# Boxplot latencies
axes[1].boxplot([lat_kvc, lat_base], labels=["kvcached", "baseline"])
plt.savefig("kvcached_bursty.png")

Expected: kvcached idle 1100MB → burst peak 4500MB → release to 1100MB. Baseline stuck at 4500MB. Latencies match (median ~1.2s). Savings: 3400MB idle.

"The idle gap is where kvcached releases physical VRAM -- a static-allocation engine simply cannot."

Multi-Model GPU Sharing: Dynamic Memory Arbitration

Load two models sequentially on one GPU (ports 8001/8002). Alternate bursts (4 concurrent, no pause between rounds, 5s settle).

pA, _ = launch_vllm("Qwen/Qwen2.5-0.5B", 8001, kvcached=True)
wait_ready(8001)
pB, _ = launch_vllm("Qwen/Qwen2.5-1.5B", 8002, kvcached=True)
wait_ready(8002)  # Total idle ~2000MB
sampler.start()
for i in range(4):
    port, model = (8001, MODEL_A) if i%2==0 else (8002, MODEL_B)
    bursty_workload(port, model, n_bursts=1, burst_size=4)
    time.sleep(5)  # Switch

Observation: VRAM flexes 2000MB idle → 4500MB (model A burst) → 2000MB → 5000MB (model B, larger). No OOM; static would fail.

Principle: IPC-shared cache pool arbitrates fairly; idle instances yield instantly. Scale to 4+ models on A100. Mistake: Mismatched IPC_NAME causes collisions—unique per port.

"Two LLMs on one T4 via kvcached — memory flexes per active model."

CLI Tools for Production Monitoring

kvcached bundles:

  • kvtop: Live KV-per-instance (like htop/nvtop). Run: kvtop → see alloc/release realtime.
  • kvctl: Budget caps, e.g., kvctl kvc_8001 limit 2GB.

Test: shutil.which("kvtop") post-install. Integrate with Prometheus for dashboards.

"kvtop — live per-instance KV memory monitor (like nvtop for kvcached)."

Full reproducibility: GitHub notebook auto-generates plots/summaries. Extend: Ray Serve integration, Kubernetes multi-GPU.

Key Takeaways

  • Install kvcached on vLLM 0.10.2; autopatch via ENABLE_KVCACHED=true—no engine fork needed.
  • Benchmark bursty: 3x6 requests, 6s pauses; expect 70%+ idle VRAM savings vs static gpu_mem_util=0.55.
  • Monitor with pynvml sampler (0.2s interval) + matplotlib for proof.
  • Multi-model: Unique KVCACHED_IPC_NAME per port; alternate loads show flex.
  • Avoid static pitfalls: No release post-burst wastes tenant slots.
  • Production: kvtop/kvctl for observability; target <20% overhead.
  • Replicate on Colab T4: Full code yields plots in <10min.
  • Principle: Demand-driven KV > fixed pools for 90% real workloads.

Notable quotes:

  1. "kvcached enables significant VRAM savings during idle periods while maintaining competitive latency under load."
  2. "By running multiple models on a single GPU and alternating traffic, we clearly saw how memory is allocated only when needed and released when idle."
  3. "VRAM flex: kvcached peak-idle = XXX MB (baseline can't release -- static pool)."
  4. "This is great for bursty or multi-tenant inference environments."

Summarized by x-ai/grok-4.1-fast via openrouter

9509 input / 3375 output tokens in 19903ms

© 2026 Edge