Solving the Inference Cold-Start Bottleneck

Inference workloads on Kubernetes often suffer from multi-minute cold-start times due to container image pulling, weight loading, and CUDA kernel compilation. During this period, GPUs remain idle, leading to SLA violations during traffic spikes. NVIDIA Dynamo Snapshot addresses this by checkpointing a fully initialized, "warm" inference worker and restoring it on demand.

The Checkpoint/Restore Mechanism

Dynamo Snapshot utilizes two primary tools to capture the full state of an inference worker:

  • cuda-checkpoint: Serializes GPU-side state (CUDA contexts, streams, and memory) into CPU memory.
  • CRIU (Checkpoint/Restore in Userspace): Serializes the host-side process tree (threads, file descriptors, namespaces) to disk.

To manage this in Kubernetes, NVIDIA provides a privileged snapshot-agent DaemonSet. This agent handles the orchestration of checkpointing and restoration at the container level, ensuring the filesystem and process state remain synchronized. To handle distributed runtime constraints, the system uses quiesce/resume hooks, where the worker waits for a signal file before initializing network connections, ensuring the restored process resumes in a consistent state.

Performance Optimizations

To make checkpointing viable for large models, NVIDIA implemented several critical optimizations:

  • KV Cache Unmap: By using the CUDA Virtual Memory Management API (cuMemUnmap and cuMemRelease), the system releases physical memory for the KV cache while maintaining the virtual address range, significantly reducing checkpoint sizes (e.g., from ~190 GiB to ~6 GiB for Qwen3-0.6B).
  • CRIU Enhancements: Future upstream improvements include parallel memfd restoration and the use of Linux native AIO (io_submit) to eliminate serial bottlenecks in memory restoration, allowing restore speeds to approach the theoretical storage bandwidth limit.
  • GPU Memory Service (GMS): This upcoming feature decouples model weights from the core CRIU checkpoint. By offloading weights to a separate artifact, GMS allows process state and weight restoration to occur concurrently, reducing end-to-end startup time for large models (like gpt-oss-120b) by up to 21×.