DGX Spark Runs 14B LLMs at 20 Tokens/Sec Locally

NVIDIA DGX Spark's 128GB Grace Blackwell unified memory fits 200B-param models locally, delivering 20.19 tokens/sec on 14B NVFP4 via vLLM—ideal for prototyping with cloud-equivalent stack.

Unified Memory Unlocks Local 200B-Param Workloads

NVIDIA DGX Spark, powered by the GB10 Grace Blackwell superchip, combines CPU and GPU with 128GB of unified memory and FP4 support. This fits models up to 200B parameters on a desk-sized workstation, using the same NVIDIA AI software stack as data centers or clouds. Avoid cloud delays from scheduling, costs, and data residency by running locally—scale to cloud only when ready. Memory capacity fits large models, but bandwidth governs speed; NVFP4 quantization boosts "intelligence per byte," making 14B models feel as responsive as smaller ones.

Reproducible vLLM Benchmarks Capture Real UX Metrics

Serve models (1.5B-14B) in NVIDIA-optimized Docker containers for identical dev-to-prod environments. Automate with an orchestrator script: generate unique run directories via timestamp + model ID, enforce environment isolation, require warm-up runs, and log GPU metrics every second. Measure end-to-end latency via streaming API—timestamp the first token chunk precisely in stream_once() for accurate Time to First Token (TTFT), the key user-perceived responsiveness metric. Artifacts include metadata, responses, and results for verification; start with example commands from build.nvidia.com/spark playbooks.

Quantization Drives Throughput: 61 Tokens/Sec Down to 20 on 14B

Tested instruct and base models show throughput drops sharply with size, but NVFP4 closes the gap. 1.5B instruct hits 61.73 tokens/sec; 14B NVFP4 reaches 20.19 tokens/sec (faster than human reading), vs. 8.40 tokens/sec for unquantized 14B base. TTFT scales with params but NVFP4 on 14B is 3.4x faster than base, proving quantization's role in balancing compute and UX on Blackwell hardware. Use for realistic dev workflows, not theoretical peaks.

Choose Local for Prototyping, Privacy, Steady-State

Opt for DGX Spark when cloud iteration lags: privacy-sensitive data stays local, rapid prototyping/fine-tuning matches prod stack, steady-state workloads avoid variable costs/latency. Run locally to iterate fast, then port seamlessly—prioritizes developer productivity over full cloud replacement.

Video description
Moving LLM workloads from the cloud to local infrastructure requires a shift in engineering strategy. In this talk, I share my journey of serving and benchmarking open-source models (1.5B to 14B) on an NVIDIA DGX Spark workstation. Using a reproducible methodology with vLLM, I analyze real-world trade-offs in throughput, latency, and the benefits of the 128GB Grace Blackwell unified memory architecture. You will leave with a clear framework for local model sizing, an understanding of quantization performance like NVFP4, and a guide for when local compute is the right choice for your AI stack. Speaker info: - LinkedIn https://www.linkedin.com/in/mozhgankch/

Summarized by x-ai/grok-4.1-fast via openrouter

4567 input / 1482 output tokens in 11692ms

© 2026 Edge