DeepSeek V4: 10x KV Savings for 1M-Token Agents

Efficiency Breakthroughs Enable Long-Context Scaling

DeepSeek V4 Pro, a 1.6 trillion parameter model with 49 billion active parameters, uses only 27% of the FLOPs and 10% of the KV cache compared to V3.2 at a 1 million token context window. The smaller V4 Flash model has 284 billion parameters but just 13 million active ones, achieving the same native 1M token context. These reductions make million-token inference feasible without massive memory costs, dropping usage to roughly a tenth for KV cache while maintaining performance. In max reasoning mode (V4 Pro Max), it matches or approaches Claude Opus 4.6 and GPT-5.4 on knowledge and agentic benchmarks, prioritizing efficiency over raw scale.

Hybrid Attention Stack Drives Speed

V4 interleaves two mechanisms layer-by-layer: Compressed Sparse Attention (CSA) and Heavy Compressed Attention (HCA). CSA collapses every four KV tokens into one compressed entry, then applies sparse attention to select the top-K most relevant compressed blocks via a fast indexer—combining compression with sparsity for query-focused efficiency. HCA is more aggressive, collapsing every 128 tokens into a single KV entry with full attention on the compressed stream, no sparsity. Pinning these with a sliding window branch preserves local details, preventing loss in interleaved layers. This hybrid setup yields the dramatic KV and FLOPs savings, optimized for agent loops over large documents or multi-step reasoning.

Cost-Effective Agent Applications and Access

At $1.74 per million input tokens and $3.48 per million output for V4 Pro (Flash at 14¢ input/28¢ output), plus context caching (14¢/M for Pro, 3¢/M for Flash), economics now support long-horizon agents without RAG pipelines. Pass entire large documents or weave multi-step agent processes into one context, enabling seamless integration with tools like Claude Code or OpenCode—DeepSeek uses it in-house for agents. Access via chat.deepseek.com API or Hugging Face weights; expect third-party inference hosts soon. This lowers barriers for production apps where prior models' costs deterred long-context experimentation.