DeepSeek's Visual Primitives: 10x KV Cache Efficiency
DeepSeek's 'Thinking with Visual Primitives' embeds bounding boxes and points as inline chain-of-thought tokens to solve visual reference gaps, compressing KV cache 10x (90 entries vs. 870 for Sonnet on 80x80 images) for frontier-grade vision at 1/10th cost.
Visual Primitives Fix Reference Gaps in Multimodal Chain-of-Thought
Current multimodal models suffer from a 'reference gap': even with perfect perception, language descriptions lose precision in long reasoning (e.g., 'third bear from the left'). DeepSeek solves this by treating bounding boxes and points as first-class tokens in the vocabulary, output inline during chain-of-thought. For a team photo count query, the model generates tags like [label:person]box:(x1,y1,x2,y2) for each entity, enabling reliable counting in dense scenes, multi-hop spatial reasoning, and disambiguating visuals like Chihuahua vs. muffin. This builds on DeepSeek's 2-year lineage prioritizing cheap representations: DeepSeek-VL (hybrid SIGLIP/SAM encoders), Janus (decoupled understanding/generation encoders), DeepSeek-VL2 (MoE/MLHA for 1B active params scoring 80.9 OCR/88.9 DocVQA), Janus-Pro-7B (runs on consumer GPU, beats DALL-E 3 at 80% on GenEval), and DeepSeek-OCR (renders 1000 text tokens to image for 97% accurate 100-token compression). The throughline: seek minimal representations that preserve info, like pixels over tokens (per Karpathy: 'the tokenizer must go').
Architecture Delivers 7000x Compression on DeepSeek-V4 Flash
Base is standard: image → custom Vision Transformer (arbitrary resolution, 14x14 patches) → LLM (DeepSeek-V4 Flash: 284B MoE, 13B active params) ← text tokenizer; detokenizer on output. Efficiency magic in ViT: 756x756 image (571k pixels) → 2916 patch tokens → 3x3 channel compression to 324 tokens → V4's compressed sparse attention for 4x KV reduction → 81 KV entries (7000x compression). An 80x80 image uses 90 KV entries vs. Sonnet 4.6's 870 or Gemini 3 Flash's ~1000—10x less compute. Training: (1) trillions-scale pretrain; (2) SFT on separate box/point grounding models; (3) GRPO RL with format/quality/accuracy rewards; (4) unified RFD merge; (5) on-policy distillation to single student. Result: frontier reasoning at 1/10th vision inference cost.
Strong Grounded Reasoning Wins, But Limited to Triggered Use
Excels on pointer-dependent tasks: 67% maze navigation (vs. 49% Gemini 3 Flash/GPT-4o/Sonnet 4.6); doubles path tracing scores; ties/wins counting/spatial. Gemini 3 Flash leads raw count QA, but primitives boost topology where language fails trajectories. Caveats (per paper): scores only on relevant subsets, not overall superiority; resolution-bound (fine scenes fail); explicit trigger needed (no auto-use); point reasoning generalizes poorly across scenarios. DeepSeek emphasizes honesty vs. hype. Rollout started April 29, 2025, in app/web fast/expert modes; paper briefly on GitHub.