Visual Primitives Solve LMM Reference Gap

Embed Coordinates as Core Reasoning Units to Eliminate Reference Gap

Current large multimodal models (LMMs) suffer from a 'Reference Gap': natural language can't precisely pinpoint visual entities, causing failures in dense counting, multi-step spatial reasoning, and tracking. For example, asking 'What is the leftmost bird doing?' among 50 birds forces vague descriptions like 'gray bird near left edge,' collapsing logic chains.

DeepSeek's solution elevates bounding boxes (x1,y1,x2,y2) and points (x,y) from final outputs to 'visual primitives'—minimum units of thought. The model outputs coordinates inline during reasoning: 'I see a 452,23,804,411 climbing a tree (exclude); 50,447,647,771 on ground (include).' This anchors every step visually, mimicking human pointing while scanning, preventing lost tracks in dense scenes.

Built on DeepSeek-V4-Flash with DeepSeek-ViT vision encoder in LLaVA-style architecture (ViT features + LLM), it follows standard fusion but innovates in reasoning paradigm.

Achieve 7056x Token Compression with No Capability Loss

Processing an 800x800 image yields 2,916 patch tokens, bloating KV cache and slowing inference. DeepSeek applies two-stage compression: spatial (3x3 patches to 1 token, 2,916 → 324) + DeepSeek-V4-Flash's 4x Compressed Sparse Attention (324 → 81 tokens, ~90 KV slots total). Result: 7056x overall compression.

Comparisons: Gemma-4-31B (289 tokens), GPT-4o (740? note: likely GPT-4 variants), Claude-3.5-Sonnet (870? labeled 4.6), Gemini-1.5-Flash (1,100). DeepSeek uses 1/10th of Claude's tokens.

Performance holds: 77.2% average across 7 benchmarks (counting, spatial reasoning, maze navigation, path tracking), beating GPT-4o (71.1%), Claude-3.5-Sonnet (65.3%), Gemini-1.5-Flash (76.5%). Excels in multi-step tasks: maze navigation 66.9% (vs GPT-4o 50.6%), path tracking 56.7% (vs 46.5%), Pixmo-Count 89.2% (vs Gemini 88.2%), fine-grained counting 88.7% (vs Qwen2-VL 87.2%).

Five-Step Training Pipeline Yields Unified Spatial Expert

Pre-training: Crawl 97,984 bounding box sources (HuggingFace etc.), filter via Semantic Review (MLLM checks labels for nonsense/ambiguity/harm) + Geometric Review (valid framing, no truncation/giant boxes >90% area), retaining 31,701 sources → 40M+ samples.

SFT: Train separate box/point experts to avoid conflicts on small data.

RL: GRPO with 3 rewards—format (correct syntax, no duplicates/loops), quality (LLM-judged reasoning), accuracy (task-specific: counting reward = 1 / (1 + |pred - gt| / (gt + 1)) with α=0.7, β=3 for dense tolerance; maze: causal progress ratio + completeness, truncating illegal wall-passes).

Rejection Fine-Tuning: Merge experts.

On-Policy Distillation: Experts teach student via full-vocab logits + reverse KL (peaks multimodal distributions, cuts hallucinations).

Evaluations span counting (coarse/fine-grained, anchor per object), spatial reasoning (multi-hop/embodied), mazes (grids/circles/honeycombs, including unsolvable), path tracking (curvature at colorless intersections).

Real tasks shine: distinguishes Chihuahuas from muffins via semantics + boxes; infers gummy bear heavier than cabinet from scale tilt; links Golden Gate box to Warriors NBA team; diagrams latte steps on espresso machine photo.

Outperforms Language-Only and Auxiliary Grounding Paradigms

Text-only CoT (GPT-4V/Claude3) fails on ambiguity. High-res cropping (InternVL) clarifies but can't cross-patch reference. Post-verification (GRIT/DeepEyesV2) verifies linguistically. VGR aids but subordinates visuals.

DeepSeek makes primitives intrinsic: point-while-thinking drives reasoning, unlike Argus (2025 paper, arXiv:2505.23766) which explores architecture less deeply on data/rewards.