Tag: multimodal

Summaries

Data and Beyond

May 5, 2026

Visual Primitives Solve LMM Reference Gap

DeepSeek's withdrawn paper introduces 'Thinking with Visual Primitives'—embedding bounding boxes and points into every reasoning step—to fix ambiguous referencing in multimodal models, achieving 77.2% on spatial benchmarks with 10x fewer tokens than rivals.

__oneoff__

Gemma 3: Open Multimodal Models from 270M to 27B Params

Gemma 3 provides lightweight, open-weight multimodal LLMs (text/image input, text output) in 270M-27B sizes with 128K context (32K for tiny), trained on 6-14T tokens across 140+ languages, ideal for resource-constrained deployment.

llm

open-source

multimodal