LFM2.5-VL-450M Delivers Edge VLM with Grounding in <250ms

450M vision-language model scales to 28T tokens, adds bounding box detection (81.28 RefCOCO-M), multilingual support (MMMB 68.09), and runs 512x512 images in 242ms on Jetson Orin for real-time edge apps.

Core Upgrades Enable Structured Outputs on Edge

Scale pre-training from 10T to 28T tokens, then apply preference optimization and RL for production multimodal gains: bounding box prediction jumps from 0 to 81.28 on RefCOCO-M, enabling object localization; multilingual image understanding rises from 54.29 to 68.09 on MMMB across Arabic, Chinese, French, German, Japanese, Korean, Portuguese, Spanish; instruction following improves from 32.93 to 45.00 on MM-IFEval for better steerability. Adds text function calling (21.08 BFCLv4). These yield grounded, actionable outputs from images without separate detection models.

Benchmark Leadership in Compact VLMs

Outperforms prior LFM2-VL-450M and SmolVLM2-500M: MMStar 43.00 (vs 40.87/38.20), RealWorldQA 58.43 (52.03/49.90), MMBench dev en 60.91 (56.27/52.32), POPE 86.93 (83.79/82.67), MMVet 41.10 (33.85/29.90), OCRBench 684 (657/609), CountBench 73.31 (47.64/61.81). Text-only: GPQA 25.66 (23.13/23.84), MMLU Pro 19.32 (17.22/13.57), IFEval 61.16 (51.75/30.14). MMMU val slightly lower at 32.67 but overall vision/language reliability higher, prioritizing real-world tasks over academic evals.

Real-Time Inference Fits Tight Edge Constraints

Q4_0 quantized model processes live camera feeds responsively: 512x512 images in 242ms on Jetson Orin (4 FPS video full reasoning), 2.4s on Snapdragon 8 Elite (Samsung S25 Ultra), 944ms on Ryzen AI Max+ 395; 256x256 under 1s everywhere. Enables semantic scene understanding beyond detection, suiting power/privacy-limited hardware without cloud dependency.

Production Fits for Constrained High-Throughput Apps

Industrial automation (warehouses/vehicles): single-pass grounded reasoning on worker/forklift actions via Jetson Orin. Wearables/monitoring (glasses/dashcams): local semantic outputs preserve privacy under power limits. Retail/e-commerce: scales visual search/cataloging with low-latency structured reasoning for millions of images. Run/fine-tune via Hugging Face, LEAP, Playground; docs cover local setup.

Summarized by x-ai/grok-4.1-fast via openrouter

5502 input / 1781 output tokens in 9396ms

© 2026 Edge