Fix VLM Counting: Gemma 4 + 300M Segmentation Agent

Vision language models like Gemma 4 fail at accurate object counting; pair it with 300M Falcon Perception segmentation in an agentic loop for precise local detection, counting, and reasoning.

VLM Weaknesses Exposed, Segmentation as Grounding

Vision language models (VLMs) like Gemma 4 excel at fast scene understanding but consistently fail at precise object counting, localization, and handling occlusions—e.g., miscounting 8 apples and 5 oranges as 5 each. This stems from their inability to isolate objects reliably without additional tooling. The fix: integrate Falcon Perception, a 300M parameter segmentation model from TII UAE (similar to SAM but far smaller and local-friendly), which generates full-resolution binary masks, bounding boxes, and detections via chain-of-perception decoding. It processes text+image queries to identify objects without exhaustive prompting, enabling accurate counts even for occluded or distant items. Trade-off: adds latency over pure VLM but runs efficiently on edge hardware like DGX Spark or Apple Silicon (MLX version), outperforming larger models like SAM in speed.

Gemma 4's Apache 2.0-licensed family (sizes from 2B up) provides the reasoning backbone; use the 4B instruction-tuned variant for this pipeline to balance speed and capability on local devices.

Agentic Loop for Robust Visual Reasoning

Wrap VLM and segmentation in a dynamic agentic loop driven by Gemma 4, with four tools for planning, detection, and analysis. Start with a planning router: Gemma 4 assesses the query+image to decide simple sequential processing (segment → reason) or full loop for complex tasks. In loop mode:

  1. Extract target objects from query (e.g., "dogs", "cars vs. people").
  2. Call Falcon Perception's detect_each for segmented images/masks per object.
  3. Feed results back to Gemma 4 for visual reasoning, re-planning if needed (capped at 8 steps for safety).
  4. Output final grounded answer.

This hybrid beats standalone VLM by 100% on counting accuracy in demos, as segmentation provides verifiable isolates for reasoning. Expandable: add tools for video frame processing or real-time tracking.

Proven Accuracy in Complex Scenes

On a busy street image, agent counts 14 cars (focusing on visible/near ones, handling background) vs. 12 people (including some occluded), correctly concluding more cars—VLMs alone hallucinate here. For dog breeds: segments 2 dogs, then classifies as potential breeds. Fruit demo: isolates 5 oranges and 8 apples for exact comparison, fixing Gemma 4's 5-vs-5 error. Even dense/occluded scenes yield reliable results, with minor misses on distant objects. Pure Gemma 4 baseline fails; agentic pipeline succeeds via multi-step verification.

Open-source GitHub repo (DGX/NVIDIA or MLX/Apple) includes setup, pre-loaded images, and dual-mode UI (agentic vs. baselines). Scale to larger Gemma variants for production; current 4B suits experimentation.

Video description
Vision language models like Gemma 4 are great at understanding images but terrible at counting objects. In this video, I combine Gemma 4 with Falcon Perception, a tiny 300M parameter segmentation model, inside an agentic loop to build a local vision system that can actually detect, count, and reason about objects accurately. https://github.com/PromtEngineer/Gemma4-Visual-Agent/tree/dgx-spark-gb10 https://huggingface.co/blog/tiiuae/falcon-perception https://deepmind.google/models/gemma/ https://developer.nvidia.com/blog/bringing-ai-closer-to-the-edge-and-on-device-with-gemma-4/ My Dictation App: www.whryte.com Website: https://engineerprompt.ai/ RAG Beyond Basics Course: https://prompt-s-site.thinkific.com/courses/rag Signup for Newsletter, localgpt: https://tally.so/r/3y9bb0 Let's Connect: 🦾 Discord: https://discord.com/invite/t4eYQRUcXB ☕ Buy me a Coffee: https://ko-fi.com/promptengineering |🔴 Patreon: https://www.patreon.com/PromptEngineering 💼Consulting: https://calendly.com/engineerprompt/consulting-call 📧 Business Contact: engineerprompt@gmail.com Become Member: http://tinyurl.com/y5h28s6h 💻 Pre-configured localGPT VM: https://bit.ly/localGPT (use Code: PromptEngineering for 50% off). Signup for Newsletter, localgpt: https://tally.so/r/3y9bb0

Summarized by x-ai/grok-4.1-fast via openrouter

6108 input / 1382 output tokens in 12749ms

© 2026 Edge