Fix VLM Counting: Gemma 4 + 300M Segmentation Agent

Video description

Vision language models like Gemma 4 are great at understanding images but terrible at counting objects. In this video, I combine Gemma 4 with Falcon Perception, a tiny 300M parameter segmentation model, inside an agentic loop to build a local vision system that can actually detect, count, and reason about objects accurately. https://github.com/PromtEngineer/Gemma4-Visual-Agent/tree/dgx-spark-gb10 https://huggingface.co/blog/tiiuae/falcon-perception https://deepmind.google/models/gemma/ https://developer.nvidia.com/blog/bringing-ai-closer-to-the-edge-and-on-device-with-gemma-4/ My Dictation App: www.whryte.com Website: https://engineerprompt.ai/ RAG Beyond Basics Course: https://prompt-s-site.thinkific.com/courses/rag Signup for Newsletter, localgpt: https://tally.so/r/3y9bb0 Let's Connect: 🦾 Discord: https://discord.com/invite/t4eYQRUcXB ☕ Buy me a Coffee: https://ko-fi.com/promptengineering |🔴 Patreon: https://www.patreon.com/PromptEngineering 💼Consulting: https://calendly.com/engineerprompt/consulting-call 📧 Business Contact: engineerprompt@gmail.com Become Member: http://tinyurl.com/y5h28s6h 💻 Pre-configured localGPT VM: https://bit.ly/localGPT (use Code: PromptEngineering for 50% off). Signup for Newsletter, localgpt: https://tally.so/r/3y9bb0

VLM Weaknesses Exposed, Segmentation as Grounding

Vision language models (VLMs) like Gemma 4 excel at fast scene understanding but consistently fail at precise object counting, localization, and handling occlusions—e.g., miscounting 8 apples and 5 oranges as 5 each. This stems from their inability to isolate objects reliably without additional tooling. The fix: integrate Falcon Perception, a 300M parameter segmentation model from TII UAE (similar to SAM but far smaller and local-friendly), which generates full-resolution binary masks, bounding boxes, and detections via chain-of-perception decoding. It processes text+image queries to identify objects without exhaustive prompting, enabling accurate counts even for occluded or distant items. Trade-off: adds latency over pure VLM but runs efficiently on edge hardware like DGX Spark or Apple Silicon (MLX version), outperforming larger models like SAM in speed.

Gemma 4's Apache 2.0-licensed family (sizes from 2B up) provides the reasoning backbone; use the 4B instruction-tuned variant for this pipeline to balance speed and capability on local devices.

Agentic Loop for Robust Visual Reasoning

Wrap VLM and segmentation in a dynamic agentic loop driven by Gemma 4, with four tools for planning, detection, and analysis. Start with a planning router: Gemma 4 assesses the query+image to decide simple sequential processing (segment → reason) or full loop for complex tasks. In loop mode:

Extract target objects from query (e.g., "dogs", "cars vs. people").
Call Falcon Perception's detect_each for segmented images/masks per object.
Feed results back to Gemma 4 for visual reasoning, re-planning if needed (capped at 8 steps for safety).
Output final grounded answer.

This hybrid beats standalone VLM by 100% on counting accuracy in demos, as segmentation provides verifiable isolates for reasoning. Expandable: add tools for video frame processing or real-time tracking.

Proven Accuracy in Complex Scenes

On a busy street image, agent counts 14 cars (focusing on visible/near ones, handling background) vs. 12 people (including some occluded), correctly concluding more cars—VLMs alone hallucinate here. For dog breeds: segments 2 dogs, then classifies as potential breeds. Fruit demo: isolates 5 oranges and 8 apples for exact comparison, fixing Gemma 4's 5-vs-5 error. Even dense/occluded scenes yield reliable results, with minor misses on distant objects. Pure Gemma 4 baseline fails; agentic pipeline succeeds via multi-step verification.

Open-source GitHub repo (DGX/NVIDIA or MLX/Apple) includes setup, pre-loaded images, and dual-mode UI (agentic vs. baselines). Scale to larger Gemma variants for production; current 4B suits experimentation.

Video description

VLM Weaknesses Exposed, Segmentation as Grounding

Agentic Loop for Robust Visual Reasoning

Proven Accuracy in Complex Scenes

More on Edge

Gemini Enables Agentic Tasks and Prompt-Based Widgets on Android

Daybreak: AI Agents for Proactive Vuln Patching

OpenAI's Realtime Voice Models Add Reasoning, Translation, Transcription

AI Agents Surge in Finance and Productivity Tools