VLM Weaknesses Exposed, Segmentation as Grounding

Vision language models (VLMs) like Gemma 4 excel at fast scene understanding but consistently fail at precise object counting, localization, and handling occlusions—e.g., miscounting 8 apples and 5 oranges as 5 each. This stems from their inability to isolate objects reliably without additional tooling. The fix: integrate Falcon Perception, a 300M parameter segmentation model from TII UAE (similar to SAM but far smaller and local-friendly), which generates full-resolution binary masks, bounding boxes, and detections via chain-of-perception decoding. It processes text+image queries to identify objects without exhaustive prompting, enabling accurate counts even for occluded or distant items. Trade-off: adds latency over pure VLM but runs efficiently on edge hardware like DGX Spark or Apple Silicon (MLX version), outperforming larger models like SAM in speed.

Gemma 4's Apache 2.0-licensed family (sizes from 2B up) provides the reasoning backbone; use the 4B instruction-tuned variant for this pipeline to balance speed and capability on local devices.

Agentic Loop for Robust Visual Reasoning

Wrap VLM and segmentation in a dynamic agentic loop driven by Gemma 4, with four tools for planning, detection, and analysis. Start with a planning router: Gemma 4 assesses the query+image to decide simple sequential processing (segment → reason) or full loop for complex tasks. In loop mode:

  1. Extract target objects from query (e.g., "dogs", "cars vs. people").
  2. Call Falcon Perception's detect_each for segmented images/masks per object.
  3. Feed results back to Gemma 4 for visual reasoning, re-planning if needed (capped at 8 steps for safety).
  4. Output final grounded answer.

This hybrid beats standalone VLM by 100% on counting accuracy in demos, as segmentation provides verifiable isolates for reasoning. Expandable: add tools for video frame processing or real-time tracking.

Proven Accuracy in Complex Scenes

On a busy street image, agent counts 14 cars (focusing on visible/near ones, handling background) vs. 12 people (including some occluded), correctly concluding more cars—VLMs alone hallucinate here. For dog breeds: segments 2 dogs, then classifies as potential breeds. Fruit demo: isolates 5 oranges and 8 apples for exact comparison, fixing Gemma 4's 5-vs-5 error. Even dense/occluded scenes yield reliable results, with minor misses on distant objects. Pure Gemma 4 baseline fails; agentic pipeline succeeds via multi-step verification.

Open-source GitHub repo (DGX/NVIDIA or MLX/Apple) includes setup, pre-loaded images, and dual-mode UI (agentic vs. baselines). Scale to larger Gemma variants for production; current 4B suits experimentation.