Building a Local Multimodal Search Engine with Gemma 4

The "Describe-Then-Embed" Strategy

The core challenge of multimodal search is that generative models like Gemma 4 are not contrastive dual-encoders (like CLIP). Attempting to embed raw image or audio activations into a vector space results in noise. The solution is to treat the LLM as a translator: use Gemma 4 to generate descriptive text for video keyframes and audio transcripts. Once all modalities are converted to text, a single embedding model (nomic-embed-text) maps them into a unified 768-dimensional space, allowing for accurate cross-modal retrieval.

Dual-Granularity Indexing in Qdrant

To balance high-recall asset searching with precise timestamp playback, the system uses two distinct indexing granularities within a single Qdrant collection:

Fragment Points (Primary): Each keyframe, audio chunk, or caption line is stored as a single point with its own vector and payload (start/end timestamps). This enables pinpoint playback.
Asset-Level MaxSim Points (Secondary): Each asset holds an array of vectors representing its frames and audio. By using Qdrant's MultiVectorConfig with MAX_SIM (ColBERT-style late interaction), the system can quickly identify which entire files are relevant to a query before drilling down into specific fragments.

Implementation & Performance Traps

Embedded Storage: By using Qdrant’s local-disk mode (QdrantClient(path=...)), the entire engine runs without external dependencies or Docker, making it portable for local machines.
Thinking Models: When using Gemma 4 for description, disable the model's "thinking" process (think=False). Otherwise, reasoning tokens consume the output budget, leading to empty descriptions and ingestion failures.
Client-Side MaxSim: Because the embedded Qdrant client may have limitations with native multivector queries, the author recommends computing MaxSim on the client side by taking the maximum dot product between the query vector and the stored sub-vectors.
Quantization: Use INT8 scalar quantization with always_ram configuration to keep the index performant and memory-efficient on hardware like Apple Silicon.

The "Describe-Then-Embed" Strategy

Dual-Granularity Indexing in Qdrant

Implementation & Performance Traps

More from AI Automation

Building a Local Agentic Coding Assistant

SpatialClaw: Using Code as an Action Interface for Spatial Reasoning

Webwright: A Terminal-Native Framework for AI Web Agents

Building Enterprise-Ready AI Agents with ADK 2.0