Liquid AI's New 350M Multilingual Retrieval Models

Bidirectional Architecture for Retrieval

Liquid AI has expanded its LFM2.5 family with two 350M-parameter retrieval models: LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M. Both models are built upon the LFM2.5-350M-Base backbone, transitioning from a causal decoder architecture to a bidirectional encoder. This shift was achieved by replacing causal attention masks with bidirectional ones and modifying short convolutions to be non-causal, allowing tokens to attend to both left and right context. This architecture is specifically designed to preserve the efficiency of the LFM2 backbone while providing the full-context representations required for high-quality retrieval.

Model Selection: Dense vs. Late-Interaction

LFM2.5-Embedding-350M (Dense Bi-Encoder): Compresses documents into a single 1024-dimensional vector. This model is ideal for scenarios prioritizing speed, minimal storage, and low-cost indexing.
LFM2.5-ColBERT-350M (Late-Interaction): Retains 128-dimensional per-token embeddings, enabling word-by-word query matching. While it requires a larger index, it offers higher accuracy and better generalization. It can also function as a reranker for existing retrieval pipelines without requiring a pre-built index.

Performance and Deployment

Both models support 11 languages (Arabic, German, English, Spanish, French, Italian, Japanese, Korean, Norwegian, Portuguese, and Swedish). Benchmarks indicate that LFM2.5-ColBERT-350M outperforms existing models in its class on NanoBEIR and MKQA-11 datasets.

For deployment, Liquid AI provides GGUF variants compatible with llama.cpp, enabling execution on CPUs and edge devices. On a MacBook Pro M4 Max, query latency for the Embedding model is approximately 7.3 ms, while the ColBERT model achieves 8.2 ms when document embeddings are pre-computed. The models are available on Hugging Face under the LFM Open License v1.0.

Bidirectional Architecture for Retrieval

Model Selection: Dense vs. Late-Interaction

Performance and Deployment

More from AI & LLMs

In the Weights: Measuring Your Digital Presence in AI Models

Run Gemma 4 on iPhone at 40 tok/s with MLX Swift LM

Gemini-NotebookLM: Chats Become Cited Sources

AI Emotional Support Trap: Sounds Safe, Lacks True Understanding