Architecture and Efficiency

Liquid AI's LFM2.5-8B-A1B is a sparse Mixture-of-Experts (MoE) model designed specifically for on-device deployment. It features 8.3B total parameters but activates only 1.5B per forward pass, significantly reducing compute requirements. The architecture consists of 24 layers: 18 double-gated LIV convolution blocks and 6 Grouped Query Attention (GQA) layers. This design allows the model to maintain high performance while remaining efficient enough to run on consumer hardware, achieving ~30 tokens/s on mobile devices and 253 tokens/s on an M5 Max CPU.

Training and Reasoning Capabilities

Unlike its predecessor, LFM2.5-8B-A1B is optimized as a reasoning-only model, requiring an explicit chain of thought before generating a final answer. Liquid AI scaled pretraining to 38T tokens and expanded the context window to 128K tokens. The model also features a doubled vocabulary size (128,000), which improves tokenization efficiency for non-Latin scripts, particularly in Hindi, Thai, Vietnamese, Indonesian, and Arabic.

To ensure reliability, the training process included two reinforcement learning stages: one to reduce 'doom loops' in reasoning traces and another using an avg@k-based reward to minimize hallucinations. These improvements resulted in significant benchmark gains, such as an increase in the AA-Omniscience Non-Hallucination Rate from 7.46 to 63.47 and a jump in Tau² Telecom scores from 13.60 to 88.07.

Deployment and Tool Use

Designed for agentic workflows, the model natively outputs Pythonic function calls wrapped in specific special tokens (<|tool_call_start|> and <|tool_call_end|>), though this can be overridden to JSON via the system prompt. It offers day-one support across major inference frameworks, including llama.cpp, MLX, vLLM, SGLang, and ONNX.