LFM 2.5: Train Small Models to Beat Doom Loops & Use Tools

Edge-Optimized Architectures Maximize Effective Parameters

Small models (350M-24B params) differ from scaled-down giants by being memory-bound (<1GB on-device), task-specific, and latency-sensitive. Standard distillation from large models bloats embedding layers—Gemma 3 270M wastes 63% of params on embeddings, Gemma 2.5 0.8B uses 29%—leaving fewer effective params for reasoning. Liquid AI's LFM 2 shrinks embeddings to 10% of params via on-device profiling on AMD Ryzen Max+ 395 CPU and Samsung Galaxy S25 Ultra, prioritizing gated short convolutions over sliding window attention, gated DeltaNet, or linear attention. This delivers 2-5x faster inference (lower cost ratio) and higher throughput even at peak GPU concurrency, using less memory while boosting reasoning capacity in the same footprint.

Target summarization or data extraction over general chat; latency wins make them ideal for phones/cars without internet.

28T Pre-Training + Targeted Post-Training Builds Frontier Capabilities

Defy Chinchilla: Pre/mid-train 350M LFM 2.5 on 28T tokens—far beyond optimal—for ongoing perf gains, aligning with new test-time scaling laws (Roberts et al.). Post-training mirrors big models but narrows focus: SFT on task-specific data (e.g., function calling), on-policy length-normalized DPO for broad quality lifts (smoother outputs), and RL across diverse environments for generalization.

Cold-start fix: Seed SFT with RL-like samples; poor RL signals flag missing SFT data—restart SFT to recover. Result: LFM 2.5 350M crushes priors on GPQA Diamond (knowledge), IFEval (instructions), CaseReportBench (extraction), BFCL/Dow2 (tools)—targeting extraction/tool use over math/MT-Bench averages. RL shines at small scale for narrow gains; cheap to run.

Crush Doom Loops with DPO Rejection and Verifiable RL

Doom loops—endless repetition—spike in tiny reasoning models on hard tasks (e.g., 50%+ in Gemma 3.5 0.8B reasoning). LFM 2.5 1.2B starts at 15-16% loop rate post-pretrain; SFT barely dents it.

Fix 1 (DPO): Generate 1M prompts → 5 diverse temp-sampled + 1 greedy rollout per policy model → LLM jury picks best/chosen vs. worst/rejected. Loops get rejected, training model to avoid them—drops rate sharply.

Fix 2 (RL): Verifiable rewards (e.g., extract final math answer or zero reward) + n-gram repetition penalty + temp sampling. Near-eliminates loops (<1%). Avoid scaled-down big models; tailor stack to edge uniqueness.

Agentic RL Unlocks Small Models Despite Low Knowledge

Memory limits cause hallucinations/long-context fails, but agentic tools (web search, Python recursion) bypass them. Small models excel at reliable tool-calling/reasoning if post-trained right—better than big models for latency/privacy/offline (cars, finance/healthcare). Underexplored: Pair edge models + agents for production wins; distill RL anti-looping cautiously, as it risks SFT-like issues.