ByteDance's Lance: A Unified 3B Model for Vision and Video

Unified Architecture for Multimodal Tasks

Most AI systems separate understanding (semantic alignment) from generation (continuous representation) into distinct architectures. ByteDance's Lance bridges this gap by using a dual-stream mixture-of-experts architecture that handles both tasks natively. It processes text, images, and videos as a single interleaved sequence, allowing the model to perform complex tasks like captioning, visual reasoning, text-to-video generation, and multi-turn editing without needing separate components.

Technical Innovations

Dual-Stream MoE: The architecture utilizes an understanding expert (LLMUND) for semantic tasks and a generation expert (LLMGEN) for visual synthesis. Both share the same context but operate via different training objectives: next-token prediction for understanding and flow matching in continuous latent space for generation.
Modality-Aware Rotary Positional Encoding (MaPE): To prevent positional ambiguity when mixing different token types (ViT semantic tokens, VAE condition tokens, and noisy VAE target tokens), Lance applies a fixed temporal offset to each modality group. This preserves spatial layout while separating token groups in the global positional space, which is critical for cross-task alignment.

Performance and Training

Lance was trained in four stages—Pre-Training, Continual Training, Supervised Fine-Tuning, and Reinforcement Learning (using GRPO)—within a budget of 128 GPUs. Despite having only 3B activated parameters, it consistently outperforms larger or specialized models on key benchmarks:

Video Generation: Achieves a VBench score of 85.11, surpassing dedicated models like HunyuanVideo and Wan2.1-T2V.
Video Understanding: Scores 62.0 on MVBench, leading all unified models.
Image Generation: Matches top-tier unified models with a 0.90 score on GenEval.

Lance is available under an Apache 2.0 license, requiring a minimum of 40GB VRAM for inference.

Unified Architecture for Multimodal Tasks

Technical Innovations

Performance and Training

More from AI & LLMs

637MB LLM Runs Offline on Base MacBook Air, Works Surprisingly Well

Pick Gemma 4 Model by Hardware to Unlock 9/10 Math Accuracy

SGLang: Fast LLM Serving on 400k+ GPUs

vLLM: High-Throughput LLM Serving Engine