Unified Architecture for Multimodal Tasks

Most AI systems separate understanding (semantic alignment) from generation (continuous representation) into distinct architectures. ByteDance's Lance bridges this gap by using a dual-stream mixture-of-experts architecture that handles both tasks natively. It processes text, images, and videos as a single interleaved sequence, allowing the model to perform complex tasks like captioning, visual reasoning, text-to-video generation, and multi-turn editing without needing separate components.

Technical Innovations

  • Dual-Stream MoE: The architecture utilizes an understanding expert (LLMUND) for semantic tasks and a generation expert (LLMGEN) for visual synthesis. Both share the same context but operate via different training objectives: next-token prediction for understanding and flow matching in continuous latent space for generation.
  • Modality-Aware Rotary Positional Encoding (MaPE): To prevent positional ambiguity when mixing different token types (ViT semantic tokens, VAE condition tokens, and noisy VAE target tokens), Lance applies a fixed temporal offset to each modality group. This preserves spatial layout while separating token groups in the global positional space, which is critical for cross-task alignment.

Performance and Training

Lance was trained in four stages—Pre-Training, Continual Training, Supervised Fine-Tuning, and Reinforcement Learning (using GRPO)—within a budget of 128 GPUs. Despite having only 3B activated parameters, it consistently outperforms larger or specialized models on key benchmarks:

  • Video Generation: Achieves a VBench score of 85.11, surpassing dedicated models like HunyuanVideo and Wan2.1-T2V.
  • Video Understanding: Scores 62.0 on MVBench, leading all unified models.
  • Image Generation: Matches top-tier unified models with a 0.90 score on GenEval.

Lance is available under an Apache 2.0 license, requiring a minimum of 40GB VRAM for inference.