LPM-1.0: Real-Time Video for Conversational Characters

Full-Duplex Real-Time Pipeline for Conversations

LPM 1.0 integrates with audio-to-audio models like ChatGPT or Doubao via a three-state streaming process: (1) Listen mode streams video of reactive behaviors—nods, gaze shifts, micro-expressions—from user audio while forwarding audio to the LLM; (2) Speak mode drives lip-synced speaking performance from LLM response audio; (3) Silence mode generates idle video from text prompts for natural pauses. This replaces static animations with dynamic, low-latency video (480p streaming demos up to 45 minutes), handling speak-listen handoffs despite minor audio sync issues from separation errors. Use multimodal inputs (first image + optional refs + audio + text) in a single pass for plug-and-play compatibility in agents, live streams, or NPCs.

Multi-Granularity Identity and Multimodal Control

Preserve character details like teeth, wrinkles, profile geometry, and body without hallucination by conditioning on global appearance refs, multi-view body images, and facial expression exemplars. Control performance via unified text (e.g., 'speak while slouched,' 'show hurt then sadness'), audio (lip sync, rhythm, emotion), and images—enabling steerable emotions, interactions (human-object, human-animal), and transitions (smugness to amusement). Zero-shot generalization applies to photoreal humans, 2D anime, 3D models, or creatures without fine-tuning, producing expressive outputs like singing in EN/CN/PT with melody-aligned visemes and breath motion.

Long-Term Stability and Expressive Behaviors

Maintain consistency over unbounded lengths (22-45 min full-duplex demos) via streaming architecture, avoiding drift in identity or quality. Speaking demos cover emotions (terror, grief, resentment) with accurate delivery, breathing, and body language; listening generates context-aware reactions (guarded nods, empathetic brows) based on persona/relationship; singing aligns to fast lyrics across genres/languages. All site videos are single-generation from synthetic inputs (no real likenesses), with artifacts confirming synthetic nature. For non-commercial academic use only—no model weights, code, or APIs released; focused on positive apps like education/accessibility, opposing deception.