VOID Erases Video Objects While Rewriting Physics
Netflix's open-source VOID model uses a two-pass pipeline—reasoning with VLM + SAM 2 for quad masks, then diffusion generation—to remove objects and simulate counterfactual scenes without ghost interactions, excelling in dance but struggling with fights.
VOID's Two-Pass Pipeline Fixes Ghost Interactions
Standard video inpainting tools erase objects like watermarks or static people by filling pixels from surroundings, but they ignore physics, leaving artifacts like spinning blenders or falling pins without cause. VOID counters this by reimagining a 'counterfactual reality' where the object never existed.
First pass: Reasoning. A vision-language model (VLM) paired with SAM 2 (Segment Anything Model 2) tracks the target pixel-perfectly and predicts causal effects—e.g., removing one domino flags affected chain reactions. This generates a 'quad mask' expanding beyond the object to map physics rewrite zones.
Second pass: Generation and refinement. A video diffusion model inpaints using the quad mask. To prevent morphing or dreaminess, an optional flow warp noise step locks remaining objects' shapes and consistency. Prompts focus on the desired scene without mentioning the removed object, e.g., 'fighter in dark kimono in gym' instead of referencing the erased white-kimono fighter.
Trade-off: Works best for simple interactions; complex dynamics like fights produce ghost-like remnants because physics simulation can't fully rewrite human behavior.
Training on Synthetic Physics Simulations
Real-world data lacks 'unhappened' events, so Netflix/Insight trained VOID on synthetic environments like Kubric. Run thousands of physics sims: one with object collision (before/after), one without. AI learns object presence → environmental impact mappings. This teaches cause-effect without filming impossibilities like 'uncrashed cars.'
Outcome: VOID generalizes to real videos, handling interactions better than pixel-fill alone, but requires precise segmentation and prompts for optimal masks.
Streamlined Setup with Custom Web App
Raw GitHub repo (https://github.com/Netflix/void-model) has gaps: undocumented SAM 3 needs, strict 'quad_mask_0.mpp4' naming, no built-in GUI for masking. Fix by deploying on Runpod H100 GPU pod (100GB container, port 8998):
- SSH, clone https://github.com/andrisgauracs/netflix-void-web-app.
- Run
run.shwith Hugging Face token (for models), SAM 3 gated access, Gemini API key (pose estimation). - Access UI tabs: Segment (prompt + points for SAM 2 mask), Inference (counterfactual prompt), Results (view + optional second-pass refinement).
This automates workflow: upload video → mask → infer → refine. Speeds testing from hours of CLI debugging to minutes, but demands beefy GPU (H100 recommended) and API approvals.
Test Results: Strengths in Motion, Weak in Combat
Matrix fight (remove Neo): Morpheus punches air/ghost; hand inconsistencies persist post-refinement. Fails to make opponent static—can't invent idle behavior.
La La Land dance (remove Emma Stone): Near-flawless. Ryan Gosling dances solo seamlessly, even through occlusions; minor artifacts only. Best result—proves strength in rhythmic, predictable motion.
Titanic bow (remove Jack): Kate stands alone convincingly, but arm artifacts and morphing face create uncanny valley. User error in segmentation left hand remnants; highlights need precise points.
Overall: Delivers on physics rewrite for 2/3 tests, but artifacts in occlusion/complexity. Future: Netflix interactive narratives like Bandersnatch, user-driven edits. Use for VFX cleanup, personalized video—test your clips to gauge fit.