Addressing Robotics Data Fragmentation

Robotics research is currently hampered by fragmented data formats across different hardware and tasks. The Qwen-RobotSuite attempts to solve this by providing three distinct foundation models that standardize how robots perceive, predict, and act.

Qwen-RobotManip: Unified Manipulation

Qwen-RobotManip is a Vision-Language-Action (VLA) model built on Qwen3.5-4B. Its core innovation is a canonical state-action representation: an 80-dimensional vector that uses per-dimension binary masking. This allows the model to handle diverse robot morphologies by only supervising the dimensions relevant to a specific robot's hardware. By training on 38,100 hours of synthesized and open-source data, it achieves significant gains in out-of-distribution (OOD) settings, outperforming prior state-of-the-art models like π0.5 by 3.2× in cross-embodiment transfer tasks.

Qwen-RobotWorld: Language as a Universal Interface

Qwen-RobotWorld treats natural language as a universal action interface for a video world model. Using a 60-layer double-stream Multimodal Diffusion Transformer (MMDiT), it predicts future video trajectories based on language instructions. This approach allows the model to remain embodiment-agnostic, as the language instruction encodes the goal and constraints regardless of the underlying hardware (e.g., grippers vs. humanoids). It ranks first on benchmarks like EWMBench and DreamGen, demonstrating high motion fidelity and physical adherence to laws like gravity and fluid dynamics.

Qwen-RobotNav: Scalable Navigation

Qwen-RobotNav reframes navigation as a parameterized observation context problem. Built on Qwen3-VL, it predicts 8-waypoint trajectories using a lightweight MLP head. The model features a configurable interface that allows it to adapt to various navigation tasks (PointNav, ObjNav, Tracking) by adjusting visual token budgets and temporal decay parameters. It functions as a reactive executor within an agentic system, where a higher-level planner (Qwen3.6-Plus) decomposes long-horizon goals into sub-goals communicated via natural language.