Motus: A Unified Latent Action World Model (Stage 2 Pretrained)

Motus is a unified latent action world model that leverages existing pretrained models and rich, sharable motion information. Motus introduces a Mixture-of-Transformers (MoT) architecture to integrate three experts (understanding, action, and video generation) and adopts a UniDiffuser-style scheduler to enable flexible switching between different modeling modes (World Models, Vision-Language-Action Models, Inverse Dynamics Models, Video Generation Models, and Video-Action Joint Prediction Models). Motus further leverages optical flow to learn latent actions and adopts a three-phase training pipeline and six-layer data pyramid, thereby extracting pixel-level "delta action" and enabling large-scale action pretraining.

This checkpoint contains the Stage 2 pretrained Motus model.

Homepage | GitHub | arXiv | Feishu | WeChat


Table of Contents


Highlights

  • 87.02% average success rate on RoboTwin 2.0 (+15% over X-VLA, +45% over ฯ€โ‚€.โ‚…)
  • Unified 5-in-1 Model: VLA, World Model, IDM, VGM, and Video-Action Joint Prediction
  • Tri-model Joint Attention: Video, action, and understanding experts share attention layers
  • Latent Action Pretraining: Pretrained on optical flow-derived latent actions

Model Details

Architecture

Component Base Model Parameters
VGM (Video Generation Model) WAN 2.2 ~5.00B
VLM (Vision-Language Model) Qwen3-VL-2B ~2.13B
Action Expert - ~641.5M
Understanding Expert - ~253.5M
Total - ~8B

Action Representation

  • Control frequency: 30Hz (default)
  • Action chunk size: 48 steps (default)
  • Action dimension: 14 (bimanual: 7 per arm)

Hardware & Software Requirements

Mode VRAM Recommended GPU
Inference (with pre-encoded T5) ~ 24 GB RTX 5090
Inference (without pre-encoded T5) ~ 41 GB A100 (40GB) / A100 (80GB) / H100 / B200

Quickstart (Inference)

# Run under Motus repository root
import torch
import yaml
from pathlib import Path

from models.motus import Motus, MotusConfig

# Load config
with open("configs/robotwin.yaml", "r") as f:
    config = yaml.safe_load(f)

# Create model config
model_config = MotusConfig(
    wan_checkpoint_path=config['model']['wan']['checkpoint_path'],
    vae_path=config['model']['wan']['vae_path'],
    wan_config_path=config['model']['wan']['config_path'],
    video_precision=config['model']['wan']['precision'],
    vlm_checkpoint_path=config['model']['vlm']['checkpoint_path'],
    action_dim=config['common']['action_dim'],
    action_state_dim=config['common']['state_dim'],
    num_video_frames=config['common']['num_video_frames'],
    video_height=config['common']['video_height'],
    video_width=config['common']['video_width'],
    load_pretrained_backbones=False,  # Load from checkpoint
)

# Initialize and load checkpoint
device = "cuda:0"
model = Motus(model_config).to(device).eval()
model.load_checkpoint("./pretrained_models/Motus", strict=False)

# Run inference
with torch.no_grad():
    predicted_frames, predicted_actions = model.inference_step(
        first_frame=first_frame_tensor,  # [1, C, H, W]
        state=state_tensor,              # [1, state_dim]
        num_inference_steps=20,
        language_embeddings=t5_embeddings,
        vlm_inputs=[vlm_inputs],
    )

# predicted_actions: [1, action_chunk_size, action_dim]
action_chunk = predicted_actions.squeeze(0).cpu().numpy()

Citation

@misc{bi2025motusunifiedlatentaction,
      title={Motus: A Unified Latent Action World Model}, 
      author={Hongzhe Bi and Hengkai Tan and Shenghao Xie and Zeyuan Wang and Shuhe Huang and Haitian Liu and Ruowen Zhao and Yao Feng and Chendong Xiang and Yinze Rong and Hongyan Zhao and Hanyu Liu and Zhizhong Su and Lei Ma and Hang Su and Jun Zhu},
      year={2025},
      eprint={2512.13030},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.13030}, 
}
Downloads last month
17
Video Preview
loading