Motus: A Unified Latent Action World Model (Stage 2 Pretrained)
Motus is a unified latent action world model that leverages existing pretrained models and rich, sharable motion information. Motus introduces a Mixture-of-Transformers (MoT) architecture to integrate three experts (understanding, action, and video generation) and adopts a UniDiffuser-style scheduler to enable flexible switching between different modeling modes (World Models, Vision-Language-Action Models, Inverse Dynamics Models, Video Generation Models, and Video-Action Joint Prediction Models). Motus further leverages optical flow to learn latent actions and adopts a three-phase training pipeline and six-layer data pyramid, thereby extracting pixel-level "delta action" and enabling large-scale action pretraining.
This checkpoint contains the Stage 2 pretrained Motus model.
Homepage | GitHub | arXiv | Feishu | WeChat
Table of Contents
Highlights
- 87.02% average success rate on RoboTwin 2.0 (+15% over X-VLA, +45% over ฯโ.โ )
- Unified 5-in-1 Model: VLA, World Model, IDM, VGM, and Video-Action Joint Prediction
- Tri-model Joint Attention: Video, action, and understanding experts share attention layers
- Latent Action Pretraining: Pretrained on optical flow-derived latent actions
Model Details
Architecture
| Component | Base Model | Parameters |
|---|---|---|
| VGM (Video Generation Model) | WAN 2.2 | ~5.00B |
| VLM (Vision-Language Model) | Qwen3-VL-2B | ~2.13B |
| Action Expert | - | ~641.5M |
| Understanding Expert | - | ~253.5M |
| Total | - | ~8B |
Action Representation
- Control frequency: 30Hz (default)
- Action chunk size: 48 steps (default)
- Action dimension: 14 (bimanual: 7 per arm)
Hardware & Software Requirements
| Mode | VRAM | Recommended GPU |
|---|---|---|
| Inference (with pre-encoded T5) | ~ 24 GB | RTX 5090 |
| Inference (without pre-encoded T5) | ~ 41 GB | A100 (40GB) / A100 (80GB) / H100 / B200 |
Quickstart (Inference)
# Run under Motus repository root
import torch
import yaml
from pathlib import Path
from models.motus import Motus, MotusConfig
# Load config
with open("configs/robotwin.yaml", "r") as f:
config = yaml.safe_load(f)
# Create model config
model_config = MotusConfig(
wan_checkpoint_path=config['model']['wan']['checkpoint_path'],
vae_path=config['model']['wan']['vae_path'],
wan_config_path=config['model']['wan']['config_path'],
video_precision=config['model']['wan']['precision'],
vlm_checkpoint_path=config['model']['vlm']['checkpoint_path'],
action_dim=config['common']['action_dim'],
action_state_dim=config['common']['state_dim'],
num_video_frames=config['common']['num_video_frames'],
video_height=config['common']['video_height'],
video_width=config['common']['video_width'],
load_pretrained_backbones=False, # Load from checkpoint
)
# Initialize and load checkpoint
device = "cuda:0"
model = Motus(model_config).to(device).eval()
model.load_checkpoint("./pretrained_models/Motus", strict=False)
# Run inference
with torch.no_grad():
predicted_frames, predicted_actions = model.inference_step(
first_frame=first_frame_tensor, # [1, C, H, W]
state=state_tensor, # [1, state_dim]
num_inference_steps=20,
language_embeddings=t5_embeddings,
vlm_inputs=[vlm_inputs],
)
# predicted_actions: [1, action_chunk_size, action_dim]
action_chunk = predicted_actions.squeeze(0).cpu().numpy()
Citation
@misc{bi2025motusunifiedlatentaction,
title={Motus: A Unified Latent Action World Model},
author={Hongzhe Bi and Hengkai Tan and Shenghao Xie and Zeyuan Wang and Shuhe Huang and Haitian Liu and Ruowen Zhao and Yao Feng and Chendong Xiang and Yinze Rong and Hongyan Zhao and Hanyu Liu and Zhizhong Su and Lei Ma and Hang Su and Jun Zhu},
year={2025},
eprint={2512.13030},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.13030},
}
- Downloads last month
- 17