arxiv:2606.16993

DreamX-World 1.0: A General-Purpose Interactive World Model

Published on Jun 15

· Submitted by

Adina Yakefu on Jun 16

AMAP-ML

Upvote

Authors:

Rui Chen ,

Jing Tang ,

Abstract

DreamX-World 1.0 is a interactive text/image-to-video model that generates long-horizon content with camera control and scene persistence using specialized encoding, training techniques, and optimization methods.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

DreamX-World 1.0 is a general-purpose interactive text/image-to-video world model for controllable long-horizon generation. It supports camera navigation, revisits to previously observed regions, and promptable events across photorealistic, game-style, and stylized domains. Our data engine combines camera-accurate Unreal Engine rendering, action-rich gameplay recordings, and real-world videos with recovered camera geometry. For camera control, we introduce E-PRoPE, a lightweight variant of projective positional encoding that retains PRoPE's projective camera geometry while applying camera-aware attention to spatially reduced tokens. We convert a bidirectional video generator into a few-step autoregressive world model using causal forcing, DMD-style distillation, and long-rollout training. Training on self-generated long-horizon contexts exposes the model to its own generated history and reduces the style and color drift that accumulates across autoregressive chunks. Memory-Conditioned Scene Persistence retrieves earlier views through camera-geometry-based retrieval, while residual recycling makes the conditioning path less sensitive to imperfect memory latents. Event Instruction Tuning adds composable event control, and reinforcement learning alignment recovers camera control and visual quality after distillation. With mixed-precision DiT execution, residual reuse, 75\%-pruned VAE decoding, and asynchronous pipeline parallelism, DreamX-World 1.0 reaches up to 16\,FPS on eight RTX\,5090 GPUs. On our 5-second basic evaluation, DreamX-World 1.0 achieves a camera-control score of 73.75 and an overall score of 84.76, outperforming HY-WorldPlay 1.5 and LingBot-World in overall score, which achieve 80.79 and 80.45, respectively.

View arXiv page View PDF Project page GitHub 318 Add to collection

Community

ruichen9618

Paper author about 17 hours ago

Project Page: https://amap-ml.github.io/DreamX_World
Code: https://github.com/AMAP-ML/DreamX-World

Jiashuz

about 17 hours ago

Great work! Looking forward to the code and checkpoints of DreamX-World 1.0!

xiaochonglinghu

about 17 hours ago

A very interesting work!
Featuring:
🎮 Real-time interaction
🧠 Long-term memory
🎨 Multi-style support
⏱️ 1-minute continuous generation

AdinaY

Paper submitter about 13 hours ago

This comment has been hidden

ruichen9618

Paper author about 13 hours ago

Model: https://huggingface.co/GD-ML/DreamX-World-5B

noahml

about 1 hour ago

Neat paper. The approach to mixing Unreal Engine rendering with real-world videos to build this world model seems like a solid way to handle data diversity. I’m especially interested in how it keeps style and color drift under control when generating such long-horizon content.

I am curious about the Memory-Conditioned Scene Persistence. Does the reliance on camera-geometry-based retrieval ever struggle when the generated scene deviates too far from the initial geometry?

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/66e5befb-5721-4c75-a21f-84cac53e25b7