VideoMDM: Towards 3D Human Motion Generation From 2D Supervision
Abstract
VideoMDM trains 3D human motion priors from 2D poses using a diffusion framework with 2D reprojection loss and 3D motion regularizers, achieving near-3D supervised performance without requiring 3D ground truth.
We introduce VideoMDM, a diffusion-based framework that trains 3D human motion priors directly from accurate 2D poses extracted from monocular videos, without any 3D ground truth. A pretrained 2D-to-3D lifter provides approximate 3D pose sequences that serve as a noisy teacher: these are diffused, denoised by the model in 3D, and supervised in 2D by reprojecting the prediction and comparing against accurate keypoints. We show that, under mild assumptions, a depth-weighted 2D reprojection loss is equivalent in expectation to direct 3D supervision, and we adapt standard 3D motion regularizers - velocity consistency and over-parameterized representation alignment - to this 2D setting. Unlike methods that lift 2D to 3D only at inference, VideoMDM learns a coherent 3D motion manifold during training. On HumanML3D it nearly closes the gap to fully 3D-supervised MDM (FID 0.88 vs 0.54); On real video datasets Fit3D and NBA the method learns to generate motions consistently preferred by humans, with strong quantitative results.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AnyLift: Scaling Motion Reconstruction from Internet Videos via 2D Diffusion (2026)
- Unsupervised 3D Human Pose Estimation via Conditional Multi-view Ancestral Sampling (2026)
- Natural Human Motion Recovery by Aligning High-Order Temporal Dynamics from Monocular Videos (2026)
- Enhancing Domain Generalization in 3D Human Pose Estimation through Controllable Generative Augmentation (2026)
- FactorizedHMR: A Hybrid Framework for Video Human Mesh Recovery (2026)
- TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking (2026)
- AnyAct: Towards Human Reenactment of Character Motion From Video (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.13364 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper