MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation Paper • 2601.06874 • Published 25 days ago • 12
ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models Paper • 2407.21534 • Published Jul 31, 2024 • 5
Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation Paper • 2503.24379 • Published Mar 31, 2025 • 76
JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization Paper • 2503.23377 • Published Mar 30, 2025 • 57