Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination Paper • 2511.17490 • Published Nov 21, 2025 • 21
Building a Foundational Guardrail for General Agentic Systems via Synthetic Data Paper • 2510.09781 • Published Oct 10, 2025 • 26
Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting Paper • 2504.05541 • Published Apr 7, 2025 • 15
The Tenth NTIRE 2025 Efficient Super-Resolution Challenge Report Paper • 2504.10686 • Published Apr 14, 2025
MMIG-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models Paper • 2505.19415 • Published May 26, 2025 • 2
MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness Paper • 2505.20426 • Published May 26, 2025 • 7
Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models Paper • 2510.05034 • Published Oct 6, 2025 • 48
UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing Paper • 2503.12652 • Published Mar 16, 2025
CLIP-UP: A Simple and Efficient Mixture-of-Experts CLIP Training Recipe with Sparse Upcycling Paper • 2502.00965 • Published Feb 3, 2025
GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing Paper • 2505.11493 • Published May 16, 2025 • 3
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer Paper • 2509.16197 • Published Sep 19, 2025 • 56
CAR-Flow: Condition-Aware Reparameterization Aligns Source and Target for Better Flow Matching Paper • 2509.19300 • Published Sep 23, 2025 • 6
Beyond Markovian: Reflective Exploration via Bayes-Adaptive RL for LLM Reasoning Paper • 2505.20561 • Published May 26, 2025 • 7
MOFI: Learning Image Representations from Noisy Entity Annotated Images Paper • 2306.07952 • Published Jun 13, 2023 • 2
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models Paper • 2404.07973 • Published Apr 11, 2024 • 32
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models Paper • 2410.02740 • Published Oct 3, 2024 • 54
STIV: Scalable Text and Image Conditioned Video Generation Paper • 2412.07730 • Published Dec 10, 2024 • 74