-
Emu3.5: Native Multimodal Models are World Learners
Paper • 2510.26583 • Published • 111 -
RECALL: REpresentation-aligned Catastrophic-forgetting ALLeviation via Hierarchical Model Merging
Paper • 2510.20479 • Published • 12 -
A Definition of AGI
Paper • 2510.18212 • Published • 36 -
Video-As-Prompt: Unified Semantic Control for Video Generation
Paper • 2510.20888 • Published • 50
Collections
Discover the best community collections!
Collections including paper arxiv:2510.23451
-
RL makes MLLMs see better than SFT
Paper • 2510.16333 • Published • 49 -
Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback
Paper • 2510.16888 • Published • 22 -
Reasoning with Sampling: Your Base Model is Smarter Than You Think
Paper • 2510.14901 • Published • 48 -
Sample By Step, Optimize By Chunk: Chunk-Level GRPO For Text-to-Image Generation
Paper • 2510.21583 • Published • 31
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 15 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
-
Is Multilingual LLM Watermarking Truly Multilingual? A Simple Back-Translation Solution
Paper • 2510.18019 • Published • 18 -
PORTool: Tool-Use LLM Training with Rewarded Tree
Paper • 2510.26020 • Published • 5 -
POWSM: A Phonetic Open Whisper-Style Speech Foundation Model
Paper • 2510.24992 • Published • 4 -
Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation
Paper • 2510.24821 • Published • 41
-
InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning
Paper • 2502.11573 • Published • 9 -
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
Paper • 2502.02339 • Published • 23 -
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Paper • 2502.11775 • Published • 9 -
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
Paper • 2412.18319 • Published • 39
-
Emu3.5: Native Multimodal Models are World Learners
Paper • 2510.26583 • Published • 111 -
RECALL: REpresentation-aligned Catastrophic-forgetting ALLeviation via Hierarchical Model Merging
Paper • 2510.20479 • Published • 12 -
A Definition of AGI
Paper • 2510.18212 • Published • 36 -
Video-As-Prompt: Unified Semantic Control for Video Generation
Paper • 2510.20888 • Published • 50
-
Is Multilingual LLM Watermarking Truly Multilingual? A Simple Back-Translation Solution
Paper • 2510.18019 • Published • 18 -
PORTool: Tool-Use LLM Training with Rewarded Tree
Paper • 2510.26020 • Published • 5 -
POWSM: A Phonetic Open Whisper-Style Speech Foundation Model
Paper • 2510.24992 • Published • 4 -
Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation
Paper • 2510.24821 • Published • 41
-
RL makes MLLMs see better than SFT
Paper • 2510.16333 • Published • 49 -
Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback
Paper • 2510.16888 • Published • 22 -
Reasoning with Sampling: Your Base Model is Smarter Than You Think
Paper • 2510.14901 • Published • 48 -
Sample By Step, Optimize By Chunk: Chunk-Level GRPO For Text-to-Image Generation
Paper • 2510.21583 • Published • 31
-
InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning
Paper • 2502.11573 • Published • 9 -
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
Paper • 2502.02339 • Published • 23 -
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Paper • 2502.11775 • Published • 9 -
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
Paper • 2412.18319 • Published • 39
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 15 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23