From Perception to Action: An Interactive Benchmark for Vision Reasoning
Abstract
Current vision-language models lack capability to understand physical structures and causal constraints needed for complex, interactive 3D tasks, as demonstrated by the CHAIN benchmark evaluating structured action planning under physical constraints.
Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model (VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to assess agents' ability to reason about how geometry, contact, and support relations jointly constrain what actions are possible in a dynamic environment. To address this gap, we introduce the Causal Hierarchy of Actions and Interactions (CHAIN) benchmark, an interactive 3D, physics-driven testbed designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints. CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing. We conduct a comprehensive study of state-of-the-art VLMs and diffusion-based models under unified interactive settings. Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions. The project is available at https://social-ai-studio.github.io/CHAIN/.
Community
Introducing CHAIN, a new interactive benchmark that moves beyond static visual QA. It evaluates Vision-Language Models (VLMs) on their ability to close the loop between perception, multi-step reasoning, and actionable execution in dynamic environments.
Code: https://github.com/Social-AI-Studio/CHAIN
Website: https://social-ai-studio.github.io/CHAIN/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Thinking in Structures: Evaluating Spatial Intelligence through Reasoning on Constrained Manifolds (2026)
- CausalSpatial: A Benchmark for Object-Centric Causal Spatial Reasoning (2026)
- BiManiBench: A Hierarchical Benchmark for Evaluating Bimanual Coordination of Multimodal Large Language Models (2026)
- GSR: Learning Structured Reasoning for Embodied Manipulation (2026)
- VirtualEnv: A Platform for Embodied AI Research (2026)
- V-CAGE: Context-Aware Generation and Verification for Scalable Long-Horizon Embodied Tasks (2026)
- NovaPlan: Zero-Shot Long-Horizon Manipulation via Closed-Loop Video Language Planning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper
