MuSEAgent: A Multimodal Reasoning Agent with Stateful Experiences
Abstract
MuSEAgent enhances multimodal reasoning through stateful experience learning that abstracts interactions into decision experiences for improved policy-driven retrieval and adaptive search strategies.
Research agents have recently achieved significant progress in information seeking and synthesis across heterogeneous textual and visual sources. In this paper, we introduce MuSEAgent, a multimodal reasoning agent that enhances decision-making by extending the capabilities of research agents to discover and leverage stateful experiences. Rather than relying on trajectory-level retrieval, we propose a stateful experience learning paradigm that abstracts interaction data into atomic decision experiences through hindsight reasoning. These experiences are organized into a quality-filtered experience bank that supports policy-driven experience retrieval at inference time. Specifically, MuSEAgent enables adaptive experience exploitation through complementary wide- and deep-search strategies, allowing the agent to dynamically retrieve multimodal guidance across diverse compositional semantic viewpoints. Extensive experiments demonstrate that MuSEAgent consistently outperforms strong trajectory-level experience retrieval baselines on both fine-grained visual perception and complex multimodal reasoning tasks. These results validate the effectiveness of stateful experience modeling in improving multimodal agent reasoning.
Community
MuSEAgent enhances multimodal agent reasoning by leveraging fine-grained stateful experiences, consisting of two phases: (1) Experience Abstraction, which extracts state-level experiences via hindsight evaluation and builds multi-viewpoint embeddings for each experience; (2) Experience Exploitation, where the agent performs a deep-and-wide search over the experience bank to determine the next action at inference time.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- XSkill: Continual Learning from Experience and Skills in Multimodal Agents (2026)
- V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval (2026)
- MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline (2026)
- MIRROR: Multimodal Iterative Reasoning via Reflection on Visual Regions (2026)
- VimRAG: Navigating Massive Visual Context in Retrieval-Augmented Generation via Multimodal Memory Graph (2026)
- A Skill-augmented Agentic Framework and Benchmark for Multi-Video Understanding (2026)
- Mind-Brush: Integrating Agentic Cognitive Search and Reasoning into Image Generation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2603.27813 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper