---
library_name: transformers
license: apache-2.0
pipeline_tag: image-text-to-text
tags:
- vision-language-model
- linear-attention
- gated-deltanet
- infinitevl
- multimodal
---
### InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models
Hongyuan Tao
1,
[Bencheng Liao](https://github.com/LegendBC)
1,
[Shaoyu Chen](https://scholar.google.com/citations?user=PIeNN2gAAAAJ&hl=en&oi=sra)
2,
Haoran Yin
2,
[Qian Zhang](https://scholar.google.com/citations?user=pCY-bikAAAAJ&hl=zh-CN)
2,
[Wenyu Liu](https://scholar.google.com/citations?user=D7jDk7gAAAAJ&hl=en)
1,
[Xinggang Wang](https://xwcv.github.io)
1,✉️
1Huazhong University of Science and Technology,
2Horizon Robotics
(✉️) corresponding author:
xgwang@hust.edu.cn
## Introduction
**InfiniteVL** is a novel linear-complexity Vision-Language Model (VLM) architecture designed to overcome the computational bottlenecks of traditional Transformers in processing **unlimited multimodal streams**.
By synergizing **Sliding Window Attention (SWA)** for fine-grained local perception and **Gated DeltaNet** for efficient long-term memory, InfiniteVL achieves a "best of both worlds" balance. It delivers competitive performance on standard benchmarks (comparable to Qwen2.5-VL) while enabling constant-memory inference and high-throughput streaming.