File size: 20,106 Bytes
f525cc9
 
3c71b5a
 
f525cc9
 
 
 
 
 
 
 
 
 
4ed751c
f525cc9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3c71b5a
f525cc9
3c71b5a
f525cc9
 
 
 
 
 
 
 
 
 
 
cfac571
f525cc9
 
 
 
 
 
 
 
3c71b5a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f525cc9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fa59273
f525cc9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3c71b5a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f525cc9
3c71b5a
 
f525cc9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3c71b5a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
---
library_name: transformers
license: apache-2.0
pipeline_tag: image-text-to-text
tags:
- vision-language-model
- linear-attention
- gated-deltanet
- infinitevl
- multimodal
---

<div align="center">

<img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/Logo.png" width="500" alt="InfiniteVL Logo">

<hr>

### InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

Hongyuan Tao<sup>1</sup>,
[Bencheng Liao](https://github.com/LegendBC)<sup>1</sup>,
[Shaoyu Chen](https://scholar.google.com/citations?user=PIeNN2gAAAAJ&hl=en&oi=sra)<sup>2</sup>,
Haoran Yin<sup>2</sup>,
[Qian Zhang](https://scholar.google.com/citations?user=pCY-bikAAAAJ&hl=zh-CN)<sup>2</sup>,
[Wenyu Liu](https://scholar.google.com/citations?user=D7jDk7gAAAAJ&hl=en)<sup>1</sup>,
[Xinggang Wang](https://xwcv.github.io)<sup>1,βœ‰οΈ</sup>

<sup>1</sup>Huazhong University of Science and Technology,
<sup>2</sup>Horizon Robotics

(βœ‰οΈ) corresponding author: <a href="mailto:[email protected]">[email protected]</a>

<br>
<a href="https://arxiv.org/abs/2512.08829"><img src="https://img.shields.io/badge/arXiv-Paper-b31b1b.svg" alt="arXiv"></a>
<a href="https://github.com/hustvl/InfiniteVL"><img src="https://img.shields.io/badge/GitHub-Repository-black?logo=github" alt="GitHub"></a>
<a href="https://huggingface.co/hustvl/InfiniteVL/"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue" alt="Hugging Face"></a>

</div>

## Introduction

**InfiniteVL** is a novel linear-complexity Vision-Language Model (VLM) architecture designed to overcome the computational bottlenecks of traditional Transformers in processing **unlimited multimodal streams**.


By synergizing **Sliding Window Attention (SWA)** for fine-grained local perception and **Gated DeltaNet** for efficient long-term memory, InfiniteVL achieves a "best of both worlds" balance. It delivers competitive performance on standard benchmarks (comparable to Qwen2.5-VL) while enabling constant-memory inference and high-throughput streaming.

<div align="center">
<img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/image1_new_01.png" width="800" alt="InfiniteVL Logo">
</div>

### ✨ Key Highlights
*   πŸš€ **High Efficiency:** Achieves **>3.6Γ—** inference speedup and constant memory footprint compared to FlashAttention-2 accelerated Transformers.
*   ⚑ **Real-Time Streaming:** Sustains a stable **24 FPS** prefill speed on a single **NVIDIA RTX 4090** for continuous video understanding.
*   🧠 **Unlimited Context:** Effectively retains context over extremely long sequences (tested >500K tokens) without OOM errors.
*   πŸ† **Strong Performance:** Matches leading Transformer-based VLMs (e.g., Qwen2.5-VL-3B) and significantly outperforms previous linear VLMs (e.g., VL-Mamba, Cobra) on comprehensive aspects.

## News
*   `Dec. 10th, 2025`: We release the **InfiniteVL** model weights and inference code! Please check [Model Zoo](#model-zoo).
*   `Dec. 10th, 2025`: We release our paper on [Arxiv](https://arxiv.org/abs/2512.08829).

## Table of Contents

*   [Introduction](#introduction)
*   [Key Highlights](#key-highlights)
*   [News](#news)
*   [Architecture](#architecture)
*   [Training Strategy](#training-strategy)
*   [Performance](#performance)
*   [Model Zoo](#model-zoo)
*   [Getting Started](#getting-started)
*   [Advanced Usage: CUDA Graph Acceleration](#advanced-usage-cuda-graph-acceleration)
*   [Qualitative Analysis & Visualization](#qualitative-analysis--visualization)
*   [Contact](#contact)
*   [Citation](#citation)
*   [Acknowledgement](#acknowledgement)

## Architecture

<div align="center">
  <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/architecture.png" alt="InfiniteVL Architecture" width="50%">
</div>
<br>

**InfiniteVL** adopts a hybrid architecture that synergizes the efficiency of linear attention with the precision of window-based attention. The model comprises a **Vision Encoder** (adapted from Qwen2.5-VL), a **Projection MLP**, and a **Decoder-only LLM Backbone**.

### Key Design Highlights

*   **Hybrid Block Design**: The LLM backbone consists of **9 Hybrid Blocks**. Within each block, we strategically interleave:
    *   **1 Sliding Window Attention (SWA) Layer**: Responsible for capturing high-resolution local context and fine-grained visual details.
    *   **3 Gated DeltaNet Layers**: Responsible for modeling long-range global dependencies with linear complexity.

*   **Constant Memory Footprint**: Unlike traditional Transformers where the Key-Value (KV) cache grows linearly with sequence length ($O(N)$), the **Gated DeltaNet** layers compress history into a fixed-size memory state (e.g., $16 \times 128 \times 256$). This enables **constant memory usage** and constant inference latency, even when processing unlimited input streams.

*   **Seamless Integration**: By combining SWA and Gated DeltaNet, InfiniteVL achieves the "best of both worlds":
    *   Local attention ensures high performance on information-intensive tasks (e.g., OCR, Document Understanding).
    *   Linear attention ensures efficiency and stability for long-context scenarios (e.g., Streaming Video Understanding).

## Training Strategy

To achieve strong multimodal performance with minimal training resources, InfiniteVL employs a **three-stage progressive training strategy**. This approach allows our linear-complexity model to inherit the vast knowledge of a Transformer teacher before adapting to long-context scenarios.

<div align="center">
  <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/training_strategy.png" alt="Training Pipeline" width="90%">
</div>

### Stage 1: Distillation Pretraining (Efficient Initialization)
*   **Goal:** Rapidly transfer knowledge from the **Qwen2.5-VL** teacher to the InfiniteVL student.
*   **Method:** We replace the teacher's attention layers with **Gated DeltaNet** while keeping other parameters frozen. We use **Layer-wise MSE Loss** (to align internal states) and **End-to-End KL Divergence** (to align output logits).
*   **Significance:** This bypasses the difficulty of training linear attention from scratch, ensuring a robust initialization.

### Stage 2: Instruction SFT (General Capabilities)
*   **Goal:** Unlock strong instruction-following and reasoning capabilities.
*   **Data:** **~8M** diverse multimodal instruction pairs, covering General VQA, OCR, Mathematics, and Code.
*   **Settings:** Image resolution increased to **1344Γ—1344**; max context length set to 8,192.
*   **Outcome:** Produces the **Stage 2 Model**, which offers the best performance on standard benchmarks.

### Stage 3: Long-Sequence SFT (Context Extension)
*   **Goal:** Activate the architecture's potential for **unlimited-length processing** and streaming.
*   **Data:** A mixture of Stage 2 data (800K) and **~200K long-sequence samples** (e.g., long videos, multi-page documents).
*   **Method:** **LoRA** fine-tuning with context length extended to **32,768**.
*   **Outcome:** Produces the **Stage 3 Model**, enabling length extrapolation and stable streaming inference.

## Performance

### πŸš€ Efficiency & Streaming

**InfiniteVL** is engineered for unlimited-input scenarios. Unlike Transformer-based models where cost grows linearly with history, InfiniteVL maintains **constant** computational cost and memory usage.

> **Hardware Setup:** All efficiency results are measured on a single NVIDIA RTX 4090 GPU.

<div align="center">
  <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/plot_line.png" width="80%" alt="Efficiency Comparison">
  <br>
  <em>Figure 1: Comparison of streaming FPS and latency. InfiniteVL sustains real-time performance while Transformer baselines degrade rapidly.</em>
</div>

### πŸ† Multimodal Benchmarks

InfiniteVL achieves state-of-the-art performance among linear-complexity VLMs. Crucially, thanks to our **Hybrid Architecture** and **High-quality training strategies**, it overcomes the traditional weakness of linear models in information-intensive tasks (e.g., OCR, Document Understanding), achieving results comparable to top-tier Transformer VLMs.

<div align="center">
  <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/performance1.png" width="100%" alt="Performance Comparison">
  <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/performance2.png" width="100%" alt="Performance Comparison">
  <br>
  <em>Figure 2: Comparison of InfiniteVL with existing VLMs on public multimodal understanding, real-world comprehension, text-rich, reasoning-centric multimodal benchmarks.</em>
</div>
<br>

**Key Takeaways:**
*   **Best-in-Class Linear Model:** Significantly outperforms previous linear VLMs (Cobra, MaTVLM) by large margins (+40-60 points on DocVQA/OCRBench).
*   **Transformer-Level Quality:** Matches the performance of Qwen2.5-VL-3B on complex reasoning and text-rich tasks while being significantly faster in long contexts.

## Model Zoo

We release two versions of InfiniteVL-4B to cater to different application scenarios.

| Model | Stage | Description | Training context Length | Download |
| :--- | :---: | :--- | :---: | :---: |
| **InfiniteVL-4B** | **Stage 2** | **Best Generalist / Base.** The checkpoint directly after Instruction SFT. It delivers the **peak foundational performance** on standard multimodal benchmarks (e.g., OCR, MMMU, MathVista) and preserves the most robust knowledge. | 8K | [πŸ€— Hugging Face](https://huggingface.co/hustvl/InfiniteVL) |
| **InfiniteVL-4B-LongSFT** | **Stage 3** | **Long-Context Adapted.** Fine-tuned using only a **small amount** of long-sequence multimodal data. It successfully activates length generalization for streaming scenarios, though its full potential on extreme contexts is not yet fully exploited. | 32K | [πŸ€— Hugging Face](https://huggingface.co/hustvl/InfiniteVL-LongSFT) |


> **πŸ’‘ Recommendations:**
>
> *   **For Long-Context Inference:** Please use the **Stage 3** model. It enables stable streaming inference and avoids memory explosion.
> *   **For Training / Fine-tuning:** We strongly recommend using the **Stage 2** model as your starting point. Since it maintains the strongest general capabilities and hasn't shifted towards the specific long-context distribution, it serves as the best foundation for adaptation to new tasks or domains.

## Getting Started

### πŸ› οΈ Environment Setup

We recommend using **Anaconda** or **Miniconda** to manage the environment. The code is tested on **Python 3.11** + **PyTorch 2.6.0** + **CUDA 12.1**.

**1. Create and activate a virtual environment:**
```bash
conda create -n infinitevl python=3.11 -y
conda activate infinitevl
``` 
**2. Install Environment:**

The core environments are list as follows:
```bash
# --- Core Deep Learning ---
torch==2.6.0
torchvision==0.21.0
torchaudio==2.6.0
transformers==4.57.0
accelerate==1.8.1

# --- Vision & Multimodal ---
qwen-vl-utils==0.0.11
decord==0.6.0
opencv-python==4.11.0.86
pillow==10.4.0
timm==1.0.22
einops==0.8.1

# --- Linear Attention & Kernels (Critical) ---
# Note: These often require specific CUDA environments to build
flash-attn==2.7.4.post1
flash-linear-attention==0.4.0
fla-core==0.4.0
causal-conv1d==1.5.0.post5
triton==3.2.0
``` 

### Using πŸ€— Transformers to Chat

```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
from qwen_vl_utils import process_vision_info

# Load Model
model_path = "hustvl/InfiniteVL-LongSFT" # Replace with your HF repo ID
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

# Prepare Inputs
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Process Inputs
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

# Generate
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])
```
<details>
<summary><strong>πŸ–ΌοΈ Multi-Image Inference (Click to expand)</strong></summary>

InfiniteVL supports inputting multiple images in a single turn for comparison or storytelling.

```python
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "What are the similarities between these two images?"},
        ],
    }
]

# Process
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

# Generate
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
print(processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0])
```

</details>
<details>
<summary><strong>πŸŽ₯ Video Inference (Click to expand)</strong></summary>

```python
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0, 
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# Process
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

# Generate
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
print(processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0])
```
</details>

## πŸš€ Advanced Usage: CUDA Graph Acceleration

Unlike Transformer-based VLMs where the KV cache grows dynamically, **InfiniteVL maintains a constant-size memory state**. This unique property allows us to use **CUDA Graphs** to capture the entire computation graph for both streaming prefill and decoding, eliminating kernel launch overheads and maximizing GPU utilization.

This is the key technology behind our **24 FPS** real-time streaming performance.

### ⚑ Accelerated Streaming Inference

Unlike Transformer-based VLMs where the KV cache grows dynamically, **InfiniteVL maintains a constant-size memory state**. This unique property allows us to use **CUDA Graphs** to capture the entire computation graph for streaming prefill, eliminating kernel launch overheads.

We provide a complete script in [`examples/demo_streaming_inference.py`](examples/demo_streaming_inference.py) to demonstrate this capability.

> **πŸŽ₯ Simulation Note:** This script **simulates a real-time streaming scenario** by reading a local video file frame-by-frame. It treats the video as a continuous data stream, updating the global linear memory state on-the-fly without retraining.
>
> **⚠️ Requirement:** This demo relies on the specialized model implementation (supporting `StaticCachePrealloc` and CUDA Graphs) located in the **[`infinitevl/infinitevl_streaming`](infinitevl/infinitevl_streaming)** directory. Please ensure your environment is set up correctly to import these modules.

#### 1. Run the Simulation Demo
```bash
# Make sure you are in the project root
python examples/demo_streaming_inference.py \
    --model_path /path/to/InfiniteVL-4B \
    --video_path assets/demo.mp4 \
    --fps 30
```

### ⚑ Accelerated Decode

In addition to streaming prefill, InfiniteVL natively supports **CUDA Graph-accelerated decoding**. By capturing the decoding step into a static graph, we can achieve extremely low-latency token generation, further enhancing the responsiveness of real-time interactions.

> 🚧 **Coming Soon:** The code for accelerated decoding is currently being refactored and cleaned up. We are working hard to release it as soon as possible. Please stay tuned!

## Qualitative Analysis & Visualization

We provide visualization cases to demonstrate InfiniteVL's robust performance across diverse scenarios, ranging from information-intensive static tasks to ultra-long streaming video understanding.

### 1. Fundamental Visual-Language Capabilities (OCR & Reasoning)
InfiniteVL effectively overcomes the traditional limitations of linear attention in detailed visual perception. By combining Sliding Window Attention with Gated DeltaNet, it excels at **Dense Text Recognition (OCR), Chart Interpretation, and Complex Scene Description**, delivering performance comparable to full-attention Transformers.

<div align="center">
  <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/image_case1_01.png" width="80%" alt="Fundamental Capabilities">
</div>

### 2. Long-Term Streaming Understanding
The core strength of InfiniteVL lies in its ability to maintain coherent memory over **unlimited input streams**.

The examples below demonstrate a continuous street-view video stream. InfiniteVL maintains a constant memory state and accurately answers questions at various timestamps (e.g., Frame 3100, ~1M tokens processed), recalling specific details like "NBC Studios" text or the color of a pedestrian's bag without forgetting.

<div align="center">
  <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/streaming_case1_01.png" width="80%" alt="Streaming Capabilities">
  <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/streaming_case2_01.png" width="80%" alt="Streaming Capabilities">
</div>

## Contact
If you have any questions, please contact Hongyuan Tao via email ([email protected]).

## Citation

If you find InfiniteVL useful for your research or applications, please consider citing our paper:

```bibtex
@article{tao2025infinitevl,
  title={InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models},
  author={Tao, Hongyuan and Liao, Bencheng and Chen, Shaoyu and Yin, Haoran and Zhang, Qian and Liu, Wenyu and Wang, Xinggang},
  journal={arXiv preprint},
  year={2025}
}
``` 

## Acknowledgement

InfiniteVL is built upon the giants of the open-source community. We would like to express our gratitude to:

*   **[Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL)**: For providing a powerful vision-language codebase and vision encoder.
*   **[Gated DeltaNet](https://github.com/sustcsonglin/flash-linear-attention)**: For the efficient linear attention mechanism and CUDA kernel implementations (FLA).
*   **Open-Source Datasets**: We sincerely thank the creators of the high-quality datasets used in our training, including **FineVision, LLaVA-OneVision, PixMo, The Cauldron, Docmatix, LLaVA-Video**, and others. Their contributions are essential to the development of efficient multimodal models.