TSPO-0.4B / README.md
hzf666's picture
Update README.md
b2a10e2 verified
<!-- ---
license: apache-2.0
---
# TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding
[[๐Ÿ“– Paper](https://arxiv.org/pdf/2508.04369)] [[๐Ÿค— TSPO-model](https://huggingface.co/hzf666/TSPO-0.4B)] [[๐Ÿค— TSPO-train-data](https://huggingface.co/datasets/hzf666/TSPO-10K)]
## ๐Ÿ‘€ Overview
Inspired by Deepseek-R1's GRPO algorithm, We propose **Temporal Sampling Policy Optimization (TSPO)**, a reinforcement learning framework that advances long-form video understanding by addressing the challenges of unsupervised and non-differentiable sparse frame sampling.
<div align="center">
<img src="./assets/main_fig.png" width="800" height="400" style="object-fit: contain;">
</div>
## ๐Ÿ† Performance
- Our method achieves **63.9%** accuracy on LongVideoBench and **76.3%** on MLVU, setting a new state-of-the-art performance among 7B video-MLLMs.
- Our trained temporal agent (TSPO-0.4B) demonstrates strong generalizability. When applied to LLaVA-Video-7B, it achieves an average improvement of **4.3%** across four benchmarks; with Qwen2.5VL-7B, the gain reaches **5.3%**. Transferability to other backbones is further analyzed in Table 2 of our paper.
<div align="center">
<img src="./assets/main_results.png" width="700" height="350" style="object-fit: contain;">
</div>
## ๐Ÿ“ Code
[โœ’๏ธ TSPO](https://github.com/Hui-design/TSPO)
## ๐Ÿงธ Toy example
We present a toy example to show how TSPO works. We follow an intuition that Video-MLLMs can only give correct answers if the temporal agent samples the correct keyframes. Thanks to our joint modeling of keyframe sampling and language generation, we can use the language accuracy reward $R_A$ derived from multiple-choice training data to supervise the temporal agent.
- As shown in the GIF, through TSPO training, the temporal agent learns to select frames that lead to the correct answer for the question *"What is the scene at the beginning of the video?"*. As a result, $R_A$ increases, the predicted score peaks converge at the video's beginning, and the sampled frames converge to this time segment.
- **For reproduce this example**, first download [LLaVA-Video-Qwen](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2), [CLIP-Large](https://huggingface.co/openai/clip-vit-large-patch14), and [208.mp4](https://drive.google.com/file/d/1FDIxxIcyjL0v2O6sGc0KljXi8MRLDFJk/view?usp=sharing), and modify the ``model_name_or_path`` and ``clip_path`` in the ``toy_example.sh``. The script can be run on a single GPU with at least 28GB.
<div align="center">
<img src="./assets/gif.gif" width="800" height="400" style="object-fit: contain;">
</div>
## ๐Ÿ“ Set up
```
conda create -n TSPO python=3.10
conda activate TSPO
pip install -r requirement.txt
pip install flash-attn==2.5.9.post1 --no-build-isolation
cd lmms-eval
pip install -e .
cd ../
```
## ๐ŸŽฅ Demo
- Download [LLaVA-Video-Qwen-7B](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2) or [Qwen2.5vl-7B](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct), and our [TSPO-0.4B](). Then, you can try the ``demo/llava_video_tspo.py`` or ``demo/qwen25vl_tspo.py`` .
- We provide example long videos: [208.mp4](https://drive.google.com/file/d/1FDIxxIcyjL0v2O6sGc0KljXi8MRLDFJk/view?usp=sharing), [7XWqI121-Q4.mp4](https://drive.google.com/file/d/1qh-8I1DsgH5TbqEbr05PPO5hdGtvUK23/view?usp=sharing), [5dJUUQufzw4.mp4](https://drive.google.com/file/d/1lBf6Oo7jkhi7-fSvrc_U7SqvqET3vhrh/view?usp=sharing). you can feel free to edit the "video_path" and "question". Our model will output the responses and the sampled frames will be saved under the demo directory.
```
# using llava_video as backbone
CUDA_VISIBLE_DEVICES=0 python demo/llava_video_tspo.py
# using Qwen2.5vl as backbone
CUDA_VISIBLE_DEVICES=0 python demo/qwen25vl_tspo.py
```
<div align="center">
<img src="./assets/demo2.png" width="700" height="350" style="object-fit: contain;">
</div>
## ๐Ÿ’พ Dataset
We provide TSPO-10K train dataset, which is available at [[๐Ÿค— TSPO-train-data]()].
## ๐Ÿš€ Training
First download [LLaVA-Video-Qwen](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2) and [CLIP-Large](https://huggingface.co/openai/clip-vit-large-patch14) and modify the ``model_name_or_path`` and ``clip_path ``in the ``train_deepspeed.sh``
```
bash train_deepspeed.sh
```
To get your trained TSPO-0.4B weights, you should run the merge_weights.py
```
bash scripts/merge_weights.py
```
## ๐Ÿ”ฎ Evaluation
For LongVideoBench, VideoMME, and MLVU, we use lmms-eval, which we have adapted to the current project by adding files such as `llava_vid_tspo.py` and `qwen_2_5_vl_tspo.py`.
```
# For Qwen2.5-VL+TSPO
bash eval_scripts/TSPO_qwen25_vl.sh
# For LLaVA-Video
bash eval_scripts/TSPO_llava_video.sh
```
You can evaluate original model without our TSPO by:
```
# For Original Qwen2.5-VL
bash eval_scripts/original_qwen25_vl.sh
# For Original LLaVA-Video
bash eval_scripts/original_llava_video.sh
```
For [LVBench](https://github.com/zai-org/LVBench), we use its own evaluation protocol. You can combine our demo.py and LVBench's official github to evaluate it.
## Acknowledgements
## Citations
If you find our work helpful for your research, please consider citing our work.
```
```
-->