Update README.md

b2a10e2 verified 7 months ago

5.28 kB

	<!-- ---
	license: apache-2.0
	---

	# TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding

	[[📖 Paper](https://arxiv.org/pdf/2508.04369)] [[🤗 TSPO-model](https://huggingface.co/hzf666/TSPO-0.4B)] [[🤗 TSPO-train-data](https://huggingface.co/datasets/hzf666/TSPO-10K)]


	## 👀 Overview

	Inspired by Deepseek-R1's GRPO algorithm, We propose Temporal Sampling Policy Optimization (TSPO), a reinforcement learning framework that advances long-form video understanding by addressing the challenges of unsupervised and non-differentiable sparse frame sampling.

	<div align="center">
	<img src="./assets/main_fig.png" width="800" height="400" style="object-fit: contain;">
	</div>


	## 🏆 Performance

	- Our method achieves 63.9% accuracy on LongVideoBench and 76.3% on MLVU, setting a new state-of-the-art performance among 7B video-MLLMs.

	- Our trained temporal agent (TSPO-0.4B) demonstrates strong generalizability. When applied to LLaVA-Video-7B, it achieves an average improvement of 4.3% across four benchmarks; with Qwen2.5VL-7B, the gain reaches 5.3%. Transferability to other backbones is further analyzed in Table 2 of our paper.

	<div align="center">
	<img src="./assets/main_results.png" width="700" height="350" style="object-fit: contain;">
	</div>


	## 📝 Code
	[✒️ TSPO](https://github.com/Hui-design/TSPO)


	## 🧸 Toy example

	We present a toy example to show how TSPO works. We follow an intuition that Video-MLLMs can only give correct answers if the temporal agent samples the correct keyframes. Thanks to our joint modeling of keyframe sampling and language generation, we can use the language accuracy reward $R_A$ derived from multiple-choice training data to supervise the temporal agent.

	- As shown in the GIF, through TSPO training, the temporal agent learns to select frames that lead to the correct answer for the question "What is the scene at the beginning of the video?". As a result, $R_A$ increases, the predicted score peaks converge at the video's beginning, and the sampled frames converge to this time segment.

	- For reproduce this example, first download [LLaVA-Video-Qwen](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2), [CLIP-Large](https://huggingface.co/openai/clip-vit-large-patch14), and [208.mp4](https://drive.google.com/file/d/1FDIxxIcyjL0v2O6sGc0KljXi8MRLDFJk/view?usp=sharing), and modify the ``model_name_or_path`` and ``clip_path`` in the ``toy_example.sh``. The script can be run on a single GPU with at least 28GB.

	<div align="center">
	<img src="./assets/gif.gif" width="800" height="400" style="object-fit: contain;">
	</div>

	## 📐 Set up

	```
	conda create -n TSPO python=3.10
	conda activate TSPO

	pip install -r requirement.txt
	pip install flash-attn==2.5.9.post1 --no-build-isolation

	cd lmms-eval
	pip install -e .
	cd ../
	```



	## 🎥 Demo

	- Download [LLaVA-Video-Qwen-7B](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2) or [Qwen2.5vl-7B](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct), and our [TSPO-0.4B](). Then, you can try the ``demo/llava_video_tspo.py`` or ``demo/qwen25vl_tspo.py`` .

	- We provide example long videos: [208.mp4](https://drive.google.com/file/d/1FDIxxIcyjL0v2O6sGc0KljXi8MRLDFJk/view?usp=sharing), [7XWqI121-Q4.mp4](https://drive.google.com/file/d/1qh-8I1DsgH5TbqEbr05PPO5hdGtvUK23/view?usp=sharing), [5dJUUQufzw4.mp4](https://drive.google.com/file/d/1lBf6Oo7jkhi7-fSvrc_U7SqvqET3vhrh/view?usp=sharing). you can feel free to edit the "video_path" and "question". Our model will output the responses and the sampled frames will be saved under the demo directory.

	```
	# using llava_video as backbone
	CUDA_VISIBLE_DEVICES=0 python demo/llava_video_tspo.py

	# using Qwen2.5vl as backbone
	CUDA_VISIBLE_DEVICES=0 python demo/qwen25vl_tspo.py
	```

	<div align="center">
	<img src="./assets/demo2.png" width="700" height="350" style="object-fit: contain;">
	</div>

	## 💾 Dataset

	We provide TSPO-10K train dataset, which is available at [[🤗 TSPO-train-data]()].



	## 🚀 Training

	First download [LLaVA-Video-Qwen](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2) and [CLIP-Large](https://huggingface.co/openai/clip-vit-large-patch14) and modify the ``model_name_or_path`` and ``clip_path ``in the ``train_deepspeed.sh``

	```
	bash train_deepspeed.sh
	```

	To get your trained TSPO-0.4B weights, you should run the merge_weights.py

	```
	bash scripts/merge_weights.py
	```



	## 🔮 Evaluation

	For LongVideoBench, VideoMME, and MLVU, we use lmms-eval, which we have adapted to the current project by adding files such as `llava_vid_tspo.py` and `qwen_2_5_vl_tspo.py`.

	```
	# For Qwen2.5-VL+TSPO
	bash eval_scripts/TSPO_qwen25_vl.sh

	# For LLaVA-Video
	bash eval_scripts/TSPO_llava_video.sh
	```

	You can evaluate original model without our TSPO by:

	```
	# For Original Qwen2.5-VL
	bash eval_scripts/original_qwen25_vl.sh

	# For Original LLaVA-Video
	bash eval_scripts/original_llava_video.sh
	```

	For [LVBench](https://github.com/zai-org/LVBench), we use its own evaluation protocol. You can combine our demo.py and LVBench's official github to evaluate it.



	## Acknowledgements



	## Citations

	If you find our work helpful for your research, please consider citing our work.

	```

	```
	-->