| <!-- --- | |
| license: apache-2.0 | |
| --- | |
| # TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding | |
| [[๐ Paper](https://arxiv.org/pdf/2508.04369)] [[๐ค TSPO-model](https://huggingface.co/hzf666/TSPO-0.4B)] [[๐ค TSPO-train-data](https://huggingface.co/datasets/hzf666/TSPO-10K)] | |
| ## ๐ Overview | |
| Inspired by Deepseek-R1's GRPO algorithm, We propose **Temporal Sampling Policy Optimization (TSPO)**, a reinforcement learning framework that advances long-form video understanding by addressing the challenges of unsupervised and non-differentiable sparse frame sampling. | |
| <div align="center"> | |
| <img src="./assets/main_fig.png" width="800" height="400" style="object-fit: contain;"> | |
| </div> | |
| ## ๐ Performance | |
| - Our method achieves **63.9%** accuracy on LongVideoBench and **76.3%** on MLVU, setting a new state-of-the-art performance among 7B video-MLLMs. | |
| - Our trained temporal agent (TSPO-0.4B) demonstrates strong generalizability. When applied to LLaVA-Video-7B, it achieves an average improvement of **4.3%** across four benchmarks; with Qwen2.5VL-7B, the gain reaches **5.3%**. Transferability to other backbones is further analyzed in Table 2 of our paper. | |
| <div align="center"> | |
| <img src="./assets/main_results.png" width="700" height="350" style="object-fit: contain;"> | |
| </div> | |
| ## ๐ Code | |
| [โ๏ธ TSPO](https://github.com/Hui-design/TSPO) | |
| ## ๐งธ Toy example | |
| We present a toy example to show how TSPO works. We follow an intuition that Video-MLLMs can only give correct answers if the temporal agent samples the correct keyframes. Thanks to our joint modeling of keyframe sampling and language generation, we can use the language accuracy reward $R_A$ derived from multiple-choice training data to supervise the temporal agent. | |
| - As shown in the GIF, through TSPO training, the temporal agent learns to select frames that lead to the correct answer for the question *"What is the scene at the beginning of the video?"*. As a result, $R_A$ increases, the predicted score peaks converge at the video's beginning, and the sampled frames converge to this time segment. | |
| - **For reproduce this example**, first download [LLaVA-Video-Qwen](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2), [CLIP-Large](https://huggingface.co/openai/clip-vit-large-patch14), and [208.mp4](https://drive.google.com/file/d/1FDIxxIcyjL0v2O6sGc0KljXi8MRLDFJk/view?usp=sharing), and modify the ``model_name_or_path`` and ``clip_path`` in the ``toy_example.sh``. The script can be run on a single GPU with at least 28GB. | |
| <div align="center"> | |
| <img src="./assets/gif.gif" width="800" height="400" style="object-fit: contain;"> | |
| </div> | |
| ## ๐ Set up | |
| ``` | |
| conda create -n TSPO python=3.10 | |
| conda activate TSPO | |
| pip install -r requirement.txt | |
| pip install flash-attn==2.5.9.post1 --no-build-isolation | |
| cd lmms-eval | |
| pip install -e . | |
| cd ../ | |
| ``` | |
| ## ๐ฅ Demo | |
| - Download [LLaVA-Video-Qwen-7B](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2) or [Qwen2.5vl-7B](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct), and our [TSPO-0.4B](). Then, you can try the ``demo/llava_video_tspo.py`` or ``demo/qwen25vl_tspo.py`` . | |
| - We provide example long videos: [208.mp4](https://drive.google.com/file/d/1FDIxxIcyjL0v2O6sGc0KljXi8MRLDFJk/view?usp=sharing), [7XWqI121-Q4.mp4](https://drive.google.com/file/d/1qh-8I1DsgH5TbqEbr05PPO5hdGtvUK23/view?usp=sharing), [5dJUUQufzw4.mp4](https://drive.google.com/file/d/1lBf6Oo7jkhi7-fSvrc_U7SqvqET3vhrh/view?usp=sharing). you can feel free to edit the "video_path" and "question". Our model will output the responses and the sampled frames will be saved under the demo directory. | |
| ``` | |
| # using llava_video as backbone | |
| CUDA_VISIBLE_DEVICES=0 python demo/llava_video_tspo.py | |
| # using Qwen2.5vl as backbone | |
| CUDA_VISIBLE_DEVICES=0 python demo/qwen25vl_tspo.py | |
| ``` | |
| <div align="center"> | |
| <img src="./assets/demo2.png" width="700" height="350" style="object-fit: contain;"> | |
| </div> | |
| ## ๐พ Dataset | |
| We provide TSPO-10K train dataset, which is available at [[๐ค TSPO-train-data]()]. | |
| ## ๐ Training | |
| First download [LLaVA-Video-Qwen](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2) and [CLIP-Large](https://huggingface.co/openai/clip-vit-large-patch14) and modify the ``model_name_or_path`` and ``clip_path ``in the ``train_deepspeed.sh`` | |
| ``` | |
| bash train_deepspeed.sh | |
| ``` | |
| To get your trained TSPO-0.4B weights, you should run the merge_weights.py | |
| ``` | |
| bash scripts/merge_weights.py | |
| ``` | |
| ## ๐ฎ Evaluation | |
| For LongVideoBench, VideoMME, and MLVU, we use lmms-eval, which we have adapted to the current project by adding files such as `llava_vid_tspo.py` and `qwen_2_5_vl_tspo.py`. | |
| ``` | |
| # For Qwen2.5-VL+TSPO | |
| bash eval_scripts/TSPO_qwen25_vl.sh | |
| # For LLaVA-Video | |
| bash eval_scripts/TSPO_llava_video.sh | |
| ``` | |
| You can evaluate original model without our TSPO by: | |
| ``` | |
| # For Original Qwen2.5-VL | |
| bash eval_scripts/original_qwen25_vl.sh | |
| # For Original LLaVA-Video | |
| bash eval_scripts/original_llava_video.sh | |
| ``` | |
| For [LVBench](https://github.com/zai-org/LVBench), we use its own evaluation protocol. You can combine our demo.py and LVBench's official github to evaluate it. | |
| ## Acknowledgements | |
| ## Citations | |
| If you find our work helpful for your research, please consider citing our work. | |
| ``` | |
| ``` | |
| --> |