Update README.md (#7)

59c8194 about 1 month ago

6.85 kB

	---
	license: apache-2.0
	language:
	- en
	metrics:
	- accuracy
	base_model:
	- Qwen/Qwen3-8B
	library_name: transformers
	tags:
	- multi-modal
	- large-language-model
	- vision-language-model
	- vision-encoder
	---

	<p align="center">
	<img src="/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F6258a6455ea3a0a9b6de3f22%2FmIMYeUFquGSbm89lT61TG.png%26quot%3B%3C%2Fspan%3E width="160" />
	</p>

	<h2 align="center">Penguin-VL</h2>
	<h4 align="center">
	Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
	</h4>

	<h4 align="center">
	<b>Project Page:</b> <a href="https://penguin-vl.github.io">penguin-vl.github.io</a> \|
	<b>GitHub:</b> <a href="https://github.com/tencent-ailab/Penguin-VL">tencent-ailab/Penguin-VL</a> \|
	<b>arXiv:</b> <a href="https://arxiv.org/abs/2603.06569">2603.06569</a>
	<br><br>
	<a href="https://penguin-vl.github.io"><img src="https://img.shields.io/badge/Project-Page-green?logo=github" alt="Project Page"></a>
	<a href="https://github.com/tencent-ailab/Penguin-VL"><img src="https://img.shields.io/badge/GitHub-Repo-blue?logo=github" alt="GitHub Badge"></a>
	<a href="https://huggingface.co/spaces/tencent/Penguin-VL"><img src="https://img.shields.io/badge/HuggingFace-Spaces-yellow?logo=huggingface" alt="Hugging Face Spaces"></a>
	<a href="https://arxiv.org/abs/2603.06569"><img src="https://img.shields.io/badge/arXiv-2603.06569-b31b1b.svg?logo=arxiv" alt="arXiv"></a>
	</h4>

	---

	## 📰 News

	* 2026.03 — PenguinVL-Encoder now available for general use.
	* 2026.03 — Released PenguinVL-2B, PenguinVL-8B.

	---

	## 🌟 Model Overview

	PenguinVL is a compact Vision-Language Model designed to explore the efficiency limits of small-scale VLMs. Rather than being only an instruction-tuned model, PenguinVL is built from the ground up through LLM-based vision encoder construction, multimodal pretraining, and subsequent instruction tuning.

	Unlike most existing VLMs that rely on contrastive-pretrained vision encoders (e.g., CLIP/SigLIP), PenguinVL initializes its vision encoder directly from a text-only LLM. This design avoids the objective mismatch between contrastive learning and autoregressive language modeling, enabling tighter alignment between visual representations and the language backbone.

	### Key Characteristics

	- 🧠 LLM-based Vision Encoder
	The vision encoder is adapted from a pretrained text LLM (Qwen3-0.6B), modified with bidirectional attention and 2D-RoPE for spatial modeling.
	This provides strong semantic priors and native compatibility with the downstream LLM.

	- 🎥 Efficient Video Understanding
	A Temporal Redundancy-Aware (TRA) token compression strategy dynamically allocates token budgets across frames, enabling long-video reasoning within a limited context window.

	- 🏗 Unified Architecture
	The model consists of:
	1. LLM-initialized vision encoder
	2. Lightweight MLP projector
	3. Qwen3 language backbone

	- 📊 Compact but Strong
	At 8B scale, Penguin-VL achieves competitive performance across image, document, OCR, math, and video benchmarks while remaining deployment-friendly.

	---

	## 🧪 Quick Start — Transformers Inference

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoProcessor

	model_name = "tencent/Penguin-VL-8B"

	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	trust_remote_code=True,
	device_map="auto",
	torch_dtype=torch.bfloat16,
	)

	processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)

	# Example: Image + Text
	inputs = processor(
	conversation=[
	{"role": "system", "content": "You are a helpful assistant."},
	{
	"role": "user",
	"content": [
	{"type": "image", "image": {"image_path": "assets/example.jpg"}},
	{"type": "text", "text": "Describe this image."}
	],
	},
	],
	return_tensors="pt",
	)


	inputs = {k: v.cuda() if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
	if "pixel_values" in inputs:
	inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)

	output_ids = model.generate(**inputs, max_new_tokens=128)
	response = processor.decode(output_ids[0], skip_special_tokens=True)

	print(response)
	```

	## 🌎 Model Zoo
	\| Model \| Base Model \| HF Link \|
	\| -------------------- \| ------------ \| ------------------------------------------------------------ \|
	\| PenguinVL-8B \| Qwen3-8B \| [tencent/Penguin-VL-8B](https://huggingface.co/tencent/Penguin-VL-8B) \|
	\| PenguinVL-2B \| Qwen3-1.7B \| [tencent/Penguin-VL-2B](https://huggingface.co/tencent/Penguin-VL-2B) \|
	\| PenguinVL-Encoder \| Qwen3-0.6B \| [tencent/Penguin-Encoder](https://huggingface.co/tencent/Penguin-Encoder) \|

	## 🚀 Main Results
	### Chart / OCR / Document Understanding

	\| Benchmark \| Penguin-VL 8B \| Qwen3-VL 8B \| InternVL3.5 8B \| OpenAI GPT-5 nano \|
	\|---\|---:\|---:\|---:\|---:\|
	\| InfoVQA \| 86.8 \| 83.1 \| 79.1 \| 49.2 \|
	\| ChartQA \| 90.5 \| 89.6 \| 86.7 \| 48.6 \|
	\| DocVQA \| 96.2 \| 96.1 \| 92.3 \| 78.3 \|
	\| CharXiv (DQ / RQ) \| 75.7 / 40.0 \| 83.0 / 46.4 \| 72.2 / 44.4 \| 64.4 / 31.7 \|
	\| OCRBench \| 852 \| 896 \| 840 \| 701 \|

	### General Knowledge / Multi-Image / Math Reasoning

	\| Benchmark \| Penguin-VL 8B \| Qwen3-VL 8B \| InternVL3.5 8B \| OpenAI GPT-5 nano \|
	\|---\|---:\|---:\|---:\|---:\|
	\| AI2D \| 86.1 \| 85.7 \| 84.0 \| 65.7 \|
	\| RealWorldQA \| 75.8 \| 71.5 \| 67.5 \| 60.7 \|
	\| V-star \| 90.2 \| 90.1 \| 70.7 \| 63.4 \|
	\| MMMU-Pro \| 40.2 \| 55.9 \| 39.7 \| 36.5 \|
	\| BLINK \| 58.2 \| 69.1 \| 59.5 \| 42.2 \|
	\| MathVista \| 77.4 \| 77.2 \| 74.2 \| 40.9 \|
	\| MathVerse \| 50.8 \| 62.1 \| 55.8 \| 27.0 \|
	\| LogicVista \| 53.8 \| 55.3 \| 57.3 \| 40.5 \|

	### Video Understanding

	\| Benchmark \| Penguin-VL 8B \| Qwen3-VL 8B \| InternVL3.5 8B \| OpenAI GPT-5 nano \|
	\|---\|---:\|---:\|---:\|---:\|
	\| MVBench \| 71.7 \| 68.7 \| 72.1 \| 52.9 \|
	\| LongVideoBench \| 67.0 \| 62.6 \| 62.1 \| 38.1 \|
	\| VideoMME \| 66.2 \| 71.4 \| 66.0 \| 49.4 \|
	\| Egochema \| 67.0 \| 70.2 \| 61.0 \| 34.8 \|
	\| MMVU \| 53.9 \| 58.7 \| 51.5 \| 51.0 \|
	\| CharadesSTA \| 61.4 \| 56.0 \| 32.8 \| 5.0 \|
	\| NextQA \| 85.4 \| 82.3 \| 81.3 \| 59.3 \|
	\| ActivityNetQA \| 65.2 \| 63.7 \| 60.1 \| – \|
	\| Perception Test \| 78.0 \| 72.7 \| 72.7 \| – \|

	> Bold indicates the best result among compared models.
	> More details can see our paper.


	## Citation

	If you find Penguin-VL useful for your research and applications, please cite using this BibTeX:
	```bibtex
	@article{Penguin-VL,
	title={Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders},
	author={Boqiang Zhang and Lei Ke and Ruihan Yang and Qi Gao and Tianyuan Qu and Rossell Chen and Dong Yu and Leoweiliang},
	journal={arXiv preprint arXiv:2603.06569},
	year={2026}
	}
	```