| --- |
| license: apache-2.0 |
| language: |
| - en |
| metrics: |
| - accuracy |
| base_model: |
| - Qwen/Qwen3-8B |
| library_name: transformers |
| tags: |
| - multi-modal |
| - large-language-model |
| - vision-language-model |
| - vision-encoder |
| --- |
| |
| <p align="center"> |
| <img src="/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F6258a6455ea3a0a9b6de3f22%2FmIMYeUFquGSbm89lT61TG.png%26quot%3B%3C%2Fspan%3E width="160" /> |
| </p> |
|
|
| <h2 align="center">Penguin-VL</h2> |
| <h4 align="center"> |
| Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders |
| </h4> |
|
|
| <h4 align="center"> |
| <b>Project Page:</b> <a href="https://penguin-vl.github.io">penguin-vl.github.io</a> | |
| <b>GitHub:</b> <a href="https://github.com/tencent-ailab/Penguin-VL">tencent-ailab/Penguin-VL</a> | |
| <b>arXiv:</b> <a href="https://arxiv.org/abs/2603.06569">2603.06569</a> |
| <br><br> |
| <a href="https://penguin-vl.github.io"><img src="https://img.shields.io/badge/Project-Page-green?logo=github" alt="Project Page"></a> |
| <a href="https://github.com/tencent-ailab/Penguin-VL"><img src="https://img.shields.io/badge/GitHub-Repo-blue?logo=github" alt="GitHub Badge"></a> |
| <a href="https://huggingface.co/spaces/tencent/Penguin-VL"><img src="https://img.shields.io/badge/HuggingFace-Spaces-yellow?logo=huggingface" alt="Hugging Face Spaces"></a> |
| <a href="https://arxiv.org/abs/2603.06569"><img src="https://img.shields.io/badge/arXiv-2603.06569-b31b1b.svg?logo=arxiv" alt="arXiv"></a> |
| </h4> |
|
|
| --- |
|
|
| ## π° News |
|
|
| * **2026.03** β PenguinVL-Encoder now available for general use. |
| * **2026.03** β Released PenguinVL-2B, PenguinVL-8B. |
|
|
| --- |
|
|
| ## π Model Overview |
|
|
| PenguinVL is a compact Vision-Language Model designed to explore the efficiency limits of small-scale VLMs. Rather than being only an instruction-tuned model, PenguinVL is built from the ground up through **LLM-based vision encoder construction, multimodal pretraining, and subsequent instruction tuning**. |
|
|
| Unlike most existing VLMs that rely on contrastive-pretrained vision encoders (e.g., CLIP/SigLIP), PenguinVL initializes its vision encoder directly from a **text-only LLM**. This design avoids the objective mismatch between contrastive learning and autoregressive language modeling, enabling tighter alignment between visual representations and the language backbone. |
|
|
| ### Key Characteristics |
|
|
| - π§ **LLM-based Vision Encoder** |
| The vision encoder is adapted from a pretrained text LLM (Qwen3-0.6B), modified with bidirectional attention and 2D-RoPE for spatial modeling. |
| This provides strong semantic priors and native compatibility with the downstream LLM. |
|
|
| - π₯ **Efficient Video Understanding** |
| A Temporal Redundancy-Aware (TRA) token compression strategy dynamically allocates token budgets across frames, enabling long-video reasoning within a limited context window. |
|
|
| - π Unified Architecture |
| The model consists of: |
| 1. LLM-initialized vision encoder |
| 2. Lightweight MLP projector |
| 3. Qwen3 language backbone |
|
|
| - π Compact but Strong |
| At 8B scale, Penguin-VL achieves competitive performance across image, document, OCR, math, and video benchmarks while remaining deployment-friendly. |
|
|
| --- |
|
|
| ## π§ͺ Quick Start β Transformers Inference |
|
|
| ```python |
| import torch |
| from transformers import AutoModelForCausalLM, AutoProcessor |
| |
| model_name = "tencent/Penguin-VL-8B" |
| |
| model = AutoModelForCausalLM.from_pretrained( |
| model_name, |
| trust_remote_code=True, |
| device_map="auto", |
| torch_dtype=torch.bfloat16, |
| ) |
| |
| processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True) |
| |
| # Example: Image + Text |
| inputs = processor( |
| conversation=[ |
| {"role": "system", "content": "You are a helpful assistant."}, |
| { |
| "role": "user", |
| "content": [ |
| {"type": "image", "image": {"image_path": "assets/example.jpg"}}, |
| {"type": "text", "text": "Describe this image."} |
| ], |
| }, |
| ], |
| return_tensors="pt", |
| ) |
| |
| |
| inputs = {k: v.cuda() if isinstance(v, torch.Tensor) else v for k, v in inputs.items()} |
| if "pixel_values" in inputs: |
| inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16) |
| |
| output_ids = model.generate(**inputs, max_new_tokens=128) |
| response = processor.decode(output_ids[0], skip_special_tokens=True) |
| |
| print(response) |
| ``` |
|
|
| ## π Model Zoo |
| | Model | Base Model | HF Link | |
| | -------------------- | ------------ | ------------------------------------------------------------ | |
| | PenguinVL-8B | Qwen3-8B | [tencent/Penguin-VL-8B](https://huggingface.co/tencent/Penguin-VL-8B) | |
| | PenguinVL-2B | Qwen3-1.7B | [tencent/Penguin-VL-2B](https://huggingface.co/tencent/Penguin-VL-2B) | |
| | PenguinVL-Encoder | Qwen3-0.6B | [tencent/Penguin-Encoder](https://huggingface.co/tencent/Penguin-Encoder) | |
|
|
| ## π Main Results |
| ### Chart / OCR / Document Understanding |
|
|
| | Benchmark | **Penguin-VL 8B** | Qwen3-VL 8B | InternVL3.5 8B | OpenAI GPT-5 nano | |
| |---|---:|---:|---:|---:| |
| | InfoVQA | **86.8** | 83.1 | 79.1 | 49.2 | |
| | ChartQA | **90.5** | 89.6 | 86.7 | 48.6 | |
| | DocVQA | **96.2** | 96.1 | 92.3 | 78.3 | |
| | CharXiv (DQ / RQ) | 75.7 / 40.0 | **83.0 / 46.4** | 72.2 / 44.4 | 64.4 / 31.7 | |
| | OCRBench | 852 | **896** | 840 | 701 | |
|
|
| ### General Knowledge / Multi-Image / Math Reasoning |
|
|
| | Benchmark | **Penguin-VL 8B** | Qwen3-VL 8B | InternVL3.5 8B | OpenAI GPT-5 nano | |
| |---|---:|---:|---:|---:| |
| | AI2D | **86.1** | 85.7 | 84.0 | 65.7 | |
| | RealWorldQA | **75.8** | 71.5 | 67.5 | 60.7 | |
| | V-star | **90.2** | 90.1 | 70.7 | 63.4 | |
| | MMMU-Pro | 40.2 | **55.9** | 39.7 | 36.5 | |
| | BLINK | 58.2 | **69.1** | 59.5 | 42.2 | |
| | MathVista | **77.4** | 77.2 | 74.2 | 40.9 | |
| | MathVerse | 50.8 | **62.1** | 55.8 | 27.0 | |
| | LogicVista | 53.8 | 55.3 | **57.3** | 40.5 | |
|
|
| ### Video Understanding |
|
|
| | Benchmark | **Penguin-VL 8B** | Qwen3-VL 8B | InternVL3.5 8B | OpenAI GPT-5 nano | |
| |---|---:|---:|---:|---:| |
| | MVBench | 71.7 | 68.7 | **72.1** | 52.9 | |
| | LongVideoBench | **67.0** | 62.6 | 62.1 | 38.1 | |
| | VideoMME | 66.2 | **71.4** | 66.0 | 49.4 | |
| | Egochema | 67.0 | **70.2** | 61.0 | 34.8 | |
| | MMVU | 53.9 | **58.7** | 51.5 | 51.0 | |
| | CharadesSTA | **61.4** | 56.0 | 32.8 | 5.0 | |
| | NextQA | **85.4** | 82.3 | 81.3 | 59.3 | |
| | ActivityNetQA | **65.2** | 63.7 | 60.1 | β | |
| | Perception Test | **78.0** | 72.7 | 72.7 | β | |
|
|
| > **Bold** indicates the best result among compared models. |
| > More details can see our paper. |
|
|
|
|
| ## Citation |
|
|
| If you find Penguin-VL useful for your research and applications, please cite using this BibTeX: |
| ```bibtex |
| @article{Penguin-VL, |
| title={Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders}, |
| author={Boqiang Zhang and Lei Ke and Ruihan Yang and Qi Gao and Tianyuan Qu and Rossell Chen and Dong Yu and Leoweiliang}, |
| journal={arXiv preprint arXiv:2603.06569}, |
| year={2026} |
| } |
| ``` |