Switch-KD-Qwen2.5-CLIP-1.8B

Switch-KD-Qwen2.5-CLIP-1.8B is a compact vision-language model (VLM) trained using the Switch-KD (Visual-Switch Knowledge Distillation) framework from Li Auto's MindKD technology. This model achieves competitive performance on multimodal benchmarks while being efficient for deployment.

Model Details

Base Model: Qwen2.5-1.5B-Instruct
Visual Encoder: CLIP-ViT-L/14-336
Projector: LDPNetV2 (FeatureIRLayer + TokenDownLayer + PosInjectLayer)
Total Parameters: ~1.8B
Image Resolution: 336×336
Context Length: 32,768 tokens
Training Method: Switch-KD distillation with DBiLD Loss
License: Apache 2.0

Architecture

Image (336×336) → CLIP ViT-L/14 (576 tokens) → LDPNetV2 Projector (144 tokens) → Qwen2.5-1.5B

The model uses a custom architecture where:

CLIP Vision Encoder extracts visual features at 336×336 resolution
LDPNetV2 Projector reduces visual tokens from 576 to 144 while preserving information
Qwen2.5-1.5B processes both visual and textual inputs for generation

Key Results

Switch-KD demonstrates significant improvements over baseline VLM distillation methods:

vs. Align-KD (1.5B Models)

+4.4% average improvement across 6 benchmarks
Uses only 1/3 the training data (1.2M vs 3.6M samples)

Benchmark Performance (Selected)

Benchmark	Score
MME (Perception)	1411.5
MMBench	68.4
GQA	61.9
ScienceQA	71.6
TextVQA	57.0
POPE	87.5

For detailed results, see the Switch-KD paper.

Installation

pip install transformers accelerate torch

Quickstart

Command Line Interface

# Single-round inference
python chat.py --model HaoyiSun/Switch-KD-Qwen2.5-CLIP-1.8B \
    --image path/to/image.jpg \
    --question "Please describe this picture."

# Interactive multi-round chat
python chat.py --model HaoyiSun/Switch-KD-Qwen2.5-CLIP-1.8B \
    --image path/to/image.jpg \
    --interactive

# With custom settings
python chat.py --model HaoyiSun/Switch-KD-Qwen2.5-CLIP-1.8B \
    --image path/to/image.jpg \
    --interactive \
    --max-new-tokens 1024 \
    --torch-dtype fp16

Model Architecture

Visual Encoder (CLIP-ViT-L/14-336)

Hidden size: 1024
Layers: 24
Attention heads: 16
Image size: 336×336
Patch size: 14×14

Projector (LDPNetV2)

Projects 1024-dim visual features to 1536-dim LLM space
Reduces spatial tokens from 576 to 144
Components: FeatureIRLayer → TokenDownLayer → PosInjectLayer

Language Model (Qwen2.5-1.5B)

Hidden size: 1536
Layers: 28
Attention heads: 12 (2 key-value heads for GQA)
Context length: 32,768 tokens
Vocabulary: 151,936 tokens

Training

Switch-KD is trained using two key innovations:

Visual-Switch Distillation: Switches student visual outputs into teacher language pathway for cross-modal knowledge transfer
DBiLD Loss: Dynamic Bi-directional Logits Difference loss with adaptive top-K selection via Kneedle algorithm

Training Configuration

Training data: 1.2M image-text pairs
Optimizer: AdamW with cosine learning rate schedule
Batch size: 64 per GPU
Training epochs: Varies by configuration

Limitations

The model is primarily trained on English datasets and may have reduced performance on other languages
Best performance on images similar to training distribution (natural images, documents, charts)
May struggle with very low-resolution or extremely high-resolution images
Designed for single-image understanding (not optimized for video)

Citation

If you use this model, please cite:

@article{sun2026switchkd,
  title={Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models},
  author={Sun, Haoyi and Wang, Xiaoxiao and Mao, Ning and Wang, Qian and Mu, Lifu and Zheng, Wen and Wei, Tao and Chen, Wei},
  journal={arXiv preprint arXiv:2604.14629},
  year={2026}
}

Acknowledgments

This model is built upon excellent open-source work:

Qwen2.5 - Language model backbone
CLIP - Visual encoder
xtuner - VLM training framework

Contact

For questions about this model or Switch-KD framework:

Work: sunhaoyi@lixiang.com
Personal: haoyi199815@126.com

Model Card Authors: Haoyi Sun et al.

Model Card Contact: sunhaoyi@lixiang.com

Downloads last month: 87

Collection including HaoyiSun/Switch-KD-Qwen2.5-CLIP-1.8B

Switch-KD

Collection

Open-source academic release of Li Auto's MindKD technology, published as Switch-KD at CVPRF 2026. • 2 items • Updated 10 days ago

Paper for HaoyiSun/Switch-KD-Qwen2.5-CLIP-1.8B

Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models

Paper • 2604.14629 • Published 17 days ago • 9