Switch-KD-Qwen2.5-CLIP-1.8B

Switch-KD-Qwen2.5-CLIP-1.8B is a compact vision-language model (VLM) trained using the Switch-KD (Visual-Switch Knowledge Distillation) framework from Li Auto's MindKD technology. This model achieves competitive performance on multimodal benchmarks while being efficient for deployment.

Model Details

  • Base Model: Qwen2.5-1.5B-Instruct
  • Visual Encoder: CLIP-ViT-L/14-336
  • Projector: LDPNetV2 (FeatureIRLayer + TokenDownLayer + PosInjectLayer)
  • Total Parameters: ~1.8B
  • Image Resolution: 336×336
  • Context Length: 32,768 tokens
  • Training Method: Switch-KD distillation with DBiLD Loss
  • License: Apache 2.0

Architecture

Image (336×336) → CLIP ViT-L/14 (576 tokens) → LDPNetV2 Projector (144 tokens) → Qwen2.5-1.5B

The model uses a custom architecture where:

  1. CLIP Vision Encoder extracts visual features at 336×336 resolution
  2. LDPNetV2 Projector reduces visual tokens from 576 to 144 while preserving information
  3. Qwen2.5-1.5B processes both visual and textual inputs for generation

Key Results

Switch-KD demonstrates significant improvements over baseline VLM distillation methods:

vs. Align-KD (1.5B Models)

  • +4.4% average improvement across 6 benchmarks
  • Uses only 1/3 the training data (1.2M vs 3.6M samples)

Benchmark Performance (Selected)

Benchmark Score
MME (Perception) 1411.5
MMBench 68.4
GQA 61.9
ScienceQA 71.6
TextVQA 57.0
POPE 87.5

For detailed results, see the Switch-KD paper.

Installation

pip install transformers accelerate torch

Quickstart

Command Line Interface

# Single-round inference
python chat.py --model HaoyiSun/Switch-KD-Qwen2.5-CLIP-1.8B \
    --image path/to/image.jpg \
    --question "Please describe this picture."

# Interactive multi-round chat
python chat.py --model HaoyiSun/Switch-KD-Qwen2.5-CLIP-1.8B \
    --image path/to/image.jpg \
    --interactive

# With custom settings
python chat.py --model HaoyiSun/Switch-KD-Qwen2.5-CLIP-1.8B \
    --image path/to/image.jpg \
    --interactive \
    --max-new-tokens 1024 \
    --torch-dtype fp16

Model Architecture

Visual Encoder (CLIP-ViT-L/14-336)

  • Hidden size: 1024
  • Layers: 24
  • Attention heads: 16
  • Image size: 336×336
  • Patch size: 14×14

Projector (LDPNetV2)

  • Projects 1024-dim visual features to 1536-dim LLM space
  • Reduces spatial tokens from 576 to 144
  • Components: FeatureIRLayer → TokenDownLayer → PosInjectLayer

Language Model (Qwen2.5-1.5B)

  • Hidden size: 1536
  • Layers: 28
  • Attention heads: 12 (2 key-value heads for GQA)
  • Context length: 32,768 tokens
  • Vocabulary: 151,936 tokens

Training

Switch-KD is trained using two key innovations:

  1. Visual-Switch Distillation: Switches student visual outputs into teacher language pathway for cross-modal knowledge transfer

  2. DBiLD Loss: Dynamic Bi-directional Logits Difference loss with adaptive top-K selection via Kneedle algorithm

Training Configuration

  • Training data: 1.2M image-text pairs
  • Optimizer: AdamW with cosine learning rate schedule
  • Batch size: 64 per GPU
  • Training epochs: Varies by configuration

Limitations

  • The model is primarily trained on English datasets and may have reduced performance on other languages
  • Best performance on images similar to training distribution (natural images, documents, charts)
  • May struggle with very low-resolution or extremely high-resolution images
  • Designed for single-image understanding (not optimized for video)

Citation

If you use this model, please cite:

@article{sun2026switchkd,
  title={Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models},
  author={Sun, Haoyi and Wang, Xiaoxiao and Mao, Ning and Wang, Qian and Mu, Lifu and Zheng, Wen and Wei, Tao and Chen, Wei},
  journal={arXiv preprint arXiv:2604.14629},
  year={2026}
}

Acknowledgments

This model is built upon excellent open-source work:

  • Qwen2.5 - Language model backbone
  • CLIP - Visual encoder
  • xtuner - VLM training framework

Contact

For questions about this model or Switch-KD framework:


Model Card Authors: Haoyi Sun et al.

Model Card Contact: sunhaoyi@lixiang.com

Downloads last month
87
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including HaoyiSun/Switch-KD-Qwen2.5-CLIP-1.8B

Paper for HaoyiSun/Switch-KD-Qwen2.5-CLIP-1.8B