InternVL3-8B-AWQ is much slower than InternVL3-8B

by QiliangGoose - opened May 31

May 31

I'm using InternVL3-8B-AWQ to inference on vLLM, which is much slower than InternVL3-8B.

The decive I'm using:
RTX4090D 24G
vLLM==0.9.0

Time cost for return of first token:
InternVL3-8B 0.81s
InternVL3-8B-AWQ 1.40s

Command Settings:
python3 -m vllm.entrypoints.openai.api_server
--model models--OpenGVLab--InternVL3-8B-AWQ
--gpu-memory-utilization 0.9
--max_num_seqs 1
--max-model-len 16384
--served-model-name "vlm_test"
--limit-mm-per-prompt image=5
--quantization awq
--trust-remote-code

Wonder if there is any special settings needed?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment