InternVL3-8B-AWQ is much slower than InternVL3-8B

#3
by QiliangGoose - opened

I'm using InternVL3-8B-AWQ to inference on vLLM, which is much slower than InternVL3-8B.

The decive I'm using:
RTX4090D 24G
vLLM==0.9.0

Time cost for return of first token:
InternVL3-8B 0.81s
InternVL3-8B-AWQ 1.40s

Command Settings:
python3 -m vllm.entrypoints.openai.api_server
--model models--OpenGVLab--InternVL3-8B-AWQ
--gpu-memory-utilization 0.9
--max_num_seqs 1
--max-model-len 16384
--served-model-name "vlm_test"
--limit-mm-per-prompt image=5
--quantization awq
--trust-remote-code

Wonder if there is any special settings needed?

Sign up or log in to comment