zai-org/GLM-4.6V · vLLM load error

21 days ago

TypeError: Invalid type of HuggingFace processor. Expected type: <class 'transformers.processing_utils.ProcessorMixin'>, but found type: <class 'transformers.tokenization_utils_fast.PreTrainedTokenizerFast'>

chriswritescode

21 days ago

same cant get this to load on either vllm or sglang latest sglang gives different error ( return super().getattribute(key)
AttributeError: 'Glm4vMoeConfig' object has no attribute 'rope_scaling'
)

malaiwah

21 days ago

You probably need to update your transformers library -- I had the same as well and after updating to 5.0.0rc0 I am good to run it.

(glm46v) mbelleau@aibeast:/mnt/vault/llm/glm46v$ uv pip freeze | grep -e torch -e vllm -e transformers
torch==2.9.1
torchaudio==2.9.0+cu130
torchvision==0.24.1
transformers==5.0.0rc0
vllm==0.12.0

srinivasbilla

20 days ago

Thanks, that was indeed the issue. even though i had it in my requirements it didnt install for some reason. After installing transformers v5 i was able to run but had a lot of memory issues.
on my 8xL4 (24gb each) this worked

!vllm serve zai-org/GLM-4.6V-FP8 --served-model-name glm-4.6v --host 0.0.0.0 --port 1234 --max-model-len 32000 --tensor-parallel-size 8 --distributed-executor-backend mp --max-num-seqs 4 --enable-expert-parallel --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice --enforce-eager --mm-encoder-tp-mode data --mm-processor-cache-type shm --gpu_memory_utilization 0.8 --kv-cache-dtype fp8_e4m3 --mm-processor-cache-gb 1 --limit-mm-per-prompt '{"image":2, "video":0}'

srinivasbilla changed discussion status to closed 20 days ago

mratsim

20 days ago

Because vllm==0.12.0 depends on transformers>=4.56.0,<5 and your project depends on transformers==5.0.0rc0, we can conclude that vllm==0.12.0 and your project are incompatible.
And because your project depends on vllm==0.12.0, we can conclude that your project's requirements are unsatisfiable.

nmitchko

15 days ago

•

edited 15 days ago

Because vllm==0.12.0 depends on transformers>=4.56.0,<5 and your project depends on transformers==5.0.0rc0, we can conclude that vllm==0.12.0 and your project are incompatible.
And because your project depends on vllm==0.12.0, we can conclude that your project's requirements are unsatisfiable.

try this installation:

pip install transformers==5.0.0rc1 --upgrade --no-deps

If using vllm docker, build like so

# ./vllm/dockerfile
FROM vllm/vllm-openai:nightly
RUN uv pip install transformers==5.0.0rc0 --upgrade --no-deps --system
RUN uv pip install huggingface-hub --upgrade --no-deps --system

docker build . -t vllm/vllm-openai:glm46v

then run that docker image according to normal vllm commands.

Felladrin

14 days ago

If using vllm docker, build like so
# ./vllm/dockerfile
FROM vllm/vllm-openai:nightly
RUN uv pip install transformers==5.0.0rc0 --upgrade --no-deps --system
RUN uv pip install huggingface-hub --upgrade --no-deps --system
docker build . -t vllm/vllm-openai:glm46v

then run that docker image according to normal vllm commands.

This was useful! Thank you for sharing!

nmitchko

12 days ago

•

edited 12 days ago

If using vllm docker, build like so
# ./vllm/dockerfile
FROM vllm/vllm-openai:nightly
RUN uv pip install transformers==5.0.0rc0 --upgrade --no-deps --system
RUN uv pip install huggingface-hub --upgrade --no-deps --system
docker build . -t vllm/vllm-openai:glm46v

then run that docker image according to normal vllm commands.
This was useful! Thank you for sharing!

Glad it works for you. With --kv-cache-dtype fp8 you can easily fit this on a single RTX 6000 PRO with tokens to burns. 256k context works on a single GPU :) AWQ or NVFP4 variants

Edit: added that this works in the awq/nvfp4 variants on a single 6000 blackwell system.

mratsim

12 days ago

Glad it works for you. With --kv-cache-dtype fp8 you can easily fit this on a single RTX 6000 PRO with tokens to burns. 256k context works on a single GPU :)

How?

RTX Pro 6000 is 96 GiB while the FP8 version of the model takes 110 GiB
This model is trained up to 131072 context size not 256k

nmitchko

12 days ago

Glad it works for you. With --kv-cache-dtype fp8 you can easily fit this on a single RTX 6000 PRO with tokens to burns. 256k context works on a single GPU :)

How?

RTX Pro 6000 is 96 GiB while the FP8 version of the model takes 110 GiB

This model is trained up to 131072 context size not 256k

I meant for the NVFP4 or AWQ variants. Will update my original comment.... got two threads mixed up

And for 256k here you go.

    --max-model-len 262144 \
    --rope-scaling '{"rope_type":"yarn","factor":2.0,"original_max_position_embeddings":131072 }'

mratsim

12 days ago

I meant for the NVFP4 or AWQ variants. Will update my original comment.... got two threads mixed up

Ah I see, however I suspect there is a significant quality degradation because from what I've seen all of them are using LLM compressor and LLM compressor does not include a "calibrate all experts" flag for GLM4.x yet which is necessary so that all experts are quantized properly, see: https://github.com/vllm-project/llm-compressor/blob/0.9.0/examples/quantization_w4a4_fp4/README.md?plain=1#L85-L92

Quantizing MoEs

To quantize MoEs, MoE calibration is now handled automatically by the pipeline. An example quantizing Llama4 can be found under llama4_example.py. The pipeline automatically applies the appropriate MoE calibration context which:

Linearizes the model to enable quantization and execution in vLLM. This is required as the native model definition does not include torch.nn.Linear layers in its MoE blocks, a requirement for LLM Compressor to run quantization.

Ensures experts are quantized correctly as not all experts are activated during calibration

And for 256k here you go.

    --max-model-len 262144 \
    --rope-scaling '{"rope_type":"yarn","factor":2.0,"original_max_position_embeddings":131072 }'

Ah yeah, I try to avoid that because that degrades performance on small contexts and the perf on long context is also meh even without that, see https://contextarena.ai/ and https://fiction.live/stories/Fiction-liveBench-Sept-29-2025/oQdzQvKHw8JyXbN87

nmitchko

11 days ago

I meant for the NVFP4 or AWQ variants. Will update my original comment.... got two threads mixed up

Ah I see, however I suspect there is a significant quality degradation because from what I've seen all of them are using LLM compressor and LLM compressor does not include a "calibrate all experts" flag for GLM4.x yet which is necessary so that all experts are quantized properly, see: https://github.com/vllm-project/llm-compressor/blob/0.9.0/examples/quantization_w4a4_fp4/README.md?plain=1#L85-L92

Quantizing MoEs

To quantize MoEs, MoE calibration is now handled automatically by the pipeline. An example quantizing Llama4 can be found under llama4_example.py. The pipeline automatically applies the appropriate MoE calibration context which:

Linearizes the model to enable quantization and execution in vLLM. This is required as the native model definition does not include torch.nn.Linear layers in its MoE blocks, a requirement for LLM Compressor to run quantization.

Ensures experts are quantized correctly as not all experts are activated during calibration
And for 256k here you go.
    --max-model-len 262144 \
    --rope-scaling '{"rope_type":"yarn","factor":2.0,"original_max_position_embeddings":131072 }'
Ah yeah, I try to avoid that because that degrades performance on small contexts and the perf on long context is also meh even without that, see https://contextarena.ai/ and https://fiction.live/stories/Fiction-liveBench-Sept-29-2025/oQdzQvKHw8JyXbN87

This setup is my daily driver for web-dev, long context with my project size and the Multi-model capabilites have worked great.

I notice that NVFP4 models are much more performant than AWQ models. The model cyankiwi/GLM-4.6V-AWQ-4bit seems to work fine. (at least until my next GPU arrives :)