vLLM load error
TypeError: Invalid type of HuggingFace processor. Expected type: <class 'transformers.processing_utils.ProcessorMixin'>, but found type: <class 'transformers.tokenization_utils_fast.PreTrainedTokenizerFast'>
same cant get this to load on either vllm or sglang latest sglang gives different error ( return super().getattribute(key)
AttributeError: 'Glm4vMoeConfig' object has no attribute 'rope_scaling'
)
You probably need to update your transformers library -- I had the same as well and after updating to 5.0.0rc0 I am good to run it.
(glm46v) mbelleau@aibeast:/mnt/vault/llm/glm46v$ uv pip freeze | grep -e torch -e vllm -e transformers
torch==2.9.1
torchaudio==2.9.0+cu130
torchvision==0.24.1
transformers==5.0.0rc0
vllm==0.12.0
Thanks, that was indeed the issue. even though i had it in my requirements it didnt install for some reason. After installing transformers v5 i was able to run but had a lot of memory issues.
on my 8xL4 (24gb each) this worked
!vllm serve zai-org/GLM-4.6V-FP8 --served-model-name glm-4.6v --host 0.0.0.0 --port 1234 --max-model-len 32000 --tensor-parallel-size 8 --distributed-executor-backend mp --max-num-seqs 4 --enable-expert-parallel --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice --enforce-eager --mm-encoder-tp-mode data --mm-processor-cache-type shm --gpu_memory_utilization 0.8 --kv-cache-dtype fp8_e4m3 --mm-processor-cache-gb 1 --limit-mm-per-prompt '{"image":2, "video":0}'
Because vllm==0.12.0 depends on transformers>=4.56.0,<5 and your project depends on transformers==5.0.0rc0, we can conclude that vllm==0.12.0 and your project are incompatible.
And because your project depends on vllm==0.12.0, we can conclude that your project's requirements are unsatisfiable.
Because vllm==0.12.0 depends on transformers>=4.56.0,<5 and your project depends on transformers==5.0.0rc0, we can conclude that vllm==0.12.0 and your project are incompatible.
And because your project depends on vllm==0.12.0, we can conclude that your project's requirements are unsatisfiable.
try this installation:
pip install transformers==5.0.0rc1 --upgrade --no-deps
If using vllm docker, build like so
# ./vllm/dockerfile
FROM vllm/vllm-openai:nightly
RUN uv pip install transformers==5.0.0rc0 --upgrade --no-deps --system
RUN uv pip install huggingface-hub --upgrade --no-deps --system
docker build . -t vllm/vllm-openai:glm46v
then run that docker image according to normal vllm commands.
If using vllm docker, build like so
# ./vllm/dockerfile FROM vllm/vllm-openai:nightly RUN uv pip install transformers==5.0.0rc0 --upgrade --no-deps --system RUN uv pip install huggingface-hub --upgrade --no-deps --system
docker build . -t vllm/vllm-openai:glm46vthen run that docker image according to normal vllm commands.
This was useful! Thank you for sharing!
If using vllm docker, build like so
# ./vllm/dockerfile FROM vllm/vllm-openai:nightly RUN uv pip install transformers==5.0.0rc0 --upgrade --no-deps --system RUN uv pip install huggingface-hub --upgrade --no-deps --system
docker build . -t vllm/vllm-openai:glm46vthen run that docker image according to normal vllm commands.
This was useful! Thank you for sharing!
Glad it works for you. With --kv-cache-dtype fp8 you can easily fit this on a single RTX 6000 PRO with tokens to burns. 256k context works on a single GPU :) AWQ or NVFP4 variants
Edit: added that this works in the awq/nvfp4 variants on a single 6000 blackwell system.
Glad it works for you. With
--kv-cache-dtype fp8you can easily fit this on a single RTX 6000 PRO with tokens to burns. 256k context works on a single GPU :)
How?
- RTX Pro 6000 is 96 GiB while the FP8 version of the model takes 110 GiB
- This model is trained up to 131072 context size not 256k
Glad it works for you. With
--kv-cache-dtype fp8you can easily fit this on a single RTX 6000 PRO with tokens to burns. 256k context works on a single GPU :)How?
- RTX Pro 6000 is 96 GiB while the FP8 version of the model takes 110 GiB
- This model is trained up to 131072 context size not 256k
I meant for the NVFP4 or AWQ variants. Will update my original comment.... got two threads mixed up
And for 256k here you go.
--max-model-len 262144 \
--rope-scaling '{"rope_type":"yarn","factor":2.0,"original_max_position_embeddings":131072 }'
I meant for the NVFP4 or AWQ variants. Will update my original comment.... got two threads mixed up
Ah I see, however I suspect there is a significant quality degradation because from what I've seen all of them are using LLM compressor and LLM compressor does not include a "calibrate all experts" flag for GLM4.x yet which is necessary so that all experts are quantized properly, see: https://github.com/vllm-project/llm-compressor/blob/0.9.0/examples/quantization_w4a4_fp4/README.md?plain=1#L85-L92
Quantizing MoEs
To quantize MoEs, MoE calibration is now handled automatically by the pipeline. An example quantizing Llama4 can be found under
llama4_example.py. The pipeline automatically applies the appropriate MoE calibration context which:
- Linearizes the model to enable quantization and execution in vLLM. This is required as the native model definition does not include
torch.nn.Linearlayers in its MoE blocks, a requirement for LLM Compressor to run quantization.- Ensures experts are quantized correctly as not all experts are activated during calibration
And for 256k here you go.
--max-model-len 262144 \ --rope-scaling '{"rope_type":"yarn","factor":2.0,"original_max_position_embeddings":131072 }'
Ah yeah, I try to avoid that because that degrades performance on small contexts and the perf on long context is also meh even without that, see https://contextarena.ai/ and https://fiction.live/stories/Fiction-liveBench-Sept-29-2025/oQdzQvKHw8JyXbN87
I meant for the NVFP4 or AWQ variants. Will update my original comment.... got two threads mixed up
Ah I see, however I suspect there is a significant quality degradation because from what I've seen all of them are using LLM compressor and LLM compressor does not include a "calibrate all experts" flag for GLM4.x yet which is necessary so that all experts are quantized properly, see: https://github.com/vllm-project/llm-compressor/blob/0.9.0/examples/quantization_w4a4_fp4/README.md?plain=1#L85-L92
Quantizing MoEs
To quantize MoEs, MoE calibration is now handled automatically by the pipeline. An example quantizing Llama4 can be found under
llama4_example.py. The pipeline automatically applies the appropriate MoE calibration context which:
- Linearizes the model to enable quantization and execution in vLLM. This is required as the native model definition does not include
torch.nn.Linearlayers in its MoE blocks, a requirement for LLM Compressor to run quantization.- Ensures experts are quantized correctly as not all experts are activated during calibration
And for 256k here you go.
--max-model-len 262144 \ --rope-scaling '{"rope_type":"yarn","factor":2.0,"original_max_position_embeddings":131072 }'Ah yeah, I try to avoid that because that degrades performance on small contexts and the perf on long context is also meh even without that, see https://contextarena.ai/ and https://fiction.live/stories/Fiction-liveBench-Sept-29-2025/oQdzQvKHw8JyXbN87
This setup is my daily driver for web-dev, long context with my project size and the Multi-model capabilites have worked great.
I notice that NVFP4 models are much more performant than AWQ models. The model cyankiwi/GLM-4.6V-AWQ-4bit seems to work fine. (at least until my next GPU arrives :)