Instructions to use zai-org/GLM-4.1V-9B-Thinking with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use zai-org/GLM-4.1V-9B-Thinking with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="zai-org/GLM-4.1V-9B-Thinking")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("zai-org/GLM-4.1V-9B-Thinking")
model = AutoModelForImageTextToText.from_pretrained("zai-org/GLM-4.1V-9B-Thinking")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use zai-org/GLM-4.1V-9B-Thinking with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "zai-org/GLM-4.1V-9B-Thinking"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.1V-9B-Thinking",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/zai-org/GLM-4.1V-9B-Thinking

SGLang

How to use zai-org/GLM-4.1V-9B-Thinking with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "zai-org/GLM-4.1V-9B-Thinking" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.1V-9B-Thinking",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "zai-org/GLM-4.1V-9B-Thinking" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.1V-9B-Thinking",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use zai-org/GLM-4.1V-9B-Thinking with Docker Model Runner:
```
docker model run hf.co/zai-org/GLM-4.1V-9B-Thinking
```

Please make MOE models

by Narutoouz - opened Jul 3, 2025

Discussion

Narutoouz

Jul 3, 2025

You guys are making the best models for local inference, I love GLM4-32b and 9b models. I can't aniticipate enough, what an wonderful MOE you could make? Always it would be great if it having both Thinking and non thinking modes like qwen3 series of models.

ZHANGYUXUAN-zR

Z.ai org Jul 4, 2025

We will make it. Although I don't know their open-source time yet, we will try our best, and we will disclose information once there is new progress.

Dampfinchen

Jul 5, 2025

I agree. Would love to see a smaller one similar to Qwen 3 30B A3B and a bigger one. Native multimodality and more attention heads for better instruct following over a larger context would be nice to see as well.

Doctor-Chad-PhD

Jul 5, 2025

Yes, a GLM 30B A3B would be great.

imoc

Jul 6, 2025

•

edited Jul 6, 2025

I vote for 70~80B-A8~9B, A3B is fast but practically too weak.

Narutoouz

Jul 6, 2025

We will make it. Although I don't know their open-source time yet, we will try our best, and we will disclose information once there is new progress.

Thank you, looking forward to it.

Narutoouz

Jul 6, 2025

I vote for 70~80B-A8~9B, A3B is fast but practically too weak.

GLM already surprised me with 32b and 9b models as it done or out done SOTA models in simiar tasks. Their MOE model will be also the same. Just like mistral 24b multimodal model, if it is multimodel also, it will be just cherry on the cake for open source community!