Instructions to use z-lab/Qwen3.6-27B-DFlash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use z-lab/Qwen3.6-27B-DFlash with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="z-lab/Qwen3.6-27B-DFlash", trust_remote_code=True)# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("z-lab/Qwen3.6-27B-DFlash", trust_remote_code=True) model = AutoModel.from_pretrained("z-lab/Qwen3.6-27B-DFlash", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use z-lab/Qwen3.6-27B-DFlash with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "z-lab/Qwen3.6-27B-DFlash" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "z-lab/Qwen3.6-27B-DFlash", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/z-lab/Qwen3.6-27B-DFlash
- SGLang
How to use z-lab/Qwen3.6-27B-DFlash with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "z-lab/Qwen3.6-27B-DFlash" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "z-lab/Qwen3.6-27B-DFlash", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "z-lab/Qwen3.6-27B-DFlash" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "z-lab/Qwen3.6-27B-DFlash", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use z-lab/Qwen3.6-27B-DFlash with Docker Model Runner:
docker model run hf.co/z-lab/Qwen3.6-27B-DFlash
RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::Half
Having no luck running this DFlash model with cyankiwi/Qwen3.6-27B-AWQ-BF16-INT4.
It all loads up but the moment I send a prompt in, it crashes.
Current vLLM nightly, CU130, TP=2.
(EngineCore pid=19180) ERROR 05-07 18:17:48 [core.py:1138] EngineCore encountered a fatal error. (EngineCore pid=19180) ERROR 05-07 18:17:48 [core.py:1138] Traceback (most recent call last): (EngineCore pid=19180) ERROR 05-07 18:17:48 [core.py:1138] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1129, in run_engine_core (EngineCore pid=19180) ERROR 05-07 18:17:48 [core.py:1138] engine_core.run_busy_loop() (EngineCore pid=19180) ERROR 05-07 18:17:48 [core.py:1138] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1170, in run_busy_loop (EngineCore pid=19180) ERROR 05-07 18:17:48 [core.py:1138] self._process_engine_step() (EngineCore pid=19180) ERROR 05-07 18:17:48 [core.py:1138] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1209, in _process_engine_step (EngineCore pid=19180) ERROR 05-07 18:17:48 [core.py:1138] outputs, model_executed = self.step_fn() (EngineCore pid=19180) ERROR 05-07 18:17:48 [core.py:1138] ^^^^^^^^^^^^^^ (EngineCore pid=19180) ERROR 05-07 18:17:48 [core.py:1138] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 521, in step_with_batch_queue (EngineCore pid=19180) ERROR 05-07 18:17:48 [core.py:1138] model_output = future.result() (EngineCore pid=19180) ERROR 05-07 18:17:48 [core.py:1138] ^^^^^^^^^^^^^^^ (EngineCore pid=19180) ERROR 05-07 18:17:48 [core.py:1138] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 90, in result (EngineCore pid=19180) ERROR 05-07 18:17:48 [core.py:1138] return super().result() (EngineCore pid=19180) ERROR 05-07 18:17:48 [core.py:1138] ^^^^^^^^^^^^^^^^ (EngineCore pid=19180) ERROR 05-07 18:17:48 [core.py:1138] File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 449, in result (EngineCore pid=19180) ERROR 05-07 18:17:48 [core.py:1138] return self.__get_result() (EngineCore pid=19180) ERROR 05-07 18:17:48 [core.py:1138] ^^^^^^^^^^^^^^^^^^^ (EngineCore pid=19180) ERROR 05-07 18:17:48 [core.py:1138] File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result (EngineCore pid=19180) ERROR 05-07 18:17:48 [core.py:1138] raise self._exception (EngineCore pid=19180) ERROR 05-07 18:17:48 [core.py:1138] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 94, in _wait_for_response (EngineCore pid=19180) ERROR 05-07 18:17:48 [core.py:1138] response = self.aggregate(self.get_response()) (EngineCore pid=19180) ERROR 05-07 18:17:48 [core.py:1138] ^^^^^^^^^^^^^^^^^^^ (EngineCore pid=19180) ERROR 05-07 18:17:48 [core.py:1138] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 390, in get_response (EngineCore pid=19180) ERROR 05-07 18:17:48 [core.py:1138] raise RuntimeError( (EngineCore pid=19180) ERROR 05-07 18:17:48 [core.py:1138] RuntimeError: Worker failed with error 'expected mat1 and mat2 to have the same dtype, but got: float != c10::Half', please check the stack trace above for the root cause (Worker_TP0 pid=19199) INFO 05-07 18:17:48 [multiproc_executor.py:775] Parent process exited, terminating worker queues (APIServer pid=19011) ERROR 05-07 18:17:48 [async_llm.py:704] AsyncLLM output_handler failed. (APIServer pid=19011) ERROR 05-07 18:17:48 [async_llm.py:704] Traceback (most recent call last): (APIServer pid=19011) ERROR 05-07 18:17:48 [async_llm.py:704] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 660, in output_handler (APIServer pid=19011) ERROR 05-07 18:17:48 [async_llm.py:704] outputs = await engine_core.get_output_async() (APIServer pid=19011) ERROR 05-07 18:17:48 [async_llm.py:704] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=19011) ERROR 05-07 18:17:48 [async_llm.py:704] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 998, in get_output_async (APIServer pid=19011) ERROR 05-07 18:17:48 [async_llm.py:704] raise self._format_exception(outputs) from None (APIServer pid=19011) ERROR 05-07 18:17:48 [async_llm.py:704] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. (APIServer pid=19011) ERROR 05-07 18:17:48 [serving.py:984] Error in chat completion stream generator. (APIServer pid=19011) ERROR 05-07 18:17:48 [serving.py:984] Traceback (most recent call last): (APIServer pid=19011) ERROR 05-07 18:17:48 [serving.py:984] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/entrypoints/openai/chat_completion/serving.py", line 487, in chat_completion_stream_generator (APIServer pid=19011) ERROR 05-07 18:17:48 [serving.py:984] async for res in result_generator: (APIServer pid=19011) ERROR 05-07 18:17:48 [serving.py:984] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 579, in generate (APIServer pid=19011) ERROR 05-07 18:17:48 [serving.py:984] out = q.get_nowait() or await q.get() (APIServer pid=19011) ERROR 05-07 18:17:48 [serving.py:984] ^^^^^^^^^^^^^ (APIServer pid=19011) ERROR 05-07 18:17:48 [serving.py:984] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/engine/output_processor.py", line 85, in get (APIServer pid=19011) ERROR 05-07 18:17:48 [serving.py:984] raise output (APIServer pid=19011) ERROR 05-07 18:17:48 [serving.py:984] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 660, in output_handler (APIServer pid=19011) ERROR 05-07 18:17:48 [serving.py:984] outputs = await engine_core.get_output_async() (APIServer pid=19011) ERROR 05-07 18:17:48 [serving.py:984] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=19011) ERROR 05-07 18:17:48 [serving.py:984] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 998, in get_output_async (APIServer pid=19011) ERROR 05-07 18:17:48 [serving.py:984] raise self._format_exception(outputs) from None (APIServer pid=19011) ERROR 05-07 18:17:48 [serving.py:984] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 957, in worker_busy_loop (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 957, in worker_busy_loop (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] output = func(*args, **kwargs) (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] output = func(*args, **kwargs) (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] return func(*args, **kwargs) (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] return func(*args, **kwargs) (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 778, in sample_tokens (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 778, in sample_tokens (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] return self.model_runner.sample_tokens(grammar_output) (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] return self.model_runner.sample_tokens(grammar_output) (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] return func(*args, **kwargs) (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] return func(*args, **kwargs) (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4309, in sample_tokens (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4309, in sample_tokens (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] propose_draft_token_ids(sampled_token_ids) (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] propose_draft_token_ids(sampled_token_ids) (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4270, in propose_draft_token_ids (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4270, in propose_draft_token_ids (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] self._draft_token_ids = self.propose_draft_token_ids( (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] self._draft_token_ids = self.propose_draft_token_ids( (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4810, in propose_draft_token_ids (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4810, in propose_draft_token_ids (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] draft_token_ids = self.drafter.propose( (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] draft_token_ids = self.drafter.propose( (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/spec_decode/llm_base_proposer.py", line 450, in propose (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/spec_decode/llm_base_proposer.py", line 450, in propose (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] target_hidden_states = self.model.combine_hidden_states( (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] target_hidden_states = self.model.combine_hidden_states( (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_dflash.py", line 588, in combine_hidden_states (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_dflash.py", line 588, in combine_hidden_states (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] result = self.model.fc(hidden_states) (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] result = self.model.fc(hidden_states) (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] return self._call_impl(*args, **kwargs) (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] return self._call_impl(*args, **kwargs) (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] return forward_call(*args, **kwargs) (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] return forward_call(*args, **kwargs) (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 397, in forward (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] output = self.quant_method.apply(self, x, bias) (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 397, in forward (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 231, in apply (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] output = self.quant_method.apply(self, x, bias) (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] return dispatch_unquantized_gemm()(layer, x, layer.weight, bias) (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 231, in apply (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/model_executor/layers/utils.py", line 98, in default_unquantized_gemm (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] return dispatch_unquantized_gemm()(layer, x, layer.weight, bias) (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] return torch.nn.functional.linear(x, weight, bias) (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/model_executor/layers/utils.py", line 98, in default_unquantized_gemm (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/model_executor/parameter.py", line 126, in __torch_function__ (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] return torch.nn.functional.linear(x, weight, bias) (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] return super().__torch_function__(func, types, args, kwargs) (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/model_executor/parameter.py", line 126, in __torch_function__ (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::Half (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] return super().__torch_function__(func, types, args, kwargs) (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] Traceback (most recent call last): (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 957, in worker_busy_loop (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::Half (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] output = func(*args, **kwargs) (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] Traceback (most recent call last): (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 957, in worker_busy_loop (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] output = func(*args, **kwargs) (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] return func(*args, **kwargs) (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 778, in sample_tokens (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] return func(*args, **kwargs) (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] return self.model_runner.sample_tokens(grammar_output) (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 778, in sample_tokens (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] return self.model_runner.sample_tokens(grammar_output) (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] return func(*args, **kwargs) (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4309, in sample_tokens (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] return func(*args, **kwargs) (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] propose_draft_token_ids(sampled_token_ids) (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4270, in propose_draft_token_ids (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4309, in sample_tokens (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] self._draft_token_ids = self.propose_draft_token_ids( (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] propose_draft_token_ids(sampled_token_ids) (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4270, in propose_draft_token_ids (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4810, in propose_draft_token_ids (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] self._draft_token_ids = self.propose_draft_token_ids( (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] draft_token_ids = self.drafter.propose( (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4810, in propose_draft_token_ids (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/spec_decode/llm_base_proposer.py", line 450, in propose (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] draft_token_ids = self.drafter.propose( (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] target_hidden_states = self.model.combine_hidden_states( (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/v1/spec_decode/llm_base_proposer.py", line 450, in propose (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_dflash.py", line 588, in combine_hidden_states (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] target_hidden_states = self.model.combine_hidden_states( (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] result = self.model.fc(hidden_states) (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_dflash.py", line 588, in combine_hidden_states (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] result = self.model.fc(hidden_states) (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] return self._call_impl(*args, **kwargs) (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] return self._call_impl(*args, **kwargs) (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] return forward_call(*args, **kwargs) (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 397, in forward (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] return forward_call(*args, **kwargs) (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] output = self.quant_method.apply(self, x, bias) (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 397, in forward (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 231, in apply (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] output = self.quant_method.apply(self, x, bias) (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] return dispatch_unquantized_gemm()(layer, x, layer.weight, bias) (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 231, in apply (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/model_executor/layers/utils.py", line 98, in default_unquantized_gemm (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] return dispatch_unquantized_gemm()(layer, x, layer.weight, bias) (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] return torch.nn.functional.linear(x, weight, bias) (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/model_executor/layers/utils.py", line 98, in default_unquantized_gemm (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/model_executor/parameter.py", line 126, in __torch_function__ (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] return torch.nn.functional.linear(x, weight, bias) (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] return super().__torch_function__(func, types, args, kwargs) (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] File "/home/user/Envs/llm/lib/python3.12/site-packages/vllm/model_executor/parameter.py", line 126, in __torch_function__ (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::Half (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] return super().__torch_function__(func, types, args, kwargs) (Worker_TP1 pid=19200) ERROR 05-07 18:17:48 [multiproc_executor.py:962] (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0 pid=19199) ERROR 05-07 18:17:48 [multiproc_executor.py:962] RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::Half
Is the issue the specific cyankiwi quant, or the DFlash model?
I've heard it's already been fixed in the nightly build.
To resolve this, cast hidden_states to match the model weights' data type before the multiplication:
Source: qwen3_dflash.py
def combine_hidden_states(
self,
hidden_states: torch.Tensor,
) -> torch.Tensor:
...
# Cast hidden_states to match the model weights' dtype before matrix multiplication
hidden_states = hidden_states.to(self.model.fc.weight.dtype)
After 2 weeks I've given up, but I'll give this a try.
Thanks!
It appears it has been patched in the nightly, closing the issue.