inference throughput drops by 80% with this template

#28
by froilo - opened

inference throughput drops by 80% with this template

      /llama_mtp/llama.cpp/build/bin/llama-server
      -m /models/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_S.gguf
      --host 0.0.0.0
      --port 8080
      -c 32768
      --cache-type-k q8_0
      --cache-type-v q8_0
      --temp 0.7
      --top-p 0.95
      --top-k 20
      --presence-penalty 0.0
      --min-p 0.00
      --spec-type draft-mtp
      --spec-draft-n-max 6
      --spec-draft-p-min 0.75
      --jinja
      --chat-template-file /models/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/froggeric__Qwen-Fixed-Chat-Templates.jinja

Try these params:
--chat-template-kwargs "{\"preserve_thinking\":true}" --spec-type draft-mtp --spec-draft-n-max 2.

spec-draft-n-max 6 - that's not optimal for this model.

Try these params:
--chat-template-kwargs "{\"preserve_thinking\":true}" --spec-type draft-mtp --spec-draft-n-max 2.

spec-draft-n-max 6 - that's not optimal for this model.

thx it works
it may be even about 8-10% faster than my prior MTP setup without the template (not enough data though)

also havent tested agentic flow yet

Screenshot_2026-05-19-13-25-42-067-edit_com.reddit.frontpage

This might be helpful to understand mtp params.

thats why ive been using --spec-draft-n-max 6

Your quant had the best perf between 1-2.

Edit: srry, unsloth suggested that value somewhere else

Sign up or log in to comment