inference throughput drops by 80% with this template

#28

by froilo - opened 3 days ago

inference throughput drops by 80% with this template

      /llama_mtp/llama.cpp/build/bin/llama-server
      -m /models/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_S.gguf
      --host 0.0.0.0
      --port 8080
      -c 32768
      --cache-type-k q8_0
      --cache-type-v q8_0
      --temp 0.7
      --top-p 0.95
      --top-k 20
      --presence-penalty 0.0
      --min-p 0.00
      --spec-type draft-mtp
      --spec-draft-n-max 6
      --spec-draft-p-min 0.75
      --jinja
      --chat-template-file /models/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/froggeric__Qwen-Fixed-Chat-Templates.jinja

szwedek

3 days ago

Try these params:
--chat-template-kwargs "{\"preserve_thinking\":true}" --spec-type draft-mtp --spec-draft-n-max 2.

spec-draft-n-max 6 - that's not optimal for this model.

froilo

3 days ago

•

edited 3 days ago

Try these params:
--chat-template-kwargs "{\"preserve_thinking\":true}" --spec-type draft-mtp --spec-draft-n-max 2.

spec-draft-n-max 6 - that's not optimal for this model.

thx it works
it may be even about 8-10% faster than my prior MTP setup without the template (not enough data though)

also havent tested agentic flow yet