Thank you very much! Phenomenal on old hardware.
#18
by
LadislavDanis - opened
On my old Dell OptiPlex Sff 3050
Intel(R) Core(TM) i5-7500 CPU @ 3.40GHz
64 GB RAM DDR4
Tesla P4 8GB VRAM
with the command:
llama-server
--model /home/testbox/.cache/llama.cpp/unsloth_Qwen3-Coder-30B-A3B-Instruct-GGUF_Qwen3-Coder-30B-A3B-Instruct-UD-Q8_K_XL.gguf
--threads 3
--batch-size 256
--ctx-size 65536
--cache-type-k q8_0
--cache-type-v q8_0
--jinja
--temp 0.7 --min-p 0.01 --top-p 0.80 --top-k 20 --repeat-penalty 1.05
--flash-attn on
-a qwen3-coder-30-a3b-p4
--n-gpu-layers 48
--override-tensor ".blk.([1-9]|[1-3][0-9]|4[0-6]).ffn_._exps.*=CPU"
--host 0.0.0.0
--port 4000
consistently achieve a beautiful approx. 7 t/s
If I didn't have degraded dual channel due to RAM size (my DEL officially supports 32GB RAM), I believe the result would be even better.