Testing smol-IQ3_KS

#7
by shewin - opened

Computed blk.60.attn_kv_b.weight as 512 x 16384 of type q8_0 and stored in buffer CUDA0
llama_init_from_model: n_ctx = 161280
llama_init_from_model: n_batch = 8192
llama_init_from_model: n_ubatch = 8192
llama_init_from_model: flash_attn = 1
llama_init_from_model: mla_attn = 3
llama_init_from_model: attn_max_b = 4096
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 1
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: v_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 50000.0
llama_init_from_model: freq_scale = 0.015625
llama_kv_cache_init: CUDA0 KV buffer size = 5742.01 MiB
llama_init_from_model: KV self size = 5741.98 MiB, c^KV (q8_0): 5741.98 MiB, kv^T: not used
llama_init_from_model: CUDA_Host output buffer size = 0.62 MiB
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 1
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 2
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 3
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 4
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 5
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 6
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 7
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 8
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 9
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 10
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 11
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 12
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 13
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 14
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 15
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 16
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 17
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 18
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 19
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 20
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 21
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 22
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 23
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 24
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 25
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 26
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 27
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 28
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 29
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 30
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 31
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 32
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 33
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 34
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 35
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 36
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 37
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 38
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 39
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 40
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 41
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 42
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 43
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 44
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 45
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 46
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 47
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 48
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 49
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 50
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 51
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 52
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 53
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 54
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 55
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 56
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 57
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 58
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 59
llama_repack_up_gate_exps: repacking up/gate experts weight in layer 60
llama_init_from_model: CUDA0 compute buffer size = 8023.03 MiB
llama_init_from_model: CUDA_Host compute buffer size = 2744.09 MiB
llama_init_from_model: graph nodes = 5417
llama_init_from_model: graph splits = 122
llama_init_from_model: enabling only_active_experts scheduling

main: n_kv_max = 161280, n_batch = 8192, n_ubatch = 8192, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
8192 2048 0 12.100 677.04 109.233 18.75
8192 2048 8192 13.790 594.04 177.558 11.53
8192 2048 16384 15.557 526.57 118.881 17.23
8192 2048 24576 17.367 471.71 123.425 16.59
8192 2048 32768 19.337 423.64 133.373 15.36

2026-04-23_01-32_1

2026-04-23_02-35
better and faster than IQ3_K

@shewin

fascinating.. the smol-IQ3_KS uses imatrix for everything. it should definitely be faster given less overall active weights size, so more TG throughput for sure.

Sign up or log in to comment