Am I correct in understanding that Q_4_K_S quantization means 4-bit quantization with k-quantization(K) and a small(S) block size? What does a small block size mean? Q_3 means 3 bit quantization? Thanks!
I can’t say I understand it correctly either, so I tried searching for it. Well, if you’re in trouble, use Q4_K_M.
https://www.reddit.com/r/LocalLLaMA/comments/1d1sc50/gguf_weight_encoding_suffixes_is_there_a_guide/
README.md
# Which GGUF is right for me? (Opinionated)
Good question! I am collecting human data on how quantization affects outputs. See here for more information: https://github.com/ggerganov/llama.cpp/discussions/5962
In the meantime, use the largest that fully fits in your GPU. If you can comfortably fit Q4_K_S, try using a model with more parameters.
# llama.cpp feature matrix
See the wiki upstream: https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix
This file has been truncated. show original