Instructions to use ubergarm/GLM-5.1-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use ubergarm/GLM-5.1-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="ubergarm/GLM-5.1-GGUF", filename="IQ2_KL/GLM-5.1-IQ2_KL-00001-of-00007.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use ubergarm/GLM-5.1-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ubergarm/GLM-5.1-GGUF:Q2_K # Run inference directly in the terminal: llama-cli -hf ubergarm/GLM-5.1-GGUF:Q2_K
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ubergarm/GLM-5.1-GGUF:Q2_K # Run inference directly in the terminal: llama-cli -hf ubergarm/GLM-5.1-GGUF:Q2_K
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf ubergarm/GLM-5.1-GGUF:Q2_K # Run inference directly in the terminal: ./llama-cli -hf ubergarm/GLM-5.1-GGUF:Q2_K
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf ubergarm/GLM-5.1-GGUF:Q2_K # Run inference directly in the terminal: ./build/bin/llama-cli -hf ubergarm/GLM-5.1-GGUF:Q2_K
Use Docker
docker model run hf.co/ubergarm/GLM-5.1-GGUF:Q2_K
- LM Studio
- Jan
- vLLM
How to use ubergarm/GLM-5.1-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ubergarm/GLM-5.1-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ubergarm/GLM-5.1-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ubergarm/GLM-5.1-GGUF:Q2_K
- Ollama
How to use ubergarm/GLM-5.1-GGUF with Ollama:
ollama run hf.co/ubergarm/GLM-5.1-GGUF:Q2_K
- Unsloth Studio
How to use ubergarm/GLM-5.1-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ubergarm/GLM-5.1-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ubergarm/GLM-5.1-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for ubergarm/GLM-5.1-GGUF to start chatting
- Pi
How to use ubergarm/GLM-5.1-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf ubergarm/GLM-5.1-GGUF:Q2_K
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "ubergarm/GLM-5.1-GGUF:Q2_K" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use ubergarm/GLM-5.1-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf ubergarm/GLM-5.1-GGUF:Q2_K
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default ubergarm/GLM-5.1-GGUF:Q2_K
Run Hermes
hermes
- Docker Model Runner
How to use ubergarm/GLM-5.1-GGUF with Docker Model Runner:
docker model run hf.co/ubergarm/GLM-5.1-GGUF:Q2_K
- Lemonade
How to use ubergarm/GLM-5.1-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull ubergarm/GLM-5.1-GGUF:Q2_K
Run and chat with the model
lemonade run user.GLM-5.1-GGUF-Q2_K
List all available models
lemonade list
Testing IQ3_KS
Tensor blk.77.ffn_up_exps.weight (size = 1225.00 MiB) buffer type overriden to CUDA_Host
model has unused tensor blk.78.attn_norm.weight (size = 24576 bytes) -- ignoring
model has unused tensor blk.78.attn_q_a_norm.weight (size = 8192 bytes) -- ignoring
model has unused tensor blk.78.attn_kv_a_norm.weight (size = 2048 bytes) -- ignoring
model has unused tensor blk.78.attn_q_a.weight (size = 10420224 bytes) -- ignoring
model has unused tensor blk.78.attn_q_b.weight (size = 27787264 bytes) -- ignoring
model has unused tensor blk.78.attn_kv_a_mqa.weight (size = 3760128 bytes) -- ignoring
model has unused tensor blk.78.attn_output.weight (size = 83361792 bytes) -- ignoring
model has unused tensor blk.78.indexer.k_norm.weight (size = 512 bytes) -- ignoring
model has unused tensor blk.78.indexer.k_norm.bias (size = 512 bytes) -- ignoring
model has unused tensor blk.78.indexer.proj.weight (size = 208896 bytes) -- ignoring
model has unused tensor blk.78.indexer.attn_k.weight (size = 835584 bytes) -- ignoring
model has unused tensor blk.78.indexer.attn_q_b.weight (size = 6946816 bytes) -- ignoring
model has unused tensor blk.78.ffn_norm.weight (size = 24576 bytes) -- ignoring
model has unused tensor blk.78.ffn_gate_inp.weight (size = 6291456 bytes) -- ignoring
model has unused tensor blk.78.exp_probs_b.bias (size = 1024 bytes) -- ignoring
Tensor blk.78.ffn_gate_exps.weight (size = 2018.00 MiB) buffer type overriden to CUDA_Host
model has unused tensor blk.78.ffn_gate_exps.weight (size = 2116026368 bytes) -- ignoring
Tensor blk.78.ffn_down_exps.weight (size = 2022.00 MiB) buffer type overriden to CUDA_Host
model has unused tensor blk.78.ffn_down_exps.weight (size = 2120220672 bytes) -- ignoring
Tensor blk.78.ffn_up_exps.weight (size = 2018.00 MiB) buffer type overriden to CUDA_Host
model has unused tensor blk.78.ffn_up_exps.weight (size = 2116026368 bytes) -- ignoring
model has unused tensor blk.78.ffn_gate_shexp.weight (size = 8265728 bytes) -- ignoring
model has unused tensor blk.78.ffn_down_shexp.weight (size = 8282112 bytes) -- ignoring
model has unused tensor blk.78.ffn_up_shexp.weight (size = 8265728 bytes) -- ignoring
model has unused tensor blk.78.nextn.eh_proj.weight (size = 80216064 bytes) -- ignoring
model has unused tensor blk.78.nextn.enorm.weight (size = 24576 bytes) -- ignoring
model has unused tensor blk.78.nextn.hnorm.weight (size = 24576 bytes) -- ignoring
model has unused tensor blk.78.nextn.shared_head_norm.weight (size = 24576 bytes) -- ignoring
Allocating 299.91 GiB of pinned host memory, this may take a while.
Using pinned host memory improves PP performance by a significant margin.
But if it takes too long for your model and amount of patience, kill the process and run using
GGML_CUDA_NO_PINNED=1 your_command_goes_here
done allocating 299.91 GiB in 84226.1 ms
llm_load_tensors: offloading 79 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 80/80 layers to GPU
llm_load_tensors: CUDA_Host buffer size = 307110.47 MiB
llm_load_tensors: CUDA0 buffer size = 14498.95 MiB
...................................................................................................~ggml_backend_cuda_context: have 0 graphs
.
============ llm_prepare_mla: need to compute 79 wkv_b tensors
Computed blk.0.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.1.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.2.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.3.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.4.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.5.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.6.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.7.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.8.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.9.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.10.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.11.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.12.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.13.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.14.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.15.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.16.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.17.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.18.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.19.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.20.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.21.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.22.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.23.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.24.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.25.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.26.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.27.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.28.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.29.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.30.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.31.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.32.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.33.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.34.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.35.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.36.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.37.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.38.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.39.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.40.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.41.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.42.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.43.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.44.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.45.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.46.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.47.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.48.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.49.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.50.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.51.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.52.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.53.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.54.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.55.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.56.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.57.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.58.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.59.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.60.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.61.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.62.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.63.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.64.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.65.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.66.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.67.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.68.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.69.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.70.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.71.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.72.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.73.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.74.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.75.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.76.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.77.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
Computed blk.78.attn_kv_b.weight as 512 x 28672 of type q8_0 and stored in buffer CUDA0
llama_init_from_model: n_ctx = 131072
llama_init_from_model: n_batch = 4096
llama_init_from_model: n_ubatch = 4096
llama_init_from_model: flash_attn = 1
llama_init_from_model: mla_attn = 3
llama_init_from_model: attn_max_b = 512
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 1
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: v_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 1000000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 5967.04 MiB
llama_init_from_model: KV self size = 5967.00 MiB, c^KV (q8_0): 5967.00 MiB, kv^T: not used
llama_init_from_model: CUDA_Host output buffer size = 0.59 MiB
llama_init_from_model: CUDA0 compute buffer size = 3934.02 MiB
llama_init_from_model: CUDA_Host compute buffer size = 1120.05 MiB
llama_init_from_model: graph nodes = 30842
llama_init_from_model: graph splits = 152
llama_init_from_model: enabling only_active_experts scheduling
main: n_kv_max = 131072, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 4096 | 1024 | 0 | 8.638 | 474.17 | 73.191 | 13.99 |
| 4096 | 1024 | 4096 | 9.353 | 437.96 | 68.152 | 15.03 |
| 4096 | 1024 | 8192 | 9.690 | 422.72 | 70.148 | 14.60 |
| 4096 | 1024 | 12288 | 10.253 | 399.50 | 70.674 | 14.49 |
| 4096 | 1024 | 16384 | 10.882 | 376.41 | 91.145 | 11.23 |
Great to see you again! Looks good! Cheers!
Used it for a day for some work, looks like a solid quant.
For some reason kimi-cli & nano-coder broke entirely. Opencode is the only one I tried that handled switch of same quant GLM 5 > 5.1. They were all working fine on 5.0, but after the switch kimi-cli can't use any tools and ends up just looping thru 50+ calls. Funny enough, under kimi-cli, eventually it found a way to jailbreak by running everything via python scripts. Pretty cool stuff. Maybe this is some nuanced tools use templates kimi/nanocoder user or maybe default template needs to be adjusted. Opencode works. For some reason, when context gets ~70k ik_llama prompt caching breaks and it decides it needs to process the whole thing again. Could be opencode issue. There are those messages in logs, maybe related maybe not.
render_message_to_json: Neither string content nor typed content is supported by the template. This is unexpected and may lead to issues
I wonder if some template needs updating for 5.1.
Thanks for the report, I've only tested with opencode which is working well in my < 65k context tests. I get those messages too, and I always bake in the original upstream default chat template. You could probably try a custom one with llama-server --chat-template-file myCustomTemplate.jinja but I'm not sure what to change.
I did see a custom template is suggested with gemma-4-31b-it, and with the latest tokenizer fixes that seems to have it working for me finally. https://www.reddit.com/r/LocalLLaMA/comments/1sgl3qz/gemma_4_on_llamacpp_should_be_stable_now/ Just interesting there are two templates, one is "interleaved". Maybe I have to copy paste the original GLM-5.1 chat template from https://huggingface.co/zai-org/GLM-5.1/blob/main/chat_template.jinja and give it the gemma 4 interleaved one as an example and see if it can "fix itself" haha...
I always get an error when trying to run this quant, can someone check these details?
Error:
gguf_init_from_file_ptr: tensor 'output.weight' has invalid ggml type 141. should be in [0, 42)
gguf_init_from_file_ptr: failed to read tensor info
llama_model_load: error loading model: llama_model_loader: failed to load GGUF split from /models/GLM5.1/GLM-5.1-IQ3_KS-00001-of-00008.gguf
Model hashes (matches HF pages):
61e1798dc098a2c442285963c7863ab0be43dcfe3cbfd9cab905b93ffccea9f2 GLM-5.1-IQ3_KS-00001-of-00008.gguf
01e8cb02bc9fcbb20e77a2439eec781e544ef3d27f6348ed06ee845a2d032fe7 GLM-5.1-IQ3_KS-00002-of-00008.gguf
f01b8508b6453c7f577abe08a2e1518e375a16ce0d6376dfc8feb547d7369b89 GLM-5.1-IQ3_KS-00003-of-00008.gguf
e171ebd288174e42d34408059d750a89ba95833016d2306ef874911d57058bbf GLM-5.1-IQ3_KS-00004-of-00008.gguf
eed179a22f4f7554274b5f2480abe42d576e2ee657aa1e8c59da9843f99895ef GLM-5.1-IQ3_KS-00005-of-00008.gguf
6f423703134b02d1dca257092d0ab0ca559315f05e0e9f67c510c8654d276272 GLM-5.1-IQ3_KS-00006-of-00008.gguf
afc13e4dda8dda7501d5708b12893ea9e4bf191d9e9c3f9627e689a4cd4949fb GLM-5.1-IQ3_KS-00007-of-00008.gguf
a5c14491398633fdd452f458f510cb0a54dd400e3c453dd8dda6987937ff715f GLM-5.1-IQ3_KS-00008-of-00008.gguf
llama.cpp versions:
self-compiled b8763-ff5ef8278
ggml-org's b8763 (https://github.com/ggml-org/llama.cpp/releases/tag/b8763)
unslothai's b1-d12cc3d (https://github.com/unslothai/llama.cpp/releases/tag/b8720)
All of those versions run GLM5 UD-Q3_K_XL and would run GLM5.1-UD_Q3_K_XL (except it is slightly too big for my setup).
If someone who has this quant working could check the model hashes I would appreciate it, or any other pointer in the right direction.
Are the the Iq_k quants supposed by llamacpp? Thought those are mostly for ik_llama
Are the the Iq_k quants supposed by llamacpp? Thought those are mostly for ik_llama
You are right, my mistake. I thought all i-quant support was merged in but I guess I just got used to using the non-k ones and made a wrong assumption.
Yeah check out the quickstart for how to get ik_llama.cpp going or look at their readme for instructions: https://github.com/ikawrakow/ik_llama.cpp/
yes, it is confusing as ik made many of the quant types still used in mainline, before beginning his own version to support newer quantization types (which is mostly what i work with).
there are some precompiled binaries around too if you prefer
