Quantized GLM-4.7-Flash with llama.cpp and opencode.

#66

by ghostwithahat - opened 7 days ago

•

I had a lot of problems running this model on llama.cpp and using it with opencode. Most of the time, tool usage was broken.

The problems were worst, when i worked with IQ4_NL quants, that i made with convert_hf_to_gguf.py and llama-quantize. Tool calls were torn apart and had a lot of syntax errors. When I happened to make MXFP4_MOE quants and run the model this way, those problems vanished.

I then had the next problem with tool calling. As far as i remember, during a stream GLM revoked tool usage, which caused errors in opencode. I therefore asked codex to make a proxy that buffers the stream. This finally worked.

Maybe someone can make use of my experience. After I got tool usage working, I then tested this model as a programming assistant. Eventually I switched back to gpt-oss-20b.

jackhe1975

5 days ago

I also tried opencode and openclaw for using this model as the major AI. It turned out that the chat template would be the key to fulfill opencode and openclaw well. You would need to set a customized chat template to make sure the opencode and openclaw's request to OpenAI API compatible completions. Otherwise, the tools and some system prompt won't be processed by this model. Also, you need to configure the temperature, top-p, min-p as recommended to make sure the tool calls got a proper accuracy.
BTW, I didn't quantized this model by myself. I just download the one UD_Q4_K_XL from unsloth repo. I would be very satisfied with thiis model with llama-server on a 24GB AMD RX7900xtx GPU.

ghostwithahat

5 days ago

I used temperature, top-p and min-p as recommended. I used the the template from the repository ( https://huggingface.co/zai-org/GLM-4.7-Flash/blob/main/chat_template.jinja ). Do you have a customized working chat template?

jackhe1975

4 days ago

•

edited 4 days ago

Yes, the customized chat template was based on the same chat template. Just a few lines of changes as below. The previous thinking was kept on purpose for coding reasoning. The other message role matching was for openclaw as it would use a different role to submit completion request.

You can see what I had changed.

Also, please note that you could also try following llama-server's model parameters which I used.

no-mmap = true
direct-io = true

chat-template-file = D:\llama\templates\openclaw.jinja
jinja = true

reasoning-budget = -1
temp = 0.7
top-p = 0.95
min-p = 0.01
repeat-penalty = 1.0
flash-attn = on

kv-unified = true
parallel = 4
cache-ram = 4096

; ngram-mod should be good for coding
spec-type = ngram-mod
spec-ngram-size-n = 48
spec-ngram-size-m = 64
draft-min = 48
draft-max = 64

cache-type-k = q4_0
cache-type-v = q4_0

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment