Text Generation
GGUF
English
Chinese
llama.cpp
quantization
prism
qwen3.5
llama-cpp
dynamic-quantization
conversational
Instructions to use Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF", filename="Qwen3.5-0.8B/Qwen3.5-0.8B-PRISM-DQ.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF:BF16 # Run inference directly in the terminal: llama-cli -hf Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF:BF16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF:BF16 # Run inference directly in the terminal: llama-cli -hf Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF:BF16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF:BF16 # Run inference directly in the terminal: ./llama-cli -hf Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF:BF16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF:BF16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF:BF16
Use Docker
docker model run hf.co/Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF:BF16
- LM Studio
- Jan
- vLLM
How to use Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF:BF16
- Ollama
How to use Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF with Ollama:
ollama run hf.co/Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF:BF16
- Unsloth Studio
How to use Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF to start chatting
- Pi
How to use Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF:BF16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF:BF16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF:BF16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF:BF16
Run Hermes
hermes
- Docker Model Runner
How to use Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF with Docker Model Runner:
docker model run hf.co/Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF:BF16
- Lemonade
How to use Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF:BF16
Run and chat with the model
lemonade run user.Qwen3.5-PRISM-Dynamic-Quant-GGUF-BF16
List all available models
lemonade list
Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,256 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
base_model:
|
| 4 |
+
- Qwen/Qwen3.5-0.8B
|
| 5 |
+
- Qwen/Qwen3.5-2B
|
| 6 |
+
- Qwen/Qwen3.5-4B
|
| 7 |
+
- Qwen/Qwen3.5-9B
|
| 8 |
+
tags:
|
| 9 |
+
- gguf
|
| 10 |
+
- quantization
|
| 11 |
+
- prism
|
| 12 |
+
- qwen3.5
|
| 13 |
+
- llama-cpp
|
| 14 |
+
- dynamic-quantization
|
| 15 |
+
language:
|
| 16 |
+
- en
|
| 17 |
+
- zh
|
| 18 |
+
pipeline_tag: text-generation
|
| 19 |
+
library_name: llama.cpp
|
| 20 |
+
---
|
| 21 |
+
|
| 22 |
+
# Qwen3.5 PRISM Dynamic Quantization (GGUF)
|
| 23 |
+
|
| 24 |
+
**PRISM Dynamic Quantization (PRISM-DQ)** applies per-tensor-class bit allocation based on structural weight analysis — no calibration data or importance matrices required. Each tensor class (attention keys, FFN gates, SSM components, etc.) receives a quantization type proportional to its measured sensitivity, while staying within a target bits-per-weight budget.
|
| 25 |
+
|
| 26 |
+
This repo contains PRISM-DQ quantized GGUFs for the full **Qwen3.5** vision-language model family (0.8B, 2B, 4B, 9B), plus multimodal projection weights (mmproj) for vision capabilities.
|
| 27 |
+
|
| 28 |
+
## Benchmark Results
|
| 29 |
+
|
| 30 |
+

|
| 31 |
+
|
| 32 |
+
### Perplexity Comparison (UltraChat, 5 chunks, 512 ctx)
|
| 33 |
+
|
| 34 |
+
| Model | Method | BPW | PPL | Size |
|
| 35 |
+
|:------|:-------|----:|----:|-----:|
|
| 36 |
+
| **Qwen3.5-0.8B** | Q3_K_M | 4.96 | 12.14 | 470 MB |
|
| 37 |
+
| | **PRISM-DQ** | **4.94** | **11.42** | **468 MB** |
|
| 38 |
+
| | Q3_K_M (imatrix) | 4.96 | 11.31 | 470 MB |
|
| 39 |
+
| | UD-Q3_K_XL | 5.19 | 10.94 | 492 MB |
|
| 40 |
+
| | IQ4_XS (imatrix) | 5.20 | 10.35 | 493 MB |
|
| 41 |
+
| | UD-Q4_K_XL | 5.89 | 10.07 | 559 MB |
|
| 42 |
+
| **Qwen3.5-2B** | Q3_K_M | 4.69 | 9.35 | 1107 MB |
|
| 43 |
+
| | **PRISM-DQ** | **4.68** | **9.26** | **1104 MB** |
|
| 44 |
+
| | Q3_K_M (imatrix) | 4.69 | 8.40 | 1107 MB |
|
| 45 |
+
| | UD-Q3_K_XL | 4.91 | 8.27 | 1159 MB |
|
| 46 |
+
| | IQ4_XS (imatrix) | 4.97 | 8.12 | 1173 MB |
|
| 47 |
+
| | UD-Q4_K_XL | 5.68 | 8.07 | 1340 MB |
|
| 48 |
+
| **Qwen3.5-4B** | Q3_K_M | 4.36 | 6.88 | 2293 MB |
|
| 49 |
+
| | **PRISM-DQ** | **4.31** | **6.82** | **2271 MB** |
|
| 50 |
+
| | Q3_K_M (imatrix) | 4.36 | 6.62 | 2293 MB |
|
| 51 |
+
| | UD-Q3_K_XL | 4.63 | 6.66 | 2436 MB |
|
| 52 |
+
| | IQ4_XS (imatrix) | 4.70 | 6.51 | 2477 MB |
|
| 53 |
+
| | UD-Q4_K_XL | 5.53 | 6.56 | 2912 MB |
|
| 54 |
+
| **Qwen3.5-9B** | Q3_K_M | 4.17 | 6.25 | 4674 MB |
|
| 55 |
+
| | **PRISM-DQ** | **4.15** | **6.18** | **4652 MB** |
|
| 56 |
+
| | Q3_K_M (imatrix) | 4.17 | 5.96 | 4674 MB |
|
| 57 |
+
| | UD-Q3_K_XL | 4.51 | 6.01 | 5054 MB |
|
| 58 |
+
| | IQ4_XS (imatrix) | 4.61 | 6.03 | 5169 MB |
|
| 59 |
+
| | UD-Q4_K_XL | 5.33 | 5.86 | 5966 MB |
|
| 60 |
+
|
| 61 |
+
### Key Findings
|
| 62 |
+
|
| 63 |
+
- **PRISM-DQ beats uniform Q3_K_M** on all 4 models (1-6% PPL improvement) at same or lower BPW
|
| 64 |
+
- **Smallest file size** at competitive perplexity across the Qwen3.5 family
|
| 65 |
+
- **No calibration data needed** — allocation decisions are purely weight-analysis-based
|
| 66 |
+
- When combined with importance matrices, PRISM-DQ+imatrix achieves Pareto-optimal results on 4B and 9B
|
| 67 |
+
|
| 68 |
+
## Model Files
|
| 69 |
+
|
| 70 |
+
Each subfolder contains the quantized model GGUF plus multimodal projection weights:
|
| 71 |
+
|
| 72 |
+
```
|
| 73 |
+
Qwen3.5-0.8B/
|
| 74 |
+
Qwen3.5-0.8B-PRISM-DQ.gguf (446 MB)
|
| 75 |
+
mmproj-BF16.gguf
|
| 76 |
+
mmproj-F16.gguf
|
| 77 |
+
mmproj-F32.gguf
|
| 78 |
+
chat_template.jinja
|
| 79 |
+
|
| 80 |
+
Qwen3.5-2B/
|
| 81 |
+
Qwen3.5-2B-PRISM-DQ.gguf (1.0 GB)
|
| 82 |
+
mmproj-BF16.gguf
|
| 83 |
+
mmproj-F16.gguf
|
| 84 |
+
mmproj-F32.gguf
|
| 85 |
+
chat_template.jinja
|
| 86 |
+
|
| 87 |
+
Qwen3.5-4B/
|
| 88 |
+
Qwen3.5-4B-PRISM-DQ.gguf (2.1 GB)
|
| 89 |
+
mmproj-BF16.gguf
|
| 90 |
+
mmproj-F16.gguf
|
| 91 |
+
mmproj-F32.gguf
|
| 92 |
+
chat_template.jinja
|
| 93 |
+
|
| 94 |
+
Qwen3.5-9B/
|
| 95 |
+
Qwen3.5-9B-PRISM-DQ.gguf (4.3 GB)
|
| 96 |
+
mmproj-BF16.gguf
|
| 97 |
+
mmproj-F16.gguf
|
| 98 |
+
mmproj-F32.gguf
|
| 99 |
+
chat_template.jinja
|
| 100 |
+
```
|
| 101 |
+
|
| 102 |
+
## Usage
|
| 103 |
+
|
| 104 |
+
### Text-only (llama.cpp)
|
| 105 |
+
|
| 106 |
+
```bash
|
| 107 |
+
# Download a model
|
| 108 |
+
huggingface-cli download Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF \
|
| 109 |
+
Qwen3.5-9B/Qwen3.5-9B-PRISM-DQ.gguf --local-dir .
|
| 110 |
+
|
| 111 |
+
# Run with llama-cli
|
| 112 |
+
llama-cli -m Qwen3.5-9B/Qwen3.5-9B-PRISM-DQ.gguf \
|
| 113 |
+
-p "You are a helpful assistant." \
|
| 114 |
+
--chat-template-file Qwen3.5-9B/chat_template.jinja \
|
| 115 |
+
-cnv
|
| 116 |
+
```
|
| 117 |
+
|
| 118 |
+
### Vision (multimodal)
|
| 119 |
+
|
| 120 |
+
```bash
|
| 121 |
+
# Download model + mmproj
|
| 122 |
+
huggingface-cli download Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF \
|
| 123 |
+
Qwen3.5-9B/Qwen3.5-9B-PRISM-DQ.gguf \
|
| 124 |
+
Qwen3.5-9B/mmproj-BF16.gguf --local-dir .
|
| 125 |
+
|
| 126 |
+
# Run with llama-mtmd-cli
|
| 127 |
+
llama-mtmd-cli -m Qwen3.5-9B/Qwen3.5-9B-PRISM-DQ.gguf \
|
| 128 |
+
--mmproj Qwen3.5-9B/mmproj-BF16.gguf \
|
| 129 |
+
--chat-template-file Qwen3.5-9B/chat_template.jinja \
|
| 130 |
+
-cnv
|
| 131 |
+
```
|
| 132 |
+
|
| 133 |
+
### LM Studio / Ollama
|
| 134 |
+
|
| 135 |
+
These GGUFs work with any llama.cpp-compatible runtime. Simply point your application at the `.gguf` file.
|
| 136 |
+
|
| 137 |
+
## PRISM-DQ Quantization Recipes
|
| 138 |
+
|
| 139 |
+
<details>
|
| 140 |
+
<summary>Qwen3.5-0.8B (target 3.5 BPW)</summary>
|
| 141 |
+
|
| 142 |
+
```bash
|
| 143 |
+
llama-quantize \
|
| 144 |
+
--tensor-type "attn_gate=Q3_K" \
|
| 145 |
+
--tensor-type "attn_k=Q3_K" \
|
| 146 |
+
--tensor-type "attn_output=IQ4_XS" \
|
| 147 |
+
--tensor-type "attn_q=Q3_K" \
|
| 148 |
+
--tensor-type "attn_qkv=Q3_K" \
|
| 149 |
+
--tensor-type "attn_v=Q4_K" \
|
| 150 |
+
--tensor-type "ffn_down=Q3_K" \
|
| 151 |
+
--tensor-type "ffn_gate=Q3_K" \
|
| 152 |
+
--tensor-type "ffn_up=Q3_K" \
|
| 153 |
+
--tensor-type "ssm_alpha=Q3_K" \
|
| 154 |
+
--tensor-type "ssm_beta=IQ4_XS" \
|
| 155 |
+
--tensor-type "ssm_out=IQ4_XS" \
|
| 156 |
+
--tensor-type "token_embd=Q3_K" \
|
| 157 |
+
--tensor-type "blk\.(4)\.ssm_beta=Q4_K" \
|
| 158 |
+
--tensor-type "blk\.(18)\.ssm_out=Q4_K" \
|
| 159 |
+
input.gguf output.gguf Q3_K
|
| 160 |
+
```
|
| 161 |
+
|
| 162 |
+
</details>
|
| 163 |
+
|
| 164 |
+
<details>
|
| 165 |
+
<summary>Qwen3.5-2B (target 3.5 BPW)</summary>
|
| 166 |
+
|
| 167 |
+
```bash
|
| 168 |
+
llama-quantize \
|
| 169 |
+
--tensor-type "attn_gate=Q3_K" \
|
| 170 |
+
--tensor-type "attn_k=Q4_K" \
|
| 171 |
+
--tensor-type "attn_output=Q4_K" \
|
| 172 |
+
--tensor-type "attn_q=Q4_K" \
|
| 173 |
+
--tensor-type "attn_qkv=Q3_K" \
|
| 174 |
+
--tensor-type "attn_v=Q4_K" \
|
| 175 |
+
--tensor-type "ffn_down=Q3_K" \
|
| 176 |
+
--tensor-type "ffn_gate=Q3_K" \
|
| 177 |
+
--tensor-type "ffn_up=Q3_K" \
|
| 178 |
+
--tensor-type "ssm_alpha=Q4_K" \
|
| 179 |
+
--tensor-type "ssm_beta=Q4_K" \
|
| 180 |
+
--tensor-type "ssm_out=Q3_K" \
|
| 181 |
+
--tensor-type "token_embd=Q3_K" \
|
| 182 |
+
input.gguf output.gguf Q3_K
|
| 183 |
+
```
|
| 184 |
+
|
| 185 |
+
</details>
|
| 186 |
+
|
| 187 |
+
<details>
|
| 188 |
+
<summary>Qwen3.5-4B (target 3.5 BPW)</summary>
|
| 189 |
+
|
| 190 |
+
```bash
|
| 191 |
+
llama-quantize \
|
| 192 |
+
--tensor-type "attn_gate=Q3_K" \
|
| 193 |
+
--tensor-type "attn_k=Q4_K" \
|
| 194 |
+
--tensor-type "attn_output=Q5_K" \
|
| 195 |
+
--tensor-type "attn_q=Q3_K" \
|
| 196 |
+
--tensor-type "attn_qkv=Q3_K" \
|
| 197 |
+
--tensor-type "attn_v=Q4_K" \
|
| 198 |
+
--tensor-type "ffn_down=Q3_K" \
|
| 199 |
+
--tensor-type "ffn_gate=Q3_K" \
|
| 200 |
+
--tensor-type "ffn_up=Q3_K" \
|
| 201 |
+
--tensor-type "ssm_alpha=Q4_K" \
|
| 202 |
+
--tensor-type "ssm_beta=Q4_K" \
|
| 203 |
+
--tensor-type "ssm_out=Q3_K" \
|
| 204 |
+
--tensor-type "token_embd=Q3_K" \
|
| 205 |
+
input.gguf output.gguf Q3_K
|
| 206 |
+
```
|
| 207 |
+
|
| 208 |
+
</details>
|
| 209 |
+
|
| 210 |
+
<details>
|
| 211 |
+
<summary>Qwen3.5-9B (target 3.5 BPW)</summary>
|
| 212 |
+
|
| 213 |
+
```bash
|
| 214 |
+
llama-quantize \
|
| 215 |
+
--tensor-type "attn_gate=Q3_K" \
|
| 216 |
+
--tensor-type "attn_k=Q4_K" \
|
| 217 |
+
--tensor-type "attn_output=IQ4_XS" \
|
| 218 |
+
--tensor-type "attn_q=Q4_K" \
|
| 219 |
+
--tensor-type "attn_qkv=Q3_K" \
|
| 220 |
+
--tensor-type "attn_v=Q4_K" \
|
| 221 |
+
--tensor-type "ffn_down=Q3_K" \
|
| 222 |
+
--tensor-type "ffn_gate=Q3_K" \
|
| 223 |
+
--tensor-type "ffn_up=Q3_K" \
|
| 224 |
+
--tensor-type "output=Q3_K" \
|
| 225 |
+
--tensor-type "ssm_alpha=Q4_K" \
|
| 226 |
+
--tensor-type "ssm_beta=Q4_K" \
|
| 227 |
+
--tensor-type "ssm_out=Q3_K" \
|
| 228 |
+
--tensor-type "token_embd=Q3_K" \
|
| 229 |
+
input.gguf output.gguf Q3_K
|
| 230 |
+
```
|
| 231 |
+
|
| 232 |
+
</details>
|
| 233 |
+
|
| 234 |
+
## How PRISM-DQ Works
|
| 235 |
+
|
| 236 |
+
PRISM Dynamic Quantization analyzes each weight tensor using 7 structural metrics:
|
| 237 |
+
|
| 238 |
+
1. **PL-Alpha-Hill** — spectral heavy-tail index via eigenvalue analysis
|
| 239 |
+
2. **Spectral Dominance** — top singular value ratio (rank-1 approximation quality)
|
| 240 |
+
3. **OSQE** — optimal scale quantization error at multiple bit levels (2, 3, 4, 6 bit)
|
| 241 |
+
4. **Matrix Imbalance** — max of row/column coefficient of variation
|
| 242 |
+
5. **Fragility** — log-ratio of 2-bit vs 4-bit quantization error
|
| 243 |
+
6. **Boundary Density** — fraction of values near quantization bin boundaries
|
| 244 |
+
7. **Spectral Position Prior** — bidirectional spectral norm product encoding layer position
|
| 245 |
+
|
| 246 |
+
These metrics are combined into a composite sensitivity score per tensor class. A Lagrangian allocator then distributes bits across classes to minimize total quantization distortion subject to the BPW budget, with per-block refinement for individual tensor overrides.
|
| 247 |
+
|
| 248 |
+
## License
|
| 249 |
+
|
| 250 |
+
This model is released under the Apache 2.0 license, consistent with the base Qwen3.5 models.
|
| 251 |
+
|
| 252 |
+
## Acknowledgments
|
| 253 |
+
|
| 254 |
+
- [Qwen Team](https://huggingface.co/Qwen) for the Qwen3.5 model family
|
| 255 |
+
- [llama.cpp](https://github.com/ggml-org/llama.cpp) for the quantization infrastructure
|
| 256 |
+
- Multimodal projection weights sourced from [unsloth](https://huggingface.co/unsloth) GGUF conversions
|