Ex0bit commited on
Commit
1fcf935
·
verified ·
1 Parent(s): b25cecf

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +256 -0
README.md ADDED
@@ -0,0 +1,256 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - Qwen/Qwen3.5-0.8B
5
+ - Qwen/Qwen3.5-2B
6
+ - Qwen/Qwen3.5-4B
7
+ - Qwen/Qwen3.5-9B
8
+ tags:
9
+ - gguf
10
+ - quantization
11
+ - prism
12
+ - qwen3.5
13
+ - llama-cpp
14
+ - dynamic-quantization
15
+ language:
16
+ - en
17
+ - zh
18
+ pipeline_tag: text-generation
19
+ library_name: llama.cpp
20
+ ---
21
+
22
+ # Qwen3.5 PRISM Dynamic Quantization (GGUF)
23
+
24
+ **PRISM Dynamic Quantization (PRISM-DQ)** applies per-tensor-class bit allocation based on structural weight analysis — no calibration data or importance matrices required. Each tensor class (attention keys, FFN gates, SSM components, etc.) receives a quantization type proportional to its measured sensitivity, while staying within a target bits-per-weight budget.
25
+
26
+ This repo contains PRISM-DQ quantized GGUFs for the full **Qwen3.5** vision-language model family (0.8B, 2B, 4B, 9B), plus multimodal projection weights (mmproj) for vision capabilities.
27
+
28
+ ## Benchmark Results
29
+
30
+ ![Pareto Frontier Analysis](benchmark.png)
31
+
32
+ ### Perplexity Comparison (UltraChat, 5 chunks, 512 ctx)
33
+
34
+ | Model | Method | BPW | PPL | Size |
35
+ |:------|:-------|----:|----:|-----:|
36
+ | **Qwen3.5-0.8B** | Q3_K_M | 4.96 | 12.14 | 470 MB |
37
+ | | **PRISM-DQ** | **4.94** | **11.42** | **468 MB** |
38
+ | | Q3_K_M (imatrix) | 4.96 | 11.31 | 470 MB |
39
+ | | UD-Q3_K_XL | 5.19 | 10.94 | 492 MB |
40
+ | | IQ4_XS (imatrix) | 5.20 | 10.35 | 493 MB |
41
+ | | UD-Q4_K_XL | 5.89 | 10.07 | 559 MB |
42
+ | **Qwen3.5-2B** | Q3_K_M | 4.69 | 9.35 | 1107 MB |
43
+ | | **PRISM-DQ** | **4.68** | **9.26** | **1104 MB** |
44
+ | | Q3_K_M (imatrix) | 4.69 | 8.40 | 1107 MB |
45
+ | | UD-Q3_K_XL | 4.91 | 8.27 | 1159 MB |
46
+ | | IQ4_XS (imatrix) | 4.97 | 8.12 | 1173 MB |
47
+ | | UD-Q4_K_XL | 5.68 | 8.07 | 1340 MB |
48
+ | **Qwen3.5-4B** | Q3_K_M | 4.36 | 6.88 | 2293 MB |
49
+ | | **PRISM-DQ** | **4.31** | **6.82** | **2271 MB** |
50
+ | | Q3_K_M (imatrix) | 4.36 | 6.62 | 2293 MB |
51
+ | | UD-Q3_K_XL | 4.63 | 6.66 | 2436 MB |
52
+ | | IQ4_XS (imatrix) | 4.70 | 6.51 | 2477 MB |
53
+ | | UD-Q4_K_XL | 5.53 | 6.56 | 2912 MB |
54
+ | **Qwen3.5-9B** | Q3_K_M | 4.17 | 6.25 | 4674 MB |
55
+ | | **PRISM-DQ** | **4.15** | **6.18** | **4652 MB** |
56
+ | | Q3_K_M (imatrix) | 4.17 | 5.96 | 4674 MB |
57
+ | | UD-Q3_K_XL | 4.51 | 6.01 | 5054 MB |
58
+ | | IQ4_XS (imatrix) | 4.61 | 6.03 | 5169 MB |
59
+ | | UD-Q4_K_XL | 5.33 | 5.86 | 5966 MB |
60
+
61
+ ### Key Findings
62
+
63
+ - **PRISM-DQ beats uniform Q3_K_M** on all 4 models (1-6% PPL improvement) at same or lower BPW
64
+ - **Smallest file size** at competitive perplexity across the Qwen3.5 family
65
+ - **No calibration data needed** — allocation decisions are purely weight-analysis-based
66
+ - When combined with importance matrices, PRISM-DQ+imatrix achieves Pareto-optimal results on 4B and 9B
67
+
68
+ ## Model Files
69
+
70
+ Each subfolder contains the quantized model GGUF plus multimodal projection weights:
71
+
72
+ ```
73
+ Qwen3.5-0.8B/
74
+ Qwen3.5-0.8B-PRISM-DQ.gguf (446 MB)
75
+ mmproj-BF16.gguf
76
+ mmproj-F16.gguf
77
+ mmproj-F32.gguf
78
+ chat_template.jinja
79
+
80
+ Qwen3.5-2B/
81
+ Qwen3.5-2B-PRISM-DQ.gguf (1.0 GB)
82
+ mmproj-BF16.gguf
83
+ mmproj-F16.gguf
84
+ mmproj-F32.gguf
85
+ chat_template.jinja
86
+
87
+ Qwen3.5-4B/
88
+ Qwen3.5-4B-PRISM-DQ.gguf (2.1 GB)
89
+ mmproj-BF16.gguf
90
+ mmproj-F16.gguf
91
+ mmproj-F32.gguf
92
+ chat_template.jinja
93
+
94
+ Qwen3.5-9B/
95
+ Qwen3.5-9B-PRISM-DQ.gguf (4.3 GB)
96
+ mmproj-BF16.gguf
97
+ mmproj-F16.gguf
98
+ mmproj-F32.gguf
99
+ chat_template.jinja
100
+ ```
101
+
102
+ ## Usage
103
+
104
+ ### Text-only (llama.cpp)
105
+
106
+ ```bash
107
+ # Download a model
108
+ huggingface-cli download Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF \
109
+ Qwen3.5-9B/Qwen3.5-9B-PRISM-DQ.gguf --local-dir .
110
+
111
+ # Run with llama-cli
112
+ llama-cli -m Qwen3.5-9B/Qwen3.5-9B-PRISM-DQ.gguf \
113
+ -p "You are a helpful assistant." \
114
+ --chat-template-file Qwen3.5-9B/chat_template.jinja \
115
+ -cnv
116
+ ```
117
+
118
+ ### Vision (multimodal)
119
+
120
+ ```bash
121
+ # Download model + mmproj
122
+ huggingface-cli download Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF \
123
+ Qwen3.5-9B/Qwen3.5-9B-PRISM-DQ.gguf \
124
+ Qwen3.5-9B/mmproj-BF16.gguf --local-dir .
125
+
126
+ # Run with llama-mtmd-cli
127
+ llama-mtmd-cli -m Qwen3.5-9B/Qwen3.5-9B-PRISM-DQ.gguf \
128
+ --mmproj Qwen3.5-9B/mmproj-BF16.gguf \
129
+ --chat-template-file Qwen3.5-9B/chat_template.jinja \
130
+ -cnv
131
+ ```
132
+
133
+ ### LM Studio / Ollama
134
+
135
+ These GGUFs work with any llama.cpp-compatible runtime. Simply point your application at the `.gguf` file.
136
+
137
+ ## PRISM-DQ Quantization Recipes
138
+
139
+ <details>
140
+ <summary>Qwen3.5-0.8B (target 3.5 BPW)</summary>
141
+
142
+ ```bash
143
+ llama-quantize \
144
+ --tensor-type "attn_gate=Q3_K" \
145
+ --tensor-type "attn_k=Q3_K" \
146
+ --tensor-type "attn_output=IQ4_XS" \
147
+ --tensor-type "attn_q=Q3_K" \
148
+ --tensor-type "attn_qkv=Q3_K" \
149
+ --tensor-type "attn_v=Q4_K" \
150
+ --tensor-type "ffn_down=Q3_K" \
151
+ --tensor-type "ffn_gate=Q3_K" \
152
+ --tensor-type "ffn_up=Q3_K" \
153
+ --tensor-type "ssm_alpha=Q3_K" \
154
+ --tensor-type "ssm_beta=IQ4_XS" \
155
+ --tensor-type "ssm_out=IQ4_XS" \
156
+ --tensor-type "token_embd=Q3_K" \
157
+ --tensor-type "blk\.(4)\.ssm_beta=Q4_K" \
158
+ --tensor-type "blk\.(18)\.ssm_out=Q4_K" \
159
+ input.gguf output.gguf Q3_K
160
+ ```
161
+
162
+ </details>
163
+
164
+ <details>
165
+ <summary>Qwen3.5-2B (target 3.5 BPW)</summary>
166
+
167
+ ```bash
168
+ llama-quantize \
169
+ --tensor-type "attn_gate=Q3_K" \
170
+ --tensor-type "attn_k=Q4_K" \
171
+ --tensor-type "attn_output=Q4_K" \
172
+ --tensor-type "attn_q=Q4_K" \
173
+ --tensor-type "attn_qkv=Q3_K" \
174
+ --tensor-type "attn_v=Q4_K" \
175
+ --tensor-type "ffn_down=Q3_K" \
176
+ --tensor-type "ffn_gate=Q3_K" \
177
+ --tensor-type "ffn_up=Q3_K" \
178
+ --tensor-type "ssm_alpha=Q4_K" \
179
+ --tensor-type "ssm_beta=Q4_K" \
180
+ --tensor-type "ssm_out=Q3_K" \
181
+ --tensor-type "token_embd=Q3_K" \
182
+ input.gguf output.gguf Q3_K
183
+ ```
184
+
185
+ </details>
186
+
187
+ <details>
188
+ <summary>Qwen3.5-4B (target 3.5 BPW)</summary>
189
+
190
+ ```bash
191
+ llama-quantize \
192
+ --tensor-type "attn_gate=Q3_K" \
193
+ --tensor-type "attn_k=Q4_K" \
194
+ --tensor-type "attn_output=Q5_K" \
195
+ --tensor-type "attn_q=Q3_K" \
196
+ --tensor-type "attn_qkv=Q3_K" \
197
+ --tensor-type "attn_v=Q4_K" \
198
+ --tensor-type "ffn_down=Q3_K" \
199
+ --tensor-type "ffn_gate=Q3_K" \
200
+ --tensor-type "ffn_up=Q3_K" \
201
+ --tensor-type "ssm_alpha=Q4_K" \
202
+ --tensor-type "ssm_beta=Q4_K" \
203
+ --tensor-type "ssm_out=Q3_K" \
204
+ --tensor-type "token_embd=Q3_K" \
205
+ input.gguf output.gguf Q3_K
206
+ ```
207
+
208
+ </details>
209
+
210
+ <details>
211
+ <summary>Qwen3.5-9B (target 3.5 BPW)</summary>
212
+
213
+ ```bash
214
+ llama-quantize \
215
+ --tensor-type "attn_gate=Q3_K" \
216
+ --tensor-type "attn_k=Q4_K" \
217
+ --tensor-type "attn_output=IQ4_XS" \
218
+ --tensor-type "attn_q=Q4_K" \
219
+ --tensor-type "attn_qkv=Q3_K" \
220
+ --tensor-type "attn_v=Q4_K" \
221
+ --tensor-type "ffn_down=Q3_K" \
222
+ --tensor-type "ffn_gate=Q3_K" \
223
+ --tensor-type "ffn_up=Q3_K" \
224
+ --tensor-type "output=Q3_K" \
225
+ --tensor-type "ssm_alpha=Q4_K" \
226
+ --tensor-type "ssm_beta=Q4_K" \
227
+ --tensor-type "ssm_out=Q3_K" \
228
+ --tensor-type "token_embd=Q3_K" \
229
+ input.gguf output.gguf Q3_K
230
+ ```
231
+
232
+ </details>
233
+
234
+ ## How PRISM-DQ Works
235
+
236
+ PRISM Dynamic Quantization analyzes each weight tensor using 7 structural metrics:
237
+
238
+ 1. **PL-Alpha-Hill** — spectral heavy-tail index via eigenvalue analysis
239
+ 2. **Spectral Dominance** — top singular value ratio (rank-1 approximation quality)
240
+ 3. **OSQE** — optimal scale quantization error at multiple bit levels (2, 3, 4, 6 bit)
241
+ 4. **Matrix Imbalance** — max of row/column coefficient of variation
242
+ 5. **Fragility** — log-ratio of 2-bit vs 4-bit quantization error
243
+ 6. **Boundary Density** — fraction of values near quantization bin boundaries
244
+ 7. **Spectral Position Prior** — bidirectional spectral norm product encoding layer position
245
+
246
+ These metrics are combined into a composite sensitivity score per tensor class. A Lagrangian allocator then distributes bits across classes to minimize total quantization distortion subject to the BPW budget, with per-block refinement for individual tensor overrides.
247
+
248
+ ## License
249
+
250
+ This model is released under the Apache 2.0 license, consistent with the base Qwen3.5 models.
251
+
252
+ ## Acknowledgments
253
+
254
+ - [Qwen Team](https://huggingface.co/Qwen) for the Qwen3.5 model family
255
+ - [llama.cpp](https://github.com/ggml-org/llama.cpp) for the quantization infrastructure
256
+ - Multimodal projection weights sourced from [unsloth](https://huggingface.co/unsloth) GGUF conversions