Qwen3-4B-NVFP4

NVFP4 (W4A4) quantized version of Qwen/Qwen3-4B, created using llm-compressor with calibrated quantization.

Overview

Property	Value
Base Model	Qwen/Qwen3-4B
Parameters	4.41B
Quantization	NVFP4 (W4A4)
Format	`nvfp4-pack-quantized` via `compressed-tensors`
Tool	llm-compressor
Disk Size	~3.4 GB
VRAM	~2.2 GB

Quantized text encoder for Flux 2 Klein 4B image generation pipelines. Architecturally identical to the Klein 4B text encoder.

Scheme: NVFP4 — 4-bit weights with 4-bit activations, FP8 block scales, FP32 global scale
Targets: All Linear layers (excluding lm_head)
Calibration: 256 samples, sequential pipeline with CPU offloading
Group Size: 16
Scale Dtype: float8_e4m3fn

Verified against BF16 baseline:

Scope	Cosine Similarity
Layer level (`q_proj`)	0.991
Full model (4.41B)	0.907

Minimum: NVIDIA Blackwell (CC 12.0) or Ada Lovelace for native NVFP4 inference
Fallback: Dequantizes to BF16 on older hardware

Safetensors

Model size

3B params

Tensor type

F32

BF16

F8_E4M3

Base model

Finetuned

Quantized

(193)

this model