Qwen3-4B-NVFP4

NVFP4 (W4A4) quantized version of Qwen/Qwen3-4B, created using llm-compressor with calibrated quantization.

Overview

Property Value
Base Model Qwen/Qwen3-4B
Parameters 4.41B
Quantization NVFP4 (W4A4)
Format nvfp4-pack-quantized via compressed-tensors
Tool llm-compressor
Disk Size ~3.4 GB
VRAM ~2.2 GB

Intended Use

Quantized text encoder for Flux 2 Klein 4B image generation pipelines. Architecturally identical to the Klein 4B text encoder.

Quantization Details

  • Scheme: NVFP4 — 4-bit weights with 4-bit activations, FP8 block scales, FP32 global scale
  • Targets: All Linear layers (excluding lm_head)
  • Calibration: 256 samples, sequential pipeline with CPU offloading
  • Group Size: 16
  • Scale Dtype: float8_e4m3fn

Parity

Verified against BF16 baseline:

Scope Cosine Similarity
Layer level (q_proj) 0.991
Full model (4.41B) 0.907

Hardware Requirements

  • Minimum: NVIDIA Blackwell (CC 12.0) or Ada Lovelace for native NVFP4 inference
  • Fallback: Dequantizes to BF16 on older hardware
Downloads last month
107
Safetensors
Model size
3B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vistralis/Qwen3-4B-NVFP4

Base model

Qwen/Qwen3-4B-Base
Finetuned
Qwen/Qwen3-4B
Quantized
(193)
this model