Transformers documentation
SpQR
Get started
Base classes
Models
Preprocessors
Inference
Pipeline API
Generate API
Optimization
Chat with models
Serving
Training
Quantization
OverviewSelecting a quantization methodQuantization conceptsAQLMAutoRoundAWQBitNetbitsandbytescompressed-tensorsEETQFBGEMMFine-grained FP8Four Over SixFP-QuantGGUFGPTQHIGGSHQQMetalMXFP4OptimumQuantoQuarktorchaoSpQRVPTQSINQContribute
Ecosystem integrations
Resources
Contribute
API
You are viewing v5.3.0 version. A newer version v5.8.1 is available.
SpQR
The SpQR quantization algorithm involves a 16x16 tiled bi-level group 3-bit quantization structure with sparse outliers.

To quantize a model with SpQR, refer to the Vahe1994/SpQR repository.
Load a SpQR-quantized model with from_pretrained().
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
quantized_model = AutoModelForCausalLM.from_pretrained(
"elvircrn/Llama-2-7b-SPQR-3Bit-16x16-red_pajama-hf",
dtype=torch.half,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("elvircrn/Llama-2-7b-SPQR-3Bit-16x16-red_pajama-hf")