DeepSeek-R1-Distill-Llama-8B AWQ (W4, G128)

This repository contains an AWQ-quantized version of DeepSeek-R1-Distill-Llama-8B.

Model Details

Base Model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
Quantization: AWQ, 4-bit weights, group size 128, zero point enabled
Model Size: ~4GB (75% reduction from 16GB FP16)
Performance: 16% faster generation, minimal quality loss

Benchmark Results

Metric	Dataset / Task	FP16 (Base)	AWQ 4-bit (This Model)	Improvement/Loss
Generation Speed	Throughput	55.6 tokens/sec	64.6 tokens/sec	+16.2% 🚀
Perplexity (PPL)	WikiText-2	13.13	13.84	+0.71 (Minimal)
Memory Usage	VRAM Occupancy	~16 GB	~4 GB	-75% 📉

Load

Python example using this repo with AWQ:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "benyamini/DeepSeek-R1-Distill-Llama-8B-AWQ-w4g128"

# === Load ===
tok = AutoTokenizer.from_pretrained(model_id, use_fast=True, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="cuda:0",
    trust_remote_code=True,
)

# === Inference ===
model.eval()
inputs = tok("Hello", return_tensors="pt").to("cuda:0")
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=20, do_sample=False)
print(tok.decode(out[0], skip_special_tokens=True))

Files

awq_model-w4g128-v2.pt (quantized weights, v2 format)
config.json, tokenizer.* copied from the base model for convenience

Downloads last month: 7

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for benyamini/DeepSeek-R1-Distill-Llama-8B-AWQ-w4g128

Base model

deepseek-ai/DeepSeek-R1-Distill-Llama-8B

Quantized

(191)

this model