DeepSeek-R1-Distill-Llama-8B AWQ (W4, G128)

This repository contains an AWQ-quantized version of DeepSeek-R1-Distill-Llama-8B.

Model Details

  • Base Model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
  • Quantization: AWQ, 4-bit weights, group size 128, zero point enabled
  • Model Size: ~4GB (75% reduction from 16GB FP16)
  • Performance: 16% faster generation, minimal quality loss

Benchmark Results

Metric Dataset / Task FP16 (Base) AWQ 4-bit (This Model) Improvement/Loss
Generation Speed Throughput 55.6 tokens/sec 64.6 tokens/sec +16.2% 🚀
Perplexity (PPL) WikiText-2 13.13 13.84 +0.71 (Minimal)
Memory Usage VRAM Occupancy ~16 GB ~4 GB -75% 📉

Load

Python example using this repo with AWQ:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "benyamini/DeepSeek-R1-Distill-Llama-8B-AWQ-w4g128"

# === Load ===
tok = AutoTokenizer.from_pretrained(model_id, use_fast=True, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="cuda:0",
    trust_remote_code=True,
)

# === Inference ===
model.eval()
inputs = tok("Hello", return_tensors="pt").to("cuda:0")
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=20, do_sample=False)
print(tok.decode(out[0], skip_special_tokens=True))

Files

  • awq_model-w4g128-v2.pt (quantized weights, v2 format)
  • config.json, tokenizer.* copied from the base model for convenience
Downloads last month
7
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for benyamini/DeepSeek-R1-Distill-Llama-8B-AWQ-w4g128

Quantized
(191)
this model