DeepSeek-R1-Distill-Llama-8B AWQ (W4, G128)
This repository contains an AWQ-quantized version of DeepSeek-R1-Distill-Llama-8B.
Model Details
- Base Model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
- Quantization: AWQ, 4-bit weights, group size 128, zero point enabled
- Model Size: ~4GB (75% reduction from 16GB FP16)
- Performance: 16% faster generation, minimal quality loss
Benchmark Results
| Metric | Dataset / Task | FP16 (Base) | AWQ 4-bit (This Model) | Improvement/Loss |
|---|---|---|---|---|
| Generation Speed | Throughput | 55.6 tokens/sec | 64.6 tokens/sec | +16.2% 🚀 |
| Perplexity (PPL) | WikiText-2 | 13.13 | 13.84 | +0.71 (Minimal) |
| Memory Usage | VRAM Occupancy | ~16 GB | ~4 GB | -75% 📉 |
Load
Python example using this repo with AWQ:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "benyamini/DeepSeek-R1-Distill-Llama-8B-AWQ-w4g128"
# === Load ===
tok = AutoTokenizer.from_pretrained(model_id, use_fast=True, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="cuda:0",
trust_remote_code=True,
)
# === Inference ===
model.eval()
inputs = tok("Hello", return_tensors="pt").to("cuda:0")
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=20, do_sample=False)
print(tok.decode(out[0], skip_special_tokens=True))
Files
awq_model-w4g128-v2.pt(quantized weights, v2 format)config.json,tokenizer.*copied from the base model for convenience
- Downloads last month
- 7
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for benyamini/DeepSeek-R1-Distill-Llama-8B-AWQ-w4g128
Base model
deepseek-ai/DeepSeek-R1-Distill-Llama-8B