---
license: mit
base_model:
- deepseek-ai/DeepSeek-R1
---


# Model Overview

- **Model Architecture:** DeepSeek-R1
  - **Input:** Text
  - **Output:** Text
- **Supported Hardware Microarchitecture:** AMD MI350/MI355
- **ROCm**: 7.0-Preview
- **Preferred Operating System(s):** Linux
- **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/)
- **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html)
  - **Weight quantization:** OCP MXFP4
  - **Activation quantization:** OCP MXFP4
- **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)

The model is the quantized version of the [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) model, which is an auto-regressive language model that uses an optimized transformer architecture. The MXFP4 model is quantized with [AMD-Quark](https://quark.docs.amd.com/latest/index.html).


# Model Quantization

This model was obtained by quantizing [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1)'s weights and activations to MXFP4, using AutoSmoothQuant algorithm in [AMD-Quark](https://quark.docs.amd.com/latest/index.html).

**Quantization scripts:**
```
# Dequantize the FP8 pretrained model to BFloat16, and then quantize the BFloat16 model using the following script.

cd Quark/examples/torch/language_modeling/llm_ptq/
python3 quantize_quark.py --model_dir $MODEL_DIR \
                          --quant_scheme w_mxfp4_a_mxfp4 \
                          --num_calib_data 128 \
                          --exclude_layers "*mlp.gate.*" "*lm_head" \
                          --multi_gpu \
                          --quant_algo autosmoothquant \
                          --model_export hf_format \
                          --output_dir amd/DeepSeek-R1-MXFP4
```

# Deployment
### Use with SGLang

This model can be deployed efficiently using the [SGLang](https://docs.sglang.ai/) backend.

## Evaluation

The model was evaluated on AIME2024, GPQA Diamond, and GSM8K.
Evaluation was conducted using the framework [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and the SGLang engine.

### Accuracy

<table>
  <tr>
   <td><strong>Benchmark</strong>
   </td>
   <td><strong>DeepSeek-R1 </strong>
   </td>
   <td><strong>DeepSeek-R1-MXFP4(this model)</strong>
   </td>
   <td><strong>Recovery</strong>
   </td>
  </tr>
  <tr>
   <td>AIME2024
   </td>
   <td>78.00
   </td>
   <td>76.00
   </td>
   <td>97.44%
   </td>
  </tr>
  <tr>
   <td>GPQA Diamond
   </td>
   <td>68.89
   </td>
   <td>68.18
   </td>
   <td>98.97%
   </td>
  </tr>  <tr>
   <td>GSM8K
   </td>
   <td>95.81
   </td>
   <td>95.42
   </td>
   <td>99.59%
   </td>
  </tr>
</table>


### Reproduction

The results were obtained using the following commands:

#### AIME2024
```
python3 -m sglang.launch_server \
    --model amd/DeepSeek-R1-MXFP4 \
    --tp 8  \
    --trust-remote-code  \
    --n-share-experts-fusion 8 \
    --disable-radix-cache

lm_eval --model local-completions \
    --model_args model=amd/DeepSeek-R1-MXFP4,base_url=http://localhost:30000/v1/completions,num_concurrent=999999,timeout=999999,tokenized_requests=False,max_length=32000,temperature=0.6,top_p=0.95 \
    --tasks aime24 \
    --num_fewshot 0 \
    --gen_kwargs "do_sample=True,temperature=0.6,top_p=0.95,max_tokens=32000" \
    --batch_size auto \
    --log_samples \
    --output_path output_data/DeepSeek-R1-MXFP4
```

#### GSM8K
```
lm_eval \
    --model vllm \
    --model_args pretrained=amd/DeepSeek-R1-MXFP4,dtype=auto,add_bos_token=True,tensor_parallel_size=$tp_size,gpu_memory_utilization=0.8,max_model_len=38768, \
    --tasks gsm8k \
    --num_fewshot 8 \
    --batch_size auto \
    --device cuda 
```


# License
Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.
Benchmark	DeepSeek-R1	DeepSeek-R1-MXFP4(this model)	Recovery
AIME2024	78.00	76.00	97.44%
GPQA Diamond	68.89	68.18	98.97%
GSM8K	95.81	95.42	99.59%