About model.generate and KV cache

#6
by quaternior - opened

Hello, thanks for your impressive work, I really enjoy this project and model.

I have some questions about this model. This example code shows model.generate for convenience reproduction. But it seems that it doesn't use KV cache for decode-like phase. And also, for efficient deployment, I saw that there are some choices such as dInfer (for efficient inference framework) and SGLang (for efficient serving framework). Is it right?

Again, thanks for your amazing work!

inclusionAI org

Thank you for your kind words and for your interest in our project!

you are correct that the provided model.generate script does not currently support KV caching. This specific script is intended to be a minimal, easy-to-read demonstration of the LLaDA 2.1 decoding algorithm rather than a performance-optimized implementation.

For high-performance deployment and production-level serving, we highly recommend using SGLang. It is well-optimized for our model. You can deploy LLaDA 2.1 using the following command:

python3 -m sglang.launch_server \
      --model-path inclusionAI/LLaDA2.1-flash \
      --dllm-algorithm JointThreshold \
      --tp-size 4 \
      --trust-remote-code \
      --mem-fraction-static 0.8 \
      --max-running-requests 1 \
      --attention-backend flashinfer

I hope this helps!

Sign up or log in to comment