docs: add KTransformers CPU offloading inference guide

#34
by ErvinX - opened

Add KTransformers as a recommended inference option for MiMo-V2-Flash.

KTransformers enables efficient deployment on consumer-grade hardware by offloading MoE expert computations to CPU while keeping other components on GPU. With 4× RTX 5090 + 2× AMD EPYC 9355, it achieves up to 35.7 tokens/s decode speed.

Benchmarks: https://ktransformers.net/benchmarks#MiMo-V2-Flash-FP8-TP4

bwshen-mi changed pull request status to merged

Sign up or log in to comment