docs: add KTransformers CPU offloading inference guide
#34
by
ErvinX - opened
Add KTransformers as a recommended inference option for MiMo-V2-Flash.
KTransformers enables efficient deployment on consumer-grade hardware by offloading MoE expert computations to CPU while keeping other components on GPU. With 4× RTX 5090 + 2× AMD EPYC 9355, it achieves up to 35.7 tokens/s decode speed.
Benchmarks: https://ktransformers.net/benchmarks#MiMo-V2-Flash-FP8-TP4
bwshen-mi changed pull request status to
merged