Multi-GPU inference requires 6 patches to modeling_zaya.py

#2
by trohrbaugh - opened

Running with device_map="auto" across multiple GPUs fails with a sequence of cross-device errors. We documented all six bugs and their one-line fixes here: https://huggingface.co/blog/RadicalNotionAI/zaya1-74b-preview-bugs
TL;DR: embed_tokens lookup, MoE routing indices, expert output cat, router probs multiply, KV cache conv_states, and Flash SDPA mask expansion all assume single-GPU placement. The math SDPA backend must also be forced explicitly

We strongly recommend you use vLLM for inference, not transformers, and for training on transformers we recommend using Flash attention instead of eager/SDPA backends. Thank you for the fixes though we will look into improving the transformers implementation.

for sure. but as you know latest vllm and now v5 transformers has some serious differences that are breaking many models. But those details should help others.

trohrbaugh changed discussion status to closed

Sign up or log in to comment