Multi-GPU inference requires 6 patches to modeling_zaya.py

by trohrbaugh - opened 23 days ago

Running with device_map="auto" across multiple GPUs fails with a sequence of cross-device errors. We documented all six bugs and their one-line fixes here: https://huggingface.co/blog/RadicalNotionAI/zaya1-74b-preview-bugs
TL;DR: embed_tokens lookup, MoE routing indices, expert output cat, router probs multiply, KV cache conv_states, and Flash SDPA mask expansion all assume single-GPU placement. The math SDPA backend must also be forced explicitly

ganeshnanduru

Zyphra org 23 days ago

We strongly recommend you use vLLM for inference, not transformers, and for training on transformers we recommend using Flash attention instead of eager/SDPA backends. Thank you for the fixes though we will look into improving the transformers implementation.

trohrbaugh

23 days ago

for sure. but as you know latest vllm and now v5 transformers has some serious differences that are breaking many models. But those details should help others.

trohrbaugh changed discussion status to closed 23 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment