Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
alvarobarttΒ 
posted an update 7 days ago
Post
3218
Latest hf-mem release added a breakdown of Mixture-of-Experts (MoE) memory usage!

TL; DR MoEs can be misleading to reason about from active parameters alone, since each token only activates a subset of experts, while the serving setup still needs to account for the full resident memory footprint.

🧠 hf-mem now splits MoE memory into base model weights, routed experts, and KV cache
πŸ—οΈ Dense models usually load and use most weights every forward pass, while MoEs load many experts but only route each token to a few of them
⚑ Active params isn't the same as memory footprint, especially for sparse architectures
πŸ“¦ Runtime memory is about what is used per request/token, while loading memory also includes the expert weights that need to be resident
πŸ“š KV cache can still dominate depending on context length, batch size, and concurrency
πŸ”€ Expert Parallelism (EP) helps shard experts across accelerators when expert weights dominate
πŸš€ Data Parallelism (DP) + EP is often a good fit for throughput-oriented MoE serving

Check the repository at https://github.com/alvarobartt/hf-mem
In this post