[QUESTION] Expert Parallelism with Non-Identical Experts #1342

kevin3567 · 2025-01-01T22:09:05Z

I am currently writing a distributed MoE training process for an MoE with different expert architectures. For example, consider a simple Case A where:

I have 8 GPUs
GPU 0 runs a 5-layer MLP
GPU 1-3 each runs a 3-layer MLP
GPU 4-7 each runs a 2-layer MLP
All experts are parallelized in forward and backward pass.

Additionally, consider another "optimized" Case B where:
- GPU 0 and GPU 1 runs a 5-layer MLP via Tensor Parallel (between just these two processes)

GPU 2-3 each runs a 3-layer MLP
GPU 4-7 each runs a 2-layer MLP
All experts are parallelized in forward and backward pass.

How would I go about writing this? Specifically, I am confused on:

how to implement efficient gradient computation and synchronization during backward pass in both cases
efficient all-to-all communication for Case B which conducts TP within EP

Furthermore, are there any examples I can refer to? Thanks in advance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] Expert Parallelism with Non-Identical Experts #1342

[QUESTION] Expert Parallelism with Non-Identical Experts #1342

kevin3567 commented Jan 1, 2025

[QUESTION] Expert Parallelism with Non-Identical Experts #1342

[QUESTION] Expert Parallelism with Non-Identical Experts #1342

Comments

kevin3567 commented Jan 1, 2025