Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] Expert Parallelism with Non-Identical Experts #1342

Open
kevin3567 opened this issue Jan 1, 2025 · 0 comments
Open

[QUESTION] Expert Parallelism with Non-Identical Experts #1342

kevin3567 opened this issue Jan 1, 2025 · 0 comments

Comments

@kevin3567
Copy link

I am currently writing a distributed MoE training process for an MoE with different expert architectures. For example, consider a simple Case A where:

  • I have 8 GPUs
  • GPU 0 runs a 5-layer MLP
  • GPU 1-3 each runs a 3-layer MLP
  • GPU 4-7 each runs a 2-layer MLP
  • All experts are parallelized in forward and backward pass.

Additionally, consider another "optimized" Case B where:
- GPU 0 and GPU 1 runs a 5-layer MLP via Tensor Parallel (between just these two processes)

  • GPU 2-3 each runs a 3-layer MLP
  • GPU 4-7 each runs a 2-layer MLP
  • All experts are parallelized in forward and backward pass.

How would I go about writing this? Specifically, I am confused on:

  • how to implement efficient gradient computation and synchronization during backward pass in both cases
  • efficient all-to-all communication for Case B which conducts TP within EP

Furthermore, are there any examples I can refer to? Thanks in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant