-
-
Notifications
You must be signed in to change notification settings - Fork 14.8k
[Bugfix] Fix DP/EP Shared Expert With Monolithic Kernels #36061
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
c4b139b
37c2a9d
566f5ab
d74e051
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -567,7 +567,7 @@ def make_fp8_moe_kernel( | |
| experts, | ||
| shared_experts=( | ||
| shared_experts | ||
| if moe_config.moe_parallel_config.use_all2all_kernels | ||
| if moe_config.moe_parallel_config.use_deepep_ll_kernels | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this condition be |
||
| else None | ||
| ), | ||
| moe_parallel_config=moe_config.moe_parallel_config, | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -433,7 +433,7 @@ def make_nvfp4_moe_kernel( | |
| experts, | ||
| shared_experts=( | ||
| shared_experts | ||
| if moe_config.moe_parallel_config.use_all2all_kernels | ||
| if moe_config.moe_parallel_config.use_deepep_ll_kernels | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similar to the change in The shared expert computation will be skipped for backends like
This logic needs to be reconciled to ensure shared experts are always computed. The |
||
| else None | ||
| ), | ||
| moe_parallel_config=moe_config.moe_parallel_config, | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change correctly identifies that only the
deepep_low_latencybackend supports shared expert overlap within the modular kernel. However, this introduces a critical issue for other All-to-All backends (e.g.,deepep_high_throughput,mori).Here's the breakdown of the issue:
FusedMoEKernelis created, soquant_method.mk_owns_shared_expertbecomesTrue.DefaultMoERunnerfrom computing the shared experts, as it delegates this task to the modular kernel.deepep_llbackends,shared_expertsis passed asNoneto theFusedMoEKernel.FusedMoEKernelalso doesn't compute the shared experts.This results in the shared expert computation being skipped entirely for these configurations, likely leading to incorrect model outputs.
To fix this, the logic that determines whether the modular kernel "owns" the shared expert computation needs to be updated. For instance,
DefaultMoERunnershould handle shared experts ifuse_all2all_kernelsis true butuse_deepep_ll_kernelsis false.