[Bugfix] Fix DP/EP Shared Expert With Monolithic Kernels#36061
[Bugfix] Fix DP/EP Shared Expert With Monolithic Kernels#36061robertgshaw2-redhat merged 4 commits intomainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request aims to fix an issue with shared expert computation in monolithic kernels by restricting the passing of shared_experts to the FusedMoEKernel only when the deepep_low_latency All-to-All backend is used, as it's the only one that supports shared expert overlap.
While the intent is correct, the change appears to introduce a critical bug. For other All-to-All backends, this change will cause the shared expert computation to be skipped entirely. This is because the DefaultMoERunner delegates shared expert computation to the modular kernel for all All-to-All configurations, but with this change, the kernel will no longer receive the shared expert module for non-deepep_ll backends. I've added comments with details on the issue and suggestions for a fix.
| shared_experts=( | ||
| shared_experts | ||
| if moe_config.moe_parallel_config.use_all2all_kernels | ||
| if moe_config.moe_parallel_config.use_deepep_ll_kernels |
There was a problem hiding this comment.
This change correctly identifies that only the deepep_low_latency backend supports shared expert overlap within the modular kernel. However, this introduces a critical issue for other All-to-All backends (e.g., deepep_high_throughput, mori).
Here's the breakdown of the issue:
- For any All-to-All backend, a
FusedMoEKernelis created, soquant_method.mk_owns_shared_expertbecomesTrue. - This prevents
DefaultMoERunnerfrom computing the shared experts, as it delegates this task to the modular kernel. - With this change, for non-
deepep_llbackends,shared_expertsis passed asNoneto theFusedMoEKernel. - Consequently, the
FusedMoEKernelalso doesn't compute the shared experts.
This results in the shared expert computation being skipped entirely for these configurations, likely leading to incorrect model outputs.
To fix this, the logic that determines whether the modular kernel "owns" the shared expert computation needs to be updated. For instance, DefaultMoERunner should handle shared experts if use_all2all_kernels is true but use_deepep_ll_kernels is false.
| shared_experts=( | ||
| shared_experts | ||
| if moe_config.moe_parallel_config.use_all2all_kernels | ||
| if moe_config.moe_parallel_config.use_deepep_ll_kernels |
There was a problem hiding this comment.
Similar to the change in fp8.py, this modification correctly restricts passing shared_experts to the FusedMoEKernel to only when the deepep_low_latency backend is used. However, it creates the same critical issue for other All-to-All backends.
The shared expert computation will be skipped for backends like deepep_high_throughput because:
quant_method.mk_owns_shared_expertwill beTrue, soDefaultMoERunnerwon't run the shared experts.- The
FusedMoEKernelwill receiveshared_experts=Noneand will also not run them.
This logic needs to be reconciled to ensure shared experts are always computed. The DefaultMoERunner should likely handle the shared expert computation when a modular kernel is used but does not support shared expert overlap (i.e., when use_all2all_kernels is true but use_deepep_ll_kernels is false).
|
Hi @robertgshaw2-redhat, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
| shared_experts=( | ||
| shared_experts | ||
| if moe_config.moe_parallel_config.use_all2all_kernels | ||
| if moe_config.moe_parallel_config.use_deepep_ll_kernels |
There was a problem hiding this comment.
Should this condition be prepare_finalize.supports_async? That's the only time it really matters for the MK to call shared_experts.
ProExpertProg
left a comment
There was a problem hiding this comment.
Good with this if @bnellnm is
…t#36061) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>
…t#36061) Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>
Summary