Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion vllm/model_executor/layers/fused_moe/oracle/fp8.py
Original file line number Diff line number Diff line change
Expand Up @@ -567,7 +567,7 @@ def make_fp8_moe_kernel(
experts,
shared_experts=(
shared_experts
if moe_config.moe_parallel_config.use_all2all_kernels
if moe_config.moe_parallel_config.use_deepep_ll_kernels
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This change correctly identifies that only the deepep_low_latency backend supports shared expert overlap within the modular kernel. However, this introduces a critical issue for other All-to-All backends (e.g., deepep_high_throughput, mori).

Here's the breakdown of the issue:

  1. For any All-to-All backend, a FusedMoEKernel is created, so quant_method.mk_owns_shared_expert becomes True.
  2. This prevents DefaultMoERunner from computing the shared experts, as it delegates this task to the modular kernel.
  3. With this change, for non-deepep_ll backends, shared_experts is passed as None to the FusedMoEKernel.
  4. Consequently, the FusedMoEKernel also doesn't compute the shared experts.

This results in the shared expert computation being skipped entirely for these configurations, likely leading to incorrect model outputs.

To fix this, the logic that determines whether the modular kernel "owns" the shared expert computation needs to be updated. For instance, DefaultMoERunner should handle shared experts if use_all2all_kernels is true but use_deepep_ll_kernels is false.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this condition be prepare_finalize.supports_async? That's the only time it really matters for the MK to call shared_experts.

else None
),
moe_parallel_config=moe_config.moe_parallel_config,
Expand Down
2 changes: 1 addition & 1 deletion vllm/model_executor/layers/fused_moe/oracle/nvfp4.py
Original file line number Diff line number Diff line change
Expand Up @@ -433,7 +433,7 @@ def make_nvfp4_moe_kernel(
experts,
shared_experts=(
shared_experts
if moe_config.moe_parallel_config.use_all2all_kernels
if moe_config.moe_parallel_config.use_deepep_ll_kernels
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Similar to the change in fp8.py, this modification correctly restricts passing shared_experts to the FusedMoEKernel to only when the deepep_low_latency backend is used. However, it creates the same critical issue for other All-to-All backends.

The shared expert computation will be skipped for backends like deepep_high_throughput because:

  1. quant_method.mk_owns_shared_expert will be True, so DefaultMoERunner won't run the shared experts.
  2. The FusedMoEKernel will receive shared_experts=None and will also not run them.

This logic needs to be reconciled to ensure shared experts are always computed. The DefaultMoERunner should likely handle the shared expert computation when a modular kernel is used but does not support shared expert overlap (i.e., when use_all2all_kernels is true but use_deepep_ll_kernels is false).

else None
),
moe_parallel_config=moe_config.moe_parallel_config,
Expand Down
Loading