[MoE Refactor] Integrate Naive Prepare Finalize into MK #32567

bnellnm · 2026-01-25T22:12:27Z

Should there be an entry for MoEPrepareAndFinalizeNaiveEP?

bnellnm · 2026-01-25T22:11:59Z

Can you add a test/registration for MoEPrepareAndFinalizeNaiveEP?

mgoin · 2026-01-26T20:20:04Z

Are we going to deprecate dispatch_router_logits eventually or keep it around for different cases?

We will still need it for monolithic kernels (trtllm)

-Original file line number
+Diff line change
@@ Expand Up / @@ -1131,7 +1131,7 @@ steps: @@
       - csrc/quantization/cutlass_w8a8/moe/
       - vllm/model_executor/layers/fused_moe/cutlass_moe.py
       - vllm/model_executor/layers/fused_moe/flashinfer_cutlass_moe.py
-      - vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py
+      - vllm/model_executor/layers/fused_moe/flashinfer_a2a_prepare_finalize.py
       - vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
       - vllm/v1/attention/backends/flashinfer.py
       - vllm/v1/attention/backends/mla/cutlass_mla.py
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -1017,7 +1017,7 @@ steps: @@
       - csrc/quantization/cutlass_w8a8/moe/
       - vllm/model_executor/layers/fused_moe/cutlass_moe.py
       - vllm/model_executor/layers/fused_moe/flashinfer_cutlass_moe.py
-      - vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py
+      - vllm/model_executor/layers/fused_moe/flashinfer_a2a_prepare_finalize.py
       - vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
       - vllm/v1/attention/backends/flashinfer.py
       - vllm/v1/attention/backends/mla/cutlass_mla.py
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -85,7 +85,7 @@ steps: @@
       - csrc/quantization/cutlass_w8a8/moe/
       - vllm/model_executor/layers/fused_moe/cutlass_moe.py
       - vllm/model_executor/layers/fused_moe/flashinfer_cutlass_moe.py
-      - vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py
+      - vllm/model_executor/layers/fused_moe/flashinfer_a2a_prepare_finalize.py
       - vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
       - vllm/v1/attention/backends/flashinfer.py
       - vllm/v1/attention/backends/mla/cutlass_mla.py
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -197,7 +197,7 @@ def run_cutlass_moe_fp4( @@
             )
             kernel = mk.FusedMoEModularKernel(
-                MoEPrepareAndFinalizeNoEP(defer_input_quant=True),
+                MoEPrepareAndFinalizeNoEP(),
                 CutlassExpertsFp4(
                     make_dummy_moe_config(),
                     quant_config=quant_config,
@@ Expand Down Expand Up / @@ -242,7 +242,7 @@ def run_cutlass_from_graph( @@
             )
             kernel = mk.FusedMoEModularKernel(
-                MoEPrepareAndFinalizeNoEP(defer_input_quant=True),
+                MoEPrepareAndFinalizeNoEP(),
                 CutlassExpertsFp4(
                     make_dummy_moe_config(),
                     quant_config=quant_config,
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -36,8 +36,7 @@ th { @@
     | pplx | batched | fp8,int8 | G,A,T | Y | Y | [`PplxPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.pplx_prepare_finalize.PplxPrepareAndFinalize] |
     | deepep_high_throughput | standard | fp8 | G(128),A,T<sup>2</sup> | Y | Y | [`DeepEPLLPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.deepep_ll_prepare_finalize.DeepEPLLPrepareAndFinalize] |
     | deepep_low_latency | batched | fp8 | G(128),A,T<sup>3</sup> | Y | Y | [`DeepEPHTPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.deepep_ht_prepare_finalize.DeepEPHTPrepareAndFinalize] |
-    | flashinfer_all2allv | standard | nvfp4,fp8 | G,A,T | N | N | [`FlashInferAllToAllMoEPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.flashinfer_cutlass_prepare_finalize.FlashInferAllToAllMoEPrepareAndFinalize] |
-    | flashinfer<sup>4</sup> | standard | nvfp4,fp8 | G,A,T | N | N | [`FlashInferCutlassMoEPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.flashinfer_cutlass_prepare_finalize.FlashInferCutlassMoEPrepareAndFinalize] |
+    | flashinfer_all2allv | standard | nvfp4,fp8 | G,A,T | N | N | [`FlashInferA2APrepareAndFinalize`][vllm.model_executor.layers.fused_moe.flashinfer_a2a_prepare_finalize.FlashInferA2APrepareAndFinalize] |
     | MoEPrepareAndFinalizeNoEP<sup>5</sup> | standard | fp8,int8 | G,A,T | N | Y | [`MoEPrepareAndFinalizeNoEP`][vllm.model_executor.layers.fused_moe.prepare_finalize.MoEPrepareAndFinalizeNoEP] |
     | BatchedPrepareAndFinalize<sup>5</sup> | batched | fp8,int8 | G,A,T | N | Y | [`BatchedPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.fused_batched_moe.BatchedPrepareAndFinalize] |
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MoE Refactor] Integrate Naive Prepare Finalize into MK #32567

Uh oh!

Diff view

Diff view

There are no files selected for viewing

bnellnm Jan 25, 2026

Uh oh!

robertgshaw2-redhat Jan 26, 2026

Uh oh!

bnellnm Jan 25, 2026

Uh oh!

robertgshaw2-redhat Jan 26, 2026

Uh oh!

mgoin Jan 26, 2026

Uh oh!

robertgshaw2-redhat Jan 26, 2026

Uh oh!

Uh oh!

-Original file line number
+Diff line change
@@ Expand Up / @@ -22,6 +22,9 @@ @@
     )
     from vllm.forward_context import set_forward_context
     from vllm.model_executor.layers.fused_moe import fused_topk
+    from vllm.model_executor.layers.fused_moe.all2all_utils import (
+        maybe_make_prepare_finalize,
+    )
     from vllm.model_executor.layers.fused_moe.config import (
         FusedMoEConfig,
         FusedMoEParallelConfig,
@@ Expand All / @@ -40,7 +43,6 @@ @@
         TestMoEQuantConfig,
         expert_info,
         make_fused_experts,
-        make_prepare_finalize,
         prepare_finalize_info,
     )
     from .parallel_utils import ProcessGroupInfo
@@ Expand Down Expand Up / @@ -603,10 +605,12 @@ def next_power_of_2(x): @@
             routing_method=RoutingMethodType.DeepSeekV3,
         )
-        # make modular kernel
-        prepare_finalize = make_prepare_finalize(
-            config.prepare_finalize_type, config.all2all_backend(), moe, quant_config
+        prepare_finalize = maybe_make_prepare_finalize(
+            moe=moe,
+            quant_config=quant_config,
+            allow_new_interface=True,
         )
+        assert prepare_finalize is not None
         fused_experts = make_fused_experts(
             config.fused_experts_type,
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -7,9 +7,6 @@ @@
     # Fused experts and PrepareFinalize imports
     import vllm.model_executor.layers.fused_moe.modular_kernel as mk
     from vllm.model_executor.layers.fused_moe import TritonExperts
-    from vllm.model_executor.layers.fused_moe.all2all_utils import (
-        maybe_make_prepare_finalize,
-    )
     from vllm.model_executor.layers.fused_moe.batched_deep_gemm_moe import (
         BatchedDeepGemmExperts,
     )
@@ Expand Down Expand Up / @@ -255,13 +252,12 @@ def expert_info(kind) -> ExpertInfo: @@
         )
     if has_flashinfer_cutlass_fused_moe() and current_platform.has_device_capability(100):
+        from vllm.model_executor.layers.fused_moe.flashinfer_a2a_prepare_finalize import (  # noqa: E501
+            FlashInferCutlassMoEPrepareAndFinalize,
+        )
         from vllm.model_executor.layers.fused_moe.flashinfer_cutlass_moe import (
             FlashInferExperts,
         )
-        from vllm.model_executor.layers.fused_moe.flashinfer_cutlass_prepare_finalize import (  # noqa: E501
-            FlashInferCutlassMoEPrepareAndFinalize,
-            create_flashinfer_prepare_finalize,
-        )
         register_prepare_and_finalize(
             FlashInferCutlassMoEPrepareAndFinalize,
@@ Expand Down Expand Up / @@ -429,24 +425,6 @@ def expert_info(kind) -> ExpertInfo: @@
         ]
-    def make_prepare_finalize(
-        prepare_finalize_type: mk.FusedMoEPrepareAndFinalize,
-        backend: str | None,
-        moe: FusedMoEConfig,
-        quant_config: FusedMoEQuantConfig,
-    ) -> mk.FusedMoEPrepareAndFinalize:
-        if backend != "naive" and backend is not None:
-            prepare_finalize = maybe_make_prepare_finalize(moe, quant_config)
-            assert prepare_finalize is not None
-            return prepare_finalize
-        elif prepare_finalize_type == FlashInferCutlassMoEPrepareAndFinalize:
-            return create_flashinfer_prepare_finalize(
-                use_dp=moe.moe_parallel_config.dp_size > 1
-            )
-        else:
-            return MoEPrepareAndFinalizeNoEP()
     def _slice(rank: int, num_local_experts: int, t: torch.Tensor) -> torch.Tensor:
         s = rank * num_local_experts
         e = s + num_local_experts
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -59,7 +59,7 @@ def naive_multicast( @@
             return buffer
-        def dispatch(
+        def dispatch_router_logits(
             self,
             hidden_states: torch.Tensor,
             router_logits: torch.Tensor,
@@ Expand All / @@ -84,6 +84,34 @@ def dispatch( @@
             return hidden_states, router_logits
+        def dispatch(
+            self,
+            hidden_states: torch.Tensor,
+            topk_weights: torch.Tensor,
+            topk_ids: torch.Tensor,
+            is_sequence_parallel: bool = False,
+            extra_tensors: list[torch.Tensor] | None = None,
+        ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+            if extra_tensors is not None:
+                raise NotImplementedError(
+                    "extra_tensors is not supported for NaiveAll2AllManager"
+                )
+            sp_size = self.tp_group.world_size if is_sequence_parallel else 1
+            dp_metadata = get_forward_context().dp_metadata
+            assert dp_metadata is not None
+            cu_tokens_across_sp_cpu = dp_metadata.cu_tokens_across_sp(sp_size)
+            hidden_states = self.naive_multicast(
+                hidden_states, cu_tokens_across_sp_cpu, is_sequence_parallel
+            )
+            topk_weights = self.naive_multicast(
+                topk_weights, cu_tokens_across_sp_cpu, is_sequence_parallel
+            )
+            topk_ids = self.naive_multicast(
+                topk_ids, cu_tokens_across_sp_cpu, is_sequence_parallel
+            )
+            return hidden_states, topk_weights, topk_ids
         def combine(
             self, hidden_states: torch.Tensor, is_sequence_parallel: bool = False
         ) -> torch.Tensor:
@@ Expand Down Expand Up / @@ -114,7 +142,7 @@ class AgRsAll2AllManager(All2AllManagerBase): @@
         def __init__(self, cpu_group):
             super().__init__(cpu_group)
-        def dispatch(
+        def dispatch_router_logits(
             self,
             hidden_states: torch.Tensor,
             router_logits: torch.Tensor,
@@ Expand Down Expand Up / @@ -148,6 +176,46 @@ def dispatch( @@
                 return (gathered_tensors[0], gathered_tensors[1], gathered_tensors[2:])
             return gathered_tensors[0], gathered_tensors[1]
+        def dispatch(
+            self,
+            hidden_states: torch.Tensor,
+            topk_weights: torch.Tensor,
+            topk_ids: torch.Tensor,
+            is_sequence_parallel: bool = False,
+            extra_tensors: list[torch.Tensor] | None = None,
+        ) -> (
+            tuple[torch.Tensor, torch.Tensor, torch.Tensor]
+            | tuple[torch.Tensor, torch.Tensor, torch.Tensor, list[torch.Tensor]]
+        ):
+            """
+            Gather hidden_states and router_logits from all dp ranks.
+            """
+            dp_metadata = get_forward_context().dp_metadata
+            assert dp_metadata is not None
+            sizes = dp_metadata.get_chunk_sizes_across_dp_rank()
+            assert sizes is not None
+            dist_group = get_ep_group() if is_sequence_parallel else get_dp_group()
+            assert sizes[dist_group.rank_in_group] == hidden_states.shape[0]
+            tensors_to_gather = [hidden_states, topk_weights, topk_ids]
+            if extra_tensors is not None:
+                tensors_to_gather.extend(extra_tensors)
+            gathered_tensors = dist_group.all_gatherv(
+                tensors_to_gather,
+                dim=0,
+                sizes=sizes,
+            )
+            hidden_states = gathered_tensors[0]
+            topk_weights = gathered_tensors[1]
+            topk_ids = gathered_tensors[2]
+            if extra_tensors is None:
+                return hidden_states, topk_weights, topk_ids
+            return hidden_states, topk_weights, topk_ids, gathered_tensors[3:]
         def combine(
             self, hidden_states: torch.Tensor, is_sequence_parallel: bool = False
         ) -> torch.Tensor:
@@ Expand Down Expand Up / @@ -216,7 +284,7 @@ def get_handle(self, kwargs): @@
                 pplx.AllToAll.internode if self.internode else pplx.AllToAll.intranode,
             )
-        def dispatch(
+        def dispatch_router_logits(
             self,
             hidden_states: torch.Tensor,
             router_logits: torch.Tensor,
@@ Expand All / @@ -225,6 +293,19 @@ def dispatch( @@
         ) -> tuple[torch.Tensor, torch.Tensor]:
             raise NotImplementedError
+        def dispatch(
+            self,
+            hidden_states: torch.Tensor,
+            topk_weights: torch.Tensor,
+            topk_ids: torch.Tensor,
+            is_sequence_parallel: bool = False,
+            extra_tensors: list[torch.Tensor] | None = None,
+        ) -> (
+            tuple[torch.Tensor, torch.Tensor, torch.Tensor]
+            | tuple[torch.Tensor, torch.Tensor, torch.Tensor, list[torch.Tensor]]
+        ):
+            raise NotImplementedError
         def combine(
             self, hidden_states: torch.Tensor, is_sequence_parallel: bool = False
         ) -> torch.Tensor:
@@ Expand Down Expand Up / @@ -264,7 +345,7 @@ def __init__(self, cpu_group): @@
         def get_handle(self, kwargs):
             raise NotImplementedError
-        def dispatch(
+        def dispatch_router_logits(
             self,
             hidden_states: torch.Tensor,
             router_logits: torch.Tensor,
@@ Expand All / @@ -273,6 +354,19 @@ def dispatch( @@
         ) -> tuple[torch.Tensor, torch.Tensor]:
             raise NotImplementedError
+        def dispatch(
+            self,
+            hidden_states: torch.Tensor,
+            topk_weights: torch.Tensor,
+            topk_ids: torch.Tensor,
+            is_sequence_parallel: bool = False,
+            extra_tensors: list[torch.Tensor] | None = None,
+        ) -> (
+            tuple[torch.Tensor, torch.Tensor, torch.Tensor]
+            | tuple[torch.Tensor, torch.Tensor, torch.Tensor, list[torch.Tensor]]
+        ):
+            raise NotImplementedError
         def combine(
             self, hidden_states: torch.Tensor, is_sequence_parallel: bool = False
         ) -> torch.Tensor:
@@ Expand Down @@

Uh oh!

[MoE Refactor] Integrate Naive Prepare Finalize into MK #32567

Uh oh!

[MoE Refactor] Integrate Naive Prepare Finalize into MK #32567

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

bnellnm Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

bnellnm Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

mgoin Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!