[MoE] Nvfp4 Masked Gemm: Add flashinfer grouped_gemm_nt_masked #25990

wenscarl · 2025-09-30T21:11:32Z

Add grouped_gemm_nt_masked from flashinfer to support nvfp4 MoE.

depends on silu_and_mul nvfp4 quanization fusion rework

Purpose

Test Plan

VLLM_WORKER_MULTIPROC_METHOD="spawn" \
VLLM_ALL2ALL_BACKEND="masked_gemm" \
VLLM_USE_STANDALONE_COMPILE=0 \
VLLM_USE_FLASHINFER_MOE_FP4=1 \
VLLM_FLASHINFER_MOE_BACKEND="cutedsl" \
lm_eval --model vllm --model_args pretrained=/dev/shm/checkpoints/nvidia-DeepSeek-R1-0528-FP4,quantization=modelopt_fp4,data_parallel_size=8,enable_expert_parallel=False,tensor_parallel_size=1,max_model_len=2048 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto

Test Result

vllm (pretrained=/dev/shm/checkpoints/nvidia-DeepSeek-R1-0528-FP4,quantization=modelopt_fp4,data_parallel_size=8,enable_expert_parallel=True,tensor_parallel_size=1,max_model_len=2048,enforce_eager=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9591|±  |0.0055|
|     |       |strict-match    |     5|exact_match|↑  |0.9538|±  |0.0058|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify · 2025-10-06T21:07:34Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wenscarl.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

bnellnm · 2025-10-07T19:14:27Z

tests/kernels/moe/test_cutedsl_moe.py

There should be existing utilities for a number of these functions, e.g. test_moe, dequantize_nvfp4_to_dtype, etc. Can you switch over to the existing implementations?

It would also be good to add the FlashInferCuteDSLExperts to the test_modular_kernel_combinations.py test. It should be fairly simple to register them in modular_kernel_tools/mk_objects.py. The test already supports nvfp4 so there should not be much additional work.

varun-sundar-rabindranath · 2025-10-09T05:39:47Z

Thanks for working on this ! I think this will also help enable gpt-oss + DeepEPLowLatency on blackwell 🙌

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

chatgpt-codex-connector · 2025-10-10T03:39:11Z

tests/kernels/moe/test_cutedsl_moe.py

+import pytest
+import torch
+from flashinfer import fp4_quantize
+from torch.nn import functional as F
+
+from vllm.model_executor.layers.activation import SiluAndMul
+from vllm.model_executor.layers.fused_moe.flashinfer_cutedsl_moe import (
+    flashinfer_cutedsl_moe_masked,
+    scaled_fp4_grouped_quant,
+)
+from vllm.utils.flashinfer import (
+    flashinfer_cutedsl_grouped_gemm_nt_masked as cutedsl_gmm_masked,
+)
+
+if torch.cuda.get_device_capability() < (10, 0):
+    pytest.skip(


Guard optional FlashInfer/GPU dependencies in new test

The new CUTEDSL MoE test imports flashinfer and calls torch.cuda.get_device_capability() at module import time. In environments without the optional FlashInfer package or without CUDA support, these imports raise ImportError/RuntimeError before pytest has a chance to apply the skip, causing the entire test suite to fail during collection. Wrap the import with pytest.importorskip("flashinfer") and check torch.cuda.is_available() before calling get_device_capability so the module skips cleanly when the dependency or hardware is absent.

Useful? React with 👍 / 👎.

vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py

vllm/model_executor/layers/fused_moe/flashinfer_cutedsl_moe.py

mergify · 2025-10-14T03:04:46Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wenscarl.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

bnellnm · 2025-10-14T15:19:14Z

vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py

+        if envs.VLLM_FLASHINFER_MOE_BACKEND == "cutedsl":
+            logger.info_once(
+                "Skip quantization when using FlashInfer CUTEDSL for "
+                "ModelOptNvFp4FusedMoE."
+            )
+            q_dtype = None


Quantization can be skipped if the quant_dtype field is left as None in the quant_config.

just want to limit the scope of this temporary change to dispatch since the whole model is still nvfp4. When fp4 dispatched is supported by deepep(actually already supported but not in main branch), we can remove this.

mgoin

Looks reasonable to me overall, it seems we just need to wait for the flashinfer change to get in

wenscarl · 2025-10-23T12:46:14Z

@mgoin flashinfer-ai/flashinfer#1927 is merged. Should unblock this PR.

mergify · 2025-11-12T22:20:27Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wenscarl.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

tests/kernels/moe/test_cutedsl_moe.py

vllm/envs.py

bnellnm

Overall LGTM. Just had a couple minor comments.

mergify · 2025-11-14T18:26:05Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wenscarl.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Shu Wang. <[email protected]>

Signed-off-by: mgoin <[email protected]>

mgoin

@wenscarl When I run the test locally, I see a failure for the last case, PTAL

tests/kernels/moe/test_cutedsl_moe.py .......F                                                                                                                                                         [100%]

================================================================================================== FAILURES ==================================================================================================
_________________________________________________________________________________ test_grouped_gemm_nt_masked[16-128-512-5] __________________________________________________________________________________

bs = 16, hidden_dim = 128, inter_dim = 512, topk = 5

    @pytest.mark.parametrize(
        "bs, hidden_dim, inter_dim, topk", [(2, 128, 256, 2), (16, 128, 512, 5)]
    )
    @torch.inference_mode()
    def test_grouped_gemm_nt_masked(
        bs: int, hidden_dim: int, inter_dim: int, topk: int
    ) -> None:
        torch.manual_seed(42)
        B = bs
        D = hidden_dim
        N = inter_dim
        # CuteDSL group gemm has issue when not all experts are active.
        # i.e. masked = [2, 3, 0, 0, 1] where the 2nd and 3rd experts are inactive
        # see https://github.com/flashinfer-ai/flashinfer/issues/1856
        num_experts = bs
        hidden_states = torch.randn(B, D, dtype=torch.bfloat16, device="cuda")
        weights = torch.randn(num_experts, N, D, dtype=torch.bfloat16, device="cuda")
        router_logits = torch.randn(B, num_experts, dtype=torch.float32)
    
        hidden_states_expanded = (
            hidden_states.view(B, -1, D).repeat(1, topk, 1).reshape(-1, D)
        )
        hidden_states_3d, masked_m, topk_idx, _ = prepare_inputs(
            hidden_states_expanded, router_logits, num_experts, topk
        )
    
        a_amax = (
            hidden_states_3d.abs()
            .amax(dim=(1, 2))
            .to(torch.float32)
            .to(hidden_states.device)
        )
        b_amax = weights.abs().amax(dim=(1, 2)).to(torch.float32).to(weights.device)
        a_gs = FLOAT8_E4M3_MAX * FLOAT4_E2M1_MAX / a_amax
        b_gs = FLOAT8_E4M3_MAX * FLOAT4_E2M1_MAX / b_amax
        out_flashinfer = flashinfer_cutedsl_grouped_gemm_nt_masked(
            hidden_states_3d.to(hidden_states.device), a_gs, weights, b_gs, masked_m
        )
        # reference
        out_ref = grouped_gemm_ref(
            hidden_states_expanded=hidden_states_expanded,
            hidden_states_3d=hidden_states_3d,
            weights=weights,
            topk_idx=topk_idx,
            masked_m=masked_m,
            B=B,
            topk=topk,
            num_experts=num_experts,
        )
        # Note: just to compare the masked position due to cutedsl may write nan
        # into unmasked position.
        for i in range(num_experts):
>           torch.testing.assert_close(
                out_flashinfer.permute(2, 0, 1)[i, : masked_m[i]],
                out_ref.to(out_flashinfer.device)[i, : masked_m[i]],
                atol=1e-1,
                rtol=1e-1,
            )
E           AssertionError: Tensor-likes are not close!
E           
E           Mismatched elements: 1529 / 1536 (99.5%)
E           Greatest absolute difference: 42.5 at index (1, 212) (up to 0.1 allowed)
E           Greatest relative difference: 1.0 at index (0, 0) (up to 0.1 allowed)

tests/kernels/moe/test_cutedsl_moe.py:570: AssertionError

Signed-off-by: mgoin <[email protected]>

Signed-off-by: Shu Wang. <[email protected]>

wenscarl · 2025-11-18T20:28:14Z

@wenscarl When I run the test locally, I see a failure for the last case, PTAL

tests/kernels/moe/test_cutedsl_moe.py .......F                                                                                                                                                         [100%]

================================================================================================== FAILURES ==================================================================================================
_________________________________________________________________________________ test_grouped_gemm_nt_masked[16-128-512-5] __________________________________________________________________________________

bs = 16, hidden_dim = 128, inter_dim = 512, topk = 5

    @pytest.mark.parametrize(
        "bs, hidden_dim, inter_dim, topk", [(2, 128, 256, 2), (16, 128, 512, 5)]
    )
    @torch.inference_mode()
    def test_grouped_gemm_nt_masked(
        bs: int, hidden_dim: int, inter_dim: int, topk: int
    ) -> None:
        torch.manual_seed(42)
        B = bs
        D = hidden_dim
        N = inter_dim
        # CuteDSL group gemm has issue when not all experts are active.
        # i.e. masked = [2, 3, 0, 0, 1] where the 2nd and 3rd experts are inactive
        # see https://github.com/flashinfer-ai/flashinfer/issues/1856
        num_experts = bs
        hidden_states = torch.randn(B, D, dtype=torch.bfloat16, device="cuda")
        weights = torch.randn(num_experts, N, D, dtype=torch.bfloat16, device="cuda")
        router_logits = torch.randn(B, num_experts, dtype=torch.float32)
    
        hidden_states_expanded = (
            hidden_states.view(B, -1, D).repeat(1, topk, 1).reshape(-1, D)
        )
        hidden_states_3d, masked_m, topk_idx, _ = prepare_inputs(
            hidden_states_expanded, router_logits, num_experts, topk
        )
    
        a_amax = (
            hidden_states_3d.abs()
            .amax(dim=(1, 2))
            .to(torch.float32)
            .to(hidden_states.device)
        )
        b_amax = weights.abs().amax(dim=(1, 2)).to(torch.float32).to(weights.device)
        a_gs = FLOAT8_E4M3_MAX * FLOAT4_E2M1_MAX / a_amax
        b_gs = FLOAT8_E4M3_MAX * FLOAT4_E2M1_MAX / b_amax
        out_flashinfer = flashinfer_cutedsl_grouped_gemm_nt_masked(
            hidden_states_3d.to(hidden_states.device), a_gs, weights, b_gs, masked_m
        )
        # reference
        out_ref = grouped_gemm_ref(
            hidden_states_expanded=hidden_states_expanded,
            hidden_states_3d=hidden_states_3d,
            weights=weights,
            topk_idx=topk_idx,
            masked_m=masked_m,
            B=B,
            topk=topk,
            num_experts=num_experts,
        )
        # Note: just to compare the masked position due to cutedsl may write nan
        # into unmasked position.
        for i in range(num_experts):
>           torch.testing.assert_close(
                out_flashinfer.permute(2, 0, 1)[i, : masked_m[i]],
                out_ref.to(out_flashinfer.device)[i, : masked_m[i]],
                atol=1e-1,
                rtol=1e-1,
            )
E           AssertionError: Tensor-likes are not close!
E           
E           Mismatched elements: 1529 / 1536 (99.5%)
E           Greatest absolute difference: 42.5 at index (1, 212) (up to 0.1 allowed)
E           Greatest relative difference: 1.0 at index (0, 0) (up to 0.1 allowed)

tests/kernels/moe/test_cutedsl_moe.py:570: AssertionError

It's because the global scaling factors have nan. Fixed by filling 1s at initialization.

…project#25990) Signed-off-by: Shu Wang. <[email protected]> Signed-off-by: mgoin <[email protected]> Co-authored-by: Michael Goin <[email protected]>

…project#25990) Signed-off-by: Shu Wang. <[email protected]> Signed-off-by: mgoin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Signed-off-by: LuminolT <[email protected]>

Signed-off-by: Shu Wang. <[email protected]> Signed-off-by: mgoin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Signed-off-by: jiang1.li <[email protected]>

wenscarl force-pushed the cutedsl_grp_gemm branch from 3ba900a to e186f4c Compare October 2, 2025 03:38

mergify bot added the needs-rebase label Oct 6, 2025

wenscarl force-pushed the cutedsl_grp_gemm branch 3 times, most recently from 3d56913 to 99d4080 Compare October 7, 2025 03:55

bnellnm reviewed Oct 7, 2025

View reviewed changes

mergify bot removed the needs-rebase label Oct 10, 2025

wenscarl force-pushed the cutedsl_grp_gemm branch from 80a4edf to 32dd1a1 Compare October 10, 2025 03:34

wenscarl marked this pull request as ready for review October 10, 2025 03:35

wenscarl requested review from WoosukKwon, mgoin, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners October 10, 2025 03:35

wenscarl requested a review from bnellnm October 10, 2025 03:35

chatgpt-codex-connector bot reviewed Oct 10, 2025

View reviewed changes

bnellnm reviewed Oct 10, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py Outdated Show resolved Hide resolved

bnellnm reviewed Oct 10, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py Outdated Show resolved Hide resolved

bnellnm reviewed Oct 10, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/flashinfer_cutedsl_moe.py Outdated Show resolved Hide resolved

mergify bot added the needs-rebase label Oct 14, 2025

wenscarl force-pushed the cutedsl_grp_gemm branch from 32dd1a1 to 8a224da Compare October 14, 2025 03:29

mergify bot removed the needs-rebase label Oct 14, 2025

bnellnm reviewed Oct 14, 2025

View reviewed changes

bnellnm approved these changes Oct 14, 2025

View reviewed changes

mgoin reviewed Oct 16, 2025

View reviewed changes

wenscarl mentioned this pull request Oct 18, 2025

[MoE] CuteDSL MoE with Nvfp4 DeepEP dispatch #27141

Open

5 tasks

mergify bot added the needs-rebase label Nov 12, 2025

wenscarl requested review from bnellnm and mgoin November 13, 2025 05:24

mergify bot removed the needs-rebase label Nov 13, 2025

bnellnm reviewed Nov 13, 2025

View reviewed changes

tests/kernels/moe/test_cutedsl_moe.py Outdated Show resolved Hide resolved

bnellnm reviewed Nov 13, 2025

View reviewed changes

vllm/envs.py Outdated Show resolved Hide resolved

bnellnm reviewed Nov 13, 2025

View reviewed changes

mergify bot added the needs-rebase label Nov 14, 2025

wenscarl force-pushed the cutedsl_grp_gemm branch from fc02f8e to b97c80d Compare November 14, 2025 18:42

wenscarl requested a review from bnellnm November 14, 2025 18:43

mergify bot removed the needs-rebase label Nov 14, 2025

Add flashinfer_cutedsl grouped gemm

0a83133

Signed-off-by: Shu Wang. <[email protected]>

wenscarl force-pushed the cutedsl_grp_gemm branch from b97c80d to 0a83133 Compare November 14, 2025 18:44

bnellnm approved these changes Nov 14, 2025

View reviewed changes

wenscarl and others added 3 commits November 14, 2025 13:23

Merge branch 'main' into cutedsl_grp_gemm

0801b54

Merge branch 'main' into cutedsl_grp_gemm

4ccc268

Fix test skip

37f6bb4

Signed-off-by: mgoin <[email protected]>

mgoin reviewed Nov 17, 2025

View reviewed changes

Add test to Blackwell job

1977bcf

Signed-off-by: mgoin <[email protected]>

mergify bot added the ci/build label Nov 17, 2025

Avoid nan by torch.ones

87c4e87

Signed-off-by: Shu Wang. <[email protected]>

wenscarl requested a review from mgoin November 18, 2025 20:27

vllm-bot merged commit 613abb5 into vllm-project:main Nov 19, 2025
52 of 54 checks passed

github-project-automation bot moved this to Done in NVIDIA Nov 19, 2025

Uh oh!

[MoE] Nvfp4 Masked Gemm: Add flashinfer grouped_gemm_nt_masked #25990

[MoE] Nvfp4 Masked Gemm: Add flashinfer grouped_gemm_nt_masked #25990

Conversation

wenscarl commented Sep 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Oct 6, 2025

Uh oh!

bnellnm Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

varun-sundar-rabindranath commented Oct 9, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Oct 14, 2025

Uh oh!

bnellnm Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

wenscarl Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

wenscarl commented Oct 23, 2025

Uh oh!

mergify bot commented Nov 12, 2025

Uh oh!

Uh oh!

Uh oh!

bnellnm left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Nov 14, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

wenscarl commented Nov 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wenscarl commented Sep 30, 2025 •

edited by github-actions bot

Loading

bnellnm Oct 7, 2025 •

edited

Loading