Flashinfer_CUTLASS_MOE fuses quantization for TP by wenscarl · Pull Request #27223 · vllm-project/vllm

wenscarl · 2025-10-20T20:26:02Z

co-authored by @leejnau

Purpose

For TP case, the nvfp4 quantzation is fused with flashinfer_cutlass_moe call. This fixes the accuracy issue for nvidia/Deepseek-R1-0528-FP4-v2 model.
For DP case, the fix should rely on #26135.

Test Plan

VLLM_USE_FLASHINFER_MOE_FP4=1 \
VLLM_FLASHINFER_MOE_BACKEND="throughput" \
/home/shuw/.local/bin/lm_eval --model vllm --model_args pretrained=nvidia/DeepSeek-R1-0528-FP4-v2,data_parallel_size=1,enable_expert_parallel=True,tensor_parallel_size=4,enforce_eager=True,max_model_len=2048 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto

Test Result

Previously:

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.0106	±	0.0028
		strict-match	5	exact_match	↑	0.0000	±	0.0000

With this PR:

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.9522	±	0.0059
		strict-match	5	exact_match	↑	0.9469	±	0.0062

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Shu Wang. <shuw@nvidia.com>

bnellnm

LGTM. Alternatively, you could prevent modular kernels from being created in this case and fall though to the direct call to flashinfer_cutlass_moe_fp4?

wenscarl · 2025-10-20T21:32:09Z

LGTM. Alternatively, you could prevent modular kernels from being created in this case and fall though to the direct call to flashinfer_cutlass_moe_fp4?

Yes. But do you prefer to have modular kernels anyways?

mgoin · 2025-10-20T21:34:57Z

vllm/model_executor/layers/quantization/modelopt.py

-
-            assert self.moe_quant_config is not None
-
-            return flashinfer_cutlass_moe_fp4(


Can you update for compressed-tensors too?

vllm/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

Line 516 in f9e7ad5

return flashinfer_cutlass_moe_fp4(

@wenscarl @mgoin is my understanding correct that, going forward, direct calling into FlashInfer is going to be deprecated in favour of modular kernels, for all cases (e.g., regardless of EP/TP/DP choices)?

@wenscarl @mgoin is my understanding correct that, going forward, direct calling into FlashInfer is going to be deprecated in favour of modular kernels, for all cases (e.g., regardless of EP/TP/DP choices)?

There's no plan to force all MoE kernels to be called via modular kernels.

But by deleting this elif clause (and, per @mgoin 's suggestion, applying this change to compressed-tensors), doesn't it force the use of FlashInfer cutlass implementation to go through the modular kernels?

I'm just trying to understand if this is the plan for all cases that use FlashInfer, regardless of distributed strategies, or whether self.flashinfer_moe_backend is FlashinferMoeBackend.TENSORRT_LLM or FlashinferMoeBackend.CUTLASS.

I don't see any reason to use the modular kernels for cases that aren't using some kind of all2all communication. In this particular case I think @wenscarl and @leejnau figured out that this was dead code because the CUTLASS case always created a modular kernel. I'm not sure if the same holds true for compressed_tensors.

@bnellnm this PR is updated with a additional quant_dtype: nvfp4_skip_quantization.

I'm just trying to understand if this is the plan for all cases that use FlashInfer
I vote for that. Since flashinfer cutlass moe is at least a better option to normal cutlass_moe. TRTLLM MoE can win sometimes even.

bnellnm · 2025-10-20T22:33:42Z

LGTM. Alternatively, you could prevent modular kernels from being created in this case and fall though to the direct call to flashinfer_cutlass_moe_fp4?

Yes. But do you prefer to have modular kernels anyways?

I've no preference for this particular case since there's no communication going on. My only concern would be that the FusedMoEQuantConfig doesn't match up with what the prepare/finalize class is doing. This could probably be made consistent by making sure the quant_dtype is None, e.g. something like this

    def get_fused_moe_quant_config(
        self, layer: torch.nn.Module
    ) -> FusedMoEQuantConfig | None:
        if (
            self.use_marlin
            or self.flashinfer_moe_backend == FlashinferMoeBackend.TENSORRT_LLM
        ):
            return None
        
        # New case here for CUTLASS
        if (
              self.flashinfer_moe_backend == FlashinferMoeBackend.CUTLASS
              and layer.tp_size > 1
        ):  
            return FusedMoEQuantConfig.make(
               w1_scale=layer.w13_weight_scale,
               w2_scale=layer.w2_weight_scale,
               g1_alphas=layer.g1_alphas,
               g2_alphas=layer.g2_alphas,
               a1_gscale=layer.w13_input_scale_quant,
               a2_gscale=layer.w2_input_scale_quant,
           )
        
        return nvfp4_moe_quant_config(
            w1_scale=layer.w13_weight_scale,
            w2_scale=layer.w2_weight_scale,
            g1_alphas=layer.g1_alphas,
            g2_alphas=layer.g2_alphas,
            a1_gscale=layer.w13_input_scale_quant,
            a2_gscale=layer.w2_input_scale_quant,
        )

vllm/model_executor/layers/quantization/modelopt.py

pavanimajety · 2025-10-23T16:29:16Z

vllm/model_executor/layers/fused_moe/flashinfer_cutlass_moe.py

        FlashInferExperts(
            out_dtype=hidden_states.dtype,
            quant_config=quant_config,
+            use_dp=False,


Why are we always setting use_dp=False? Doesn't flashinfer_cutlass_moe_fp4 also support dp?

flashinfer_cutlass_moe_fp4(dead code) is removed in this PR. Previously flashinfer_cutlass_moe_fp4 is meant for TP case only. If DP, the fused expert is assemble elsewhere.

How is TP + Flashinfer cutlass handled now? Could we remove this method altogether or do we see any cases where it would be used?

https://github.com/vllm-project/vllm/pull/27223/files#diff-5bb9585da825481e1ae1534657a703f846325c7dd72cfddb0c41f878db33d78aR82 differentiate TP vs DP. But in either case, a fused moe expert is assembled. I agree that flashinfer_cutlass_moe_fp4 should be removed. But compressed tensor TP could still depends on it. So its removal is out-of-scope for this PR.

weireweire · 2025-11-05T06:00:47Z

vllm/model_executor/layers/quantization/modelopt.py

-        elif (
-            self.allow_flashinfer
-            and self.flashinfer_moe_backend == FlashinferMoeBackend.CUTLASS
-        ):


will this break PP mode? when run deepseek PP, it will go to the last clause that use cutlass_moe_fp4, which will break on SM120. While on SGLang, flashinfer_cutlass works with sm120 PP.

well, #27123 added back this and solved my issue.

Signed-off-by: Shu Wang. <shuw@nvidia.com>

wenscarl marked this pull request as ready for review October 20, 2025 20:26

wenscarl requested review from mgoin, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners October 20, 2025 20:26

Flashinfer_CUTLASS_MOE fuses quantization for TP

4c37b7d

Signed-off-by: Shu Wang. <shuw@nvidia.com>

wenscarl force-pushed the flashinfer_cutlass_moe_tp branch from 1057c27 to 4c37b7d Compare October 20, 2025 20:33

bnellnm approved these changes Oct 20, 2025

View reviewed changes

mgoin reviewed Oct 20, 2025

View reviewed changes

wenscarl mentioned this pull request Oct 21, 2025

[Feature]: Expert Parallelism with CompressedTensor FP4 or NVFP4 #27231

Closed

1 task

leejnau reviewed Oct 21, 2025

View reviewed changes

vllm/model_executor/layers/quantization/modelopt.py Show resolved Hide resolved

mgoin added bug Something isn't working quantization ready ONLY add when PR is ready to merge/full CI is needed labels Oct 21, 2025

leejnau approved these changes Oct 21, 2025

View reviewed changes

wenscarl requested a review from pavanimajety as a code owner October 22, 2025 03:38

wenscarl requested review from bnellnm and mgoin October 22, 2025 03:40

bnellnm reviewed Oct 22, 2025

View reviewed changes

vllm/model_executor/layers/quantization/modelopt.py Show resolved Hide resolved

wenscarl force-pushed the flashinfer_cutlass_moe_tp branch 2 times, most recently from c78959c to 4c37b7d Compare October 23, 2025 03:13

wenscarl and others added 2 commits October 22, 2025 20:14

Merge branch 'main' into flashinfer_cutlass_moe_tp

96da50c

Merge branch 'main' into flashinfer_cutlass_moe_tp

b725102

pavanimajety reviewed Oct 23, 2025

View reviewed changes

pavanimajety and others added 3 commits October 27, 2025 15:37

Merge branch 'main' into flashinfer_cutlass_moe_tp

947ec6b

Merge branch 'main' into flashinfer_cutlass_moe_tp

e2e90a7

Merge branch 'main' into flashinfer_cutlass_moe_tp

ba51037

pavanimajety enabled auto-merge (squash) October 31, 2025 15:44

pavanimajety approved these changes Oct 31, 2025

View reviewed changes

pavanimajety merged commit fc16f1c into vllm-project:main Oct 31, 2025
57 checks passed

amirkl94 mentioned this pull request Nov 2, 2025

Bugfix: Cutlass FP8 FusedMoE bad scaling factors #27255

Merged

weireweire reviewed Nov 5, 2025

View reviewed changes

ZhengHongming888 pushed a commit to ZhengHongming888/vllm that referenced this pull request Nov 8, 2025

Flashinfer_CUTLASS_MOE fuses quantization for TP (vllm-project#27223)

efb0671

Signed-off-by: Shu Wang. <shuw@nvidia.com>

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

Flashinfer_CUTLASS_MOE fuses quantization for TP (vllm-project#27223)

e0e16cf

Signed-off-by: Shu Wang. <shuw@nvidia.com>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

Flashinfer_CUTLASS_MOE fuses quantization for TP (vllm-project#27223)

253af3a

Signed-off-by: Shu Wang. <shuw@nvidia.com>


		assert self.moe_quant_config is not None

		return flashinfer_cutlass_moe_fp4(

Uh oh!

Conversation

wenscarl commented Oct 20, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

bnellnm left a comment

Choose a reason for hiding this comment

Uh oh!

wenscarl commented Oct 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bnellnm commented Oct 20, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

wenscarl commented Oct 20, 2025 •

edited by github-actions bot

Loading