[NVIDIA] Explicitly disable shuffled weights for flashinfer blockscale moe fp8 kernels by kaixih · Pull Request #21411 · vllm-project/vllm

kaixih · 2025-07-22T21:47:43Z

The latest Flashinfer (PR) introduces a new flag to the trtllm_fp8_block_scale_moe API, which defaults to True. This PR explicitly disables it to restore the previous behavior.

I have verified the perf and accuracy with the tot and we recommend to use flashinfer v0.2.9.

cc. @kushanam @mgoin

Signed-off-by: kaixih <kaixih@nvidia.com>

github-actions · 2025-07-22T21:47:51Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request aims to restore previous behavior for FlashInfer MoE kernels by explicitly disabling a new use_shuffled_weight flag. While the change is correct in its intent, it introduces a critical backward compatibility issue for users with older versions of FlashInfer. I've provided a comment with a suggested fix to address this.

gemini-code-assist · 2025-07-22T21:48:56Z

vllm/model_executor/layers/fused_moe/fused_moe.py

        tile_tokens_dim=_get_tile_tokens_dim(x.shape[0], top_k,
                                             global_num_experts),
        routing_method_type=2,  # DeepSeek-styled routing method
+        use_shuffled_weight=False,


This change explicitly adds the use_shuffled_weight argument, which was introduced in FlashInfer v0.2.9. This will cause a TypeError for users with older versions of FlashInfer, breaking backward compatibility.

To fix this, we should only pass the argument if the installed FlashInfer version supports it. This can be done with a version check, which requires refactoring the function call to use a kwargs dictionary.

Here is a suggested implementation to replace lines 1097-1131:

from vllm.utils.flashinfer import flashinfer_trtllm_fp8_block_scale_moe a_q, a_sf = per_token_group_quant_fp8(x, block_shape[1]) # NOTE: scales of hidden states have to be transposed! a_sf_t = a_sf.t().contiguous() kwargs = dict( routing_logits=routing_logits, routing_bias=routing_bias, hidden_states=a_q, hidden_states_scale=a_sf_t, gemm1_weights=w13_weight, gemm1_weights_scale=w13_weight_scale_inv, gemm2_weights=w2_weight, gemm2_weights_scale=w2_weight_scale_inv, num_experts=global_num_experts, top_k=top_k, n_group=num_expert_group, topk_group=topk_group, intermediate_size=intermediate_size, local_expert_offset=expert_offset, local_num_experts=local_num_experts, routed_scaling_factor=routed_scaling, tile_tokens_dim=_get_tile_tokens_dim(x.shape[0], top_k, global_num_experts), routing_method_type=2, # DeepSeek-styled routing method ) try: import flashinfer from packaging.version import Version # The use_shuffled_weight argument was added in flashinfer v0.2.9 if Version(flashinfer.__version__) >= Version("0.2.9"): kwargs["use_shuffled_weight"] = False except (ImportError, AttributeError): # Older flashinfer version or flashinfer not installed. # The lazy loader will handle the ImportError later if it's missing. pass return flashinfer_trtllm_fp8_block_scale_moe(**kwargs)

mgoin · 2025-07-22T23:45:29Z

Thanks Kaixi, should we wait for the 0.2.9 release to land this?

kaixih · 2025-07-23T19:28:22Z

@mgoin Sure.

kaixih · 2025-07-24T17:57:44Z

@mgoin the flashinfer has released the 0.2.9rc1. I took a quick look at the vLLM codebase and noticed that only the Dockerfile explicitly references the Flashinfer version. I'm not sure how vLLM determines or enforces the Flashinfer version elsewhere. Can you advise?

mgoin · 2025-07-24T18:11:26Z

@kaixih Let's wait on landing this PR until after #21485 then, as it already updates the dockerfile. We will enforce the version in the future by adding it to requirements/cuda.txt as well

kaixih · 2025-07-25T22:05:10Z

@mgoin can we merge this PR since the flashinfer 0.2.9rc1 is in.

…e moe fp8 kernels (vllm-project#21411) Signed-off-by: kaixih <kaixih@nvidia.com>

…e moe fp8 kernels (vllm-project#21411) Signed-off-by: kaixih <kaixih@nvidia.com> Signed-off-by: x22x22 <wadeking@qq.com>

…e moe fp8 kernels (vllm-project#21411) Signed-off-by: kaixih <kaixih@nvidia.com>

…e moe fp8 kernels (vllm-project#21411) Signed-off-by: kaixih <kaixih@nvidia.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

…e moe fp8 kernels (vllm-project#21411) Signed-off-by: kaixih <kaixih@nvidia.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>

…e moe fp8 kernels (vllm-project#21411) Signed-off-by: kaixih <kaixih@nvidia.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>

…e moe fp8 kernels (vllm-project#21411) Signed-off-by: kaixih <kaixih@nvidia.com>

Explicitly disable shuffled weights

412cfe6

Signed-off-by: kaixih <kaixih@nvidia.com>

gemini-code-assist bot reviewed Jul 22, 2025

View reviewed changes

mgoin approved these changes Jul 25, 2025

View reviewed changes

mgoin enabled auto-merge (squash) July 25, 2025 22:46

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 25, 2025

vllm-bot merged commit de509ae into vllm-project:main Jul 26, 2025
70 of 73 checks passed

liuyumoye pushed a commit to liuyumoye/vllm that referenced this pull request Jul 31, 2025

[NVIDIA] Explicitly disable shuffled weights for flashinfer blockscal…

f619071

…e moe fp8 kernels (vllm-project#21411) Signed-off-by: kaixih <kaixih@nvidia.com>

HsChen-sys pushed a commit to HsChen-sys/vllm that referenced this pull request Aug 1, 2025

[NVIDIA] Explicitly disable shuffled weights for flashinfer blockscal…

c4a9b26

…e moe fp8 kernels (vllm-project#21411) Signed-off-by: kaixih <kaixih@nvidia.com>

x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025

[NVIDIA] Explicitly disable shuffled weights for flashinfer blockscal…

77c88a2

…e moe fp8 kernels (vllm-project#21411) Signed-off-by: kaixih <kaixih@nvidia.com> Signed-off-by: x22x22 <wadeking@qq.com>

Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025

[NVIDIA] Explicitly disable shuffled weights for flashinfer blockscal…

30b19eb

…e moe fp8 kernels (vllm-project#21411) Signed-off-by: kaixih <kaixih@nvidia.com>

npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025

[NVIDIA] Explicitly disable shuffled weights for flashinfer blockscal…

f579b42

…e moe fp8 kernels (vllm-project#21411) Signed-off-by: kaixih <kaixih@nvidia.com>

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[NVIDIA] Explicitly disable shuffled weights for flashinfer blockscal…

c0e0d4e

…e moe fp8 kernels (vllm-project#21411) Signed-off-by: kaixih <kaixih@nvidia.com>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

[NVIDIA] Explicitly disable shuffled weights for flashinfer blockscal…

4074a76

…e moe fp8 kernels (vllm-project#21411) Signed-off-by: kaixih <kaixih@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NVIDIA] Explicitly disable shuffled weights for flashinfer blockscale moe fp8 kernels#21411

[NVIDIA] Explicitly disable shuffled weights for flashinfer blockscale moe fp8 kernels#21411
vllm-bot merged 1 commit intovllm-project:mainfrom
kaixih:fix_trtllm_moe_bs_fp8

kaixih commented Jul 22, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jul 22, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jul 22, 2025

Uh oh!

mgoin commented Jul 22, 2025

Uh oh!

kaixih commented Jul 23, 2025

Uh oh!

kaixih commented Jul 24, 2025

Uh oh!

mgoin commented Jul 24, 2025 •

edited

Loading

Uh oh!

kaixih commented Jul 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

kaixih commented Jul 22, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 22, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

mgoin commented Jul 22, 2025

Uh oh!

kaixih commented Jul 23, 2025

Uh oh!

kaixih commented Jul 24, 2025

Uh oh!

mgoin commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaixih commented Jul 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kaixih commented Jul 22, 2025 •

edited by github-actions bot

Loading

mgoin commented Jul 24, 2025 •

edited

Loading