[Fix] Fix FlashInfer CUTLASS MoE for unquantized models and single-GPU, bump FlashInfer to 0.6.8 by askliar · Pull Request #38215 · vllm-project/vllm

askliar · 2026-03-26T09:44:34Z

This is a future-looking PR

Summary

Fix quant_scales type on unquantized path (flashinfer_cutlass_moe.py): Pass quant_scales=[] instead of None for the unquantized (bf16/fp16) MoE path. The C++ binding expects List[Tensor]; passing None causes a type error at runtime.
Remove spurious use_ep guard (oracle/unquantized.py): The flashinfer_cutlass_available check was gated on use_ep=True, preventing the CUTLASS backend from being selected on single-GPU (EP=1) deployments. FlashInferExperts supports EP=1 natively (ep_size=1, ep_rank=0). The use_ep parameter is removed from select_unquantized_moe_backend entirely.
Bump FlashInfer to 0.6.7 (Dockerfile, versions.json, requirements/cuda.txt): Update flashinfer-python and flashinfer-cubin from 0.6.6 to 0.6.7.

Test plan

Verify FlashInfer CUTLASS MoE activates on single-GPU with VLLM_USE_FLASHINFER_MOE_FP16=1 (previously silently fell back to Triton)
Verify unquantized MoE models (e.g. MTP draft model) run without type errors on the CUTLASS path
Verify EP>1 deployments are unaffected
Verify FlashInfer 0.6.7 builds and runs correctly in Docker

…FlashInfer CUTLASS MoE implementation Signed-off-by: Andrii Skliar <askliar@nvidia.com>

…n, and requirements/cuda.txt Updated the FlashInfer version to 0.6.7 across the Dockerfile, versions.json, and requirements files to ensure compatibility with the latest features and fixes. Signed-off-by: Andrii Skliar <askliar@nvidia.com>

… in tests and implementation Removed the use_ep parameter from the select_unquantized_moe_backend function calls in both the test cases and the UnquantizedFusedMoEMethod class to streamline backend selection logic. Signed-off-by: Andrii Skliar <askliar@nvidia.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request updates the FlashInfer library version to 0.6.7 across the Dockerfile, versions.json, and CUDA requirements. It also refactors the unquantized MoE backend selection logic by removing the use_ep (Expert Parallelism) parameter from the select_unquantized_moe_backend function and its associated calls and conditions, simplifying the backend selection process. Additionally, quant_scales is now initialized as an empty list instead of None in flashinfer_cutlass_moe.py. A review comment points out that a logging condition in vllm/model_executor/layers/fused_moe/oracle/unquantized.py is too broad and could lead to a misleading log message, suggesting to use the flashinfer_cutlass_available variable for accuracy.

gemini-code-assist · 2026-03-26T09:47:43Z

vllm/model_executor/layers/fused_moe/oracle/unquantized.py

                    scope="local",
                )
-            elif use_ep and (not use_dp):
+            elif (not use_dp):


The condition (not use_dp) is too broad and can lead to a misleading log message. The log suggests that "FlashInfer MoE is available for EP", but this elif block can be entered even when FlashInfer is not available (e.g., on unsupported hardware or if has_flashinfer_cutlass_fused_moe() is false).

To ensure the log message is accurate, the condition should check if the FlashInfer CUTLASS backend is actually available. The flashinfer_cutlass_available variable already encapsulates all the necessary checks (hardware support, not use_dp, etc.).

Using flashinfer_cutlass_available as the condition ensures that this log is only shown when the feature is truly available but not enabled via VLLM_USE_FLASHINFER_MOE_FP16.

Suggested change

elif (not use_dp):

elif flashinfer_cutlass_available:

…n, and requirements/cuda.txt Updated the FlashInfer version to 0.6.8 across the Dockerfile, versions.json, and requirements files to incorporate the latest improvements and fixes. Signed-off-by: Andrii Skliar <askliar@nvidia.com>

Updated the logging message in the select_unquantized_moe_backend function to clarify the availability of FlashInfer CUTLASS MoE when not using data parallelism. This change aims to improve user awareness regarding performance optimization options. Signed-off-by: Andrii Skliar <askliar@nvidia.com>

mergify · 2026-03-30T17:56:25Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @askliar.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Andrii Skliar added 3 commits March 26, 2026 10:30

[Fix] Update quant_scales initialization and correct comment typo in …

c79a1a1

…FlashInfer CUTLASS MoE implementation Signed-off-by: Andrii Skliar <askliar@nvidia.com>

askliar requested review from WoosukKwon, mgoin, pavanimajety, tlrmchlsmth and yewentao256 as code owners March 26, 2026 09:44

claude bot reviewed Mar 26, 2026

View reviewed changes

askliar marked this pull request as draft March 26, 2026 09:44

mergify bot added ci/build nvidia labels Mar 26, 2026

github-project-automation bot added this to NVIDIA Mar 26, 2026

gemini-code-assist bot reviewed Mar 26, 2026

View reviewed changes

Andrii Skliar added 2 commits March 26, 2026 10:52

askliar changed the title ~~[Fix] Fix FlashInfer CUTLASS MoE for unquantized models and single-GPU, bump FlashInfer to 0.6.7~~ [Fix] Fix FlashInfer CUTLASS MoE for unquantized models and single-GPU, bump FlashInfer to 0.6.8 Mar 26, 2026

johnnynunez mentioned this pull request Mar 29, 2026

[NVIDIA] Bugfix NVFP4 DGX Spark and RTX50 #38423

Merged

8 tasks

mergify bot added the needs-rebase label Mar 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Fix] Fix FlashInfer CUTLASS MoE for unquantized models and single-GPU, bump FlashInfer to 0.6.8#38215

[Fix] Fix FlashInfer CUTLASS MoE for unquantized models and single-GPU, bump FlashInfer to 0.6.8#38215
askliar wants to merge 5 commits intovllm-project:mainfrom
askliar:feature/add_nemotronh_spark_support

askliar commented Mar 26, 2026 •

edited

Loading

Uh oh!

claude bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 26, 2026

Uh oh!

mergify bot commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

askliar commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

askliar commented Mar 26, 2026 •

edited

Loading