Skip to content

[Fix] Fix FlashInfer CUTLASS MoE for unquantized models and single-GPU, bump FlashInfer to 0.6.8#38215

Draft
askliar wants to merge 5 commits intovllm-project:mainfrom
askliar:feature/add_nemotronh_spark_support
Draft

[Fix] Fix FlashInfer CUTLASS MoE for unquantized models and single-GPU, bump FlashInfer to 0.6.8#38215
askliar wants to merge 5 commits intovllm-project:mainfrom
askliar:feature/add_nemotronh_spark_support

Conversation

@askliar
Copy link
Copy Markdown
Contributor

@askliar askliar commented Mar 26, 2026

This is a future-looking PR

Summary

  • Fix quant_scales type on unquantized path (flashinfer_cutlass_moe.py): Pass quant_scales=[] instead of None for the unquantized (bf16/fp16) MoE path. The C++ binding expects List[Tensor]; passing None causes a type error at runtime.

  • Remove spurious use_ep guard (oracle/unquantized.py): The flashinfer_cutlass_available check was gated on use_ep=True, preventing the CUTLASS backend from being selected on single-GPU (EP=1) deployments. FlashInferExperts supports EP=1 natively (ep_size=1, ep_rank=0). The use_ep parameter is removed from select_unquantized_moe_backend entirely.

  • Bump FlashInfer to 0.6.7 (Dockerfile, versions.json, requirements/cuda.txt): Update flashinfer-python and flashinfer-cubin from 0.6.6 to 0.6.7.

Test plan

  • Verify FlashInfer CUTLASS MoE activates on single-GPU with VLLM_USE_FLASHINFER_MOE_FP16=1 (previously silently fell back to Triton)
  • Verify unquantized MoE models (e.g. MTP draft model) run without type errors on the CUTLASS path
  • Verify EP>1 deployments are unaffected
  • Verify FlashInfer 0.6.7 builds and runs correctly in Docker

Andrii Skliar added 3 commits March 26, 2026 10:30
…FlashInfer CUTLASS MoE implementation

Signed-off-by: Andrii Skliar <askliar@nvidia.com>
…n, and requirements/cuda.txt

Updated the FlashInfer version to 0.6.7 across the Dockerfile, versions.json, and requirements files to ensure compatibility with the latest features and fixes.

Signed-off-by: Andrii Skliar <askliar@nvidia.com>
… in tests and implementation

Removed the use_ep parameter from the select_unquantized_moe_backend function calls in both the test cases and the UnquantizedFusedMoEMethod class to streamline backend selection logic.

Signed-off-by: Andrii Skliar <askliar@nvidia.com>
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@askliar askliar marked this pull request as draft March 26, 2026 09:44
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the FlashInfer library version to 0.6.7 across the Dockerfile, versions.json, and CUDA requirements. It also refactors the unquantized MoE backend selection logic by removing the use_ep (Expert Parallelism) parameter from the select_unquantized_moe_backend function and its associated calls and conditions, simplifying the backend selection process. Additionally, quant_scales is now initialized as an empty list instead of None in flashinfer_cutlass_moe.py. A review comment points out that a logging condition in vllm/model_executor/layers/fused_moe/oracle/unquantized.py is too broad and could lead to a misleading log message, suggesting to use the flashinfer_cutlass_available variable for accuracy.

scope="local",
)
elif use_ep and (not use_dp):
elif (not use_dp):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The condition (not use_dp) is too broad and can lead to a misleading log message. The log suggests that "FlashInfer MoE is available for EP", but this elif block can be entered even when FlashInfer is not available (e.g., on unsupported hardware or if has_flashinfer_cutlass_fused_moe() is false).

To ensure the log message is accurate, the condition should check if the FlashInfer CUTLASS backend is actually available. The flashinfer_cutlass_available variable already encapsulates all the necessary checks (hardware support, not use_dp, etc.).

Using flashinfer_cutlass_available as the condition ensures that this log is only shown when the feature is truly available but not enabled via VLLM_USE_FLASHINFER_MOE_FP16.

Suggested change
elif (not use_dp):
elif flashinfer_cutlass_available:

Andrii Skliar added 2 commits March 26, 2026 10:52
…n, and requirements/cuda.txt

Updated the FlashInfer version to 0.6.8 across the Dockerfile, versions.json, and requirements files to incorporate the latest improvements and fixes.

Signed-off-by: Andrii Skliar <askliar@nvidia.com>
Updated the logging message in the select_unquantized_moe_backend function to clarify the availability of FlashInfer CUTLASS MoE when not using data parallelism. This change aims to improve user awareness regarding performance optimization options.

Signed-off-by: Andrii Skliar <askliar@nvidia.com>
@askliar askliar changed the title [Fix] Fix FlashInfer CUTLASS MoE for unquantized models and single-GPU, bump FlashInfer to 0.6.7 [Fix] Fix FlashInfer CUTLASS MoE for unquantized models and single-GPU, bump FlashInfer to 0.6.8 Mar 26, 2026
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 30, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @askliar.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant