[Fix] Fix FlashInfer CUTLASS MoE for unquantized models and single-GPU, bump FlashInfer to 0.6.8#38215
[Fix] Fix FlashInfer CUTLASS MoE for unquantized models and single-GPU, bump FlashInfer to 0.6.8#38215askliar wants to merge 5 commits intovllm-project:mainfrom
Conversation
…FlashInfer CUTLASS MoE implementation Signed-off-by: Andrii Skliar <askliar@nvidia.com>
…n, and requirements/cuda.txt Updated the FlashInfer version to 0.6.7 across the Dockerfile, versions.json, and requirements files to ensure compatibility with the latest features and fixes. Signed-off-by: Andrii Skliar <askliar@nvidia.com>
… in tests and implementation Removed the use_ep parameter from the select_unquantized_moe_backend function calls in both the test cases and the UnquantizedFusedMoEMethod class to streamline backend selection logic. Signed-off-by: Andrii Skliar <askliar@nvidia.com>
There was a problem hiding this comment.
Code Review
This pull request updates the FlashInfer library version to 0.6.7 across the Dockerfile, versions.json, and CUDA requirements. It also refactors the unquantized MoE backend selection logic by removing the use_ep (Expert Parallelism) parameter from the select_unquantized_moe_backend function and its associated calls and conditions, simplifying the backend selection process. Additionally, quant_scales is now initialized as an empty list instead of None in flashinfer_cutlass_moe.py. A review comment points out that a logging condition in vllm/model_executor/layers/fused_moe/oracle/unquantized.py is too broad and could lead to a misleading log message, suggesting to use the flashinfer_cutlass_available variable for accuracy.
| scope="local", | ||
| ) | ||
| elif use_ep and (not use_dp): | ||
| elif (not use_dp): |
There was a problem hiding this comment.
The condition (not use_dp) is too broad and can lead to a misleading log message. The log suggests that "FlashInfer MoE is available for EP", but this elif block can be entered even when FlashInfer is not available (e.g., on unsupported hardware or if has_flashinfer_cutlass_fused_moe() is false).
To ensure the log message is accurate, the condition should check if the FlashInfer CUTLASS backend is actually available. The flashinfer_cutlass_available variable already encapsulates all the necessary checks (hardware support, not use_dp, etc.).
Using flashinfer_cutlass_available as the condition ensures that this log is only shown when the feature is truly available but not enabled via VLLM_USE_FLASHINFER_MOE_FP16.
| elif (not use_dp): | |
| elif flashinfer_cutlass_available: |
…n, and requirements/cuda.txt Updated the FlashInfer version to 0.6.8 across the Dockerfile, versions.json, and requirements files to incorporate the latest improvements and fixes. Signed-off-by: Andrii Skliar <askliar@nvidia.com>
Updated the logging message in the select_unquantized_moe_backend function to clarify the availability of FlashInfer CUTLASS MoE when not using data parallelism. This change aims to improve user awareness regarding performance optimization options. Signed-off-by: Andrii Skliar <askliar@nvidia.com>
|
This pull request has merge conflicts that must be resolved before it can be |
This is a future-looking PR
Summary
Fix
quant_scalestype on unquantized path (flashinfer_cutlass_moe.py): Passquant_scales=[]instead ofNonefor the unquantized (bf16/fp16) MoE path. The C++ binding expectsList[Tensor]; passingNonecauses a type error at runtime.Remove spurious
use_epguard (oracle/unquantized.py): Theflashinfer_cutlass_availablecheck was gated onuse_ep=True, preventing the CUTLASS backend from being selected on single-GPU (EP=1) deployments.FlashInferExpertssupports EP=1 natively (ep_size=1, ep_rank=0). Theuse_epparameter is removed fromselect_unquantized_moe_backendentirely.Bump FlashInfer to 0.6.7 (
Dockerfile,versions.json,requirements/cuda.txt): Updateflashinfer-pythonandflashinfer-cubinfrom 0.6.6 to 0.6.7.Test plan
VLLM_USE_FLASHINFER_MOE_FP16=1(previously silently fell back to Triton)