[flashinfer] fix FI all2all with FI cutlass moe#28166
[flashinfer] fix FI all2all with FI cutlass moe#28166pavanimajety merged 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses two errors encountered when using FlashInfer's All-to-All with CUTLASS MoE. The first fix correctly changes a call from prepare_workspace to prepare_workspace_tensor, resolving an AttributeError. The second fix addresses a data type mismatch for topk_weights returned by a FlashInfer kernel.
My review focuses on the correctness of the data type conversion. I've identified a critical issue where .view(dtype=...) is used for type conversion, which reinterprets the tensor's underlying bytes instead of casting the values. This can lead to incorrect results. I've suggested using .to(dtype=...) instead to ensure a proper type cast.
vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py
Show resolved
Hide resolved
Summary:
Running FI Cutlass moe with FI a2av backend runs into error:
```
�[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843] ) = self.prepare_finalize.prepare(
�[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843] File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/c9838acc51201940/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py", line 115, in prepare
�[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843] flashinfer_alltoall_dispatch(
�[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843] File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/c9838acc51201940/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py", line 239, in flashinfer_alltoall_dispatch
�[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843] all2all_manager.prepare_workspace,
�[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843] AttributeError: 'FlashInferAllToAllManager' object has no attribute 'prepare_workspace'. Did you mean: 'prepare_workspace_tensor'?
�[1;36m(EngineCore_DP5 pid=104759)�[0;0m ERROR 11-05 14:09:51 [core.py:843] EngineCore failed to start.
```
After fixing the error above, running into the following error:
```
�[1;36m(EngineCore_DP5 pid=821648)�[0;0m File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/c9838acc51201940/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/flashinfer/fused_moe/core.py", line 817, in cutlass_fused_moe
�[1;36m(EngineCore_DP5 pid=821648)�[0;0m return get_cutlass_fused_moe_module(device_arch).cutlass_fused_moe(
�[1;36m(EngineCore_DP5 pid=821648)�[0;0m File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/c9838acc51201940/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/flashinfer/fused_moe/core.py", line 537, in cutlass_fused_moe
�[1;36m(EngineCore_DP5 pid=821648)�[0;0m run_moe(
�[1;36m(EngineCore_DP5 pid=821648)�[0;0m File "tvm_ffi/function.pxi", line 814, in tvm_ffi.core.Function.__call__
�[1;36m(EngineCore_DP5 pid=821648)�[0;0m File "buck-out/v2/gen/fbcode/deeplearning/tvm_ffi/tvm_ffi/cython/__core__cython-lib__/19a62205b4ea2336/buck-headers/tvm_ffi_python_helpers.h", line 323, in _ZL43__pyx_pw_7tvm_ffi_4core_8Function_3__call__P7_objectS0_S0__tvm_ffi$core
�[1;36m(EngineCore_DP5 pid=821648)�[0;0m File "fbcode/deeplearning/flashinfer/csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_sm100_binding.cu", line 706, in FusedMoeRunner::GetFunction(tvm::ffi::String const&)::{lambda(tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::Array<tvm::ffi::Tensor, void>, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, long, long, long, long, long, long, bool, bool, tvm::ffi::Optional<tvm::ffi::Array<long, void>, void>, bool, long)vllm-project#1}::operator()(tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::Array<tvm::ffi::Tensor, void>, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, long, long, long, long, long, long, bool, bool, tvm::ffi::Optional<tvm::ffi::Array<long, void>, void>, bool, long) const
�[1;36m(EngineCore_DP5 pid=821648)�[0;0m File "fbcode/deeplearning/flashinfer/csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_sm100_binding.cu", line 248, in void FusedMoeRunner::runMoe(tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::Optional<tvm::ffi::Array<tvm::ffi::Tensor>>, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::Optional<tvm::ffi::TensorView>, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, bool, bool, tvm::ffi::Optional<tvm::ffi::Array<long>>, bool, ActivationType)
�[1;36m(EngineCore_DP5 pid=821648)�[0;0m RuntimeError: Check failed: token_final_scales.value().dtype() == dl_float32 (int32 vs. float32) : Inconsistency of Tensor type: token_final_scales.value()
I1105 14:19:35.039142 822035 HealthTracker.cpp:26 req:00007fd9d4e1b100] Mark connection as healthy.
```
It seems like flashinfer moe_prepare kernel always return int32 tensor, so convert the type accordingly
Differential Revision: D86345110
Signed-off-by: Xiaozhu <mxz297@gmail.com>
21f9420 to
4e7edaa
Compare
|
After PR, FI-cutlass NVFP4 moe + FI-a2av works with DEP16 non-disagg on GB200 and got gsm8k score 0.96 |
pavanimajety
left a comment
There was a problem hiding this comment.
LGTM, thank you for the fix!
Signed-off-by: Xiaozhu <mxz297@gmail.com>
Signed-off-by: Xiaozhu <mxz297@gmail.com>
Summary:
Running FI Cutlass moe with FI a2av backend runs into error:
After fixing the error above, running into the following error:
It seems like flashinfer moe_prepare kernel always return int32 tensor, so convert the type accordingly
Differential Revision: D86345110