[flashinfer] fix FI all2all with FI cutlass moe by mxz297 · Pull Request #28166 · vllm-project/vllm

mxz297 · 2025-11-05T22:21:28Z

Summary:
Running FI Cutlass moe with FI a2av backend runs into error:

�[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843]     ) = self.prepare_finalize.prepare(
�[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843]   File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/c9838acc51201940/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py", line 115, in prepare
�[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843]     flashinfer_alltoall_dispatch(
�[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843]   File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/c9838acc51201940/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py", line 239, in flashinfer_alltoall_dispatch
�[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843]     all2all_manager.prepare_workspace,
�[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843] AttributeError: 'FlashInferAllToAllManager' object has no attribute 'prepare_workspace'. Did you mean: 'prepare_workspace_tensor'?
�[1;36m(EngineCore_DP5 pid=104759)�[0;0m ERROR 11-05 14:09:51 [core.py:843] EngineCore failed to start.

After fixing the error above, running into the following error:

�[1;36m(EngineCore_DP5 pid=821648)�[0;0m   File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/c9838acc51201940/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/flashinfer/fused_moe/core.py", line 817, in cutlass_fused_moe
�[1;36m(EngineCore_DP5 pid=821648)�[0;0m     return get_cutlass_fused_moe_module(device_arch).cutlass_fused_moe(
�[1;36m(EngineCore_DP5 pid=821648)�[0;0m   File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/c9838acc51201940/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/flashinfer/fused_moe/core.py", line 537, in cutlass_fused_moe
�[1;36m(EngineCore_DP5 pid=821648)�[0;0m     run_moe(
�[1;36m(EngineCore_DP5 pid=821648)�[0;0m   File "tvm_ffi/function.pxi", line 814, in tvm_ffi.core.Function.__call__
�[1;36m(EngineCore_DP5 pid=821648)�[0;0m   File "buck-out/v2/gen/fbcode/deeplearning/tvm_ffi/tvm_ffi/cython/__core__cython-lib__/19a62205b4ea2336/buck-headers/tvm_ffi_python_helpers.h", line 323, in _ZL43__pyx_pw_7tvm_ffi_4core_8Function_3__call__P7_objectS0_S0__tvm_ffi$core
�[1;36m(EngineCore_DP5 pid=821648)�[0;0m   File "fbcode/deeplearning/flashinfer/csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_sm100_binding.cu", line 706, in FusedMoeRunner::GetFunction(tvm::ffi::String const&)::{lambda(tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::Array<tvm::ffi::Tensor, void>, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, long, long, long, long, long, long, bool, bool, tvm::ffi::Optional<tvm::ffi::Array<long, void>, void>, bool, long)#1}::operator()(tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::Array<tvm::ffi::Tensor, void>, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, long, long, long, long, long, long, bool, bool, tvm::ffi::Optional<tvm::ffi::Array<long, void>, void>, bool, long) const
�[1;36m(EngineCore_DP5 pid=821648)�[0;0m   File "fbcode/deeplearning/flashinfer/csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_sm100_binding.cu", line 248, in void FusedMoeRunner::runMoe(tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::Optional<tvm::ffi::Array<tvm::ffi::Tensor>>, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::Optional<tvm::ffi::TensorView>, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, bool, bool, tvm::ffi::Optional<tvm::ffi::Array<long>>, bool, ActivationType)
�[1;36m(EngineCore_DP5 pid=821648)�[0;0m RuntimeError: Check failed: token_final_scales.value().dtype() == dl_float32 (int32 vs. float32) : Inconsistency of Tensor type: token_final_scales.value()
I1105 14:19:35.039142 822035 HealthTracker.cpp:26 req:00007fd9d4e1b100] Mark connection as healthy.

It seems like flashinfer moe_prepare kernel always return int32 tensor, so convert the type accordingly

Differential Revision: D86345110

gemini-code-assist

Code Review

This pull request addresses two errors encountered when using FlashInfer's All-to-All with CUTLASS MoE. The first fix correctly changes a call from prepare_workspace to prepare_workspace_tensor, resolving an AttributeError. The second fix addresses a data type mismatch for topk_weights returned by a FlashInfer kernel.

My review focuses on the correctness of the data type conversion. I've identified a critical issue where .view(dtype=...) is used for type conversion, which reinterprets the tensor's underlying bytes instead of casting the values. This can lead to incorrect results. I've suggested using .to(dtype=...) instead to ensure a proper type cast.

vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py

Summary: Running FI Cutlass moe with FI a2av backend runs into error: ``` �[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843] ) = self.prepare_finalize.prepare( �[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843] File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/c9838acc51201940/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py", line 115, in prepare �[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843] flashinfer_alltoall_dispatch( �[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843] File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/c9838acc51201940/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py", line 239, in flashinfer_alltoall_dispatch �[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843] all2all_manager.prepare_workspace, �[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843] AttributeError: 'FlashInferAllToAllManager' object has no attribute 'prepare_workspace'. Did you mean: 'prepare_workspace_tensor'? �[1;36m(EngineCore_DP5 pid=104759)�[0;0m ERROR 11-05 14:09:51 [core.py:843] EngineCore failed to start. ``` After fixing the error above, running into the following error: ``` �[1;36m(EngineCore_DP5 pid=821648)�[0;0m File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/c9838acc51201940/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/flashinfer/fused_moe/core.py", line 817, in cutlass_fused_moe �[1;36m(EngineCore_DP5 pid=821648)�[0;0m return get_cutlass_fused_moe_module(device_arch).cutlass_fused_moe( �[1;36m(EngineCore_DP5 pid=821648)�[0;0m File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/c9838acc51201940/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/flashinfer/fused_moe/core.py", line 537, in cutlass_fused_moe �[1;36m(EngineCore_DP5 pid=821648)�[0;0m run_moe( �[1;36m(EngineCore_DP5 pid=821648)�[0;0m File "tvm_ffi/function.pxi", line 814, in tvm_ffi.core.Function.__call__ �[1;36m(EngineCore_DP5 pid=821648)�[0;0m File "buck-out/v2/gen/fbcode/deeplearning/tvm_ffi/tvm_ffi/cython/__core__cython-lib__/19a62205b4ea2336/buck-headers/tvm_ffi_python_helpers.h", line 323, in _ZL43__pyx_pw_7tvm_ffi_4core_8Function_3__call__P7_objectS0_S0__tvm_ffi$core �[1;36m(EngineCore_DP5 pid=821648)�[0;0m File "fbcode/deeplearning/flashinfer/csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_sm100_binding.cu", line 706, in FusedMoeRunner::GetFunction(tvm::ffi::String const&)::{lambda(tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::Array<tvm::ffi::Tensor, void>, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, long, long, long, long, long, long, bool, bool, tvm::ffi::Optional<tvm::ffi::Array<long, void>, void>, bool, long)vllm-project#1}::operator()(tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::Array<tvm::ffi::Tensor, void>, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, long, long, long, long, long, long, bool, bool, tvm::ffi::Optional<tvm::ffi::Array<long, void>, void>, bool, long) const �[1;36m(EngineCore_DP5 pid=821648)�[0;0m File "fbcode/deeplearning/flashinfer/csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_sm100_binding.cu", line 248, in void FusedMoeRunner::runMoe(tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::Optional<tvm::ffi::Array<tvm::ffi::Tensor>>, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::Optional<tvm::ffi::TensorView>, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, bool, bool, tvm::ffi::Optional<tvm::ffi::Array<long>>, bool, ActivationType) �[1;36m(EngineCore_DP5 pid=821648)�[0;0m RuntimeError: Check failed: token_final_scales.value().dtype() == dl_float32 (int32 vs. float32) : Inconsistency of Tensor type: token_final_scales.value() I1105 14:19:35.039142 822035 HealthTracker.cpp:26 req:00007fd9d4e1b100] Mark connection as healthy. ``` It seems like flashinfer moe_prepare kernel always return int32 tensor, so convert the type accordingly Differential Revision: D86345110 Signed-off-by: Xiaozhu <mxz297@gmail.com>

mxz297 · 2025-11-05T23:03:44Z

After PR, FI-cutlass NVFP4 moe + FI-a2av works with DEP16 non-disagg on GB200 and got gsm8k score 0.96

pavanimajety

LGTM, thank you for the fix!

Signed-off-by: Xiaozhu <mxz297@gmail.com>

mxz297 requested review from mgoin and pavanimajety as code owners November 5, 2025 22:21

mxz297 changed the title ~~fix FI all2all with FI cutlass moe~~ [flashinfer] fix FI all2all with FI cutlass moe Nov 5, 2025

gemini-code-assist bot reviewed Nov 5, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py Show resolved Hide resolved

mxz297 force-pushed the export-D86345110 branch from 21f9420 to 4e7edaa Compare November 5, 2025 23:01

pavanimajety approved these changes Nov 5, 2025

View reviewed changes

pavanimajety added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 5, 2025

pavanimajety enabled auto-merge (squash) November 5, 2025 23:43

pavanimajety merged commit e31946f into vllm-project:main Nov 6, 2025
53 checks passed

mxz297 deleted the export-D86345110 branch November 6, 2025 06:07

ZhengHongming888 pushed a commit to ZhengHongming888/vllm that referenced this pull request Nov 8, 2025

[flashinfer] fix FI all2all with FI cutlass moe (vllm-project#28166)

e078ca7

Signed-off-by: Xiaozhu <mxz297@gmail.com>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[flashinfer] fix FI all2all with FI cutlass moe (vllm-project#28166)

2fb2787

Signed-off-by: Xiaozhu <mxz297@gmail.com>

This was referenced Jan 25, 2026

[BugFix] Fixed 'FlashInferAllToAllManager' object has no attribute 'prepare_workspace' #27862

Closed

[Bug]: 'FlashInferAllToAllManager' object has no attribute 'prepare_workspace' #27655

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[flashinfer] fix FI all2all with FI cutlass moe#28166

[flashinfer] fix FI all2all with FI cutlass moe#28166
pavanimajety merged 1 commit intovllm-project:mainfrom
mxz297:export-D86345110

mxz297 commented Nov 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

mxz297 commented Nov 5, 2025

Uh oh!

pavanimajety left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

mxz297 commented Nov 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mxz297 commented Nov 5, 2025

Uh oh!

pavanimajety left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants