Skip to content

[flashinfer] fix FI all2all with FI cutlass moe#28166

Merged
pavanimajety merged 1 commit intovllm-project:mainfrom
mxz297:export-D86345110
Nov 6, 2025
Merged

[flashinfer] fix FI all2all with FI cutlass moe#28166
pavanimajety merged 1 commit intovllm-project:mainfrom
mxz297:export-D86345110

Conversation

@mxz297
Copy link
Contributor

@mxz297 mxz297 commented Nov 5, 2025

Summary:
Running FI Cutlass moe with FI a2av backend runs into error:

�[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843]     ) = self.prepare_finalize.prepare(
�[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843]   File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/c9838acc51201940/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py", line 115, in prepare
�[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843]     flashinfer_alltoall_dispatch(
�[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843]   File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/c9838acc51201940/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py", line 239, in flashinfer_alltoall_dispatch
�[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843]     all2all_manager.prepare_workspace,
�[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843] AttributeError: 'FlashInferAllToAllManager' object has no attribute 'prepare_workspace'. Did you mean: 'prepare_workspace_tensor'?
�[1;36m(EngineCore_DP5 pid=104759)�[0;0m ERROR 11-05 14:09:51 [core.py:843] EngineCore failed to start.

After fixing the error above, running into the following error:

�[1;36m(EngineCore_DP5 pid=821648)�[0;0m   File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/c9838acc51201940/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/flashinfer/fused_moe/core.py", line 817, in cutlass_fused_moe
�[1;36m(EngineCore_DP5 pid=821648)�[0;0m     return get_cutlass_fused_moe_module(device_arch).cutlass_fused_moe(
�[1;36m(EngineCore_DP5 pid=821648)�[0;0m   File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/c9838acc51201940/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/flashinfer/fused_moe/core.py", line 537, in cutlass_fused_moe
�[1;36m(EngineCore_DP5 pid=821648)�[0;0m     run_moe(
�[1;36m(EngineCore_DP5 pid=821648)�[0;0m   File "tvm_ffi/function.pxi", line 814, in tvm_ffi.core.Function.__call__
�[1;36m(EngineCore_DP5 pid=821648)�[0;0m   File "buck-out/v2/gen/fbcode/deeplearning/tvm_ffi/tvm_ffi/cython/__core__cython-lib__/19a62205b4ea2336/buck-headers/tvm_ffi_python_helpers.h", line 323, in _ZL43__pyx_pw_7tvm_ffi_4core_8Function_3__call__P7_objectS0_S0__tvm_ffi$core
�[1;36m(EngineCore_DP5 pid=821648)�[0;0m   File "fbcode/deeplearning/flashinfer/csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_sm100_binding.cu", line 706, in FusedMoeRunner::GetFunction(tvm::ffi::String const&)::{lambda(tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::Array<tvm::ffi::Tensor, void>, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, long, long, long, long, long, long, bool, bool, tvm::ffi::Optional<tvm::ffi::Array<long, void>, void>, bool, long)#1}::operator()(tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::Array<tvm::ffi::Tensor, void>, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, long, long, long, long, long, long, bool, bool, tvm::ffi::Optional<tvm::ffi::Array<long, void>, void>, bool, long) const
�[1;36m(EngineCore_DP5 pid=821648)�[0;0m   File "fbcode/deeplearning/flashinfer/csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_sm100_binding.cu", line 248, in void FusedMoeRunner::runMoe(tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::Optional<tvm::ffi::Array<tvm::ffi::Tensor>>, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::Optional<tvm::ffi::TensorView>, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, bool, bool, tvm::ffi::Optional<tvm::ffi::Array<long>>, bool, ActivationType)
�[1;36m(EngineCore_DP5 pid=821648)�[0;0m RuntimeError: Check failed: token_final_scales.value().dtype() == dl_float32 (int32 vs. float32) : Inconsistency of Tensor type: token_final_scales.value()
I1105 14:19:35.039142 822035 HealthTracker.cpp:26 req:00007fd9d4e1b100] Mark connection as healthy.

It seems like flashinfer moe_prepare kernel always return int32 tensor, so convert the type accordingly

Differential Revision: D86345110

@mxz297 mxz297 changed the title fix FI all2all with FI cutlass moe [flashinfer] fix FI all2all with FI cutlass moe Nov 5, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses two errors encountered when using FlashInfer's All-to-All with CUTLASS MoE. The first fix correctly changes a call from prepare_workspace to prepare_workspace_tensor, resolving an AttributeError. The second fix addresses a data type mismatch for topk_weights returned by a FlashInfer kernel.

My review focuses on the correctness of the data type conversion. I've identified a critical issue where .view(dtype=...) is used for type conversion, which reinterprets the tensor's underlying bytes instead of casting the values. This can lead to incorrect results. I've suggested using .to(dtype=...) instead to ensure a proper type cast.

Summary:
Running FI Cutlass moe with FI a2av backend runs into error:

```
�[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843]     ) = self.prepare_finalize.prepare(
�[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843]   File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/c9838acc51201940/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py", line 115, in prepare
�[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843]     flashinfer_alltoall_dispatch(
�[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843]   File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/c9838acc51201940/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py", line 239, in flashinfer_alltoall_dispatch
�[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843]     all2all_manager.prepare_workspace,
�[1;36m(EngineCore_DP7 pid=104761)�[0;0m ERROR 11-05 14:09:51 [core.py:843] AttributeError: 'FlashInferAllToAllManager' object has no attribute 'prepare_workspace'. Did you mean: 'prepare_workspace_tensor'?
�[1;36m(EngineCore_DP5 pid=104759)�[0;0m ERROR 11-05 14:09:51 [core.py:843] EngineCore failed to start.
```

After fixing the error above, running into the following error:
```
�[1;36m(EngineCore_DP5 pid=821648)�[0;0m   File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/c9838acc51201940/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/flashinfer/fused_moe/core.py", line 817, in cutlass_fused_moe
�[1;36m(EngineCore_DP5 pid=821648)�[0;0m     return get_cutlass_fused_moe_module(device_arch).cutlass_fused_moe(
�[1;36m(EngineCore_DP5 pid=821648)�[0;0m   File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/c9838acc51201940/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/flashinfer/fused_moe/core.py", line 537, in cutlass_fused_moe
�[1;36m(EngineCore_DP5 pid=821648)�[0;0m     run_moe(
�[1;36m(EngineCore_DP5 pid=821648)�[0;0m   File "tvm_ffi/function.pxi", line 814, in tvm_ffi.core.Function.__call__
�[1;36m(EngineCore_DP5 pid=821648)�[0;0m   File "buck-out/v2/gen/fbcode/deeplearning/tvm_ffi/tvm_ffi/cython/__core__cython-lib__/19a62205b4ea2336/buck-headers/tvm_ffi_python_helpers.h", line 323, in _ZL43__pyx_pw_7tvm_ffi_4core_8Function_3__call__P7_objectS0_S0__tvm_ffi$core
�[1;36m(EngineCore_DP5 pid=821648)�[0;0m   File "fbcode/deeplearning/flashinfer/csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_sm100_binding.cu", line 706, in FusedMoeRunner::GetFunction(tvm::ffi::String const&)::{lambda(tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::Array<tvm::ffi::Tensor, void>, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, long, long, long, long, long, long, bool, bool, tvm::ffi::Optional<tvm::ffi::Array<long, void>, void>, bool, long)vllm-project#1}::operator()(tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::Array<tvm::ffi::Tensor, void>, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, tvm::ffi::Optional<tvm::ffi::TensorView, void>, long, long, long, long, long, long, bool, bool, tvm::ffi::Optional<tvm::ffi::Array<long, void>, void>, bool, long) const
�[1;36m(EngineCore_DP5 pid=821648)�[0;0m   File "fbcode/deeplearning/flashinfer/csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_sm100_binding.cu", line 248, in void FusedMoeRunner::runMoe(tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::TensorView, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::Optional<tvm::ffi::Array<tvm::ffi::Tensor>>, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::Optional<tvm::ffi::TensorView>, tvm::ffi::Optional<tvm::ffi::TensorView>, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, bool, bool, tvm::ffi::Optional<tvm::ffi::Array<long>>, bool, ActivationType)
�[1;36m(EngineCore_DP5 pid=821648)�[0;0m RuntimeError: Check failed: token_final_scales.value().dtype() == dl_float32 (int32 vs. float32) : Inconsistency of Tensor type: token_final_scales.value()
I1105 14:19:35.039142 822035 HealthTracker.cpp:26 req:00007fd9d4e1b100] Mark connection as healthy.
```

It seems like flashinfer moe_prepare kernel always return int32 tensor, so convert the type accordingly

Differential Revision: D86345110

Signed-off-by: Xiaozhu <mxz297@gmail.com>
@mxz297
Copy link
Contributor Author

mxz297 commented Nov 5, 2025

After PR, FI-cutlass NVFP4 moe + FI-a2av works with DEP16 non-disagg on GB200 and got gsm8k score 0.96

Copy link
Collaborator

@pavanimajety pavanimajety left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you for the fix!

@pavanimajety pavanimajety added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 5, 2025
@pavanimajety pavanimajety enabled auto-merge (squash) November 5, 2025 23:43
@pavanimajety pavanimajety merged commit e31946f into vllm-project:main Nov 6, 2025
53 checks passed
@mxz297 mxz297 deleted the export-D86345110 branch November 6, 2025 06:07
ZhengHongming888 pushed a commit to ZhengHongming888/vllm that referenced this pull request Nov 8, 2025
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants