[Bugfix] Add autotuning guard to all unprotected FlashInfer MoE kernels#37091
[Bugfix] Add autotuning guard to all unprotected FlashInfer MoE kernels#37091haosdent wants to merge 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request correctly adds autotuning guards to several FlashInfer MoE kernels to prevent state corruption during the autotuning dummy pass. The fix is applied consistently across all identified unprotected kernel paths. My main feedback is regarding the code duplication introduced by adding the same guard logic in six different files. I've suggested refactoring this logic into a centralized helper function to improve maintainability and adhere to the DRY principle. This would make the codebase more robust to future changes in the autotuning mechanism.
| # trtllm_fp8 monolithic kernels do not support autotuning | ||
| # so skip this kernel during dummy run for autotuning. | ||
| import vllm.utils.flashinfer as fi_utils | ||
|
|
||
| if fi_utils._is_fi_autotuning: | ||
| return torch.zeros_like(hidden_states) |
There was a problem hiding this comment.
This autotuning guard logic is duplicated in 6 different files in this pull request. To improve maintainability and adhere to the DRY (Don't Repeat Yourself) principle, this logic could be centralized into a helper function.
For example, you could add a helper in vllm/utils/flashinfer.py:
def skip_if_autotuning(output_tensor_for_shape=None):
"""If autotuning, returns (True, dummy_output). Otherwise (False, None)."""
if _is_fi_autotuning:
if output_tensor_for_shape is None:
return True, None
return True, torch.zeros_like(output_tensor_for_shape)
return False, NoneThen, this apply method could be simplified to:
import vllm.utils.flashinfer as fi_utils
should_skip, retval = fi_utils.skip_if_autotuning(hidden_states)
if should_skip:
return retvalThis would make the code cleaner and easier to manage if the autotuning check logic changes in the future. This suggestion applies to all files changed in this PR.
| # trtllm_fp4_block_scale_moe does not support autotuning | ||
| # so skip this kernel during dummy run for autotuning. | ||
| import vllm.utils.flashinfer as fi_utils | ||
|
|
||
| if fi_utils._is_fi_autotuning: | ||
| return torch.zeros_like(hidden_states) |
There was a problem hiding this comment.
Similar to other files in this PR, this autotuning guard introduces code duplication. To enhance maintainability, this logic could be refactored into a shared helper function within vllm.utils.flashinfer. Centralizing the check for _is_fi_autotuning and the creation of a dummy return value would make the codebase more robust to future changes in the autotuning mechanism.
| # flashinfer CuteDSL MoE does not support autotuning | ||
| # so skip this kernel during dummy run for autotuning. | ||
| import vllm.utils.flashinfer as fi_utils | ||
|
|
||
| if fi_utils._is_fi_autotuning: | ||
| return |
There was a problem hiding this comment.
This autotuning guard is repeated across multiple locations. To avoid this duplication, consider creating a single helper function in vllm.utils.flashinfer. This function could handle the check and return logic. For this specific case, since the method returns None, a centralized helper could be designed to handle this gracefully (e.g., by being called with an argument indicating no return value is needed). This would centralize the logic and improve code quality.
| # flashinfer cutlass MoE does not support autotuning | ||
| # so skip this kernel during dummy run for autotuning. | ||
| import vllm.utils.flashinfer as fi_utils | ||
|
|
||
| if fi_utils._is_fi_autotuning: | ||
| return |
There was a problem hiding this comment.
The logic to guard against autotuning is duplicated here and in other files. A refactoring to a common helper function in vllm.utils.flashinfer would be beneficial for long-term maintenance. This would ensure that any future modifications to the autotuning guard only need to be made in one place. Since this apply method has no return value, the helper could be designed to handle this case gracefully.
| # flashinfer bf16 monolithic MoE does not support autotuning | ||
| # so skip this kernel during dummy run for autotuning. | ||
| import vllm.utils.flashinfer as fi_utils | ||
|
|
||
| if fi_utils._is_fi_autotuning: | ||
| return torch.zeros_like(hidden_states) |
There was a problem hiding this comment.
This change introduces the same autotuning guard logic seen in other files in this PR. To follow the DRY principle, it would be better to abstract this logic into a reusable helper function located in vllm.utils.flashinfer. This would consolidate the autotuning check and make the overall implementation cleaner and more maintainable.
| # flashinfer mxint4 monolithic MoE does not support autotuning | ||
| # so skip this kernel during dummy run for autotuning. | ||
| import vllm.utils.flashinfer as fi_utils | ||
|
|
||
| if fi_utils._is_fi_autotuning: | ||
| return torch.zeros_like(x) |
There was a problem hiding this comment.
The addition of this autotuning guard results in duplicated code across several files. I recommend refactoring this logic into a centralized helper function in vllm.utils.flashinfer. This function would encapsulate the check for _is_fi_autotuning and the logic for returning a correctly shaped zero tensor (in this case, based on the x tensor). This would improve code maintainability.
|
Hi @haosdent, thanks for the fix. I got the following error when trying out the fix on the offloading usecase. Could you take a look? |
|
Also, I wonder how do we know if a kernel from flashinfer is incompatible with auto-tuning? It seems even for the same kernel, it only causes problems in some setup. Do we have a reliable way to know if things are working or not? |
|
Hi @wzhao18, thanks a lot for your test! I just fixed the issue you reported. Can you try the latest version and then test again? |
|
@haosdent I tried wrapping the trtllm nvfp4 moe with the following and it works. Would this be cleaner? This is used in trtllm_moe.py |
|
@wzhao18 yes, many thanks for your feedback, let me update later |
|
This pull request has merge conflicts that must be resolved before it can be |
|
Have updated, thanks @wzhao18 , the new way looks much better |
Use `with autotune(False):` to disable FlashInfer autotuning for MoE kernels that are incompatible with it (upstream flashinfer#2023). This follows the existing pattern in trtllm_moe.py and avoids shape/dtype mismatches from dummy return values. Kernels wrapped: - TrtLlmNvFp4ExpertsMonolithic (trtllm_fp4_block_scale_moe) - TrtLlmFp8ExpertsMonolithic (trtllm_fp8_block_scale_moe, trtllm_fp8_per_tensor_scale_moe) - flashinfer_fused_moe_bf16 (flashinfer_trtllm_bf16_moe) - FlashInferExperts (flashinfer_cutlass_fused_moe) - FlashInferCuteDSLExperts (flashinfer_cutedsl_moe_masked) - flashinfer_trtllm_mxint4_moe (trtllm_mxint4_block_scale_moe) Signed-off-by: haosdent <haosdent@gmail.com>
Purpose
Fixes #36999 - CPU weight offloading produces garbage output when the FlashInfer autotuner is enabled on Blackwell GPUs.
Root Cause
During FlashInfer autotuning,
kernel_warmup.py:flashinfer_autotune()sets_is_fi_autotuning = Trueand runs a model dummy pass. This triggers MoE kernel calls. Certain FlashInfer MoE kernels are incompatible with FlashInfer's autotuning mechanism (upstream FlashInfer bug flashinfer-ai/flashinfer#2023). The incompatible kernel call corrupts CUDA state, producing garbage output or crashes during subsequent inference.The modular kernel implementations (
TrtLlmNvFp4ExpertsModularin #32564,TrtLlmFp8ExpertsModularin #36307) already have the_is_fi_autotuningguard. However, the monolithic counterparts and other FlashInfer MoE paths were missing protection.For single-GPU Kimi K2.5-NVFP4 (the original reporter's config), the kernel oracle selects
TrtLlmNvFp4ExpertsMonolithic(monolithic preferred over modular when no EP/EPLB), which callsflashinfer.fused_moe.trtllm_fp4_block_scale_moe()during autotuning without protection.Fix
Wrap all unprotected FlashInfer MoE kernel calls with
with autotune(False):, following the existing pattern intrtllm_moe.py:178. This tells FlashInfer not to autotune these specific kernel calls while still allowing them to execute normally — avoiding shape/dtype mismatches that occurred with the previous dummy-return approach.experts/trtllm_nvfp4_moe.pytrtllm_fp4_block_scale_moeexperts/trtllm_fp8_moe.pytrtllm_fp8_block_scale_moe,trtllm_fp8_per_tensor_scale_moeflashinfer_trtllm_moe.pyflashinfer_trtllm_bf16_moeflashinfer_cutlass_moe.pyflashinfer_cutlass_fused_moeflashinfer_cutedsl_moe.pyflashinfer_cutedsl_moe_maskedquantization/utils/flashinfer_mxint4_moe.pytrtllm_mxint4_block_scale_moeTest Plan
nm-testing/Qwen3-Next-80B-A3B-Instruct-NVFP4(FLASHINFER_CUTLASS backend)pytest tests/kernels/moe/ -v -spre-commit run --all-filesTest Result
E2E Verification (NVIDIA GB10, SM121)
WITHOUT the fix (guard removed from
flashinfer_cutlass_moe.py:FlashInferExperts.apply()):nm-testing/Qwen3-Next-80B-A3B-Instruct-NVFP4(NVFP4 MoE, FLASHINFER_CUTLASS backend)CUDA error: an illegal instruction was encountered— server crasheddispatchMoeGemmSelectTileShapeTmaWarpSpecializedduring FlashInfer autotuningWITH the fix (
autotune(False)wrapping the kernel call):"The capital of France is Paris."for the prompt "What is the capital of France?"