[BugFix][Performance] Restore flashinfer autotuning for all scenarios#27904
[BugFix][Performance] Restore flashinfer autotuning for all scenarios#27904mgoin merged 3 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
There was a problem hiding this comment.
Code Review
This pull request effectively resolves a crash that occurred when running MoE models in eager mode by ensuring tune_max_num_tokens is at least 1. The fix is correctly applied across multiple flashinfer kernel invocation sites in trtllm_moe.py and mxfp4.py. Additionally, the removal of the now-redundant flashinfer_autotune_supported function and its associated logic simplifies the codebase and re-enables autotuning for all scenarios, which is a great improvement. The test suite has been updated appropriately to validate the fix. The changes are well-targeted and correct.
| "do_finalize": True, | ||
| "output": output, | ||
| "tune_max_num_tokens": self.max_capture_size, | ||
| "tune_max_num_tokens": max(self.max_capture_size, 1), |
There was a problem hiding this comment.
Why were we setting this to self.max_capture_size ? Shouldn't we set this to max_num_batched_tokens atleast ?
Just curious cc @pavanimajety @nvpohanh
There was a problem hiding this comment.
Ohh I see, very interesting. Yes I have the same question of why not use the max batch size, since we will want to autotune not only for cudagraphs for the prefill as well
There was a problem hiding this comment.
@nvjullin Could you review this PR and comment on this? Thanks!
There was a problem hiding this comment.
It comes from PR23608. After a quick look in flashinfer, I believe this parameter is needed because autotuning on a dummy input won't result in the maximum number of tokens at each EP rank. I agree max_num_batched_tokens makes more sense.
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
| # Enable autotune when, | ||
| # https://github.com/flashinfer-ai/flashinfer/issues/2023 is | ||
| # resolved. | ||
| trtllm_fp4_block_scale_routed_moe(**kwargs) |
yewentao256
left a comment
There was a problem hiding this comment.
LGTM, thanks for the work!
| from vllm.utils.flashinfer import autotune | ||
|
|
||
| with autotune(False): | ||
| # Enable autotune when, |
There was a problem hiding this comment.
| # Enable autotune when, | |
| # TODO: Enable autotune when, |
Purpose
Bug:
on
main+ B200 :vllm serve openai/gpt-oss-20b --enforce-eagerfails.on
main+ H100 :VLLM_USE_FLASHINFER_MOE_MXFP4_BF16=1 vllm serve openai/gpt-oss-20b --enforce-eagerfails.Both failures are asserts in the flashinfer code base,
Note that this is the same error reported in #27751
Fix:
Our calls to the flashinfer MoE kernels, set
tune_max_num_tokensto the CUDAGraph capture size. When CUDAGraph was disabled,max_capture_sizeis set 0 and the autotuner asserts. This PR setstune_max_num_tokensto 1 when CUDAGraphs are disabled (i.e. eager-mode)Note:
Initially, this issue was thought to manifest in specific scenarios and we resorted to skipping autotuning for those cases in PRs, #27762 and #26729 . This PR reverts the skip logic introduced in those PRs.
Fixes #27751
Test Plan
manually run
vllm serve openai/gpt-oss-20b --enforce-eageron B200.CI
Test Result
Tests Pass