[BugFix][Performance] Restore flashinfer autotuning for all scenarios by varun-sundar-rabindranath · Pull Request #27904 · vllm-project/vllm

varun-sundar-rabindranath · 2025-11-01T00:23:31Z

Purpose

Bug:
on main + B200 : vllm serve openai/gpt-oss-20b --enforce-eager fails.
on main + H100 : VLLM_USE_FLASHINFER_MOE_MXFP4_BF16=1 vllm serve openai/gpt-oss-20b --enforce-eager fails.

Both failures are asserts in the flashinfer code base,

(EngineCore_DP0 pid=3490083)   File "/home/varun/code/vllm/vllm/model_executor/layers/quantization/mxfp4.py", line 1109, in apply
(EngineCore_DP0 pid=3490083)     _ = flashinfer_cutlass_fused_moe(
(EngineCore_DP0 pid=3490083)   File "/home/varun/code/vllm/vllm/utils/flashinfer.py", line 84, in wrapper
(EngineCore_DP0 pid=3490083)     return impl(*args, **kwargs)
(EngineCore_DP0 pid=3490083)   File "/home/varun/code/vllm/vllm-test/lib/python3.10/site-packages/flashinfer/fused_moe/core.py", line 790, in cutlass_fused_moe
(EngineCore_DP0 pid=3490083)     return get_cutlass_fused_moe_module(device_arch).cutlass_fused_moe(
(EngineCore_DP0 pid=3490083)   File "/home/varun/code/vllm/vllm-test/lib/python3.10/site-packages/flashinfer/fused_moe/core.py", line 460, in cutlass_fused_moe
(EngineCore_DP0 pid=3490083)     _, gemm_tactic_1 = tuner.choose_one(
(EngineCore_DP0 pid=3490083)   File "/home/varun/code/vllm/vllm-test/lib/python3.10/site-packages/flashinfer/autotuner.py", line 457, in choose_one
(EngineCore_DP0 pid=3490083)     profiles = self._generate_optimization_profiles(tuning_config, inputs)
(EngineCore_DP0 pid=3490083)   File "/home/varun/code/vllm/vllm-test/lib/python3.10/site-packages/flashinfer/autotuner.py", line 643, in _generate_optimization_profiles
(EngineCore_DP0 pid=3490083)     assert len(opt_shapes) > 0, "Empty tuning buckets are not allowed"
(EngineCore_DP0 pid=3490083) AssertionError: Empty tuning buckets are not allowed

Note that this is the same error reported in #27751

Fix:
Our calls to the flashinfer MoE kernels, set tune_max_num_tokens to the CUDAGraph capture size. When CUDAGraph was disabled, max_capture_size is set 0 and the autotuner asserts. This PR sets tune_max_num_tokens to 1 when CUDAGraphs are disabled (i.e. eager-mode)

Note:
Initially, this issue was thought to manifest in specific scenarios and we resorted to skipping autotuning for those cases in PRs, #27762 and #26729 . This PR reverts the skip logic introduced in those PRs.

Fixes #27751

Test Plan

manually run vllm serve openai/gpt-oss-20b --enforce-eager on B200.
CI

Test Result

Tests Pass

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

gemini-code-assist

Code Review

This pull request effectively resolves a crash that occurred when running MoE models in eager mode by ensuring tune_max_num_tokens is at least 1. The fix is correctly applied across multiple flashinfer kernel invocation sites in trtllm_moe.py and mxfp4.py. Additionally, the removal of the now-redundant flashinfer_autotune_supported function and its associated logic simplifies the codebase and re-enables autotuning for all scenarios, which is a great improvement. The test suite has been updated appropriately to validate the fix. The changes are well-targeted and correct.

varun-sundar-rabindranath · 2025-11-01T00:25:22Z

cc @zyongye @nvpohanh @mgoin PTAL. Sorry about the confusion with intermediate fixes.

varun-sundar-rabindranath · 2025-11-01T00:27:49Z

vllm/model_executor/layers/fused_moe/trtllm_moe.py

            "do_finalize": True,
            "output": output,
-            "tune_max_num_tokens": self.max_capture_size,
+            "tune_max_num_tokens": max(self.max_capture_size, 1),


Why were we setting this to self.max_capture_size ? Shouldn't we set this to max_num_batched_tokens atleast ?
Just curious cc @pavanimajety @nvpohanh

Ohh I see, very interesting. Yes I have the same question of why not use the max batch size, since we will want to autotune not only for cudagraphs for the prefill as well

@nvjullin Could you review this PR and comment on this? Thanks!

It comes from PR23608. After a quick look in flashinfer, I believe this parameter is needed because autotuning on a dummy input won't result in the maximum number of tokens at each EP rank. I agree max_num_batched_tokens makes more sense.

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

varun-sundar-rabindranath · 2025-11-01T19:09:22Z

vllm/model_executor/layers/fused_moe/trtllm_moe.py

+            # Enable autotune when,
+            # https://github.com/flashinfer-ai/flashinfer/issues/2023 is
+            # resolved.
+            trtllm_fp4_block_scale_routed_moe(**kwargs)


cc @nvpohanh
cc @mgoin changes since you last reviewed.

yewentao256

LGTM, thanks for the work!

yewentao256 · 2025-11-01T19:10:37Z

vllm/model_executor/layers/fused_moe/trtllm_moe.py

+        from vllm.utils.flashinfer import autotune
+
+        with autotune(False):
+            # Enable autotune when,


Suggested change

# Enable autotune when,

# TODO: Enable autotune when,

…vllm-project#27904)

restore autotuning functionality

3359c4d

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

varun-sundar-rabindranath requested review from mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners November 1, 2025 00:23

gemini-code-assist bot reviewed Nov 1, 2025

View reviewed changes

varun-sundar-rabindranath commented Nov 1, 2025

View reviewed changes

mgoin approved these changes Nov 1, 2025

View reviewed changes

mgoin added ready ONLY add when PR is ready to merge/full CI is needed nvidia labels Nov 1, 2025

Varun Sundar Rabindranath added 2 commits November 1, 2025 12:22

fix dp

de80592

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

Add fi issue

0fec4d3

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

varun-sundar-rabindranath commented Nov 1, 2025

View reviewed changes

yewentao256 approved these changes Nov 1, 2025

View reviewed changes

mgoin approved these changes Nov 4, 2025

View reviewed changes

mgoin merged commit 4022a9d into vllm-project:main Nov 4, 2025
55 checks passed

nvjullin mentioned this pull request Nov 5, 2025

Use maximum number of batched tokens to autotune MoE #28106

Open

5 tasks

ZhengHongming888 pushed a commit to ZhengHongming888/vllm that referenced this pull request Nov 8, 2025

[BugFix][Performance] Restore flashinfer autotuning for all scenarios (…

bd3ee1a

…vllm-project#27904)

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[BugFix][Performance] Restore flashinfer autotuning for all scenarios (…

d255b8a

…vllm-project#27904)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BugFix][Performance] Restore flashinfer autotuning for all scenarios#27904

[BugFix][Performance] Restore flashinfer autotuning for all scenarios#27904
mgoin merged 3 commits intovllm-project:mainfrom
neuralmagic:varun/fix-flashinfer-autotune

varun-sundar-rabindranath commented Nov 1, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

varun-sundar-rabindranath commented Nov 1, 2025

Uh oh!

varun-sundar-rabindranath Nov 1, 2025

Uh oh!

mgoin Nov 1, 2025

Uh oh!

nvpohanh Nov 3, 2025

Uh oh!

nvjullin Nov 3, 2025

Uh oh!

varun-sundar-rabindranath Nov 1, 2025

Uh oh!

yewentao256 left a comment

Uh oh!

yewentao256 Nov 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

varun-sundar-rabindranath commented Nov 1, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

varun-sundar-rabindranath commented Nov 1, 2025

Uh oh!

varun-sundar-rabindranath Nov 1, 2025

Choose a reason for hiding this comment

Uh oh!

mgoin Nov 1, 2025

Choose a reason for hiding this comment

Uh oh!

nvpohanh Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

nvjullin Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

varun-sundar-rabindranath Nov 1, 2025

Choose a reason for hiding this comment

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

yewentao256 Nov 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

varun-sundar-rabindranath commented Nov 1, 2025 •

edited by github-actions bot

Loading