Disable flashinfer autotune temporarily due to correctness issues by wzhao18 · Pull Request #41524 · vllm-project/vllm

wzhao18 · 2026-05-03T00:18:39Z

Purpose

We have observed correctness bugs with flashinfer autotuning, as seen in issue flashinfer-ai/flashinfer#3197. Kernel-level reproduction is available.

While this is pending for fix, this PR disables flashinfer autotuning by default for now for O1 and O2 optimization levels.

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request disables the enable_flashinfer_autotune setting within the OPTIMIZATION_LEVEL_01 and OPTIMIZATION_LEVEL_02 configurations in vllm/config/vllm.py to address known correctness issues. Feedback suggests that this feature should also be disabled for OPTIMIZATION_LEVEL_03 to ensure consistency and prevent potential errors for users on that optimization level.

gemini-code-assist · 2026-05-03T00:22:57Z

-        "enable_flashinfer_autotune": True,
+        # Disabled for now due to correctness issues:
+        # https://github.com/flashinfer-ai/flashinfer/issues/3197
+        "enable_flashinfer_autotune": False,


The correctness issues with FlashInfer autotuning likely affect OPTIMIZATION_LEVEL_03 as well, especially since it is documented as being the same as OPTIMIZATION_LEVEL_02 (line 80). Leaving it enabled in O3 (line 256) while disabling it in O1 and O2 creates an inconsistency that could lead to incorrect results for users selecting the O3 level. Please consider disabling it for O3 as well.

mgoin

I've been seeing workarounds like this recently, LGTM

yewentao256

LGTM, thanks for the work!

yewentao256 · 2026-05-03T14:41:03Z

Merge from main as "Rule: auto-rebase to keep merge candidate within 1 day behind main (update)"

…lm-project#41524) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Joachim Studnia <joachim@mistral.ai>

…lm-project#41524) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

…lm-project#41524) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>

…lm-project#41524) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>

eugr · 2026-05-08T19:18:22Z

@wzhao18 - This PR nukes performance on nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 on DGX Spark dual node cluster (only in the cluster though). Going from ~26 t/s before this PR to 18 t/s after. Adding --enable-flashinfer-autotune to vLLM parameters restores the performance.

@mgoin - FYI.

wzhao18 · 2026-05-08T19:26:34Z

The issue has been root caused in flashinfer and fixed by flashinfer-ai/flashinfer#3227. We can re-enable autotuning when flashinfer v0.6.11 is integrated.

…sues (vllm-project#41524)" This reverts commit c51df43.

Disable flashinfer autotune by default

01301bc

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

wzhao18 requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners May 3, 2026 00:18

claude Bot reviewed May 3, 2026

View reviewed changes

wzhao18 changed the title ~~Disable flashinfer autotune for now due to correctness issues~~ Disable flashinfer autotune temporarily due to correctness issues May 3, 2026

gemini-code-assist Bot reviewed May 3, 2026

View reviewed changes

mgoin added bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed nvidia labels May 3, 2026

github-project-automation Bot added this to NVIDIA May 3, 2026

mgoin approved these changes May 3, 2026

View reviewed changes

github-project-automation Bot moved this to Ready in NVIDIA May 3, 2026

yewentao256 approved these changes May 3, 2026

View reviewed changes

Merge branch 'main' into wzhao/disable-autotune

c10eadc

yewentao256 enabled auto-merge (squash) May 3, 2026 14:41

yewentao256 merged commit c51df43 into vllm-project:main May 3, 2026
55 checks passed

github-project-automation Bot moved this from Ready to Done in NVIDIA May 3, 2026

teason2021 mentioned this pull request May 3, 2026

Can anyone confirm a GLM-4.7 quant working well using spark-vllm-docker? eugr/spark-vllm-docker#226

Open

joa-stdn pushed a commit to joa-stdn/vllm that referenced this pull request May 4, 2026

Disable flashinfer autotune temporarily due to correctness issues (vl…

048def3

…lm-project#41524) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Joachim Studnia <joachim@mistral.ai>

chaojun-zhang pushed a commit to chaojun-zhang/vllm that referenced this pull request May 6, 2026

Disable flashinfer autotune temporarily due to correctness issues (vl…

7e0a0ef

…lm-project#41524) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

ehfd mentioned this pull request May 6, 2026

[Bug]: Decode Context Parallelism (--decode-context-parallel-size) output drift and gibberish in latest nightly #41623

Open

1 task

Natfii mentioned this pull request May 7, 2026

cherry-pick: upstream stabilization tier-1 (4 picks) Navi-AI-Lab/nvllm#10

Merged

4 tasks

licson added a commit to licson/vllm that referenced this pull request May 11, 2026

Revert "Disable flashinfer autotune temporarily due to correctness is…

b7328e9

…sues (vllm-project#41524)" This reverts commit c51df43.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Disable flashinfer autotune temporarily due to correctness issues#41524

Disable flashinfer autotune temporarily due to correctness issues#41524
yewentao256 merged 2 commits into
vllm-project:mainfrom
wzhao18:wzhao/disable-autotune

wzhao18 commented May 3, 2026 •

edited by github-actions Bot

Loading

Uh oh!

claude Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 3, 2026

Uh oh!

mgoin left a comment

Uh oh!

yewentao256 left a comment

Uh oh!

yewentao256 commented May 3, 2026

Uh oh!

Uh oh!

eugr commented May 8, 2026

Uh oh!

wzhao18 commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

wzhao18 commented May 3, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 3, 2026

Choose a reason for hiding this comment

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

yewentao256 commented May 3, 2026

Uh oh!

Uh oh!

eugr commented May 8, 2026

Uh oh!

wzhao18 commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wzhao18 commented May 3, 2026 •

edited by github-actions Bot

Loading