Skip to content

Disable flashinfer autotune temporarily due to correctness issues#41524

Merged
yewentao256 merged 2 commits into
vllm-project:mainfrom
wzhao18:wzhao/disable-autotune
May 3, 2026
Merged

Disable flashinfer autotune temporarily due to correctness issues#41524
yewentao256 merged 2 commits into
vllm-project:mainfrom
wzhao18:wzhao/disable-autotune

Conversation

@wzhao18
Copy link
Copy Markdown
Contributor

@wzhao18 wzhao18 commented May 3, 2026

Purpose

We have observed correctness bugs with flashinfer autotuning, as seen in issue flashinfer-ai/flashinfer#3197. Kernel-level reproduction is available.

While this is pending for fix, this PR disables flashinfer autotuning by default for now for O1 and O2 optimization levels.

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@wzhao18 wzhao18 changed the title Disable flashinfer autotune for now due to correctness issues Disable flashinfer autotune temporarily due to correctness issues May 3, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request disables the enable_flashinfer_autotune setting within the OPTIMIZATION_LEVEL_01 and OPTIMIZATION_LEVEL_02 configurations in vllm/config/vllm.py to address known correctness issues. Feedback suggests that this feature should also be disabled for OPTIMIZATION_LEVEL_03 to ensure consistency and prevent potential errors for users on that optimization level.

Comment thread vllm/config/vllm.py
"enable_flashinfer_autotune": True,
# Disabled for now due to correctness issues:
# https://github.com/flashinfer-ai/flashinfer/issues/3197
"enable_flashinfer_autotune": False,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The correctness issues with FlashInfer autotuning likely affect OPTIMIZATION_LEVEL_03 as well, especially since it is documented as being the same as OPTIMIZATION_LEVEL_02 (line 80). Leaving it enabled in O3 (line 256) while disabling it in O1 and O2 creates an inconsistency that could lead to incorrect results for users selecting the O3 level. Please consider disabling it for O3 as well.

@mgoin mgoin added bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed nvidia labels May 3, 2026
Copy link
Copy Markdown
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been seeing workarounds like this recently, LGTM

@github-project-automation github-project-automation Bot moved this to Ready in NVIDIA May 3, 2026
Copy link
Copy Markdown
Member

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the work!

@yewentao256
Copy link
Copy Markdown
Member

@yewentao256 yewentao256 enabled auto-merge (squash) May 3, 2026 14:41
@yewentao256 yewentao256 merged commit c51df43 into vllm-project:main May 3, 2026
55 checks passed
@github-project-automation github-project-automation Bot moved this from Ready to Done in NVIDIA May 3, 2026
joa-stdn pushed a commit to joa-stdn/vllm that referenced this pull request May 4, 2026
…lm-project#41524)

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: Joachim Studnia <joachim@mistral.ai>
chaojun-zhang pushed a commit to chaojun-zhang/vllm that referenced this pull request May 6, 2026
Copilot AI pushed a commit to hongbolv/vllm that referenced this pull request May 7, 2026
…lm-project#41524)

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
ikaadil pushed a commit to ikaadil/vllm that referenced this pull request May 7, 2026
…lm-project#41524)

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
@eugr
Copy link
Copy Markdown

eugr commented May 8, 2026

@wzhao18 - This PR nukes performance on nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 on DGX Spark dual node cluster (only in the cluster though). Going from ~26 t/s before this PR to 18 t/s after. Adding --enable-flashinfer-autotune to vLLM parameters restores the performance.

@mgoin - FYI.

@wzhao18
Copy link
Copy Markdown
Contributor Author

wzhao18 commented May 8, 2026

The issue has been root caused in flashinfer and fixed by flashinfer-ai/flashinfer#3227. We can re-enable autotuning when flashinfer v0.6.11 is integrated.

licson added a commit to licson/vllm that referenced this pull request May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working nvidia ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants