Disable FlashInfer sampler by default#26859

Merged

tlrmchlsmth merged 1 commit intovllm-project:mainfrom

neuralmagic:disable-flashinfer-sampler-default

Oct 15, 2025

Member

mgoin commented Oct 15, 2025 •

edited by github-actions bot

Loading

Purpose

There have been increasing reports of correctness issues or IMA with FlashInfer's top-p & top-k sampling kernel (see #26480 (comment)). For instance, it seems it can generates the same output even when the temperature is quite high (even though the seed is not set). vLLM generates different results (expectedly) once the kernel is disabled.

Since flashinfer-python is a default dep of vLLM CUDA now, many more users would be using this kernel by default. Let us disable it by default for now so users can opt-in

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.


          Disable FlashInfer sampler by default

9bd64cf

Signed-off-by: mgoin <mgoin64@gmail.com>

mgoin requested review from 22quinn, houseroad and njhill as code owners

October 15, 2025 00:24

mgoin added the ready label

mergify bot added the v1 label

gemini-code-assist bot reviewed

View reviewed changes

Contributor

gemini-code-assist bot left a comment

Code Review

This pull request correctly disables the FlashInfer sampler by default, requiring users to opt-in by setting VLLM_USE_FLASHINFER_SAMPLER=1. The change from envs.VLLM_USE_FLASHINFER_SAMPLER is not False to a direct boolean check simplifies the logic and makes the default behavior consistent. The corresponding log message is also appropriately changed from a warning to a debug message. The implementation is sound and improves code clarity.

tlrmchlsmth approved these changes

View reviewed changes

tlrmchlsmth enabled auto-merge (squash)

October 15, 2025 00:50

tlrmchlsmth merged commit e66d787 into vllm-project:main

52 of 53 checks passed

mgoin mentioned this pull request

[UX] Fallback to native implementation when flashinfer sampler failed to compile #26799

Closed

5 tasks

bbartels pushed a commit to bbartels/vllm that referenced this pull request


          Disable FlashInfer sampler by default (vllm-project#26859)

75ceedc

Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: bbartels <benjamin@bartels.dev>

jeejeelee mentioned this pull request

[Kernel] Lazy import FlashInfer #26977

Merged

5 tasks

DarkLight1337 mentioned this pull request

[Bug]: Hybrid Attention models broken after switching to flashinfer 0.4 (tested on Granite 4.0 H, Qwen3-Next, Jamba-3B, Nemotron-H-8b) #26936

Open

1 task

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request


          Disable FlashInfer sampler by default (vllm-project#26859)

a89590d

Signed-off-by: mgoin <mgoin64@gmail.com>

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request


          Disable FlashInfer sampler by default (vllm-project#26859)

4a94dbd

Signed-off-by: mgoin <mgoin64@gmail.com>

0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request


          Disable FlashInfer sampler by default (vllm-project#26859)

fa36257

Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request


          Disable FlashInfer sampler by default (vllm-project#26859)

a2e43af

Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

noooop mentioned this pull request

[Installation]: FlashInfer Dependency issue due to pre-release apache-tvm-ffi #27476

Closed

1 task

npanpaliya pushed a commit to odh-on-pz/vllm-cpu that referenced this pull request


          Disable FlashInfer sampler by default (vllm-project/vllm#26859)

16f6b81

Signed-off-by: mgoin <mgoin64@gmail.com>

npanpaliya pushed a commit to odh-on-pz/vllm-cpu that referenced this pull request


          Cherry-pick: Disable FlashInfer sampler by default (vllm-project/vllm…

9b63d25

…#26859) (red-hat-data-services#295)

vllm-project/vllm#26859

Signed-off-by: mgoin <mgoin64@gmail.com>

LucasWilkinson mentioned this pull request

[Bug]: illegal memory access when there are multiple concurrent request #23814

Closed

1 task

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request


          Disable FlashInfer sampler by default (vllm-project#26859)

dbe9856

Signed-off-by: mgoin <mgoin64@gmail.com>

yzh119 commented Nov 19, 2025

Hi @mgoin can we get a concrete reproducing example (or a link to the issues) for reproducing the correctness/IMA issues you mentioned? It will be very helpful for us to improve the stability.

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request


          Disable FlashInfer sampler by default (vllm-project#26859)

ea61438

Signed-off-by: mgoin <mgoin64@gmail.com>

njhill mentioned this pull request

[Perf] Triton-based top-p/top-k masking #32558

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

tlrmchlsmth tlrmchlsmth approved these changes

22quinn Awaiting requested review from 22quinn 22quinn is a code owner

houseroad Awaiting requested review from houseroad houseroad is a code owner

njhill Awaiting requested review from njhill njhill is a code owner

+1 more reviewer

gemini-code-assist[bot] gemini-code-assist[bot] left review comments

Labels