Skip to content

[FlashInfer] Revert block_size 16 + head_size 256 workaround on Blackwell#36987

Merged
hmellor merged 1 commit intovllm-project:mainfrom
CentML:vadim/revert27994
Mar 16, 2026
Merged

[FlashInfer] Revert block_size 16 + head_size 256 workaround on Blackwell#36987
hmellor merged 1 commit intovllm-project:mainfrom
CentML:vadim/revert27994

Conversation

@vadiklyutiy
Copy link
Collaborator

@vadiklyutiy vadiklyutiy commented Mar 13, 2026

Purpose

Revert the workaround introduced in vllm-project/vllm#27994.

PR #27994 was originally merged to work around a FlashInfer bug tracked in flashinfer-ai/flashinfer#1993, which reported that the combination of block_size=16 and head_size=256 produced incorrect results on Blackwell. The workaround forced kernel_block_alignment_size=32 (instead of 16) when this combination was detected, and added an assertion in the FlashInfer attention backend to block the problematic configuration entirely.

Since then, the upstream FlashInfer issue has been fixed and vLLM's FlashInfer dependency has been updated (in #30993) to a version that includes the fix (v0.6). The workaround is no longer necessary and can be safely removed.

Test Plan

Tested on B200 GPUs.

Qwen3-Next-80B-A3B-Instruct (hybrid mamba model, head_size 256)

This is the same model used to validate PR #27994. Ran full GSM8K 5-shot evaluation to confirm no accuracy regression.

vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct -tp 4
lm_eval --model local-completions \
    --model_args model=Qwen/Qwen3-Next-80B-A3B-Instruct,base_url=http://localhost:8000/v1/completions,num_concurrent=109 \
    --tasks gsm8k --num_fewshot 5

Results comparison:

flexible-extract strict-match
Before PR #27994 0.2123 ±0.0113 0.1933 ±0.0109
After PR #27994 0.8491 ±0.0099 0.8120 ±0.0108
This PR (revert) 0.8567 ±0.0097 0.8127 ±0.0107

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
@vadiklyutiy vadiklyutiy requested review from heheda12345 and pavanimajety and removed request for mgoin and pavanimajety March 13, 2026 15:35
@mergify mergify bot added the nvidia label Mar 13, 2026
@mergify mergify bot added the v1 label Mar 13, 2026
@vadiklyutiy vadiklyutiy requested a review from mgoin March 13, 2026 15:36
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request reverts a workaround for a FlashInfer bug on Blackwell GPUs related to block_size=16 and head_size=256. The changes remove the conditional logic in vllm/model_executor/models/config.py that forced a larger block alignment size, and an assertion in vllm/v1/attention/backends/flashinfer.py that blocked this configuration. The removal of this workaround is justified by an upstream fix in the FlashInfer dependency. The code changes are consistent with this goal. I have reviewed the changes and found no issues.

@vadiklyutiy vadiklyutiy added the qwen Related to Qwen models label Mar 13, 2026
@vadiklyutiy vadiklyutiy self-assigned this Mar 13, 2026
@vadiklyutiy vadiklyutiy added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 14, 2026
@hmellor
Copy link
Member

hmellor commented Mar 16, 2026

For completeness, could you link the PR that upgraded vLLM to the version with the fix?

@vadiklyutiy
Copy link
Collaborator Author

For completeness, could you link the PR that upgraded vLLM to the version with the fix?

It was fixed a while ago. Version 0.6.0 already includes the fix.

@hmellor
Copy link
Member

hmellor commented Mar 16, 2026

I know we already have the fix, I'd just like it to be well documented. I've updated the PR description.

@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Mar 16, 2026
@hmellor hmellor merged commit 8374387 into vllm-project:main Mar 16, 2026
64 checks passed
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Mar 16, 2026
@vadiklyutiy
Copy link
Collaborator Author

I know we already have the fix, I'd just like it to be well documented. I've updated the PR description.

Sorry, I hadn't got the final goal before. Thanks for merging.

@hmellor
Copy link
Member

hmellor commented Mar 16, 2026

No problem! In general I find linking related issues like this really helpful when tracing back future issues

XLiu-2000 pushed a commit to XLiu-2000/vllm that referenced this pull request Mar 17, 2026
…well (vllm-project#36987)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Signed-off-by: 刘旭 <xuliu40@gmail.com>
XLiu-2000 pushed a commit to XLiu-2000/vllm that referenced this pull request Mar 17, 2026
…well (vllm-project#36987)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Signed-off-by: 刘旭 <xuliu40@gmail.com>
Signed-off-by: XLiu-2000 <xuliu40@gmail.com>
XLiu-2000 pushed a commit to XLiu-2000/vllm that referenced this pull request Mar 17, 2026
Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026
…well (vllm-project#36987)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026
…well (vllm-project#36987)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026
…well (vllm-project#36987)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants