[FlashInfer] Revert block_size 16 + head_size 256 workaround on Blackwell by vadiklyutiy · Pull Request #36987 · vllm-project/vllm

vadiklyutiy · 2026-03-13T15:34:50Z

Purpose

Revert the workaround introduced in vllm-project/vllm#27994.

PR #27994 was originally merged to work around a FlashInfer bug tracked in flashinfer-ai/flashinfer#1993, which reported that the combination of block_size=16 and head_size=256 produced incorrect results on Blackwell. The workaround forced kernel_block_alignment_size=32 (instead of 16) when this combination was detected, and added an assertion in the FlashInfer attention backend to block the problematic configuration entirely.

Since then, the upstream FlashInfer issue has been fixed and vLLM's FlashInfer dependency has been updated (in #30993) to a version that includes the fix (v0.6). The workaround is no longer necessary and can be safely removed.

Test Plan

Tested on B200 GPUs.

Qwen3-Next-80B-A3B-Instruct (hybrid mamba model, head_size 256)

This is the same model used to validate PR #27994. Ran full GSM8K 5-shot evaluation to confirm no accuracy regression.

vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct -tp 4
lm_eval --model local-completions \
    --model_args model=Qwen/Qwen3-Next-80B-A3B-Instruct,base_url=http://localhost:8000/v1/completions,num_concurrent=109 \
    --tasks gsm8k --num_fewshot 5

Results comparison:

	flexible-extract	strict-match
Before PR #27994	0.2123 ±0.0113	0.1933 ±0.0109
After PR #27994	0.8491 ±0.0099	0.8120 ±0.0108
This PR (revert)	0.8567 ±0.0097	0.8127 ±0.0107

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

gemini-code-assist

Code Review

This pull request reverts a workaround for a FlashInfer bug on Blackwell GPUs related to block_size=16 and head_size=256. The changes remove the conditional logic in vllm/model_executor/models/config.py that forced a larger block alignment size, and an assertion in vllm/v1/attention/backends/flashinfer.py that blocked this configuration. The removal of this workaround is justified by an upstream fix in the FlashInfer dependency. The code changes are consistent with this goal. I have reviewed the changes and found no issues.

hmellor · 2026-03-16T08:31:20Z

For completeness, could you link the PR that upgraded vLLM to the version with the fix?

vadiklyutiy · 2026-03-16T08:44:24Z

For completeness, could you link the PR that upgraded vLLM to the version with the fix?

It was fixed a while ago. Version 0.6.0 already includes the fix.

hmellor · 2026-03-16T09:03:40Z

I know we already have the fix, I'd just like it to be well documented. I've updated the PR description.

vadiklyutiy · 2026-03-16T09:20:08Z

I know we already have the fix, I'd just like it to be well documented. I've updated the PR description.

Sorry, I hadn't got the final goal before. Thanks for merging.

hmellor · 2026-03-16T09:26:11Z

No problem! In general I find linking related issues like this really helpful when tracing back future issues

…well (vllm-project#36987) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: 刘旭 <xuliu40@gmail.com>

…well (vllm-project#36987) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: 刘旭 <xuliu40@gmail.com> Signed-off-by: XLiu-2000 <xuliu40@gmail.com>

…well (vllm-project#36987) Signed-off-by: XuLiu <xuliu40@gmail.com>

…well (vllm-project#36987) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

remove workaround for FI page_size for hybrid models

ee69b5a

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

vadiklyutiy requested review from mgoin and pavanimajety as code owners March 13, 2026 15:34

vadiklyutiy requested review from heheda12345 and pavanimajety and removed request for mgoin and pavanimajety March 13, 2026 15:35

mergify bot added the nvidia label Mar 13, 2026

github-project-automation bot added this to NVIDIA Mar 13, 2026

mergify bot added the v1 label Mar 13, 2026

vadiklyutiy requested a review from mgoin March 13, 2026 15:36

gemini-code-assist bot reviewed Mar 13, 2026

View reviewed changes

vadiklyutiy added the qwen Related to Qwen models label Mar 13, 2026

vadiklyutiy self-assigned this Mar 13, 2026

vadiklyutiy mentioned this pull request Mar 13, 2026

[BUG] TRT-LLM Gen full attn. Incorrect result for head_dim=256 flashinfer-ai/flashinfer#1993

Open

vadiklyutiy added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 14, 2026

hmellor approved these changes Mar 16, 2026

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Mar 16, 2026

hmellor merged commit 8374387 into vllm-project:main Mar 16, 2026
64 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Mar 16, 2026

XLiu-2000 pushed a commit to XLiu-2000/vllm that referenced this pull request Mar 17, 2026

[FlashInfer] Revert block_size 16 + head_size 256 workaround on Black…

9d1368a

…well (vllm-project#36987) Signed-off-by: XuLiu <xuliu40@gmail.com>

Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026

[FlashInfer] Revert block_size 16 + head_size 256 workaround on Black…

99b2f26

…well (vllm-project#36987) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026

[FlashInfer] Revert block_size 16 + head_size 256 workaround on Black…

269a950

…well (vllm-project#36987) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026

[FlashInfer] Revert block_size 16 + head_size 256 workaround on Black…

731fdfc

…well (vllm-project#36987) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FlashInfer] Revert block_size 16 + head_size 256 workaround on Blackwell#36987

[FlashInfer] Revert block_size 16 + head_size 256 workaround on Blackwell#36987
hmellor merged 1 commit intovllm-project:mainfrom
CentML:vadim/revert27994

vadiklyutiy commented Mar 13, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

hmellor commented Mar 16, 2026

Uh oh!

vadiklyutiy commented Mar 16, 2026

Uh oh!

hmellor commented Mar 16, 2026

Uh oh!

Uh oh!

vadiklyutiy commented Mar 16, 2026

Uh oh!

hmellor commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

vadiklyutiy commented Mar 13, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Qwen3-Next-80B-A3B-Instruct (hybrid mamba model, head_size 256)

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

hmellor commented Mar 16, 2026

Uh oh!

vadiklyutiy commented Mar 16, 2026

Uh oh!

hmellor commented Mar 16, 2026

Uh oh!

Uh oh!

vadiklyutiy commented Mar 16, 2026

Uh oh!

hmellor commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vadiklyutiy commented Mar 13, 2026 •

edited by github-actions bot

Loading