[FlashInfer] Revert block_size 16 + head_size 256 workaround on Blackwell#36987
[FlashInfer] Revert block_size 16 + head_size 256 workaround on Blackwell#36987hmellor merged 1 commit intovllm-project:mainfrom
Conversation
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request reverts a workaround for a FlashInfer bug on Blackwell GPUs related to block_size=16 and head_size=256. The changes remove the conditional logic in vllm/model_executor/models/config.py that forced a larger block alignment size, and an assertion in vllm/v1/attention/backends/flashinfer.py that blocked this configuration. The removal of this workaround is justified by an upstream fix in the FlashInfer dependency. The code changes are consistent with this goal. I have reviewed the changes and found no issues.
|
For completeness, could you link the PR that upgraded vLLM to the version with the fix? |
It was fixed a while ago. Version 0.6.0 already includes the fix. |
|
I know we already have the fix, I'd just like it to be well documented. I've updated the PR description. |
Sorry, I hadn't got the final goal before. Thanks for merging. |
|
No problem! In general I find linking related issues like this really helpful when tracing back future issues |
…well (vllm-project#36987) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: 刘旭 <xuliu40@gmail.com>
…well (vllm-project#36987) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: 刘旭 <xuliu40@gmail.com> Signed-off-by: XLiu-2000 <xuliu40@gmail.com>
…well (vllm-project#36987) Signed-off-by: XuLiu <xuliu40@gmail.com>
…well (vllm-project#36987) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
…well (vllm-project#36987) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
…well (vllm-project#36987) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Purpose
Revert the workaround introduced in vllm-project/vllm#27994.
PR #27994 was originally merged to work around a FlashInfer bug tracked in flashinfer-ai/flashinfer#1993, which reported that the combination of
block_size=16andhead_size=256produced incorrect results on Blackwell. The workaround forcedkernel_block_alignment_size=32(instead of 16) when this combination was detected, and added an assertion in the FlashInfer attention backend to block the problematic configuration entirely.Since then, the upstream FlashInfer issue has been fixed and vLLM's FlashInfer dependency has been updated (in #30993) to a version that includes the fix (
v0.6). The workaround is no longer necessary and can be safely removed.Test Plan
Tested on B200 GPUs.
Qwen3-Next-80B-A3B-Instruct (hybrid mamba model, head_size 256)
This is the same model used to validate PR #27994. Ran full GSM8K 5-shot evaluation to confirm no accuracy regression.
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct -tp 4 lm_eval --model local-completions \ --model_args model=Qwen/Qwen3-Next-80B-A3B-Instruct,base_url=http://localhost:8000/v1/completions,num_concurrent=109 \ --tasks gsm8k --num_fewshot 5Results comparison: