[Core] Remove FlashAttention block size restriction for hybrid models#36701
Open
tdoublep wants to merge 1 commit intovllm-project:mainfrom
Open
[Core] Remove FlashAttention block size restriction for hybrid models#36701tdoublep wants to merge 1 commit intovllm-project:mainfrom
tdoublep wants to merge 1 commit intovllm-project:mainfrom
Conversation
The restriction limiting FA block sizes to [16, 32, 64] for hybrid models with float32 Mamba cache is no longer needed. PR vllm-project#35219 introduced KVBlockZeroer which zeros freshly allocated KV cache blocks, preventing NaN propagation from stale fp32 data in reused blocks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Member
Author
|
cc @NickLucche |
Contributor
There was a problem hiding this comment.
Code Review
This pull request removes a block size restriction in FlashAttentionBackend.get_supported_kernel_block_sizes() that was a workaround for a previously fixed bug. The change simplifies the code by removing the now-obsolete conditional logic, which improves maintainability. The provided test plan confirms that removing this restriction does not reintroduce the original issue. The changes are correct and well-justified.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
FlashAttentionBackend.get_supported_kernel_block_sizes()that limited hybrid models with float32 Mamba cache to block sizes[16, 32, 64]KVBlockZeroer, making the restriction unnecessaryTest plan
Verified on H100 with
nvidia/NVIDIA-Nemotron-Nano-9B-v2(hybrid Mamba model) using the same reproduction script from #27753.Test script
Test output
All 10 iterations produced meaningful, coherent output:
No NaN, no zero tokens, no empty strings across all 10 iterations.
🤖 Generated with Claude Code