[Core] Remove FlashAttention block size restriction for hybrid models by tdoublep · Pull Request #36701 · vllm-project/vllm

tdoublep · 2026-03-10T20:52:45Z

Summary

Remove the block size restriction in FlashAttentionBackend.get_supported_kernel_block_sizes() that limited hybrid models with float32 Mamba cache to block sizes [16, 32, 64]
This restriction was introduced in [Hybrid] Pass kernel block size to builders #27753 to work around NaN propagation from stale fp32 Mamba data in reused KV cache blocks
[BUGFIX][Mamba][Qwen3.5] Zero freed SSM cache blocks on GPU #35219 has since solved the root cause by zeroing freshly allocated KV cache blocks via KVBlockZeroer, making the restriction unnecessary

Test plan

Verified on H100 with nvidia/NVIDIA-Nemotron-Nano-9B-v2 (hybrid Mamba model) using the same reproduction script from #27753.

Test script

from vllm import LLM, SamplingParams
import os

os.environ["VLLM_ATTENTION_BACKEND"] = "FLASH_ATTN"

prompts = ["Solve the following math problem step by step. The last line of your response should be of the form Answer: $Answer (without quotes) where $Answer is the answer to the problem.\n\nConsider the paths of length $16$ that follow the lines from the lower left corner to the upper right corner on an $8\\times 8$ grid. Find the number of such paths that change direction exactly four times, as in the examples shown below.\n\nRemember to put your answer on its own line after \"Answer:\"."]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(
    model="nvidia/NVIDIA-Nemotron-Nano-9B-v2",
    trust_remote_code=True,
    num_gpu_blocks_override=10,
    compilation_config={"cudagraph_capture_sizes": [1]},
    mamba_ssm_cache_dtype="float32",
    max_num_seqs=1,
    enable_prefix_caching=True,
)

outputs = []
for i in range(10):
    result = llm.generate(prompts, sampling_params)[0]
    outputs.append(result)
    generated_text = result.outputs[0].text
    token_ids = result.outputs[0].token_ids
    print(f"--- Iteration {i+1} ---")
    print(f"Token IDs (first 20): {token_ids[:20]}")
    print(f"Generated text (first 200 chars): {generated_text[:200]!r}")
    print()

all_ok = all(len(o.outputs[0].token_ids) > 0 and o.outputs[0].token_ids[0] != 0 for o in outputs)
print(f"All iterations produced meaningful output: {all_ok}")

Test output

All 10 iterations produced meaningful, coherent output:

--- Iteration 1 ---
Token IDs (first 20): [1885, 74045, 1561, 1784, 4127, 10867, 13170, 1278, 2782, 1307, 22344, 1307, 5592, 1032, 1049, 1054]
Generated text (first 200 chars): '</think>\nThe problem requires finding the number of paths of length 16'

--- Iteration 2 ---
Token IDs (first 20): [1032, 1267, 20396, 22344, 1877, 1045, 17669, 1032, 1049, 1058, 21285, 1044, 21285, 1044, 16999, 1044]
Generated text (first 200 chars): ' \n\nExample paths:\n- Path 1: Right, Right, Down,'

--- Iteration 3 ---
Token IDs (first 20): [1885, 74045, 1561, 1784, 2782, 1307, 22344, 1307, 5592, 1032, 1049, 1054, 1562, 1278, 4953, 3979]
Generated text (first 200 chars): '</think>\nThe number of paths of length 16 from the lower left'

--- Iteration 4 ---
Token IDs (first 20): [1885, 74045, 1561, 1784, 4127, 19263, 13170, 1278, 2782, 1307, 22344, 1307, 5592, 1032, 1049, 1054]
Generated text (first 200 chars): '</think>\nThe problem involves finding the number of paths of length 16'

--- Iteration 5 ---
Token IDs (first 20): [4848, 1058, 3870, 15047, 1278, 4127, 1307, 13170, 1278, 2782, 1307, 22344, 1307, 5592, 1032, 1049]
Generated text (first 200 chars): ' output: To solve the problem of finding the number of paths of length 1'

--- Iteration 6 ---
Token IDs (first 20): [1032, 1010, 1885, 74045, 1561, 1784, 4127, 10867, 13170, 1278, 2782, 1307, 22344, 1307, 5592, 1032]
Generated text (first 200 chars): ' \n</think>\nThe problem requires finding the number of paths of length '

--- Iteration 7 ---
Token IDs (first 20): [1885, 74045, 1561, 1784, 4127, 10867, 13170, 1278, 2782, 1307, 22344, 1307, 5592, 1032, 1049, 1054]
Generated text (first 200 chars): '</think>\nThe problem requires finding the number of paths of length 16'

--- Iteration 8 ---
Token IDs (first 20): [1032, 1010, 1885, 74045, 1561, 1784, 22344, 1307, 5592, 1032, 1049, 1054, 1562, 1278, 4953, 3979]
Generated text (first 200 chars): ' \n</think>\nThe paths of length 16 from the lower left'

--- Iteration 9 ---
Token IDs (first 20): [1032, 1267, 49250, 2077, 1561, 44053, 1044, 2878, 1681, 3219, 1046, 1362, 2534, 1317, 3081, 1278]
Generated text (first 200 chars): " \n\n<think>\nOkay, let's see. I need to find the"

--- Iteration 10 ---
Token IDs (first 20): [1060, 74045, 1561, 44053, 1044, 1878, 1362, 2534, 1317, 3081, 1278, 2782, 1307, 22344, 1408, 1420]
Generated text (first 200 chars): '<think>\nOkay, so I need to find the number of paths on an'

All iterations produced meaningful output: True

No NaN, no zero tokens, no empty strings across all 10 iterations.

🤖 Generated with Claude Code

The restriction limiting FA block sizes to [16, 32, 64] for hybrid models with float32 Mamba cache is no longer needed. PR vllm-project#35219 introduced KVBlockZeroer which zeros freshly allocated KV cache blocks, preventing NaN propagation from stale fp32 data in reused blocks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tdoublep · 2026-03-10T20:54:09Z

cc @NickLucche

gemini-code-assist

Code Review

This pull request removes a block size restriction in FlashAttentionBackend.get_supported_kernel_block_sizes() that was a workaround for a previously fixed bug. The change simplifies the code by removing the now-obsolete conditional logic, which improves maintainability. The provided test plan confirms that removing this restriction does not reintroduce the original issue. The changes are correct and well-justified.

mergify bot added the v1 label Mar 10, 2026

tdoublep marked this pull request as ready for review March 10, 2026 20:53

tdoublep requested review from LucasWilkinson and MatthewBonanni as code owners March 10, 2026 20:53

gemini-code-assist bot reviewed Mar 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Core] Remove FlashAttention block size restriction for hybrid models#36701

[Core] Remove FlashAttention block size restriction for hybrid models#36701
tdoublep wants to merge 1 commit intovllm-project:mainfrom
tdoublep:remove-hybrid-flash-attn-block-size-restriction

tdoublep commented Mar 10, 2026

Uh oh!

tdoublep commented Mar 10, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

tdoublep commented Mar 10, 2026

Summary

Test plan

Test script

Test output

Uh oh!

tdoublep commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tdoublep commented Mar 10, 2026 •

edited

Loading