Reapply [Attention][FA3] Update FA3 to include new swizzle optimization by LucasWilkinson · Pull Request #34043 · vllm-project/vllm

LucasWilkinson · 2026-02-07T06:03:02Z

Reapply #23465 after revert in #33841 but with correct metadata sizes

gemini-code-assist

Code Review

This pull request updates the flash-attention dependency and adjusts the scheduler_metadata buffer size to accommodate changes in flash-attention and handle edge cases related to CUDA graph capturing. The changes appear correct and necessary. My review focuses on improving the clarity of the comments associated with these critical buffer size calculations. Enhancing these comments will improve maintainability and help prevent future confusion or bugs, especially since this area of the code has been a source of issues in the past.

gemini-code-assist · 2026-02-07T06:04:39Z

vllm/v1/attention/backends/flash_attn.py

+            # Times 4 due to:
+            #  https://github.com/vllm-project/flash-attention/blob/3223650ccabe622a0fcae65eec706a50186a89f7/hopper/flash_api.cpp#L650-L653
+            # For some tests max_cudagraph_size > max_num_seqs,
+            #   so we need to use the larger one.


The comment explaining the buffer size calculation is a bit confusing as it compares max_cudagraph_size (number of tokens) with max_num_seqs (number of sequences) without full context. A more detailed explanation would improve clarity and maintainability, especially given the history of issues with this code block.

During CUDA graph capture for uniform decode batches, the number of requests (num_reqs) is set to be equal to the number of tokens (num_tokens). This num_tokens can be up to max_cudagraph_capture_size. In some configurations (especially in tests), max_cudagraph_capture_size can be larger than max_num_seqs. Since the scheduler_metadata buffer size depends on the number of requests, we must allocate a buffer large enough for this capture-time scenario.

Suggested change

# Times 4 due to:

# https://github.com/vllm-project/flash-attention/blob/3223650ccabe622a0fcae65eec706a50186a89f7/hopper/flash_api.cpp#L650-L653

# For some tests max_cudagraph_size > max_num_seqs,

# so we need to use the larger one.

# During CUDA graph capture, `num_reqs` can be up to `max_cudagraph_size`.

# This can be larger than `max_num_seqs`, so we use the max of the two.

# The size is multiplied by 4 due to flash-attention's requirement:

# https://github.com/vllm-project/flash-attention/blob/3223650ccabe622a0fcae65eec706a50186a89f7/hopper/flash_api.cpp#L650-L653

gemini-code-assist · 2026-02-07T06:04:39Z

vllm/v1/attention/backends/mla/flashattn_mla.py

+            # Times 4 due to:
+            #  https://github.com/vllm-project/flash-attention/blob/3223650ccabe622a0fcae65eec706a50186a89f7/hopper/flash_api.cpp#L650-L653
+            # For some tests max_cudagraph_size > max_num_seqs,
+            #   so we need to use the larger one.


The comment explaining the buffer size calculation is a bit confusing as it compares max_cudagraph_size (number of tokens) with max_num_seqs (number of sequences) without full context. A more detailed explanation would improve clarity and maintainability, especially given the history of issues with this code block.

During CUDA graph capture for uniform decode batches, the number of requests (num_reqs) is set to be equal to the number of tokens (num_tokens). This num_tokens can be up to max_cudagraph_capture_size. In some configurations (especially in tests), max_cudagraph_capture_size can be larger than max_num_seqs. Since the scheduler_metadata buffer size depends on the number of requests, we must allocate a buffer large enough for this capture-time scenario.

Suggested change

# Times 4 due to:

# https://github.com/vllm-project/flash-attention/blob/3223650ccabe622a0fcae65eec706a50186a89f7/hopper/flash_api.cpp#L650-L653

# For some tests max_cudagraph_size > max_num_seqs,

# so we need to use the larger one.

# During CUDA graph capture, `num_reqs` can be up to `max_cudagraph_size`.

# This can be larger than `max_num_seqs`, so we use the max of the two.

# The size is multiplied by 4 due to flash-attention's requirement:

# https://github.com/vllm-project/flash-attention/blob/3223650ccabe622a0fcae65eec706a50186a89f7/hopper/flash_api.cpp#L650-L653

…ion" (#33841) This reverts commit e3bf79f. Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

…raphs - Increase scheduler_metadata buffer to 1 + round_up(batch_size, 4) * 4 to account for all num_prepare_batch_vectors slots + semaphore - Only relax BatchDescriptor for PIECEWISE mode, not FULL mode (FA3's scheduler_metadata computation depends on exact num_reqs) - Rename relax_for_mixed_batch_cudagraphs -> relax_for_piecewise_cudagraphs - Update tests to reflect FULL mode using exact keys Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

Remove relax_for_piecewise_cudagraphs() method from BatchDescriptor and use NamedTuple._replace() directly for cleaner O(1) set lookups. For pure FULL mode, always relax uniform=False since keys are registered with uniform=False. Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

ProExpertProg

LGTM

LucasWilkinson requested a review from pavanimajety as a code owner February 7, 2026 06:03

mergify bot added ci/build v1 labels Feb 7, 2026

gemini-code-assist bot reviewed Feb 7, 2026

View reviewed changes

LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 7, 2026

LucasWilkinson force-pushed the lwilkinson/fix-fa3-swizzle branch 2 times, most recently from 7249070 to 592c224 Compare February 8, 2026 22:00

LucasWilkinson mentioned this pull request Feb 8, 2026

[BugFix] Potential bug fix for test_async_tp_pass_correctness #33854

Closed

LucasWilkinson force-pushed the lwilkinson/fix-fa3-swizzle branch from 592c224 to b7d884a Compare February 9, 2026 06:31

LucasWilkinson mentioned this pull request Feb 9, 2026

[Attention] FA4 integration #32974

Merged

mergify bot added the nvidia label Feb 10, 2026

github-project-automation bot added this to NVIDIA Feb 10, 2026

LucasWilkinson requested review from ApostaC, DarkLight1337, WoosukKwon, alexm-redhat, bigPYJ1151, gshtras, heheda12345, hmellor, mgoin, njhill, noooop, orozery, patrickvonplaten, robertgshaw2-redhat, sighingnow, tdoublep, tjtanaa and ywang96 as code owners February 10, 2026 17:16

mergify bot added gpt-oss Related to GPT-OSS models rocm Related to AMD ROCm labels Feb 10, 2026

github-project-automation bot added this to gpt-oss Issues & Enhancements Feb 10, 2026

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Feb 10, 2026

github-project-automation bot added this to AMD Feb 10, 2026

mergify bot added the cpu Related to CPU backends label Feb 10, 2026

github-project-automation bot moved this to Todo in AMD Feb 10, 2026

mergify bot added structured-output speculative-decoding labels Feb 10, 2026

github-project-automation bot added this to Structured Output Feb 10, 2026

mergify bot added the kv-connector label Feb 10, 2026

LucasWilkinson force-pushed the lwilkinson/fix-fa3-swizzle branch 11 times, most recently from 39e4204 to 5797dc2 Compare February 10, 2026 18:34

LucasWilkinson removed the documentation Improvements or additions to documentation label Feb 10, 2026

LucasWilkinson added 3 commits February 11, 2026 03:09

Reapply "[Attention][FA3] Update FA3 to include new swizzle optimizat…

c79f6fc

…ion" (#33841) This reverts commit e3bf79f. Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

LucasWilkinson mentioned this pull request Feb 11, 2026

[chore] Update FA #34276

Open

ProExpertProg approved these changes Feb 11, 2026

View reviewed changes

This was referenced Mar 3, 2026

[Bugfix] Fix the issue when the FI Attention operator is used in graph mode with eagle enabled vllm-project/vllm-ascend#6932

Closed

[Spec Decode][CUDA Graphs] Enables Eagle drafter support for FULL CUDA Graph mode #34880

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reapply [Attention][FA3] Update FA3 to include new swizzle optimization#34043

Reapply [Attention][FA3] Update FA3 to include new swizzle optimization#34043
vllm-bot merged 3 commits intomainfrom
lwilkinson/fix-fa3-swizzle

LucasWilkinson commented Feb 7, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 7, 2026

Uh oh!

gemini-code-assist bot Feb 7, 2026

Uh oh!

ProExpertProg left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

LucasWilkinson commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LucasWilkinson commented Feb 7, 2026 •

edited

Loading