[BugFix] Potential Fix for FA3 full-cudagraph IMA #25490
Merged
WoosukKwon merged 3 commits intomainfrom Sep 24, 2025
Merged
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request aims to fix a potential Invalid Memory Access in FlashAttention 3 with full CUDA graphs by ensuring the max_num_splits parameter is consistent. The change refactors the logic for setting max_num_splits to a common location. However, the current implementation introduces a critical flaw: it can lead to an UnboundLocalError because max_num_splits is not defined in all code paths. My review provides a fix for this issue to ensure the variable is always initialized. Addressing this will also help achieve the PR's goal of making the parameter consistent.
58df19e to
e1c19ca
Compare
Collaborator
|
@LucasWilkinson Can you please check the CI again? |
FeiDaLI
pushed a commit
to FeiDaLI/vllm
that referenced
this pull request
Sep 25, 2025
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
yewentao256
pushed a commit
that referenced
this pull request
Oct 3, 2025
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
choprahetarth
pushed a commit
to Tandemn-Labs/vllm
that referenced
this pull request
Oct 11, 2025
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
lywa1998
pushed a commit
to lywa1998/vllm
that referenced
this pull request
Oct 20, 2025
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
rtourgeman
pushed a commit
to rtourgeman/vllm
that referenced
this pull request
Nov 10, 2025
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
@WoosukKwon reported an IMA with FA3 full-CG that was fixed by doing https://github.com/vllm-project/vllm/compare/woosuk/fa3-ima?expand=1
the theory here is that
get_scheduler_metadatawas being called with a differentmax_num_splitsthan what was being passed toFlashAttentionMetadatathis is an alternative solution that doesn't lose the logic to use
max_num_splits=0(i.e. use the heuristic) for batches larger thenmax_cudagraph_sizewe do not currently have a repo so cannot confirm this resolves @WoosukKwon 's IMA but this should be resolved regardless; we should alway make sure the arguments to
get_scheduler_metadataandFlashAttentionMetadataare inline