Disable 2CTA fwd non-causal on CUDA 12 to work around codegen regression#2461
Merged
Conversation
…egression CUDA 12.9 has a codegen issue that causes ~18% slowdown for 2CTA forward non-causal (hdim=128: 1280 vs 1542 TFLOPS). This is fixed in CUDA 13.x. Auto-disable 2CTA when CUDA 12.9 is detected. Users on CUDA 13.x are unaffected. The manual `FA_DISABLE_2CTA=1` override continues to work regardless of CUDA version.
tridao
reviewed
Apr 15, 2026
| cuda_version = torch.version.cuda | ||
| if cuda_version is not None: | ||
| major, minor = cuda_version.split(".")[:2] | ||
| return int(major) == 12 and int(minor) == 9 |
Member
There was a problem hiding this comment.
Let's just check int(major) == 12
Collaborator
Author
There was a problem hiding this comment.
Sure, will change. Thanks Tri
tridao
approved these changes
Apr 15, 2026
ussoewwin
pushed a commit
to ussoewwin/flash-attention
that referenced
this pull request
May 13, 2026
…ion (Dao-AILab#2461) * Disable 2CTA forward non-causal on CUDA 12.9 to work around codegen regression CUDA 12.9 has a codegen issue that causes ~18% slowdown for 2CTA forward non-causal (hdim=128: 1280 vs 1542 TFLOPS). This is fixed in CUDA 13.x. Auto-disable 2CTA when CUDA 12.9 is detected. Users on CUDA 13.x are unaffected. The manual `FA_DISABLE_2CTA=1` override continues to work regardless of CUDA version. * Disable 2CTA forward non-causal on all CUDA 12.x (not just 12.9)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
(hdim=128: ~1280 vs ~1542 TFLOPS)
torch.version.cudaFA_DISABLE_2CTA=1env var continues to work regardless of CUDA versionConfirmed by @Johnsonms and @jshahOSS on B200 with CUDA 12.9.
Test plan
FA_DISABLE_2CTA=1still works as manual override