Disable 2CTA fwd non-causal on CUDA 12 to work around codegen regression by Johnsonms · Pull Request #2461 · Dao-AILab/flash-attention

Johnsonms · 2026-04-15T06:50:04Z

Summary

CUDA 12.9 has a codegen regression that causes ~18% slowdown for 2CTA forward non-causal
(hdim=128: ~1280 vs ~1542 TFLOPS)
Auto-disable 2CTA when CUDA 12.9 is detected via torch.version.cuda
Users on CUDA 13.x are unaffected — 2CTA stays enabled and performs as expected
The manual FA_DISABLE_2CTA=1 env var continues to work regardless of CUDA version

Confirmed by @Johnsonms and @jshahOSS on B200 with CUDA 12.9.

Test plan

Verify on CUDA 12.9: non-causal hdim=128 should now use 1CTA (~1542 TFLOPS)
Verify on CUDA 13.x: non-causal hdim=128 should still use 2CTA (~1597 TFLOPS)
Verify FA_DISABLE_2CTA=1 still works as manual override

python benchmarks/benchmark_attn.py \                                            
  --fwd --backend fa4 \                                                                            
  --headdim 128 --batch-size 4 --seqlen 8192 \    
  --causal both

…egression CUDA 12.9 has a codegen issue that causes ~18% slowdown for 2CTA forward non-causal (hdim=128: 1280 vs 1542 TFLOPS). This is fixed in CUDA 13.x. Auto-disable 2CTA when CUDA 12.9 is detected. Users on CUDA 13.x are unaffected. The manual `FA_DISABLE_2CTA=1` override continues to work regardless of CUDA version.

tridao · 2026-04-15T07:13:58Z

+        cuda_version = torch.version.cuda
+        if cuda_version is not None:
+            major, minor = cuda_version.split(".")[:2]
+            return int(major) == 12 and int(minor) == 9


Let's just check int(major) == 12

Sure, will change. Thanks Tri

…ion (Dao-AILab#2461) * Disable 2CTA forward non-causal on CUDA 12.9 to work around codegen regression CUDA 12.9 has a codegen issue that causes ~18% slowdown for 2CTA forward non-causal (hdim=128: 1280 vs 1542 TFLOPS). This is fixed in CUDA 13.x. Auto-disable 2CTA when CUDA 12.9 is detected. Users on CUDA 13.x are unaffected. The manual `FA_DISABLE_2CTA=1` override continues to work regardless of CUDA version. * Disable 2CTA forward non-causal on all CUDA 12.x (not just 12.9)

tridao reviewed Apr 15, 2026

View reviewed changes

tridao approved these changes Apr 15, 2026

View reviewed changes

Disable 2CTA forward non-causal on all CUDA 12.x (not just 12.9)

c6bbf0e

Johnsonms changed the title ~~Disable 2CTA fwd non-causal on CUDA 12.9 to work around codegen regression~~ Disable 2CTA fwd non-causal on CUDA 12 to work around codegen regression Apr 15, 2026

Johnsonms merged commit 83b8b8f into main Apr 15, 2026

jayhshah mentioned this pull request May 6, 2026

[CuTe,Bwd,Sm100] don't disable 2cta due to cuda 12 in bwd #2543

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable 2CTA fwd non-causal on CUDA 12 to work around codegen regression#2461

Disable 2CTA fwd non-causal on CUDA 12 to work around codegen regression#2461
Johnsonms merged 2 commits into
mainfrom
Johnsonms/disable-2cta-cuda12.9

Johnsonms commented Apr 15, 2026 •

edited

Loading

Uh oh!

tridao Apr 15, 2026

Uh oh!

Johnsonms Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Johnsonms commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

tridao Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Johnsonms Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Johnsonms commented Apr 15, 2026 •

edited

Loading