[Cute,Sm100,Bwd] refine bwd swizzle for deterministic by jayhshah · Pull Request #2390 · Dao-AILab/flash-attention

jayhshah · 2026-03-25T02:57:21Z

This PR fixes the varlen swizzle num_n_blocks from num_m_blocks calculation for 2cta and includes dqaccum in the head size for not spilling l2. This noticeably improves backward deterministic FLOPs (for non-varlen, limited to non-causal but doesn't appear to cause regression for causal; for varlen, improve perf across the board thanks to the fix).

Example benchmarks:

non-varlen deterministic, MHA nheads = 16

hdim causal batch seqlen     PR          		MAIN
----------------------------------------------------------------
128  False     4   8192      4.58/1200/53.3% 	4.75/1157/51.4%
128  False     2  16384      9.32/1180/52.4%	10.08/1091/48.5%
128  False     1  32768     19.69/1117/49.6%	20.48/1074/47.7%

128   True     4   8192      2.50/1100/48.9%	2.50/1100/48.9%
128   True     2  16384      4.60/1195/53.1%	4.60/1195/53.1%
128   True     1  32768      9.74/1129/50.2%	9.86/1115/49.6%

varlen deterministic, MHA nheads = 16

hdim causal batch seqlen     PR          		MAIN
----------------------------------------------------------------
128  False     4   8192      4.66/1180/52.5%	5.13/1071/47.6%
128  False     2  16384      8.97/1225/54.5%	10.15/1083/48.1%
128  False     1  32768     19.45/1131/50.3%	20.85/1054/46.9%

128   True     4   8192      2.59/1061/47.1%	2.83/973/43.2%
128   True     2  16384      4.72/1166/51.8%	5.06/1086/48.3%
128   True     1  32768      9.72/1131/50.3%	10.09/1089/48.4%

tridao · 2026-03-25T06:30:41Z

                num_n_blocks = (
                    num_m_blocks
                    * params.tile_shape_mn[0]
+                    * params.cluster_shape_m


does this affect any of the 2cta bwd code?

This change is meant to get the right head swizzle heuristic for 2cta bwd, by accounting for the num_m_blocks being defined with respect to tiler divided by the cluster shape.

num_n_blocks here is only used to derive nheads_in_l2, so it doesn't affect correctness.

For the tile_shape_mn[0] do we pass the CTA's tile shape or the cluster tile shape? I think we had this discussion and realized we have not been consistent.
In any case if it doesn't affect correctness it's fine w me.

Currently for the tile scheduler args we pass cta_tiler[:2] as tile_shape_min and cluster_shape_mn as a separate parameter; this makes the most sense to me since cluster shape is in principle separate from use of 2cta mma.

refine bwd swizzle when deterministic

6757ec5

tridao reviewed Mar 25, 2026

View reviewed changes

tridao approved these changes Mar 25, 2026

View reviewed changes

jayhshah merged commit 5c7711e into main Mar 25, 2026
2 of 3 checks passed

jayhshah deleted the jshah/bwd-det-swizzle branch March 25, 2026 17:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cute,Sm100,Bwd] refine bwd swizzle for deterministic#2390

[Cute,Sm100,Bwd] refine bwd swizzle for deterministic#2390
jayhshah merged 1 commit intomainfrom
jshah/bwd-det-swizzle

jayhshah commented Mar 25, 2026

Uh oh!

tridao Mar 25, 2026

Uh oh!

jayhshah Mar 25, 2026 •

edited

Loading

Uh oh!

tridao Mar 25, 2026

Uh oh!

jayhshah Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jayhshah commented Mar 25, 2026

Uh oh!

tridao Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

jayhshah Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tridao Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

jayhshah Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jayhshah Mar 25, 2026 •

edited

Loading