[cute, sm90, flex] support sm90 block sparse deterministic by geruome · Pull Request #2567 · Dao-AILab/flash-attention

geruome · 2026-05-14T16:38:52Z

following the sm100 flex deter PR: here

No spt mode added.

End-to-end test script：here

bitwise check OK.

Huge speed slowdown on H20：

DETER vs NON-DETER fwd+bwd e2e
case                         non_e2e  deter_e2e  e2e_x  non_bwd  deter_bwd  bwd_x  dQ(max/mean)       dK(max/mean)       dV(max/mean)
mha_h20_d64_s1k                 0.546      1.360   2.49x    0.229      0.257   1.12x  4.883e-04/2.416e-09  0.000e+00/0.000e+00  0.000e+00/0.000e+00
gqa_h16_hkv4_d128_s1k           0.605      1.436   2.37x    0.324      0.362   1.12x  4.883e-04/1.593e-09  3.906e-03/1.030e-08  1.221e-04/4.952e-10
mha_h20_d64_s4k                 0.740      1.753   2.37x    0.464      0.788   1.70x  9.766e-04/3.634e-09  0.000e+00/0.000e+00  0.000e+00/0.000e+00
gqa_h16_hkv4_d128_s4k           1.113      2.484   2.23x    0.745      1.406   1.89x  4.883e-04/2.160e-09  9.766e-04/4.659e-10  1.221e-04/6.207e-11
mha_h20_d64_s16k                2.858      7.188   2.51x    1.890      5.516   2.92x  9.766e-04/2.658e-09  0.000e+00/0.000e+00  0.000e+00/0.000e+00
gqa_h16_hkv4_d128_s16k          4.412     18.396   4.17x    3.059     16.295   5.33x  1.953e-03/2.811e-09  0.000e+00/0.000e+00  0.000e+00/0.000e+00
mha_h20_d64_s64k               11.387     85.336   7.49x    7.632     80.667  10.57x  9.766e-04/2.705e-09  0.000e+00/0.000e+00  0.000e+00/0.000e+00
gqa_h16_hkv4_d128_s64k         17.498    151.373   8.65x   12.232    145.072  11.86x  9.766e-04/2.636e-09  0.000e+00/0.000e+00  0.000e+00/0.000e+00

The next step would be optimize the speed of deterministic mode.
Similar to idea of here.
Instead of walking partial blocks first and then full blocks, we can merge them into one list and launch/visit them in a specified order. This feels closer to the dense-attention pattern.

wangziheng and others added 5 commits May 14, 2026 20:29

initial commit

9967447

add tests

28503a0

fix speed cal

d254682

rm tests

06ba330

pass lint

a103aca

geruome changed the title ~~[cute, sm90, flex] support sm90 flex_attention deterministic~~ [cute, sm90, flex] support sm90 block sparse deterministic Jun 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cute, sm90, flex] support sm90 block sparse deterministic#2567

[cute, sm90, flex] support sm90 block sparse deterministic#2567
geruome wants to merge 5 commits into
Dao-AILab:mainfrom
geruome:wzh/pr_sm90_flex_deter

geruome commented May 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

geruome commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

geruome commented May 14, 2026 •

edited

Loading