Skip to content

[cute, sm90, flex] support sm90 block sparse deterministic#2567

Open
geruome wants to merge 5 commits into
Dao-AILab:mainfrom
geruome:wzh/pr_sm90_flex_deter
Open

[cute, sm90, flex] support sm90 block sparse deterministic#2567
geruome wants to merge 5 commits into
Dao-AILab:mainfrom
geruome:wzh/pr_sm90_flex_deter

Conversation

@geruome

@geruome geruome commented May 14, 2026

Copy link
Copy Markdown
Contributor

following the sm100 flex deter PR: here

No spt mode added.

End-to-end test script:here

bitwise check OK.

Huge speed slowdown on H20:

DETER vs NON-DETER fwd+bwd e2e
case                         non_e2e  deter_e2e  e2e_x  non_bwd  deter_bwd  bwd_x  dQ(max/mean)       dK(max/mean)       dV(max/mean)
mha_h20_d64_s1k                 0.546      1.360   2.49x    0.229      0.257   1.12x  4.883e-04/2.416e-09  0.000e+00/0.000e+00  0.000e+00/0.000e+00
gqa_h16_hkv4_d128_s1k           0.605      1.436   2.37x    0.324      0.362   1.12x  4.883e-04/1.593e-09  3.906e-03/1.030e-08  1.221e-04/4.952e-10
mha_h20_d64_s4k                 0.740      1.753   2.37x    0.464      0.788   1.70x  9.766e-04/3.634e-09  0.000e+00/0.000e+00  0.000e+00/0.000e+00
gqa_h16_hkv4_d128_s4k           1.113      2.484   2.23x    0.745      1.406   1.89x  4.883e-04/2.160e-09  9.766e-04/4.659e-10  1.221e-04/6.207e-11
mha_h20_d64_s16k                2.858      7.188   2.51x    1.890      5.516   2.92x  9.766e-04/2.658e-09  0.000e+00/0.000e+00  0.000e+00/0.000e+00
gqa_h16_hkv4_d128_s16k          4.412     18.396   4.17x    3.059     16.295   5.33x  1.953e-03/2.811e-09  0.000e+00/0.000e+00  0.000e+00/0.000e+00
mha_h20_d64_s64k               11.387     85.336   7.49x    7.632     80.667  10.57x  9.766e-04/2.705e-09  0.000e+00/0.000e+00  0.000e+00/0.000e+00
gqa_h16_hkv4_d128_s64k         17.498    151.373   8.65x   12.232    145.072  11.86x  9.766e-04/2.636e-09  0.000e+00/0.000e+00  0.000e+00/0.000e+00

The next step would be optimize the speed of deterministic mode.
Similar to idea of here.
Instead of walking partial blocks first and then full blocks, we can merge them into one list and launch/visit them in a specified order. This feels closer to the dense-attention pattern.

@geruome geruome changed the title [cute, sm90, flex] support sm90 flex_attention deterministic [cute, sm90, flex] support sm90 block sparse deterministic Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant