Skip to content

Conversation

@Aya-ZIbra
Copy link
Contributor

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2078

Changing the QtileSize to 64. I see good improvement > 20 %..
For correctness this includes changing the TMEM atoms and introducing warp sync for row stats.

Perf:

(Batch, SeqLenQ, SeqLenKV, MaxLenKV, HeadQ, HeadKV, HeadD)	cutlass_blackwell_fmha_decode-gbps			Improvment with Qtile = 64
(16, 1, 256, 256, 8, 1, 128)	238.2206209			1.31463193
(16, 1, 512, 512, 8, 1, 128)	410.8838061			1.315872068
(16, 1, 1024, 1024, 8, 1, 128)	660.5696208			1.335567769
(16, 1, 2048, 2048, 8, 1, 128)	916.5460174			1.310093116
(16, 1, 4096, 4096, 8, 1, 128)	1133.690174			1.258896694
(16, 1, 8192, 8192, 8, 1, 128)	1271.341515			1.229311967
(32, 1, 256, 256, 8, 1, 128)	468.9034945			1.295635241
(32, 1, 512, 512, 8, 1, 128)	799.2689835			1.280831124
(32, 1, 1024, 1024, 8, 1, 128)	1285.452285			1.293538886
(32, 1, 2048, 2048, 8, 1, 128)	1797.074701			1.269787171
(32, 1, 4096, 4096, 8, 1, 128)	2210.946865			1.229703361
(32, 1, 8192, 8192, 8, 1, 128)	2498.665399			1.212166122
(64, 1, 256, 256, 8, 1, 128)	893.9747894			1.302172409
(64, 1, 512, 512, 8, 1, 128)	1493.150844			1.274679551
(64, 1, 1024, 1024, 8, 1, 128)	2309.825211			1.220419935
(64, 1, 2048, 2048, 8, 1, 128)	3012.271892			1.159444905
(64, 1, 4096, 4096, 8, 1, 128)	3552.001019			1.089389445
(64, 1, 8192, 8192, 8, 1, 128)	4348.016208			1.131298153
(128, 1, 256, 256, 8, 1, 128)	1549.388365			1.233405251
(128, 1, 512, 512, 8, 1, 128)	2480.52007			1.210676964
(128, 1, 1024, 1024, 8, 1, 128)	3360.125922			1.145674899
(128, 1, 2048, 2048, 8, 1, 128)	4103.461192			1.093136854
(128, 1, 4096, 4096, 8, 1, 128)	4783.429328			1.095583284

Reviewed By: jianyuh, v0i0

Differential Revision: D85155388

@netlify
Copy link

netlify bot commented Oct 31, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 41a878c
🔍 Latest deploy log https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/690406d3e7a82a0008762a48
😎 Deploy Preview https://deploy-preview-5072--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@meta-codesync
Copy link
Contributor

meta-codesync bot commented Oct 31, 2025

@Aya-ZIbra has exported this pull request. If you are a Meta employee, you can view the originating Diff in D85155388.

Summary:

X-link: facebookresearch/FBGEMM#2078

Changing the QtileSize to 64. I see good improvement  > 20 %..
For correctness this includes changing the TMEM atoms and introducing warp sync for row stats.

Perf:
```
(Batch, SeqLenQ, SeqLenKV, MaxLenKV, HeadQ, HeadKV, HeadD)	cutlass_blackwell_fmha_decode-gbps			Improvment with Qtile = 64
(16, 1, 256, 256, 8, 1, 128)	238.2206209			1.31463193
(16, 1, 512, 512, 8, 1, 128)	410.8838061			1.315872068
(16, 1, 1024, 1024, 8, 1, 128)	660.5696208			1.335567769
(16, 1, 2048, 2048, 8, 1, 128)	916.5460174			1.310093116
(16, 1, 4096, 4096, 8, 1, 128)	1133.690174			1.258896694
(16, 1, 8192, 8192, 8, 1, 128)	1271.341515			1.229311967
(32, 1, 256, 256, 8, 1, 128)	468.9034945			1.295635241
(32, 1, 512, 512, 8, 1, 128)	799.2689835			1.280831124
(32, 1, 1024, 1024, 8, 1, 128)	1285.452285			1.293538886
(32, 1, 2048, 2048, 8, 1, 128)	1797.074701			1.269787171
(32, 1, 4096, 4096, 8, 1, 128)	2210.946865			1.229703361
(32, 1, 8192, 8192, 8, 1, 128)	2498.665399			1.212166122
(64, 1, 256, 256, 8, 1, 128)	893.9747894			1.302172409
(64, 1, 512, 512, 8, 1, 128)	1493.150844			1.274679551
(64, 1, 1024, 1024, 8, 1, 128)	2309.825211			1.220419935
(64, 1, 2048, 2048, 8, 1, 128)	3012.271892			1.159444905
(64, 1, 4096, 4096, 8, 1, 128)	3552.001019			1.089389445
(64, 1, 8192, 8192, 8, 1, 128)	4348.016208			1.131298153
(128, 1, 256, 256, 8, 1, 128)	1549.388365			1.233405251
(128, 1, 512, 512, 8, 1, 128)	2480.52007			1.210676964
(128, 1, 1024, 1024, 8, 1, 128)	3360.125922			1.145674899
(128, 1, 2048, 2048, 8, 1, 128)	4103.461192			1.093136854
(128, 1, 4096, 4096, 8, 1, 128)	4783.429328			1.095583284

```

Reviewed By: jianyuh, v0i0

Differential Revision: D85155388
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Oct 31, 2025

This pull request has been merged in 9ec0d72.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants