Insert fences in insertRawThreadSynchronization#4810
Insert fences in insertRawThreadSynchronization#4810jacobhinkle wants to merge 8 commits intomainfrom
Conversation
Whenever we insert a sync, this PR adds a simple analysis to determine whether a memory fence is required and if so it adds a `FenceAsyncProxy`. This is not sufficient to completely address #4808 because: 1. It only affects syncs inserted in this pass, while mbarrier syncs are inserted in circular buffering as well. 2. It does not affect the `wgmma::fence` which is inserted in another part of this pass 3. It does not predicate the fence based on predicates of the consumers and even if it did, we do not use `expr->predicate()` for TMA stores yet.
|
Review updated until commit 67285b3 Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
|
!test --diff |
|
Like #4804, this PR also causes a failure in |
…h/insert_fences_with_syncs
|
I observed some slowdowns that I'm trying to understand. In particular, looking at On TOT we have this epilogue section: block_sync::sync<false>(dim3(128, 2, 1));
#pragma unroll
for(nvfuser_index_t i56 = 0; i56 < 2; ++i56) {
fenceAsyncProxy();
if (b21) {
Hopper::cpAsyncBulkTensorTileS2G((Hopper::CpAsyncBulkTensorTileS2GIndex<2>{ ptr14, (Array<int, 2, 1>{__to_int32((i41 + (64 * i56))), i43}) }) , (i13 + (8192 * i56)));
}
}
block_sync::sync<false>(dim3(128, 2, 1));
cpAsyncBulkCommitGroup();
cpAsyncBulkWaitGroup<0LL>();When I switch this to the following I get a slowdown from 48.8 us to 61.8 us, i.e. a drop to 79% perf: block_sync::sync<false>(dim3(128, 2, 1));
if (b21) {
fenceAsyncProxy();
}
#pragma unroll
for(nvfuser_index_t i56 = 0; i56 < 2; ++i56) {
if (b21) {
Hopper::cpAsyncBulkTensorTileS2G((Hopper::CpAsyncBulkTensorTileS2GIndex<2>{ ptr14, (Array<int, 2, 1>{__to_int32((i41 + (64 * i56))), i43}) }) , (i13 + (8192 * i56)));
}
}
block_sync::sync<false>(dim3(128, 2, 1));
cpAsyncBulkCommitGroup();
cpAsyncBulkWaitGroup<0LL>();This looks like the idea fence placement and predication to me so this is surprising, and perf is not recovered by adding What is even more interesting is to look at the ncu profiles for these two situations: Relevent section of PTX diff
$L__BB0_47:
- mov.u32 %r363, 1;
+ mov.u32 %r364, 1;
- mov.u32 %r364, 256;
+ mov.u32 %r365, 256;
// begin inline asm
- bar.sync %r363, %r364;
+ bar.sync %r364, %r365;
// end inline asm
+ or.pred %p98, %p4, %p64;
+ @%p98 bra $L__BB0_49;
+
+ shl.b32 %r747, %r265, 7;
+ add.s32 %r746, %r747, %r203;
// begin inline asm
fence.proxy.async;
// end inline asm
- or.pred %p98, %p4, %p64;
- @%p98 bra $L__BB0_49;
-
- shl.b32 %r750, %r262, 7;
- add.s32 %r749, %r750, %r199;
// begin inline asm
+ cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd122, {%r35, %r746}], [%r14];
+ // end inline asm
+ add.s32 %r371, %r14, 8192;
+ add.s32 %r372, %r35, 64;
+ // begin inline asm
- cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd121, {%r26, %r749}], [%r9];
+ cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd122, {%r372, %r746}], [%r371];
// end inline asm
$L__BB0_49:
// begin inline asm
- fence.proxy.async;
-
+ bar.sync %r364, %r365;
// end inline asm
- @%p98 bra $L__BB0_51;
+ // begin inline asm
+ cp.async.bulk.commit_group;
- shl.b32 %r748, %r262, 7;
- add.s32 %r747, %r748, %r199;
- add.s32 %r372, %r9, 8192;
- add.s32 %r373, %r26, 64;
+ // end inline asm
// begin inline asm
- cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%rd121, {%r373, %r747}], [%r372];
+ cp.async.bulk.wait_group.read 0;
+
// end inline asm
+$L__BB0_50:
+ add.s32 %r749, %r749, 1;
+ setp.lt.s32 %p99, %r749, %r11;
+ @%p99 bra $L__BB0_27;
+ |
|
I believe the issue is that these changes sometimes are resulting in thread divergence during |


Stacked on #4820
Whenever we insert a sync, this PR adds a simple analysis to determine whether a memory fence is required and if so it adds a
FenceAsyncProxy.Before this PR:
After this PR:
This is not sufficient to completely address #4808 because:
wgmma::fencewhich is inserted in another part of this passexpr->predicate()for TMA stores yet. That predication happens in the unroll pass and an exception is made for ElectSync of TMA store given Skip ElectSync when creating predicate for TMA Store in PredicateCompute #4332.Fixes #4814