Skip ElectSync when creating predicate for TMA Store in PredicateCompute#4332
Skip ElectSync when creating predicate for TMA Store in PredicateCompute#4332
Conversation
* TMA Store is a warp-collective, so it is issued by a single warp. * Using ElectSync to pick a single thread is unnecessary.
|
!test |
|
Review updated until commit ee8f34c Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
|
For future refactor, perhaps we should create a separate predicate type such as |
|
!test |
Does it make sense to have two predicate types, or just one? If we only need one, then probably we can name it into something like |
Yes, |
This PR changes
createElectSyncPredicateto skip adding ElectSync toTMA Storeexpressions.Review of
ElectSyncPredicate HandlingElectSync Predicate with Expression
ElectSyncPredicate with their expression. See https://github.com/NVIDIA/Fuser/blob/main/csrc/device_lower/pass/unroll.cpp#L160-L171.ElectSyncPredicate with its expression is handled by PredicateCompute::createSingleExpressionElectSync.createSingleExpressionElectSyncuses the predicate's expression to determine if it is TMA Store.createElectSyncPredicatewill skip theElectSyncif it is a TMA Store. The logic to select a warp is the same.Expression-Less ElectSync Predicate
ElectSyncPredicate generated in circular buffering pass are expression-less because we can have multiple TMA load expressions for a singleIfThenElse. These predicates are handled by Predicate::Compute::createMultipleExpressionElectSync.Test Example
HopperMatmulTest/MLPGemmPersistentBroadcastInputs.NumWarpGroups/2Nsys NvProf
CUDA Kernel without ElectSync
CUDA Kernel with ElectSync