Restrict fence.proxy.async to shared::cta#4804
Conversation
We only currently use the `kir::FenceAsyncProxy` expression type to avoid shared memory WARs. Adding this modifier to the PTX instruction means we don't need to wait for gmem writes to be available to the epilogue threads, which saves time. Note that cutlass also only uses `fence.proxy.async` with this modifier: https://github.com/NVIDIA/cutlass/blob/9baa06dd57804ce8fb5efe9e471b3451341522c6/include/cutlass/arch/barrier.h#L717 In testing this took our small LLM problem set from 84% to 90% of cublas (ignoring split-K). Note that we should also predicate the `fenceAsyncProxy` calls to match the predicates in the consumer expressions -- currently meaning in TMA stores. This can also give us a speedup but I have not yet automated that. PTX doc: https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar
|
!test |
|
Review updated until commit 9a8ebe6 Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
| std::string WgMmaFence::toString(int indent_size) const { | ||
| return "fence.proxy.async\n"; | ||
| return "wgmma.fence.sync.aligned\n"; | ||
| } |
There was a problem hiding this comment.
Fixing previously incorrect printout
|
I am hitting correctness issues in the pingpong test unless the fenceProxyAsync call is predicated along with the tma store. UPDATE: predicate will be inserted by #4820 |
Stacked on #4820
We only currently use the
kir::FenceAsyncProxyexpression type to avoid shared memory WAR hazards. Adding this modifier to the PTX instruction means we don't need to wait for gmem writes to be available to the epilogue threads, which saves time. Note that cutlass also only usesfence.proxy.asyncwith this modifier: https://github.com/NVIDIA/cutlass/blob/9baa06dd57804ce8fb5efe9e471b3451341522c6/include/cutlass/arch/barrier.h#L717In testing this took our small LLM problem set from 84% to 90% of cublas (ignoring split-K). Note that we should also predicate the
fenceAsyncProxycalls to match the predicates in the consumer expressions -- currently meaning in TMA stores. This can also give us a speedup which is measurable in some test problems but I have not yet automated that so I don't know what to expect in terms of overall speedups.PTX doc: https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar