[FPSAN] Insert barriers in between scratch loads and stores#10055
[FPSAN] Insert barriers in between scratch loads and stores#10055pawelszczerbuk merged 3 commits into
Conversation
|
why .cta? Does this work for multicta kernels? |
| Location loc = op.getLoc(); | ||
| if (op.hasGlobalRead() || op.hasGlobalWrite()) { | ||
| PTXBuilder ptxBuilder; | ||
| auto &membar = *ptxBuilder.create("membar.cta"); |
There was a problem hiding this comment.
The ptx spec for bar.sync states that:
The barrier{.cta}.sync or barrier{.cta}.red or barrier{.cta}.arrive instruction guarantees that when the barrier completes, prior memory accesses requested by this thread are performed relative to all threads participating in the barrier. The barrier{.cta}.sync and barrier{.cta}.red instruction further guarantees that no new memory access is requested by this thread before the barrier completes.
Which to me suggests that an additional fence shouldn't be required. Is it possible that the issues are only happening in warp-specialized code where you can have the reads and writes happening in different warp partitions?
There was a problem hiding this comment.
Interesting, thanks for digging that out! Doing more experiments to see what exactly is causing the failure.
There was a problem hiding this comment.
So it seems that fence release/acquire is sufficient for the flakiness to go away. But WS does not explain the issue, because test_dot_fma was flaking, while not using WS at all. Not sure what to make of it.
There was a problem hiding this comment.
I notice in emitMmaEmulationLoops you both load and store from dTilePtr without a barrier in between. Is it possible this is causing issues? It seems safe because you should be loading and storing from the same thread, but technically this still requires a memory fence afaik.
There was a problem hiding this comment.
Let me see, good catch!
There was a problem hiding this comment.
I asked codex to find similar races, and it noted that we don't synchronise after storing to hbm in a tmem_store, as such we could have a pattern like a tmem_store into a tmem_load or the other way around that could race.
In the LLVM lowering we emit after tmem_load a
NVVM::Tcgen05WaitOp::create(rewriter, loc, NVVM::Tcgen05WaitKind::LOAD);
so here we should probably emit a barrier or at least a fence.
There was a problem hiding this comment.
Jackpot, that was it :D Thanks a lot Peter!!!
There was a problem hiding this comment.
we don't synchronise after storing to hbm in a tmem_store
wait, do you mean tma_store? tmem_store does not touch hbm right?
There was a problem hiding this comment.
I meant this line. It's in the fp sanitiser
8d9e8ee to
19cab1f
Compare
| ttg::BarrierOp::create( | ||
| rewriter, loc, ttg::AddrSpace::GlobalRead | ttg::AddrSpace::GlobalWrite); |
There was a problem hiding this comment.
nit. this GlobalRead | GlobalWrite does not do anything on nvidia so perhaps better drop it and just use Local?
There was a problem hiding this comment.
Good point, otherwise it is getting confusing.
There was a problem hiding this comment.
Sorry, no, we do need these for AMD. This is the common code path
Fpsan insufficiently synchronized global scratch accesses - we were missing a barrier in between memory accesses, which led to random failures in the tests and racy behavior.