[AMD] Migrate to MMRA-annotated fences for memory barriers#10383
Conversation
Now that the AMDGPU backend supports Memory Model Relaxation hannotations (MMRAs) on fences to indicate that only LDS / only global memory should be waited on, there is no need to manually insert waitcnt instructions, as the backend will correctly insert only the requested waits. This makes the atomic behavior we're looking for transparent to LLVM IR and avoids low-level intrinsics that require a bunch of complex bit-packing logic the backend already knows how to do. There's been a soft request from the backend folks for people to migrate to this form of fence if possible (though we at least have the benefit that we're not using inline assembly). AI disclosure: I wrote up what needed doing and had Codex do the work here (though I have checked the logic and it looks correct).
jungpark-mlir
left a comment
There was a problem hiding this comment.
Looks good to me, thanks for this improvement!
| false); | ||
| prependOp(ROCDL::SBarrierOp::create(builder, loc), false); | ||
| prependOp(createLocalMMRAFence(builder, loc, LLVM::AtomicOrdering::acquire), | ||
| false); |
There was a problem hiding this comment.
Just to check if I understand this correctly. In theory we don't need acquire fence here because sched.barrier is used, right?
Still good to have it to avoid any confusion.
There was a problem hiding this comment.
sched.barrier isn't an actual barrier - and is very unrelated to this change - and we do need the acquire fence.
So, having looked and reminded myself how LLVM defines fence, the reason this works is that s.barrier is, conceptually, an atomic operation on something we can model as inaccessible memory, and so a release fence followed by the "atomic" s.barrier that actually synchronizes the lanes, followed by the acquire fence, establishes a happens-before relationship between those two fences.
That, in turn, establishes a happens-before relationship between memory operations before the fence and those after the fence across the entire workgroup (which is the syncscope).
Now, those MMRAs prevent the classical "fences fence too much" issue where that relationship applies to global accesses, breaking software pipelining by forcing vmcnts. The MMRA metadata ensures that only (for example) operations on LDS get fenced.
|
Performance notes: I eyeballed some diffs. The LLVM IR changes appear to be as expected (though there's a sea of renumberings so it's hard to tell at some point). At the assembly level, using fences instead of manual waitcnts appears to have givin the compiler more freedom in shuffling a few LDS loads or index computations relative to each other and to, it at least one case, allowed a redundant However, when I ran the matmul tutorial benchmarks, all these changes lived below the bdenchmark noise floor, and I've seen no evidence that sched.barrier usage broke here. So there's no perf argument against landing this. |
|
Nice. Thanks @krzysz00! |
Now that the AMDGPU backend supports Memory Model Relaxation hannotations (MMRAs) on fences to indicate that only LDS / only global memory should be waited on, there is no need to manually insert waitcnt instructions, as the backend will correctly insert only the requested waits. This makes the atomic behavior we're looking for transparent to LLVM IR and avoids low-level intrinsics that require a bunch of complex bit-packing logic the backend already knows how to do.
There's been a soft request from the backend folks for people to migrate to this form of fence if possible (though we at least have the benefit that we're not using inline assembly).
AI disclosure: I wrote up what needed doing and had Codex do the work here (though I have checked the logic and it looks correct).
New contributor declaration
I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run
pre-commit run --from-ref origin/main --to-ref HEAD.Select one of the following.
/testforlittests/unittestfor C++ tests/python/testfor end-to-end testsFILL THIS IN.Select one of the following.
littests.littests I have added follow these best practices,including the "tests should be minimal" section. (Usually running Python code
and using the instructions it generates is not minimal.)