[AMD] Migrate to MMRA-annotated fences for memory barriers by krzysz00 · Pull Request #10383 · triton-lang/triton

krzysz00 · 2026-05-26T20:27:07Z

Now that the AMDGPU backend supports Memory Model Relaxation hannotations (MMRAs) on fences to indicate that only LDS / only global memory should be waited on, there is no need to manually insert waitcnt instructions, as the backend will correctly insert only the requested waits. This makes the atomic behavior we're looking for transparent to LLVM IR and avoids low-level intrinsics that require a bunch of complex bit-packing logic the backend already knows how to do.

There's been a soft request from the backend folks for people to migrate to this form of fence if possible (though we at least have the benefit that we're not using inline assembly).

AI disclosure: I wrote up what needed doing and had Codex do the work here (though I have checked the logic and it looks correct).

New contributor declaration

I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run pre-commit run --from-ref origin/main --to-ref HEAD.
Select one of the following.
- I have added tests.
  - /test for lit tests
  - /unittest for C++ tests
  - /python/test for end-to-end tests
- This PR does not need a test because FILL THIS IN.
Select one of the following.
- I have not added any lit tests.
- The lit tests I have added follow these best practices,
  including the "tests should be minimal" section. (Usually running Python code
  and using the instructions it generates is not minimal.)

Now that the AMDGPU backend supports Memory Model Relaxation hannotations (MMRAs) on fences to indicate that only LDS / only global memory should be waited on, there is no need to manually insert waitcnt instructions, as the backend will correctly insert only the requested waits. This makes the atomic behavior we're looking for transparent to LLVM IR and avoids low-level intrinsics that require a bunch of complex bit-packing logic the backend already knows how to do. There's been a soft request from the backend folks for people to migrate to this form of fence if possible (though we at least have the benefit that we're not using inline assembly). AI disclosure: I wrote up what needed doing and had Codex do the work here (though I have checked the logic and it looks correct).

jungpark-mlir

Looks good to me, thanks for this improvement!

jungpark-mlir · 2026-05-28T12:52:07Z

            false);
  prependOp(ROCDL::SBarrierOp::create(builder, loc), false);
+  prependOp(createLocalMMRAFence(builder, loc, LLVM::AtomicOrdering::acquire),
+            false);


Just to check if I understand this correctly. In theory we don't need acquire fence here because sched.barrier is used, right?
Still good to have it to avoid any confusion.

sched.barrier isn't an actual barrier - and is very unrelated to this change - and we do need the acquire fence.

So, having looked and reminded myself how LLVM defines fence, the reason this works is that s.barrier is, conceptually, an atomic operation on something we can model as inaccessible memory, and so a release fence followed by the "atomic" s.barrier that actually synchronizes the lanes, followed by the acquire fence, establishes a happens-before relationship between those two fences.

That, in turn, establishes a happens-before relationship between memory operations before the fence and those after the fence across the entire workgroup (which is the syncscope).

Now, those MMRAs prevent the classical "fences fence too much" issue where that relationship applies to global accesses, breaking software pipelining by forcing vmcnts. The MMRA metadata ensures that only (for example) operations on LDS get fenced.

krzysz00 · 2026-05-28T17:51:55Z

Performance notes:

I eyeballed some diffs. The LLVM IR changes appear to be as expected (though there's a sea of renumberings so it's hard to tell at some point).

At the assembly level, using fences instead of manual waitcnts appears to have givin the compiler more freedom in shuffling a few LDS loads or index computations relative to each other and to, it at least one case, allowed a redundant s_waitcnt lgkmcnt(0) to be removed.

However, when I ran the matmul tutorial benchmarks, all these changes lived below the bdenchmark noise floor, and I've seen no evidence that sched.barrier usage broke here.

So there's no perf argument against landing this.

antiagainst · 2026-05-29T05:25:40Z

Nice. Thanks @krzysz00!

krzysz00 requested review from antiagainst, ptillet and zhanglx13 as code owners May 26, 2026 20:27

jungpark-mlir reviewed May 28, 2026

View reviewed changes

jungpark-mlir approved these changes May 28, 2026

View reviewed changes

antiagainst approved these changes May 29, 2026

View reviewed changes

antiagainst merged commit 06a81df into triton-lang:main May 29, 2026
9 of 10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Migrate to MMRA-annotated fences for memory barriers#10383

[AMD] Migrate to MMRA-annotated fences for memory barriers#10383
antiagainst merged 1 commit into
triton-lang:mainfrom
krzysz00:use-mmra-for-memory-fencing

krzysz00 commented May 26, 2026

Uh oh!

jungpark-mlir left a comment

Uh oh!

jungpark-mlir May 28, 2026

Uh oh!

krzysz00 May 28, 2026

Uh oh!

krzysz00 commented May 28, 2026

Uh oh!

antiagainst commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

krzysz00 commented May 26, 2026

New contributor declaration

Uh oh!

jungpark-mlir left a comment

Choose a reason for hiding this comment

Uh oh!

jungpark-mlir May 28, 2026

Choose a reason for hiding this comment

Uh oh!

krzysz00 May 28, 2026

Choose a reason for hiding this comment

Uh oh!

krzysz00 commented May 28, 2026

Uh oh!

antiagainst commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants