Skip to content

[AMD] Migrate to MMRA-annotated fences for memory barriers#10383

Merged
antiagainst merged 1 commit into
triton-lang:mainfrom
krzysz00:use-mmra-for-memory-fencing
May 29, 2026
Merged

[AMD] Migrate to MMRA-annotated fences for memory barriers#10383
antiagainst merged 1 commit into
triton-lang:mainfrom
krzysz00:use-mmra-for-memory-fencing

Conversation

@krzysz00
Copy link
Copy Markdown
Contributor

Now that the AMDGPU backend supports Memory Model Relaxation hannotations (MMRAs) on fences to indicate that only LDS / only global memory should be waited on, there is no need to manually insert waitcnt instructions, as the backend will correctly insert only the requested waits. This makes the atomic behavior we're looking for transparent to LLVM IR and avoids low-level intrinsics that require a bunch of complex bit-packing logic the backend already knows how to do.

There's been a soft request from the backend folks for people to migrate to this form of fence if possible (though we at least have the benefit that we're not using inline assembly).

AI disclosure: I wrote up what needed doing and had Codex do the work here (though I have checked the logic and it looks correct).

New contributor declaration

  • I am not making a trivial change, such as fixing a typo in a comment.

  • I have written a PR description following these
    rules.

  • I have run pre-commit run --from-ref origin/main --to-ref HEAD.

  • Select one of the following.

    • I have added tests.
      • /test for lit tests
      • /unittest for C++ tests
      • /python/test for end-to-end tests
    • This PR does not need a test because FILL THIS IN.
  • Select one of the following.

    • I have not added any lit tests.
    • The lit tests I have added follow these best practices,
      including the "tests should be minimal" section. (Usually running Python code
      and using the instructions it generates is not minimal.)

Now that the AMDGPU backend supports Memory Model Relaxation
hannotations (MMRAs) on fences to indicate that only LDS / only global
memory should be waited on, there is no need to manually insert
waitcnt instructions, as the backend will correctly insert only the
requested waits. This makes the atomic behavior we're looking for
transparent to LLVM IR and avoids low-level intrinsics that require a
bunch of complex bit-packing logic the backend already knows how to
do.

There's been a soft request from the backend folks for people to
migrate to this form of fence if possible (though we at least have the
benefit that we're not using inline assembly).

AI disclosure: I wrote up what needed doing and had Codex do the work
here (though I have checked the logic and it looks correct).
Copy link
Copy Markdown
Contributor

@jungpark-mlir jungpark-mlir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks for this improvement!

false);
prependOp(ROCDL::SBarrierOp::create(builder, loc), false);
prependOp(createLocalMMRAFence(builder, loc, LLVM::AtomicOrdering::acquire),
false);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to check if I understand this correctly. In theory we don't need acquire fence here because sched.barrier is used, right?
Still good to have it to avoid any confusion.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sched.barrier isn't an actual barrier - and is very unrelated to this change - and we do need the acquire fence.

So, having looked and reminded myself how LLVM defines fence, the reason this works is that s.barrier is, conceptually, an atomic operation on something we can model as inaccessible memory, and so a release fence followed by the "atomic" s.barrier that actually synchronizes the lanes, followed by the acquire fence, establishes a happens-before relationship between those two fences.

That, in turn, establishes a happens-before relationship between memory operations before the fence and those after the fence across the entire workgroup (which is the syncscope).

Now, those MMRAs prevent the classical "fences fence too much" issue where that relationship applies to global accesses, breaking software pipelining by forcing vmcnts. The MMRA metadata ensures that only (for example) operations on LDS get fenced.

@krzysz00
Copy link
Copy Markdown
Contributor Author

Performance notes:

I eyeballed some diffs. The LLVM IR changes appear to be as expected (though there's a sea of renumberings so it's hard to tell at some point).

At the assembly level, using fences instead of manual waitcnts appears to have givin the compiler more freedom in shuffling a few LDS loads or index computations relative to each other and to, it at least one case, allowed a redundant s_waitcnt lgkmcnt(0) to be removed.

However, when I ran the matmul tutorial benchmarks, all these changes lived below the bdenchmark noise floor, and I've seen no evidence that sched.barrier usage broke here.

So there's no perf argument against landing this.

@antiagainst
Copy link
Copy Markdown
Member

Nice. Thanks @krzysz00!

@antiagainst antiagainst merged commit 06a81df into triton-lang:main May 29, 2026
9 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants