[FPSAN] Insert barriers in between scratch loads and stores by pawelszczerbuk · Pull Request #10055 · triton-lang/triton

pawelszczerbuk · 2026-04-16T16:06:22Z

Fpsan insufficiently synchronized global scratch accesses - we were missing a barrier in between memory accesses, which led to random failures in the tests and racy behavior.

lezcano · 2026-04-16T17:56:23Z

why .cta? Does this work for multicta kernels?

peterbell10 · 2026-04-16T19:29:36Z

+    Location loc = op.getLoc();
+    if (op.hasGlobalRead() || op.hasGlobalWrite()) {
+      PTXBuilder ptxBuilder;
+      auto &membar = *ptxBuilder.create("membar.cta");


The ptx spec for bar.sync states that:

The barrier{.cta}.sync or barrier{.cta}.red or barrier{.cta}.arrive instruction guarantees that when the barrier completes, prior memory accesses requested by this thread are performed relative to all threads participating in the barrier. The barrier{.cta}.sync and barrier{.cta}.red instruction further guarantees that no new memory access is requested by this thread before the barrier completes.

Which to me suggests that an additional fence shouldn't be required. Is it possible that the issues are only happening in warp-specialized code where you can have the reads and writes happening in different warp partitions?

Interesting, thanks for digging that out! Doing more experiments to see what exactly is causing the failure.

So it seems that fence release/acquire is sufficient for the flakiness to go away. But WS does not explain the issue, because test_dot_fma was flaking, while not using WS at all. Not sure what to make of it.

I notice in emitMmaEmulationLoops you both load and store from dTilePtr without a barrier in between. Is it possible this is causing issues? It seems safe because you should be loading and storing from the same thread, but technically this still requires a memory fence afaik.

Let me see, good catch!

I asked codex to find similar races, and it noted that we don't synchronise after storing to hbm in a tmem_store, as such we could have a pattern like a tmem_store into a tmem_load or the other way around that could race.

In the LLVM lowering we emit after tmem_load a

NVVM::Tcgen05WaitOp::create(rewriter, loc, NVVM::Tcgen05WaitKind::LOAD);

so here we should probably emit a barrier or at least a fence.

Jackpot, that was it :D Thanks a lot Peter!!!

we don't synchronise after storing to hbm in a tmem_store

wait, do you mean tma_store? tmem_store does not touch hbm right?

I meant this line. It's in the fp sanitiser

triton/lib/Dialect/TritonInstrument/Transforms/FpSanitizer.cpp

Line 1827 in 3be1a23

if (!createStoreScratchMemory(rewriter, loc, info->ptr, op.getSrc(), srcTy))

lezcano · 2026-04-20T07:57:12Z

+  ttg::BarrierOp::create(
+      rewriter, loc, ttg::AddrSpace::GlobalRead | ttg::AddrSpace::GlobalWrite);


nit. this GlobalRead | GlobalWrite does not do anything on nvidia so perhaps better drop it and just use Local?

~~Good point, otherwise it is getting confusing.~~

Sorry, no, we do need these for AMD. This is the common code path

pawelszczerbuk requested a review from ThomasRaoux April 16, 2026 16:06

pawelszczerbuk requested a review from ptillet as a code owner April 16, 2026 16:06

ThomasRaoux reviewed Apr 16, 2026

View reviewed changes

Comment thread third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/BarrierOpToLLVM.cpp Outdated

peterbell10 reviewed Apr 16, 2026

View reviewed changes

Comment thread lib/Dialect/TritonInstrument/IR/Utility.cpp Outdated

peterbell10 reviewed Apr 16, 2026

View reviewed changes

Add barrier between load and store in mma loop emulation

19cab1f

pawelszczerbuk force-pushed the pawel/fpsan_scratch_sync_fix branch from 8d9e8ee to 19cab1f Compare April 17, 2026 16:58

Adding barrier after other scratch loads and stores

8c8c4d1

pawelszczerbuk changed the title ~~[FPSAN] Make scratch load/stores uncached, add cta barrier to global BarrierOp~~ [FPSAN] Insert barriers in between scratch loads and stores Apr 17, 2026

lit test fix

b907887

lezcano reviewed Apr 20, 2026

View reviewed changes

pawelszczerbuk requested review from lezcano and peterbell10 April 20, 2026 16:14

ThomasRaoux approved these changes Apr 20, 2026

View reviewed changes

pawelszczerbuk merged commit 147a60d into triton-lang:main Apr 20, 2026
9 checks passed

		ttg::BarrierOp::create(
		rewriter, loc, ttg::AddrSpace::GlobalRead \| ttg::AddrSpace::GlobalWrite);

Conversation

pawelszczerbuk commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lezcano commented Apr 16, 2026

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peterbell10 Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pawelszczerbuk Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pawelszczerbuk commented Apr 16, 2026 •

edited

Loading

peterbell10 Apr 17, 2026 •

edited

Loading

pawelszczerbuk Apr 20, 2026 •

edited

Loading