Skip to content

[FPSAN] Insert barriers in between scratch loads and stores#10055

Merged
pawelszczerbuk merged 3 commits into
triton-lang:mainfrom
pawelszczerbuk:pawel/fpsan_scratch_sync_fix
Apr 20, 2026
Merged

[FPSAN] Insert barriers in between scratch loads and stores#10055
pawelszczerbuk merged 3 commits into
triton-lang:mainfrom
pawelszczerbuk:pawel/fpsan_scratch_sync_fix

Conversation

@pawelszczerbuk
Copy link
Copy Markdown
Contributor

@pawelszczerbuk pawelszczerbuk commented Apr 16, 2026

Fpsan insufficiently synchronized global scratch accesses - we were missing a barrier in between memory accesses, which led to random failures in the tests and racy behavior.

@lezcano
Copy link
Copy Markdown
Contributor

lezcano commented Apr 16, 2026

why .cta? Does this work for multicta kernels?

Comment thread third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/BarrierOpToLLVM.cpp Outdated
Comment thread lib/Dialect/TritonInstrument/IR/Utility.cpp Outdated
Location loc = op.getLoc();
if (op.hasGlobalRead() || op.hasGlobalWrite()) {
PTXBuilder ptxBuilder;
auto &membar = *ptxBuilder.create("membar.cta");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ptx spec for bar.sync states that:

The barrier{.cta}.sync or barrier{.cta}.red or barrier{.cta}.arrive instruction guarantees that when the barrier completes, prior memory accesses requested by this thread are performed relative to all threads participating in the barrier. The barrier{.cta}.sync and barrier{.cta}.red instruction further guarantees that no new memory access is requested by this thread before the barrier completes.

Which to me suggests that an additional fence shouldn't be required. Is it possible that the issues are only happening in warp-specialized code where you can have the reads and writes happening in different warp partitions?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, thanks for digging that out! Doing more experiments to see what exactly is causing the failure.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it seems that fence release/acquire is sufficient for the flakiness to go away. But WS does not explain the issue, because test_dot_fma was flaking, while not using WS at all. Not sure what to make of it.

Copy link
Copy Markdown
Contributor

@peterbell10 peterbell10 Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice in emitMmaEmulationLoops you both load and store from dTilePtr without a barrier in between. Is it possible this is causing issues? It seems safe because you should be loading and storing from the same thread, but technically this still requires a memory fence afaik.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me see, good catch!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked codex to find similar races, and it noted that we don't synchronise after storing to hbm in a tmem_store, as such we could have a pattern like a tmem_store into a tmem_load or the other way around that could race.

In the LLVM lowering we emit after tmem_load a

    NVVM::Tcgen05WaitOp::create(rewriter, loc, NVVM::Tcgen05WaitKind::LOAD);

so here we should probably emit a barrier or at least a fence.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jackpot, that was it :D Thanks a lot Peter!!!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't synchronise after storing to hbm in a tmem_store

wait, do you mean tma_store? tmem_store does not touch hbm right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant this line. It's in the fp sanitiser

if (!createStoreScratchMemory(rewriter, loc, info->ptr, op.getSrc(), srcTy))

@pawelszczerbuk pawelszczerbuk force-pushed the pawel/fpsan_scratch_sync_fix branch from 8d9e8ee to 19cab1f Compare April 17, 2026 16:58
@pawelszczerbuk pawelszczerbuk changed the title [FPSAN] Make scratch load/stores uncached, add cta barrier to global BarrierOp [FPSAN] Insert barriers in between scratch loads and stores Apr 17, 2026
Comment on lines +49 to +50
ttg::BarrierOp::create(
rewriter, loc, ttg::AddrSpace::GlobalRead | ttg::AddrSpace::GlobalWrite);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. this GlobalRead | GlobalWrite does not do anything on nvidia so perhaps better drop it and just use Local?

Copy link
Copy Markdown
Contributor Author

@pawelszczerbuk pawelszczerbuk Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, otherwise it is getting confusing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, no, we do need these for AMD. This is the common code path

@pawelszczerbuk pawelszczerbuk merged commit 147a60d into triton-lang:main Apr 20, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants