Skip to content

[Gluon] Add support for nv local_store_async#10357

Merged
ThomasRaoux merged 6 commits into
triton-lang:mainfrom
ThomasRaoux:codex/gluon-st-async-shared
May 29, 2026
Merged

[Gluon] Add support for nv local_store_async#10357
ThomasRaoux merged 6 commits into
triton-lang:mainfrom
ThomasRaoux:codex/gluon-st-async-shared

Conversation

@ThomasRaoux
Copy link
Copy Markdown
Collaborator

No description provided.

@lezcano
Copy link
Copy Markdown
Contributor

lezcano commented May 23, 2026

will review next week, but if we are adding store it'd be nice to add the load as well for symmetry

@ThomasRaoux
Copy link
Copy Markdown
Collaborator Author

will review next week, but if we are adding store it'd be nice to add the load as well for symmetry

there is no load equivalent to st.async

Copy link
Copy Markdown
Contributor

@lezcano lezcano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a follow-up PR, we could check if the op could be lowered to cp.async.bulk.shared::cluster.shared::cta which should hopefully emit fewer instructions.

Comment thread lib/Dialect/TritonNvidiaGPU/IR/Ops.cpp Outdated
Comment on lines +283 to +284
if (bitwidth < 8 || bitwidth > 64 || !llvm::isPowerOf2_32(bitwidth))
return emitOpError("requires 8-, 16-, 32-, or 64-bit element types");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary

Comment thread lib/Dialect/TritonNvidiaGPU/IR/Ops.cpp Outdated
Comment on lines +277 to +278
if (failed(verifyCompletionBarrierLayout(getOperation(), getMbarrier())))
return failure();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this just allows a 1-CTA mbarrier, while we could be feeding a tcgen05 op and we'd need a 2-cta one. Let's remove it altogether.


@pytest.mark.skipif(not is_cuda() or torch.cuda.get_device_capability()[0] < 9, reason="Requires hopper or newer")
@pytest.mark.parametrize("EXPECT_DELTA", [0, 4], ids=["match", "mismatch"])
def test_async_shared_store_expect_bytes(EXPECT_DELTA, device, run_wrapper, monkeypatch, num_ctas):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a very similar test for TMA. Can you see if it's possible to merge them?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a way to cleanly merge those

Comment on lines +447 to +448
Value mbarrier =
mapSharedToCluster(storeLoc, mbarrierPtr, targetCTAId, rewriter);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use the mbarrierPtr associated to its peer CTA if it's in 2CTA mode (once the verifier allows it). There is a helper to do that.

@ThomasRaoux
Copy link
Copy Markdown
Collaborator Author

In a follow-up PR, we could check if the op could be lowered to cp.async.bulk.shared::cluster.shared::cta which should hopefully emit fewer instructions.

how can that be? This op is for copying data from shared to shared, the one here is from reg to shared

@lezcano
Copy link
Copy Markdown
Contributor

lezcano commented May 26, 2026

ah, yes, sorry, nevermind

@lezcano
Copy link
Copy Markdown
Contributor

lezcano commented May 26, 2026

also, looked alright to me, but ping @peterbell10 to review the gluon part

@ThomasRaoux ThomasRaoux marked this pull request as ready for review May 27, 2026 02:36
Comment thread lib/Dialect/TritonNvidiaGPU/IR/Ops.cpp
Comment thread python/triton/experimental/gluon/language/_semantic.py Outdated
bar = mbarrier.allocate_mbarrier()
mbarrier.init(bar, count=1)
mbarrier.expect(bar, smem.nbytes_per_cta)
hopper.async_store(smem, values, bar)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if there are any lifetime issues with the registers, similar to wgmma, or does the instruction completely finish reading the registers synchronously (via the usual SASS register dependency tracking)?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there isn't lifetime issues for the register in this case, it is fully handled by the scoreboard

@ThomasRaoux ThomasRaoux requested a review from peterbell10 May 29, 2026 13:32
@ThomasRaoux ThomasRaoux merged commit af1bca5 into triton-lang:main May 29, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants