[Gluon] Add support for nv local_store_async#10357
Conversation
|
will review next week, but if we are adding store it'd be nice to add the load as well for symmetry |
there is no load equivalent to |
lezcano
left a comment
There was a problem hiding this comment.
In a follow-up PR, we could check if the op could be lowered to cp.async.bulk.shared::cluster.shared::cta which should hopefully emit fewer instructions.
| if (bitwidth < 8 || bitwidth > 64 || !llvm::isPowerOf2_32(bitwidth)) | ||
| return emitOpError("requires 8-, 16-, 32-, or 64-bit element types"); |
| if (failed(verifyCompletionBarrierLayout(getOperation(), getMbarrier()))) | ||
| return failure(); |
There was a problem hiding this comment.
this just allows a 1-CTA mbarrier, while we could be feeding a tcgen05 op and we'd need a 2-cta one. Let's remove it altogether.
|
|
||
| @pytest.mark.skipif(not is_cuda() or torch.cuda.get_device_capability()[0] < 9, reason="Requires hopper or newer") | ||
| @pytest.mark.parametrize("EXPECT_DELTA", [0, 4], ids=["match", "mismatch"]) | ||
| def test_async_shared_store_expect_bytes(EXPECT_DELTA, device, run_wrapper, monkeypatch, num_ctas): |
There was a problem hiding this comment.
we have a very similar test for TMA. Can you see if it's possible to merge them?
There was a problem hiding this comment.
I don't see a way to cleanly merge those
| Value mbarrier = | ||
| mapSharedToCluster(storeLoc, mbarrierPtr, targetCTAId, rewriter); |
There was a problem hiding this comment.
This should use the mbarrierPtr associated to its peer CTA if it's in 2CTA mode (once the verifier allows it). There is a helper to do that.
how can that be? This op is for copying data from shared to shared, the one here is from reg to shared |
|
ah, yes, sorry, nevermind |
|
also, looked alright to me, but ping @peterbell10 to review the gluon part |
| bar = mbarrier.allocate_mbarrier() | ||
| mbarrier.init(bar, count=1) | ||
| mbarrier.expect(bar, smem.nbytes_per_cta) | ||
| hopper.async_store(smem, values, bar) |
There was a problem hiding this comment.
Do you know if there are any lifetime issues with the registers, similar to wgmma, or does the instruction completely finish reading the registers synchronously (via the usual SASS register dependency tracking)?
There was a problem hiding this comment.
there isn't lifetime issues for the register in this case, it is fully handled by the scoreboard
No description provided.