[AMD][Gluon] Support global/buffer load to shared by borontion · Pull Request #7880 · triton-lang/triton

borontion · 2025-08-15T23:39:11Z

This PR introduces following new builtin in Gluon:

global_load_to_shared: similar to ttgl.nvidia.ampere.async_copy.async_copy_global_to_shared
async_wait: similar to ttgl.nvidia.ampere.async_copy.wait_group
load_shared_relaxed: load from shared memory with hints to compiler not insert fence. should be used in pair with async_wait. this function will annotate issued local load op to prevent LLVM emitting conservative wait counts before local load. following the logic of annotateLocalLoadsSyncedViaAsyncWait.

Along the way, there are other small changes:

Support broadcast mask and other in buffer_load_to_shared
Expose other in create_async_copy_global_to_local for global_load_to_shared
Change buffer_load_to_shared to CDNA4-only

borontion · 2025-08-16T05:06:19Z

@antiagainst @AlexAUT there some comments about constraints for global_load_to_shared and buffer_load_to_shared, mostly based on my understanding of BufferLoadToLocalOpConversion and AsyncCopyGlobalToLocalOpConversion. let me whether they are accurate.

antiagainst

Thanks for addressing the comments! LGTM now. @peterbell10 can you help to take a look too?

peterbell10 · 2025-08-17T23:17:46Z

+from . import async_copy

-__all__ = ["buffer_load_to_shared", "buffer_load", "buffer_store", "mfma", "mfma_scaled"]
+__all__ = [*__cdna3_all, "async_copy", "mfma_scaled"]


Is it true that all ops that exist in cdna3 are present and unchanged in cdna4?

peterbell10 · 2025-08-17T23:26:41Z

+        """
+        layout = _unwrap_if_constexpr(layout)
+        ret = _semantic.shared_load(self, layout)
+        ret.handle.set_attr(self.SYNCED_VIA_WAIT_ATTR_NAME, _semantic.builder.get_bool_attr(True))


This code is wrong as you haven't defined a custom type for your value, so it won't be reconstructed the same after control flow. e.g.:

smem = async_hint_shared(smem) for i in range(n): # smem is a plain shared_memory_descriptor here, so not loaded with hint smem.load(...)

tbh though the API seems strange to me. Why not have a custom function instead:

val = async_copy.load_shared_relaxed(smem, layout)

thanks for pointing out. I initially want to keep the kernel using smem.load without changing to other api. looks like not valid then. I followed you suggestion and also included 2 frontend tests for the case you mentioned.

peterbell10 · 2025-08-17T23:29:54Z

+    """
+    Wait for outstanding asynchronous memory operations, this includes
+    normal load like `load` and `buffer_load`, as well as all async memory
+    operations like  `global_load_to_shared` and `buffer_load_to_shared`.


What is the distinction between "asynchronous memory operations" and "async memory operations"? These seem like they should be the same thing, but this doc implies load and buffer_load are "asynchronous" but not "async"?

just reworded this part and removed the term "async/asynchronous memory operations", which is confusing in this context. all AMD memory operations when executed in hardware are asynchronous. and there are 2 categories:

normal load (to register): load and buffer_load

direct load to shared memory: global_load_to_shared and buffer_load_to_shared

this function async_wait waits for all of these memory operations. we typically refer "direct load to shared memory" category as "asynchronous memory operations" for which we will manually insert async_wait. and the for the "normal load" category, we leave this task to llvm.

I also had a question along these lines. Does Triton or AMD have a definition of async? I ask because the nvidia nomenclature includes .async_copy.async_copy_global_to_shared while AMD calls it global_load_to_shared. Have we conflated async to mean direct-to-LDS, or are async and direct-to-LDS sepparate properties?

For Triton, async and direct-to-shared are the same and always appear together: this is ttg.async_copy_global_to_shared.

AMD has separated async and direct-to-shared properties: all AMD memory operations are async regardless of whether it is direct-to-shared or to-register. however, when using Triton for AMD, we treat direct-to-shared ops as async and needs to insert fence, also assume direct-to-register as non-async (but indeed they are still async but handled by llvm)

Nvidia is the same as Triton, as far as I know. there is only cp.async, and the destination can only be shared memory.

Have we conflated async to mean direct-to-LDS, or are async and direct-to-LDS sepparate properties?

Triton already has 2 different concepts for them; ttg.async_copy_global_to_local has async and to_local. So async is for letting some ops finish out of order, and to_local signals direct to lds loads.

AMD memory operations are async

No, not in cuda/triton terms. All loads from HBM, global_load, buffer_load, global_load_lds and buffer_load * lds, will always complete in program/assembly order. We do not have a concept like cp.async on GFX9 which allows to have asynchronous groups of loads.
Calling them asynchronous just because they can finish out of order with ALU/LDS ops is a bit misleading since most GPU architectures will be able to do that.

So the name ttg.async_copy_global_to_local is not entirely correct on GFX9, hence why we omitted it from the direct-to-lds buffer op (amdgpu.buffer_load_to_local). We still use the AsyncToken concept to enable efficient pipelining:

when using Triton for AMD, we treat direct-to-shared ops as async and needs to insert fence, also assume direct-to-register as non-async (but indeed they are still async but handled by llvm)

Note that this is just a performance optimization because LLVM is unable to deduce the correct number of loads it needs to fence when having deeper pipelines. You can remove the fence we get from async_wait and we still get correct but slow assembly*.

*) If we disable the alias classes used to disable the conservative waits from LLVM and pickup this.

I think we are the same page but I agree that term "asynchronous" here indeed is not correct. My understanding is memory ops should follow the happens-before relation, which is different from "asynchronous" in cuda's cp.async [1]:

Some PTX instructions (all variants of cp.async, cp.async.bulk, cp.reduce.async.bulk, wgmma.mma_async) perform operations that are asynchronous to the thread that executed the instruction. These asynchronous operations are ordered after prior instructions in the same thread (except in the case of wgmma.mma_async), but they are not part of the program order for that thread. Instead, they provide weaker ordering guarantees as documented in the instruction description.

This is why I am trying to avoid mentioning "asynchronous" for gluon ops here. I guess we need refresh the docs here to avoid the confusion.

Regarding

Note that this is just a performance optimization because LLVM is unable to deduce the correct number of loads it needs to fence when having deeper pipelines. You can remove the fence we get from async_wait and we still get correct but slow assembly*.

Yeah I am aware of it. That's why I added a separated load_shared_relaxed (for alias scope) to use in pair with async_wait. It is fine to just use globa/buffer to shared without a wait.

peterbell10

SGTM provided my understanding is correct.

peterbell10 · 2025-08-18T21:27:39Z

+    """
+    Wait for outstanding memory operations, this includes normal load like
+    `load` and `buffer_load`, as well as direct load to shared memory
+    like `global_load_to_shared` and `buffer_load_to_shared`.


So to be clear:

async_copy.global_load_to_shared(a, ...) b = ttgl.load(...) async_copy.async_wait(num_outstanding=1)

This code guaruntees a is loaded on AMD since b is counted in num_outstanding?

This is quite different from NVIDIA despite being the same ttgir ops so this is surprising to me. Not necessarily a problem for gluon though since the APIs are clearly distinct.

borontion added 9 commits August 14, 2025 22:47

support implicit broadcast

dda7cfd

expose async copy

fef6e35

add async load example

70872f5

add buffer load example

8648892

update comments

bf42f72

fix tensor shape

c7dc02c

expose other value for async copy

684dfae

add async annotation

50216d1

set random seed

775f90b

borontion changed the title ~~[WIP][AMD][Gluon] Support async buffer and global load~~ [WIP][AMD][Gluon] Support async global/buffer load to shared Aug 16, 2025

borontion added 6 commits August 15, 2025 17:11

add comments

e706668

exclude async operations to cdna4

8b04206

rename builtins

3a2a10c

update constraints load to shared

8d32602

remove 2d constraint

1197945

introduce relaxed shared memory descriptor

4293ae3

borontion marked this pull request as ready for review August 16, 2025 05:03

borontion requested review from antiagainst, peterbell10 and zhanglx13 as code owners August 16, 2025 05:03

borontion changed the title ~~[WIP][AMD][Gluon] Support async global/buffer load to shared~~ [AMD][Gluon] Support async global/buffer load to shared Aug 16, 2025

borontion added 3 commits August 15, 2025 22:08

update comments

670885b

update comments

c62c073

update tests

ca7fd35

borontion changed the title ~~[AMD][Gluon] Support async global/buffer load to shared~~ [AMD][Gluon] Support global/buffer load to shared Aug 16, 2025

update

9f8cb11

antiagainst requested changes Aug 16, 2025

View reviewed changes

borontion added 2 commits August 16, 2025 11:57

clean up

e014f6e

address comments

140c1e0

borontion added 2 commits August 16, 2025 17:36

fmt

2db0ff1

update comments

fb0a794

antiagainst requested changes Aug 17, 2025

View reviewed changes

borontion added 4 commits August 17, 2025 14:53

address comments

816b780

update description

7501534

update description

c306a94

update description

0bcc33b

antiagainst approved these changes Aug 17, 2025

View reviewed changes

peterbell10 reviewed Aug 17, 2025

View reviewed changes

borontion added 8 commits August 17, 2025 17:08

address comments

142ff86

update tests

88090e0

fix

5ebd479

update description

ac510fd

update description

e7a6ba3

update description

483ca39

update description

41887b4

update description

6e2aee8

borontion requested a review from peterbell10 August 18, 2025 04:49

Merge branch 'main' into amd-gluon-to-shared

ea3b653

peterbell10 approved these changes Aug 18, 2025

View reviewed changes

peterbell10 merged commit 48862cd into triton-lang:main Aug 18, 2025
9 checks passed

borontion deleted the amd-gluon-to-shared branch October 28, 2025 22:00

Conversation

borontion commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

borontion commented Aug 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

antiagainst left a comment

Choose a reason for hiding this comment

Uh oh!

peterbell10 Aug 17, 2025

Choose a reason for hiding this comment

Uh oh!

borontion Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

peterbell10 Aug 17, 2025

Choose a reason for hiding this comment

Uh oh!

borontion Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peterbell10 Aug 17, 2025

Choose a reason for hiding this comment

Uh oh!

borontion Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guacamoleo Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

borontion Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlexAUT Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

borontion Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peterbell10 left a comment

Choose a reason for hiding this comment

Uh oh!

peterbell10 Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

borontion Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

borontion commented Aug 15, 2025 •

edited

Loading

borontion Aug 18, 2025 •

edited

Loading

borontion Aug 18, 2025 •

edited

Loading

borontion Aug 18, 2025 •

edited

Loading

borontion Sep 2, 2025 •

edited

Loading

peterbell10 Aug 18, 2025 •

edited

Loading

borontion Aug 18, 2025 •

edited

Loading