Skip to content

[AMD][Gluon] Support global/buffer load to shared #7880

Merged
peterbell10 merged 36 commits into
triton-lang:mainfrom
borontion:amd-gluon-to-shared
Aug 18, 2025
Merged

[AMD][Gluon] Support global/buffer load to shared #7880
peterbell10 merged 36 commits into
triton-lang:mainfrom
borontion:amd-gluon-to-shared

Conversation

@borontion
Copy link
Copy Markdown
Contributor

@borontion borontion commented Aug 15, 2025

This PR introduces following new builtin in Gluon:

  • global_load_to_shared: similar to ttgl.nvidia.ampere.async_copy.async_copy_global_to_shared
  • async_wait: similar to ttgl.nvidia.ampere.async_copy.wait_group
  • load_shared_relaxed: load from shared memory with hints to compiler not insert fence. should be used in pair with async_wait. this function will annotate issued local load op to prevent LLVM emitting conservative wait counts before local load. following the logic of annotateLocalLoadsSyncedViaAsyncWait.

Along the way, there are other small changes:

  • Support broadcast mask and other in buffer_load_to_shared
  • Expose other in create_async_copy_global_to_local for global_load_to_shared
  • Change buffer_load_to_shared to CDNA4-only

@borontion borontion changed the title [WIP][AMD][Gluon] Support async buffer and global load [WIP][AMD][Gluon] Support async global/buffer load to shared Aug 16, 2025
@borontion borontion marked this pull request as ready for review August 16, 2025 05:03
@borontion borontion changed the title [WIP][AMD][Gluon] Support async global/buffer load to shared [AMD][Gluon] Support async global/buffer load to shared Aug 16, 2025
@borontion
Copy link
Copy Markdown
Contributor Author

@antiagainst @AlexAUT there some comments about constraints for global_load_to_shared and buffer_load_to_shared, mostly based on my understanding of BufferLoadToLocalOpConversion and AsyncCopyGlobalToLocalOpConversion. let me whether they are accurate.

@borontion borontion changed the title [AMD][Gluon] Support async global/buffer load to shared [AMD][Gluon] Support global/buffer load to shared Aug 16, 2025
Comment thread python/triton/experimental/gluon/language/amd/cdna4/__init__.py Outdated
Comment thread python/triton/experimental/gluon/language/amd/cdna4/__init__.py Outdated
Comment thread python/triton/experimental/gluon/language/amd/cdna4/__init__.py Outdated
Comment thread python/triton/experimental/gluon/language/amd/cdna4/__init__.py Outdated
Comment thread python/triton/experimental/gluon/language/amd/cdna4/__init__.py Outdated
Comment thread python/triton/experimental/gluon/language/amd/cdna4/__init__.py Outdated
Comment thread python/triton/experimental/gluon/language/amd/cdna4/__init__.py Outdated
Comment thread third_party/amd/lib/TritonAMDGPUToLLVM/LoadStoreOpToLLVM.cpp
Comment thread python/triton/experimental/gluon/language/amd/cdna4/__init__.py Outdated
Comment thread python/test/gluon/test_core.py Outdated
Comment thread python/triton/experimental/gluon/language/amd/cdna4/async_copy.py Outdated
Comment thread python/triton/experimental/gluon/language/amd/cdna4/async_copy.py Outdated
Comment thread python/test/gluon/test_frontend.py Outdated
Comment thread python/triton/experimental/gluon/language/amd/cdna4/__init__.py Outdated
Comment thread python/triton/experimental/gluon/language/amd/cdna4/__init__.py Outdated
Comment thread python/triton/experimental/gluon/language/amd/cdna4/async_copy.py Outdated
Comment thread python/triton/experimental/gluon/language/amd/cdna4/async_copy.py
Comment thread python/triton/experimental/gluon/language/amd/cdna4/async_copy.py Outdated
Comment thread third_party/amd/lib/TritonAMDGPUToLLVM/LoadStoreOpToLLVM.cpp
Copy link
Copy Markdown
Member

@antiagainst antiagainst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comments! LGTM now. @peterbell10 can you help to take a look too?

from . import async_copy

__all__ = ["buffer_load_to_shared", "buffer_load", "buffer_store", "mfma", "mfma_scaled"]
__all__ = [*__cdna3_all, "async_copy", "mfma_scaled"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it true that all ops that exist in cdna3 are present and unchanged in cdna4?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

"""
layout = _unwrap_if_constexpr(layout)
ret = _semantic.shared_load(self, layout)
ret.handle.set_attr(self.SYNCED_VIA_WAIT_ATTR_NAME, _semantic.builder.get_bool_attr(True))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is wrong as you haven't defined a custom type for your value, so it won't be reconstructed the same after control flow. e.g.:

smem = async_hint_shared(smem)
for i in range(n):
    # smem is a plain shared_memory_descriptor here, so not loaded with hint
    smem.load(...)

tbh though the API seems strange to me. Why not have a custom function instead:

val = async_copy.load_shared_relaxed(smem, layout)

Copy link
Copy Markdown
Contributor Author

@borontion borontion Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for pointing out. I initially want to keep the kernel using smem.load without changing to other api. looks like not valid then. I followed you suggestion and also included 2 frontend tests for the case you mentioned.

"""
Wait for outstanding asynchronous memory operations, this includes
normal load like `load` and `buffer_load`, as well as all async memory
operations like `global_load_to_shared` and `buffer_load_to_shared`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the distinction between "asynchronous memory operations" and "async memory operations"? These seem like they should be the same thing, but this doc implies load and buffer_load are "asynchronous" but not "async"?

Copy link
Copy Markdown
Contributor Author

@borontion borontion Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just reworded this part and removed the term "async/asynchronous memory operations", which is confusing in this context. all AMD memory operations when executed in hardware are asynchronous. and there are 2 categories:

  • normal load (to register): load and buffer_load
  • direct load to shared memory: global_load_to_shared and buffer_load_to_shared

this function async_wait waits for all of these memory operations. we typically refer "direct load to shared memory" category as "asynchronous memory operations" for which we will manually insert async_wait. and the for the "normal load" category, we leave this task to llvm.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also had a question along these lines. Does Triton or AMD have a definition of async? I ask because the nvidia nomenclature includes .async_copy.async_copy_global_to_shared while AMD calls it global_load_to_shared. Have we conflated async to mean direct-to-LDS, or are async and direct-to-LDS sepparate properties?

Copy link
Copy Markdown
Contributor Author

@borontion borontion Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Triton, async and direct-to-shared are the same and always appear together: this is ttg.async_copy_global_to_shared.

AMD has separated async and direct-to-shared properties: all AMD memory operations are async regardless of whether it is direct-to-shared or to-register. however, when using Triton for AMD, we treat direct-to-shared ops as async and needs to insert fence, also assume direct-to-register as non-async (but indeed they are still async but handled by llvm)

Nvidia is the same as Triton, as far as I know. there is only cp.async, and the destination can only be shared memory.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we conflated async to mean direct-to-LDS, or are async and direct-to-LDS sepparate properties?

Triton already has 2 different concepts for them; ttg.async_copy_global_to_local has async and to_local. So async is for letting some ops finish out of order, and to_local signals direct to lds loads.

AMD memory operations are async

No, not in cuda/triton terms. All loads from HBM, global_load, buffer_load, global_load_lds and buffer_load * lds, will always complete in program/assembly order. We do not have a concept like cp.async on GFX9 which allows to have asynchronous groups of loads.
Calling them asynchronous just because they can finish out of order with ALU/LDS ops is a bit misleading since most GPU architectures will be able to do that.

So the name ttg.async_copy_global_to_local is not entirely correct on GFX9, hence why we omitted it from the direct-to-lds buffer op (amdgpu.buffer_load_to_local). We still use the AsyncToken concept to enable efficient pipelining:

when using Triton for AMD, we treat direct-to-shared ops as async and needs to insert fence, also assume direct-to-register as non-async (but indeed they are still async but handled by llvm)

Note that this is just a performance optimization because LLVM is unable to deduce the correct number of loads it needs to fence when having deeper pipelines. You can remove the fence we get from async_wait and we still get correct but slow assembly*.

*) If we disable the alias classes used to disable the conservative waits from LLVM and pickup this.

Copy link
Copy Markdown
Contributor Author

@borontion borontion Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are the same page but I agree that term "asynchronous" here indeed is not correct. My understanding is memory ops should follow the happens-before relation, which is different from "asynchronous" in cuda's cp.async [1]:

Some PTX instructions (all variants of cp.async, cp.async.bulk, cp.reduce.async.bulk, wgmma.mma_async) perform operations that are asynchronous to the thread that executed the instruction. These asynchronous operations are ordered after prior instructions in the same thread (except in the case of wgmma.mma_async), but they are not part of the program order for that thread. Instead, they provide weaker ordering guarantees as documented in the instruction description.

This is why I am trying to avoid mentioning "asynchronous" for gluon ops here. I guess we need refresh the docs here to avoid the confusion.


Regarding

Note that this is just a performance optimization because LLVM is unable to deduce the correct number of loads it needs to fence when having deeper pipelines. You can remove the fence we get from async_wait and we still get correct but slow assembly*.

Yeah I am aware of it. That's why I added a separated load_shared_relaxed (for alias scope) to use in pair with async_wait. It is fine to just use globa/buffer to shared without a wait.

@borontion borontion requested a review from peterbell10 August 18, 2025 04:49
Copy link
Copy Markdown
Contributor

@peterbell10 peterbell10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM provided my understanding is correct.

"""
Wait for outstanding memory operations, this includes normal load like
`load` and `buffer_load`, as well as direct load to shared memory
like `global_load_to_shared` and `buffer_load_to_shared`.
Copy link
Copy Markdown
Contributor

@peterbell10 peterbell10 Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to be clear:

async_copy.global_load_to_shared(a, ...)
b = ttgl.load(...)
async_copy.async_wait(num_outstanding=1)

This code guaruntees a is loaded on AMD since b is counted in num_outstanding?

This is quite different from NVIDIA despite being the same ttgir ops so this is surprising to me. Not necessarily a problem for gluon though since the APIs are clearly distinct.

Copy link
Copy Markdown
Contributor Author

@borontion borontion Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct

@peterbell10 peterbell10 merged commit 48862cd into triton-lang:main Aug 18, 2025
9 checks passed
@borontion borontion deleted the amd-gluon-to-shared branch October 28, 2025 22:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants