[AMD][Gluon] Support global/buffer load to shared #7880
Conversation
|
@antiagainst @AlexAUT there some comments about constraints for |
antiagainst
left a comment
There was a problem hiding this comment.
Thanks for addressing the comments! LGTM now. @peterbell10 can you help to take a look too?
| from . import async_copy | ||
|
|
||
| __all__ = ["buffer_load_to_shared", "buffer_load", "buffer_store", "mfma", "mfma_scaled"] | ||
| __all__ = [*__cdna3_all, "async_copy", "mfma_scaled"] |
There was a problem hiding this comment.
Is it true that all ops that exist in cdna3 are present and unchanged in cdna4?
| """ | ||
| layout = _unwrap_if_constexpr(layout) | ||
| ret = _semantic.shared_load(self, layout) | ||
| ret.handle.set_attr(self.SYNCED_VIA_WAIT_ATTR_NAME, _semantic.builder.get_bool_attr(True)) |
There was a problem hiding this comment.
This code is wrong as you haven't defined a custom type for your value, so it won't be reconstructed the same after control flow. e.g.:
smem = async_hint_shared(smem)
for i in range(n):
# smem is a plain shared_memory_descriptor here, so not loaded with hint
smem.load(...)tbh though the API seems strange to me. Why not have a custom function instead:
val = async_copy.load_shared_relaxed(smem, layout)There was a problem hiding this comment.
thanks for pointing out. I initially want to keep the kernel using smem.load without changing to other api. looks like not valid then. I followed you suggestion and also included 2 frontend tests for the case you mentioned.
| """ | ||
| Wait for outstanding asynchronous memory operations, this includes | ||
| normal load like `load` and `buffer_load`, as well as all async memory | ||
| operations like `global_load_to_shared` and `buffer_load_to_shared`. |
There was a problem hiding this comment.
What is the distinction between "asynchronous memory operations" and "async memory operations"? These seem like they should be the same thing, but this doc implies load and buffer_load are "asynchronous" but not "async"?
There was a problem hiding this comment.
just reworded this part and removed the term "async/asynchronous memory operations", which is confusing in this context. all AMD memory operations when executed in hardware are asynchronous. and there are 2 categories:
- normal load (to register):
loadandbuffer_load - direct load to shared memory:
global_load_to_sharedandbuffer_load_to_shared
this function async_wait waits for all of these memory operations. we typically refer "direct load to shared memory" category as "asynchronous memory operations" for which we will manually insert async_wait. and the for the "normal load" category, we leave this task to llvm.
There was a problem hiding this comment.
I also had a question along these lines. Does Triton or AMD have a definition of async? I ask because the nvidia nomenclature includes .async_copy.async_copy_global_to_shared while AMD calls it global_load_to_shared. Have we conflated async to mean direct-to-LDS, or are async and direct-to-LDS sepparate properties?
There was a problem hiding this comment.
For Triton, async and direct-to-shared are the same and always appear together: this is ttg.async_copy_global_to_shared.
AMD has separated async and direct-to-shared properties: all AMD memory operations are async regardless of whether it is direct-to-shared or to-register. however, when using Triton for AMD, we treat direct-to-shared ops as async and needs to insert fence, also assume direct-to-register as non-async (but indeed they are still async but handled by llvm)
Nvidia is the same as Triton, as far as I know. there is only cp.async, and the destination can only be shared memory.
There was a problem hiding this comment.
Have we conflated async to mean direct-to-LDS, or are async and direct-to-LDS sepparate properties?
Triton already has 2 different concepts for them; ttg.async_copy_global_to_local has async and to_local. So async is for letting some ops finish out of order, and to_local signals direct to lds loads.
AMD memory operations are async
No, not in cuda/triton terms. All loads from HBM, global_load, buffer_load, global_load_lds and buffer_load * lds, will always complete in program/assembly order. We do not have a concept like cp.async on GFX9 which allows to have asynchronous groups of loads.
Calling them asynchronous just because they can finish out of order with ALU/LDS ops is a bit misleading since most GPU architectures will be able to do that.
So the name ttg.async_copy_global_to_local is not entirely correct on GFX9, hence why we omitted it from the direct-to-lds buffer op (amdgpu.buffer_load_to_local). We still use the AsyncToken concept to enable efficient pipelining:
when using Triton for AMD, we treat direct-to-shared ops as async and needs to insert fence, also assume direct-to-register as non-async (but indeed they are still async but handled by llvm)
Note that this is just a performance optimization because LLVM is unable to deduce the correct number of loads it needs to fence when having deeper pipelines. You can remove the fence we get from async_wait and we still get correct but slow assembly*.
*) If we disable the alias classes used to disable the conservative waits from LLVM and pickup this.
There was a problem hiding this comment.
I think we are the same page but I agree that term "asynchronous" here indeed is not correct. My understanding is memory ops should follow the happens-before relation, which is different from "asynchronous" in cuda's cp.async [1]:
Some PTX instructions (all variants of cp.async, cp.async.bulk, cp.reduce.async.bulk, wgmma.mma_async) perform operations that are asynchronous to the thread that executed the instruction. These asynchronous operations are ordered after prior instructions in the same thread (except in the case of wgmma.mma_async), but they are not part of the program order for that thread. Instead, they provide weaker ordering guarantees as documented in the instruction description.
This is why I am trying to avoid mentioning "asynchronous" for gluon ops here. I guess we need refresh the docs here to avoid the confusion.
Regarding
Note that this is just a performance optimization because LLVM is unable to deduce the correct number of loads it needs to fence when having deeper pipelines. You can remove the fence we get from async_wait and we still get correct but slow assembly*.
Yeah I am aware of it. That's why I added a separated load_shared_relaxed (for alias scope) to use in pair with async_wait. It is fine to just use globa/buffer to shared without a wait.
peterbell10
left a comment
There was a problem hiding this comment.
SGTM provided my understanding is correct.
| """ | ||
| Wait for outstanding memory operations, this includes normal load like | ||
| `load` and `buffer_load`, as well as direct load to shared memory | ||
| like `global_load_to_shared` and `buffer_load_to_shared`. |
There was a problem hiding this comment.
So to be clear:
async_copy.global_load_to_shared(a, ...)
b = ttgl.load(...)
async_copy.async_wait(num_outstanding=1)This code guaruntees a is loaded on AMD since b is counted in num_outstanding?
This is quite different from NVIDIA despite being the same ttgir ops so this is surprising to me. Not necessarily a problem for gluon though since the APIs are clearly distinct.
This PR introduces following new builtin in Gluon:
global_load_to_shared: similar tottgl.nvidia.ampere.async_copy.async_copy_global_to_sharedasync_wait: similar tottgl.nvidia.ampere.async_copy.wait_groupload_shared_relaxed: load from shared memory with hints to compiler not insert fence. should be used in pair withasync_wait. this function will annotate issued local load op to prevent LLVM emitting conservative wait counts before local load. following the logic ofannotateLocalLoadsSyncedViaAsyncWait.Along the way, there are other small changes:
maskandotherinbuffer_load_to_sharedotherincreate_async_copy_global_to_localforglobal_load_to_sharedbuffer_load_to_sharedto CDNA4-only