[AMD] Add buffer support #4716

giuseros · 2024-09-12T15:48:40Z

This PR is building on top of #4638 to finally add support for buffer operations. For now we will focus on buffer load/store, but in the future we might add more. What this PR is doing:

Adding inferred properties for non-negativeness (tt.non_negative) and the size of the memory buffers passed (tt.within_2gb)
Adding a series of checks to be sure we can emit the buffer load instructions (non negativeness, 32bitness of the offsets, etc..)
Change to the canonicalizer pointer pass to take into account the tt.within_2gb property
Add a generic infra to emit masked buffer ops. For now we will use it to emit masked buffer loads and stores, but in the future we might want to add more.
I am shielding this feature behind a AMDGCN_USE_BUFFER_OPS. In this way we can enable the feature gradually and check for possible performance/correctness issues.

ThomasRaoux

just putting a blocker on this as some pieces will need a bit more discussions. Lei had mentioned those to me so it's not a surprise but haven't had a chance to discuss it with Phil and the rest of the team yet

python/triton/compiler/compiler.py

python/triton/runtime/jit.py

ThomasRaoux · 2024-09-12T15:57:39Z

Do we have some data on the performance impact of this feature? Considering the cost in extra compilations and maintenance it would be good to have this information

ThomasRaoux · 2024-09-12T16:21:59Z

After having a quick chat with @ptillet one problem is that the specialization will apply to all backends even the ones that can't take advantage of that.
@antiagainst @giuseros, Can you make separate changes to refactor the specialization and allow different backends to have different specializations first?

giuseros · 2024-09-12T16:43:14Z

Do we have some data on the performance impact of this feature? Considering the cost in extra compilations and maintenance it would be good to have this information

Yes, if we run on a non-power-of-two shape we get up to 36% improvement:

These are gfx11 numbers, but I had the same with mi200 and mi300. For power-of-two shapes the perfs are similar

giuseros · 2024-09-12T16:49:19Z

just putting a blocker on this as some pieces will need a bit more discussion

Absolutely fine, I put it here exactly to have discussions while we get on with #4638

giuseros · 2024-09-30T10:46:32Z

I rebased against recent fixes/refactors/etc... I also found out that there is a benefit also when we use it on power-of-two sizes (this is still on my gfx11 card):

This seems to be because there is a reduction in the number of registers used (we don't have to have the set of pointers around. We only need to update the scalar base pointer - unless there is a non-uniform update within the loop, which is rare)

However, for "bad" configurations (i.e., the ones not picked by the tuner) sometimes I see an increase in reg pressure (using buffer ops). So I still think this feature needs to be shielded behind an environment variable to allow further experimentation.

In this PR I am trying to refactor the specializations that we apply to the signature of a given function in Triton. Basically, given a kernel there are some argument properties that can help compilation. E.g., divisibility by 16 and the fact that an integer is equal to 1. In a previous PR: #4716, I needed other specializations to add buffer support in the AMD backend (and get back some performance when we were using unaligned masked loads). So this is my attempt to redesign the specialization support to introduce per-backend specializations. The idea is that `AttrsDescriptor` is now the class that is taking care of doing the analysis of the parameters and adding the specialization. It also has a function table where more specializations can be added per-backend.

antiagainst

Great stuff thanks a ton for this @giuseros! Overall looks quite good to me; I just have a bunch of small issues.

third_party/amd/backend/compiler.py

third_party/amd/lib/TritonAMDGPUToLLVM/LoadStoreOpToLLVM.cpp

antiagainst · 2024-10-03T05:49:53Z

third_party/amd/lib/TritonAMDGPUToLLVM/LoadStoreOpToLLVM.cpp

+        queue.push_back(operand.getDefiningOp());
+    }
+
+    // 2. Check the that pointer is not a block argument. We cannot


This comment is duplicated from L255?

third_party/amd/lib/TritonAMDGPUToLLVM/LoadStoreOpToLLVM.cpp

giuseros · 2024-10-03T13:13:14Z

Thanks for the review @antiagainst ! I addressed almost all the comments except testing. I will do that in the next commit (possible later today or tomorrow)

giuseros requested review from antiagainst, zhanglx13 and ptillet as code owners September 12, 2024 15:48

giuseros changed the title ~~Add buffer support~~ [AMD] Add buffer support Sep 12, 2024

giuseros mentioned this pull request Sep 12, 2024

[AMD] Add buffer operation support #4277

Closed

ThomasRaoux requested changes Sep 12, 2024

View reviewed changes

python/triton/compiler/compiler.py Outdated Show resolved Hide resolved

python/triton/runtime/jit.py Outdated Show resolved Hide resolved

giuseros force-pushed the add_buffer_support_3 branch from fbeb3a6 to b426ce4 Compare September 13, 2024 16:33

giuseros mentioned this pull request Sep 16, 2024

Refactor compiler specializations to consider backend #4734

Merged

giuseros force-pushed the add_buffer_support_3 branch from b426ce4 to 6dafa47 Compare September 30, 2024 10:25

giuseros force-pushed the add_buffer_support_3 branch 2 times, most recently from 8894638 to 0eae68c Compare October 2, 2024 16:58

Introduce support for buffer operations

de67cc1

giuseros force-pushed the add_buffer_support_3 branch from 0eae68c to de67cc1 Compare October 2, 2024 17:46

antiagainst requested changes Oct 3, 2024

View reviewed changes

Address review feedbacks

c7cb9ff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Add buffer support #4716

[AMD] Add buffer support #4716

giuseros commented Sep 12, 2024 •

edited

Loading

ThomasRaoux left a comment

ThomasRaoux commented Sep 12, 2024

ThomasRaoux commented Sep 12, 2024

giuseros commented Sep 12, 2024

giuseros commented Sep 12, 2024 •

edited

Loading

giuseros commented Sep 30, 2024 •

edited

Loading

antiagainst left a comment

antiagainst Oct 3, 2024

giuseros commented Oct 3, 2024

[AMD] Add buffer support #4716

Are you sure you want to change the base?

[AMD] Add buffer support #4716

Conversation

giuseros commented Sep 12, 2024 • edited Loading

ThomasRaoux left a comment

Choose a reason for hiding this comment

ThomasRaoux commented Sep 12, 2024

ThomasRaoux commented Sep 12, 2024

giuseros commented Sep 12, 2024

giuseros commented Sep 12, 2024 • edited Loading

giuseros commented Sep 30, 2024 • edited Loading

antiagainst left a comment

Choose a reason for hiding this comment

antiagainst Oct 3, 2024

Choose a reason for hiding this comment

giuseros commented Oct 3, 2024

giuseros commented Sep 12, 2024 •

edited

Loading

giuseros commented Sep 12, 2024 •

edited

Loading

giuseros commented Sep 30, 2024 •

edited

Loading