[AMD] Enable masked load and pointer canonicalization pass #4638

giuseros · 2024-09-03T16:42:09Z

This PR is doing two things:

We are using the new llvm.masked{load/store} intrinsics. This means that the backend will take responsibility to lower the stores/loads.
We are enabling the canonicalization pointer pass on the Triton IR. I extensively run testing and corrected a couple of minor issues still present in the implementation.

The reason why I am enabling both at the same time is because I saw a minor regression with llvm.masked{load,store} which seems to go away when using the pointer canonicalization. Also, this combination seems to reduce the numbers of vgprs used (at least for GEMM kernels).

zhanglx13 · 2024-09-04T13:08:13Z

Also, this combination seems to reduce the numbers of vgprs used (at least for GEMM kernels).

Can you give a concrete example about this effect?

giuseros · 2024-09-04T15:09:19Z

Can you give a concrete example about this effect?

On my local machine (gfx11 card), if I run:

TRITON_HIP_USE_NEW_STREAM_PIPELINE=1 AMDGCN_ENABLE_DUMP=1 TRITON_ALWAYS_COMPILE=1 python3 python/tutorials/03-matrix-multiplication.py | grep vgpr_

I get:

So a 13% reduction on number of registers for the best config.

giuseros · 2024-09-04T15:12:02Z

I also want to underline that this PR is blocked by a recent PR #4369 . We will need to add volatile support to masked.load in LLVM before we can merge this.

giuseros · 2024-09-05T16:37:48Z

Upon further thought, I sank the emission of masked.load and masked.store in llLoad and llStore (respectively). So for now we will use masked intrinsics when we can use them

antiagainst

Cool thanks! Just two small nits.

third_party/amd/lib/TritonAMDGPUToLLVM/Utility.cpp

antiagainst · 2024-09-05T16:50:51Z

Could you fix the test and update the pull request message too?

giuseros · 2024-09-05T22:09:09Z

Unfortunately the issue with the test revealed something more problematic of the pass. I might do a separate PR about this.

giuseros · 2024-09-09T11:36:30Z

Hi @antiagainst , this is the PR that fixes the issue: #4678

giuseros · 2024-09-11T10:50:03Z

Hi @antiagainst , I rebased, fixed the nits and also addressed the fact that we are now using llLoad also for LDS access. This mean that I had to add support for non-vector types when I emit the masked.load. I ran few tests and everything seems ok (also the FA test). Let's see how the CI does.

…the if-else branch

giuseros · 2024-09-12T17:36:07Z

Hi @antiagainst , could you start the tests again ?

giuseros requested review from antiagainst and zhanglx13 as code owners September 3, 2024 16:42

giuseros marked this pull request as draft September 3, 2024 16:42

giuseros force-pushed the enable_masked_load branch from 64cf130 to b78315e Compare September 5, 2024 16:34

giuseros force-pushed the enable_masked_load branch from b78315e to f70434a Compare September 5, 2024 16:38

antiagainst approved these changes Sep 5, 2024

View reviewed changes

third_party/amd/lib/TritonAMDGPUToLLVM/Utility.cpp Outdated Show resolved Hide resolved

third_party/amd/lib/TritonAMDGPUToLLVM/Utility.cpp Outdated Show resolved Hide resolved

antiagainst marked this pull request as ready for review September 5, 2024 16:48

antiagainst requested a review from ptillet as a code owner September 5, 2024 16:48

antiagainst marked this pull request as draft September 5, 2024 16:50

antiagainst changed the title ~~Enable MaskedLoad and pointer canonicalization pass~~ [AMD] Enable masked load and pointer canonicalization pass Sep 9, 2024

giuseros force-pushed the enable_masked_load branch from f70434a to 6e39106 Compare September 11, 2024 10:33

giuseros added 5 commits September 11, 2024 18:25

Enable MaskedLoad and pointer canonicalization pass

de2bb02

Updating lit tests

9d537b4

Rebase after PR triton-lang#4369

3cb0b61

Rebase after canonicalizer fix

8e94c54

Fix minor issues with block arguments

b209aab

giuseros force-pushed the enable_masked_load branch from 6e39106 to b209aab Compare September 11, 2024 18:07

Fix line info test, due to the fact that the backend is lowering now …

0dc07fe

…the if-else branch

giuseros mentioned this pull request Sep 12, 2024

[AMD] Add buffer support #4716

Open

antiagainst approved these changes Sep 12, 2024

View reviewed changes

antiagainst marked this pull request as ready for review September 12, 2024 21:23

antiagainst merged commit c238af8 into triton-lang:main Sep 12, 2024
7 checks passed

lezcano mentioned this pull request Sep 24, 2024

Implement scaled_dot(mxfp8, fp8) via mma #4795

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Enable masked load and pointer canonicalization pass #4638

[AMD] Enable masked load and pointer canonicalization pass #4638

giuseros commented Sep 3, 2024

zhanglx13 commented Sep 4, 2024

giuseros commented Sep 4, 2024

giuseros commented Sep 4, 2024

giuseros commented Sep 5, 2024

antiagainst left a comment

antiagainst commented Sep 5, 2024

giuseros commented Sep 5, 2024

giuseros commented Sep 9, 2024

giuseros commented Sep 11, 2024

giuseros commented Sep 12, 2024

[AMD] Enable masked load and pointer canonicalization pass #4638

[AMD] Enable masked load and pointer canonicalization pass #4638

Conversation

giuseros commented Sep 3, 2024

zhanglx13 commented Sep 4, 2024

giuseros commented Sep 4, 2024

giuseros commented Sep 4, 2024

giuseros commented Sep 5, 2024

antiagainst left a comment

Choose a reason for hiding this comment

antiagainst commented Sep 5, 2024

giuseros commented Sep 5, 2024

giuseros commented Sep 9, 2024

giuseros commented Sep 11, 2024

giuseros commented Sep 12, 2024