Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMD] Enable masked load and pointer canonicalization pass #4638

Merged
merged 6 commits into from
Sep 12, 2024

Conversation

giuseros
Copy link
Contributor

@giuseros giuseros commented Sep 3, 2024

This PR is doing two things:

  • We are using the new llvm.masked{load/store} intrinsics. This means that the backend will take responsibility to lower the stores/loads.
  • We are enabling the canonicalization pointer pass on the Triton IR. I extensively run testing and corrected a couple of minor issues still present in the implementation.

The reason why I am enabling both at the same time is because I saw a minor regression with llvm.masked{load,store} which seems to go away when using the pointer canonicalization. Also, this combination seems to reduce the numbers of vgprs used (at least for GEMM kernels).

@giuseros giuseros marked this pull request as draft September 3, 2024 16:42
@zhanglx13
Copy link
Collaborator

Also, this combination seems to reduce the numbers of vgprs used (at least for GEMM kernels).

Can you give a concrete example about this effect?

@giuseros
Copy link
Contributor Author

giuseros commented Sep 4, 2024

Can you give a concrete example about this effect?

On my local machine (gfx11 card), if I run:

TRITON_HIP_USE_NEW_STREAM_PIPELINE=1 AMDGCN_ENABLE_DUMP=1 TRITON_ALWAYS_COMPILE=1 python3 python/tutorials/03-matrix-multiplication.py | grep vgpr_

I get:
image

So a 13% reduction on number of registers for the best config.

@giuseros
Copy link
Contributor Author

giuseros commented Sep 4, 2024

I also want to underline that this PR is blocked by a recent PR #4369 . We will need to add volatile support to masked.load in LLVM before we can merge this.

@giuseros
Copy link
Contributor Author

giuseros commented Sep 5, 2024

Upon further thought, I sank the emission of masked.load and masked.store in llLoad and llStore (respectively). So for now we will use masked intrinsics when we can use them

Copy link
Collaborator

@antiagainst antiagainst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool thanks! Just two small nits.

third_party/amd/lib/TritonAMDGPUToLLVM/Utility.cpp Outdated Show resolved Hide resolved
third_party/amd/lib/TritonAMDGPUToLLVM/Utility.cpp Outdated Show resolved Hide resolved
@antiagainst antiagainst marked this pull request as ready for review September 5, 2024 16:48
@antiagainst antiagainst marked this pull request as draft September 5, 2024 16:50
@antiagainst
Copy link
Collaborator

Could you fix the test and update the pull request message too?

@giuseros
Copy link
Contributor Author

giuseros commented Sep 5, 2024

Unfortunately the issue with the test revealed something more problematic of the pass. I might do a separate PR about this.

@giuseros
Copy link
Contributor Author

giuseros commented Sep 9, 2024

Hi @antiagainst , this is the PR that fixes the issue: #4678

@antiagainst antiagainst changed the title Enable MaskedLoad and pointer canonicalization pass [AMD] Enable masked load and pointer canonicalization pass Sep 9, 2024
@giuseros
Copy link
Contributor Author

Hi @antiagainst , I rebased, fixed the nits and also addressed the fact that we are now using llLoad also for LDS access. This mean that I had to add support for non-vector types when I emit the masked.load. I ran few tests and everything seems ok (also the FA test). Let's see how the CI does.

@giuseros
Copy link
Contributor Author

Hi @antiagainst , could you start the tests again ?

@antiagainst antiagainst marked this pull request as ready for review September 12, 2024 21:23
@antiagainst antiagainst merged commit c238af8 into triton-lang:main Sep 12, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants