Skip to content

[AMD][Backend] Add OptimizeDescriptorEncoding pass for AMDGPU#9792

Merged
antiagainst merged 4 commits into
triton-lang:mainfrom
sriakrish:amd-opt-desc-encoding
Mar 30, 2026
Merged

[AMD][Backend] Add OptimizeDescriptorEncoding pass for AMDGPU#9792
antiagainst merged 4 commits into
triton-lang:mainfrom
sriakrish:amd-opt-desc-encoding

Conversation

@sriakrish
Copy link
Copy Markdown
Contributor

  • Introduces the pass re-using the AssignDescriptorMemoryLayouts
  • Better handling of descriptor lowerings
  • Move shared encoding assignment for descriptor loads with dot operands from LowerLoops to this pass
  • Fixes rank-reducing descriptor loads
  • Fixes descriptors passed as kernel args

New contributor declaration

  • I am not making a trivial change, such as fixing a typo in a comment.

  • I have written a PR description following these
    rules.

  • I have run pre-commit run --from-ref origin/main --to-ref HEAD.

  • Select one of the following.

    • I have added tests.
      • /test for lit tests
      • /unittest for C++ tests
      • /python/test for end-to-end tests
    • This PR does not need a test because FILL THIS IN.
  • Select one of the following.

    • I have not added any lit tests.
    • The lit tests I have added follow these best practices,
      including the "tests should be minimal" section. (Usually running Python code
      and using the instructions it generates is not minimal.)

- Introduces the pass re-using the `AssignDescriptorMemoryLayouts`
- Better handling of descriptor lowerings
- Move shared encoding assignment for descriptor loads with dot
  operands from LowerLoops to this pass
- Fixes rank-reducing descriptor loads
- Fixes descriptors passed as kernel args
@sriakrish
Copy link
Copy Markdown
Contributor Author

@antiagainst

auto encoding = type.getEncoding();
assert(isPaddedEncoding(encoding) &&
"expected padded encoding or partitioned wrapping padded");
if (auto padded = dyn_cast<PaddedSharedEncodingAttr>(encoding)) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can a tensor have a padded layout?

Copy link
Copy Markdown
Contributor Author

@sriakrish sriakrish Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We introduced this function to handle rank reducing loads with descriptors. In the lowering of TDM copy in LoadStoreOpToLLVM, we need the layout of the descriptor's block type to properly configure the TDM copy operation. This is particularly useful in rank-reducing loads, since the block type has the layout information required for TDM copy. So to get its linear component, this function was introduced. The padded encoding is from the descriptors' block type.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a tensor should not have a shared layout. We should check this in the verifier..

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We introduced this function to handle rank reducing loads with descriptors. In the lowering of TDM copy in LoadStoreOpToLLVM, we need the layout of the descriptor's block type to properly configure the TDM copy operation. This is particularly useful in rank-reducing loads, since the block type has the layout information required for TDM copy. So to get its linear component, this function was introduced. The padded encoding is from the descriptors' block type.

the descriptor can have a shared layout to represent the destination layout but surely the tensor wouldn't. I'm confused how this works

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT, descriptor type just warps a RankedTensorType inside and we just attach the planned shared memory encoding to that inner ranked tensor type; so seemingly we have a shared layout for a tensor type here. This is how it's done for NVIDIA side too like shown in the tests. For AMD we are just using #ttg.padded_shared instead of #ttg.nvmma_shared.

@sriakrish is OOO right now. I changed the API here to give separate shape and encoding here to avoid being confusing a bit, given LinearLayoutConversions.h is more generic.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I think we may want to move the shared encoding attribute to be directly attached to descriptor type itself to better disambiguate? I can create another refactoring pull request if we agree that's better.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#9851 is a prototype.

Copy link
Copy Markdown
Contributor

@peterbell10 peterbell10 Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe there is a misunderstanding, the RankedTensorType with a shared encoding is never actually the type of an IR value, it's just a component of the TensorDescType. I did it this way simply because RankedTensorType has all the attributes as well as printing and parsing logic. We can clean this up if you guys prefer, but it's purely cosmetic IMO.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. currently the IR looks like !tt.tensordesc<tensor<16x128xf32, #shared>> but instead it could be !tt.tensordesc<16x128xf32, #shared> and not involve RankedTensorType at all.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 sounds much better, I didn't realize we had the ranked tensor right now.

Copy link
Copy Markdown
Member

@antiagainst antiagainst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The major concern is peeled out to #9851 which will be resolved separately. I have reviewed the rest internally earlier so I'll land for now to move forward. If any further comments we can address post-commit. :)

@antiagainst antiagainst merged commit f50c8df into triton-lang:main Mar 30, 2026
17 of 18 checks passed
plognjen pushed a commit to plognjen/triton that referenced this pull request Apr 14, 2026
…-lang#9792)

- Introduces the pass re-using the `AssignDescriptorMemoryLayouts`
- Better handling of descriptor lowerings
- Move shared encoding assignment for descriptor loads with dot operands
from LowerLoops to this pass
- Fixes rank-reducing descriptor loads
- Fixes descriptors passed as kernel args

---------

Co-authored-by: Lei Zhang <antiagainst@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants