Skip to content

[Backend] Add a shared layout for padding#7212

Merged
antiagainst merged 26 commits intotriton-lang:mainfrom
antiagainst:padded-shared
Jun 20, 2025
Merged

[Backend] Add a shared layout for padding#7212
antiagainst merged 26 commits intotriton-lang:mainfrom
antiagainst:padded-shared

Conversation

@antiagainst
Copy link
Copy Markdown
Member

This commit adds a new shared memory layout for padding.
Padding cannot be represented with linear layout, so we need
to define it at a parallel level with the swizzled shared layout.

Intermediate lowering steps don't need to concern about
the exact padding actually; only when we are making the
1-D physical allocation and creating pointers for indexing
we then need to factor in the padding. It means we can
leverage existing linear layout facilities for reasoning the
element mapping.

Comment thread lib/Tools/LinearLayout.cpp Outdated
@antiagainst antiagainst marked this pull request as ready for review June 18, 2025 16:39
return get(context, intervals, paddings, order, ctaLayout);
}]>,
AttrBuilder<(ins "ArrayRef<int64_t>":$shape, "ArrayRef<unsigned>":$order,
"unsigned":$dotKWidth, "unsigned":$elemBitWidth,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move those builder to helper functions instead? We have that for the other shared layouts and it's super annoying (and we have been meaning to clean it up)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for context it's annoying because builders are not expected to handle logic to decide on the layout

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point. Done with a66fa0d.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry my comment should have been more clear, what I meant is that adding a builder that take dotKWidth and other such parameters is confusing as it contains logic on how to avoid bank conflicts based on the reg layout. Those make it very hard to read the code as builder are usually expected to be simple and not contain logic related to bank conflicts or other such considerations. My suggestion was to make this an explicit function that would call into the default builder.

Copy link
Copy Markdown
Member Author

@antiagainst antiagainst Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah makes sense. I don't need this builder right away. (It was added because I also enabled pipeliner on AMD side to emit this layout just to try out correctness with b622870 but I reverted it to make this pull request focusing on core changes.) So I just dropped it with 25221f4 and can do it properly later when needed.

Comment thread lib/Tools/LinearLayout.cpp Outdated
Copy link
Copy Markdown
Contributor

@Jokeren Jokeren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which cases should we still use PaddedSharedEncoding but not the swizzled layout?

@antiagainst
Copy link
Copy Markdown
Member Author

In which cases should we still use PaddedSharedEncoding but not the swizzled layout?

One example is for CDNA4, we have global load direct to LDS (i.e. shared memory) instruction. However, that instruction does not support scattered write when writing to LDS--the whole warp uses one m0 register to describe the base offset (see 3.7. M0 Memory Descriptor in https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/amd-instinct-cdna4-instruction-set-architecture.pdf) and the whole warp can only write consecutive banks. So we cannot do swizzled write like the normal way. Instead we need to "reverse" swizzle the global pointers when load from global memory. Those "reverse" swizzle introduces overhead because we need to do warp shuffle with ds_permute to exchange global pointers etc. which is a source of performance issues. Using padded layout could avoid the cost there.

antiagainst referenced this pull request in antiagainst/triton Jun 19, 2025
Copy link
Copy Markdown
Contributor

@lezcano lezcano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing! Just a minor point (and see Thomas' point) but otherwise looks great!

Comment on lines +265 to +268
if (auto paddedLayout =
dyn_cast<gpu::PaddedSharedEncodingAttr>(allocType.getEncoding())) {
SmallVector<int64_t> unpaddedShape = gpu::getShapePerCTA(allocType);
numElems = paddedLayout.getPaddedSize(unpaddedShape);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to do it inside getAllocationShapePerCTA

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually was trying to do that. Then I realized it's not that compatible--getAllocationShapePerCTA assumes the original ranked shape, while after factoring in padding fundamentally we only have a 1-D size. Also getAllocationShapePerCTA is used quite a few places that assumes original rank. So ends up I'm doing it this way given only when doing allocation or the final pointer indexing we care about the exact physical memory.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually makes sense.

Copy link
Copy Markdown
Contributor

@lezcano lezcano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but let's wait for Thomas' ok

Comment on lines +265 to +268
if (auto paddedLayout =
dyn_cast<gpu::PaddedSharedEncodingAttr>(allocType.getEncoding())) {
SmallVector<int64_t> unpaddedShape = gpu::getShapePerCTA(allocType);
numElems = paddedLayout.getPaddedSize(unpaddedShape);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually makes sense.

Copy link
Copy Markdown
Collaborator

@ThomasRaoux ThomasRaoux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@antiagainst antiagainst merged commit 526d168 into triton-lang:main Jun 20, 2025
9 checks passed
@antiagainst antiagainst deleted the padded-shared branch June 20, 2025 22:09
antiagainst pushed a commit that referenced this pull request Jul 7, 2025
For padded layouts introduced by
#7212 we need to add the
padding to the base ptr of the resulting subview.
dshi7 pushed a commit to facebookexperimental/triton that referenced this pull request Aug 8, 2025
For padded layouts introduced by
triton-lang/triton#7212 we need to add the
padding to the base ptr of the resulting subview.
tie-pilot-qxw pushed a commit to tie-pilot-qxw/triton that referenced this pull request Aug 30, 2025
This commit adds a new shared memory layout for padding.
Padding cannot be represented with linear layout, so we need
to define it at a parallel level with the swizzled shared layout.

Intermediate lowering steps don't need to concern about
the exact padding actually; only when we are making the
1-D physical allocation and creating pointers for indexing
we then need to factor in the padding. It means we can
leverage existing linear layout facilities for reasoning the
element mapping.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants