[Backend] Add a shared layout for padding#7212
[Backend] Add a shared layout for padding#7212antiagainst merged 26 commits intotriton-lang:mainfrom
Conversation
This reverts commit b36f6c3.
5875a0f to
961ecc4
Compare
| return get(context, intervals, paddings, order, ctaLayout); | ||
| }]>, | ||
| AttrBuilder<(ins "ArrayRef<int64_t>":$shape, "ArrayRef<unsigned>":$order, | ||
| "unsigned":$dotKWidth, "unsigned":$elemBitWidth, |
There was a problem hiding this comment.
can we move those builder to helper functions instead? We have that for the other shared layouts and it's super annoying (and we have been meaning to clean it up)
There was a problem hiding this comment.
for context it's annoying because builders are not expected to handle logic to decide on the layout
There was a problem hiding this comment.
sorry my comment should have been more clear, what I meant is that adding a builder that take dotKWidth and other such parameters is confusing as it contains logic on how to avoid bank conflicts based on the reg layout. Those make it very hard to read the code as builder are usually expected to be simple and not contain logic related to bank conflicts or other such considerations. My suggestion was to make this an explicit function that would call into the default builder.
There was a problem hiding this comment.
Yeah makes sense. I don't need this builder right away. (It was added because I also enabled pipeliner on AMD side to emit this layout just to try out correctness with b622870 but I reverted it to make this pull request focusing on core changes.) So I just dropped it with 25221f4 and can do it properly later when needed.
Jokeren
left a comment
There was a problem hiding this comment.
In which cases should we still use PaddedSharedEncoding but not the swizzled layout?
One example is for CDNA4, we have global load direct to LDS (i.e. shared memory) instruction. However, that instruction does not support scattered write when writing to LDS--the whole warp uses one |
| if (auto paddedLayout = | ||
| dyn_cast<gpu::PaddedSharedEncodingAttr>(allocType.getEncoding())) { | ||
| SmallVector<int64_t> unpaddedShape = gpu::getShapePerCTA(allocType); | ||
| numElems = paddedLayout.getPaddedSize(unpaddedShape); |
There was a problem hiding this comment.
It might be better to do it inside getAllocationShapePerCTA
There was a problem hiding this comment.
I actually was trying to do that. Then I realized it's not that compatible--getAllocationShapePerCTA assumes the original ranked shape, while after factoring in padding fundamentally we only have a 1-D size. Also getAllocationShapePerCTA is used quite a few places that assumes original rank. So ends up I'm doing it this way given only when doing allocation or the final pointer indexing we care about the exact physical memory.
lezcano
left a comment
There was a problem hiding this comment.
LGTM but let's wait for Thomas' ok
| if (auto paddedLayout = | ||
| dyn_cast<gpu::PaddedSharedEncodingAttr>(allocType.getEncoding())) { | ||
| SmallVector<int64_t> unpaddedShape = gpu::getShapePerCTA(allocType); | ||
| numElems = paddedLayout.getPaddedSize(unpaddedShape); |
For padded layouts introduced by #7212 we need to add the padding to the base ptr of the resulting subview.
For padded layouts introduced by triton-lang/triton#7212 we need to add the padding to the base ptr of the resulting subview.
This commit adds a new shared memory layout for padding. Padding cannot be represented with linear layout, so we need to define it at a parallel level with the swizzled shared layout. Intermediate lowering steps don't need to concern about the exact padding actually; only when we are making the 1-D physical allocation and creating pointers for indexing we then need to factor in the padding. It means we can leverage existing linear layout facilities for reasoning the element mapping.
This commit adds a new shared memory layout for padding.
Padding cannot be represented with linear layout, so we need
to define it at a parallel level with the swizzled shared layout.
Intermediate lowering steps don't need to concern about
the exact padding actually; only when we are making the
1-D physical allocation and creating pointers for indexing
we then need to factor in the padding. It means we can
leverage existing linear layout facilities for reasoning the
element mapping.