Layout conversion bypass for blocked to dotOperand #4538

binarman · 2024-08-19T14:24:19Z

This PR extends shared memory bypass for blocked->dot operand conversions and adds bypass check in DecomposeUnsupportedConversions and ReduceDataDuplication.

This PR is a part of PR series. Final goal is to improve efficiency of small dot operations and bypass as much shared memory accesses as possible.

Rough list of PRs:

Basic FMA dot fixes, dot 3d support and relaxing small dimensions for dot [Backend] Improve dot support to target FMA #4516
Blocked->dotOp shared memory bypassing (this PR)Layout conversion bypass for blocked to dotOperand #4538
Accelerate AMD Matmul + emit dot operations [WIP] [AMD] Emit AMD specific intrinsics for dot #4594
Layout optimization, so operand B is loaded in proper mfma layout and do not need to go through LDS [WIP] Optimize fma dot #4581
Vectorization optimization of dot operands/results (in case llvm can not do this internally)
Reduction operation hoisting out of the K loop (reduction operation is a byproduct of layout optimization step) Hoist reduction outside a loop #4559

antiagainst

Cool! Overall looks good; just a few small issues.

lib/Dialect/TritonGPU/Transforms/ReduceDataDuplication.cpp

antiagainst · 2024-08-20T06:03:00Z

lib/Analysis/Utility.cpp

+  int kDim = dotOperandLayout.getOpIdx() == 0 ? rank - 1 : rank - 2;
+  int nonKDim = dotOperandLayout.getOpIdx() == 0 ? rank - 2 : rank - 1;
+  auto ctaLayout = blockedLayout.getCTALayout();
+


One issue we have in the codebase is lots of mysterious layout/indexing--it's not easy for others reading the code to pick up the intent. The following might not be that tricky; but still can we add a comment to explain what the following checks are doing in a high level?

Thanks for adding the comments! but the wording is quite confusing to me right now. What about something like

The following logic checks that a source blocked layout B matches a destination dot operand layout with blocked layout parent P. It's considered match if 1) each thread holds a whole copy of all elements along the K dimension for B, and 2) distribution along all other non-K dimensions match between S and B. This is to guarantee that each thread have all the data needed for reduction without exchange with other threads. (And/or whatever other reasons why we want this kind of match.)

lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp

test/TritonGPU/reduce-data-duplication.mlir

test/Conversion/amd/decompose-unsupported-conversions.mlir

This PR extends shared memory bypass for blocked->dot operand conversions and adds bypass check in DecomposeUnsupportedConversions and ReduceDataDuplication.

antiagainst · 2024-09-18T23:13:09Z

lib/Analysis/Utility.cpp

+  // i.e. tensor<64x32xf16, #dot_op<{opIdx=0, parent=#blocked}>> will have sizePerThread = [<depends on #blocked>, 32]
+  // and  tensor<64x32xf16, #dot_op<{opIdx=1, parent=#blocked}>> will have sizePerThread = [64, <depends on #blocked>]
+  //
+  // For example tensor<64x32xf16, #dot_op<{opIdx=0, parent=#blocked<{sizePerThread = [2, 8], threadsPerWarp = [32, 1]}>>>


This is going from dot operand to blocked layout? Isn't it the reverse of what we are doing in this function? I'm also not sure the distribution is correct? Isn't this contradicting to the check at L571?

Yes, this is misleading, let me change this comment.

I mean that these dot and blocked layouts are equal? I should not use "converted" here

antiagainst · 2024-09-18T23:20:31Z

lib/Analysis/Utility.cpp

+  int kDim = dotOperandLayout.getOpIdx() == 0 ? rank - 1 : rank - 2;
+  int nonKDim = dotOperandLayout.getOpIdx() == 0 ? rank - 2 : rank - 1;
+  auto ctaLayout = blockedLayout.getCTALayout();
+


Thanks for adding the comments! but the wording is quite confusing to me right now. What about something like

The following logic checks that a source blocked layout B matches a destination dot operand layout with blocked layout parent P. It's considered match if 1) each thread holds a whole copy of all elements along the K dimension for B, and 2) distribution along all other non-K dimensions match between S and B. This is to guarantee that each thread have all the data needed for reduction without exchange with other threads. (And/or whatever other reasons why we want this kind of match.)

binarman mentioned this pull request Aug 19, 2024

[WIP] Support small dots and optimization of dot operands #4400

Draft

antiagainst requested changes Aug 20, 2024

View reviewed changes

This was referenced Aug 22, 2024

Hoist reduction outside a loop #4559

Draft

[Backend] Improve dot support to target FMA #4516

Draft

[WIP] Optimize fma dot #4581

Draft

[WIP] [AMD] Emit AMD specific intrinsics for dot #4594

Draft

alefimov-amd force-pushed the fma_lds_bypass branch from de080f3 to 44c50c5 Compare September 12, 2024 14:39

binarman added 2 commits September 12, 2024 18:14

Layout conversion bypass for blocked to dotOperand

52e51df

This PR extends shared memory bypass for blocked->dot operand conversions and adds bypass check in DecomposeUnsupportedConversions and ReduceDataDuplication.

address review comments

5e50eb0

alefimov-amd force-pushed the fma_lds_bypass branch from 44c50c5 to 5e50eb0 Compare September 12, 2024 16:36

antiagainst requested changes Sep 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Layout conversion bypass for blocked to dotOperand #4538

Layout conversion bypass for blocked to dotOperand #4538

binarman commented Aug 19, 2024 •

edited

Loading

antiagainst left a comment

antiagainst Aug 20, 2024

antiagainst Sep 18, 2024

antiagainst Sep 18, 2024

binarman Sep 19, 2024

antiagainst Sep 18, 2024

Layout conversion bypass for blocked to dotOperand #4538

Are you sure you want to change the base?

Layout conversion bypass for blocked to dotOperand #4538

Conversation

binarman commented Aug 19, 2024 • edited Loading

antiagainst left a comment

Choose a reason for hiding this comment

antiagainst Aug 20, 2024

Choose a reason for hiding this comment

antiagainst Sep 18, 2024

Choose a reason for hiding this comment

antiagainst Sep 18, 2024

Choose a reason for hiding this comment

binarman Sep 19, 2024

Choose a reason for hiding this comment

antiagainst Sep 18, 2024

Choose a reason for hiding this comment

binarman commented Aug 19, 2024 •

edited

Loading