[Backend] Improve dot support to target FMA #4516

binarman · 2024-08-14T13:10:09Z

This PR:

Refactors FMA dot implementation
Supports dot3d in FMA path
Fixes several issues in operand offset computation
Enables small dot operands

This PR is a part of PR series. Final goal is to improve efficiency of small dot operations and bypass as much shared memory accesses as possible.

Rough list of PRs:

Basic FMA dot fixes, dot 3d support and relaxing small dimensions for dot (this PR) [Backend] Improve dot support to target FMA #4516
Blocked->dotOp shared memory bypassing Layout conversion bypass for blocked to dotOperand #4538
Accelerate AMD Matmul + emit dot operations [WIP] [AMD] Emit AMD specific intrinsics for dot #4594
Layout optimization, so operand B is loaded in proper mfma layout and do not need to go through LDS [WIP] Optimize fma dot #4581
Vectorization optimization of dot operands/results (in case llvm can not do this internally)
Reduction operation hoisting out of the K loop (reduction operation is a byproduct of layout optimization step) Hoist reduction outside a loop #4559

This PR: - Refactors FMA dot implementation - Supports dot3d in FMA path - Fixes several issues in operand offset computation - Enables small dot operands

…ompiltion time and reduce number of instructions in assembly, fix bug with wrong order field used for share mem load size computation

antiagainst

First batch of comments; I still need to review SharedToDotOperandFMA.cpp more carefully.

antiagainst · 2024-08-20T06:22:17Z

include/triton/Conversion/TritonGPUToLLVM/Utility.h

@@ -1471,6 +1471,22 @@ inline bool isLayoutMmaV1(Attribute layout) {
  return isMmaV1;
 }

+inline SharedMemoryObject


Please add some documentation to this function.

antiagainst · 2024-09-23T04:43:55Z

include/triton/Dialect/TritonGPU/IR/Dialect.h

@@ -129,6 +129,16 @@ void dumpHWLayout(RankedTensorType tensorType);
 // Return a string representation of the layout of the tensor.
 std::string getLayoutStr(RankedTensorType tensorType, bool useHWPointOfView);

+template <typename T>
+llvm::SmallVector<T> expandMatrixShapeWithBatch(llvm::ArrayRef<T> s) {
+  llvm::SmallVector<T> expanded(3 - s.size(), 1);


Assert s.size() <= 3 and directly return if == 3?

antiagainst · 2024-09-23T05:07:17Z

lib/Dialect/TritonGPU/IR/Dialect.cpp

@@ -3205,6 +3202,15 @@ std::string mlir::triton::gpu::getLayoutStr(RankedTensorType tensorType,
  return layoutStr;
 }

+llvm::SmallVector<unsigned>
+mlir::triton::gpu::expandMatrixOrderWithBatch(llvm::ArrayRef<unsigned> o) {
+  int oldRank = o.size();


Assert o.size <= 3 and return directly if == 3?

antiagainst · 2024-09-23T05:08:49Z

third_party/nvidia/backend/compiler.py

@@ -15,7 +15,16 @@


 def min_dot_size(target: GPUTarget):
-    return lambda lhsType, rhsType: (16, 32, 16) if lhsType.is_int8() else (16, 16, 16)
+
+    def fma_supported(lhsType, rhsType):


Let's don't touch nvidia side for now?

Ok, but note that changes in common code also affects nvidia side.
This is just a switch which enables this functionality in frontend.

antiagainst · 2024-09-23T05:11:30Z

third_party/amd/backend/compiler.py

+    def fma_supported(lhsType, rhsType):
+        return lhsType == rhsType and (lhsType.is_fp16() or lhsType.is_fp32())
+
+    def gfx94_limits(lhsType, rhsType):


get_gfx94_limits

antiagainst · 2024-09-23T05:18:21Z

lib/Conversion/TritonGPUToLLVM/DotOpToLLVM/FMA.cpp

  auto dTensorTy = cast<RankedTensorType>(D.getType());
+  auto dElemTy = dTensorTy.getElementType();


This is dead code?

In this PR yes. It will be used in later parts to choose data specific intrinsics instead of simple FMA.
I will remove it here.

antiagainst · 2024-09-23T05:29:20Z

lib/Conversion/TritonGPUToLLVM/DotOpToLLVM/FMA.cpp

+        unsigned idx[] = {b, m, n};
+        unsigned linearIdx = 0;
+        for (auto dim : llvm::reverse(order)) {
+          linearIdx = linearIdx * retSize[dim] + idx[dim];


This is non-trivial. Can you add some comment to explain the how values are stored in ret so why we compute the linearIdx this way?

We have values scattered across multiple dimensions of the tensor, but in LLVM IR Triton stores them in linear structure.

This part computes linear index in this structure where to put dot result according it's batch, M and N coordinates.

I will put this part in separate function.

antiagainst · 2024-09-23T06:21:29Z

lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandFMA.cpp

-  SmallVector<Value> aOff(aNumPtr);
-  for (int i = 0; i < aNumPtr; ++i) {
-    aOff[i] = add(mul(offA0, strideA0), mul(offA1, strideA1));
+/**


This is not the commonly used style. Can we follow the common style in the codebase to be consistent?

I will change it.
Do we have a code style guide for stuff like this?

Sometimes it is not obvious which option to pick, I've seen few different styles of comments in many places in our code base and seems this depends on a code author taste.

antiagainst · 2024-09-23T06:25:10Z

python/test/unit/language/test_core.py

@@ -3282,6 +3286,9 @@ def kernel(X, stride_xm, stride_xk, Y, stride_yk, stride_yn, W, stride_wn, strid
        return
    # make sure ld/st are vectorized
    ptx = pgm.asm['ptx']
+    is_fma = K < 16 or N < 16 or M < 16


Can we skip testing these small matmuls for nvidia?

antiagainst · 2024-09-23T06:30:08Z

lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandFMA.cpp

-  auto bTensorTy = cast<MemDescType>(B.getType());
-  auto bLayout = cast<SharedEncodingAttr>(bTensorTy.getEncoding());
-  auto bShapePerCTA = getShapePerCTA(bTensorTy);
+Value loadFMAOp(Value dotOp, Value llA, BlockedEncodingAttr dLayout,


There are lots of magic indexing and calculation in this function in general. Can you provide more comments inside to make it easier for others to follow?

I don't think comments will make this function easier to understand.
Instead I am trying to break it into smaller functions, but this is not easy so far.

antiagainst · 2024-09-23T06:31:55Z

When addressing comments, please make sure to add new commits and not squashing into existing ones. Otherwise it's hard to re-review again.

binarman · 2024-10-02T19:43:07Z

When addressing comments, please make sure to add new commits and not squashing into existing ones. Otherwise it's hard to re-review again.

hmm, let me rework this with merge commits, I did not notice this comment in time

Typically I rebase changes on top of main branch, when I have conflicts. Because this way history look clean and it is easier to review, but I see that you prefer merge updates, so I'll continue doing it that way.

binarman force-pushed the small_fma_dot branch 3 times, most recently from 68350e9 to 8e620d3 Compare August 15, 2024 22:19

binarman mentioned this pull request Aug 17, 2024

[WIP] Support small dots and optimization of dot operands #4400

Draft

binarman force-pushed the small_fma_dot branch from 8e620d3 to 9d01eab Compare August 17, 2024 20:47

binarman changed the title ~~[WIP] Relax dot operand constrains with FMA based dot~~ Relax dot operand constrains with FMA based dot Aug 17, 2024

This was referenced Aug 19, 2024

Layout conversion bypass for blocked to dotOperand #4538

Draft

Hoist reduction outside a loop #4559

Draft

This was referenced Aug 26, 2024

[WIP] Optimize fma dot #4581

Draft

[WIP] [AMD] Emit AMD specific intrinsics for dot #4594

Draft

alefimov-amd force-pushed the small_fma_dot branch from 9d01eab to 3033970 Compare September 11, 2024 15:09

binarman force-pushed the small_fma_dot branch from 3033970 to 6907073 Compare September 13, 2024 12:40

binarman added 4 commits September 14, 2024 13:56

Relax dot operand constrains with FMA based dot

eb8218d

This PR: - Refactors FMA dot implementation - Supports dot3d in FMA path - Fixes several issues in operand offset computation - Enables small dot operands

implement separate conversion path for unswizzled tensor to improve c…

860e20e

…ompiltion time and reduce number of instructions in assembly, fix bug with wrong order field used for share mem load size computation

post rebase fix NV accelerate matmul

eb7d234

add smoke tests for accelerate matmul pass

fe8d557

binarman force-pushed the small_fma_dot branch from 35bae87 to fe8d557 Compare September 14, 2024 14:06

antiagainst requested changes Sep 23, 2024

View reviewed changes

antiagainst changed the title ~~Relax dot operand constrains with FMA based dot~~ [Backend] Improve dot support to target FMA Sep 23, 2024

binarman force-pushed the small_fma_dot branch from fe8d557 to 4d70a5e Compare October 2, 2024 19:40

binarman added 3 commits October 3, 2024 12:00

address review comments and take into account order of dot operand

d983e19

Merge remote-tracking branch 'openai/main' into small_fma_dot

669e30b

add more comments and remove variable name duplicates

04678cc

binarman force-pushed the small_fma_dot branch from 4d70a5e to 04678cc Compare October 3, 2024 12:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Backend] Improve dot support to target FMA #4516

[Backend] Improve dot support to target FMA #4516

binarman commented Aug 14, 2024 •

edited

Loading

antiagainst left a comment

antiagainst Aug 20, 2024

antiagainst Sep 23, 2024

antiagainst Sep 23, 2024

antiagainst Sep 23, 2024

binarman Sep 30, 2024

antiagainst Sep 23, 2024

antiagainst Sep 23, 2024

binarman Sep 30, 2024

antiagainst Sep 23, 2024

binarman Sep 30, 2024

antiagainst Sep 23, 2024

binarman Sep 30, 2024

antiagainst Sep 23, 2024

antiagainst Sep 23, 2024

binarman Sep 30, 2024

antiagainst commented Sep 23, 2024 •

edited

Loading

binarman commented Oct 2, 2024 •

edited

Loading

		auto dTensorTy = cast<RankedTensorType>(D.getType());
		auto dElemTy = dTensorTy.getElementType();

[Backend] Improve dot support to target FMA #4516

Are you sure you want to change the base?

[Backend] Improve dot support to target FMA #4516

Conversation

binarman commented Aug 14, 2024 • edited Loading

antiagainst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antiagainst commented Sep 23, 2024 • edited Loading

binarman commented Oct 2, 2024 • edited Loading

binarman commented Aug 14, 2024 •

edited

Loading

antiagainst commented Sep 23, 2024 •

edited

Loading

binarman commented Oct 2, 2024 •

edited

Loading