[AMD] Add atomicRMW dpp logic by joviliast · Pull Request #5072 · triton-lang/triton

joviliast · 2024-11-05T15:30:33Z

In the case of unpaired f16 elements utilize dpp instructions to accelerate atomics. Here is an algorithm of lowering tt::atomicRmwOp(%ptr, %val, %mask):

Group thread by pairs. Master thread is (tid % 2 == 0);
All the threads send %val to (tid - 1) thread via dppUpdateOp shl, so all the masters recieve value from secondary threads;
Take into account parity in the %mask value, build CF structures according to it;
Generate llvm::atomicRmwOp in the threads enabled by %mask value;
All the threads send result of generated operation to (tid + 1) thread via dppUpdateOp shl, so all secondary thread also recieve their result.

This approach is alternative to #5028. DPP approach has ~5% perf improvment so use this one in the case target arch supports DPP.

antiagainst · 2024-11-08T01:00:25Z

-                                        ? 2
-                                        : 1);
+      vec = std::min<unsigned>(vec,
+                               llvm::isa<FloatType>(valueElemTy) &&


Nit: can we extract this ternary expression as a local variable? Reads better that way.

antiagainst · 2024-11-08T02:31:29Z

    // vec = 1, numElements = 1 for scalar
    auto vec = getVectorSize(ptr);
    int numElems = 1;
+    bool forceF16packing = false;


Maybe renaming it as useDppForPackedF16 to be clear? Also can you add a comment regarding what this is for. Basically embed the commit message you have in the pull request. And also mention that "this way enables us to use half the active threads committing atomic requests thus reduce contention and improve atomics performance".

antiagainst · 2024-11-08T02:32:39Z

+                                       valueElemTy.getIntOrFloatBitWidth() == 16
+                                   ? 2
+                                   : 1);
+      // Force F16 packing in the case it couldn't be already packed, ISA


".. in the case it's not coming in as packed, but the ISA can support packed atomic instructions."

antiagainst · 2024-11-08T02:39:23Z

+      // supports global atomics packed instructions
+      forceF16packing = targetInfo.getISAFamily() == AMD::ISAFamily::CDNA3 &&
+                        vec == 1 && llvm::isa<FloatType>(valueElemTy) &&
+                        valueElemTy.getIntOrFloatBitWidth() == 16;


I think we should specifically check valueElemTy.isF16() || valueElemTyp.isBF16()? That's the only two variants with hardware packed ISA right?

antiagainst · 2024-11-08T02:39:42Z

    auto tid = tid_val();
    mask = and_(mask,
                icmp_slt(mul(tid, i32_val(elemsPerThread)), i32_val(numElems)));
+    if (forceF16packing)


Add a comment to explain what's this is for.

antiagainst · 2024-11-08T02:40:26Z

-      if (vec == 1) {
+      if (forceF16packing) {
+        Value old = i32_val(0);
+        int dppCtrl = 0x101;   // sh left 1 lane


Let's be complete here: "Shift left"

antiagainst · 2024-11-08T02:42:35Z

      Value operand;
-      if (vec == 1) {
+      if (forceF16packing) {
+        Value old = i32_val(0);


Maybe extract this into an utility function dppPack2xF16?

antiagainst · 2024-11-08T02:42:47Z

-              vec == 1 ? retVal
-                       : extract_element(valueElemTy, retVal, i32_val(ii));
+        if (forceF16packing) {
+          Value old = i32_val(0);


Maybe extract this into a utility function dppUnpack2xF16?

antiagainst · 2024-11-09T16:58:40Z

  }
 };

+bool supportedGlobalAtomicF16PackedAndDpp(triton::AMD::ISAFamily isaFamily) {


supports..

antiagainst · 2024-11-09T17:00:01Z

+  int rowMask = 0b1111;  // enable all rows
+  int bankMask = 0b1111; // enable all banks
+  bool boundCtrl = false;
+  auto dppMovRes =


Nit: auto dppMovOp = ..create<..>(); return dppMove.getResult() to style a bit better.

antiagainst · 2024-11-09T17:01:09Z

+    // accelerate atomics. Here is an algorithm of lowering
+    // tt::atomicRmwOp(%ptr, %val, %mask):
+    // 0. Group thread by pairs. Master
+    //    thread is (tid % 2 == 0);


Style nit: merge this line with the previous line.

antiagainst · 2024-11-09T17:03:40Z

+    //    thread is (tid % 2 == 0);
+    // 1. All the threads send %val to (tid - 1) thread via dppUpdateOp shl, so
+    //    all the masters recieve value from secondary threads;
+    // 2. Take into account parity in the %mask value, build CF structures


s/CF/control flow/

antiagainst · 2024-11-09T17:08:29Z

-      if (vec == 1) {
+      if (useDppForPackedF16) {
+        // Move %val to left neighbour to proceed packed atomic further.
+        Value packedVal = undef(packF16Ty);


Use zero instead of undef here given we are only inserting into half of it?

In the case of unpaired f16 elements utilize dpp instructions to accelerate atomics. Here is an algorithm of lowering `tt::atomicRmwOp(%ptr, %val, %mask)`: 0. Group thread by pairs. Master thread is (tid % 2 == 0); 1. All the threads send `%val` to `(tid - 1)` thread via `dppUpdateOp shl`, so all the masters recieve value from secondary threads; 2. Take into account parity in the `%mask` value, build control flow structures according to it; 3. Generate llvm::atomicRmvOp in the threads enabled by `%mask` value; 4. All the threads send result of generated operation to `(tid + 1)` thread via `dppUpdateOp shl`, so all secondary thread also recieve their result. This approach is alternative to triton-lang#5028. DPP approach has ~5% perf improvment so use this one in the case target arch supports DPP. Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>

scxiao · 2024-11-12T17:34:17Z

Hi @joviliast, @antiagainst, should we try to turn on the bfloat16 type for atomic_add in this PR?

antiagainst · 2024-11-12T17:50:25Z

That's a separate concern which should not be bundled with this.

In the case of unpaired f16 elements utilize DPP instructions to accelerate atomics. Here is an algorithm of lowering `tt::atomicRmwOp(%ptr, %val, %mask)`: 0. Group thread by pairs. Master thread is (tid % 2 == 0); 1. All the threads send `%val` to `(tid - 1)` thread via `dppUpdateOp shl`, so all the masters recieve value from secondary threads; 2. Take into account parity in the `%mask` value, build CF structures according to it; 3. Generate `llvm::atomicRmwOp` in the threads enabled by `%mask` value; 4. All the threads send result of generated operation to `(tid + 1)` thread via `dppUpdateOp shl`, so all secondary thread also recieve their result. DPP approach has ~5% perf improvment so use this one in the case target arch supports DPP. Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>

In the case of unpaired f16 elements utilize DPP instructions to accelerate atomics. Here is an algorithm of lowering `tt::atomicRmwOp(%ptr, %val, %mask)`: 0. Group thread by pairs. Master thread is (tid % 2 == 0); 1. All the threads send `%val` to `(tid - 1)` thread via `dppUpdateOp shl`, so all the masters recieve value from secondary threads; 2. Take into account parity in the `%mask` value, build CF structures according to it; 3. Generate `llvm::atomicRmwOp` in the threads enabled by `%mask` value; 4. All the threads send result of generated operation to `(tid + 1)` thread via `dppUpdateOp shl`, so all secondary thread also recieve their result. DPP approach has ~5% perf improvment so use this one in the case target arch supports DPP. Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com> (cherry picked from commit bab3470)

* [AMD] Emit vectorized 16-bit float LLVM atomic ops (triton-lang#4925) In the case of 16 bit floats operands for tt::AtomicRMWOp, construct only one LLVM::AtomicRMWOp but use vector of elements. Such approach allows to generate packed intrinsics and process 2 elements at once. Added a lit test for f16 vectorized case. (cherry picked from commit 78c8054) * [AMD] Restructure ReorderInstructions pass (triton-lang#4998) (cherry picked from commit 86a2ac7) * [AMD] Support warp-level reduction with DPP (triton-lang#5019) This commit adds support for warp-level reduction with DPP instructions, which can improve performance. See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/ (cherry picked from commit 21119e3) * [AMD] Add missing dependency to TritonAMDGPUIR (triton-lang#5053) TritonAMDGPUTransforms now depends on it. (cherry picked from commit 0b443ce) * [AMD] Support warp-level reduction with DPP (triton-lang#5019) This commit adds support for warp-level reduction with DPP instructions, which can improve performance. See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/ (cherry picked from commit 21119e3) * [AMD] Use DPP to accelerate 16-bit floats (triton-lang#5072) In the case of unpaired f16 elements utilize DPP instructions to accelerate atomics. Here is an algorithm of lowering `tt::atomicRmwOp(%ptr, %val, %mask)`: 0. Group thread by pairs. Master thread is (tid % 2 == 0); 1. All the threads send `%val` to `(tid - 1)` thread via `dppUpdateOp shl`, so all the masters recieve value from secondary threads; 2. Take into account parity in the `%mask` value, build CF structures according to it; 3. Generate `llvm::atomicRmwOp` in the threads enabled by `%mask` value; 4. All the threads send result of generated operation to `(tid + 1)` thread via `dppUpdateOp shl`, so all secondary thread also recieve their result. DPP approach has ~5% perf improvment so use this one in the case target arch supports DPP. Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com> (cherry picked from commit bab3470) * [AMD] Reland sinking the 2nd tt.load after local_load's (triton-lang#4935) This PR adds more restrictions about when should we apply the sched-load optimizations and un-revert triton-lang#4823. We will only apply the optimization when all of the following is satisfied: 1. pureMatmulProblem, i.e. 1 `tt.dot` in the main loop 2. two `tt.load`s in the main loop 3. 2nd `tt.load` is ahead of the `tt.dot` 4. 1st user of 2nd `tt.load` is after the `tt.dot` 5. tile size is large enough, i.e. nonKDim >= 128 and kDim >= 64 (cherry picked from commit 4f6f768) --------- Co-authored-by: Ilya V <152324710+joviliast@users.noreply.github.com> Co-authored-by: Lei Zhang <antiagainst@gmail.com> Co-authored-by: Kyle Wang <ec1wng@gmail.com> Co-authored-by: Lixun Zhang <Lixun.Zhang@amd.com>

In the case of unpaired f16 elements utilize DPP instructions to accelerate atomics. Here is an algorithm of lowering `tt::atomicRmwOp(%ptr, %val, %mask)`: 0. Group thread by pairs. Master thread is (tid % 2 == 0); 1. All the threads send `%val` to `(tid - 1)` thread via `dppUpdateOp shl`, so all the masters recieve value from secondary threads; 2. Take into account parity in the `%mask` value, build CF structures according to it; 3. Generate `llvm::atomicRmwOp` in the threads enabled by `%mask` value; 4. All the threads send result of generated operation to `(tid + 1)` thread via `dppUpdateOp shl`, so all secondary thread also recieve their result. DPP approach has ~5% perf improvment so use this one in the case target arch supports DPP. Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com> (cherry picked from commit bab3470) (cherry picked from commit 7ff5225)

joviliast force-pushed the atomic-dpp branch 2 times, most recently from 0fe1b29 to 95f4cf2 Compare November 5, 2024 15:59

antiagainst requested changes Nov 8, 2024

View reviewed changes

joviliast force-pushed the atomic-dpp branch from 3472f71 to a8db4e9 Compare November 8, 2024 17:31

antiagainst requested changes Nov 9, 2024

View reviewed changes

antiagainst marked this pull request as ready for review November 12, 2024 03:14

antiagainst requested review from ptillet and zhanglx13 as code owners November 12, 2024 03:14

joviliast force-pushed the atomic-dpp branch from a8db4e9 to a3631b8 Compare November 12, 2024 13:02

joviliast and others added 2 commits November 12, 2024 14:04

Merge branch 'main' into atomic-dpp

0f5df3a

antiagainst approved these changes Nov 12, 2024

View reviewed changes

antiagainst merged commit bab3470 into triton-lang:main Nov 12, 2024

joviliast mentioned this pull request Nov 14, 2024

[DRAFT][AMD] Introduce OptimizeAtomicLayouts pass #5028

Closed

jataylo mentioned this pull request Dec 13, 2024

[CP] AMD Performance cherry picks ROCm/triton#682

Merged

Conversation

joviliast commented Nov 5, 2024 • edited by antiagainst Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scxiao commented Nov 12, 2024

Uh oh!

antiagainst commented Nov 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joviliast commented Nov 5, 2024 •

edited by antiagainst

Loading