[AMD] Add atomicRMW dpp logic#5072
Conversation
0fe1b29 to
95f4cf2
Compare
| ? 2 | ||
| : 1); | ||
| vec = std::min<unsigned>(vec, | ||
| llvm::isa<FloatType>(valueElemTy) && |
There was a problem hiding this comment.
Nit: can we extract this ternary expression as a local variable? Reads better that way.
| // vec = 1, numElements = 1 for scalar | ||
| auto vec = getVectorSize(ptr); | ||
| int numElems = 1; | ||
| bool forceF16packing = false; |
There was a problem hiding this comment.
Maybe renaming it as useDppForPackedF16 to be clear? Also can you add a comment regarding what this is for. Basically embed the commit message you have in the pull request. And also mention that "this way enables us to use half the active threads committing atomic requests thus reduce contention and improve atomics performance".
| valueElemTy.getIntOrFloatBitWidth() == 16 | ||
| ? 2 | ||
| : 1); | ||
| // Force F16 packing in the case it couldn't be already packed, ISA |
There was a problem hiding this comment.
".. in the case it's not coming in as packed, but the ISA can support packed atomic instructions."
| // supports global atomics packed instructions | ||
| forceF16packing = targetInfo.getISAFamily() == AMD::ISAFamily::CDNA3 && | ||
| vec == 1 && llvm::isa<FloatType>(valueElemTy) && | ||
| valueElemTy.getIntOrFloatBitWidth() == 16; |
There was a problem hiding this comment.
I think we should specifically check valueElemTy.isF16() || valueElemTyp.isBF16()? That's the only two variants with hardware packed ISA right?
| auto tid = tid_val(); | ||
| mask = and_(mask, | ||
| icmp_slt(mul(tid, i32_val(elemsPerThread)), i32_val(numElems))); | ||
| if (forceF16packing) |
There was a problem hiding this comment.
Add a comment to explain what's this is for.
| if (vec == 1) { | ||
| if (forceF16packing) { | ||
| Value old = i32_val(0); | ||
| int dppCtrl = 0x101; // sh left 1 lane |
There was a problem hiding this comment.
Let's be complete here: "Shift left"
| Value operand; | ||
| if (vec == 1) { | ||
| if (forceF16packing) { | ||
| Value old = i32_val(0); |
There was a problem hiding this comment.
Maybe extract this into an utility function dppPack2xF16?
| vec == 1 ? retVal | ||
| : extract_element(valueElemTy, retVal, i32_val(ii)); | ||
| if (forceF16packing) { | ||
| Value old = i32_val(0); |
There was a problem hiding this comment.
Maybe extract this into a utility function dppUnpack2xF16?
| } | ||
| }; | ||
|
|
||
| bool supportedGlobalAtomicF16PackedAndDpp(triton::AMD::ISAFamily isaFamily) { |
| int rowMask = 0b1111; // enable all rows | ||
| int bankMask = 0b1111; // enable all banks | ||
| bool boundCtrl = false; | ||
| auto dppMovRes = |
There was a problem hiding this comment.
Nit: auto dppMovOp = ..create<..>(); return dppMove.getResult() to style a bit better.
| // accelerate atomics. Here is an algorithm of lowering | ||
| // tt::atomicRmwOp(%ptr, %val, %mask): | ||
| // 0. Group thread by pairs. Master | ||
| // thread is (tid % 2 == 0); |
There was a problem hiding this comment.
Style nit: merge this line with the previous line.
| // thread is (tid % 2 == 0); | ||
| // 1. All the threads send %val to (tid - 1) thread via dppUpdateOp shl, so | ||
| // all the masters recieve value from secondary threads; | ||
| // 2. Take into account parity in the %mask value, build CF structures |
| if (vec == 1) { | ||
| if (useDppForPackedF16) { | ||
| // Move %val to left neighbour to proceed packed atomic further. | ||
| Value packedVal = undef(packF16Ty); |
There was a problem hiding this comment.
Use zero instead of undef here given we are only inserting into half of it?
a8db4e9 to
a3631b8
Compare
In the case of unpaired f16 elements utilize dpp instructions to accelerate atomics. Here is an algorithm of lowering `tt::atomicRmwOp(%ptr, %val, %mask)`: 0. Group thread by pairs. Master thread is (tid % 2 == 0); 1. All the threads send `%val` to `(tid - 1)` thread via `dppUpdateOp shl`, so all the masters recieve value from secondary threads; 2. Take into account parity in the `%mask` value, build control flow structures according to it; 3. Generate llvm::atomicRmvOp in the threads enabled by `%mask` value; 4. All the threads send result of generated operation to `(tid + 1)` thread via `dppUpdateOp shl`, so all secondary thread also recieve their result. This approach is alternative to triton-lang#5028. DPP approach has ~5% perf improvment so use this one in the case target arch supports DPP. Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
|
Hi @joviliast, @antiagainst, should we try to turn on the bfloat16 type for atomic_add in this PR? |
|
That's a separate concern which should not be bundled with this. |
In the case of unpaired f16 elements utilize DPP instructions to accelerate atomics. Here is an algorithm of lowering `tt::atomicRmwOp(%ptr, %val, %mask)`: 0. Group thread by pairs. Master thread is (tid % 2 == 0); 1. All the threads send `%val` to `(tid - 1)` thread via `dppUpdateOp shl`, so all the masters recieve value from secondary threads; 2. Take into account parity in the `%mask` value, build CF structures according to it; 3. Generate `llvm::atomicRmwOp` in the threads enabled by `%mask` value; 4. All the threads send result of generated operation to `(tid + 1)` thread via `dppUpdateOp shl`, so all secondary thread also recieve their result. DPP approach has ~5% perf improvment so use this one in the case target arch supports DPP. Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
In the case of unpaired f16 elements utilize DPP instructions to accelerate atomics. Here is an algorithm of lowering `tt::atomicRmwOp(%ptr, %val, %mask)`: 0. Group thread by pairs. Master thread is (tid % 2 == 0); 1. All the threads send `%val` to `(tid - 1)` thread via `dppUpdateOp shl`, so all the masters recieve value from secondary threads; 2. Take into account parity in the `%mask` value, build CF structures according to it; 3. Generate `llvm::atomicRmwOp` in the threads enabled by `%mask` value; 4. All the threads send result of generated operation to `(tid + 1)` thread via `dppUpdateOp shl`, so all secondary thread also recieve their result. DPP approach has ~5% perf improvment so use this one in the case target arch supports DPP. Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com> (cherry picked from commit bab3470)
In the case of unpaired f16 elements utilize DPP instructions to accelerate atomics. Here is an algorithm of lowering `tt::atomicRmwOp(%ptr, %val, %mask)`: 0. Group thread by pairs. Master thread is (tid % 2 == 0); 1. All the threads send `%val` to `(tid - 1)` thread via `dppUpdateOp shl`, so all the masters recieve value from secondary threads; 2. Take into account parity in the `%mask` value, build CF structures according to it; 3. Generate `llvm::atomicRmwOp` in the threads enabled by `%mask` value; 4. All the threads send result of generated operation to `(tid + 1)` thread via `dppUpdateOp shl`, so all secondary thread also recieve their result. DPP approach has ~5% perf improvment so use this one in the case target arch supports DPP. Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com> (cherry picked from commit bab3470)
In the case of unpaired f16 elements utilize DPP instructions to accelerate atomics. Here is an algorithm of lowering `tt::atomicRmwOp(%ptr, %val, %mask)`: 0. Group thread by pairs. Master thread is (tid % 2 == 0); 1. All the threads send `%val` to `(tid - 1)` thread via `dppUpdateOp shl`, so all the masters recieve value from secondary threads; 2. Take into account parity in the `%mask` value, build CF structures according to it; 3. Generate `llvm::atomicRmwOp` in the threads enabled by `%mask` value; 4. All the threads send result of generated operation to `(tid + 1)` thread via `dppUpdateOp shl`, so all secondary thread also recieve their result. DPP approach has ~5% perf improvment so use this one in the case target arch supports DPP. Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com> (cherry picked from commit bab3470)
In the case of unpaired f16 elements utilize DPP instructions to accelerate atomics. Here is an algorithm of lowering `tt::atomicRmwOp(%ptr, %val, %mask)`: 0. Group thread by pairs. Master thread is (tid % 2 == 0); 1. All the threads send `%val` to `(tid - 1)` thread via `dppUpdateOp shl`, so all the masters recieve value from secondary threads; 2. Take into account parity in the `%mask` value, build CF structures according to it; 3. Generate `llvm::atomicRmwOp` in the threads enabled by `%mask` value; 4. All the threads send result of generated operation to `(tid + 1)` thread via `dppUpdateOp shl`, so all secondary thread also recieve their result. DPP approach has ~5% perf improvment so use this one in the case target arch supports DPP. Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com> (cherry picked from commit bab3470)
* [AMD] Emit vectorized 16-bit float LLVM atomic ops (triton-lang#4925) In the case of 16 bit floats operands for tt::AtomicRMWOp, construct only one LLVM::AtomicRMWOp but use vector of elements. Such approach allows to generate packed intrinsics and process 2 elements at once. Added a lit test for f16 vectorized case. (cherry picked from commit 78c8054) * [AMD] Restructure ReorderInstructions pass (triton-lang#4998) (cherry picked from commit 86a2ac7) * [AMD] Support warp-level reduction with DPP (triton-lang#5019) This commit adds support for warp-level reduction with DPP instructions, which can improve performance. See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/ (cherry picked from commit 21119e3) * [AMD] Add missing dependency to TritonAMDGPUIR (triton-lang#5053) TritonAMDGPUTransforms now depends on it. (cherry picked from commit 0b443ce) * [AMD] Support warp-level reduction with DPP (triton-lang#5019) This commit adds support for warp-level reduction with DPP instructions, which can improve performance. See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/ (cherry picked from commit 21119e3) * [AMD] Use DPP to accelerate 16-bit floats (triton-lang#5072) In the case of unpaired f16 elements utilize DPP instructions to accelerate atomics. Here is an algorithm of lowering `tt::atomicRmwOp(%ptr, %val, %mask)`: 0. Group thread by pairs. Master thread is (tid % 2 == 0); 1. All the threads send `%val` to `(tid - 1)` thread via `dppUpdateOp shl`, so all the masters recieve value from secondary threads; 2. Take into account parity in the `%mask` value, build CF structures according to it; 3. Generate `llvm::atomicRmwOp` in the threads enabled by `%mask` value; 4. All the threads send result of generated operation to `(tid + 1)` thread via `dppUpdateOp shl`, so all secondary thread also recieve their result. DPP approach has ~5% perf improvment so use this one in the case target arch supports DPP. Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com> (cherry picked from commit bab3470) * [AMD] Reland sinking the 2nd tt.load after local_load's (triton-lang#4935) This PR adds more restrictions about when should we apply the sched-load optimizations and un-revert triton-lang#4823. We will only apply the optimization when all of the following is satisfied: 1. pureMatmulProblem, i.e. 1 `tt.dot` in the main loop 2. two `tt.load`s in the main loop 3. 2nd `tt.load` is ahead of the `tt.dot` 4. 1st user of 2nd `tt.load` is after the `tt.dot` 5. tile size is large enough, i.e. nonKDim >= 128 and kDim >= 64 (cherry picked from commit 4f6f768) --------- Co-authored-by: Ilya V <152324710+joviliast@users.noreply.github.com> Co-authored-by: Lei Zhang <antiagainst@gmail.com> Co-authored-by: Kyle Wang <ec1wng@gmail.com> Co-authored-by: Lixun Zhang <Lixun.Zhang@amd.com>
In the case of unpaired f16 elements utilize DPP instructions to accelerate atomics. Here is an algorithm of lowering `tt::atomicRmwOp(%ptr, %val, %mask)`: 0. Group thread by pairs. Master thread is (tid % 2 == 0); 1. All the threads send `%val` to `(tid - 1)` thread via `dppUpdateOp shl`, so all the masters recieve value from secondary threads; 2. Take into account parity in the `%mask` value, build CF structures according to it; 3. Generate `llvm::atomicRmwOp` in the threads enabled by `%mask` value; 4. All the threads send result of generated operation to `(tid + 1)` thread via `dppUpdateOp shl`, so all secondary thread also recieve their result. DPP approach has ~5% perf improvment so use this one in the case target arch supports DPP. Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com> (cherry picked from commit bab3470) (cherry picked from commit 7ff5225)
In the case of unpaired f16 elements utilize dpp instructions to accelerate atomics. Here is an algorithm of lowering
tt::atomicRmwOp(%ptr, %val, %mask):%valto(tid - 1)thread viadppUpdateOp shl, so all the masters recieve value from secondary threads;%maskvalue, build CF structures according to it;llvm::atomicRmwOpin the threads enabled by%maskvalue;(tid + 1)thread viadppUpdateOp shl, so all secondary thread also recieve their result.This approach is alternative to #5028. DPP approach has ~5% perf improvment so use this one in the case target arch supports DPP.