Skip to content

[AMD] Add atomicRMW dpp logic#5072

Merged
antiagainst merged 2 commits into
triton-lang:mainfrom
joviliast:atomic-dpp
Nov 12, 2024
Merged

[AMD] Add atomicRMW dpp logic#5072
antiagainst merged 2 commits into
triton-lang:mainfrom
joviliast:atomic-dpp

Conversation

@joviliast
Copy link
Copy Markdown
Contributor

@joviliast joviliast commented Nov 5, 2024

In the case of unpaired f16 elements utilize dpp instructions to accelerate atomics. Here is an algorithm of lowering tt::atomicRmwOp(%ptr, %val, %mask):

  1. Group thread by pairs. Master thread is (tid % 2 == 0);
  2. All the threads send %val to (tid - 1) thread via dppUpdateOp shl, so all the masters recieve value from secondary threads;
  3. Take into account parity in the %mask value, build CF structures according to it;
  4. Generate llvm::atomicRmwOp in the threads enabled by %mask value;
  5. All the threads send result of generated operation to (tid + 1) thread via dppUpdateOp shl, so all secondary thread also recieve their result.

This approach is alternative to #5028. DPP approach has ~5% perf improvment so use this one in the case target arch supports DPP.

@joviliast joviliast force-pushed the atomic-dpp branch 2 times, most recently from 0fe1b29 to 95f4cf2 Compare November 5, 2024 15:59
? 2
: 1);
vec = std::min<unsigned>(vec,
llvm::isa<FloatType>(valueElemTy) &&
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we extract this ternary expression as a local variable? Reads better that way.

// vec = 1, numElements = 1 for scalar
auto vec = getVectorSize(ptr);
int numElems = 1;
bool forceF16packing = false;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe renaming it as useDppForPackedF16 to be clear? Also can you add a comment regarding what this is for. Basically embed the commit message you have in the pull request. And also mention that "this way enables us to use half the active threads committing atomic requests thus reduce contention and improve atomics performance".

valueElemTy.getIntOrFloatBitWidth() == 16
? 2
: 1);
// Force F16 packing in the case it couldn't be already packed, ISA
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

".. in the case it's not coming in as packed, but the ISA can support packed atomic instructions."

// supports global atomics packed instructions
forceF16packing = targetInfo.getISAFamily() == AMD::ISAFamily::CDNA3 &&
vec == 1 && llvm::isa<FloatType>(valueElemTy) &&
valueElemTy.getIntOrFloatBitWidth() == 16;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should specifically check valueElemTy.isF16() || valueElemTyp.isBF16()? That's the only two variants with hardware packed ISA right?

auto tid = tid_val();
mask = and_(mask,
icmp_slt(mul(tid, i32_val(elemsPerThread)), i32_val(numElems)));
if (forceF16packing)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment to explain what's this is for.

if (vec == 1) {
if (forceF16packing) {
Value old = i32_val(0);
int dppCtrl = 0x101; // sh left 1 lane
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be complete here: "Shift left"

Value operand;
if (vec == 1) {
if (forceF16packing) {
Value old = i32_val(0);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe extract this into an utility function dppPack2xF16?

vec == 1 ? retVal
: extract_element(valueElemTy, retVal, i32_val(ii));
if (forceF16packing) {
Value old = i32_val(0);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe extract this into a utility function dppUnpack2xF16?

}
};

bool supportedGlobalAtomicF16PackedAndDpp(triton::AMD::ISAFamily isaFamily) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

supports..

int rowMask = 0b1111; // enable all rows
int bankMask = 0b1111; // enable all banks
bool boundCtrl = false;
auto dppMovRes =
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: auto dppMovOp = ..create<..>(); return dppMove.getResult() to style a bit better.

// accelerate atomics. Here is an algorithm of lowering
// tt::atomicRmwOp(%ptr, %val, %mask):
// 0. Group thread by pairs. Master
// thread is (tid % 2 == 0);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nit: merge this line with the previous line.

// thread is (tid % 2 == 0);
// 1. All the threads send %val to (tid - 1) thread via dppUpdateOp shl, so
// all the masters recieve value from secondary threads;
// 2. Take into account parity in the %mask value, build CF structures
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/CF/control flow/

if (vec == 1) {
if (useDppForPackedF16) {
// Move %val to left neighbour to proceed packed atomic further.
Value packedVal = undef(packF16Ty);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use zero instead of undef here given we are only inserting into half of it?

@antiagainst antiagainst marked this pull request as ready for review November 12, 2024 03:14
joviliast and others added 2 commits November 12, 2024 14:04
In the case of unpaired f16 elements utilize dpp instructions to accelerate atomics.
Here is an algorithm of lowering `tt::atomicRmwOp(%ptr, %val, %mask)`:
0. Group thread by pairs. Master thread is (tid % 2 == 0);
1. All the threads send `%val` to `(tid - 1)` thread via `dppUpdateOp shl`, so all
   the masters recieve value from secondary threads;
2. Take into account parity in the `%mask` value, build control flow structures according to it;
3. Generate llvm::atomicRmvOp in the threads enabled by `%mask` value;
4. All the threads send result of generated operation to `(tid + 1)` thread via
   `dppUpdateOp shl`, so all secondary thread also recieve their result.

This approach is alternative to triton-lang#5028.
DPP approach has ~5% perf improvment so use this one in the case target arch supports DPP.

Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
@scxiao
Copy link
Copy Markdown
Contributor

scxiao commented Nov 12, 2024

Hi @joviliast, @antiagainst, should we try to turn on the bfloat16 type for atomic_add in this PR?

@antiagainst
Copy link
Copy Markdown
Member

That's a separate concern which should not be bundled with this.

@antiagainst antiagainst merged commit bab3470 into triton-lang:main Nov 12, 2024
Luosuu pushed a commit to Luosuu/triton that referenced this pull request Nov 13, 2024
In the case of unpaired f16 elements utilize DPP instructions to
accelerate atomics. Here is an algorithm of lowering
`tt::atomicRmwOp(%ptr, %val, %mask)`:

0. Group thread by pairs. Master thread is (tid % 2 == 0);
1. All the threads send `%val` to `(tid - 1)` thread via `dppUpdateOp
shl`, so all the masters recieve value from secondary threads;
2. Take into account parity in the `%mask` value, build CF structures
according to it;
3. Generate `llvm::atomicRmwOp` in the threads enabled by `%mask` value;
4. All the threads send result of generated operation to `(tid + 1)`
thread via `dppUpdateOp shl`, so all secondary thread also recieve their
result.

DPP approach has ~5% perf improvment so use this one in the
case target arch supports DPP.

Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
jataylo pushed a commit to jataylo/triton that referenced this pull request Nov 18, 2024
In the case of unpaired f16 elements utilize DPP instructions to
accelerate atomics. Here is an algorithm of lowering
`tt::atomicRmwOp(%ptr, %val, %mask)`:

0. Group thread by pairs. Master thread is (tid % 2 == 0);
1. All the threads send `%val` to `(tid - 1)` thread via `dppUpdateOp
shl`, so all the masters recieve value from secondary threads;
2. Take into account parity in the `%mask` value, build CF structures
according to it;
3. Generate `llvm::atomicRmwOp` in the threads enabled by `%mask` value;
4. All the threads send result of generated operation to `(tid + 1)`
thread via `dppUpdateOp shl`, so all secondary thread also recieve their
result.

DPP approach has ~5% perf improvment so use this one in the
case target arch supports DPP.

Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
(cherry picked from commit bab3470)
jataylo pushed a commit to jataylo/triton that referenced this pull request Nov 18, 2024
In the case of unpaired f16 elements utilize DPP instructions to
accelerate atomics. Here is an algorithm of lowering
`tt::atomicRmwOp(%ptr, %val, %mask)`:

0. Group thread by pairs. Master thread is (tid % 2 == 0);
1. All the threads send `%val` to `(tid - 1)` thread via `dppUpdateOp
shl`, so all the masters recieve value from secondary threads;
2. Take into account parity in the `%mask` value, build CF structures
according to it;
3. Generate `llvm::atomicRmwOp` in the threads enabled by `%mask` value;
4. All the threads send result of generated operation to `(tid + 1)`
thread via `dppUpdateOp shl`, so all secondary thread also recieve their
result.

DPP approach has ~5% perf improvment so use this one in the
case target arch supports DPP.

Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
(cherry picked from commit bab3470)
jataylo pushed a commit to jataylo/triton that referenced this pull request Dec 12, 2024
In the case of unpaired f16 elements utilize DPP instructions to
accelerate atomics. Here is an algorithm of lowering
`tt::atomicRmwOp(%ptr, %val, %mask)`:

0. Group thread by pairs. Master thread is (tid % 2 == 0);
1. All the threads send `%val` to `(tid - 1)` thread via `dppUpdateOp
shl`, so all the masters recieve value from secondary threads;
2. Take into account parity in the `%mask` value, build CF structures
according to it;
3. Generate `llvm::atomicRmwOp` in the threads enabled by `%mask` value;
4. All the threads send result of generated operation to `(tid + 1)`
thread via `dppUpdateOp shl`, so all secondary thread also recieve their
result.

DPP approach has ~5% perf improvment so use this one in the
case target arch supports DPP.

Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
(cherry picked from commit bab3470)
jataylo pushed a commit to jataylo/triton that referenced this pull request Dec 13, 2024
In the case of unpaired f16 elements utilize DPP instructions to
accelerate atomics. Here is an algorithm of lowering
`tt::atomicRmwOp(%ptr, %val, %mask)`:

0. Group thread by pairs. Master thread is (tid % 2 == 0);
1. All the threads send `%val` to `(tid - 1)` thread via `dppUpdateOp
shl`, so all the masters recieve value from secondary threads;
2. Take into account parity in the `%mask` value, build CF structures
according to it;
3. Generate `llvm::atomicRmwOp` in the threads enabled by `%mask` value;
4. All the threads send result of generated operation to `(tid + 1)`
thread via `dppUpdateOp shl`, so all secondary thread also recieve their
result.

DPP approach has ~5% perf improvment so use this one in the
case target arch supports DPP.

Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
(cherry picked from commit bab3470)
jataylo added a commit to ROCm/triton that referenced this pull request Dec 13, 2024
* [AMD] Emit vectorized 16-bit float LLVM atomic ops (triton-lang#4925)

In the case of 16 bit floats operands for tt::AtomicRMWOp, construct
only one LLVM::AtomicRMWOp but use vector of elements.
Such approach allows to generate packed intrinsics and process 2
elements at once.
Added a lit test for f16 vectorized case.

(cherry picked from commit 78c8054)

* [AMD] Restructure ReorderInstructions pass (triton-lang#4998)

(cherry picked from commit 86a2ac7)

* [AMD] Support warp-level reduction with DPP (triton-lang#5019)

This commit adds support for warp-level reduction
with DPP instructions, which can improve performance.

See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/

(cherry picked from commit 21119e3)

* [AMD] Add missing dependency to TritonAMDGPUIR (triton-lang#5053)

TritonAMDGPUTransforms now depends on it.

(cherry picked from commit 0b443ce)

* [AMD] Support warp-level reduction with DPP (triton-lang#5019)

This commit adds support for warp-level reduction
with DPP instructions, which can improve performance.

See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/

(cherry picked from commit 21119e3)

* [AMD] Use DPP to accelerate 16-bit floats (triton-lang#5072)

In the case of unpaired f16 elements utilize DPP instructions to
accelerate atomics. Here is an algorithm of lowering
`tt::atomicRmwOp(%ptr, %val, %mask)`:

0. Group thread by pairs. Master thread is (tid % 2 == 0);
1. All the threads send `%val` to `(tid - 1)` thread via `dppUpdateOp
shl`, so all the masters recieve value from secondary threads;
2. Take into account parity in the `%mask` value, build CF structures
according to it;
3. Generate `llvm::atomicRmwOp` in the threads enabled by `%mask` value;
4. All the threads send result of generated operation to `(tid + 1)`
thread via `dppUpdateOp shl`, so all secondary thread also recieve their
result.

DPP approach has ~5% perf improvment so use this one in the
case target arch supports DPP.

Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
(cherry picked from commit bab3470)

* [AMD] Reland sinking the 2nd tt.load after local_load's (triton-lang#4935)

This PR adds more restrictions about when should we apply
the sched-load optimizations and un-revert
triton-lang#4823.

We will only apply the optimization when all of the following is
satisfied:
1. pureMatmulProblem, i.e. 1 `tt.dot` in the main loop
2. two `tt.load`s in the main loop
3. 2nd `tt.load` is ahead of the `tt.dot`
4. 1st user of 2nd `tt.load` is after the `tt.dot`
5. tile size is large enough, i.e. nonKDim >= 128 and kDim >= 64

(cherry picked from commit 4f6f768)

---------

Co-authored-by: Ilya V <152324710+joviliast@users.noreply.github.com>
Co-authored-by: Lei Zhang <antiagainst@gmail.com>
Co-authored-by: Kyle Wang <ec1wng@gmail.com>
Co-authored-by: Lixun Zhang <Lixun.Zhang@amd.com>
jataylo pushed a commit to ROCm/triton that referenced this pull request Jan 28, 2025
In the case of unpaired f16 elements utilize DPP instructions to
accelerate atomics. Here is an algorithm of lowering
`tt::atomicRmwOp(%ptr, %val, %mask)`:

0. Group thread by pairs. Master thread is (tid % 2 == 0);
1. All the threads send `%val` to `(tid - 1)` thread via `dppUpdateOp
shl`, so all the masters recieve value from secondary threads;
2. Take into account parity in the `%mask` value, build CF structures
according to it;
3. Generate `llvm::atomicRmwOp` in the threads enabled by `%mask` value;
4. All the threads send result of generated operation to `(tid + 1)`
thread via `dppUpdateOp shl`, so all secondary thread also recieve their
result.

DPP approach has ~5% perf improvment so use this one in the
case target arch supports DPP.

Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>
(cherry picked from commit bab3470)
(cherry picked from commit 7ff5225)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants