[AMD] Add basics to allow bypass LDS for dot RHS by oplavsic · Pull Request #4856 · triton-lang/triton

oplavsic · 2024-10-04T19:35:30Z

The AMDBypassLDSForDotOperandPass implements a strategy to bypass using the
Local Data Share (LDS) for one of the operands in an MFMA dot operation.

Under certain conditions, the dot layout of one of the operands allows direct
loading from HBM to VGPRs in the MFMA dot layout, without losing of vectorization of global loads
or increasing the number of global loads due to shared data between threads.
The required conditions are:

K-Major Tensor Layout:
The operand we want to bypass LDS for must be K-major (i.e., row-major for
operand 0 or column-major for operand 1). This supports vectorized global
load instructions, as MFMA instructions require each thread to hold B
operand elements along the K dimension.
kWidth * sizeof(dataType) == 128:
Using the maximum kWidth for a specific data type ensures optimal global
load vectorization (e.g., using global_load_dwordx4 instructions).
Single Warp per CTA Dimension:
Either warpsPerCTA[ndim] == 1 for operand A bypass or warpsPerCTA[mDim] ==
1 for operand B bypass. This guarantees that each tensor element is
handled by exactly one thread, maintaining the same number of global loads
as in the blocked layout (i.e., each element is loaded only once).

oplavsic · 2024-10-04T19:43:55Z


    // Limit shared memory sharing to width >= 32 elements.
    LDBG("Load " << *loadOp << " has width " << width);
-    if (width < 32) {


StreamPipelineV2.cpp change in this PR enables pipelining in registers. This change was suggested to me by Simon.

@sjw36 I think this change is impactful enough. Should we extract it out as a separate pull request and consider the implications over the broader cases instead of coupled with this pull request?

Yes, it could be posted separately. @oplavsic there should be other cases that will exercise your pass right?

This is now in #5227. Let's drop it here.

oplavsic · 2024-10-08T22:54:02Z

 int getNVIDIAComputeCapability(Operation *module);

+// Convert \param op operands and results to layout \param encoding.
+void convertOpEncoding(Attribute encoding, Operation *op);


I moved (and renamed) this function from Coalece.cpp so I could use it in BypassLDS pass since I needed this exact functionality.

antiagainst

Thanks! A couple of comments.

antiagainst

Thanks! Could you merge in original/main to trigger CI?

antiagainst · 2024-10-23T05:57:05Z


    // Limit shared memory sharing to width >= 32 elements.
    LDBG("Load " << *loadOp << " has width " << width);
-    if (width < 32) {


@sjw36 I think this change is impactful enough. Should we extract it out as a separate pull request and consider the implications over the broader cases instead of coupled with this pull request?

sjw36

Please update the algorithm for efficiency.

sjw36 · 2024-10-23T18:07:32Z

+    ModuleOp module = getOperation();
+    auto convertOps = collectConvertOps(module);
+
+    module.dump();


Remove debug module.dump().

sjw36 · 2024-10-23T18:13:52Z

+                                                    ModuleOp &mod) {
+  SmallVector<triton::LoadOp> loadOpsVec;
+
+  mod.walk([&](triton::LoadOp loadOp) {


This is expensive for every convert_layout. Just walk use-def chain instead.

sjw36 · 2024-10-23T18:17:48Z

+    auto srcType = dyn_cast<RankedTensorType>(convertOp.getOperand().getType());
+    auto dstType = dyn_cast<RankedTensorType>(convertOp.getType());
+
+    if (!srcType || !dstType || srcType.getShape().size() != 2)


Why is the shape restricted to 2 dimensions?

sjw36 · 2024-10-23T18:22:02Z


    // Limit shared memory sharing to width >= 32 elements.
    LDBG("Load " << *loadOp << " has width " << width);
-    if (width < 32) {


Yes, it could be posted separately. @oplavsic there should be other cases that will exercise your pass right?

antiagainst · 2024-12-03T23:53:15Z

@oplavsic can you merge in changes from main and address remaining comments? Also please put the new functionality behind an env var so we can turn it on to evaluate before making it generally enabled.

antiagainst · 2024-12-05T17:47:51Z

Closing this one given #5350 supersedes it.

oplavsic commented Oct 4, 2024

View reviewed changes

Comment thread lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp

oplavsic commented Oct 4, 2024

View reviewed changes

oplavsic force-pushed the bypass_lds_upstream branch from a6839de to 66512b5 Compare October 7, 2024 12:07

oplavsic changed the title ~~[WIP][AMD] Add AMDBypassLDSForDotOperandPass~~ [AMD] Add AMDBypassLDSForDotOperandPass Oct 8, 2024

oplavsic commented Oct 8, 2024

View reviewed changes

oplavsic force-pushed the bypass_lds_upstream branch from 9f85a9c to a30240e Compare October 9, 2024 13:43

antiagainst requested changes Oct 14, 2024

View reviewed changes

Jokeren reviewed Oct 14, 2024

View reviewed changes

Comment thread lib/Dialect/TritonGPU/IR/Dialect.cpp Outdated

Jokeren mentioned this pull request Oct 14, 2024

Requirements to pass WGMMA LHS operand in registers #4785

Open

oplavsic force-pushed the bypass_lds_upstream branch 3 times, most recently from 0f2b989 to 6240fb3 Compare October 22, 2024 23:21

antiagainst requested changes Oct 23, 2024

View reviewed changes

Ognjen Plavsic added 6 commits October 23, 2024 13:39

initial commit

bb441d1

add test

2b10cb7

[AMD] Implement AMDBypassLDSForDotOperandPass pass

4bfd05b

Introduce workaround for getSizePerThreadForOperands and add some doc

adb106c

Address review comments

b27350e

Address second iteration of review

d75d6ed

oplavsic force-pushed the bypass_lds_upstream branch from 6240fb3 to b27350e Compare October 23, 2024 12:17

sjw36 suggested changes Oct 23, 2024

View reviewed changes

antiagainst changed the title ~~[AMD] Add AMDBypassLDSForDotOperandPass~~ [AMD] Add basics to allow bypass LDS for dot RHS Dec 3, 2024

antiagainst closed this Dec 5, 2024

antiagainst mentioned this pull request Jan 22, 2025

[AMD] Add basics to allow bypass LDS for dot RHS #5350

Merged

Conversation

oplavsic commented Oct 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antiagainst left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

antiagainst left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sjw36 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antiagainst commented Dec 3, 2024

Uh oh!

antiagainst commented Dec 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

oplavsic commented Oct 4, 2024 •

edited

Loading