[Backend] Implement `scaled_dot(mxfp4, fp8)` by lezcano · Pull Request #4904 · triton-lang/triton

lezcano · 2024-10-14T16:59:59Z

This PR includes #4891 and #4895. I will rebase once those have landed.

It includes a number of hacks to work around bugs in DotOperandEncodingAttr. All these are marked as FIXME [Dot LL] to be easy to grep for. @Jokeren is working on a comprehensive revamp of DotOperandEncodingAttr which will get rid of all these. #4895 is the first step in this direction.

lezcano · 2024-10-15T17:00:35Z

    %a = triton_gpu.convert_layout %a_ : tensor<128x32xf16, #AL> -> tensor<128x32xf16, #A_DOT>
    %b_ = tt.load %b_ptr, %b_mask, %b_other : tensor<32x128x!tt.ptr<f16>, #BL>
-    // CHECK-NEXT: offset = 0, size = 4224
+    // CHECK-NEXT: offset = 0, size = 4352


nb. These changes are coming from the change in lib/Analysis/Allocation.cpp

It's OK this path was never tested anyway. It will be tested in my next PR.

Jokeren · 2024-10-15T17:06:25Z

+  // This should be getElemOrder, but we don't have such a method
+  // TODO Implement getElemOrder and make sure it's consistent with
+  // getContigPerThread
+  auto inOrd = gpu::getThreadOrder(srcLayout);


I think we assume getElemOrder == getOrder

getThreadOrder is same as getOrder except for AMD's AMDMfmaEncodingAttr. I haven't taken a deep investigation.
pin @zhanglx13 for expertise maybe

See that I changed the definition of getThreadOrder in this PR.

To be specific I was referring to:

SmallVector<unsigned> AMDMfmaEncodingAttr::getThreadOrder() const { auto order = ::getOrder(*this); if (getIsTransposed()) std::swap(order[0], order[1]); return order; }

I'm not sure if we should use getOrder or getThreadOrder for this encoding

Jokeren · 2024-10-15T17:13:10Z

    %a = triton_gpu.convert_layout %a_ : tensor<128x32xf16, #AL> -> tensor<128x32xf16, #A_DOT>
    %b_ = tt.load %b_ptr, %b_mask, %b_other : tensor<32x128x!tt.ptr<f16>, #BL>
-    // CHECK-NEXT: offset = 0, size = 4224
+    // CHECK-NEXT: offset = 0, size = 4352


It's OK this path was never tested anyway. It will be tested in my next PR.

Jokeren · 2024-10-15T17:14:08Z

      typeConverter, loc, rewriter, loadedA, repBatch, repM, repK, aTensorTy);
+
+  // FIXME [Dot LL]
+  // max(repN / 2, 1) is wrong for repN = 1!


Can you elaborate on // max(repN / 2, 1) is wrong for repN = 1!?
Why repN=1 is wrong?

We are taking this max(repN / 2, 1) here, and then in the loop inside getValuesFromDotOperandLayoutStruct we are packing 4 elements at a time. Rather than that, the correct implementation packs 2 elements inside the function for opIdx=1 and iterates repN times.

This is a tentative PR to check how much breaks if we fix this.

ThomasRaoux

Looks good overall although I didn't look in details at the LL TODOs.
Just added few minor comments

ThomasRaoux · 2024-10-16T02:18:21Z

+  // FIXME: mma should just return getOrderForDotOperand(0, order.size(),
+  // kMajor=false)


I'm also confused by this comment.

Here I just meant that the logic in mma is probably wrong and we just want this function to return what I wrote there. The point here is that, in terms of order, the mma layout is the same as the DotOperandEncoding(opIdx=0)

I had another go at the comment. Third's a charm

ThomasRaoux · 2024-10-16T02:25:43Z

+    order = getOrderForDotOperand(dotOpLayout.getOpIdx(), order.size(),
+                                  /*kMajor*/ false);


why is kMajor always false here?

This is getting the warp order but not the element order. So m is the fastest changing dimension in opIdx=0. I think confusion may arise from the variable name kMajor.

I don't have a suggestion for improvement though. Maybe just add some additional comments.

Yep, similarly to in wgmma, we want the warps have the exterior dimension (i.e. not K) as their fastest running dimension.

ThomasRaoux · 2024-10-16T02:28:30Z

+            vType.getShape(), vType.getElementType(), newVEncoding);
+        return rewriter.create<ConvertLayoutOp>(v.getLoc(), newVType, v);
+      } else {
+        auto newVEncoding = DotOperandEncodingAttr::get(


nit: assert that this is a fp8 type?

Done, although it's a bit redundant, as we are already asserting this at the beginning of the function and in semantics.py.

ThomasRaoux

LGTM

@Jokeren

This PR includes triton-lang#4891 and triton-lang#4895. I will rebase once those have landed. It includes a number of hacks to work around bugs in `DotOperandEncodingAttr`. All these are marked as `FIXME [Dot LL]` to be easy to grep for. @Jokeren is working on a comprehensive revamp of `DotOperandEncodingAttr` which will get rid of all these. triton-lang#4895 is the first step in this direction.

@Jokeren

This PR includes triton-lang#4891 and triton-lang#4895. I will rebase once those have landed. It includes a number of hacks to work around bugs in `DotOperandEncodingAttr`. All these are marked as `FIXME [Dot LL]` to be easy to grep for. @Jokeren is working on a comprehensive revamp of `DotOperandEncodingAttr` which will get rid of all these. triton-lang#4895 is the first step in this direction.

@Jokeren

This PR includes triton-lang#4891 and triton-lang#4895. I will rebase once those have landed. It includes a number of hacks to work around bugs in `DotOperandEncodingAttr`. All these are marked as `FIXME [Dot LL]` to be easy to grep for. @Jokeren is working on a comprehensive revamp of `DotOperandEncodingAttr` which will get rid of all these. triton-lang#4895 is the first step in this direction.

@Jokeren

This PR includes triton-lang#4891 and triton-lang#4895. I will rebase once those have landed. It includes a number of hacks to work around bugs in `DotOperandEncodingAttr`. All these are marked as `FIXME [Dot LL]` to be easy to grep for. @Jokeren is working on a comprehensive revamp of `DotOperandEncodingAttr` which will get rid of all these. triton-lang#4895 is the first step in this direction.

lezcano requested review from Jokeren and ptillet as code owners October 14, 2024 16:59

lezcano changed the title ~~mxfp snd~~ [Backend] Implement scaled_dot(mxfp4, fp8) Oct 14, 2024

lezcano requested a review from ThomasRaoux October 14, 2024 17:25

lezcano marked this pull request as draft October 14, 2024 17:51

lezcano force-pushed the mxfp_snd branch 4 times, most recently from c15d411 to 104200d Compare October 15, 2024 14:23

lezcano marked this pull request as ready for review October 15, 2024 14:44

lezcano force-pushed the mxfp_snd branch 2 times, most recently from 20a64b1 to 33fceb2 Compare October 15, 2024 16:44

lezcano commented Oct 15, 2024

View reviewed changes

Jokeren reviewed Oct 15, 2024

View reviewed changes

lezcano added 5 commits October 15, 2024 19:09

Implement LL

881209d

Implement mxfp4 x fp8

720c4d6

This is a tentative PR to check how much breaks if we fix this.

One more workaround

4b27371

Improve Dialect.cpp changes. Fix lit tests. Tighten Dot LL condition.

5706c7d

Generalize semantic.dot_scaled

b6426ec

lezcano force-pushed the mxfp_snd branch from 33fceb2 to b6426ec Compare October 15, 2024 18:10

Improve comment

e4a4ed9

ThomasRaoux reviewed Oct 16, 2024

View reviewed changes

Address reviews

bd5c2fc

ThomasRaoux approved these changes Oct 16, 2024

View reviewed changes

lezcano merged commit 9e90089 into triton-lang:main Oct 16, 2024

lezcano deleted the mxfp_snd branch October 16, 2024 15:21

AlexAUT mentioned this pull request Oct 17, 2024

[Backend] Fix incorrect shared layout for dot operands rank==3 #4944

Closed

lezcano mentioned this pull request Oct 18, 2024

[Backend] Pipeline scale_dot #4950

Merged

		// FIXME: mma should just return getOrderForDotOperand(0, order.size(),
		// kMajor=false)

		order = getOrderForDotOperand(dotOpLayout.getOpIdx(), order.size(),
		/kMajor/ false);

Conversation

lezcano commented Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ThomasRaoux left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lezcano Oct 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ThomasRaoux left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lezcano commented Oct 14, 2024 •

edited

Loading

lezcano Oct 16, 2024 •

edited

Loading