Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -87,8 +87,8 @@ void decomposeBlockedToDotLayoutConversion(ModuleOp module) {
return;
auto srcBlocked =
dyn_cast<triton::gpu::BlockedEncodingAttr>(srcType.getEncoding());
auto dstDotOp =
dyn_cast<triton::gpu::DotOperandEncodingAttr>(dstType.getEncoding());
auto dstEncoding = dstType.getEncoding();
auto dstDotOp = dyn_cast<triton::gpu::DotOperandEncodingAttr>(dstEncoding);
if (srcBlocked && dstDotOp) {
// FIXME [Dot LL]
// We support this one via LLs, as the LocalLoad path is buggy
Expand All @@ -99,15 +99,21 @@ void decomposeBlockedToDotLayoutConversion(ModuleOp module) {
return;
}
}

auto srcOrder = triton::gpu::getOrder(srcBlocked);
auto rank = srcOrder.size();
SmallVector<unsigned> sharedOrder;
if (rank == 3) {
sharedOrder = gpu::getThreadOrder(dstEncoding);
} else {
sharedOrder = srcOrder;
}
Comment on lines +105 to +109
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are going this route, you probably want to do it for all ranks, otherwise this heuristic would be incredibly counterintuitive.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then we might change the shared layout order for rank != 3 with this change? I can also revert all changes except for the condition in ReduceDataDuplication to make it ampere specific. Then we have the same behavior as before #4904.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just find very weird the current behaviour for rank==3. @Jokeren thoughts?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the special condition is weird. Can you attach a test case for us to take a look at?

Copy link
Copy Markdown
Contributor Author

@AlexAUT AlexAUT Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you run python/test/unit/language/test_core.py::test_dot3d[1-1-64-64-64-32-32-float16-float32] it will trigger this assert. This happens when wmma is used for the dot. The shared -> dot layout conversion for wmma will also expect that the batch dim is the slowest dimension.

Attribute sharedMemorySpace =
triton::gpu::SharedMemorySpaceAttr::get(srcType.getContext());
auto tmpType = MemDescType::get(
dstType.getShape(), dstType.getElementType(),
triton::gpu::SharedEncodingAttr::get(
module.getContext(), dstDotOp, srcType.getShape(),
srcBlocked.getOrder(), srcBlocked.getCTALayout(),
srcType.getElementType()),
module.getContext(), dstDotOp, srcType.getShape(), sharedOrder,
srcBlocked.getCTALayout(), srcType.getElementType()),
sharedMemorySpace);
auto tmp = builder.create<triton::gpu::LocalAllocOp>(
cvtOp.getLoc(), tmpType, cvtOp.getSrc());
Expand Down
19 changes: 9 additions & 10 deletions lib/Dialect/TritonGPU/Transforms/ReduceDataDuplication.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -36,30 +36,29 @@ class TritonGPUReduceDataDuplicationPass
auto srcType = cast<RankedTensorType>(cvtOp.getSrc().getType());
auto dstType = cast<RankedTensorType>(cvtOp.getType());
auto srcEncoding = srcType.getEncoding();
auto dstEncoding = dstType.getEncoding();
if (isa<triton::gpu::SharedEncodingAttr>(srcEncoding))
return;
auto dstDotOp =
dyn_cast<triton::gpu::DotOperandEncodingAttr>(dstType.getEncoding());
dyn_cast<triton::gpu::DotOperandEncodingAttr>(dstEncoding);
if (!dstDotOp)
return;
if (!cvtNeedsSharedMemory(srcType, dstType))
return;
// FIXME [Dot LL]
// We support this one via LLs, as the LocalLoad path is buggy
bool largeKWidth =
dstDotOp.getKWidth() * dstType.getElementTypeBitWidth() > 64;
if (largeKWidth) {
return;
if (auto mma = dyn_cast<NvidiaMmaEncodingAttr>(dstDotOp.getParent())) {
bool largeKWidth =
dstDotOp.getKWidth() * dstType.getElementTypeBitWidth() > 64;
if (mma.isAmpere() && largeKWidth) {
return;
}
}
auto srcOrder = triton::gpu::getOrder(srcEncoding);
auto rank = srcOrder.size();
SmallVector<unsigned> sharedOrder;
if (rank == 3) {
// add all elements except the element that is zero
for (unsigned i = 0; i < rank; ++i)
if (srcOrder[i] != 0)
sharedOrder.emplace_back(srcOrder[i]);
sharedOrder.emplace_back(0);
sharedOrder = gpu::getThreadOrder(dstEncoding);
} else {
sharedOrder = srcOrder;
}
Expand Down