Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMDGPU][PromoteAlloca] Support memsets to ptr allocas #80678

Merged
merged 3 commits into from
Feb 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 12 additions & 4 deletions llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -521,10 +521,18 @@ static Value *promoteAllocaUserToVector(
// For memset, we don't need to know the previous value because we
// currently only allow memsets that cover the whole alloca.
Value *Elt = MSI->getOperand(1);
if (DL.getTypeStoreSize(VecEltTy) > 1) {
Value *EltBytes =
Builder.CreateVectorSplat(DL.getTypeStoreSize(VecEltTy), Elt);
Elt = Builder.CreateBitCast(EltBytes, VecEltTy);
const unsigned BytesPerElt = DL.getTypeStoreSize(VecEltTy);
if (BytesPerElt > 1) {
Value *EltBytes = Builder.CreateVectorSplat(BytesPerElt, Elt);

// If the element type of the vector is a pointer, we need to first cast
// to an integer, then use a PtrCast.
if (VecEltTy->isPointerTy()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fail the same way for vector of pointers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean vector of pointers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only looks at the element type of the vector type we're promoting to, which is always a primitive

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean [6 x <2 x ptr>]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the test as well, that isn't promoted either.

On a side note, PromoteAlloca needs some love still. I'm wondering if it's worth the effort to support things like nested arrays. I'd assume we'd hit the upper limit very fast with X * Y elements but maybe not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still not trying to intelligently choose which allocas are the most profitable to promote

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mariusz-sikora-at-amd I'm not sure, no strong opinion. I was thinking of doing it by flattening arrays (e.g. [2 x [3 x float]]) becomes [6 x float]. I think the tricky part is resolving the GEPs correctly, it might be a bigger refactoring than it looks like at first glance.

One alternative may be to have some kind of "alloca canonicalization" pass earlier that does the flattening for us to enable PromoteAlloca better.

@arsenm I haven't lost track of that but I also didn't find the time for it yet :/
Last time I thought about it I thought about changing the pass so it collects allocas, then sorts them by profitability (number of users + whether there are uses in loops), then just greedily promotes them one by one until it runs out of budget. Would that be good?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I was thinking for a first step

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I thought SROA did try to flatten nested arrays already

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about splitting two-dimensional array into multiple one-dimensional because:

  • (this may be graphics specific) only single component / column of the array is used and by splitting we can removed unused columns / components.
  • some Shaders have a very low VGPR usage, but contain "big" (more than 32 elements) arrays like [ 10 x [ 4 x float ] ]. If we split it into four one-dimensional arrays than we will be able to put everything into VGPRs.
  • instruction movrel (which I'm aiming to generate) is supporting up to 32 elements.

Type *PtrInt = Builder.getIntNTy(BytesPerElt * 8);
Elt = Builder.CreateBitCast(EltBytes, PtrInt);
Elt = Builder.CreateIntToPtr(Elt, VecEltTy);
} else
Elt = Builder.CreateBitCast(EltBytes, VecEltTy);
}

return Builder.CreateVectorSplat(VectorTy->getElementCount(), Elt);
Expand Down
54 changes: 54 additions & 0 deletions llvm/test/CodeGen/AMDGPU/promote-alloca-memset.ll
Original file line number Diff line number Diff line change
Expand Up @@ -84,4 +84,58 @@ entry:
ret void
}

define amdgpu_kernel void @memset_array_ptr_alloca(ptr %out) {
; CHECK-LABEL: @memset_array_ptr_alloca(
; CHECK-NEXT: store i64 0, ptr [[OUT:%.*]], align 8
; CHECK-NEXT: ret void
;
%alloca = alloca [6 x ptr], align 16, addrspace(5)
arsenm marked this conversation as resolved.
Show resolved Hide resolved
call void @llvm.memset.p5.i64(ptr addrspace(5) %alloca, i8 0, i64 48, i1 false)
%load = load i64, ptr addrspace(5) %alloca
store i64 %load, ptr %out
ret void
}

define amdgpu_kernel void @memset_vector_ptr_alloca(ptr %out) {
; CHECK-LABEL: @memset_vector_ptr_alloca(
; CHECK-NEXT: store i64 0, ptr [[OUT:%.*]], align 8
; CHECK-NEXT: ret void
;
%alloca = alloca <6 x ptr>, align 16, addrspace(5)
call void @llvm.memset.p5.i64(ptr addrspace(5) %alloca, i8 0, i64 48, i1 false)
%load = load i64, ptr addrspace(5) %alloca
store i64 %load, ptr %out
ret void
}

define amdgpu_kernel void @memset_array_of_array_ptr_alloca(ptr %out) {
; CHECK-LABEL: @memset_array_of_array_ptr_alloca(
; CHECK-NEXT: [[ALLOCA:%.*]] = alloca [2 x [3 x ptr]], align 16, addrspace(5)
; CHECK-NEXT: call void @llvm.memset.p5.i64(ptr addrspace(5) [[ALLOCA]], i8 0, i64 48, i1 false)
; CHECK-NEXT: [[LOAD:%.*]] = load i64, ptr addrspace(5) [[ALLOCA]], align 8
; CHECK-NEXT: store i64 [[LOAD]], ptr [[OUT:%.*]], align 8
; CHECK-NEXT: ret void
;
%alloca = alloca [2 x [3 x ptr]], align 16, addrspace(5)
call void @llvm.memset.p5.i64(ptr addrspace(5) %alloca, i8 0, i64 48, i1 false)
%load = load i64, ptr addrspace(5) %alloca
store i64 %load, ptr %out
ret void
}

define amdgpu_kernel void @memset_array_of_vec_ptr_alloca(ptr %out) {
; CHECK-LABEL: @memset_array_of_vec_ptr_alloca(
; CHECK-NEXT: [[ALLOCA:%.*]] = alloca [2 x <3 x ptr>], align 16, addrspace(5)
; CHECK-NEXT: call void @llvm.memset.p5.i64(ptr addrspace(5) [[ALLOCA]], i8 0, i64 48, i1 false)
; CHECK-NEXT: [[LOAD:%.*]] = load i64, ptr addrspace(5) [[ALLOCA]], align 8
; CHECK-NEXT: store i64 [[LOAD]], ptr [[OUT:%.*]], align 8
; CHECK-NEXT: ret void
;
%alloca = alloca [2 x <3 x ptr>], align 16, addrspace(5)
call void @llvm.memset.p5.i64(ptr addrspace(5) %alloca, i8 0, i64 48, i1 false)
%load = load i64, ptr addrspace(5) %alloca
store i64 %load, ptr %out
ret void
}

declare void @llvm.memset.p5.i64(ptr addrspace(5) nocapture writeonly, i8, i64, i1 immarg)
Loading