-
Notifications
You must be signed in to change notification settings - Fork 16.3k
[AMDGPU] Add intrinsic-based optimization for rotate and funnel shift patterns #153406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -253,6 +253,7 @@ class AMDGPUCodeGenPrepareImpl | |
| bool visitIntrinsicInst(IntrinsicInst &I); | ||
| bool visitFMinLike(IntrinsicInst &I); | ||
| bool visitSqrt(IntrinsicInst &I); | ||
| bool visitFunnelShift(IntrinsicInst &I); | ||
| bool run(); | ||
| }; | ||
|
|
||
|
|
@@ -1913,6 +1914,9 @@ bool AMDGPUCodeGenPrepareImpl::visitIntrinsicInst(IntrinsicInst &I) { | |
| return visitFMinLike(I); | ||
| case Intrinsic::sqrt: | ||
| return visitSqrt(I); | ||
| case Intrinsic::fshr: | ||
| case Intrinsic::fshl: | ||
| return visitFunnelShift(I); | ||
| default: | ||
| return false; | ||
| } | ||
|
|
@@ -2103,6 +2107,37 @@ PreservedAnalyses AMDGPUCodeGenPreparePass::run(Function &F, | |
| return PA; | ||
| } | ||
|
|
||
| bool AMDGPUCodeGenPrepareImpl::visitFunnelShift(IntrinsicInst &I) { | ||
| if (!I.getType()->isIntegerTy(32)) | ||
| return false; | ||
|
|
||
| // Only convert divergent operations to v_alignbit | ||
| if (UA.isUniform(&I)) | ||
| return false; | ||
|
|
||
| Intrinsic::ID IID = I.getIntrinsicID(); | ||
| Value *Src0 = I.getOperand(0); | ||
| Value *Src1 = I.getOperand(1); | ||
| Value *Amt = I.getOperand(2); | ||
|
|
||
| IRBuilder<> Builder(&I); | ||
| Function *AlignBitFn = Intrinsic::getOrInsertDeclaration( | ||
| I.getModule(), Intrinsic::amdgcn_v_alignbit); | ||
|
|
||
| Value *AlignBitCall = nullptr; | ||
| if (IID == Intrinsic::fshr) | ||
| AlignBitCall = Builder.CreateCall(AlignBitFn, {Src0, Src1, Amt}); | ||
| else if (IID == Intrinsic::fshl) { | ||
| Value *InvAmt = Builder.CreateSub(Builder.getInt32(32), Amt); | ||
| AlignBitCall = Builder.CreateCall(AlignBitFn, {Src1, Src0, InvAmt}); | ||
|
Comment on lines
+2130
to
+2132
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is incorrect. See also the comment about a mis-compilation in the tests. A key property of funnel shifts is that bits of src0 are always more significant than bits of src1 in the output. Therefore, swapping the sources like that cannot be correct. Looking at the tests, here's what we used to generate for a fully generic fshl: This is basically The problem is that all shifts only use the LSBs to determine the shift amount, and so this lowering will give incorrect results when amt is 0 (or to be more precise, when I'm not actually sure what the best lowering here is. There are two kinds of contenders:
The second sequence should become something like: This is worse than the sequence above (though there are probably many cases where the v_and_b32 can be optimized away thanks to known bits analysis). But! It is somewhat natural to have a uniform or even constant shift amount, and I don't know how well isel would be able to optimize those cases if you the first option for the IR here. (I suspect it might be able to optimize the constant case, but not the uniform case.) With the version of LLVM IR that corresponds to this alternative, a uniform shift amount should lead to only 2x VALU (v_alignbit_b32 + v_cndmask_b32) and a bunch of SALU, which is certainly better than 4x VALU. So you may want to do a case distinction here based on whether the shift amount is uniform. However, the more important thing is that you try a couple of variants of this to find out which is really best. There may be a variation I haven't thought of. (I strongly recommend not doing an explicit case distinction based on whether amt is constant, since we have uniform analysis here. UA catches the constant case as well, and automatic constant folding in the builder should do the rest.) |
||
| } else | ||
| return false; | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should never happen, and our policy is to fail fast and loud rather than to try to continue somehow when something that shouldn't happen happens. (To go along with this, our policy is to use release builds with assertions enabled for a lot of testing -- you should definitely make sure you do the same in case you're running In this particular case, there are one of two idiomatic patterns you can use: or In this case, I'd prefer the former due to the obvious left/right symmetry. |
||
|
|
||
| I.replaceAllUsesWith(AlignBitCall); | ||
| I.eraseFromParent(); | ||
| return true; | ||
| } | ||
|
|
||
| INITIALIZE_PASS_BEGIN(AMDGPUCodeGenPrepare, DEBUG_TYPE, | ||
| "AMDGPU IR optimizations", false, false) | ||
| INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming convention omits the v prefix in intrinsic names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if we've talked about this explicitly, but I think the v prefix makes sense because the entire purpose of the intrinsic to serve as a pre-selection of sorts of v_alignbit in certain cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes it even worse?