-
Notifications
You must be signed in to change notification settings - Fork 15.5k
[VPlan] Don't use the legacy cost model for loop conditions #156864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
861d883
c3f2e8f
77a2769
19b984c
2b83698
ec22b08
1bf3611
2a6c490
6e23400
24fa802
515264b
50b801f
2d3999b
33b76ce
8c21ea3
759741c
60b76a2
8a4c2ad
ae498b8
a6bac7a
f41782c
7d21018
b175ee3
bfe6fc0
7054e6b
7ec4f3d
40a1450
a4a2f8c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1109,6 +1109,30 @@ InstructionCost VPInstruction::computeCost(ElementCount VF, | |
| return Ctx.TTI.getIndexedVectorInstrCostFromEnd(Instruction::ExtractElement, | ||
| VecTy, Ctx.CostKind, 0); | ||
| } | ||
| case VPInstruction::BranchOnCount: { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think BranchOnCond needs adding too.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. BranchOnCond doesn't cause a compare instruction to be generated, it uses the condition generated by another instruction. |
||
| // If TC <= VF then this is just a branch. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure what this means. Are you saying that we create a vplan for a given VF despite knowing that we will never enter the vector loop? I guess this can happen if TC is exactly equal to VF or we're using tail-folding, but not using the mask for control flow.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As mentioned in the comment below this, this transformation is happening in simplifyBranchConditionForVFAndUF and means the vector loop is executed exactly once. TC < VF can happen with tail folding, e.g. low_trip_count_fold_tail_scalarized_store in llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll. |
||
| // FIXME: Removing the branch happens in simplifyBranchConditionForVFAndUF | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So you're saying the cost of this branch is based on a prediction about what For example, if you pulled this code out of You'd also now be able to remove the
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll look into doing that.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think that's the way to go, we shouldn't duplicate that kind of reasoning here. The TODO should say that the branch should be simplified before we compute the costs. As a workaround for now catching some cases here should be fine, as this should only mean we may miss some new optimizations, but not make things worse
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well, I do think we should be dealing with trip counts that are multiples of vscale here too since we now support them and we know that
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. After looking into this, the problem here is that we don't have access to the ScalarEvolution object here in VPInstruction, so putting code from simplifyBranchConditionForVFAndUF into a function and calling it won't work as that makes use of ScalarEvolution. Simplifying the branch before we compute the cost seems like a good solution to this.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK no problem. I wasn't sure how easy it would be, but thanks for looking into it! I can follow up with a later PR to get access to the SCEV here. |
||
| // where it checks TC <= VF * UF, but we don't know UF yet. This means in | ||
| // some cases we get a cost that's too high due to counting a cmp that | ||
| // later gets removed. | ||
| // FIXME: The compare could also be removed if TC = M * vscale, | ||
| // VF = N * vscale, and M <= N. Detecting that would require having the | ||
| // trip count as a SCEV though. | ||
| Value *TC = getParent()->getPlan()->getTripCount()->getUnderlyingValue(); | ||
| ConstantInt *TCConst = dyn_cast_if_present<ConstantInt>(TC); | ||
| if (TCConst && TCConst->getValue().ule(VF.getKnownMinValue())) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add a TODO for the case where TC=vscale x M and VF=vscale * N as well? In such cases we should also be able to prove that TC <= VF because it just requires asking if M <= N.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will do. |
||
| return 0; | ||
| // Otherwise BranchOnCount generates ICmpEQ followed by a branch. | ||
| Type *ValTy = Ctx.Types.inferScalarType(getOperand(0)); | ||
| return Ctx.TTI.getCmpSelInstrCost(Instruction::ICmp, ValTy, | ||
| CmpInst::makeCmpResultType(ValTy), | ||
| CmpInst::ICMP_EQ, Ctx.CostKind); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you need to add the cost of the branch as well.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've deliberately avoided touching branch costs to keep the scope of this work as small as possible. |
||
| } | ||
| case Instruction::FCmp: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The change below is about more than just about the loop conditions. See I think you either need to:
Sorry about this!
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. getCostForRecipeWithOpcode already correctly handles vector compares, so I've changed this to use that function. |
||
| case Instruction::ICmp: | ||
| return getCostForRecipeWithOpcode( | ||
| getOpcode(), | ||
| vputils::onlyFirstLaneUsed(this) ? ElementCount::getFixed(1) : VF, Ctx); | ||
| case VPInstruction::ExtractPenultimateElement: | ||
| if (VF == ElementCount::getScalable(1)) | ||
| return InstructionCost::getInvalid(); | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vp_depth_first_deepwill also leave the region and visit its successors, so we will also count the compare in the middle block, and almost always overcount the compares in VPLan. probably needs to check if we left the region.I think it will also always disable the check if we have loops controlled by active-lane-mask?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed this to ignore blocks outside of the vector loop region.
I'm not sure what you're asking here. When we have a loop that's using an active-lane-mask, the vplan will have something like (example here taken from llvm/test/Transforms/LoopVectorize/AArch64/sve-wide-lane-mask.ll)
the legacy cost model will have
planContainsDifferentCompares would count 1 compare in the legacy cost model, no compares in the vplan, and return true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, what I was wondering was if we could exclude plans with ActiveLaneMask terminated exiting blocks from the carve-out, to preserve the original check for those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, it's still not clear what you're asking here. Do you mean: planContainsDifferentCompares should return false for plans that contain ActiveLaneMask terminated exiting blocks, so that the assert in computeBestVF that calls planContainsDifferentCompares will do the check BestFactor.Width == LegacyVF.Width? If so then possibly we could, though I don't think it would make a difference as I haven't found an example where planContainsAdditionalSimplifications doesn't also return true (which it will do because the cmp in the legacy cost model doesn't correspond to anything in the vplan).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for checking!
It looks like there's still something related to the changes that effectively disables the assertion altogether. I tried the patch with something like the diff below which should cause lots of crashes, but it seems there aren't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like if
planContainsAdditionalSimplificationsis returning true for most loops then there's probably not much point in having the legacy/vplan cost model assert anymore? cc @fhahnThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue with bailing out on some wide accesses should be fixed as of #169249.
One example that should crash with the PR updated the current
mainand the change below isllvm/test/Transforms/LoopVectorize/X86/float-induction-x86.ll, but it does not, presumably because we count the compare in the middle.block?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In every function in
llvm/test/Transforms/LoopVectorize/X86/float-induction-x86.llplanContainsAdditionalSimplifications is returning true: In the first function%lftr.wideiv = trunc i64 %iv.next to i32doesn't appear in the list of seen instructions, in the second%cond = icmp slt i64 %i.next, %n, in the third%cmp.not = icmp eq i64 %iv.next, %0. It looks like all of these are due to the loop termination condition being converted to branch-on-count.Out of the tests in llvm/tests/Transforms/LoopVectorize, only in the following does planContainsAdditionalSimplifications return false for at least one of the test functions:
Strangely, with the above change in these files planContainsAdditionalSimplifications changes to returning true. I think what's going on is that because of the increase in cost BestPlan is now the scalar plan, so planContainsAdditionalSimplifications is checking the scalar plan whereas before it was checking the vector plan, and there's something in the scalar plan that causes planContainsAdditionalSimplifications to return true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm so I think what is going on is that the original compare will almost always be removed (except if there is another user).
Previously the code would mark all exit instructions to be skipped for cost-computtion (and in turn skipped in planContainsAdditionalSimplifications). But now we don't, so the difference between IR compare and BranchOnCount will cause planContainsAdditionalSimplifications to return true.
Could we still add the IR instructions to the set to skip?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made this change. Doing this we also need to detect when VPInstruction::computeCost will return zero for BranchOnCount, as otherwise the assertion fails in llvm/test/Transforms/LoopVectorize/RISCV/low-trip-count.ll.