[LV] Simplify and unify resume value handling for epilogue vec.#185969
[LV] Simplify and unify resume value handling for epilogue vec.#185969
Conversation
|
@llvm/pr-subscribers-vectorizers @llvm/pr-subscribers-llvm-transforms Author: Florian Hahn (fhahn) ChangesThis patch tries to drastically simplify resume value handling for the scalar loop when vectorizing the epilogue. It uses a simpler, uniform approach for updating all resume values in the scalar loop:
This simplifies the code quite a bit, makes it more robust (should fix #179407) and also fixes a mis-compile in the existing tests (see change in llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-sub-epilogue-vec.ll, where previously we would incorrectly resume using the start value when the epilogue iteration check failed) In some cases, we get simpler code, due to additional CSE, in some cases the induction end value computations get moved from the epilogue iteration check to the vector preheader. We could try to sink the instructions as cleanup, but it is probably not worth the trouble. Patch is 98.46 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/185969.diff 33 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 53a9bdb113b65..5299ba4f88b4a 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -2423,20 +2423,6 @@ BasicBlock *InnerLoopVectorizer::createScalarPreheader(StringRef Prefix) {
Twine(Prefix) + "scalar.ph");
}
-/// Return the expanded step for \p ID using \p ExpandedSCEVs to look up SCEV
-/// expansion results.
-static Value *getExpandedStep(const InductionDescriptor &ID,
- const SCEV2ValueTy &ExpandedSCEVs) {
- const SCEV *Step = ID.getStep();
- if (auto *C = dyn_cast<SCEVConstant>(Step))
- return C->getValue();
- if (auto *U = dyn_cast<SCEVUnknown>(Step))
- return U->getValue();
- Value *V = ExpandedSCEVs.lookup(Step);
- assert(V && "SCEV must be expanded at this point");
- return V;
-}
-
/// Knowing that loop \p L executes a single vector iteration, add instructions
/// that will get simplified and thus should not have any cost to \p
/// InstsToIgnore.
@@ -7341,72 +7327,6 @@ VectorizationFactor LoopVectorizationPlanner::computeBestVF() {
return BestFactor;
}
-// If \p EpiResumePhiR is resume VPPhi for a reduction when vectorizing the
-// epilog loop, fix the reduction's scalar PHI node by adding the incoming value
-// from the main vector loop.
-static void fixReductionScalarResumeWhenVectorizingEpilog(
- VPPhi *EpiResumePhiR, PHINode &EpiResumePhi, BasicBlock *BypassBlock) {
- using namespace VPlanPatternMatch;
- // Get the VPInstruction computing the reduction result in the middle block.
- // The first operand may not be from the middle block if it is not connected
- // to the scalar preheader. In that case, there's nothing to fix.
- VPValue *Incoming = EpiResumePhiR->getOperand(0);
- match(Incoming, VPlanPatternMatch::m_ZExtOrSExt(
- VPlanPatternMatch::m_VPValue(Incoming)));
- auto *EpiRedResult = dyn_cast<VPInstruction>(Incoming);
- if (!EpiRedResult)
- return;
-
- VPValue *BackedgeVal;
- bool IsFindIV = false;
- if (EpiRedResult->getOpcode() == VPInstruction::ComputeAnyOfResult ||
- EpiRedResult->getOpcode() == VPInstruction::ComputeReductionResult)
- BackedgeVal = EpiRedResult->getOperand(EpiRedResult->getNumOperands() - 1);
- else if (matchFindIVResult(EpiRedResult, m_VPValue(BackedgeVal), m_VPValue()))
- IsFindIV = true;
- else
- return;
-
- auto *EpiRedHeaderPhi = cast_if_present<VPReductionPHIRecipe>(
- vputils::findRecipe(BackedgeVal, IsaPred<VPReductionPHIRecipe>));
- if (!EpiRedHeaderPhi) {
- match(BackedgeVal,
- VPlanPatternMatch::m_Select(VPlanPatternMatch::m_VPValue(),
- VPlanPatternMatch::m_VPValue(BackedgeVal),
- VPlanPatternMatch::m_VPValue()));
- EpiRedHeaderPhi = cast<VPReductionPHIRecipe>(
- vputils::findRecipe(BackedgeVal, IsaPred<VPReductionPHIRecipe>));
- }
-
- Value *MainResumeValue;
- if (auto *VPI = dyn_cast<VPInstruction>(EpiRedHeaderPhi->getStartValue())) {
- assert((VPI->getOpcode() == VPInstruction::Broadcast ||
- VPI->getOpcode() == VPInstruction::ReductionStartVector) &&
- "unexpected start recipe");
- MainResumeValue = VPI->getOperand(0)->getUnderlyingValue();
- } else
- MainResumeValue = EpiRedHeaderPhi->getStartValue()->getUnderlyingValue();
- if (EpiRedResult->getOpcode() == VPInstruction::ComputeAnyOfResult) {
- [[maybe_unused]] Value *StartV =
- EpiRedResult->getOperand(0)->getLiveInIRValue();
- auto *Cmp = cast<ICmpInst>(MainResumeValue);
- assert(Cmp->getPredicate() == CmpInst::ICMP_NE &&
- "AnyOf expected to start with ICMP_NE");
- assert(Cmp->getOperand(1) == StartV &&
- "AnyOf expected to start by comparing main resume value to original "
- "start value");
- MainResumeValue = Cmp->getOperand(0);
- } else if (IsFindIV) {
- MainResumeValue = cast<SelectInst>(MainResumeValue)->getFalseValue();
- }
- PHINode *MainResumePhi = cast<PHINode>(MainResumeValue);
-
- // When fixing reductions in the epilogue loop we should already have
- // created a bc.merge.rdx Phi after the main vector body. Ensure that we carry
- // over the incoming values correctly.
- EpiResumePhi.setIncomingValueForBlock(
- BypassBlock, MainResumePhi->getIncomingValueForBlock(BypassBlock));
-}
DenseMap<const SCEV *, Value *> LoopVectorizationPlanner::executePlan(
ElementCount BestVF, unsigned BestUF, VPlan &BestVPlan,
@@ -8991,35 +8911,9 @@ LoopVectorizePass::LoopVectorizePass(LoopVectorizeOptions Opts)
!EnableLoopVectorization) {}
/// Prepare \p MainPlan for vectorizing the main vector loop during epilogue
-/// vectorization. Remove ResumePhis from \p MainPlan for inductions that
-/// don't have a corresponding wide induction in \p EpiPlan.
-static void preparePlanForMainVectorLoop(VPlan &MainPlan, VPlan &EpiPlan) {
- // Collect PHI nodes of widened phis in the VPlan for the epilogue. Those
- // will need their resume-values computed in the main vector loop. Others
- // can be removed from the main VPlan.
- SmallPtrSet<PHINode *, 2> EpiWidenedPhis;
- for (VPRecipeBase &R :
- EpiPlan.getVectorLoopRegion()->getEntryBasicBlock()->phis()) {
- if (isa<VPCanonicalIVPHIRecipe>(&R))
- continue;
- EpiWidenedPhis.insert(
- cast<PHINode>(R.getVPSingleValue()->getUnderlyingValue()));
- }
- for (VPRecipeBase &R :
- make_early_inc_range(MainPlan.getScalarHeader()->phis())) {
- auto *VPIRInst = cast<VPIRPhi>(&R);
- if (EpiWidenedPhis.contains(&VPIRInst->getIRPhi()))
- continue;
- // There is no corresponding wide induction in the epilogue plan that would
- // need a resume value. Remove the VPIRInst wrapping the scalar header phi
- // together with the corresponding ResumePhi. The resume values for the
- // scalar loop will be created during execution of EpiPlan.
- VPRecipeBase *ResumePhi = VPIRInst->getOperand(0)->getDefiningRecipe();
- VPIRInst->eraseFromParent();
- ResumePhi->eraseFromParent();
- }
- RUN_VPLAN_PASS(VPlanTransforms::removeDeadRecipes, MainPlan);
-
+/// vectorization.
+static SmallVector<VPInstruction *>
+preparePlanForMainVectorLoop(VPlan &MainPlan, VPlan &EpiPlan) {
using namespace VPlanPatternMatch;
// When vectorizing the epilogue, FindFirstIV & FindLastIV reductions can
// introduce multiple uses of undef/poison. If the reduction start value may
@@ -9076,14 +8970,28 @@ static void preparePlanForMainVectorLoop(VPlan &MainPlan, VPlan &EpiPlan) {
{}, "vec.epilog.resume.val");
} else {
ResumePhi = cast<VPPhi>(&*ResumePhiIter);
- if (MainScalarPH->begin() == MainScalarPH->end())
- ResumePhi->moveBefore(*MainScalarPH, MainScalarPH->end());
- else if (&*MainScalarPH->begin() != ResumePhi)
+ ResumePhi->setName("vec.epilog.resume.val");
+ if (&*MainScalarPH->begin() != ResumePhi)
ResumePhi->moveBefore(*MainScalarPH, MainScalarPH->begin());
}
- // Add a user to to make sure the resume phi won't get removed.
- VPBuilder(MainScalarPH)
- .createNaryOp(VPInstruction::ResumeForEpilogue, ResumePhi);
+
+ // Create a ResumeForEpilogue for the canonical IV resume as the
+ // first non-phi, to keep it alive for the epilogue.
+ VPBuilder ResumeBuilder(MainScalarPH);
+ ResumeBuilder.createNaryOp(VPInstruction::ResumeForEpilogue, ResumePhi);
+
+ // Collect resume values for epilogue bypass fixup. Create
+ // ResumeForEpilogue for scalar preheader phis to keep them alive.
+ SmallVector<VPInstruction *> ResumeValues;
+ for (VPRecipeBase &R : MainPlan.getScalarHeader()->phis()) {
+ auto *VPIRInst = cast<VPIRPhi>(&R);
+ VPValue *ResumeOp = VPIRInst->getOperand(0);
+ auto *Resume =
+ ResumeBuilder.createNaryOp(VPInstruction::ResumeForEpilogue, ResumeOp);
+ ResumeValues.push_back(Resume);
+ }
+
+ return ResumeValues;
}
/// Prepare \p Plan for vectorizing the epilogue loop. That is, re-use expanded
@@ -9291,39 +9199,11 @@ static SmallVector<Instruction *> preparePlanForEpilogueVectorLoop(
return InstsToMove;
}
-// Generate bypass values from the additional bypass block. Note that when the
-// vectorized epilogue is skipped due to iteration count check, then the
-// resume value for the induction variable comes from the trip count of the
-// main vector loop, passed as the second argument.
-static Value *createInductionAdditionalBypassValues(
- PHINode *OrigPhi, const InductionDescriptor &II, IRBuilder<> &BypassBuilder,
- const SCEV2ValueTy &ExpandedSCEVs, Value *MainVectorTripCount,
- Instruction *OldInduction) {
- Value *Step = getExpandedStep(II, ExpandedSCEVs);
- // For the primary induction the additional bypass end value is known.
- // Otherwise it is computed.
- Value *EndValueFromAdditionalBypass = MainVectorTripCount;
- if (OrigPhi != OldInduction) {
- auto *BinOp = II.getInductionBinOp();
- // Fast-math-flags propagate from the original induction instruction.
- if (isa_and_nonnull<FPMathOperator>(BinOp))
- BypassBuilder.setFastMathFlags(BinOp->getFastMathFlags());
-
- // Compute the end value for the additional bypass.
- EndValueFromAdditionalBypass =
- emitTransformedIndex(BypassBuilder, MainVectorTripCount,
- II.getStartValue(), Step, II.getKind(), BinOp);
- EndValueFromAdditionalBypass->setName("ind.end");
- }
- return EndValueFromAdditionalBypass;
-}
-
-static void fixScalarResumeValuesFromBypass(BasicBlock *BypassBlock, Loop *L,
- VPlan &BestEpiPlan,
- LoopVectorizationLegality &LVL,
- const SCEV2ValueTy &ExpandedSCEVs,
- Value *MainVectorTripCount) {
- // Fix reduction resume values from the additional bypass block.
+static void
+fixScalarResumeValuesFromBypass(BasicBlock *BypassBlock, Loop *L,
+ VPlan &BestEpiPlan,
+ ArrayRef<VPInstruction *> ResumeValues) {
+ // Fix resume values from the additional bypass block.
BasicBlock *PH = L->getLoopPreheader();
for (auto *Pred : predecessors(PH)) {
for (PHINode &Phi : PH->phis()) {
@@ -9334,40 +9214,13 @@ static void fixScalarResumeValuesFromBypass(BasicBlock *BypassBlock, Loop *L,
}
auto *ScalarPH = cast<VPIRBasicBlock>(BestEpiPlan.getScalarPreheader());
if (ScalarPH->hasPredecessors()) {
- // If ScalarPH has predecessors, we may need to update its reduction
- // resume values.
- for (const auto &[R, IRPhi] :
- zip(ScalarPH->phis(), ScalarPH->getIRBasicBlock()->phis())) {
- fixReductionScalarResumeWhenVectorizingEpilog(cast<VPPhi>(&R), IRPhi,
- BypassBlock);
- }
- }
-
- // Fix induction resume values from the additional bypass block.
- IRBuilder<> BypassBuilder(BypassBlock, BypassBlock->getFirstInsertionPt());
- for (const auto &[IVPhi, II] : LVL.getInductionVars()) {
- Value *V = createInductionAdditionalBypassValues(
- IVPhi, II, BypassBuilder, ExpandedSCEVs, MainVectorTripCount,
- LVL.getPrimaryInduction());
- // TODO: Directly add as extra operand to the VPResumePHI recipe.
- if (auto *Inc = dyn_cast<PHINode>(IVPhi->getIncomingValueForBlock(PH))) {
- if (Inc->getBasicBlockIndex(BypassBlock) != -1)
- Inc->setIncomingValueForBlock(BypassBlock, V);
- } else {
- // If the resume value in the scalar preheader was simplified (e.g., when
- // narrowInterleaveGroups optimized away the resume PHIs), create a new
- // PHI to merge the bypass value with the original value.
- Value *OrigVal = IVPhi->getIncomingValueForBlock(PH);
- PHINode *NewPhi =
- PHINode::Create(IVPhi->getType(), pred_size(PH), "bc.resume.val",
- PH->getFirstNonPHIIt());
- for (auto *Pred : predecessors(PH)) {
- if (Pred == BypassBlock)
- NewPhi->addIncoming(V, Pred);
- else
- NewPhi->addIncoming(OrigVal, Pred);
- }
- IVPhi->setIncomingValueForBlock(PH, NewPhi);
+ // Fix resume values for inductions and reductions from the additional
+ // bypass block using the incoming values from the main loop's resume phis.
+ for (auto [ResumeV, IRPhi] :
+ zip(ResumeValues, ScalarPH->getIRBasicBlock()->phis())) {
+ auto *MainResumePhi = cast<PHINode>(ResumeV->getUnderlyingValue());
+ IRPhi.setIncomingValueForBlock(
+ BypassBlock, MainResumePhi->getIncomingValueForBlock(BypassBlock));
}
}
}
@@ -9377,11 +9230,12 @@ static void fixScalarResumeValuesFromBypass(BasicBlock *BypassBlock, Loop *L,
// and runtime checks of the main loop, as well as updating various phis. \p
// InstsToMove contains instructions that need to be moved to the preheader of
// the epilogue vector loop.
-static void connectEpilogueVectorLoop(
- VPlan &EpiPlan, Loop *L, EpilogueLoopVectorizationInfo &EPI,
- DominatorTree *DT, LoopVectorizationLegality &LVL,
- DenseMap<const SCEV *, Value *> &ExpandedSCEVs, GeneratedRTChecks &Checks,
- ArrayRef<Instruction *> InstsToMove) {
+static void connectEpilogueVectorLoop(VPlan &EpiPlan, Loop *L,
+ EpilogueLoopVectorizationInfo &EPI,
+ DominatorTree *DT,
+ GeneratedRTChecks &Checks,
+ ArrayRef<Instruction *> InstsToMove,
+ ArrayRef<VPInstruction *> ResumeValues) {
BasicBlock *VecEpilogueIterationCountCheck =
cast<VPIRBasicBlock>(EpiPlan.getEntry())->getIRBasicBlock();
@@ -9464,7 +9318,13 @@ static void connectEpilogueVectorLoop(
// after executing the main loop. We need to update the resume values of
// inductions and reductions during epilogue vectorization.
fixScalarResumeValuesFromBypass(VecEpilogueIterationCountCheck, L, EpiPlan,
- LVL, ExpandedSCEVs, EPI.VectorTripCount);
+ ResumeValues);
+
+ // Remove dead phis that were moved to the epilogue preheader but are unused
+ // (e.g., resume phis for inductions not widened in the epilogue vector loop).
+ for (PHINode &Phi : make_early_inc_range(VecEpiloguePreHeader->phis()))
+ if (Phi.use_empty())
+ Phi.eraseFromParent();
}
bool LoopVectorizePass::processLoop(Loop *L) {
@@ -9851,7 +9711,8 @@ bool LoopVectorizePass::processLoop(Loop *L) {
VPlan &BestEpiPlan = LVP.getPlanFor(EpilogueVF.Width);
BestEpiPlan.getMiddleBlock()->setName("vec.epilog.middle.block");
BestEpiPlan.getVectorPreheader()->setName("vec.epilog.ph");
- preparePlanForMainVectorLoop(*BestMainPlan, BestEpiPlan);
+ auto ResumeValues =
+ preparePlanForMainVectorLoop(*BestMainPlan, BestEpiPlan);
EpilogueLoopVectorizationInfo EPI(VF.Width, IC, EpilogueVF.Width, 1,
BestEpiPlan);
EpilogueVectorizerMainLoop MainILV(L, PSE, LI, DT, TTI, AC, EPI, &CM,
@@ -9868,8 +9729,8 @@ bool LoopVectorizePass::processLoop(Loop *L) {
BestEpiPlan, L, ExpandedSCEVs, EPI, CM, *PSE.getSE());
LVP.executePlan(EPI.EpilogueVF, EPI.EpilogueUF, BestEpiPlan, EpilogILV, DT,
true);
- connectEpilogueVectorLoop(BestEpiPlan, L, EPI, DT, LVL, ExpandedSCEVs,
- Checks, InstsToMove);
+ connectEpilogueVectorLoop(BestEpiPlan, L, EPI, DT, Checks, InstsToMove,
+ ResumeValues);
++LoopsEpilogueVectorized;
} else {
InnerLoopVectorizer LB(L, PSE, LI, DT, TTI, AC, VF.Width, IC, &CM, Checks,
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 12aafd8e72c22..a876c02fe073b 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -1321,6 +1321,8 @@ void VPInstruction::execute(VPTransformState &State) {
"scalar value but not only first lane defined");
State.set(this, GeneratedValue,
/*IsScalar*/ GeneratesPerFirstLaneOnly);
+ if (getOpcode() == VPInstruction::ResumeForEpilogue)
+ setUnderlyingValue(GeneratedValue);
}
bool VPInstruction::opcodeMayReadOrWriteFromMemory() const {
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/epilog-iv-select-cmp.ll b/llvm/test/Transforms/LoopVectorize/AArch64/epilog-iv-select-cmp.ll
index 4eced1640bd14..ae1f7829449da 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/epilog-iv-select-cmp.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/epilog-iv-select-cmp.ll
@@ -46,7 +46,6 @@ define i8 @select_icmp_var_start(ptr %a, i8 %n, i8 %start) {
; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 [[TMP2]], [[N_VEC]]
; CHECK-NEXT: br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[VEC_EPILOG_ITER_CHECK:.*]]
; CHECK: [[VEC_EPILOG_ITER_CHECK]]:
-; CHECK-NEXT: [[IND_END:%.*]] = trunc i32 [[N_VEC]] to i8
; CHECK-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i32 [[N_MOD_VF]], 8
; CHECK-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK]], label %[[VEC_EPILOG_SCALAR_PH]], label %[[VEC_EPILOG_PH]], !prof [[PROF3:![0-9]+]]
; CHECK: [[VEC_EPILOG_PH]]:
@@ -84,7 +83,7 @@ define i8 @select_icmp_var_start(ptr %a, i8 %n, i8 %start) {
; CHECK-NEXT: [[CMP_N16:%.*]] = icmp eq i32 [[TMP2]], [[N_VEC5]]
; CHECK-NEXT: br i1 [[CMP_N16]], label %[[EXIT]], label %[[VEC_EPILOG_SCALAR_PH]]
; CHECK: [[VEC_EPILOG_SCALAR_PH]]:
-; CHECK-NEXT: [[BC_RESUME_VAL17:%.*]] = phi i8 [ [[TMP16]], %[[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[IND_END]], %[[VEC_EPILOG_ITER_CHECK]] ], [ 0, %[[ITER_CHECK]] ]
+; CHECK-NEXT: [[BC_RESUME_VAL17:%.*]] = phi i8 [ [[TMP16]], %[[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[TMP3]], %[[VEC_EPILOG_ITER_CHECK]] ], [ 0, %[[ITER_CHECK]] ]
; CHECK-NEXT: [[BC_MERGE_RDX18:%.*]] = phi i8 [ [[RDX_SELECT15]], %[[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[RDX_SELECT]], %[[VEC_EPILOG_ITER_CHECK]] ], [ [[START]], %[[ITER_CHECK]] ]
; CHECK-NEXT: br label %[[LOOP:.*]]
; CHECK: [[LOOP]]:
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-factors.ll b/llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-factors.ll
index 44636222a8648..efa64c661ebc0 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-factors.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-factors.ll
@@ -446,6 +446,8 @@ define void @trip_count_based_on_ptrtoint(i64 %x) "target-cpu"="apple-m1" {
; CHECK: vector.ph:
; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP2]], 16
; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[TMP2]], [[N_MOD_VF]]
+; CHECK-NEXT: [[TMP12:%.*]] = mul i64 [[N_VEC]], 4
+; CHECK-NEXT: [[IND_END:%.*]] = getelementptr i8, ptr [[PTR_START]], i64 [[TMP12]]
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
@@ -465,8 +467,6 @@ define void @trip_count_based_on_ptrtoint(i64 %x) "target-cpu"="apple-m1" {
; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP2]], [[N_VEC]]
; CHECK-NEXT: br i1 [[CMP_N]], label [[EXIT:%.*]], label [[VEC_EPILOG_ITER_CHECK:%.*]]
; CHECK: vec.epilog.iter.check:
-; CHECK-NEXT: [[TMP12:%.*]] = mul i64 [[N_VEC]], 4
-; CHECK-NEXT: [[IND_END:%.*]] = getelementptr i8, ptr [[PTR_START]], i64 [[TMP12]]
; CHECK-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_MOD_VF]], 4
; CHECK-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]], !prof [[PROF11]]
; CHECK: vec.epilog.ph:
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-widen-inductions.ll b/llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-widen-inductions.ll
index eea496303a206..7ae16f782ed72 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-widen-inductions.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-widen-inductions.ll
@@ -39,7 +39,6 @@ define void @test_widen_ptr_induction(ptr %ptr.start.1) {
; CHECK: middle.blo...
[truncated]
|
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
🐧 Linux x64 Test Results
Failed Tests(click on a test name to see its output) lldb-apilldb-api.functionalities/data-formatter/data-formatter-stl/generic/list/TestDataFormatterGenericList.pyIf these failures are unrelated to your changes (for example tests are broken or flaky at HEAD), please open an issue at https://github.com/llvm/llvm-project/issues and add the |
🪟 Windows x64 Test Results
✅ The build succeeded and all tests passed. |
This patch tries to drastically simplify resume value handling for the scalar loop when vectorizing the epilogue. It uses a simpler, uniform approach for updating all resume values in the scalar loop: 1. Create ResumeForEpilogue recipes for all scalar resume phis in the main loop (the epilogue plan will have exactly the same scalar resume phis, in exactly the same order) 2. Update ::execute for ResumeForEpilogue to set the underlying value when executing. This is not super clean, but allows easy lookup of the generated IR value when we update the resume phis in the epilogue. Once we connect the 2 plans together explicitly, this can be removed. 3. Use the list of ResumeForEpilogue VPInstructions from the main loop to update the resume/bypass values from the epilogue. This simplifies the code quite a bit, makes it more robust (should fix llvm#179407) and also fixes a mis-compile in the existing tests (see change in llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-sub-epilogue-vec.ll, where previously we would incorrectly resume using the start value when the epilogue iteration check failed) In some cases, we get simpler code, due to additional CSE, in some cases the induction end value computations get moved from the epilogue iteration check to the vector preheader. We could try to sink the instructions as cleanup, but it is probably not worth the trouble.
f505665 to
90ca940
Compare
| static SmallVector<VPInstruction *> | ||
| preparePlanForMainVectorLoop(VPlan &MainPlan, VPlan &EpiPlan) { |
There was a problem hiding this comment.
Is it possible to make preparePlanFor(Main|Epilogue)VectorLoop totally uniform, and return a SmallVector of VPInstruction resume values or InstsToMove in both?
There was a problem hiding this comment.
I am not sure exactly, their returns are very different, but perhaps I am missing a simplification you are thinking of?
…r-resume-value-handling
artagnon
left a comment
There was a problem hiding this comment.
Hm, forgot to commit some changes?
7a7e332 to
83fbf86
Compare
Ah yes, sorry for that. should be pushed now |
There was a problem hiding this comment.
Damn, this is bad! Is the scalar TC really not kept anywhere? vec.epilog.resume.val is merely VectorTC, and ResumePhi is merely a phi from that start value up to ScalarTC, no?
There was a problem hiding this comment.
Its the a phi with incoming value from MiddleBlock = Vector trip count and zero incoming value from the entry block; we have the original trip count, but we need the vector trip count to resume the canonical IV from there in the epilogue vector loop
artagnon
left a comment
There was a problem hiding this comment.
Thanks for all the explanations! I think that I understand this well enough to approve, but note that all this code is new to me, and I've not played with it: I'm worried that this might cause miscompiles and crashes due to some context that we didn't think of, but that's why we have version control. I think further simplifications are possible, but let's not worry about that for now. LGTM, thanks!
…r-resume-value-handling
…r-resume-value-handling
…#185969) This patch tries to drastically simplify resume value handling for the scalar loop when vectorizing the epilogue. It uses a simpler, uniform approach for updating all resume values in the scalar loop: 1. Create ResumeForEpilogue recipes for all scalar resume phis in the main loop (the epilogue plan will have exactly the same scalar resume phis, in exactly the same order) 2. Update ::execute for ResumeForEpilogue to set the underlying value when executing. This is not super clean, but allows easy lookup of the generated IR value when we update the resume phis in the epilogue. Once we connect the 2 plans together explicitly, this can be removed. 3. Use the list of ResumeForEpilogue VPInstructions from the main loop to update the resume/bypass values from the epilogue. This simplifies the code quite a bit, makes it more robust (should fix llvm#179407) and also fixes a mis-compile in the existing tests (see change in llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-sub-epilogue-vec.ll, where previously we would incorrectly resume using the start value when the epilogue iteration check failed) In some cases, we get simpler code, due to additional CSE, in some cases the induction end value computations get moved from the epilogue iteration check to the vector preheader. We could try to sink the instructions as cleanup, but it is probably not worth the trouble. Fixes llvm#179407.
|
Looks like we're seeing a miscompilation after this patch, working on a reproducer |
|
We also see miscompilations, in different apps, so not just one. Shall we revert this because I don't see a lot of movement in #187323? |
There's no valid reproducer so far showing an issue. We can revert, but would be good to also provide a concrete reproducer showing the issue |
This shows miscompiles in 549.fotonik3d_r in SPEC 2017 FP. And then other apps as well. Thanks for the revert. |
Would you be able to provide an IR reproducer? |
I will send Alexey a message, if he doesn't get there first, I will have a look. |
…epilogue vec." (#187504) Reverts llvm/llvm-project#185969 This is suspected to cause a miscompile in 549.fotonik3d_r from SPEC 2017 FP
|
Unfortunately this change (which I'm guessing will reland without too many alterations) is causing a lot of regressions in CMSIS-DSP (embedded library) kernels that we expect to auto vectorize. On a Cortex-A320, from the 322 kernels we test, 93 saw a regression, so almost 30%. By an average of 2.7 percent. Top 20 sits at 4.5%. This seems a bit high to me, especially when it's affecting this type of what seems to me at least fairly idiomatic library code. When we look at an example, we see that the interleave count in the relevant loop is reduced from 8 to 2. Cut back stand-alone version: compiled with The below assembly is from the relevant loop. before: after: |
|
@stuij is it possible this is not the right PR? AFAICT it does not change the generated assembly for the C code at all; it just changes how the resume values are handled, but not impact the vectorization decision. On current main, codeine is also narrow: https://clang.godbolt.org/z/YoKKazsbs |
|
@sjoerdmeijer relanded as 77fb848, please let me know if you are seeing any issues, and if so please provide a reproducer :) |
|
Leaving this reverted without reverting #182146 is a bad state too. We're affected by this miscompilation then: #182146 (comment) Maybe #182146 should be reverted as well? |
|
@alexfh it was re-landed yesterday |
|
@fhahn yes! I accidentally took the wrong function from the daily regression list to look at regressions caused by this patch. Sorry! I bisected it, and it was indeed 92e44b2. We've seen regressions caused by related patches and we've got a ticket for this. We did actually try to tweak max interleave count (to 4) but it did more harm than good for the CMSIS-DSP tests. It's ongoing. Thanks for having a think on it. However, the list and amount of regressions that I quoted earlier in this thread does seem to be caused by this change. The numbers ebb and flow with this commit being applied and reverted (and it is pointed to by our automated bisecting). Examples of actual impacted tests are: arm_fill_f16, arm_q7_to_q15, arm_mat_add_f16. Looking at |
…c." (llvm#187504) Reverts llvm#185969 This is suspected to cause a miscompile in 549.fotonik3d_r from SPEC 2017 FP
This patch tries to drastically simplify resume value handling for the scalar loop when vectorizing the epilogue.
It uses a simpler, uniform approach for updating all resume values in the scalar loop:
This simplifies the code quite a bit, makes it more robust (should fix #179407) and also fixes a mis-compile in the existing tests (see change in llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-sub-epilogue-vec.ll, where previously we would incorrectly resume using the start value when the epilogue iteration check failed)
In some cases, we get simpler code, due to additional CSE, in some cases the induction end value computations get moved from the epilogue iteration check to the vector preheader. We could try to sink the instructions as cleanup, but it is probably not worth the trouble.
Fixes #179407.