[LV] Simplify and unify resume value handling for epilogue vec. by fhahn · Pull Request #185969 · llvm/llvm-project

fhahn · 2026-03-11T20:47:51Z

This patch tries to drastically simplify resume value handling for the scalar loop when vectorizing the epilogue.

It uses a simpler, uniform approach for updating all resume values in the scalar loop:

Create ResumeForEpilogue recipes for all scalar resume phis in the main loop (the epilogue plan will have exactly the same scalar resume phis, in exactly the same order)
Update ::execute for ResumeForEpilogue to set the underlying value when executing. This is not super clean, but allows easy lookup of the generated IR value when we update the resume phis in the epilogue. Once we connect the 2 plans together explicitly, this can be removed.
Use the list of ResumeForEpilogue VPInstructions from the main loop to update the resume/bypass values from the epilogue.

This simplifies the code quite a bit, makes it more robust (should fix #179407) and also fixes a mis-compile in the existing tests (see change in llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-sub-epilogue-vec.ll, where previously we would incorrectly resume using the start value when the epilogue iteration check failed)

In some cases, we get simpler code, due to additional CSE, in some cases the induction end value computations get moved from the epilogue iteration check to the vector preheader. We could try to sink the instructions as cleanup, but it is probably not worth the trouble.

Fixes #179407.

llvmbot · 2026-03-11T20:48:27Z

@llvm/pr-subscribers-vectorizers
@llvm/pr-subscribers-backend-powerpc

@llvm/pr-subscribers-llvm-transforms

Author: Florian Hahn (fhahn)

Changes

This patch tries to drastically simplify resume value handling for the scalar loop when vectorizing the epilogue.

It uses a simpler, uniform approach for updating all resume values in the scalar loop:

Create ResumeForEpilogue recipes for all scalar resume phis in the main loop (the epilogue plan will have exactly the same scalar resume phis, in exactly the same order)
Update ::execute for ResumeForEpilogue to set the underlying value when executing. This is not super clean, but allows easy lookup of the generated IR value when we update the resume phis in the epilogue. Once we connect the 2 plans together explicitly, this can be removed.
Use the list of ResumeForEpilogue VPInstructions from the main loop to update the resume/bypass values from the epilogue.

This simplifies the code quite a bit, makes it more robust (should fix #179407) and also fixes a mis-compile in the existing tests (see change in llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-sub-epilogue-vec.ll, where previously we would incorrectly resume using the start value when the epilogue iteration check failed)

In some cases, we get simpler code, due to additional CSE, in some cases the induction end value computations get moved from the epilogue iteration check to the vector preheader. We could try to sink the instructions as cleanup, but it is probably not worth the trouble.

Patch is 98.46 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/185969.diff

33 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+52-191)
(modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+2)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/epilog-iv-select-cmp.ll (+1-2)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-factors.ll (+2-2)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-widen-inductions.ll (+2-4)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/f128-fmuladd-reduction.ll (+4-4)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/force-target-instruction-cost.ll (+4-4)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/induction-costs.ll (+5-7)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/interleave-with-runtime-checks.ll (+4-4)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/intrinsiccost.ll (+14-14)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/low_trip_count_predicates.ll (+2-2)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/neon-inloop-reductions.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-dot-product-epilogue.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-sub-epilogue-vec.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-epilog-vect.ll (+4-4)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/transform-narrow-interleave-to-widen-memory-cost.ll (+6-8)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/transform-narrow-interleave-to-widen-memory-scalable.ll (+6-6)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/vector-reverse.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/PowerPC/exit-branch-cost.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/PowerPC/optimal-epilog-vectorization.ll (+2-2)
(modified) llvm/test/Transforms/LoopVectorize/X86/conversion-cost.ll (+1-2)
(modified) llvm/test/Transforms/LoopVectorize/X86/cost-model.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/X86/epilog-vectorization-inductions.ll (+2-5)
(modified) llvm/test/Transforms/LoopVectorize/X86/float-induction-x86.ll (+2-8)
(modified) llvm/test/Transforms/LoopVectorize/X86/gather_scatter.ll (+2-4)
(modified) llvm/test/Transforms/LoopVectorize/X86/intrinsiccost.ll (+14-14)
(modified) llvm/test/Transforms/LoopVectorize/X86/scatter_crash.ll (+4-10)
(modified) llvm/test/Transforms/LoopVectorize/X86/transform-narrow-interleave-to-widen-memory-epilogue-vec.ll (+1-2)
(modified) llvm/test/Transforms/LoopVectorize/X86/transform-narrow-interleave-to-widen-memory.ll (+1-2)
(modified) llvm/test/Transforms/LoopVectorize/epilog-iv-select-cmp.ll (+2-3)
(modified) llvm/test/Transforms/LoopVectorize/epilog-vectorization-any-of-reductions.ll (+3-4)
(modified) llvm/test/Transforms/LoopVectorize/epilog-vectorization-reductions.ll (+2-4)
(modified) llvm/test/Transforms/LoopVectorize/optimal-epilog-vectorization.ll (+3-5)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 53a9bdb113b65..5299ba4f88b4a 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -2423,20 +2423,6 @@ BasicBlock *InnerLoopVectorizer::createScalarPreheader(StringRef Prefix) {
                     Twine(Prefix) + "scalar.ph");
 }
 
-/// Return the expanded step for \p ID using \p ExpandedSCEVs to look up SCEV
-/// expansion results.
-static Value *getExpandedStep(const InductionDescriptor &ID,
-                              const SCEV2ValueTy &ExpandedSCEVs) {
-  const SCEV *Step = ID.getStep();
-  if (auto *C = dyn_cast<SCEVConstant>(Step))
-    return C->getValue();
-  if (auto *U = dyn_cast<SCEVUnknown>(Step))
-    return U->getValue();
-  Value *V = ExpandedSCEVs.lookup(Step);
-  assert(V && "SCEV must be expanded at this point");
-  return V;
-}
-
 /// Knowing that loop \p L executes a single vector iteration, add instructions
 /// that will get simplified and thus should not have any cost to \p
 /// InstsToIgnore.
@@ -7341,72 +7327,6 @@ VectorizationFactor LoopVectorizationPlanner::computeBestVF() {
   return BestFactor;
 }
 
-// If \p EpiResumePhiR is resume VPPhi for a reduction when vectorizing the
-// epilog loop, fix the reduction's scalar PHI node by adding the incoming value
-// from the main vector loop.
-static void fixReductionScalarResumeWhenVectorizingEpilog(
-    VPPhi *EpiResumePhiR, PHINode &EpiResumePhi, BasicBlock *BypassBlock) {
-  using namespace VPlanPatternMatch;
-  // Get the VPInstruction computing the reduction result in the middle block.
-  // The first operand may not be from the middle block if it is not connected
-  // to the scalar preheader. In that case, there's nothing to fix.
-  VPValue *Incoming = EpiResumePhiR->getOperand(0);
-  match(Incoming, VPlanPatternMatch::m_ZExtOrSExt(
-                      VPlanPatternMatch::m_VPValue(Incoming)));
-  auto *EpiRedResult = dyn_cast<VPInstruction>(Incoming);
-  if (!EpiRedResult)
-    return;
-
-  VPValue *BackedgeVal;
-  bool IsFindIV = false;
-  if (EpiRedResult->getOpcode() == VPInstruction::ComputeAnyOfResult ||
-      EpiRedResult->getOpcode() == VPInstruction::ComputeReductionResult)
-    BackedgeVal = EpiRedResult->getOperand(EpiRedResult->getNumOperands() - 1);
-  else if (matchFindIVResult(EpiRedResult, m_VPValue(BackedgeVal), m_VPValue()))
-    IsFindIV = true;
-  else
-    return;
-
-  auto *EpiRedHeaderPhi = cast_if_present<VPReductionPHIRecipe>(
-      vputils::findRecipe(BackedgeVal, IsaPred<VPReductionPHIRecipe>));
-  if (!EpiRedHeaderPhi) {
-    match(BackedgeVal,
-          VPlanPatternMatch::m_Select(VPlanPatternMatch::m_VPValue(),
-                                      VPlanPatternMatch::m_VPValue(BackedgeVal),
-                                      VPlanPatternMatch::m_VPValue()));
-    EpiRedHeaderPhi = cast<VPReductionPHIRecipe>(
-        vputils::findRecipe(BackedgeVal, IsaPred<VPReductionPHIRecipe>));
-  }
-
-  Value *MainResumeValue;
-  if (auto *VPI = dyn_cast<VPInstruction>(EpiRedHeaderPhi->getStartValue())) {
-    assert((VPI->getOpcode() == VPInstruction::Broadcast ||
-            VPI->getOpcode() == VPInstruction::ReductionStartVector) &&
-           "unexpected start recipe");
-    MainResumeValue = VPI->getOperand(0)->getUnderlyingValue();
-  } else
-    MainResumeValue = EpiRedHeaderPhi->getStartValue()->getUnderlyingValue();
-  if (EpiRedResult->getOpcode() == VPInstruction::ComputeAnyOfResult) {
-    [[maybe_unused]] Value *StartV =
-        EpiRedResult->getOperand(0)->getLiveInIRValue();
-    auto *Cmp = cast<ICmpInst>(MainResumeValue);
-    assert(Cmp->getPredicate() == CmpInst::ICMP_NE &&
-           "AnyOf expected to start with ICMP_NE");
-    assert(Cmp->getOperand(1) == StartV &&
-           "AnyOf expected to start by comparing main resume value to original "
-           "start value");
-    MainResumeValue = Cmp->getOperand(0);
-  } else if (IsFindIV) {
-    MainResumeValue = cast<SelectInst>(MainResumeValue)->getFalseValue();
-  }
-  PHINode *MainResumePhi = cast<PHINode>(MainResumeValue);
-
-  // When fixing reductions in the epilogue loop we should already have
-  // created a bc.merge.rdx Phi after the main vector body. Ensure that we carry
-  // over the incoming values correctly.
-  EpiResumePhi.setIncomingValueForBlock(
-      BypassBlock, MainResumePhi->getIncomingValueForBlock(BypassBlock));
-}
 
 DenseMap<const SCEV *, Value *> LoopVectorizationPlanner::executePlan(
     ElementCount BestVF, unsigned BestUF, VPlan &BestVPlan,
@@ -8991,35 +8911,9 @@ LoopVectorizePass::LoopVectorizePass(LoopVectorizeOptions Opts)
                               !EnableLoopVectorization) {}
 
 /// Prepare \p MainPlan for vectorizing the main vector loop during epilogue
-/// vectorization. Remove ResumePhis from \p MainPlan for inductions that
-/// don't have a corresponding wide induction in \p EpiPlan.
-static void preparePlanForMainVectorLoop(VPlan &MainPlan, VPlan &EpiPlan) {
-  // Collect PHI nodes of widened phis in the VPlan for the epilogue. Those
-  // will need their resume-values computed in the main vector loop. Others
-  // can be removed from the main VPlan.
-  SmallPtrSet<PHINode *, 2> EpiWidenedPhis;
-  for (VPRecipeBase &R :
-       EpiPlan.getVectorLoopRegion()->getEntryBasicBlock()->phis()) {
-    if (isa<VPCanonicalIVPHIRecipe>(&R))
-      continue;
-    EpiWidenedPhis.insert(
-        cast<PHINode>(R.getVPSingleValue()->getUnderlyingValue()));
-  }
-  for (VPRecipeBase &R :
-       make_early_inc_range(MainPlan.getScalarHeader()->phis())) {
-    auto *VPIRInst = cast<VPIRPhi>(&R);
-    if (EpiWidenedPhis.contains(&VPIRInst->getIRPhi()))
-      continue;
-    // There is no corresponding wide induction in the epilogue plan that would
-    // need a resume value. Remove the VPIRInst wrapping the scalar header phi
-    // together with the corresponding ResumePhi. The resume values for the
-    // scalar loop will be created during execution of EpiPlan.
-    VPRecipeBase *ResumePhi = VPIRInst->getOperand(0)->getDefiningRecipe();
-    VPIRInst->eraseFromParent();
-    ResumePhi->eraseFromParent();
-  }
-  RUN_VPLAN_PASS(VPlanTransforms::removeDeadRecipes, MainPlan);
-
+/// vectorization.
+static SmallVector<VPInstruction *>
+preparePlanForMainVectorLoop(VPlan &MainPlan, VPlan &EpiPlan) {
   using namespace VPlanPatternMatch;
   // When vectorizing the epilogue, FindFirstIV & FindLastIV reductions can
   // introduce multiple uses of undef/poison. If the reduction start value may
@@ -9076,14 +8970,28 @@ static void preparePlanForMainVectorLoop(VPlan &MainPlan, VPlan &EpiPlan) {
         {}, "vec.epilog.resume.val");
   } else {
     ResumePhi = cast<VPPhi>(&*ResumePhiIter);
-    if (MainScalarPH->begin() == MainScalarPH->end())
-      ResumePhi->moveBefore(*MainScalarPH, MainScalarPH->end());
-    else if (&*MainScalarPH->begin() != ResumePhi)
+    ResumePhi->setName("vec.epilog.resume.val");
+    if (&*MainScalarPH->begin() != ResumePhi)
       ResumePhi->moveBefore(*MainScalarPH, MainScalarPH->begin());
   }
-  // Add a user to to make sure the resume phi won't get removed.
-  VPBuilder(MainScalarPH)
-      .createNaryOp(VPInstruction::ResumeForEpilogue, ResumePhi);
+
+  // Create a ResumeForEpilogue for the canonical IV resume as the
+  // first non-phi, to keep it alive for the epilogue.
+  VPBuilder ResumeBuilder(MainScalarPH);
+  ResumeBuilder.createNaryOp(VPInstruction::ResumeForEpilogue, ResumePhi);
+
+  // Collect resume values for epilogue bypass fixup. Create
+  // ResumeForEpilogue for scalar preheader phis to keep them alive.
+  SmallVector<VPInstruction *> ResumeValues;
+  for (VPRecipeBase &R : MainPlan.getScalarHeader()->phis()) {
+    auto *VPIRInst = cast<VPIRPhi>(&R);
+    VPValue *ResumeOp = VPIRInst->getOperand(0);
+    auto *Resume =
+        ResumeBuilder.createNaryOp(VPInstruction::ResumeForEpilogue, ResumeOp);
+    ResumeValues.push_back(Resume);
+  }
+
+  return ResumeValues;
 }
 
 /// Prepare \p Plan for vectorizing the epilogue loop. That is, re-use expanded
@@ -9291,39 +9199,11 @@ static SmallVector<Instruction *> preparePlanForEpilogueVectorLoop(
   return InstsToMove;
 }
 
-// Generate bypass values from the additional bypass block. Note that when the
-// vectorized epilogue is skipped due to iteration count check, then the
-// resume value for the induction variable comes from the trip count of the
-// main vector loop, passed as the second argument.
-static Value *createInductionAdditionalBypassValues(
-    PHINode *OrigPhi, const InductionDescriptor &II, IRBuilder<> &BypassBuilder,
-    const SCEV2ValueTy &ExpandedSCEVs, Value *MainVectorTripCount,
-    Instruction *OldInduction) {
-  Value *Step = getExpandedStep(II, ExpandedSCEVs);
-  // For the primary induction the additional bypass end value is known.
-  // Otherwise it is computed.
-  Value *EndValueFromAdditionalBypass = MainVectorTripCount;
-  if (OrigPhi != OldInduction) {
-    auto *BinOp = II.getInductionBinOp();
-    // Fast-math-flags propagate from the original induction instruction.
-    if (isa_and_nonnull<FPMathOperator>(BinOp))
-      BypassBuilder.setFastMathFlags(BinOp->getFastMathFlags());
-
-    // Compute the end value for the additional bypass.
-    EndValueFromAdditionalBypass =
-        emitTransformedIndex(BypassBuilder, MainVectorTripCount,
-                             II.getStartValue(), Step, II.getKind(), BinOp);
-    EndValueFromAdditionalBypass->setName("ind.end");
-  }
-  return EndValueFromAdditionalBypass;
-}
-
-static void fixScalarResumeValuesFromBypass(BasicBlock *BypassBlock, Loop *L,
-                                            VPlan &BestEpiPlan,
-                                            LoopVectorizationLegality &LVL,
-                                            const SCEV2ValueTy &ExpandedSCEVs,
-                                            Value *MainVectorTripCount) {
-  // Fix reduction resume values from the additional bypass block.
+static void
+fixScalarResumeValuesFromBypass(BasicBlock *BypassBlock, Loop *L,
+                                VPlan &BestEpiPlan,
+                                ArrayRef<VPInstruction *> ResumeValues) {
+  // Fix resume values from the additional bypass block.
   BasicBlock *PH = L->getLoopPreheader();
   for (auto *Pred : predecessors(PH)) {
     for (PHINode &Phi : PH->phis()) {
@@ -9334,40 +9214,13 @@ static void fixScalarResumeValuesFromBypass(BasicBlock *BypassBlock, Loop *L,
   }
   auto *ScalarPH = cast<VPIRBasicBlock>(BestEpiPlan.getScalarPreheader());
   if (ScalarPH->hasPredecessors()) {
-    // If ScalarPH has predecessors, we may need to update its reduction
-    // resume values.
-    for (const auto &[R, IRPhi] :
-         zip(ScalarPH->phis(), ScalarPH->getIRBasicBlock()->phis())) {
-      fixReductionScalarResumeWhenVectorizingEpilog(cast<VPPhi>(&R), IRPhi,
-                                                    BypassBlock);
-    }
-  }
-
-  // Fix induction resume values from the additional bypass block.
-  IRBuilder<> BypassBuilder(BypassBlock, BypassBlock->getFirstInsertionPt());
-  for (const auto &[IVPhi, II] : LVL.getInductionVars()) {
-    Value *V = createInductionAdditionalBypassValues(
-        IVPhi, II, BypassBuilder, ExpandedSCEVs, MainVectorTripCount,
-        LVL.getPrimaryInduction());
-    // TODO: Directly add as extra operand to the VPResumePHI recipe.
-    if (auto *Inc = dyn_cast<PHINode>(IVPhi->getIncomingValueForBlock(PH))) {
-      if (Inc->getBasicBlockIndex(BypassBlock) != -1)
-        Inc->setIncomingValueForBlock(BypassBlock, V);
-    } else {
-      // If the resume value in the scalar preheader was simplified (e.g., when
-      // narrowInterleaveGroups optimized away the resume PHIs), create a new
-      // PHI to merge the bypass value with the original value.
-      Value *OrigVal = IVPhi->getIncomingValueForBlock(PH);
-      PHINode *NewPhi =
-          PHINode::Create(IVPhi->getType(), pred_size(PH), "bc.resume.val",
-                          PH->getFirstNonPHIIt());
-      for (auto *Pred : predecessors(PH)) {
-        if (Pred == BypassBlock)
-          NewPhi->addIncoming(V, Pred);
-        else
-          NewPhi->addIncoming(OrigVal, Pred);
-      }
-      IVPhi->setIncomingValueForBlock(PH, NewPhi);
+    // Fix resume values for inductions and reductions from the additional
+    // bypass block using the incoming values from the main loop's resume phis.
+    for (auto [ResumeV, IRPhi] :
+         zip(ResumeValues, ScalarPH->getIRBasicBlock()->phis())) {
+      auto *MainResumePhi = cast<PHINode>(ResumeV->getUnderlyingValue());
+      IRPhi.setIncomingValueForBlock(
+          BypassBlock, MainResumePhi->getIncomingValueForBlock(BypassBlock));
     }
   }
 }
@@ -9377,11 +9230,12 @@ static void fixScalarResumeValuesFromBypass(BasicBlock *BypassBlock, Loop *L,
 // and runtime checks of the main loop, as well as updating various phis. \p
 // InstsToMove contains instructions that need to be moved to the preheader of
 // the epilogue vector loop.
-static void connectEpilogueVectorLoop(
-    VPlan &EpiPlan, Loop *L, EpilogueLoopVectorizationInfo &EPI,
-    DominatorTree *DT, LoopVectorizationLegality &LVL,
-    DenseMap<const SCEV *, Value *> &ExpandedSCEVs, GeneratedRTChecks &Checks,
-    ArrayRef<Instruction *> InstsToMove) {
+static void connectEpilogueVectorLoop(VPlan &EpiPlan, Loop *L,
+                                      EpilogueLoopVectorizationInfo &EPI,
+                                      DominatorTree *DT,
+                                      GeneratedRTChecks &Checks,
+                                      ArrayRef<Instruction *> InstsToMove,
+                                      ArrayRef<VPInstruction *> ResumeValues) {
   BasicBlock *VecEpilogueIterationCountCheck =
       cast<VPIRBasicBlock>(EpiPlan.getEntry())->getIRBasicBlock();
 
@@ -9464,7 +9318,13 @@ static void connectEpilogueVectorLoop(
   // after executing the main loop. We need to update the resume values of
   // inductions and reductions during epilogue vectorization.
   fixScalarResumeValuesFromBypass(VecEpilogueIterationCountCheck, L, EpiPlan,
-                                  LVL, ExpandedSCEVs, EPI.VectorTripCount);
+                                  ResumeValues);
+
+  // Remove dead phis that were moved to the epilogue preheader but are unused
+  // (e.g., resume phis for inductions not widened in the epilogue vector loop).
+  for (PHINode &Phi : make_early_inc_range(VecEpiloguePreHeader->phis()))
+    if (Phi.use_empty())
+      Phi.eraseFromParent();
 }
 
 bool LoopVectorizePass::processLoop(Loop *L) {
@@ -9851,7 +9711,8 @@ bool LoopVectorizePass::processLoop(Loop *L) {
     VPlan &BestEpiPlan = LVP.getPlanFor(EpilogueVF.Width);
     BestEpiPlan.getMiddleBlock()->setName("vec.epilog.middle.block");
     BestEpiPlan.getVectorPreheader()->setName("vec.epilog.ph");
-    preparePlanForMainVectorLoop(*BestMainPlan, BestEpiPlan);
+    auto ResumeValues =
+        preparePlanForMainVectorLoop(*BestMainPlan, BestEpiPlan);
     EpilogueLoopVectorizationInfo EPI(VF.Width, IC, EpilogueVF.Width, 1,
                                       BestEpiPlan);
     EpilogueVectorizerMainLoop MainILV(L, PSE, LI, DT, TTI, AC, EPI, &CM,
@@ -9868,8 +9729,8 @@ bool LoopVectorizePass::processLoop(Loop *L) {
         BestEpiPlan, L, ExpandedSCEVs, EPI, CM, *PSE.getSE());
     LVP.executePlan(EPI.EpilogueVF, EPI.EpilogueUF, BestEpiPlan, EpilogILV, DT,
                     true);
-    connectEpilogueVectorLoop(BestEpiPlan, L, EPI, DT, LVL, ExpandedSCEVs,
-                              Checks, InstsToMove);
+    connectEpilogueVectorLoop(BestEpiPlan, L, EPI, DT, Checks, InstsToMove,
+                              ResumeValues);
     ++LoopsEpilogueVectorized;
   } else {
     InnerLoopVectorizer LB(L, PSE, LI, DT, TTI, AC, VF.Width, IC, &CM, Checks,
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 12aafd8e72c22..a876c02fe073b 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -1321,6 +1321,8 @@ void VPInstruction::execute(VPTransformState &State) {
          "scalar value but not only first lane defined");
   State.set(this, GeneratedValue,
             /*IsScalar*/ GeneratesPerFirstLaneOnly);
+  if (getOpcode() == VPInstruction::ResumeForEpilogue)
+    setUnderlyingValue(GeneratedValue);
 }
 
 bool VPInstruction::opcodeMayReadOrWriteFromMemory() const {
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/epilog-iv-select-cmp.ll b/llvm/test/Transforms/LoopVectorize/AArch64/epilog-iv-select-cmp.ll
index 4eced1640bd14..ae1f7829449da 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/epilog-iv-select-cmp.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/epilog-iv-select-cmp.ll
@@ -46,7 +46,6 @@ define i8 @select_icmp_var_start(ptr %a, i8 %n, i8 %start) {
 ; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i32 [[TMP2]], [[N_VEC]]
 ; CHECK-NEXT:    br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[VEC_EPILOG_ITER_CHECK:.*]]
 ; CHECK:       [[VEC_EPILOG_ITER_CHECK]]:
-; CHECK-NEXT:    [[IND_END:%.*]] = trunc i32 [[N_VEC]] to i8
 ; CHECK-NEXT:    [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i32 [[N_MOD_VF]], 8
 ; CHECK-NEXT:    br i1 [[MIN_EPILOG_ITERS_CHECK]], label %[[VEC_EPILOG_SCALAR_PH]], label %[[VEC_EPILOG_PH]], !prof [[PROF3:![0-9]+]]
 ; CHECK:       [[VEC_EPILOG_PH]]:
@@ -84,7 +83,7 @@ define i8 @select_icmp_var_start(ptr %a, i8 %n, i8 %start) {
 ; CHECK-NEXT:    [[CMP_N16:%.*]] = icmp eq i32 [[TMP2]], [[N_VEC5]]
 ; CHECK-NEXT:    br i1 [[CMP_N16]], label %[[EXIT]], label %[[VEC_EPILOG_SCALAR_PH]]
 ; CHECK:       [[VEC_EPILOG_SCALAR_PH]]:
-; CHECK-NEXT:    [[BC_RESUME_VAL17:%.*]] = phi i8 [ [[TMP16]], %[[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[IND_END]], %[[VEC_EPILOG_ITER_CHECK]] ], [ 0, %[[ITER_CHECK]] ]
+; CHECK-NEXT:    [[BC_RESUME_VAL17:%.*]] = phi i8 [ [[TMP16]], %[[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[TMP3]], %[[VEC_EPILOG_ITER_CHECK]] ], [ 0, %[[ITER_CHECK]] ]
 ; CHECK-NEXT:    [[BC_MERGE_RDX18:%.*]] = phi i8 [ [[RDX_SELECT15]], %[[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[RDX_SELECT]], %[[VEC_EPILOG_ITER_CHECK]] ], [ [[START]], %[[ITER_CHECK]] ]
 ; CHECK-NEXT:    br label %[[LOOP:.*]]
 ; CHECK:       [[LOOP]]:
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-factors.ll b/llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-factors.ll
index 44636222a8648..efa64c661ebc0 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-factors.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-factors.ll
@@ -446,6 +446,8 @@ define void @trip_count_based_on_ptrtoint(i64 %x) "target-cpu"="apple-m1" {
 ; CHECK:       vector.ph:
 ; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[TMP2]], 16
 ; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[TMP2]], [[N_MOD_VF]]
+; CHECK-NEXT:    [[TMP12:%.*]] = mul i64 [[N_VEC]], 4
+; CHECK-NEXT:    [[IND_END:%.*]] = getelementptr i8, ptr [[PTR_START]], i64 [[TMP12]]
 ; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; CHECK:       vector.body:
 ; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
@@ -465,8 +467,6 @@ define void @trip_count_based_on_ptrtoint(i64 %x) "target-cpu"="apple-m1" {
 ; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[TMP2]], [[N_VEC]]
 ; CHECK-NEXT:    br i1 [[CMP_N]], label [[EXIT:%.*]], label [[VEC_EPILOG_ITER_CHECK:%.*]]
 ; CHECK:       vec.epilog.iter.check:
-; CHECK-NEXT:    [[TMP12:%.*]] = mul i64 [[N_VEC]], 4
-; CHECK-NEXT:    [[IND_END:%.*]] = getelementptr i8, ptr [[PTR_START]], i64 [[TMP12]]
 ; CHECK-NEXT:    [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_MOD_VF]], 4
 ; CHECK-NEXT:    br i1 [[MIN_EPILOG_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]], !prof [[PROF11]]
 ; CHECK:       vec.epilog.ph:
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-widen-inductions.ll b/llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-widen-inductions.ll
index eea496303a206..7ae16f782ed72 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-widen-inductions.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-widen-inductions.ll
@@ -39,7 +39,6 @@ define void @test_widen_ptr_induction(ptr %ptr.start.1) {
 ; CHECK:       middle.blo...
[truncated]

github-actions · 2026-03-11T20:50:25Z

✅ With the latest revision this PR passed the C/C++ code formatter.

github-actions · 2026-03-11T21:14:40Z

🐧 Linux x64 Test Results

171290 tests passed
3041 tests skipped
1 test failed

Failed Tests

(click on a test name to see its output)

lldb-api

lldb-api.functionalities/data-formatter/data-formatter-stl/generic/list/TestDataFormatterGenericList.py

Script:
--
/usr/bin/python3 /home/gha/actions-runner/_work/llvm-project/llvm-project/lldb/test/API/dotest.py -u CXXFLAGS -u CFLAGS --env LLVM_LIBS_DIR=/home/gha/actions-runner/_work/llvm-project/llvm-project/build/./lib --env LLVM_INCLUDE_DIR=/home/gha/actions-runner/_work/llvm-project/llvm-project/build/include --env LLVM_TOOLS_DIR=/home/gha/actions-runner/_work/llvm-project/llvm-project/build/./bin --libcxx-include-dir /home/gha/actions-runner/_work/llvm-project/llvm-project/build/include/c++/v1 --libcxx-include-target-dir /home/gha/actions-runner/_work/llvm-project/llvm-project/build/include/x86_64-unknown-linux-gnu/c++/v1 --libcxx-library-dir /home/gha/actions-runner/_work/llvm-project/llvm-project/build/./lib/x86_64-unknown-linux-gnu --arch x86_64 --build-dir /home/gha/actions-runner/_work/llvm-project/llvm-project/build/lldb-test-build.noindex --lldb-module-cache-dir /home/gha/actions-runner/_work/llvm-project/llvm-project/build/lldb-test-build.noindex/module-cache-lldb/lldb-api --clang-module-cache-dir /home/gha/actions-runner/_work/llvm-project/llvm-project/build/lldb-test-build.noindex/module-cache-clang/lldb-api --executable /home/gha/actions-runner/_work/llvm-project/llvm-project/build/./bin/lldb --compiler /home/gha/actions-runner/_work/llvm-project/llvm-project/build/./bin/clang --dsymutil /home/gha/actions-runner/_work/llvm-project/llvm-project/build/./bin/dsymutil --make /usr/bin/gmake --llvm-tools-dir /home/gha/actions-runner/_work/llvm-project/llvm-project/build/./bin --lldb-obj-root /home/gha/actions-runner/_work/llvm-project/llvm-project/build/tools/lldb --lldb-libs-dir /home/gha/actions-runner/_work/llvm-project/llvm-project/build/./lib --cmake-build-type Release /home/gha/actions-runner/_work/llvm-project/llvm-project/lldb/test/API/functionalities/data-formatter/data-formatter-stl/generic/list -p TestDataFormatterGenericList.py
--
Exit Code: 1

Command Output (stdout):
--
lldb version 23.0.0git (https://github.com/llvm/llvm-project revision 11010d8b9a43493609d3e609fd431d57cc5e1224)
  clang revision 11010d8b9a43493609d3e609fd431d57cc5e1224
  llvm revision 11010d8b9a43493609d3e609fd431d57cc5e1224
Skipping the following test categories: msvcstl, dsym, pdb, gmodules, debugserver, objc

--
Command Output (stderr):
--
UNSUPPORTED: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_ptr_and_ref_libcpp_dsym (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_ptr_and_ref_libcpp_dsym) (test case does not fall in any category of interest for this run) 
PASS: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_ptr_and_ref_libcpp_dwarf (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_ptr_and_ref_libcpp_dwarf)
PASS: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_ptr_and_ref_libcpp_dwo (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_ptr_and_ref_libcpp_dwo)
UNSUPPORTED: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_ptr_and_ref_libcpp_pdb (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_ptr_and_ref_libcpp_pdb) (test case does not fall in any category of interest for this run) 
UNSUPPORTED: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_ptr_and_ref_libstdcpp_dsym (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_ptr_and_ref_libstdcpp_dsym) (test case does not fall in any category of interest for this run) 
PASS: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_ptr_and_ref_libstdcpp_dwarf (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_ptr_and_ref_libstdcpp_dwarf)
PASS: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_ptr_and_ref_libstdcpp_dwo (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_ptr_and_ref_libstdcpp_dwo)
UNSUPPORTED: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_ptr_and_ref_libstdcpp_pdb (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_ptr_and_ref_libstdcpp_pdb) (test case does not fall in any category of interest for this run) 
UNSUPPORTED: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_ptr_and_ref_msvcstl_dsym (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_ptr_and_ref_msvcstl_dsym) (test case does not fall in any category of interest for this run) 
UNSUPPORTED: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_ptr_and_ref_msvcstl_dwarf (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_ptr_and_ref_msvcstl_dwarf) (test case does not fall in any category of interest for this run) 
UNSUPPORTED: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_ptr_and_ref_msvcstl_dwo (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_ptr_and_ref_msvcstl_dwo) (test case does not fall in any category of interest for this run) 
UNSUPPORTED: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_ptr_and_ref_msvcstl_pdb (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_ptr_and_ref_msvcstl_pdb) (test case does not fall in any category of interest for this run) 
UNSUPPORTED: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_with_run_command_libcpp_dsym (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_with_run_command_libcpp_dsym) (test case does not fall in any category of interest for this run) 
PASS: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_with_run_command_libcpp_dwarf (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_with_run_command_libcpp_dwarf)
PASS: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_with_run_command_libcpp_dwo (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_with_run_command_libcpp_dwo)
UNSUPPORTED: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_with_run_command_libcpp_pdb (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_with_run_command_libcpp_pdb) (test case does not fall in any category of interest for this run) 
UNSUPPORTED: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_with_run_command_libstdcpp_dsym (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_with_run_command_libstdcpp_dsym) (test case does not fall in any category of interest for this run) 
FAIL: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_with_run_command_libstdcpp_dwarf (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_with_run_command_libstdcpp_dwarf)
FAIL: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_with_run_command_libstdcpp_dwo (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_with_run_command_libstdcpp_dwo)
UNSUPPORTED: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_with_run_command_libstdcpp_pdb (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_with_run_command_libstdcpp_pdb) (test case does not fall in any category of interest for this run) 
UNSUPPORTED: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_with_run_command_msvcstl_dsym (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_with_run_command_msvcstl_dsym) (test case does not fall in any category of interest for this run) 
UNSUPPORTED: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_with_run_command_msvcstl_dwarf (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_with_run_command_msvcstl_dwarf) (test case does not fall in any category of interest for this run) 
UNSUPPORTED: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_with_run_command_msvcstl_dwo (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_with_run_command_msvcstl_dwo) (test case does not fall in any category of interest for this run) 
UNSUPPORTED: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_with_run_command_msvcstl_pdb (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_with_run_command_msvcstl_pdb) (test case does not fall in any category of interest for this run) 
======================================================================
FAIL: test_with_run_command_libstdcpp_dwarf (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_with_run_command_libstdcpp_dwarf)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/gha/actions-runner/_work/llvm-project/llvm-project/lldb/packages/Python/lldbsuite/test/lldbtest.py", line 2012, in test_method
    return attrvalue(self)
           ^^^^^^^^^^^^^^^
  File "/home/gha/actions-runner/_work/llvm-project/llvm-project/lldb/test/API/functionalities/data-formatter/data-formatter-stl/generic/list/TestDataFormatterGenericList.py", line 302, in test_with_run_command_libstdcpp
    self.do_test_with_run_command(is_libstdcpp=True)
  File "/home/gha/actions-runner/_work/llvm-project/llvm-project/lldb/test/API/functionalities/data-formatter/data-formatter-stl/generic/list/TestDataFormatterGenericList.py", line 64, in do_test_with_run_command
    self.expect(
  File "/home/gha/actions-runner/_work/llvm-project/llvm-project/lldb/packages/Python/lldbsuite/test/lldbtest.py", line 2663, in expect
    self.runCmd(
  File "/home/gha/actions-runner/_work/llvm-project/llvm-project/lldb/packages/Python/lldbsuite/test/lldbtest.py", line 1049, in runCmd
    self.assertTrue(self.res.Succeeded(), msg + output)
AssertionError: False is not true : Command 'frame variable &numbers_list._M_impl._M_node --raw' did not return successfully
Error output:
error: <user expression 0>:1:15: "_M_impl" is not a member of "(int_list) numbers_list"
   1 | &numbers_list._M_impl._M_node
     | ^

Config=x86_64-/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang
======================================================================
FAIL: test_with_run_command_libstdcpp_dwo (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_with_run_command_libstdcpp_dwo)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/gha/actions-runner/_work/llvm-project/llvm-project/lldb/packages/Python/lldbsuite/test/lldbtest.py", line 2012, in test_method
    return attrvalue(self)
           ^^^^^^^^^^^^^^^
  File "/home/gha/actions-runner/_work/llvm-project/llvm-project/lldb/test/API/functionalities/data-formatter/data-formatter-stl/generic/list/TestDataFormatterGenericList.py", line 302, in test_with_run_command_libstdcpp
    self.do_test_with_run_command(is_libstdcpp=True)
  File "/home/gha/actions-runner/_work/llvm-project/llvm-project/lldb/test/API/functionalities/data-formatter/data-formatter-stl/generic/list/TestDataFormatterGenericList.py", line 64, in do_test_with_run_command
    self.expect(
  File "/home/gha/actions-runner/_work/llvm-project/llvm-project/lldb/packages/Python/lldbsuite/test/lldbtest.py", line 2663, in expect
    self.runCmd(
  File "/home/gha/actions-runner/_work/llvm-project/llvm-project/lldb/packages/Python/lldbsuite/test/lldbtest.py", line 1049, in runCmd
    self.assertTrue(self.res.Succeeded(), msg + output)
AssertionError: False is not true : Command 'frame variable &numbers_list._M_impl._M_node --raw' did not return successfully
Error output:
error: <user expression 0>:1:15: "_M_impl" is not a member of "(int_list) numbers_list"
   1 | &numbers_list._M_impl._M_node
     | ^

Config=x86_64-/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang
----------------------------------------------------------------------
Ran 24 tests in 2.145s

FAILED (failures=2, skipped=16)

--

If these failures are unrelated to your changes (for example tests are broken or flaky at HEAD), please open an issue at https://github.com/llvm/llvm-project/issues and add the infrastructure label.

github-actions · 2026-03-11T21:14:40Z

🪟 Windows x64 Test Results

132306 tests passed
3009 tests skipped

✅ The build succeeded and all tests passed.

This patch tries to drastically simplify resume value handling for the scalar loop when vectorizing the epilogue. It uses a simpler, uniform approach for updating all resume values in the scalar loop: 1. Create ResumeForEpilogue recipes for all scalar resume phis in the main loop (the epilogue plan will have exactly the same scalar resume phis, in exactly the same order) 2. Update ::execute for ResumeForEpilogue to set the underlying value when executing. This is not super clean, but allows easy lookup of the generated IR value when we update the resume phis in the epilogue. Once we connect the 2 plans together explicitly, this can be removed. 3. Use the list of ResumeForEpilogue VPInstructions from the main loop to update the resume/bypass values from the epilogue. This simplifies the code quite a bit, makes it more robust (should fix llvm#179407) and also fixes a mis-compile in the existing tests (see change in llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-sub-epilogue-vec.ll, where previously we would incorrectly resume using the start value when the epilogue iteration check failed) In some cases, we get simpler code, due to additional CSE, in some cases the induction end value computations get moved from the epilogue iteration check to the vector preheader. We could try to sink the instructions as cleanup, but it is probably not worth the trouble.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

artagnon · 2026-03-12T08:50:55Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+static SmallVector<VPInstruction *>
+preparePlanForMainVectorLoop(VPlan &MainPlan, VPlan &EpiPlan) {


Is it possible to make preparePlanFor(Main|Epilogue)VectorLoop totally uniform, and return a SmallVector of VPInstruction resume values or InstsToMove in both?

I am not sure exactly, their returns are very different, but perhaps I am missing a simplification you are thinking of?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

…r-resume-value-handling

artagnon

Hm, forgot to commit some changes?

fhahn · 2026-03-12T15:31:50Z

Hm, forgot to commit some changes?

Ah yes, sorry for that. should be pushed now

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

artagnon · 2026-03-12T21:42:58Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

Damn, this is bad! Is the scalar TC really not kept anywhere? vec.epilog.resume.val is merely VectorTC, and ResumePhi is merely a phi from that start value up to ScalarTC, no?

Its the a phi with incoming value from MiddleBlock = Vector trip count and zero incoming value from the entry block; we have the original trip count, but we need the vector trip count to resume the canonical IV from there in the epilogue vector loop

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

…r-resume-value-handling

artagnon

Thanks for all the explanations! I think that I understand this well enough to approve, but note that all this code is new to me, and I've not played with it: I'm worried that this might cause miscompiles and crashes due to some context that we didn't think of, but that's why we have version control. I think further simplifications are possible, but let's not worry about that for now. LGTM, thanks!

…r-resume-value-handling

…#185969) This patch tries to drastically simplify resume value handling for the scalar loop when vectorizing the epilogue. It uses a simpler, uniform approach for updating all resume values in the scalar loop: 1. Create ResumeForEpilogue recipes for all scalar resume phis in the main loop (the epilogue plan will have exactly the same scalar resume phis, in exactly the same order) 2. Update ::execute for ResumeForEpilogue to set the underlying value when executing. This is not super clean, but allows easy lookup of the generated IR value when we update the resume phis in the epilogue. Once we connect the 2 plans together explicitly, this can be removed. 3. Use the list of ResumeForEpilogue VPInstructions from the main loop to update the resume/bypass values from the epilogue. This simplifies the code quite a bit, makes it more robust (should fix llvm#179407) and also fixes a mis-compile in the existing tests (see change in llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-sub-epilogue-vec.ll, where previously we would incorrectly resume using the start value when the epilogue iteration check failed) In some cases, we get simpler code, due to additional CSE, in some cases the induction end value computations get moved from the epilogue iteration check to the vector preheader. We could try to sink the instructions as cleanup, but it is probably not worth the trouble. Fixes llvm#179407.

alexey-bataev · 2026-03-18T15:41:42Z

Looks like we're seeing a miscompilation after this patch, working on a reproducer

alexey-bataev · 2026-03-18T16:46:28Z

#187323

sjoerdmeijer · 2026-03-19T13:45:39Z

We also see miscompilations, in different apps, so not just one. Shall we revert this because I don't see a lot of movement in #187323?

fhahn · 2026-03-19T13:55:17Z

We also see miscompilations, in different apps, so not just one. Shall we revert this because I don't see a lot of movement in #187323?

There's no valid reproducer so far showing an issue. We can revert, but would be good to also provide a concrete reproducer showing the issue

sjoerdmeijer · 2026-03-19T14:04:22Z

We also see miscompilations, in different apps, so not just one. Shall we revert this because I don't see a lot of movement in #187323?

There's no valid reproducer so far showing an issue. We can revert, but would be good to also provide a concrete reproducer showing the issue

This shows miscompiles in 549.fotonik3d_r in SPEC 2017 FP. And then other apps as well.

Thanks for the revert.

fhahn · 2026-03-19T14:05:45Z

We also see miscompilations, in different apps, so not just one. Shall we revert this because I don't see a lot of movement in #187323?

There's no valid reproducer so far showing an issue. We can revert, but would be good to also provide a concrete reproducer showing the issue

This shows miscompiles in 549.fotonik3d_r in SPEC 2017 FP. And then other apps as well.

Thanks for the revert.

Would you be able to provide an IR reproducer?

sjoerdmeijer · 2026-03-19T14:08:29Z

Would you be able to provide an IR reproducer?

I will send Alexey a message, if he doesn't get there first, I will have a look.

…c." (#187504) Reverts #185969 This is suspected to cause a miscompile in 549.fotonik3d_r from SPEC 2017 FP

…epilogue vec." (#187504) Reverts llvm/llvm-project#185969 This is suspected to cause a miscompile in 549.fotonik3d_r from SPEC 2017 FP

stuij · 2026-03-23T13:14:15Z

Unfortunately this change (which I'm guessing will reland without too many alterations) is causing a lot of regressions in CMSIS-DSP (embedded library) kernels that we expect to auto vectorize. On a Cortex-A320, from the 322 kernels we test, 93 saw a regression, so almost 30%. By an average of 2.7 percent. Top 20 sits at 4.5%.

This seems a bit high to me, especially when it's affecting this type of what seems to me at least fairly idiomatic library code.

When we look at an example, we see that the interleave count in the relevant loop is reduced from 8 to 2.

Cut back stand-alone version:

#include <stdint.h>
typedef int32_t q31_t;


void arm_negate_1(
  const q31_t * pSrc,
        q31_t * pDst,
        uint32_t blockSize)
{
        uint32_t blkCnt;                               /* Loop counter */
        q31_t in;                                      /* Temporary input variable */

  /* Loop unrolling: Compute 4 outputs at a time */
  blkCnt = blockSize >> 2U;

  while (blkCnt > 0U)
  {
    /* C = -A */

    /* Negate and store result in destination buffer. */
    in = *pSrc++;
    *pDst++ = (in == INT32_MIN) ? INT32_MAX : -in;

    in = *pSrc++;
    *pDst++ = (in == INT32_MIN) ? INT32_MAX : -in;

    in = *pSrc++;
    *pDst++ = (in == INT32_MIN) ? INT32_MAX : -in;

    in = *pSrc++;
    *pDst++ = (in == INT32_MIN) ? INT32_MAX : -in;

    /* Decrement loop counter */
    blkCnt--;
  }

  /* Loop unrolling: Compute remaining outputs */
  blkCnt = blockSize % 0x4U;

  while (blkCnt > 0U)
  {
    /* C = -A */

    /* Negate and store result in destination buffer. */
    in = *pSrc++;
    *pDst++ = (in == INT32_MIN) ? INT32_MAX : -in;

    /* Decrement loop counter */
    blkCnt--;
  }
}

compiled with $CLANG --target=aarch64-none-elf -mcpu=cortex-a320 -O3

The below assembly is from the relevant loop.

before:

.LBB0_15:
	and	x12, x11, #0x3ffffff8
	movi	v0.4s, #128, lsl #24
	lsl	x10, x12, #4
	sub	w9, w11, w12
	add	x8, x0, x10
	add	x10, x1, x10
	add	x13, x1, #64
	add	x14, x0, #64
	and	x15, x11, #0x3ffffff8
	.p2align	4, , 8
.LBB0_16:                               // =>This Inner Loop Header: Depth=1
	ldp	q3, q4, [x14, #-32]
	ldp	q1, q2, [x14, #-64]
	cmeq	v17.4s, v4.4s, v0.4s
	cmeq	v18.4s, v3.4s, v0.4s
	neg	v4.4s, v4.4s
	neg	v3.4s, v3.4s
	cmeq	v19.4s, v1.4s, v0.4s
	neg	v1.4s, v1.4s
	cmeq	v20.4s, v2.4s, v0.4s
	neg	v2.4s, v2.4s
	bic	v3.16b, v3.16b, v18.16b
	bic	v4.16b, v4.16b, v17.16b
	bic	v17.4s, #128, lsl #24
	bic	v18.4s, #128, lsl #24
	ldp	q5, q6, [x14]
	bic	v1.16b, v1.16b, v19.16b
	bic	v19.4s, #128, lsl #24
	bic	v2.16b, v2.16b, v20.16b
	bic	v20.4s, #128, lsl #24
	orr	v4.16b, v17.16b, v4.16b
	orr	v3.16b, v18.16b, v3.16b
	ldp	q7, q16, [x14, #32]
	orr	v1.16b, v19.16b, v1.16b
	orr	v2.16b, v20.16b, v2.16b
	stp	q3, q4, [x13, #-32]
	cmeq	v3.4s, v5.4s, v0.4s
	neg	v4.4s, v5.4s
	stp	q1, q2, [x13, #-64]
	cmeq	v1.4s, v16.4s, v0.4s
	cmeq	v2.4s, v7.4s, v0.4s
	cmeq	v5.4s, v6.4s, v0.4s
	neg	v16.4s, v16.4s
	bic	v4.16b, v4.16b, v3.16b
	bic	v3.4s, #128, lsl #24
	neg	v6.4s, v6.4s
	neg	v7.4s, v7.4s
	orr	v3.16b, v3.16b, v4.16b
	bic	v4.16b, v6.16b, v5.16b
	bic	v6.16b, v7.16b, v2.16b
	bic	v7.16b, v16.16b, v1.16b
	bic	v1.4s, #128, lsl #24
	bic	v2.4s, #128, lsl #24
	bic	v5.4s, #128, lsl #24
	orr	v1.16b, v1.16b, v7.16b
	orr	v2.16b, v2.16b, v6.16b
	orr	v4.16b, v5.16b, v4.16b
	subs	x15, x15, #8
	add	x14, x14, #128
	stp	q2, q1, [x13, #32]
	stp	q3, q4, [x13], #128
	b.ne	.LBB0_16
// %bb.17:
	cmp	x12, x11
	b.ne	.LBB0_5
	b	.LBB0_19

after:

.LBB0_15:
	and	x12, x11, #0x3ffffffe
	movi	v0.4s, #128, lsl #24
	lsl	x10, x12, #4
	sub	w9, w11, w12
	add	x8, x0, x10
	add	x10, x1, x10
	add	x13, x1, #16
	add	x14, x0, #16
	and	x15, x11, #0x3ffffffe
	.p2align	4, , 8
.LBB0_16:                               // =>This Inner Loop Header: Depth=1
	ldp	q1, q2, [x14, #-16]
	subs	x15, x15, #2
	cmeq	v3.4s, v1.4s, v0.4s
	neg	v1.4s, v1.4s
	cmeq	v4.4s, v2.4s, v0.4s
	neg	v2.4s, v2.4s
	bic	v1.16b, v1.16b, v3.16b
	bic	v3.4s, #128, lsl #24
	bic	v2.16b, v2.16b, v4.16b
	bic	v4.4s, #128, lsl #24
	orr	v1.16b, v3.16b, v1.16b
	add	x14, x14, #32
	orr	v2.16b, v4.16b, v2.16b
	stp	q1, q2, [x13, #-16]
	add	x13, x13, #32
	b.ne	.LBB0_16
// %bb.17:
	cmp	x12, x11
	b.ne	.LBB0_5
	b	.LBB0_19

fhahn · 2026-03-23T16:18:46Z

@stuij is it possible this is not the right PR? AFAICT it does not change the generated assembly for the C code at all; it just changes how the resume values are handled, but not impact the vectorization decision.

On current main, codeine is also narrow: https://clang.godbolt.org/z/YoKKazsbs

fhahn · 2026-03-24T08:36:40Z

@sjoerdmeijer relanded as 77fb848, please let me know if you are seeing any issues, and if so please provide a reproducer :)

fhahn · 2026-03-24T10:32:43Z

@stuij it looks like TuneA320 limits the interleave count to 2 in this case. Does increasing the max interleave count resolve the regression? (I think the regression started in 92e44b2 AFAICT)

alexfh · 2026-03-24T13:23:39Z

Leaving this reverted without reverting #182146 is a bad state too. We're affected by this miscompilation then: #182146 (comment)

Maybe #182146 should be reverted as well?

fhahn · 2026-03-24T13:32:46Z

@alexfh it was re-landed yesterday

stuij · 2026-03-24T14:59:04Z

@fhahn yes! I accidentally took the wrong function from the daily regression list to look at regressions caused by this patch. Sorry! I bisected it, and it was indeed 92e44b2. We've seen regressions caused by related patches and we've got a ticket for this. We did actually try to tweak max interleave count (to 4) but it did more harm than good for the CMSIS-DSP tests. It's ongoing. Thanks for having a think on it.

However, the list and amount of regressions that I quoted earlier in this thread does seem to be caused by this change. The numbers ebb and flow with this commit being applied and reverted (and it is pointed to by our automated bisecting). Examples of actual impacted tests are: arm_fill_f16, arm_q7_to_q15, arm_mat_add_f16.

Looking at arm_mat_add_f16.c I can see 4 extra movs towards the start of the fn to keep values alive and in the NEON header 5 extra instructions that we now always execute to do checks that we would previously do afterwards, but only if we have resume values. So in the case of our test with multiples of 16, where we don't need the epilogue, this would be a function call overhead of 9 instructions, which I guess we do start to feel on an embedded core like the A320. Does that sound sane?

…c." (llvm#187504) Reverts llvm#185969 This is suspected to cause a miscompile in 549.fotonik3d_r from SPEC 2017 FP

fhahn requested review from aniragil, annamthomas, artagnon, ayalz and rengolin March 11, 2026 20:47

llvmbot added backend:PowerPC vectorizers llvm:transforms labels Mar 11, 2026

fhahn mentioned this pull request Mar 11, 2026

[LV] Fix crash in epilog vectorization #185480

Open

fhahn force-pushed the lv-simplify-epi-scalar-resume-value-handling branch from f505665 to 90ca940 Compare March 11, 2026 21:56

artagnon reviewed Mar 12, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into lv-simplify-epi-scala…

539884b

…r-resume-value-handling

artagnon reviewed Mar 12, 2026

View reviewed changes

fhahn added 2 commits March 12, 2026 15:30

!fixup address latest commits, thanks

03b6f41

!fixup address latest comments, thanks

83fbf86

fhahn force-pushed the lv-simplify-epi-scalar-resume-value-handling branch from 7a7e332 to 83fbf86 Compare March 12, 2026 15:31

artagnon reviewed Mar 12, 2026

View reviewed changes

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved

!fixup remove dead code

995c318

artagnon reviewed Mar 12, 2026

View reviewed changes

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved

fhahn added 2 commits March 13, 2026 10:47

Merge remote-tracking branch 'origin/main' into lv-simplify-epi-scala…

9c3b914

…r-resume-value-handling

!fixup address latest comments, thanks

f505bd5

artagnon approved these changes Mar 14, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into lv-simplify-epi-scala…

f98ca63

…r-resume-value-handling

Merge remote-tracking branch 'origin/main' into lv-simplify-epi-scala…

8bb9ff6

…r-resume-value-handling

fhahn enabled auto-merge (squash) March 16, 2026 21:12

fhahn merged commit 013f254 into llvm:main Mar 16, 2026
9 of 10 checks passed

fhahn deleted the lv-simplify-epi-scalar-resume-value-handling branch March 16, 2026 22:19

alexey-bataev mentioned this pull request Mar 18, 2026

[LV]Miscompile in Loop Vectorizer #187323

Closed

fhahn mentioned this pull request Mar 19, 2026

Revert "[LV] Simplify and unify resume value handling for epilogue vec." #187504

Merged

fhahn added a commit that referenced this pull request Mar 19, 2026

Revert "[LV] Simplify and unify resume value handling for epilogue ve…

cdaf29f

…c." (#187504) Reverts #185969 This is suspected to cause a miscompile in 549.fotonik3d_r from SPEC 2017 FP

		static SmallVector<VPInstruction *>
		preparePlanForMainVectorLoop(VPlan &MainPlan, VPlan &EpiPlan) {

Conversation

fhahn commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🐧 Linux x64 Test Results

Failed Tests

lldb-api

Uh oh!

github-actions bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🪟 Windows x64 Test Results

Uh oh!

Uh oh!

artagnon Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

fhahn Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

artagnon left a comment

Choose a reason for hiding this comment

Uh oh!

fhahn commented Mar 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

artagnon Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

fhahn Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

artagnon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alexey-bataev commented Mar 18, 2026

Uh oh!

alexey-bataev commented Mar 18, 2026

Uh oh!

sjoerdmeijer commented Mar 19, 2026

Uh oh!

fhahn commented Mar 19, 2026

Uh oh!

sjoerdmeijer commented Mar 19, 2026

Uh oh!

fhahn commented Mar 19, 2026

Uh oh!

sjoerdmeijer commented Mar 19, 2026

Uh oh!

stuij commented Mar 23, 2026

Uh oh!

fhahn commented Mar 23, 2026

Uh oh!

fhahn commented Mar 24, 2026

Uh oh!

fhahn commented Mar 24, 2026

Uh oh!

alexfh commented Mar 24, 2026

fhahn commented Mar 11, 2026 •

edited

Loading

llvmbot commented Mar 11, 2026 •

edited

Loading

github-actions bot commented Mar 11, 2026 •

edited

Loading

github-actions bot commented Mar 11, 2026 •

edited

Loading

github-actions bot commented Mar 11, 2026 •

edited

Loading