Skip to content

[LV] Simplify and unify resume value handling for epilogue vec.#185969

Merged
fhahn merged 9 commits intollvm:mainfrom
fhahn:lv-simplify-epi-scalar-resume-value-handling
Mar 16, 2026
Merged

[LV] Simplify and unify resume value handling for epilogue vec.#185969
fhahn merged 9 commits intollvm:mainfrom
fhahn:lv-simplify-epi-scalar-resume-value-handling

Conversation

@fhahn
Copy link
Copy Markdown
Contributor

@fhahn fhahn commented Mar 11, 2026

This patch tries to drastically simplify resume value handling for the scalar loop when vectorizing the epilogue.

It uses a simpler, uniform approach for updating all resume values in the scalar loop:

  1. Create ResumeForEpilogue recipes for all scalar resume phis in the main loop (the epilogue plan will have exactly the same scalar resume phis, in exactly the same order)
  2. Update ::execute for ResumeForEpilogue to set the underlying value when executing. This is not super clean, but allows easy lookup of the generated IR value when we update the resume phis in the epilogue. Once we connect the 2 plans together explicitly, this can be removed.
  3. Use the list of ResumeForEpilogue VPInstructions from the main loop to update the resume/bypass values from the epilogue.

This simplifies the code quite a bit, makes it more robust (should fix #179407) and also fixes a mis-compile in the existing tests (see change in llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-sub-epilogue-vec.ll, where previously we would incorrectly resume using the start value when the epilogue iteration check failed)

In some cases, we get simpler code, due to additional CSE, in some cases the induction end value computations get moved from the epilogue iteration check to the vector preheader. We could try to sink the instructions as cleanup, but it is probably not worth the trouble.

Fixes #179407.

@llvmbot
Copy link
Copy Markdown
Member

llvmbot commented Mar 11, 2026

@llvm/pr-subscribers-vectorizers
@llvm/pr-subscribers-backend-powerpc

@llvm/pr-subscribers-llvm-transforms

Author: Florian Hahn (fhahn)

Changes

This patch tries to drastically simplify resume value handling for the scalar loop when vectorizing the epilogue.

It uses a simpler, uniform approach for updating all resume values in the scalar loop:

  1. Create ResumeForEpilogue recipes for all scalar resume phis in the main loop (the epilogue plan will have exactly the same scalar resume phis, in exactly the same order)
  2. Update ::execute for ResumeForEpilogue to set the underlying value when executing. This is not super clean, but allows easy lookup of the generated IR value when we update the resume phis in the epilogue. Once we connect the 2 plans together explicitly, this can be removed.
  3. Use the list of ResumeForEpilogue VPInstructions from the main loop to update the resume/bypass values from the epilogue.

This simplifies the code quite a bit, makes it more robust (should fix #179407) and also fixes a mis-compile in the existing tests (see change in llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-sub-epilogue-vec.ll, where previously we would incorrectly resume using the start value when the epilogue iteration check failed)

In some cases, we get simpler code, due to additional CSE, in some cases the induction end value computations get moved from the epilogue iteration check to the vector preheader. We could try to sink the instructions as cleanup, but it is probably not worth the trouble.


Patch is 98.46 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/185969.diff

33 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+52-191)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+2)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/epilog-iv-select-cmp.ll (+1-2)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-factors.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-widen-inductions.ll (+2-4)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/f128-fmuladd-reduction.ll (+4-4)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/force-target-instruction-cost.ll (+4-4)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/induction-costs.ll (+5-7)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/interleave-with-runtime-checks.ll (+4-4)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/intrinsiccost.ll (+14-14)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/low_trip_count_predicates.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/neon-inloop-reductions.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-dot-product-epilogue.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-sub-epilogue-vec.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-epilog-vect.ll (+4-4)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/transform-narrow-interleave-to-widen-memory-cost.ll (+6-8)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/transform-narrow-interleave-to-widen-memory-scalable.ll (+6-6)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/vector-reverse.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/PowerPC/exit-branch-cost.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/PowerPC/optimal-epilog-vectorization.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/conversion-cost.ll (+1-2)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/cost-model.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/epilog-vectorization-inductions.ll (+2-5)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/float-induction-x86.ll (+2-8)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/gather_scatter.ll (+2-4)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/intrinsiccost.ll (+14-14)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/scatter_crash.ll (+4-10)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/transform-narrow-interleave-to-widen-memory-epilogue-vec.ll (+1-2)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/transform-narrow-interleave-to-widen-memory.ll (+1-2)
  • (modified) llvm/test/Transforms/LoopVectorize/epilog-iv-select-cmp.ll (+2-3)
  • (modified) llvm/test/Transforms/LoopVectorize/epilog-vectorization-any-of-reductions.ll (+3-4)
  • (modified) llvm/test/Transforms/LoopVectorize/epilog-vectorization-reductions.ll (+2-4)
  • (modified) llvm/test/Transforms/LoopVectorize/optimal-epilog-vectorization.ll (+3-5)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 53a9bdb113b65..5299ba4f88b4a 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -2423,20 +2423,6 @@ BasicBlock *InnerLoopVectorizer::createScalarPreheader(StringRef Prefix) {
                     Twine(Prefix) + "scalar.ph");
 }
 
-/// Return the expanded step for \p ID using \p ExpandedSCEVs to look up SCEV
-/// expansion results.
-static Value *getExpandedStep(const InductionDescriptor &ID,
-                              const SCEV2ValueTy &ExpandedSCEVs) {
-  const SCEV *Step = ID.getStep();
-  if (auto *C = dyn_cast<SCEVConstant>(Step))
-    return C->getValue();
-  if (auto *U = dyn_cast<SCEVUnknown>(Step))
-    return U->getValue();
-  Value *V = ExpandedSCEVs.lookup(Step);
-  assert(V && "SCEV must be expanded at this point");
-  return V;
-}
-
 /// Knowing that loop \p L executes a single vector iteration, add instructions
 /// that will get simplified and thus should not have any cost to \p
 /// InstsToIgnore.
@@ -7341,72 +7327,6 @@ VectorizationFactor LoopVectorizationPlanner::computeBestVF() {
   return BestFactor;
 }
 
-// If \p EpiResumePhiR is resume VPPhi for a reduction when vectorizing the
-// epilog loop, fix the reduction's scalar PHI node by adding the incoming value
-// from the main vector loop.
-static void fixReductionScalarResumeWhenVectorizingEpilog(
-    VPPhi *EpiResumePhiR, PHINode &EpiResumePhi, BasicBlock *BypassBlock) {
-  using namespace VPlanPatternMatch;
-  // Get the VPInstruction computing the reduction result in the middle block.
-  // The first operand may not be from the middle block if it is not connected
-  // to the scalar preheader. In that case, there's nothing to fix.
-  VPValue *Incoming = EpiResumePhiR->getOperand(0);
-  match(Incoming, VPlanPatternMatch::m_ZExtOrSExt(
-                      VPlanPatternMatch::m_VPValue(Incoming)));
-  auto *EpiRedResult = dyn_cast<VPInstruction>(Incoming);
-  if (!EpiRedResult)
-    return;
-
-  VPValue *BackedgeVal;
-  bool IsFindIV = false;
-  if (EpiRedResult->getOpcode() == VPInstruction::ComputeAnyOfResult ||
-      EpiRedResult->getOpcode() == VPInstruction::ComputeReductionResult)
-    BackedgeVal = EpiRedResult->getOperand(EpiRedResult->getNumOperands() - 1);
-  else if (matchFindIVResult(EpiRedResult, m_VPValue(BackedgeVal), m_VPValue()))
-    IsFindIV = true;
-  else
-    return;
-
-  auto *EpiRedHeaderPhi = cast_if_present<VPReductionPHIRecipe>(
-      vputils::findRecipe(BackedgeVal, IsaPred<VPReductionPHIRecipe>));
-  if (!EpiRedHeaderPhi) {
-    match(BackedgeVal,
-          VPlanPatternMatch::m_Select(VPlanPatternMatch::m_VPValue(),
-                                      VPlanPatternMatch::m_VPValue(BackedgeVal),
-                                      VPlanPatternMatch::m_VPValue()));
-    EpiRedHeaderPhi = cast<VPReductionPHIRecipe>(
-        vputils::findRecipe(BackedgeVal, IsaPred<VPReductionPHIRecipe>));
-  }
-
-  Value *MainResumeValue;
-  if (auto *VPI = dyn_cast<VPInstruction>(EpiRedHeaderPhi->getStartValue())) {
-    assert((VPI->getOpcode() == VPInstruction::Broadcast ||
-            VPI->getOpcode() == VPInstruction::ReductionStartVector) &&
-           "unexpected start recipe");
-    MainResumeValue = VPI->getOperand(0)->getUnderlyingValue();
-  } else
-    MainResumeValue = EpiRedHeaderPhi->getStartValue()->getUnderlyingValue();
-  if (EpiRedResult->getOpcode() == VPInstruction::ComputeAnyOfResult) {
-    [[maybe_unused]] Value *StartV =
-        EpiRedResult->getOperand(0)->getLiveInIRValue();
-    auto *Cmp = cast<ICmpInst>(MainResumeValue);
-    assert(Cmp->getPredicate() == CmpInst::ICMP_NE &&
-           "AnyOf expected to start with ICMP_NE");
-    assert(Cmp->getOperand(1) == StartV &&
-           "AnyOf expected to start by comparing main resume value to original "
-           "start value");
-    MainResumeValue = Cmp->getOperand(0);
-  } else if (IsFindIV) {
-    MainResumeValue = cast<SelectInst>(MainResumeValue)->getFalseValue();
-  }
-  PHINode *MainResumePhi = cast<PHINode>(MainResumeValue);
-
-  // When fixing reductions in the epilogue loop we should already have
-  // created a bc.merge.rdx Phi after the main vector body. Ensure that we carry
-  // over the incoming values correctly.
-  EpiResumePhi.setIncomingValueForBlock(
-      BypassBlock, MainResumePhi->getIncomingValueForBlock(BypassBlock));
-}
 
 DenseMap<const SCEV *, Value *> LoopVectorizationPlanner::executePlan(
     ElementCount BestVF, unsigned BestUF, VPlan &BestVPlan,
@@ -8991,35 +8911,9 @@ LoopVectorizePass::LoopVectorizePass(LoopVectorizeOptions Opts)
                               !EnableLoopVectorization) {}
 
 /// Prepare \p MainPlan for vectorizing the main vector loop during epilogue
-/// vectorization. Remove ResumePhis from \p MainPlan for inductions that
-/// don't have a corresponding wide induction in \p EpiPlan.
-static void preparePlanForMainVectorLoop(VPlan &MainPlan, VPlan &EpiPlan) {
-  // Collect PHI nodes of widened phis in the VPlan for the epilogue. Those
-  // will need their resume-values computed in the main vector loop. Others
-  // can be removed from the main VPlan.
-  SmallPtrSet<PHINode *, 2> EpiWidenedPhis;
-  for (VPRecipeBase &R :
-       EpiPlan.getVectorLoopRegion()->getEntryBasicBlock()->phis()) {
-    if (isa<VPCanonicalIVPHIRecipe>(&R))
-      continue;
-    EpiWidenedPhis.insert(
-        cast<PHINode>(R.getVPSingleValue()->getUnderlyingValue()));
-  }
-  for (VPRecipeBase &R :
-       make_early_inc_range(MainPlan.getScalarHeader()->phis())) {
-    auto *VPIRInst = cast<VPIRPhi>(&R);
-    if (EpiWidenedPhis.contains(&VPIRInst->getIRPhi()))
-      continue;
-    // There is no corresponding wide induction in the epilogue plan that would
-    // need a resume value. Remove the VPIRInst wrapping the scalar header phi
-    // together with the corresponding ResumePhi. The resume values for the
-    // scalar loop will be created during execution of EpiPlan.
-    VPRecipeBase *ResumePhi = VPIRInst->getOperand(0)->getDefiningRecipe();
-    VPIRInst->eraseFromParent();
-    ResumePhi->eraseFromParent();
-  }
-  RUN_VPLAN_PASS(VPlanTransforms::removeDeadRecipes, MainPlan);
-
+/// vectorization.
+static SmallVector<VPInstruction *>
+preparePlanForMainVectorLoop(VPlan &MainPlan, VPlan &EpiPlan) {
   using namespace VPlanPatternMatch;
   // When vectorizing the epilogue, FindFirstIV & FindLastIV reductions can
   // introduce multiple uses of undef/poison. If the reduction start value may
@@ -9076,14 +8970,28 @@ static void preparePlanForMainVectorLoop(VPlan &MainPlan, VPlan &EpiPlan) {
         {}, "vec.epilog.resume.val");
   } else {
     ResumePhi = cast<VPPhi>(&*ResumePhiIter);
-    if (MainScalarPH->begin() == MainScalarPH->end())
-      ResumePhi->moveBefore(*MainScalarPH, MainScalarPH->end());
-    else if (&*MainScalarPH->begin() != ResumePhi)
+    ResumePhi->setName("vec.epilog.resume.val");
+    if (&*MainScalarPH->begin() != ResumePhi)
       ResumePhi->moveBefore(*MainScalarPH, MainScalarPH->begin());
   }
-  // Add a user to to make sure the resume phi won't get removed.
-  VPBuilder(MainScalarPH)
-      .createNaryOp(VPInstruction::ResumeForEpilogue, ResumePhi);
+
+  // Create a ResumeForEpilogue for the canonical IV resume as the
+  // first non-phi, to keep it alive for the epilogue.
+  VPBuilder ResumeBuilder(MainScalarPH);
+  ResumeBuilder.createNaryOp(VPInstruction::ResumeForEpilogue, ResumePhi);
+
+  // Collect resume values for epilogue bypass fixup. Create
+  // ResumeForEpilogue for scalar preheader phis to keep them alive.
+  SmallVector<VPInstruction *> ResumeValues;
+  for (VPRecipeBase &R : MainPlan.getScalarHeader()->phis()) {
+    auto *VPIRInst = cast<VPIRPhi>(&R);
+    VPValue *ResumeOp = VPIRInst->getOperand(0);
+    auto *Resume =
+        ResumeBuilder.createNaryOp(VPInstruction::ResumeForEpilogue, ResumeOp);
+    ResumeValues.push_back(Resume);
+  }
+
+  return ResumeValues;
 }
 
 /// Prepare \p Plan for vectorizing the epilogue loop. That is, re-use expanded
@@ -9291,39 +9199,11 @@ static SmallVector<Instruction *> preparePlanForEpilogueVectorLoop(
   return InstsToMove;
 }
 
-// Generate bypass values from the additional bypass block. Note that when the
-// vectorized epilogue is skipped due to iteration count check, then the
-// resume value for the induction variable comes from the trip count of the
-// main vector loop, passed as the second argument.
-static Value *createInductionAdditionalBypassValues(
-    PHINode *OrigPhi, const InductionDescriptor &II, IRBuilder<> &BypassBuilder,
-    const SCEV2ValueTy &ExpandedSCEVs, Value *MainVectorTripCount,
-    Instruction *OldInduction) {
-  Value *Step = getExpandedStep(II, ExpandedSCEVs);
-  // For the primary induction the additional bypass end value is known.
-  // Otherwise it is computed.
-  Value *EndValueFromAdditionalBypass = MainVectorTripCount;
-  if (OrigPhi != OldInduction) {
-    auto *BinOp = II.getInductionBinOp();
-    // Fast-math-flags propagate from the original induction instruction.
-    if (isa_and_nonnull<FPMathOperator>(BinOp))
-      BypassBuilder.setFastMathFlags(BinOp->getFastMathFlags());
-
-    // Compute the end value for the additional bypass.
-    EndValueFromAdditionalBypass =
-        emitTransformedIndex(BypassBuilder, MainVectorTripCount,
-                             II.getStartValue(), Step, II.getKind(), BinOp);
-    EndValueFromAdditionalBypass->setName("ind.end");
-  }
-  return EndValueFromAdditionalBypass;
-}
-
-static void fixScalarResumeValuesFromBypass(BasicBlock *BypassBlock, Loop *L,
-                                            VPlan &BestEpiPlan,
-                                            LoopVectorizationLegality &LVL,
-                                            const SCEV2ValueTy &ExpandedSCEVs,
-                                            Value *MainVectorTripCount) {
-  // Fix reduction resume values from the additional bypass block.
+static void
+fixScalarResumeValuesFromBypass(BasicBlock *BypassBlock, Loop *L,
+                                VPlan &BestEpiPlan,
+                                ArrayRef<VPInstruction *> ResumeValues) {
+  // Fix resume values from the additional bypass block.
   BasicBlock *PH = L->getLoopPreheader();
   for (auto *Pred : predecessors(PH)) {
     for (PHINode &Phi : PH->phis()) {
@@ -9334,40 +9214,13 @@ static void fixScalarResumeValuesFromBypass(BasicBlock *BypassBlock, Loop *L,
   }
   auto *ScalarPH = cast<VPIRBasicBlock>(BestEpiPlan.getScalarPreheader());
   if (ScalarPH->hasPredecessors()) {
-    // If ScalarPH has predecessors, we may need to update its reduction
-    // resume values.
-    for (const auto &[R, IRPhi] :
-         zip(ScalarPH->phis(), ScalarPH->getIRBasicBlock()->phis())) {
-      fixReductionScalarResumeWhenVectorizingEpilog(cast<VPPhi>(&R), IRPhi,
-                                                    BypassBlock);
-    }
-  }
-
-  // Fix induction resume values from the additional bypass block.
-  IRBuilder<> BypassBuilder(BypassBlock, BypassBlock->getFirstInsertionPt());
-  for (const auto &[IVPhi, II] : LVL.getInductionVars()) {
-    Value *V = createInductionAdditionalBypassValues(
-        IVPhi, II, BypassBuilder, ExpandedSCEVs, MainVectorTripCount,
-        LVL.getPrimaryInduction());
-    // TODO: Directly add as extra operand to the VPResumePHI recipe.
-    if (auto *Inc = dyn_cast<PHINode>(IVPhi->getIncomingValueForBlock(PH))) {
-      if (Inc->getBasicBlockIndex(BypassBlock) != -1)
-        Inc->setIncomingValueForBlock(BypassBlock, V);
-    } else {
-      // If the resume value in the scalar preheader was simplified (e.g., when
-      // narrowInterleaveGroups optimized away the resume PHIs), create a new
-      // PHI to merge the bypass value with the original value.
-      Value *OrigVal = IVPhi->getIncomingValueForBlock(PH);
-      PHINode *NewPhi =
-          PHINode::Create(IVPhi->getType(), pred_size(PH), "bc.resume.val",
-                          PH->getFirstNonPHIIt());
-      for (auto *Pred : predecessors(PH)) {
-        if (Pred == BypassBlock)
-          NewPhi->addIncoming(V, Pred);
-        else
-          NewPhi->addIncoming(OrigVal, Pred);
-      }
-      IVPhi->setIncomingValueForBlock(PH, NewPhi);
+    // Fix resume values for inductions and reductions from the additional
+    // bypass block using the incoming values from the main loop's resume phis.
+    for (auto [ResumeV, IRPhi] :
+         zip(ResumeValues, ScalarPH->getIRBasicBlock()->phis())) {
+      auto *MainResumePhi = cast<PHINode>(ResumeV->getUnderlyingValue());
+      IRPhi.setIncomingValueForBlock(
+          BypassBlock, MainResumePhi->getIncomingValueForBlock(BypassBlock));
     }
   }
 }
@@ -9377,11 +9230,12 @@ static void fixScalarResumeValuesFromBypass(BasicBlock *BypassBlock, Loop *L,
 // and runtime checks of the main loop, as well as updating various phis. \p
 // InstsToMove contains instructions that need to be moved to the preheader of
 // the epilogue vector loop.
-static void connectEpilogueVectorLoop(
-    VPlan &EpiPlan, Loop *L, EpilogueLoopVectorizationInfo &EPI,
-    DominatorTree *DT, LoopVectorizationLegality &LVL,
-    DenseMap<const SCEV *, Value *> &ExpandedSCEVs, GeneratedRTChecks &Checks,
-    ArrayRef<Instruction *> InstsToMove) {
+static void connectEpilogueVectorLoop(VPlan &EpiPlan, Loop *L,
+                                      EpilogueLoopVectorizationInfo &EPI,
+                                      DominatorTree *DT,
+                                      GeneratedRTChecks &Checks,
+                                      ArrayRef<Instruction *> InstsToMove,
+                                      ArrayRef<VPInstruction *> ResumeValues) {
   BasicBlock *VecEpilogueIterationCountCheck =
       cast<VPIRBasicBlock>(EpiPlan.getEntry())->getIRBasicBlock();
 
@@ -9464,7 +9318,13 @@ static void connectEpilogueVectorLoop(
   // after executing the main loop. We need to update the resume values of
   // inductions and reductions during epilogue vectorization.
   fixScalarResumeValuesFromBypass(VecEpilogueIterationCountCheck, L, EpiPlan,
-                                  LVL, ExpandedSCEVs, EPI.VectorTripCount);
+                                  ResumeValues);
+
+  // Remove dead phis that were moved to the epilogue preheader but are unused
+  // (e.g., resume phis for inductions not widened in the epilogue vector loop).
+  for (PHINode &Phi : make_early_inc_range(VecEpiloguePreHeader->phis()))
+    if (Phi.use_empty())
+      Phi.eraseFromParent();
 }
 
 bool LoopVectorizePass::processLoop(Loop *L) {
@@ -9851,7 +9711,8 @@ bool LoopVectorizePass::processLoop(Loop *L) {
     VPlan &BestEpiPlan = LVP.getPlanFor(EpilogueVF.Width);
     BestEpiPlan.getMiddleBlock()->setName("vec.epilog.middle.block");
     BestEpiPlan.getVectorPreheader()->setName("vec.epilog.ph");
-    preparePlanForMainVectorLoop(*BestMainPlan, BestEpiPlan);
+    auto ResumeValues =
+        preparePlanForMainVectorLoop(*BestMainPlan, BestEpiPlan);
     EpilogueLoopVectorizationInfo EPI(VF.Width, IC, EpilogueVF.Width, 1,
                                       BestEpiPlan);
     EpilogueVectorizerMainLoop MainILV(L, PSE, LI, DT, TTI, AC, EPI, &CM,
@@ -9868,8 +9729,8 @@ bool LoopVectorizePass::processLoop(Loop *L) {
         BestEpiPlan, L, ExpandedSCEVs, EPI, CM, *PSE.getSE());
     LVP.executePlan(EPI.EpilogueVF, EPI.EpilogueUF, BestEpiPlan, EpilogILV, DT,
                     true);
-    connectEpilogueVectorLoop(BestEpiPlan, L, EPI, DT, LVL, ExpandedSCEVs,
-                              Checks, InstsToMove);
+    connectEpilogueVectorLoop(BestEpiPlan, L, EPI, DT, Checks, InstsToMove,
+                              ResumeValues);
     ++LoopsEpilogueVectorized;
   } else {
     InnerLoopVectorizer LB(L, PSE, LI, DT, TTI, AC, VF.Width, IC, &CM, Checks,
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 12aafd8e72c22..a876c02fe073b 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -1321,6 +1321,8 @@ void VPInstruction::execute(VPTransformState &State) {
          "scalar value but not only first lane defined");
   State.set(this, GeneratedValue,
             /*IsScalar*/ GeneratesPerFirstLaneOnly);
+  if (getOpcode() == VPInstruction::ResumeForEpilogue)
+    setUnderlyingValue(GeneratedValue);
 }
 
 bool VPInstruction::opcodeMayReadOrWriteFromMemory() const {
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/epilog-iv-select-cmp.ll b/llvm/test/Transforms/LoopVectorize/AArch64/epilog-iv-select-cmp.ll
index 4eced1640bd14..ae1f7829449da 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/epilog-iv-select-cmp.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/epilog-iv-select-cmp.ll
@@ -46,7 +46,6 @@ define i8 @select_icmp_var_start(ptr %a, i8 %n, i8 %start) {
 ; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i32 [[TMP2]], [[N_VEC]]
 ; CHECK-NEXT:    br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[VEC_EPILOG_ITER_CHECK:.*]]
 ; CHECK:       [[VEC_EPILOG_ITER_CHECK]]:
-; CHECK-NEXT:    [[IND_END:%.*]] = trunc i32 [[N_VEC]] to i8
 ; CHECK-NEXT:    [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i32 [[N_MOD_VF]], 8
 ; CHECK-NEXT:    br i1 [[MIN_EPILOG_ITERS_CHECK]], label %[[VEC_EPILOG_SCALAR_PH]], label %[[VEC_EPILOG_PH]], !prof [[PROF3:![0-9]+]]
 ; CHECK:       [[VEC_EPILOG_PH]]:
@@ -84,7 +83,7 @@ define i8 @select_icmp_var_start(ptr %a, i8 %n, i8 %start) {
 ; CHECK-NEXT:    [[CMP_N16:%.*]] = icmp eq i32 [[TMP2]], [[N_VEC5]]
 ; CHECK-NEXT:    br i1 [[CMP_N16]], label %[[EXIT]], label %[[VEC_EPILOG_SCALAR_PH]]
 ; CHECK:       [[VEC_EPILOG_SCALAR_PH]]:
-; CHECK-NEXT:    [[BC_RESUME_VAL17:%.*]] = phi i8 [ [[TMP16]], %[[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[IND_END]], %[[VEC_EPILOG_ITER_CHECK]] ], [ 0, %[[ITER_CHECK]] ]
+; CHECK-NEXT:    [[BC_RESUME_VAL17:%.*]] = phi i8 [ [[TMP16]], %[[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[TMP3]], %[[VEC_EPILOG_ITER_CHECK]] ], [ 0, %[[ITER_CHECK]] ]
 ; CHECK-NEXT:    [[BC_MERGE_RDX18:%.*]] = phi i8 [ [[RDX_SELECT15]], %[[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[RDX_SELECT]], %[[VEC_EPILOG_ITER_CHECK]] ], [ [[START]], %[[ITER_CHECK]] ]
 ; CHECK-NEXT:    br label %[[LOOP:.*]]
 ; CHECK:       [[LOOP]]:
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-factors.ll b/llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-factors.ll
index 44636222a8648..efa64c661ebc0 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-factors.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-factors.ll
@@ -446,6 +446,8 @@ define void @trip_count_based_on_ptrtoint(i64 %x) "target-cpu"="apple-m1" {
 ; CHECK:       vector.ph:
 ; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[TMP2]], 16
 ; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[TMP2]], [[N_MOD_VF]]
+; CHECK-NEXT:    [[TMP12:%.*]] = mul i64 [[N_VEC]], 4
+; CHECK-NEXT:    [[IND_END:%.*]] = getelementptr i8, ptr [[PTR_START]], i64 [[TMP12]]
 ; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; CHECK:       vector.body:
 ; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
@@ -465,8 +467,6 @@ define void @trip_count_based_on_ptrtoint(i64 %x) "target-cpu"="apple-m1" {
 ; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[TMP2]], [[N_VEC]]
 ; CHECK-NEXT:    br i1 [[CMP_N]], label [[EXIT:%.*]], label [[VEC_EPILOG_ITER_CHECK:%.*]]
 ; CHECK:       vec.epilog.iter.check:
-; CHECK-NEXT:    [[TMP12:%.*]] = mul i64 [[N_VEC]], 4
-; CHECK-NEXT:    [[IND_END:%.*]] = getelementptr i8, ptr [[PTR_START]], i64 [[TMP12]]
 ; CHECK-NEXT:    [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_MOD_VF]], 4
 ; CHECK-NEXT:    br i1 [[MIN_EPILOG_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]], !prof [[PROF11]]
 ; CHECK:       vec.epilog.ph:
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-widen-inductions.ll b/llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-widen-inductions.ll
index eea496303a206..7ae16f782ed72 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-widen-inductions.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/epilog-vectorization-widen-inductions.ll
@@ -39,7 +39,6 @@ define void @test_widen_ptr_induction(ptr %ptr.start.1) {
 ; CHECK:       middle.blo...
[truncated]

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 11, 2026

✅ With the latest revision this PR passed the C/C++ code formatter.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 11, 2026

🐧 Linux x64 Test Results

  • 171290 tests passed
  • 3041 tests skipped
  • 1 test failed

Failed Tests

(click on a test name to see its output)

lldb-api

lldb-api.functionalities/data-formatter/data-formatter-stl/generic/list/TestDataFormatterGenericList.py
Script:
--
/usr/bin/python3 /home/gha/actions-runner/_work/llvm-project/llvm-project/lldb/test/API/dotest.py -u CXXFLAGS -u CFLAGS --env LLVM_LIBS_DIR=/home/gha/actions-runner/_work/llvm-project/llvm-project/build/./lib --env LLVM_INCLUDE_DIR=/home/gha/actions-runner/_work/llvm-project/llvm-project/build/include --env LLVM_TOOLS_DIR=/home/gha/actions-runner/_work/llvm-project/llvm-project/build/./bin --libcxx-include-dir /home/gha/actions-runner/_work/llvm-project/llvm-project/build/include/c++/v1 --libcxx-include-target-dir /home/gha/actions-runner/_work/llvm-project/llvm-project/build/include/x86_64-unknown-linux-gnu/c++/v1 --libcxx-library-dir /home/gha/actions-runner/_work/llvm-project/llvm-project/build/./lib/x86_64-unknown-linux-gnu --arch x86_64 --build-dir /home/gha/actions-runner/_work/llvm-project/llvm-project/build/lldb-test-build.noindex --lldb-module-cache-dir /home/gha/actions-runner/_work/llvm-project/llvm-project/build/lldb-test-build.noindex/module-cache-lldb/lldb-api --clang-module-cache-dir /home/gha/actions-runner/_work/llvm-project/llvm-project/build/lldb-test-build.noindex/module-cache-clang/lldb-api --executable /home/gha/actions-runner/_work/llvm-project/llvm-project/build/./bin/lldb --compiler /home/gha/actions-runner/_work/llvm-project/llvm-project/build/./bin/clang --dsymutil /home/gha/actions-runner/_work/llvm-project/llvm-project/build/./bin/dsymutil --make /usr/bin/gmake --llvm-tools-dir /home/gha/actions-runner/_work/llvm-project/llvm-project/build/./bin --lldb-obj-root /home/gha/actions-runner/_work/llvm-project/llvm-project/build/tools/lldb --lldb-libs-dir /home/gha/actions-runner/_work/llvm-project/llvm-project/build/./lib --cmake-build-type Release /home/gha/actions-runner/_work/llvm-project/llvm-project/lldb/test/API/functionalities/data-formatter/data-formatter-stl/generic/list -p TestDataFormatterGenericList.py
--
Exit Code: 1

Command Output (stdout):
--
lldb version 23.0.0git (https://github.com/llvm/llvm-project revision 11010d8b9a43493609d3e609fd431d57cc5e1224)
  clang revision 11010d8b9a43493609d3e609fd431d57cc5e1224
  llvm revision 11010d8b9a43493609d3e609fd431d57cc5e1224
Skipping the following test categories: msvcstl, dsym, pdb, gmodules, debugserver, objc

--
Command Output (stderr):
--
UNSUPPORTED: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_ptr_and_ref_libcpp_dsym (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_ptr_and_ref_libcpp_dsym) (test case does not fall in any category of interest for this run) 
PASS: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_ptr_and_ref_libcpp_dwarf (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_ptr_and_ref_libcpp_dwarf)
PASS: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_ptr_and_ref_libcpp_dwo (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_ptr_and_ref_libcpp_dwo)
UNSUPPORTED: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_ptr_and_ref_libcpp_pdb (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_ptr_and_ref_libcpp_pdb) (test case does not fall in any category of interest for this run) 
UNSUPPORTED: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_ptr_and_ref_libstdcpp_dsym (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_ptr_and_ref_libstdcpp_dsym) (test case does not fall in any category of interest for this run) 
PASS: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_ptr_and_ref_libstdcpp_dwarf (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_ptr_and_ref_libstdcpp_dwarf)
PASS: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_ptr_and_ref_libstdcpp_dwo (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_ptr_and_ref_libstdcpp_dwo)
UNSUPPORTED: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_ptr_and_ref_libstdcpp_pdb (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_ptr_and_ref_libstdcpp_pdb) (test case does not fall in any category of interest for this run) 
UNSUPPORTED: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_ptr_and_ref_msvcstl_dsym (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_ptr_and_ref_msvcstl_dsym) (test case does not fall in any category of interest for this run) 
UNSUPPORTED: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_ptr_and_ref_msvcstl_dwarf (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_ptr_and_ref_msvcstl_dwarf) (test case does not fall in any category of interest for this run) 
UNSUPPORTED: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_ptr_and_ref_msvcstl_dwo (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_ptr_and_ref_msvcstl_dwo) (test case does not fall in any category of interest for this run) 
UNSUPPORTED: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_ptr_and_ref_msvcstl_pdb (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_ptr_and_ref_msvcstl_pdb) (test case does not fall in any category of interest for this run) 
UNSUPPORTED: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_with_run_command_libcpp_dsym (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_with_run_command_libcpp_dsym) (test case does not fall in any category of interest for this run) 
PASS: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_with_run_command_libcpp_dwarf (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_with_run_command_libcpp_dwarf)
PASS: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_with_run_command_libcpp_dwo (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_with_run_command_libcpp_dwo)
UNSUPPORTED: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_with_run_command_libcpp_pdb (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_with_run_command_libcpp_pdb) (test case does not fall in any category of interest for this run) 
UNSUPPORTED: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_with_run_command_libstdcpp_dsym (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_with_run_command_libstdcpp_dsym) (test case does not fall in any category of interest for this run) 
FAIL: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_with_run_command_libstdcpp_dwarf (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_with_run_command_libstdcpp_dwarf)
FAIL: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_with_run_command_libstdcpp_dwo (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_with_run_command_libstdcpp_dwo)
UNSUPPORTED: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_with_run_command_libstdcpp_pdb (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_with_run_command_libstdcpp_pdb) (test case does not fall in any category of interest for this run) 
UNSUPPORTED: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_with_run_command_msvcstl_dsym (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_with_run_command_msvcstl_dsym) (test case does not fall in any category of interest for this run) 
UNSUPPORTED: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_with_run_command_msvcstl_dwarf (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_with_run_command_msvcstl_dwarf) (test case does not fall in any category of interest for this run) 
UNSUPPORTED: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_with_run_command_msvcstl_dwo (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_with_run_command_msvcstl_dwo) (test case does not fall in any category of interest for this run) 
UNSUPPORTED: LLDB (/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang-x86_64) :: test_with_run_command_msvcstl_pdb (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_with_run_command_msvcstl_pdb) (test case does not fall in any category of interest for this run) 
======================================================================
FAIL: test_with_run_command_libstdcpp_dwarf (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_with_run_command_libstdcpp_dwarf)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/gha/actions-runner/_work/llvm-project/llvm-project/lldb/packages/Python/lldbsuite/test/lldbtest.py", line 2012, in test_method
    return attrvalue(self)
           ^^^^^^^^^^^^^^^
  File "/home/gha/actions-runner/_work/llvm-project/llvm-project/lldb/test/API/functionalities/data-formatter/data-formatter-stl/generic/list/TestDataFormatterGenericList.py", line 302, in test_with_run_command_libstdcpp
    self.do_test_with_run_command(is_libstdcpp=True)
  File "/home/gha/actions-runner/_work/llvm-project/llvm-project/lldb/test/API/functionalities/data-formatter/data-formatter-stl/generic/list/TestDataFormatterGenericList.py", line 64, in do_test_with_run_command
    self.expect(
  File "/home/gha/actions-runner/_work/llvm-project/llvm-project/lldb/packages/Python/lldbsuite/test/lldbtest.py", line 2663, in expect
    self.runCmd(
  File "/home/gha/actions-runner/_work/llvm-project/llvm-project/lldb/packages/Python/lldbsuite/test/lldbtest.py", line 1049, in runCmd
    self.assertTrue(self.res.Succeeded(), msg + output)
AssertionError: False is not true : Command 'frame variable &numbers_list._M_impl._M_node --raw' did not return successfully
Error output:
error: <user expression 0>:1:15: "_M_impl" is not a member of "(int_list) numbers_list"
   1 | &numbers_list._M_impl._M_node
     | ^

Config=x86_64-/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang
======================================================================
FAIL: test_with_run_command_libstdcpp_dwo (TestDataFormatterGenericList.GenericListDataFormatterTestCase.test_with_run_command_libstdcpp_dwo)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/gha/actions-runner/_work/llvm-project/llvm-project/lldb/packages/Python/lldbsuite/test/lldbtest.py", line 2012, in test_method
    return attrvalue(self)
           ^^^^^^^^^^^^^^^
  File "/home/gha/actions-runner/_work/llvm-project/llvm-project/lldb/test/API/functionalities/data-formatter/data-formatter-stl/generic/list/TestDataFormatterGenericList.py", line 302, in test_with_run_command_libstdcpp
    self.do_test_with_run_command(is_libstdcpp=True)
  File "/home/gha/actions-runner/_work/llvm-project/llvm-project/lldb/test/API/functionalities/data-formatter/data-formatter-stl/generic/list/TestDataFormatterGenericList.py", line 64, in do_test_with_run_command
    self.expect(
  File "/home/gha/actions-runner/_work/llvm-project/llvm-project/lldb/packages/Python/lldbsuite/test/lldbtest.py", line 2663, in expect
    self.runCmd(
  File "/home/gha/actions-runner/_work/llvm-project/llvm-project/lldb/packages/Python/lldbsuite/test/lldbtest.py", line 1049, in runCmd
    self.assertTrue(self.res.Succeeded(), msg + output)
AssertionError: False is not true : Command 'frame variable &numbers_list._M_impl._M_node --raw' did not return successfully
Error output:
error: <user expression 0>:1:15: "_M_impl" is not a member of "(int_list) numbers_list"
   1 | &numbers_list._M_impl._M_node
     | ^

Config=x86_64-/home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/clang
----------------------------------------------------------------------
Ran 24 tests in 2.145s

FAILED (failures=2, skipped=16)

--

If these failures are unrelated to your changes (for example tests are broken or flaky at HEAD), please open an issue at https://github.com/llvm/llvm-project/issues and add the infrastructure label.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 11, 2026

🪟 Windows x64 Test Results

  • 132306 tests passed
  • 3009 tests skipped

✅ The build succeeded and all tests passed.

This patch tries to drastically simplify resume value handling for the
scalar loop when vectorizing the epilogue.

It uses a simpler, uniform approach for updating all resume values in
the scalar loop:

1. Create ResumeForEpilogue recipes for all scalar resume phis in the
   main loop (the epilogue plan will have exactly the same scalar resume
   phis, in exactly the same order)
2. Update ::execute for ResumeForEpilogue to set the underlying value
   when executing. This is not super clean, but allows easy lookup of
   the generated IR value when we update the resume phis in the
   epilogue. Once we connect the 2 plans together explicitly, this can
   be removed.
3. Use the list of ResumeForEpilogue VPInstructions from the main loop
   to update the resume/bypass values from the epilogue.

This simplifies the code quite a bit, makes it more robust (should
fix llvm#179407)
and also fixes a mis-compile in the existing tests
(see change in llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-sub-epilogue-vec.ll,
where previously we would incorrectly resume using the start value when the
epilogue iteration check failed)

In some cases, we get simpler code, due to additional CSE, in some cases
the induction end value computations get moved from the epilogue
iteration check to the vector preheader. We could try to sink the
instructions as cleanup, but it is probably not worth the trouble.
@fhahn fhahn force-pushed the lv-simplify-epi-scalar-resume-value-handling branch from f505665 to 90ca940 Compare March 11, 2026 21:56
Comment on lines +8914 to +8915
static SmallVector<VPInstruction *>
preparePlanForMainVectorLoop(VPlan &MainPlan, VPlan &EpiPlan) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to make preparePlanFor(Main|Epilogue)VectorLoop totally uniform, and return a SmallVector of VPInstruction resume values or InstsToMove in both?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure exactly, their returns are very different, but perhaps I am missing a simplification you are thinking of?

Copy link
Copy Markdown
Contributor

@artagnon artagnon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, forgot to commit some changes?

@fhahn fhahn force-pushed the lv-simplify-epi-scalar-resume-value-handling branch from 7a7e332 to 83fbf86 Compare March 12, 2026 15:31
@fhahn
Copy link
Copy Markdown
Contributor Author

fhahn commented Mar 12, 2026

Hm, forgot to commit some changes?

Ah yes, sorry for that. should be pushed now

Comment on lines 8960 to 8961
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn, this is bad! Is the scalar TC really not kept anywhere? vec.epilog.resume.val is merely VectorTC, and ResumePhi is merely a phi from that start value up to ScalarTC, no?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its the a phi with incoming value from MiddleBlock = Vector trip count and zero incoming value from the entry block; we have the original trip count, but we need the vector trip count to resume the canonical IV from there in the epilogue vector loop

Copy link
Copy Markdown
Contributor

@artagnon artagnon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the explanations! I think that I understand this well enough to approve, but note that all this code is new to me, and I've not played with it: I'm worried that this might cause miscompiles and crashes due to some context that we didn't think of, but that's why we have version control. I think further simplifications are possible, but let's not worry about that for now. LGTM, thanks!

@fhahn fhahn enabled auto-merge (squash) March 16, 2026 21:12
@fhahn fhahn merged commit 013f254 into llvm:main Mar 16, 2026
9 of 10 checks passed
@fhahn fhahn deleted the lv-simplify-epi-scalar-resume-value-handling branch March 16, 2026 22:19
neonetizen pushed a commit to neonetizen/llvm-project that referenced this pull request Mar 17, 2026
…#185969)

This patch tries to drastically simplify resume value handling for the
scalar loop when vectorizing the epilogue.

It uses a simpler, uniform approach for updating all resume values in
the scalar loop:

1. Create ResumeForEpilogue recipes for all scalar resume phis in the
main loop (the epilogue plan will have exactly the same scalar resume
phis, in exactly the same order)
2. Update ::execute for ResumeForEpilogue to set the underlying value
when executing. This is not super clean, but allows easy lookup of the
generated IR value when we update the resume phis in the epilogue. Once
we connect the 2 plans together explicitly, this can be removed.
3. Use the list of ResumeForEpilogue VPInstructions from the main loop
to update the resume/bypass values from the epilogue.

This simplifies the code quite a bit, makes it more robust (should fix
llvm#179407) and also fixes a
mis-compile in the existing tests (see change in
llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-sub-epilogue-vec.ll,
where previously we would incorrectly resume using the start value when
the epilogue iteration check failed)

In some cases, we get simpler code, due to additional CSE, in some cases
the induction end value computations get moved from the epilogue
iteration check to the vector preheader. We could try to sink the
instructions as cleanup, but it is probably not worth the trouble.

Fixes llvm#179407.
@alexey-bataev
Copy link
Copy Markdown
Member

Looks like we're seeing a miscompilation after this patch, working on a reproducer

@alexey-bataev
Copy link
Copy Markdown
Member

#187323

@sjoerdmeijer
Copy link
Copy Markdown
Collaborator

We also see miscompilations, in different apps, so not just one. Shall we revert this because I don't see a lot of movement in #187323?

@fhahn
Copy link
Copy Markdown
Contributor Author

fhahn commented Mar 19, 2026

We also see miscompilations, in different apps, so not just one. Shall we revert this because I don't see a lot of movement in #187323?

There's no valid reproducer so far showing an issue. We can revert, but would be good to also provide a concrete reproducer showing the issue

@sjoerdmeijer
Copy link
Copy Markdown
Collaborator

We also see miscompilations, in different apps, so not just one. Shall we revert this because I don't see a lot of movement in #187323?

There's no valid reproducer so far showing an issue. We can revert, but would be good to also provide a concrete reproducer showing the issue

This shows miscompiles in 549.fotonik3d_r in SPEC 2017 FP. And then other apps as well.

Thanks for the revert.

@fhahn
Copy link
Copy Markdown
Contributor Author

fhahn commented Mar 19, 2026

We also see miscompilations, in different apps, so not just one. Shall we revert this because I don't see a lot of movement in #187323?

There's no valid reproducer so far showing an issue. We can revert, but would be good to also provide a concrete reproducer showing the issue

This shows miscompiles in 549.fotonik3d_r in SPEC 2017 FP. And then other apps as well.

Thanks for the revert.

Would you be able to provide an IR reproducer?

@sjoerdmeijer
Copy link
Copy Markdown
Collaborator

Would you be able to provide an IR reproducer?

I will send Alexey a message, if he doesn't get there first, I will have a look.

fhahn added a commit that referenced this pull request Mar 19, 2026
…c." (#187504)

Reverts #185969

This is suspected to cause a miscompile in 549.fotonik3d_r from SPEC 2017 FP
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Mar 19, 2026
…epilogue vec." (#187504)

Reverts llvm/llvm-project#185969

This is suspected to cause a miscompile in 549.fotonik3d_r from SPEC 2017 FP
@stuij
Copy link
Copy Markdown
Member

stuij commented Mar 23, 2026

Unfortunately this change (which I'm guessing will reland without too many alterations) is causing a lot of regressions in CMSIS-DSP (embedded library) kernels that we expect to auto vectorize. On a Cortex-A320, from the 322 kernels we test, 93 saw a regression, so almost 30%. By an average of 2.7 percent. Top 20 sits at 4.5%.

This seems a bit high to me, especially when it's affecting this type of what seems to me at least fairly idiomatic library code.

When we look at an example, we see that the interleave count in the relevant loop is reduced from 8 to 2.

Cut back stand-alone version:

#include <stdint.h>
typedef int32_t q31_t;


void arm_negate_1(
  const q31_t * pSrc,
        q31_t * pDst,
        uint32_t blockSize)
{
        uint32_t blkCnt;                               /* Loop counter */
        q31_t in;                                      /* Temporary input variable */

  /* Loop unrolling: Compute 4 outputs at a time */
  blkCnt = blockSize >> 2U;

  while (blkCnt > 0U)
  {
    /* C = -A */

    /* Negate and store result in destination buffer. */
    in = *pSrc++;
    *pDst++ = (in == INT32_MIN) ? INT32_MAX : -in;

    in = *pSrc++;
    *pDst++ = (in == INT32_MIN) ? INT32_MAX : -in;

    in = *pSrc++;
    *pDst++ = (in == INT32_MIN) ? INT32_MAX : -in;

    in = *pSrc++;
    *pDst++ = (in == INT32_MIN) ? INT32_MAX : -in;

    /* Decrement loop counter */
    blkCnt--;
  }

  /* Loop unrolling: Compute remaining outputs */
  blkCnt = blockSize % 0x4U;

  while (blkCnt > 0U)
  {
    /* C = -A */

    /* Negate and store result in destination buffer. */
    in = *pSrc++;
    *pDst++ = (in == INT32_MIN) ? INT32_MAX : -in;

    /* Decrement loop counter */
    blkCnt--;
  }
}

compiled with $CLANG --target=aarch64-none-elf -mcpu=cortex-a320 -O3

The below assembly is from the relevant loop.

before:

.LBB0_15:
	and	x12, x11, #0x3ffffff8
	movi	v0.4s, #128, lsl #24
	lsl	x10, x12, #4
	sub	w9, w11, w12
	add	x8, x0, x10
	add	x10, x1, x10
	add	x13, x1, #64
	add	x14, x0, #64
	and	x15, x11, #0x3ffffff8
	.p2align	4, , 8
.LBB0_16:                               // =>This Inner Loop Header: Depth=1
	ldp	q3, q4, [x14, #-32]
	ldp	q1, q2, [x14, #-64]
	cmeq	v17.4s, v4.4s, v0.4s
	cmeq	v18.4s, v3.4s, v0.4s
	neg	v4.4s, v4.4s
	neg	v3.4s, v3.4s
	cmeq	v19.4s, v1.4s, v0.4s
	neg	v1.4s, v1.4s
	cmeq	v20.4s, v2.4s, v0.4s
	neg	v2.4s, v2.4s
	bic	v3.16b, v3.16b, v18.16b
	bic	v4.16b, v4.16b, v17.16b
	bic	v17.4s, #128, lsl #24
	bic	v18.4s, #128, lsl #24
	ldp	q5, q6, [x14]
	bic	v1.16b, v1.16b, v19.16b
	bic	v19.4s, #128, lsl #24
	bic	v2.16b, v2.16b, v20.16b
	bic	v20.4s, #128, lsl #24
	orr	v4.16b, v17.16b, v4.16b
	orr	v3.16b, v18.16b, v3.16b
	ldp	q7, q16, [x14, #32]
	orr	v1.16b, v19.16b, v1.16b
	orr	v2.16b, v20.16b, v2.16b
	stp	q3, q4, [x13, #-32]
	cmeq	v3.4s, v5.4s, v0.4s
	neg	v4.4s, v5.4s
	stp	q1, q2, [x13, #-64]
	cmeq	v1.4s, v16.4s, v0.4s
	cmeq	v2.4s, v7.4s, v0.4s
	cmeq	v5.4s, v6.4s, v0.4s
	neg	v16.4s, v16.4s
	bic	v4.16b, v4.16b, v3.16b
	bic	v3.4s, #128, lsl #24
	neg	v6.4s, v6.4s
	neg	v7.4s, v7.4s
	orr	v3.16b, v3.16b, v4.16b
	bic	v4.16b, v6.16b, v5.16b
	bic	v6.16b, v7.16b, v2.16b
	bic	v7.16b, v16.16b, v1.16b
	bic	v1.4s, #128, lsl #24
	bic	v2.4s, #128, lsl #24
	bic	v5.4s, #128, lsl #24
	orr	v1.16b, v1.16b, v7.16b
	orr	v2.16b, v2.16b, v6.16b
	orr	v4.16b, v5.16b, v4.16b
	subs	x15, x15, #8
	add	x14, x14, #128
	stp	q2, q1, [x13, #32]
	stp	q3, q4, [x13], #128
	b.ne	.LBB0_16
// %bb.17:
	cmp	x12, x11
	b.ne	.LBB0_5
	b	.LBB0_19

after:

.LBB0_15:
	and	x12, x11, #0x3ffffffe
	movi	v0.4s, #128, lsl #24
	lsl	x10, x12, #4
	sub	w9, w11, w12
	add	x8, x0, x10
	add	x10, x1, x10
	add	x13, x1, #16
	add	x14, x0, #16
	and	x15, x11, #0x3ffffffe
	.p2align	4, , 8
.LBB0_16:                               // =>This Inner Loop Header: Depth=1
	ldp	q1, q2, [x14, #-16]
	subs	x15, x15, #2
	cmeq	v3.4s, v1.4s, v0.4s
	neg	v1.4s, v1.4s
	cmeq	v4.4s, v2.4s, v0.4s
	neg	v2.4s, v2.4s
	bic	v1.16b, v1.16b, v3.16b
	bic	v3.4s, #128, lsl #24
	bic	v2.16b, v2.16b, v4.16b
	bic	v4.4s, #128, lsl #24
	orr	v1.16b, v3.16b, v1.16b
	add	x14, x14, #32
	orr	v2.16b, v4.16b, v2.16b
	stp	q1, q2, [x13, #-16]
	add	x13, x13, #32
	b.ne	.LBB0_16
// %bb.17:
	cmp	x12, x11
	b.ne	.LBB0_5
	b	.LBB0_19

@fhahn
Copy link
Copy Markdown
Contributor Author

fhahn commented Mar 23, 2026

@stuij is it possible this is not the right PR? AFAICT it does not change the generated assembly for the C code at all; it just changes how the resume values are handled, but not impact the vectorization decision.

On current main, codeine is also narrow: https://clang.godbolt.org/z/YoKKazsbs

@fhahn
Copy link
Copy Markdown
Contributor Author

fhahn commented Mar 24, 2026

@sjoerdmeijer relanded as 77fb848, please let me know if you are seeing any issues, and if so please provide a reproducer :)

@fhahn
Copy link
Copy Markdown
Contributor Author

fhahn commented Mar 24, 2026

@stuij it looks like TuneA320 limits the interleave count to 2 in this case. Does increasing the max interleave count resolve the regression? (I think the regression started in 92e44b2 AFAICT)

@alexfh
Copy link
Copy Markdown
Contributor

alexfh commented Mar 24, 2026

Leaving this reverted without reverting #182146 is a bad state too. We're affected by this miscompilation then: #182146 (comment)

Maybe #182146 should be reverted as well?

@fhahn
Copy link
Copy Markdown
Contributor Author

fhahn commented Mar 24, 2026

@alexfh it was re-landed yesterday

@stuij
Copy link
Copy Markdown
Member

stuij commented Mar 24, 2026

@fhahn yes! I accidentally took the wrong function from the daily regression list to look at regressions caused by this patch. Sorry! I bisected it, and it was indeed 92e44b2. We've seen regressions caused by related patches and we've got a ticket for this. We did actually try to tweak max interleave count (to 4) but it did more harm than good for the CMSIS-DSP tests. It's ongoing. Thanks for having a think on it.

However, the list and amount of regressions that I quoted earlier in this thread does seem to be caused by this change. The numbers ebb and flow with this commit being applied and reverted (and it is pointed to by our automated bisecting). Examples of actual impacted tests are: arm_fill_f16, arm_q7_to_q15, arm_mat_add_f16.

Looking at arm_mat_add_f16.c I can see 4 extra movs towards the start of the fn to keep values alive and in the NEON header 5 extra instructions that we now always execute to do checks that we would previously do afterwards, but only if we have resume values. So in the case of our test with multiples of 16, where we don't need the epilogue, this would be a function call overhead of 9 instructions, which I guess we do start to feel on an embedded core like the A320. Does that sound sane?

albertbolt1 pushed a commit to albertbolt1/llvm-project that referenced this pull request Mar 28, 2026
…c." (llvm#187504)

Reverts llvm#185969

This is suspected to cause a miscompile in 549.fotonik3d_r from SPEC 2017 FP
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Assertion `Val && "isa<> used on a null pointer"' failed.

7 participants