Skip to content

[VPlan] Skip epilogue vectorization if dead after narrowing IGs.#187016

Merged
fhahn merged 6 commits intollvm:mainfrom
fhahn:lv-epi-consider-narrowed-loop
Mar 20, 2026
Merged

[VPlan] Skip epilogue vectorization if dead after narrowing IGs.#187016
fhahn merged 6 commits intollvm:mainfrom
fhahn:lv-epi-consider-narrowed-loop

Conversation

@fhahn
Copy link
Copy Markdown
Contributor

@fhahn fhahn commented Mar 17, 2026

When narrowing interleave groups, the main vector loop processes IC iterations instead of VF * IC. Update selectEpilogueVectorizationFactor to use the effective VF, checking if the canonical IV controlling the loop now steps by UF instead of VFxUF.

This avoids epilogue vectorization with dead epilogue vector loops and also prevents crashes in cases where we can prove both the epilogue and scalar loop are dead.

Fixes #186846

When narrowing interleave groups, the main vector loop processes
IC iterations instead of VF * IC. Update selectEpilogueVectorizationFactor
to use the effective VF, checking if the canonical IV controlling the
loop now steps by UF instead of VFxUF.

This avoids epilogue vectorization with dead epilogue vector loops and
also prevents crashes in cases where we can prove both the epilogue and
scalar loop are dead.

Fixes llvm#186846
@llvmbot
Copy link
Copy Markdown
Member

llvmbot commented Mar 17, 2026

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-vectorizers

Author: Florian Hahn (fhahn)

Changes

When narrowing interleave groups, the main vector loop processes IC iterations instead of VF * IC. Update selectEpilogueVectorizationFactor to use the effective VF, checking if the canonical IV controlling the loop now steps by UF instead of VFxUF.

This avoids epilogue vectorization with dead epilogue vector loops and also prevents crashes in cases where we can prove both the epilogue and scalar loop are dead.

Fixes #186846


Full diff: https://github.com/llvm/llvm-project/pull/187016.diff

3 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h (+2-2)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+36-11)
  • (added) llvm/test/Transforms/LoopVectorize/AArch64/transform-narrow-interleave-to-widen-memory-epilogue-vec.ll (+58)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
index 8368349e63cee..a5dc42fc71e99 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
@@ -608,8 +608,8 @@ class LoopVectorizationPlanner {
   /// \return The most profitable vectorization factor and the cost of that VF
   /// for vectorizing the epilogue. Returns VectorizationFactor::Disabled if
   /// epilogue vectorization is not supported for the loop.
-  VectorizationFactor
-  selectEpilogueVectorizationFactor(const ElementCount MainLoopVF, unsigned IC);
+  VectorizationFactor selectEpilogueVectorizationFactor(ElementCount MainLoopVF,
+                                                        unsigned IC);
 
   /// Emit remarks for recipes with invalid costs in the available VPlans.
   void emitInvalidCostRemarks(OptimizationRemarkEmitter *ORE);
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index ac9b790c739bf..9036de66fab9a 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -4419,7 +4419,7 @@ bool LoopVectorizationCostModel::isEpilogueVectorizationProfitable(
 }
 
 VectorizationFactor LoopVectorizationPlanner::selectEpilogueVectorizationFactor(
-    const ElementCount MainLoopVF, unsigned IC) {
+    ElementCount MainLoopVF, unsigned IC) {
   VectorizationFactor Result = VectorizationFactor::Disabled();
   if (!EnableEpilogueVectorization) {
     LLVM_DEBUG(dbgs() << "LEV: Epilogue vectorization is disabled.\n");
@@ -4463,6 +4463,28 @@ VectorizationFactor LoopVectorizationPlanner::selectEpilogueVectorizationFactor(
     return Result;
   }
 
+  // Check if a plan's vector loop processes fewer iterations than VF (e.g. when
+  // interleave groups have been narrowed) narrowInterleaveGroups)  and return
+  // the adjusted, effective VF.
+  using namespace VPlanPatternMatch;
+  auto GetEffectiveVF = [](VPlan &Plan, ElementCount VF) -> ElementCount {
+    auto *CanIV = Plan.getVectorLoopRegion()->getCanonicalIV();
+    auto *Exiting = Plan.getVectorLoopRegion()->getExitingBasicBlock();
+    if (match(
+            &Exiting->back(),
+            m_BranchOnCount(m_Add(m_Specific(CanIV), m_Specific(&Plan.getUF())),
+                            m_VPValue())))
+      return VF.isScalable() ? ElementCount::getScalable(1)
+                             : ElementCount::getFixed(1);
+    return VF;
+  };
+
+  // Check if the main loop processes fewer than MainLoopVF elements per
+  // iteration (e.g. due to narrowing interleave groups). Adjust MainLoopVF
+  // as needed.
+  VPlan &MainPlan = getPlanFor(MainLoopVF);
+  MainLoopVF = GetEffectiveVF(MainPlan, MainLoopVF);
+
   // If MainLoopVF = vscale x 2, and vscale is expected to be 4, then we know
   // the main loop handles 8 lanes per iteration. We could still benefit from
   // vectorizing the epilogue loop with VF=4.
@@ -4472,8 +4494,7 @@ VectorizationFactor LoopVectorizationPlanner::selectEpilogueVectorizationFactor(
   Type *TCType = Legal->getWidestInductionType();
   const SCEV *RemainingIterations = nullptr;
   unsigned MaxTripCount = 0;
-  const SCEV *TC = vputils::getSCEVExprForVPValue(
-      getPlanFor(MainLoopVF).getTripCount(), PSE);
+  const SCEV *TC = vputils::getSCEVExprForVPValue(MainPlan.getTripCount(), PSE);
   assert(!isa<SCEVCouldNotCompute>(TC) && "Trip count SCEV must be computable");
   const SCEV *KnownMinTC;
   bool ScalableTC = match(TC, m_scev_c_Mul(m_SCEV(KnownMinTC), m_SCEVVScale()));
@@ -4516,25 +4537,29 @@ VectorizationFactor LoopVectorizationPlanner::selectEpilogueVectorizationFactor(
     if (!hasPlanWithVF(NextVF.Width))
       continue;
 
+    ElementCount EffectiveVF =
+        GetEffectiveVF(getPlanFor(NextVF.Width), NextVF.Width);
     // Skip candidate VFs with widths >= the (estimated) runtime VF (scalable
     // vectors) or > the VF of the main loop (fixed vectors).
-    if ((!NextVF.Width.isScalable() && MainLoopVF.isScalable() &&
-         ElementCount::isKnownGE(NextVF.Width, EstimatedRuntimeVF)) ||
-        (NextVF.Width.isScalable() &&
-         ElementCount::isKnownGE(NextVF.Width, MainLoopVF)) ||
-        (!NextVF.Width.isScalable() && !MainLoopVF.isScalable() &&
-         ElementCount::isKnownGT(NextVF.Width, MainLoopVF)))
+    if ((!EffectiveVF.isScalable() && MainLoopVF.isScalable() &&
+         ElementCount::isKnownGE(EffectiveVF, EstimatedRuntimeVF)) ||
+        (EffectiveVF.isScalable() &&
+         ElementCount::isKnownGE(EffectiveVF, MainLoopVF)) ||
+        (!EffectiveVF.isScalable() && !MainLoopVF.isScalable() &&
+         ElementCount::isKnownGT(EffectiveVF, MainLoopVF)))
       continue;
 
     // If NextVF is greater than the number of remaining iterations, the
-    // epilogue loop would be dead. Skip such factors.
+    // epilogue loop would be dead. Skip such factors. If the epilogue plan
+    // also has narrowed interleave groups, use the effective VF since
+    // the epilogue step will be reduced to its IC.
     // TODO: We should also consider comparing against a scalable
     // RemainingIterations when SCEV be able to evaluate non-canonical
     // vscale-based expressions.
     if (!ScalableRemIter) {
       // Handle the case where NextVF and RemainingIterations are in different
       // numerical spaces.
-      ElementCount EC = NextVF.Width;
+      ElementCount EC = GetEffectiveVF(getPlanFor(NextVF.Width), NextVF.Width);
       if (NextVF.Width.isScalable())
         EC = ElementCount::getFixed(
             estimateElementCount(NextVF.Width, CM.getVScaleForTuning()));
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/transform-narrow-interleave-to-widen-memory-epilogue-vec.ll b/llvm/test/Transforms/LoopVectorize/AArch64/transform-narrow-interleave-to-widen-memory-epilogue-vec.ll
new file mode 100644
index 0000000000000..b2efbedf69178
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/transform-narrow-interleave-to-widen-memory-epilogue-vec.ll
@@ -0,0 +1,58 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 6
+; RUN: opt -passes=loop-vectorize -mcpu=neoverse-v2 -S < %s | FileCheck %s
+
+target triple = "arm64-apple-macosx"
+
+; Test that epilogue vectorization is not selected when the main vector loop
+; covers all iterations after narrowInterleaveGroups reduces the effective
+; step from VF * UF to UF.
+define void @no_epilogue_when_narrowed_covers_all(ptr %p) {
+; CHECK-LABEL: define void @no_epilogue_when_narrowed_covers_all(
+; CHECK-SAME: ptr [[P:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    br label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[OFFSET_IDX:%.*]] = mul i64 [[INDEX]], 2
+; CHECK-NEXT:    [[TMP0:%.*]] = add i64 [[OFFSET_IDX]], 2
+; CHECK-NEXT:    [[TMP1:%.*]] = add i64 [[OFFSET_IDX]], 4
+; CHECK-NEXT:    [[TMP2:%.*]] = add i64 [[OFFSET_IDX]], 6
+; CHECK-NEXT:    [[TMP3:%.*]] = getelementptr inbounds i64, ptr [[P]], i64 [[OFFSET_IDX]]
+; CHECK-NEXT:    [[TMP4:%.*]] = getelementptr inbounds i64, ptr [[P]], i64 [[TMP0]]
+; CHECK-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i64, ptr [[P]], i64 [[TMP1]]
+; CHECK-NEXT:    [[TMP6:%.*]] = getelementptr inbounds i64, ptr [[P]], i64 [[TMP2]]
+; CHECK-NEXT:    store <2 x i64> splat (i64 1), ptr [[TMP3]], align 8
+; CHECK-NEXT:    store <2 x i64> splat (i64 1), ptr [[TMP4]], align 8
+; CHECK-NEXT:    store <2 x i64> splat (i64 1), ptr [[TMP5]], align 8
+; CHECK-NEXT:    store <2 x i64> splat (i64 1), ptr [[TMP6]], align 8
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
+; CHECK-NEXT:    [[TMP7:%.*]] = icmp eq i64 [[INDEX_NEXT]], 500
+; CHECK-NEXT:    br i1 [[TMP7]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    br label %[[EXIT:.*]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret void
+;
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %p0 = getelementptr inbounds i64, ptr %p, i64 %iv
+  %p1 = getelementptr inbounds i64, ptr %p0, i64 1
+  store i64 1, ptr %p0, align 8
+  store i64 1, ptr %p1, align 8
+  %iv.next = add nuw nsw i64 %iv, 2
+  %ec = icmp eq i64 %iv.next, 1000
+  br i1 %ec, label %exit, label %loop
+
+exit:
+  ret void
+}
+;.
+; CHECK: [[LOOP0]] = distinct !{[[LOOP0]], [[META1:![0-9]+]], [[META2:![0-9]+]]}
+; CHECK: [[META1]] = !{!"llvm.loop.isvectorized", i32 1}
+; CHECK: [[META2]] = !{!"llvm.loop.unroll.runtime.disable"}
+;.

auto *Exiting = Plan.getVectorLoopRegion()->getExitingBasicBlock();
if (match(
&Exiting->back(),
m_BranchOnCount(m_Add(m_Specific(CanIV), m_Specific(&Plan.getUF())),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we attempt to create epilogues when the main loop has VF=1? If so, we'd also match those cases here too. However, I think it still works because the code below will return the correct VF, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it depends on the limit set by the target which is checked before we reach this code. I will return the corect number of elements processed

// Skip candidate VFs with widths >= the (estimated) runtime VF (scalable
// vectors) or > the VF of the main loop (fixed vectors).
if ((!NextVF.Width.isScalable() && MainLoopVF.isScalable() &&
ElementCount::isKnownGE(NextVF.Width, EstimatedRuntimeVF)) ||
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't EstimatedRuntimeVF also need updating otherwise it's inconsistent with MainLoopVF.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, EstimatedRuntimeVF should only be computed after we adjusted MainLoopVF

ElementCount::isKnownGE(NextVF.Width, MainLoopVF)) ||
(!NextVF.Width.isScalable() && !MainLoopVF.isScalable() &&
ElementCount::isKnownGT(NextVF.Width, MainLoopVF)))
if ((!EffectiveVF.isScalable() && MainLoopVF.isScalable() &&
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to have a test showing the effect of this change too. It looks like this new code is also effectively disabling epilogue vectorisation since we'll discard any VF > 1.

I suspect we don't even get as far as the SkipVF code below.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be covered by existing epilogue tests; if we don't adjust the VFs here (and below) we will disable epilogue vectorization in cases where we should after narrowing interleave groups in both the main and epilogue plan, because we would compare VF = 1 from main loop to VF > 1 from the epilogue plan.

This is when vectorizing the epilogue due to interleaving the vector loop.

Copy link
Copy Markdown
Contributor

@artagnon artagnon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine now, thanks for fixing -- I was confused about mixing NextVF and EffectiveVF. LGTM, thanks!

@fhahn fhahn merged commit 19b0c68 into llvm:main Mar 20, 2026
10 checks passed
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Mar 20, 2026
…g IGs. (#187016)

When narrowing interleave groups, the main vector loop processes IC
iterations instead of VF * IC. Update selectEpilogueVectorizationFactor
to use the effective VF, checking if the canonical IV controlling the
loop now steps by UF instead of VFxUF.

This avoids epilogue vectorization with dead epilogue vector loops and
also prevents crashes in cases where we can prove both the epilogue and
scalar loop are dead.

Fixes llvm/llvm-project#186846

PR: llvm/llvm-project#187016
@fhahn fhahn deleted the lv-epi-consider-narrowed-loop branch March 20, 2026 13:31
ambergorzynski pushed a commit to ambergorzynski/llvm-project that referenced this pull request Mar 27, 2026
…m#187016)

When narrowing interleave groups, the main vector loop processes IC
iterations instead of VF * IC. Update selectEpilogueVectorizationFactor
to use the effective VF, checking if the canonical IV controlling the
loop now steps by UF instead of VFxUF.

This avoids epilogue vectorization with dead epilogue vector loops and
also prevents crashes in cases where we can prove both the epilogue and
scalar loop are dead.

Fixes llvm#186846

PR: llvm#187016
albertbolt1 pushed a commit to albertbolt1/llvm-project that referenced this pull request Mar 28, 2026
…m#187016)

When narrowing interleave groups, the main vector loop processes IC
iterations instead of VF * IC. Update selectEpilogueVectorizationFactor
to use the effective VF, checking if the canonical IV controlling the
loop now steps by UF instead of VFxUF.

This avoids epilogue vectorization with dead epilogue vector loops and
also prevents crashes in cases where we can prove both the epilogue and
scalar loop are dead.

Fixes llvm#186846

PR: llvm#187016
Aadarsh-Keshri pushed a commit to Aadarsh-Keshri/llvm-project that referenced this pull request Apr 1, 2026
…m#187016)

When narrowing interleave groups, the main vector loop processes IC
iterations instead of VF * IC. Update selectEpilogueVectorizationFactor
to use the effective VF, checking if the canonical IV controlling the
loop now steps by UF instead of VFxUF.

This avoids epilogue vectorization with dead epilogue vector loops and
also prevents crashes in cases where we can prove both the epilogue and
scalar loop are dead.

Fixes llvm#186846

PR: llvm#187016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[LV]Multiple crashes with epilogue vectorization

4 participants