[VPlan] Add the cost of spills when considering register pressure by john-brawn-arm · Pull Request #179646 · llvm/llvm-project

john-brawn-arm · 2026-02-04T12:20:11Z

Currently when considering register pressure is enabled, we reject any VF that has higher pressure than the number of registers. However this can result in failing to vectorize in cases where it's beneficial, as the cost of the extra spills is less than the benefit we get from vectorizing.

Deal with this by instead calculating the cost of spills and adding that to the rest of the cost, so we can detect this kind of situation and still vectorize while avoiding vectorizing in cases where the extra cost makes it not with it.

Currently when considering register pressure is enabled, we reject any VF that has higher pressure than the number of registers. However this can result in failing to vectorize in cases where it's beneficial, as the cost of the extra spills is less than the benefit we get from vectorizing. Deal with this by instead calculating the cost of spills and adding that to the rest of the cost, so we can detect this kind of situation and still vectorize while avoiding vectorizing in cases where the extra cost makes it not with it.

llvmbot · 2026-02-04T12:20:53Z

@llvm/pr-subscribers-llvm-analysis

@llvm/pr-subscribers-llvm-transforms

Author: John Brawn (john-brawn-arm)

Changes

Currently when considering register pressure is enabled, we reject any VF that has higher pressure than the number of registers. However this can result in failing to vectorize in cases where it's beneficial, as the cost of the extra spills is less than the benefit we get from vectorizing.

Deal with this by instead calculating the cost of spills and adding that to the rest of the cost, so we can detect this kind of situation and still vectorize while avoiding vectorizing in cases where the extra cost makes it not with it.

Patch is 34.53 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/179646.diff

7 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h (+2-1)
(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+21-23)
(modified) llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp (+56-14)
(modified) llvm/lib/Transforms/Vectorize/VPlanAnalysis.h (+8-4)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll (+87-7)
(added) llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll (+266)
(modified) llvm/test/Transforms/LoopVectorize/LoongArch/reg-usage.ll (+2-2)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
index 44d4d92d4a7e2..06e8efef20c03 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
@@ -45,6 +45,7 @@ class OptimizationRemarkEmitter;
 class TargetTransformInfo;
 class TargetLibraryInfo;
 class VPRecipeBuilder;
+class VPRegisterUsage;
 struct VFRange;
 
 extern cl::opt<bool> EnableVPlanNativePath;
@@ -497,7 +498,7 @@ class LoopVectorizationPlanner {
   ///
   /// TODO: Move to VPlan::cost once the use of LoopVectorizationLegality has
   /// been retired.
-  InstructionCost cost(VPlan &Plan, ElementCount VF) const;
+  InstructionCost cost(VPlan &Plan, ElementCount VF, VPRegisterUsage *RU) const;
 
   /// Precompute costs for certain instructions using the legacy cost model. The
   /// function is used to bring up the VPlan-based cost model to initially avoid
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index abac45b265d10..492e716fd6ad2 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -4247,13 +4247,6 @@ VectorizationFactor LoopVectorizationPlanner::selectVectorizationFactor() {
       if (VF.isScalar())
         continue;
 
-      /// If the register pressure needs to be considered for VF,
-      /// don't consider the VF as valid if it exceeds the number
-      /// of registers for the target.
-      if (CM.shouldConsiderRegPressureForVF(VF) &&
-          RUs[I].exceedsMaxNumRegs(TTI, ForceTargetNumVectorRegs))
-        continue;
-
       InstructionCost C = CM.expectedCost(VF);
 
       // Add on other costs that are modelled in VPlan, but not in the legacy
@@ -4302,6 +4295,10 @@ VectorizationFactor LoopVectorizationPlanner::selectVectorizationFactor() {
         }
       }
 
+      // Add the cost of any spills due to excess register usage
+      if (CM.shouldConsiderRegPressureForVF(VF))
+        C += RUs[I].spillCost(CostCtx, ForceTargetNumVectorRegs);
+
       VectorizationFactor Candidate(VF, C, ScalarCost.ScalarCost);
       unsigned Width =
           estimateElementCount(Candidate.Width, CM.getVScaleForTuning());
@@ -4687,13 +4684,16 @@ LoopVectorizationPlanner::selectInterleaveCount(VPlan &Plan, ElementCount VF,
   if (hasFindLastReductionPhi(Plan))
     return 1;
 
+  VPRegisterUsage R =
+      calculateRegisterUsageForPlan(Plan, {VF}, TTI, CM.ValuesToIgnore)[0];
+
   // If we did not calculate the cost for VF (because the user selected the VF)
   // then we calculate the cost of VF here.
   if (LoopCost == 0) {
     if (VF.isScalar())
       LoopCost = CM.expectedCost(VF);
     else
-      LoopCost = cost(Plan, VF);
+      LoopCost = cost(Plan, VF, &R);
     assert(LoopCost.isValid() && "Expected to have chosen a VF with valid cost");
 
     // Loop body is free and there is no need for interleaving.
@@ -4701,8 +4701,6 @@ LoopVectorizationPlanner::selectInterleaveCount(VPlan &Plan, ElementCount VF,
       return 1;
   }
 
-  VPRegisterUsage R =
-      calculateRegisterUsageForPlan(Plan, {VF}, TTI, CM.ValuesToIgnore)[0];
   // We divide by these constants so assume that we have at least one
   // instruction that uses at least one register.
   for (auto &Pair : R.MaxLocalUsers) {
@@ -7027,13 +7025,18 @@ LoopVectorizationPlanner::precomputeCosts(VPlan &Plan, ElementCount VF,
   return Cost;
 }
 
-InstructionCost LoopVectorizationPlanner::cost(VPlan &Plan,
-                                               ElementCount VF) const {
+InstructionCost LoopVectorizationPlanner::cost(VPlan &Plan, ElementCount VF,
+                                               VPRegisterUsage *RU) const {
   VPCostContext CostCtx(CM.TTI, *CM.TLI, Plan, CM, CM.CostKind, PSE, OrigLoop);
   InstructionCost Cost = precomputeCosts(Plan, VF, CostCtx);
 
   // Now compute and add the VPlan-based cost.
   Cost += Plan.cost(VF, CostCtx);
+
+  // Add the cost of spills due to excess register usage
+  if (CM.shouldConsiderRegPressureForVF(VF))
+    Cost += RU->spillCost(CostCtx, ForceTargetNumVectorRegs);
+
 #ifndef NDEBUG
   unsigned EstimatedWidth = estimateElementCount(VF, CM.getVScaleForTuning());
   LLVM_DEBUG(dbgs() << "Cost for VF " << VF << ": " << Cost
@@ -7233,9 +7236,10 @@ VectorizationFactor LoopVectorizationPlanner::computeBestVF() {
                                P->vectorFactors().end());
 
     SmallVector<VPRegisterUsage, 8> RUs;
-    if (any_of(VFs, [this](ElementCount VF) {
-          return CM.shouldConsiderRegPressureForVF(VF);
-        }))
+    bool ConsiderRegPressure = any_of(VFs, [this](ElementCount VF) {
+      return CM.shouldConsiderRegPressureForVF(VF);
+    });
+    if (ConsiderRegPressure)
       RUs = calculateRegisterUsageForPlan(*P, VFs, TTI, CM.ValuesToIgnore);
 
     for (unsigned I = 0; I < VFs.size(); I++) {
@@ -7258,16 +7262,10 @@ VectorizationFactor LoopVectorizationPlanner::computeBestVF() {
         continue;
       }
 
-      InstructionCost Cost = cost(*P, VF);
+      InstructionCost Cost =
+          cost(*P, VF, ConsiderRegPressure ? &RUs[I] : nullptr);
       VectorizationFactor CurrentFactor(VF, Cost, ScalarCost);
 
-      if (CM.shouldConsiderRegPressureForVF(VF) &&
-          RUs[I].exceedsMaxNumRegs(TTI, ForceTargetNumVectorRegs)) {
-        LLVM_DEBUG(dbgs() << "LV(REG): Not considering vector loop of width "
-                          << VF << " because it uses too many registers\n");
-        continue;
-      }
-
       if (isMoreProfitable(CurrentFactor, BestFactor, P->hasScalarTail()))
         BestFactor = CurrentFactor;
 
diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
index 8fbe7d93e6f45..b8be1be79831e 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
@@ -16,6 +16,7 @@
 #include "llvm/ADT/TypeSwitch.h"
 #include "llvm/Analysis/ScalarEvolution.h"
 #include "llvm/Analysis/TargetTransformInfo.h"
+#include "llvm/IR/DataLayout.h"
 #include "llvm/IR/Instruction.h"
 #include "llvm/IR/PatternMatch.h"
 
@@ -389,13 +390,33 @@ bool VPDominatorTree::properlyDominates(const VPRecipeBase *A,
   return Base::properlyDominates(ParentA, ParentB);
 }
 
-bool VPRegisterUsage::exceedsMaxNumRegs(const TargetTransformInfo &TTI,
-                                        unsigned OverrideMaxNumRegs) const {
-  return any_of(MaxLocalUsers, [&TTI, &OverrideMaxNumRegs](auto &LU) {
-    return LU.second > (OverrideMaxNumRegs > 0
-                            ? OverrideMaxNumRegs
-                            : TTI.getNumberOfRegisters(LU.first));
-  });
+InstructionCost VPRegisterUsage::spillCost(VPCostContext &Ctx,
+                                           unsigned OverrideMaxNumRegs) const {
+  InstructionCost Cost;
+  DataLayout DL = Ctx.PSE.getSE()->getDataLayout();
+  for (const auto &Pair : MaxLocalUsers) {
+    unsigned AvailableRegs = OverrideMaxNumRegs > 0
+                                 ? OverrideMaxNumRegs
+                                 : Ctx.TTI.getNumberOfRegisters(Pair.first);
+    if (Pair.second > AvailableRegs) {
+      // Assume that for each register used past what's available we get one
+      // spill and reload of the largest type seen for that register class.
+      unsigned Spills = Pair.second - AvailableRegs;
+      Type *SpillType = LargestType.at(Pair.first);
+      Align Alignment = DL.getPrefTypeAlign(SpillType);
+      InstructionCost SpillCost =
+          Ctx.TTI.getMemoryOpCost(Instruction::Load, SpillType, Alignment, 0,
+                                  Ctx.CostKind) +
+          Ctx.TTI.getMemoryOpCost(Instruction::Store, SpillType, Alignment, 0,
+                                  Ctx.CostKind);
+      InstructionCost TotalCost = SpillCost * Spills;
+      LLVM_DEBUG(dbgs() << "LV(REG): Cost of " << TotalCost << " from "
+                        << Spills << " spills of "
+                        << Ctx.TTI.getRegisterClassName(Pair.first) << "\n");
+      Cost += TotalCost;
+    }
+  }
+  return Cost;
 }
 
 SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
@@ -479,6 +500,15 @@ SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
   SmallPtrSet<VPValue *, 8> OpenIntervals;
   SmallVector<VPRegisterUsage, 8> RUs(VFs.size());
   SmallVector<SmallMapVector<unsigned, unsigned, 4>, 8> MaxUsages(VFs.size());
+  SmallVector<SmallMapVector<unsigned, Type *, 4>, 8> LargestTypes(VFs.size());
+  auto MaxType = [](Type *CurMax, Type *T) {
+    if (!CurMax)
+      return T;
+    if (TypeSize::isKnownGT(T->getPrimitiveSizeInBits(),
+                            CurMax->getPrimitiveSizeInBits()))
+      return T;
+    return CurMax;
+  };
 
   LLVM_DEBUG(dbgs() << "LV(REG): Calculating max register usage:\n");
 
@@ -540,17 +570,19 @@ SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
             match(VPV, m_ExtractLastPart(m_VPValue())))
           continue;
 
+        Type *ScalarTy = TypeInfo.inferScalarType(VPV);
         if (VFs[J].isScalar() ||
             isa<VPCanonicalIVPHIRecipe, VPReplicateRecipe, VPDerivedIVRecipe,
                 VPEVLBasedIVPHIRecipe, VPScalarIVStepsRecipe>(VPV) ||
             (isa<VPInstruction>(VPV) && vputils::onlyScalarValuesUsed(VPV)) ||
             (isa<VPReductionPHIRecipe>(VPV) &&
              (cast<VPReductionPHIRecipe>(VPV))->isInLoop())) {
-          unsigned ClassID =
-              TTI.getRegisterClassForType(false, TypeInfo.inferScalarType(VPV));
+          unsigned ClassID = TTI.getRegisterClassForType(false, ScalarTy);
           // FIXME: The target might use more than one register for the type
           // even in the scalar case.
           RegUsage[ClassID] += 1;
+          LargestTypes[J][ClassID] =
+              MaxType(LargestTypes[J][ClassID], ScalarTy);
         } else {
           // The output from scaled phis and scaled reductions actually has
           // fewer lanes than the VF.
@@ -562,10 +594,12 @@ SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
             LLVM_DEBUG(dbgs() << "LV(REG): Scaled down VF from " << VFs[J]
                               << " to " << VF << " for " << *R << "\n";);
           }
-
-          Type *ScalarTy = TypeInfo.inferScalarType(VPV);
           unsigned ClassID = TTI.getRegisterClassForType(true, ScalarTy);
           RegUsage[ClassID] += GetRegUsage(ScalarTy, VF);
+          if (VectorType::isValidElementType(ScalarTy)) {
+            Type *T = VectorType::get(ScalarTy, VF);
+            LargestTypes[J][ClassID] = MaxType(LargestTypes[J][ClassID], T);
+          }
         }
       }
 
@@ -602,9 +636,11 @@ SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
       bool IsScalar = vputils::onlyScalarValuesUsed(In);
 
       ElementCount VF = IsScalar ? ElementCount::getFixed(1) : VFs[Idx];
-      unsigned ClassID = TTI.getRegisterClassForType(
-          VF.isVector(), TypeInfo.inferScalarType(In));
-      Invariant[ClassID] += GetRegUsage(TypeInfo.inferScalarType(In), VF);
+      Type *ScalarTy = TypeInfo.inferScalarType(In);
+      unsigned ClassID = TTI.getRegisterClassForType(VF.isVector(), ScalarTy);
+      Invariant[ClassID] += GetRegUsage(ScalarTy, VF);
+      Type *SpillTy = IsScalar ? ScalarTy : VectorType::get(ScalarTy, VF);
+      LargestTypes[Idx][ClassID] = MaxType(LargestTypes[Idx][ClassID], SpillTy);
     }
 
     LLVM_DEBUG({
@@ -623,10 +659,16 @@ SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
                << TTI.getRegisterClassName(pair.first) << ", " << pair.second
                << " registers\n";
       }
+      for (const auto &pair : LargestTypes[Idx]) {
+        dbgs() << "LV(REG): RegisterClass: "
+               << TTI.getRegisterClassName(pair.first) << ", " << *pair.second
+               << " is largest type potentially spilled\n";
+      }
     });
 
     RU.LoopInvariantRegs = Invariant;
     RU.MaxLocalUsers = MaxUsages[Idx];
+    RU.LargestType = LargestTypes[Idx];
     RUs[Idx] = RU;
   }
 
diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.h b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.h
index dc4be4270f7f1..3affa211dd140 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.h
@@ -19,6 +19,7 @@ namespace llvm {
 class LLVMContext;
 class VPValue;
 class VPBlendRecipe;
+class VPCostContext;
 class VPInstruction;
 class VPWidenRecipe;
 class VPWidenCallRecipe;
@@ -30,6 +31,7 @@ class VPlan;
 class Value;
 class TargetTransformInfo;
 class Type;
+class InstructionCost;
 
 /// An analysis for type-inference for VPValues.
 /// It infers the scalar type for a given VPValue by bottom-up traversing
@@ -78,12 +80,14 @@ struct VPRegisterUsage {
   /// Holds the maximum number of concurrent live intervals in the loop.
   /// The key is ClassID of target-provided register class.
   SmallMapVector<unsigned, unsigned, 4> MaxLocalUsers;
+  /// Holds the largest type used in each register class.
+  SmallMapVector<unsigned, Type *, 4> LargestType;
 
-  /// Check if any of the tracked live intervals exceeds the number of
-  /// available registers for the target. If non-zero, OverrideMaxNumRegs
+  /// Calculate the estimated cost of any spills due to using more registers
+  /// than the number available for the target. If non-zero, OverrideMaxNumRegs
   /// is used in place of the target's number of registers.
-  bool exceedsMaxNumRegs(const TargetTransformInfo &TTI,
-                         unsigned OverrideMaxNumRegs = 0) const;
+  InstructionCost spillCost(VPCostContext &Ctx,
+                            unsigned OverrideMaxNumRegs = 0) const;
 };
 
 /// Estimate the register usage for \p Plan and vectorization factors in \p VFs
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll b/llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll
index 8109d0683fe71..2a4d16979e0d8 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll
@@ -1,16 +1,31 @@
 ; REQUIRES: asserts
-; RUN: opt -passes=loop-vectorize -vectorizer-maximize-bandwidth -debug-only=loop-vectorize -disable-output -force-vector-interleave=1 -enable-epilogue-vectorization=false -S < %s 2>&1 | FileCheck %s --check-prefixes=CHECK-REGS-VP
-; RUN: opt -passes=loop-vectorize -vectorizer-maximize-bandwidth -debug-only=loop-vectorize -disable-output -force-target-num-vector-regs=1 -force-vector-interleave=1 -enable-epilogue-vectorization=false -S < %s 2>&1 | FileCheck %s --check-prefixes=CHECK-NOREGS-VP
+; RUN: opt -passes=loop-vectorize -vectorizer-maximize-bandwidth=false -debug-only=loop-vectorize,vplan -disable-output -force-vector-interleave=1 -enable-epilogue-vectorization=false -S < %s 2>&1 | FileCheck %s --check-prefixes=CHECK,CHECK-NOMAX
+; RUN: opt -passes=loop-vectorize -vectorizer-maximize-bandwidth=true -debug-only=loop-vectorize,vplan -disable-output -force-vector-interleave=1 -enable-epilogue-vectorization=false -S < %s 2>&1 | FileCheck %s --check-prefixes=CHECK,CHECK-REGS-VP
+; RUN: opt -passes=loop-vectorize -vectorizer-maximize-bandwidth=true -debug-only=loop-vectorize,vplan -disable-output -force-target-num-vector-regs=1 -force-vector-interleave=1 -enable-epilogue-vectorization=false -S < %s 2>&1 | FileCheck %s --check-prefixes=CHECK,CHECK-NOREGS-VP
 
 target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
 target triple = "aarch64-none-unknown-elf"
 
+; The use of the dotp instruction means we never have an i32 vector, so we don't
+; get any spills normally and with a reduced number of registers the number of
+; spills is small enough that it doesn't prevent use of a larger VF.
 define i32 @dotp(ptr %a, ptr %b) #0 {
+; CHECK-LABEL: LV: Checking a loop in 'dotp'
+;
+; CHECK-NOMAX: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK-NOMAX: LV: Selecting VF: vscale x 4.
+;
+; CHECK-REGS-VP: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK-REGS-VP: Cost for VF vscale x 8: 6 (Estimated cost per lane: 0.8)
+; CHECK-REGS-VP: Cost for VF vscale x 16: 5 (Estimated cost per lane: 0.3)
 ; CHECK-REGS-VP: LV: Selecting VF: vscale x 16.
 ;
-; CHECK-NOREGS-VP: LV(REG): Not considering vector loop of width vscale x 8 because it uses too many registers
-; CHECK-NOREGS-VP: LV(REG): Not considering vector loop of width vscale x 16 because it uses too many registers
-; CHECK-NOREGS-VP: LV: Selecting VF: vscale x 4.
+; CHECK-NOREGS-VP: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK-NOREGS-VP: LV(REG): Cost of 4 from 2 spills of Generic::VectorRC
+; CHECK-NOREGS-VP-NEXT: Cost for VF vscale x 8: 14 (Estimated cost per lane: 1.8)
+; CHECK-NOREGS-VP: LV(REG): Cost of 4 from 2 spills of Generic::VectorRC
+; CHECK-NOREGS-VP-NEXT: Cost for VF vscale x 16: 13 (Estimated cost per lane: 0.8)
+; CHECK-NOREGS-VP: LV: Selecting VF: vscale x 16.
 entry:
   br label %for.body
 
@@ -24,8 +39,7 @@ for.body:                                         ; preds = %for.body, %entry
   %load.b = load i8, ptr %gep.b, align 1
   %ext.b = zext i8 %load.b to i32
   %mul = mul i32 %ext.b, %ext.a
-  %sub = sub i32 0, %mul
-  %add = add i32 %accum, %sub
+  %add = add i32 %accum, %mul
   %iv.next = add i64 %iv, 1
   %exitcond.not = icmp eq i64 %iv.next, 1024
   br i1 %exitcond.not, label %for.exit, label %for.body
@@ -34,4 +48,70 @@ for.exit:                        ; preds = %for.body
   ret i32 %add
 }
 
+; The largest type used in the loop is small enough that we already consider all
+; VFs and maximize-bandwidth does nothing.
+define void @type_too_small(ptr %a, ptr %b) #0 {
+; CHECK-LABEL: LV: Checking a loop in 'type_too_small'
+; CHECK: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK: Cost for VF vscale x 8: 6 (Estimated cost per lane: 0.8)
+; CHECK: Cost for VF vscale x 16: 6 (Estimated cost per lane: 0.4)
+; CHECK: LV: Selecting VF: vscale x 16.
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %gep.a = getelementptr i8, ptr %a, i64 %iv
+  %load.a = load i8, ptr %gep.a, align 1
+  %gep.b = getelementptr i8, ptr %b, i64 %iv
+  %load.b = load i8, ptr %gep.b, align 1
+  %add = add i8 %load.a, %load.b
+  store i8 %add, ptr %gep.a, align 1
+  %iv.next = add i64 %iv, 1
+  %exitcond = icmp eq i64 %iv.next, 1024
+  br i1 %exitcond, label %exit, label %loop
+
+exit:
+  ret void
+}
+
+; With reduced number of registers the spills from high pressure are enough that
+; we use the same VF as if we hadn't maximized the bandwidth.
+define void @high_pressure(ptr %a, ptr %b) #0 {
+; CHECK-LABEL: LV: Checking a loop in 'high_pressure'
+;
+; CHECK-NOMAX: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK-NOMAX: LV: Selecting VF: vscale x 4.
+;
+; CHECK-REGS-VP: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK-REGS-VP: Cost for VF vscale x 8: 10 (Estimated cost per lane: 1.2)
+; CHECK-REGS-VP: Cost for VF vscale x 16: 21 (Estimated cost per lane: 1.3)
+; CHECK-REGS-VP: LV: Selecting VF: vscale x 8.
+
+; CHECK-NOREGS-VP: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK-NOREGS-VP: LV(REG): Cost of 12 from 3 spills of Generic::VectorRC
+; CHECK-NOREGS-VP-NEXT: Cost for VF vscale x 8: 26 (Estimated cost per lane: 3.2)
+; CHECK-NOREGS-VP: LV(REG): Cost of 56 from 7 spills of Generic::VectorRC
+; CHECK-NOREGS-VP-NEXT: Cost for VF vscale x 16: 81 (Estimated cost per lane: 5.1)
+; CHECK-NOREGS-VP: LV: Selecting VF: vscale x 4.
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %gep.a = getelementptr i32, ptr %a, i64 %iv
+  %load.a = load i32, ptr %gep.a, align 4
+  %gep.b = getelementptr i8, ptr %b, i64 %iv
+  %load.b = load i8, ptr %gep.b, align 1
+  %ext.b = zext i8 %load.b to i32
+  %add = add i32 %load.a, %ext.b
+  store i32 %add, ptr %gep.a, align 4
+  %iv.next = add i64 %iv, 1
+  %exitcond = icmp eq i64 %iv.next, 1024
+  br i1 %exitcond, label %exit, label %loop
+
+exit:
+  ret void
+}
+
 attributes #0 = { vscale_range(1,16) "target-features"="+sve" }
diff --git a/llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll b/llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll
new file mode 100644
index 00...
[truncated]

llvmbot · 2026-02-04T12:20:53Z

@llvm/pr-subscribers-vectorizers

Author: John Brawn (john-brawn-arm)

Changes

Currently when considering register pressure is enabled, we reject any VF that has higher pressure than the number of registers. However this can result in failing to vectorize in cases where it's beneficial, as the cost of the extra spills is less than the benefit we get from vectorizing.

Deal with this by instead calculating the cost of spills and adding that to the rest of the cost, so we can detect this kind of situation and still vectorize while avoiding vectorizing in cases where the extra cost makes it not with it.

Patch is 34.53 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/179646.diff

7 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h (+2-1)
(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+21-23)
(modified) llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp (+56-14)
(modified) llvm/lib/Transforms/Vectorize/VPlanAnalysis.h (+8-4)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll (+87-7)
(added) llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll (+266)
(modified) llvm/test/Transforms/LoopVectorize/LoongArch/reg-usage.ll (+2-2)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
index 44d4d92d4a7e2..06e8efef20c03 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
@@ -45,6 +45,7 @@ class OptimizationRemarkEmitter;
 class TargetTransformInfo;
 class TargetLibraryInfo;
 class VPRecipeBuilder;
+class VPRegisterUsage;
 struct VFRange;
 
 extern cl::opt<bool> EnableVPlanNativePath;
@@ -497,7 +498,7 @@ class LoopVectorizationPlanner {
   ///
   /// TODO: Move to VPlan::cost once the use of LoopVectorizationLegality has
   /// been retired.
-  InstructionCost cost(VPlan &Plan, ElementCount VF) const;
+  InstructionCost cost(VPlan &Plan, ElementCount VF, VPRegisterUsage *RU) const;
 
   /// Precompute costs for certain instructions using the legacy cost model. The
   /// function is used to bring up the VPlan-based cost model to initially avoid
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index abac45b265d10..492e716fd6ad2 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -4247,13 +4247,6 @@ VectorizationFactor LoopVectorizationPlanner::selectVectorizationFactor() {
       if (VF.isScalar())
         continue;
 
-      /// If the register pressure needs to be considered for VF,
-      /// don't consider the VF as valid if it exceeds the number
-      /// of registers for the target.
-      if (CM.shouldConsiderRegPressureForVF(VF) &&
-          RUs[I].exceedsMaxNumRegs(TTI, ForceTargetNumVectorRegs))
-        continue;
-
       InstructionCost C = CM.expectedCost(VF);
 
       // Add on other costs that are modelled in VPlan, but not in the legacy
@@ -4302,6 +4295,10 @@ VectorizationFactor LoopVectorizationPlanner::selectVectorizationFactor() {
         }
       }
 
+      // Add the cost of any spills due to excess register usage
+      if (CM.shouldConsiderRegPressureForVF(VF))
+        C += RUs[I].spillCost(CostCtx, ForceTargetNumVectorRegs);
+
       VectorizationFactor Candidate(VF, C, ScalarCost.ScalarCost);
       unsigned Width =
           estimateElementCount(Candidate.Width, CM.getVScaleForTuning());
@@ -4687,13 +4684,16 @@ LoopVectorizationPlanner::selectInterleaveCount(VPlan &Plan, ElementCount VF,
   if (hasFindLastReductionPhi(Plan))
     return 1;
 
+  VPRegisterUsage R =
+      calculateRegisterUsageForPlan(Plan, {VF}, TTI, CM.ValuesToIgnore)[0];
+
   // If we did not calculate the cost for VF (because the user selected the VF)
   // then we calculate the cost of VF here.
   if (LoopCost == 0) {
     if (VF.isScalar())
       LoopCost = CM.expectedCost(VF);
     else
-      LoopCost = cost(Plan, VF);
+      LoopCost = cost(Plan, VF, &R);
     assert(LoopCost.isValid() && "Expected to have chosen a VF with valid cost");
 
     // Loop body is free and there is no need for interleaving.
@@ -4701,8 +4701,6 @@ LoopVectorizationPlanner::selectInterleaveCount(VPlan &Plan, ElementCount VF,
       return 1;
   }
 
-  VPRegisterUsage R =
-      calculateRegisterUsageForPlan(Plan, {VF}, TTI, CM.ValuesToIgnore)[0];
   // We divide by these constants so assume that we have at least one
   // instruction that uses at least one register.
   for (auto &Pair : R.MaxLocalUsers) {
@@ -7027,13 +7025,18 @@ LoopVectorizationPlanner::precomputeCosts(VPlan &Plan, ElementCount VF,
   return Cost;
 }
 
-InstructionCost LoopVectorizationPlanner::cost(VPlan &Plan,
-                                               ElementCount VF) const {
+InstructionCost LoopVectorizationPlanner::cost(VPlan &Plan, ElementCount VF,
+                                               VPRegisterUsage *RU) const {
   VPCostContext CostCtx(CM.TTI, *CM.TLI, Plan, CM, CM.CostKind, PSE, OrigLoop);
   InstructionCost Cost = precomputeCosts(Plan, VF, CostCtx);
 
   // Now compute and add the VPlan-based cost.
   Cost += Plan.cost(VF, CostCtx);
+
+  // Add the cost of spills due to excess register usage
+  if (CM.shouldConsiderRegPressureForVF(VF))
+    Cost += RU->spillCost(CostCtx, ForceTargetNumVectorRegs);
+
 #ifndef NDEBUG
   unsigned EstimatedWidth = estimateElementCount(VF, CM.getVScaleForTuning());
   LLVM_DEBUG(dbgs() << "Cost for VF " << VF << ": " << Cost
@@ -7233,9 +7236,10 @@ VectorizationFactor LoopVectorizationPlanner::computeBestVF() {
                                P->vectorFactors().end());
 
     SmallVector<VPRegisterUsage, 8> RUs;
-    if (any_of(VFs, [this](ElementCount VF) {
-          return CM.shouldConsiderRegPressureForVF(VF);
-        }))
+    bool ConsiderRegPressure = any_of(VFs, [this](ElementCount VF) {
+      return CM.shouldConsiderRegPressureForVF(VF);
+    });
+    if (ConsiderRegPressure)
       RUs = calculateRegisterUsageForPlan(*P, VFs, TTI, CM.ValuesToIgnore);
 
     for (unsigned I = 0; I < VFs.size(); I++) {
@@ -7258,16 +7262,10 @@ VectorizationFactor LoopVectorizationPlanner::computeBestVF() {
         continue;
       }
 
-      InstructionCost Cost = cost(*P, VF);
+      InstructionCost Cost =
+          cost(*P, VF, ConsiderRegPressure ? &RUs[I] : nullptr);
       VectorizationFactor CurrentFactor(VF, Cost, ScalarCost);
 
-      if (CM.shouldConsiderRegPressureForVF(VF) &&
-          RUs[I].exceedsMaxNumRegs(TTI, ForceTargetNumVectorRegs)) {
-        LLVM_DEBUG(dbgs() << "LV(REG): Not considering vector loop of width "
-                          << VF << " because it uses too many registers\n");
-        continue;
-      }
-
       if (isMoreProfitable(CurrentFactor, BestFactor, P->hasScalarTail()))
         BestFactor = CurrentFactor;
 
diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
index 8fbe7d93e6f45..b8be1be79831e 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
@@ -16,6 +16,7 @@
 #include "llvm/ADT/TypeSwitch.h"
 #include "llvm/Analysis/ScalarEvolution.h"
 #include "llvm/Analysis/TargetTransformInfo.h"
+#include "llvm/IR/DataLayout.h"
 #include "llvm/IR/Instruction.h"
 #include "llvm/IR/PatternMatch.h"
 
@@ -389,13 +390,33 @@ bool VPDominatorTree::properlyDominates(const VPRecipeBase *A,
   return Base::properlyDominates(ParentA, ParentB);
 }
 
-bool VPRegisterUsage::exceedsMaxNumRegs(const TargetTransformInfo &TTI,
-                                        unsigned OverrideMaxNumRegs) const {
-  return any_of(MaxLocalUsers, [&TTI, &OverrideMaxNumRegs](auto &LU) {
-    return LU.second > (OverrideMaxNumRegs > 0
-                            ? OverrideMaxNumRegs
-                            : TTI.getNumberOfRegisters(LU.first));
-  });
+InstructionCost VPRegisterUsage::spillCost(VPCostContext &Ctx,
+                                           unsigned OverrideMaxNumRegs) const {
+  InstructionCost Cost;
+  DataLayout DL = Ctx.PSE.getSE()->getDataLayout();
+  for (const auto &Pair : MaxLocalUsers) {
+    unsigned AvailableRegs = OverrideMaxNumRegs > 0
+                                 ? OverrideMaxNumRegs
+                                 : Ctx.TTI.getNumberOfRegisters(Pair.first);
+    if (Pair.second > AvailableRegs) {
+      // Assume that for each register used past what's available we get one
+      // spill and reload of the largest type seen for that register class.
+      unsigned Spills = Pair.second - AvailableRegs;
+      Type *SpillType = LargestType.at(Pair.first);
+      Align Alignment = DL.getPrefTypeAlign(SpillType);
+      InstructionCost SpillCost =
+          Ctx.TTI.getMemoryOpCost(Instruction::Load, SpillType, Alignment, 0,
+                                  Ctx.CostKind) +
+          Ctx.TTI.getMemoryOpCost(Instruction::Store, SpillType, Alignment, 0,
+                                  Ctx.CostKind);
+      InstructionCost TotalCost = SpillCost * Spills;
+      LLVM_DEBUG(dbgs() << "LV(REG): Cost of " << TotalCost << " from "
+                        << Spills << " spills of "
+                        << Ctx.TTI.getRegisterClassName(Pair.first) << "\n");
+      Cost += TotalCost;
+    }
+  }
+  return Cost;
 }
 
 SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
@@ -479,6 +500,15 @@ SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
   SmallPtrSet<VPValue *, 8> OpenIntervals;
   SmallVector<VPRegisterUsage, 8> RUs(VFs.size());
   SmallVector<SmallMapVector<unsigned, unsigned, 4>, 8> MaxUsages(VFs.size());
+  SmallVector<SmallMapVector<unsigned, Type *, 4>, 8> LargestTypes(VFs.size());
+  auto MaxType = [](Type *CurMax, Type *T) {
+    if (!CurMax)
+      return T;
+    if (TypeSize::isKnownGT(T->getPrimitiveSizeInBits(),
+                            CurMax->getPrimitiveSizeInBits()))
+      return T;
+    return CurMax;
+  };
 
   LLVM_DEBUG(dbgs() << "LV(REG): Calculating max register usage:\n");
 
@@ -540,17 +570,19 @@ SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
             match(VPV, m_ExtractLastPart(m_VPValue())))
           continue;
 
+        Type *ScalarTy = TypeInfo.inferScalarType(VPV);
         if (VFs[J].isScalar() ||
             isa<VPCanonicalIVPHIRecipe, VPReplicateRecipe, VPDerivedIVRecipe,
                 VPEVLBasedIVPHIRecipe, VPScalarIVStepsRecipe>(VPV) ||
             (isa<VPInstruction>(VPV) && vputils::onlyScalarValuesUsed(VPV)) ||
             (isa<VPReductionPHIRecipe>(VPV) &&
              (cast<VPReductionPHIRecipe>(VPV))->isInLoop())) {
-          unsigned ClassID =
-              TTI.getRegisterClassForType(false, TypeInfo.inferScalarType(VPV));
+          unsigned ClassID = TTI.getRegisterClassForType(false, ScalarTy);
           // FIXME: The target might use more than one register for the type
           // even in the scalar case.
           RegUsage[ClassID] += 1;
+          LargestTypes[J][ClassID] =
+              MaxType(LargestTypes[J][ClassID], ScalarTy);
         } else {
           // The output from scaled phis and scaled reductions actually has
           // fewer lanes than the VF.
@@ -562,10 +594,12 @@ SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
             LLVM_DEBUG(dbgs() << "LV(REG): Scaled down VF from " << VFs[J]
                               << " to " << VF << " for " << *R << "\n";);
           }
-
-          Type *ScalarTy = TypeInfo.inferScalarType(VPV);
           unsigned ClassID = TTI.getRegisterClassForType(true, ScalarTy);
           RegUsage[ClassID] += GetRegUsage(ScalarTy, VF);
+          if (VectorType::isValidElementType(ScalarTy)) {
+            Type *T = VectorType::get(ScalarTy, VF);
+            LargestTypes[J][ClassID] = MaxType(LargestTypes[J][ClassID], T);
+          }
         }
       }
 
@@ -602,9 +636,11 @@ SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
       bool IsScalar = vputils::onlyScalarValuesUsed(In);
 
       ElementCount VF = IsScalar ? ElementCount::getFixed(1) : VFs[Idx];
-      unsigned ClassID = TTI.getRegisterClassForType(
-          VF.isVector(), TypeInfo.inferScalarType(In));
-      Invariant[ClassID] += GetRegUsage(TypeInfo.inferScalarType(In), VF);
+      Type *ScalarTy = TypeInfo.inferScalarType(In);
+      unsigned ClassID = TTI.getRegisterClassForType(VF.isVector(), ScalarTy);
+      Invariant[ClassID] += GetRegUsage(ScalarTy, VF);
+      Type *SpillTy = IsScalar ? ScalarTy : VectorType::get(ScalarTy, VF);
+      LargestTypes[Idx][ClassID] = MaxType(LargestTypes[Idx][ClassID], SpillTy);
     }
 
     LLVM_DEBUG({
@@ -623,10 +659,16 @@ SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
                << TTI.getRegisterClassName(pair.first) << ", " << pair.second
                << " registers\n";
       }
+      for (const auto &pair : LargestTypes[Idx]) {
+        dbgs() << "LV(REG): RegisterClass: "
+               << TTI.getRegisterClassName(pair.first) << ", " << *pair.second
+               << " is largest type potentially spilled\n";
+      }
     });
 
     RU.LoopInvariantRegs = Invariant;
     RU.MaxLocalUsers = MaxUsages[Idx];
+    RU.LargestType = LargestTypes[Idx];
     RUs[Idx] = RU;
   }
 
diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.h b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.h
index dc4be4270f7f1..3affa211dd140 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.h
@@ -19,6 +19,7 @@ namespace llvm {
 class LLVMContext;
 class VPValue;
 class VPBlendRecipe;
+class VPCostContext;
 class VPInstruction;
 class VPWidenRecipe;
 class VPWidenCallRecipe;
@@ -30,6 +31,7 @@ class VPlan;
 class Value;
 class TargetTransformInfo;
 class Type;
+class InstructionCost;
 
 /// An analysis for type-inference for VPValues.
 /// It infers the scalar type for a given VPValue by bottom-up traversing
@@ -78,12 +80,14 @@ struct VPRegisterUsage {
   /// Holds the maximum number of concurrent live intervals in the loop.
   /// The key is ClassID of target-provided register class.
   SmallMapVector<unsigned, unsigned, 4> MaxLocalUsers;
+  /// Holds the largest type used in each register class.
+  SmallMapVector<unsigned, Type *, 4> LargestType;
 
-  /// Check if any of the tracked live intervals exceeds the number of
-  /// available registers for the target. If non-zero, OverrideMaxNumRegs
+  /// Calculate the estimated cost of any spills due to using more registers
+  /// than the number available for the target. If non-zero, OverrideMaxNumRegs
   /// is used in place of the target's number of registers.
-  bool exceedsMaxNumRegs(const TargetTransformInfo &TTI,
-                         unsigned OverrideMaxNumRegs = 0) const;
+  InstructionCost spillCost(VPCostContext &Ctx,
+                            unsigned OverrideMaxNumRegs = 0) const;
 };
 
 /// Estimate the register usage for \p Plan and vectorization factors in \p VFs
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll b/llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll
index 8109d0683fe71..2a4d16979e0d8 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll
@@ -1,16 +1,31 @@
 ; REQUIRES: asserts
-; RUN: opt -passes=loop-vectorize -vectorizer-maximize-bandwidth -debug-only=loop-vectorize -disable-output -force-vector-interleave=1 -enable-epilogue-vectorization=false -S < %s 2>&1 | FileCheck %s --check-prefixes=CHECK-REGS-VP
-; RUN: opt -passes=loop-vectorize -vectorizer-maximize-bandwidth -debug-only=loop-vectorize -disable-output -force-target-num-vector-regs=1 -force-vector-interleave=1 -enable-epilogue-vectorization=false -S < %s 2>&1 | FileCheck %s --check-prefixes=CHECK-NOREGS-VP
+; RUN: opt -passes=loop-vectorize -vectorizer-maximize-bandwidth=false -debug-only=loop-vectorize,vplan -disable-output -force-vector-interleave=1 -enable-epilogue-vectorization=false -S < %s 2>&1 | FileCheck %s --check-prefixes=CHECK,CHECK-NOMAX
+; RUN: opt -passes=loop-vectorize -vectorizer-maximize-bandwidth=true -debug-only=loop-vectorize,vplan -disable-output -force-vector-interleave=1 -enable-epilogue-vectorization=false -S < %s 2>&1 | FileCheck %s --check-prefixes=CHECK,CHECK-REGS-VP
+; RUN: opt -passes=loop-vectorize -vectorizer-maximize-bandwidth=true -debug-only=loop-vectorize,vplan -disable-output -force-target-num-vector-regs=1 -force-vector-interleave=1 -enable-epilogue-vectorization=false -S < %s 2>&1 | FileCheck %s --check-prefixes=CHECK,CHECK-NOREGS-VP
 
 target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
 target triple = "aarch64-none-unknown-elf"
 
+; The use of the dotp instruction means we never have an i32 vector, so we don't
+; get any spills normally and with a reduced number of registers the number of
+; spills is small enough that it doesn't prevent use of a larger VF.
 define i32 @dotp(ptr %a, ptr %b) #0 {
+; CHECK-LABEL: LV: Checking a loop in 'dotp'
+;
+; CHECK-NOMAX: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK-NOMAX: LV: Selecting VF: vscale x 4.
+;
+; CHECK-REGS-VP: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK-REGS-VP: Cost for VF vscale x 8: 6 (Estimated cost per lane: 0.8)
+; CHECK-REGS-VP: Cost for VF vscale x 16: 5 (Estimated cost per lane: 0.3)
 ; CHECK-REGS-VP: LV: Selecting VF: vscale x 16.
 ;
-; CHECK-NOREGS-VP: LV(REG): Not considering vector loop of width vscale x 8 because it uses too many registers
-; CHECK-NOREGS-VP: LV(REG): Not considering vector loop of width vscale x 16 because it uses too many registers
-; CHECK-NOREGS-VP: LV: Selecting VF: vscale x 4.
+; CHECK-NOREGS-VP: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK-NOREGS-VP: LV(REG): Cost of 4 from 2 spills of Generic::VectorRC
+; CHECK-NOREGS-VP-NEXT: Cost for VF vscale x 8: 14 (Estimated cost per lane: 1.8)
+; CHECK-NOREGS-VP: LV(REG): Cost of 4 from 2 spills of Generic::VectorRC
+; CHECK-NOREGS-VP-NEXT: Cost for VF vscale x 16: 13 (Estimated cost per lane: 0.8)
+; CHECK-NOREGS-VP: LV: Selecting VF: vscale x 16.
 entry:
   br label %for.body
 
@@ -24,8 +39,7 @@ for.body:                                         ; preds = %for.body, %entry
   %load.b = load i8, ptr %gep.b, align 1
   %ext.b = zext i8 %load.b to i32
   %mul = mul i32 %ext.b, %ext.a
-  %sub = sub i32 0, %mul
-  %add = add i32 %accum, %sub
+  %add = add i32 %accum, %mul
   %iv.next = add i64 %iv, 1
   %exitcond.not = icmp eq i64 %iv.next, 1024
   br i1 %exitcond.not, label %for.exit, label %for.body
@@ -34,4 +48,70 @@ for.exit:                        ; preds = %for.body
   ret i32 %add
 }
 
+; The largest type used in the loop is small enough that we already consider all
+; VFs and maximize-bandwidth does nothing.
+define void @type_too_small(ptr %a, ptr %b) #0 {
+; CHECK-LABEL: LV: Checking a loop in 'type_too_small'
+; CHECK: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK: Cost for VF vscale x 8: 6 (Estimated cost per lane: 0.8)
+; CHECK: Cost for VF vscale x 16: 6 (Estimated cost per lane: 0.4)
+; CHECK: LV: Selecting VF: vscale x 16.
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %gep.a = getelementptr i8, ptr %a, i64 %iv
+  %load.a = load i8, ptr %gep.a, align 1
+  %gep.b = getelementptr i8, ptr %b, i64 %iv
+  %load.b = load i8, ptr %gep.b, align 1
+  %add = add i8 %load.a, %load.b
+  store i8 %add, ptr %gep.a, align 1
+  %iv.next = add i64 %iv, 1
+  %exitcond = icmp eq i64 %iv.next, 1024
+  br i1 %exitcond, label %exit, label %loop
+
+exit:
+  ret void
+}
+
+; With reduced number of registers the spills from high pressure are enough that
+; we use the same VF as if we hadn't maximized the bandwidth.
+define void @high_pressure(ptr %a, ptr %b) #0 {
+; CHECK-LABEL: LV: Checking a loop in 'high_pressure'
+;
+; CHECK-NOMAX: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK-NOMAX: LV: Selecting VF: vscale x 4.
+;
+; CHECK-REGS-VP: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK-REGS-VP: Cost for VF vscale x 8: 10 (Estimated cost per lane: 1.2)
+; CHECK-REGS-VP: Cost for VF vscale x 16: 21 (Estimated cost per lane: 1.3)
+; CHECK-REGS-VP: LV: Selecting VF: vscale x 8.
+
+; CHECK-NOREGS-VP: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK-NOREGS-VP: LV(REG): Cost of 12 from 3 spills of Generic::VectorRC
+; CHECK-NOREGS-VP-NEXT: Cost for VF vscale x 8: 26 (Estimated cost per lane: 3.2)
+; CHECK-NOREGS-VP: LV(REG): Cost of 56 from 7 spills of Generic::VectorRC
+; CHECK-NOREGS-VP-NEXT: Cost for VF vscale x 16: 81 (Estimated cost per lane: 5.1)
+; CHECK-NOREGS-VP: LV: Selecting VF: vscale x 4.
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %gep.a = getelementptr i32, ptr %a, i64 %iv
+  %load.a = load i32, ptr %gep.a, align 4
+  %gep.b = getelementptr i8, ptr %b, i64 %iv
+  %load.b = load i8, ptr %gep.b, align 1
+  %ext.b = zext i8 %load.b to i32
+  %add = add i32 %load.a, %ext.b
+  store i32 %add, ptr %gep.a, align 4
+  %iv.next = add i64 %iv, 1
+  %exitcond = icmp eq i64 %iv.next, 1024
+  br i1 %exitcond, label %exit, label %loop
+
+exit:
+  ret void
+}
+
 attributes #0 = { vscale_range(1,16) "target-features"="+sve" }
diff --git a/llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll b/llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll
new file mode 100644
index 00...
[truncated]

john-brawn-arm · 2026-02-04T12:31:18Z

The motivation for doing this is that I'm looking at enabling shouldConsiderVectorizationRegPressure on Arm Cortex-M CPUs with MVE, and the current behaviour makes things significantly worse in some cases due to preventing vectorization when it's beneficial. I've been specifically looking at the code we generate for https://github.com/ARM-software/CMSIS-DSP on Cortex-M55. If I enable vectorization register pressure then with the current behaviour the change in throughput is

Function	Change
Filtering/arm_conv_partial_q31	3.64%
Filtering/arm_conv_q31	1.37%
Filtering/arm_correlate_q31	-1.15%
Filtering/arm_fir_decimate_f32	-0.92%
Filtering/arm_fir_decimate_q31	-56.92%
Filtering/arm_fir_f32_16taps	187.17%
Filtering/arm_fir_f32_4taps	34.33%
Filtering/arm_fir_f32_8taps	350.95%
Filtering/arm_fir_q31_16taps	-4.88%
Filtering/arm_fir_q31_4taps	0.54%
Filtering/arm_fir_q31_8taps	-9.04%
Matrix/arm_mat_vec_mult_f16	-52.49%
Matrix/arm_mat_vec_mult_f32	-48.69%
Quaternion/arm_quaternion2rotation_f32	-4.66%
Transform/arm_cfft_f16	-7.94%
Transform/arm_rfft_fast_f16	-0.39%
Transform/arm_rfft_fast_f32	-8.86%

With this PR the change in throughput is

Function	Change
Filtering/arm_fir_f32_16taps	187.17%
Filtering/arm_fir_f32_4taps	34.33%
Filtering/arm_fir_f32_8taps	350.95%
Filtering/arm_fir_q31_16taps	-4.88%
Filtering/arm_fir_q31_4taps	0.54%
Filtering/arm_fir_q31_8taps	-9.04%
Quaternion/arm_quaternion2rotation_f32	-4.66%

The remaining regressions are due to the relative costs of interleave vs gather/scatter vs scalarize being wrong in some cases, which I'll be looking at next.

I've also checked llvm-test-suite on Neoverse-V2 (AWS Graviton 4), where useMaxBandwidth is enabled for scalable vectors and so register pressure calculation is used, and there's zero change in code generation.

github-actions · 2026-02-04T12:48:51Z

🐧 Linux x64 Test Results

192119 tests passed
4916 tests skipped

✅ The build succeeded and all tests passed.

john-brawn-arm · 2026-02-10T14:50:02Z

Ping

lukel97

Thanks for working on this, I remember this being discussed at the time the VPlan register pressure stuff initially landed. The load+store per register over heuristic seems sensible to me

llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp

john-brawn-arm · 2026-02-19T14:54:45Z

Ping

david-arm · 2026-02-20T11:53:58Z

llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp

+      unsigned Spills = MaxUsers - AvailableRegs;
+      Type *SpillType = LargestType.at(RegClass);
+      Align Alignment = DL.getPrefTypeAlign(SpillType);
+      InstructionCost SpillCost =


I think we'd probably want this code to live as part of the default implementation of a getSpillCost() TTI hook, i.e. the default could be implemented in BasicTTIImpl.h. Some targets may simply want to return Invalid here in order to prevent any spilling or filling whatsoever. Also, spilling the largest type doesn't guarantee the most pessimistic cost. I haven't looked into this in detail, but I can imagine situations where spilling <16 x i1> is more expensive than <16 x i8>, simply because in the backend <16 x i1> is not a legal type and requires promotion first.

There's the getCostOfKeepingLiveOverCall hook in TTI which is basically the cost of spilling multiple types. We could rework that to be getSpillCost() for a single type, I think the only user of it is SLP currently IIRC.

david-arm · 2026-02-20T11:59:41Z

llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp

+      Type *SpillType = LargestType.at(RegClass);
+      Align Alignment = DL.getPrefTypeAlign(SpillType);
+      InstructionCost SpillCost =
+          Ctx.TTI.getMemoryOpCost(Instruction::Load, SpillType, Alignment, 0,


This code needs to be able to handle invalid costs being returned. See AArch64TTIImpl::getMemoryOpCost for an example of what happens when calculating the cost of load/store of <vscale x 16 x i1> types.

Given that you could encounter Invalid costs here, perhaps it might be easier to see the impact of these changes if you split up into two PRs:

Create a NFC patch to refactor the code so that we call spillCost, but spillCost always returns Invalid if MaxUsers > AvailableRegs. That way we can see what happens when Invalid is returned and make sure it behaves sensibly. In theory, it should be NFC because we'll just ignore this VF the same as before.

Create a follow-on patch to add a better cost model, perhaps introducing a new TTI hook as suggested above.

david-arm · 2026-02-20T12:05:49Z

llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp

+      unsigned ClassID = TTI.getRegisterClassForType(VF.isVector(), ScalarTy);
+      Type *SpillTy = IsScalar ? ScalarTy : VectorType::get(ScalarTy, VF);
+      Invariant[ClassID] += TTI.getRegUsageForType(SpillTy);
+      LargestTypes[Idx][ClassID] = MaxType(LargestTypes[Idx][ClassID], SpillTy);


As mentioned in an earlier comment, it's not obvious to me that the largest type has the greatest spill/fill cost. I think you might need to either:

Track all types and add up the spill/fill cost for each type, or

Calculate the largest possible spill/cost for each type here, then use the largest cost in the spillCost routine.

Actually from some thinking about this, I don't think either largest size or largest cost is the right thing here. Taking this example:

void fn(char *src, long *dst, long n) { for (long i = 0; i < n; i++) { dst[i] += src[i]; } }

With VF 8 the largest (and most costly to spill) type is <vscale x 8 x i64>, which with -mcpu=neoverse-v2 -mllvm -force-target-num-vector-regs=4 gives

LV(REG): Cost of 32 from 4 spills of Generic::VectorRC

32 is the cost of 4 spills of <vscale x 8 x i64>, but what gets spilled is registers not types, i.e. what we want here is the cost of 4 z-register spills. I'm not sure what the best way to handle this is. Perhaps TargetTransformInfo should have a method that gives the spill cost for a generic register class, given that we're using these generic registers classes here to count how many registers are being used.

I've changed things to calculate the cost using the register class, by adding getRegisterClassSpillCost to TargetTransformInfo.

john-brawn-arm · 2026-03-04T10:53:19Z

Ping

john-brawn-arm · 2026-03-11T11:30:39Z

Ping

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp

lukel97

LGTM

llvm-ci · 2026-03-18T15:51:26Z

LLVM Buildbot has detected a new failure on builder fuchsia-x86_64-linux running on fuchsia-debian-64-us-central1-b-1 while building llvm at step 4 "annotate".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/11/builds/36090

Here is the relevant piece of the build log for the reference

Step 4 (annotate) failure: 'python ../llvm-zorg/zorg/buildbot/builders/annotated/fuchsia-linux.py ...' (failure)
...
  Passed           : 49744 (97.00%)
  Expectedly Failed:    24 (0.05%)
[1513/1515] Linking CXX executable unittests/Frontend/LLVMFrontendTests
[1514/1515] Running the LLVM regression tests
llvm-lit: /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/utils/lit/lit/llvm/config.py:569: note: using ld.lld: /var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/ld.lld
llvm-lit: /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/utils/lit/lit/llvm/config.py:569: note: using lld-link: /var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/lld-link
llvm-lit: /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/utils/lit/lit/llvm/config.py:569: note: using ld64.lld: /var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/ld64.lld
llvm-lit: /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/utils/lit/lit/llvm/config.py:569: note: using wasm-ld: /var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/wasm-ld
-- Testing: 64189 tests, 60 workers --
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70..
FAIL: LLVM :: Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll (51242 of 64189)
******************** TEST 'LLVM :: Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll' FAILED ********************
Exit Code: 1

Command Output (stdout):
--
# RUN: at line 1
/var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/opt -mcpu=cortex-m55 -passes=loop-vectorize -disable-output -debug-only=loop-vectorize,vplan -vectorizer-consider-reg-pressure=false /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll 2>&1 | /var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/FileCheck /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll --check-prefixes=CHECK,CHECK-NOPRESSURE
# executed command: /var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/opt -mcpu=cortex-m55 -passes=loop-vectorize -disable-output -debug-only=loop-vectorize,vplan -vectorizer-consider-reg-pressure=false /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll
# note: command had no output on stdout or stderr
# error: command failed with exit status: 1
# executed command: /var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/FileCheck /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll --check-prefixes=CHECK,CHECK-NOPRESSURE
# .---command stderr------------
# | /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll:10:16: error: CHECK-LABEL: expected string not found in input
# | ; CHECK-LABEL: LV: Checking a loop in 'spills_not_profitable'
# |                ^
# | <stdin>:1:1: note: scanning from here
# | opt: Unknown command line argument '-debug-only=loop-vectorize,vplan'. Try: '/var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/opt --help'
# | ^
# | <stdin>:1:134: note: possible intended match here
# | opt: Unknown command line argument '-debug-only=loop-vectorize,vplan'. Try: '/var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/opt --help'
# |                                                                                                                                      ^
# | 
# | Input file: <stdin>
# | Check file: /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll
# | 
# | -dump-input=help explains the following input dump.
# | 
# | Input was:
# | <<<<<<
# |             1: opt: Unknown command line argument '-debug-only=loop-vectorize,vplan'. Try: '/var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/opt --help' 
# | label:10'0     X~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ error: no match found
# | label:10'1                                                                                                                                          ?                         possible intended match
# |             2: opt: Did you mean '--debug-pass=loop-vectorize,vplan'? 
# | label:10'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# | >>>>>>
# `-----------------------------
# error: command failed with exit status: 1

Step 7 (check) failure: check (failure)
...
  Passed           : 49744 (97.00%)
  Expectedly Failed:    24 (0.05%)
[1513/1515] Linking CXX executable unittests/Frontend/LLVMFrontendTests
[1514/1515] Running the LLVM regression tests
llvm-lit: /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/utils/lit/lit/llvm/config.py:569: note: using ld.lld: /var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/ld.lld
llvm-lit: /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/utils/lit/lit/llvm/config.py:569: note: using lld-link: /var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/lld-link
llvm-lit: /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/utils/lit/lit/llvm/config.py:569: note: using ld64.lld: /var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/ld64.lld
llvm-lit: /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/utils/lit/lit/llvm/config.py:569: note: using wasm-ld: /var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/wasm-ld
-- Testing: 64189 tests, 60 workers --
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70..
FAIL: LLVM :: Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll (51242 of 64189)
******************** TEST 'LLVM :: Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll' FAILED ********************
Exit Code: 1

Command Output (stdout):
--
# RUN: at line 1
/var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/opt -mcpu=cortex-m55 -passes=loop-vectorize -disable-output -debug-only=loop-vectorize,vplan -vectorizer-consider-reg-pressure=false /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll 2>&1 | /var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/FileCheck /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll --check-prefixes=CHECK,CHECK-NOPRESSURE
# executed command: /var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/opt -mcpu=cortex-m55 -passes=loop-vectorize -disable-output -debug-only=loop-vectorize,vplan -vectorizer-consider-reg-pressure=false /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll
# note: command had no output on stdout or stderr
# error: command failed with exit status: 1
# executed command: /var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/FileCheck /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll --check-prefixes=CHECK,CHECK-NOPRESSURE
# .---command stderr------------
# | /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll:10:16: error: CHECK-LABEL: expected string not found in input
# | ; CHECK-LABEL: LV: Checking a loop in 'spills_not_profitable'
# |                ^
# | <stdin>:1:1: note: scanning from here
# | opt: Unknown command line argument '-debug-only=loop-vectorize,vplan'. Try: '/var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/opt --help'
# | ^
# | <stdin>:1:134: note: possible intended match here
# | opt: Unknown command line argument '-debug-only=loop-vectorize,vplan'. Try: '/var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/opt --help'
# |                                                                                                                                      ^
# | 
# | Input file: <stdin>
# | Check file: /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll
# | 
# | -dump-input=help explains the following input dump.
# | 
# | Input was:
# | <<<<<<
# |             1: opt: Unknown command line argument '-debug-only=loop-vectorize,vplan'. Try: '/var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/opt --help' 
# | label:10'0     X~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ error: no match found
# | label:10'1                                                                                                                                          ?                         possible intended match
# |             2: opt: Did you mean '--debug-pass=loop-vectorize,vplan'? 
# | label:10'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# | >>>>>>
# `-----------------------------
# error: command failed with exit status: 1

fhahn · 2026-03-18T16:03:29Z

llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll

probably needs ; REQUIRES: asserts here

Fixed in #187316

john-brawn-arm · 2026-03-18T17:12:32Z

mve-reg-pressure-spills.ll failed in https://lab.llvm.org/buildbot/#/builders/187/builds/18019, which enables expensive checks. The failure is

LastActiveLane operand vp<%active.lane.mask> must be prefix mask (a header mask or an EVL-derived mask currently)
LLVM ERROR: Broken VPlan found, compilation aborted!

Currently investigating, looks like maybe this is unrelated to the compiler change here and the test has exposed an existing problem.

lukel97 · 2026-03-18T17:19:18Z

mve-reg-pressure-spills.ll failed in https://lab.llvm.org/buildbot/#/builders/187/builds/18019, which enables expensive checks. The failure is
LastActiveLane operand vp<%active.lane.mask> must be prefix mask (a header mask or an EVL-derived mask currently)
LLVM ERROR: Broken VPlan found, compilation aborted!
Currently investigating, looks like maybe this is unrelated to the compiler change here and the test has exposed an existing problem.

See #182254, it might be that VPlanVerifier::verifyLastActiveLaneRecipe/isKnownMonotonic needs updated to handle another case

john-brawn-arm · 2026-03-18T19:08:41Z

#187360 should fix the VPlanVerifier failure.

vvereschaka · 2026-03-19T05:23:45Z

Hi @john-brawn-arm ,

the 'llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll' test gets failed on the aarch64 cross builder https://lab.llvm.org/buildbot/#/builders/193/builds/15053 with the following errors:

# .---command stderr------------
# | C:\buildbot\as-builder-2\x-aarch64\llvm-project\llvm\test\Transforms\LoopVectorize\AArch64\maxbandwidth-regpressure.ll:87:18: error: CHECK-REGS-VP: expected string not found in input
# | ; CHECK-REGS-VP: Cost for VF vscale x 8: 10 (Estimated cost per lane: 1.2)
# |                  ^
# | <stdin>:2813:57: note: scanning from here
# | Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
# |                                                         ^
# | <stdin>:2845:1: note: possible intended match here
# | Cost for VF vscale x 8: 10 (Estimated cost per lane: 1.3)
# | ^
# | 
# | Input file: <stdin>
# | Check file: C:\buildbot\as-builder-2\x-aarch64\llvm-project\llvm\test\Transforms\LoopVectorize\AArch64\maxbandwidth-regpressure.ll
# | 
# | -dump-input=help explains the following input dump.
# | 
# | Input was:
# | <<<<<<
...

https://lab.llvm.org/buildbot/#/builders/193/builds/15053/steps/9/logs/FAIL__LLVM__maxbandwidth-regpressure_ll

looks like because of these changes.

would you take care of it?

john-brawn-arm · 2026-03-19T13:11:17Z

Hi @john-brawn-arm ,

the 'llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll' test gets failed on the aarch64 cross builder https://lab.llvm.org/buildbot/#/builders/193/builds/15053 with the following errors:
# .---command stderr------------
# | C:\buildbot\as-builder-2\x-aarch64\llvm-project\llvm\test\Transforms\LoopVectorize\AArch64\maxbandwidth-regpressure.ll:87:18: error: CHECK-REGS-VP: expected string not found in input
# | ; CHECK-REGS-VP: Cost for VF vscale x 8: 10 (Estimated cost per lane: 1.2)
# |                  ^
# | <stdin>:2813:57: note: scanning from here
# | Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
# |                                                         ^
# | <stdin>:2845:1: note: possible intended match here
# | Cost for VF vscale x 8: 10 (Estimated cost per lane: 1.3)
# | ^
# | 
# | Input file: <stdin>
# | Check file: C:\buildbot\as-builder-2\x-aarch64\llvm-project\llvm\test\Transforms\LoopVectorize\AArch64\maxbandwidth-regpressure.ll
# | 
# | -dump-input=help explains the following input dump.
# | 
# | Input was:
# | <<<<<<
...
https://lab.llvm.org/buildbot/#/builders/193/builds/15053/steps/9/logs/FAIL__LLVM__maxbandwidth-regpressure_ll

looks like because of these changes.

would you take care of it?

Fix for this in #187498

…vm#179646) Currently when considering register pressure is enabled, we reject any VF that has higher pressure than the number of registers. However this can result in failing to vectorize in cases where it's beneficial, as the cost of the extra spills is less than the benefit we get from vectorizing. Deal with this by instead calculating the cost of spills and adding that to the rest of the cost, so we can detect this kind of situation and still vectorize while avoiding vectorizing in cases where the extra cost makes it not with it.

john-brawn-arm requested review from NickGuy-Arm, SamTebbs33, fhahn and lukel97 February 4, 2026 12:20

llvmbot added vectorizers llvm:transforms labels Feb 4, 2026

john-brawn-arm added 2 commits February 4, 2026 13:07

Change VPCostContext declaration to struct

05c8f73

Fix the other VPRegisterUsage declaration

eccba25

SamTebbs33 removed their request for review February 10, 2026 15:07

lukel97 reviewed Feb 11, 2026

View reviewed changes

john-brawn-arm added 3 commits February 12, 2026 11:29

Use structured binding

2c17a30

Add assert

70f75f9

Change GetRegUsage to GetVectorTy

a7f873d

john-brawn-arm temporarily deployed to main-branch-only February 12, 2026 13:30 — with GitHub Actions Inactive

david-arm reviewed Feb 20, 2026

View reviewed changes

Calculate spill cost using register class instead of type

caf1dce

llvmbot added the llvm:analysis Includes value tracking, cost tables and constant folding label Feb 26, 2026

lukel97 reviewed Mar 11, 2026

View reviewed changes

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Show resolved Hide resolved

llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp Outdated Show resolved Hide resolved

Split spill cost and reload cost

e51005e

lukel97 approved these changes Mar 18, 2026

View reviewed changes

john-brawn-arm merged commit a083e19 into llvm:main Mar 18, 2026
10 checks passed

john-brawn-arm deleted the vectorize_spill_cost branch March 18, 2026 15:30

fhahn reviewed Mar 18, 2026

View reviewed changes

Conversation

john-brawn-arm commented Feb 4, 2026

Uh oh!

llvmbot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Feb 4, 2026

Uh oh!

john-brawn-arm commented Feb 4, 2026

Uh oh!

github-actions bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🐧 Linux x64 Test Results

Uh oh!

john-brawn-arm commented Feb 10, 2026

Uh oh!

lukel97 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

john-brawn-arm commented Feb 19, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

john-brawn-arm commented Mar 4, 2026

Uh oh!

john-brawn-arm commented Mar 11, 2026

Uh oh!

Uh oh!

Uh oh!

lukel97 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

llvm-ci commented Mar 18, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

john-brawn-arm commented Mar 18, 2026

Uh oh!

lukel97 commented Mar 18, 2026

Uh oh!

john-brawn-arm commented Mar 18, 2026

Uh oh!

vvereschaka commented Mar 19, 2026

Uh oh!

john-brawn-arm commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

llvmbot commented Feb 4, 2026 •

edited

Loading

github-actions bot commented Feb 4, 2026 •

edited

Loading