Skip to content

[VPlan] Add the cost of spills when considering register pressure#179646

Merged
john-brawn-arm merged 8 commits intollvm:mainfrom
john-brawn-arm:vectorize_spill_cost
Mar 18, 2026
Merged

[VPlan] Add the cost of spills when considering register pressure#179646
john-brawn-arm merged 8 commits intollvm:mainfrom
john-brawn-arm:vectorize_spill_cost

Conversation

@john-brawn-arm
Copy link
Copy Markdown
Collaborator

Currently when considering register pressure is enabled, we reject any VF that has higher pressure than the number of registers. However this can result in failing to vectorize in cases where it's beneficial, as the cost of the extra spills is less than the benefit we get from vectorizing.

Deal with this by instead calculating the cost of spills and adding that to the rest of the cost, so we can detect this kind of situation and still vectorize while avoiding vectorizing in cases where the extra cost makes it not with it.

Currently when considering register pressure is enabled, we reject any VF that
has higher pressure than the number of registers. However this can result in
failing to vectorize in cases where it's beneficial, as the cost of the extra
spills is less than the benefit we get from vectorizing.

Deal with this by instead calculating the cost of spills and adding that to the
rest of the cost, so we can detect this kind of situation and still vectorize
while avoiding vectorizing in cases where the extra cost makes it not with it.
@llvmbot
Copy link
Copy Markdown
Member

llvmbot commented Feb 4, 2026

@llvm/pr-subscribers-llvm-analysis

@llvm/pr-subscribers-llvm-transforms

Author: John Brawn (john-brawn-arm)

Changes

Currently when considering register pressure is enabled, we reject any VF that has higher pressure than the number of registers. However this can result in failing to vectorize in cases where it's beneficial, as the cost of the extra spills is less than the benefit we get from vectorizing.

Deal with this by instead calculating the cost of spills and adding that to the rest of the cost, so we can detect this kind of situation and still vectorize while avoiding vectorizing in cases where the extra cost makes it not with it.


Patch is 34.53 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/179646.diff

7 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h (+2-1)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+21-23)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp (+56-14)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanAnalysis.h (+8-4)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll (+87-7)
  • (added) llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll (+266)
  • (modified) llvm/test/Transforms/LoopVectorize/LoongArch/reg-usage.ll (+2-2)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
index 44d4d92d4a7e2..06e8efef20c03 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
@@ -45,6 +45,7 @@ class OptimizationRemarkEmitter;
 class TargetTransformInfo;
 class TargetLibraryInfo;
 class VPRecipeBuilder;
+class VPRegisterUsage;
 struct VFRange;
 
 extern cl::opt<bool> EnableVPlanNativePath;
@@ -497,7 +498,7 @@ class LoopVectorizationPlanner {
   ///
   /// TODO: Move to VPlan::cost once the use of LoopVectorizationLegality has
   /// been retired.
-  InstructionCost cost(VPlan &Plan, ElementCount VF) const;
+  InstructionCost cost(VPlan &Plan, ElementCount VF, VPRegisterUsage *RU) const;
 
   /// Precompute costs for certain instructions using the legacy cost model. The
   /// function is used to bring up the VPlan-based cost model to initially avoid
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index abac45b265d10..492e716fd6ad2 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -4247,13 +4247,6 @@ VectorizationFactor LoopVectorizationPlanner::selectVectorizationFactor() {
       if (VF.isScalar())
         continue;
 
-      /// If the register pressure needs to be considered for VF,
-      /// don't consider the VF as valid if it exceeds the number
-      /// of registers for the target.
-      if (CM.shouldConsiderRegPressureForVF(VF) &&
-          RUs[I].exceedsMaxNumRegs(TTI, ForceTargetNumVectorRegs))
-        continue;
-
       InstructionCost C = CM.expectedCost(VF);
 
       // Add on other costs that are modelled in VPlan, but not in the legacy
@@ -4302,6 +4295,10 @@ VectorizationFactor LoopVectorizationPlanner::selectVectorizationFactor() {
         }
       }
 
+      // Add the cost of any spills due to excess register usage
+      if (CM.shouldConsiderRegPressureForVF(VF))
+        C += RUs[I].spillCost(CostCtx, ForceTargetNumVectorRegs);
+
       VectorizationFactor Candidate(VF, C, ScalarCost.ScalarCost);
       unsigned Width =
           estimateElementCount(Candidate.Width, CM.getVScaleForTuning());
@@ -4687,13 +4684,16 @@ LoopVectorizationPlanner::selectInterleaveCount(VPlan &Plan, ElementCount VF,
   if (hasFindLastReductionPhi(Plan))
     return 1;
 
+  VPRegisterUsage R =
+      calculateRegisterUsageForPlan(Plan, {VF}, TTI, CM.ValuesToIgnore)[0];
+
   // If we did not calculate the cost for VF (because the user selected the VF)
   // then we calculate the cost of VF here.
   if (LoopCost == 0) {
     if (VF.isScalar())
       LoopCost = CM.expectedCost(VF);
     else
-      LoopCost = cost(Plan, VF);
+      LoopCost = cost(Plan, VF, &R);
     assert(LoopCost.isValid() && "Expected to have chosen a VF with valid cost");
 
     // Loop body is free and there is no need for interleaving.
@@ -4701,8 +4701,6 @@ LoopVectorizationPlanner::selectInterleaveCount(VPlan &Plan, ElementCount VF,
       return 1;
   }
 
-  VPRegisterUsage R =
-      calculateRegisterUsageForPlan(Plan, {VF}, TTI, CM.ValuesToIgnore)[0];
   // We divide by these constants so assume that we have at least one
   // instruction that uses at least one register.
   for (auto &Pair : R.MaxLocalUsers) {
@@ -7027,13 +7025,18 @@ LoopVectorizationPlanner::precomputeCosts(VPlan &Plan, ElementCount VF,
   return Cost;
 }
 
-InstructionCost LoopVectorizationPlanner::cost(VPlan &Plan,
-                                               ElementCount VF) const {
+InstructionCost LoopVectorizationPlanner::cost(VPlan &Plan, ElementCount VF,
+                                               VPRegisterUsage *RU) const {
   VPCostContext CostCtx(CM.TTI, *CM.TLI, Plan, CM, CM.CostKind, PSE, OrigLoop);
   InstructionCost Cost = precomputeCosts(Plan, VF, CostCtx);
 
   // Now compute and add the VPlan-based cost.
   Cost += Plan.cost(VF, CostCtx);
+
+  // Add the cost of spills due to excess register usage
+  if (CM.shouldConsiderRegPressureForVF(VF))
+    Cost += RU->spillCost(CostCtx, ForceTargetNumVectorRegs);
+
 #ifndef NDEBUG
   unsigned EstimatedWidth = estimateElementCount(VF, CM.getVScaleForTuning());
   LLVM_DEBUG(dbgs() << "Cost for VF " << VF << ": " << Cost
@@ -7233,9 +7236,10 @@ VectorizationFactor LoopVectorizationPlanner::computeBestVF() {
                                P->vectorFactors().end());
 
     SmallVector<VPRegisterUsage, 8> RUs;
-    if (any_of(VFs, [this](ElementCount VF) {
-          return CM.shouldConsiderRegPressureForVF(VF);
-        }))
+    bool ConsiderRegPressure = any_of(VFs, [this](ElementCount VF) {
+      return CM.shouldConsiderRegPressureForVF(VF);
+    });
+    if (ConsiderRegPressure)
       RUs = calculateRegisterUsageForPlan(*P, VFs, TTI, CM.ValuesToIgnore);
 
     for (unsigned I = 0; I < VFs.size(); I++) {
@@ -7258,16 +7262,10 @@ VectorizationFactor LoopVectorizationPlanner::computeBestVF() {
         continue;
       }
 
-      InstructionCost Cost = cost(*P, VF);
+      InstructionCost Cost =
+          cost(*P, VF, ConsiderRegPressure ? &RUs[I] : nullptr);
       VectorizationFactor CurrentFactor(VF, Cost, ScalarCost);
 
-      if (CM.shouldConsiderRegPressureForVF(VF) &&
-          RUs[I].exceedsMaxNumRegs(TTI, ForceTargetNumVectorRegs)) {
-        LLVM_DEBUG(dbgs() << "LV(REG): Not considering vector loop of width "
-                          << VF << " because it uses too many registers\n");
-        continue;
-      }
-
       if (isMoreProfitable(CurrentFactor, BestFactor, P->hasScalarTail()))
         BestFactor = CurrentFactor;
 
diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
index 8fbe7d93e6f45..b8be1be79831e 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
@@ -16,6 +16,7 @@
 #include "llvm/ADT/TypeSwitch.h"
 #include "llvm/Analysis/ScalarEvolution.h"
 #include "llvm/Analysis/TargetTransformInfo.h"
+#include "llvm/IR/DataLayout.h"
 #include "llvm/IR/Instruction.h"
 #include "llvm/IR/PatternMatch.h"
 
@@ -389,13 +390,33 @@ bool VPDominatorTree::properlyDominates(const VPRecipeBase *A,
   return Base::properlyDominates(ParentA, ParentB);
 }
 
-bool VPRegisterUsage::exceedsMaxNumRegs(const TargetTransformInfo &TTI,
-                                        unsigned OverrideMaxNumRegs) const {
-  return any_of(MaxLocalUsers, [&TTI, &OverrideMaxNumRegs](auto &LU) {
-    return LU.second > (OverrideMaxNumRegs > 0
-                            ? OverrideMaxNumRegs
-                            : TTI.getNumberOfRegisters(LU.first));
-  });
+InstructionCost VPRegisterUsage::spillCost(VPCostContext &Ctx,
+                                           unsigned OverrideMaxNumRegs) const {
+  InstructionCost Cost;
+  DataLayout DL = Ctx.PSE.getSE()->getDataLayout();
+  for (const auto &Pair : MaxLocalUsers) {
+    unsigned AvailableRegs = OverrideMaxNumRegs > 0
+                                 ? OverrideMaxNumRegs
+                                 : Ctx.TTI.getNumberOfRegisters(Pair.first);
+    if (Pair.second > AvailableRegs) {
+      // Assume that for each register used past what's available we get one
+      // spill and reload of the largest type seen for that register class.
+      unsigned Spills = Pair.second - AvailableRegs;
+      Type *SpillType = LargestType.at(Pair.first);
+      Align Alignment = DL.getPrefTypeAlign(SpillType);
+      InstructionCost SpillCost =
+          Ctx.TTI.getMemoryOpCost(Instruction::Load, SpillType, Alignment, 0,
+                                  Ctx.CostKind) +
+          Ctx.TTI.getMemoryOpCost(Instruction::Store, SpillType, Alignment, 0,
+                                  Ctx.CostKind);
+      InstructionCost TotalCost = SpillCost * Spills;
+      LLVM_DEBUG(dbgs() << "LV(REG): Cost of " << TotalCost << " from "
+                        << Spills << " spills of "
+                        << Ctx.TTI.getRegisterClassName(Pair.first) << "\n");
+      Cost += TotalCost;
+    }
+  }
+  return Cost;
 }
 
 SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
@@ -479,6 +500,15 @@ SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
   SmallPtrSet<VPValue *, 8> OpenIntervals;
   SmallVector<VPRegisterUsage, 8> RUs(VFs.size());
   SmallVector<SmallMapVector<unsigned, unsigned, 4>, 8> MaxUsages(VFs.size());
+  SmallVector<SmallMapVector<unsigned, Type *, 4>, 8> LargestTypes(VFs.size());
+  auto MaxType = [](Type *CurMax, Type *T) {
+    if (!CurMax)
+      return T;
+    if (TypeSize::isKnownGT(T->getPrimitiveSizeInBits(),
+                            CurMax->getPrimitiveSizeInBits()))
+      return T;
+    return CurMax;
+  };
 
   LLVM_DEBUG(dbgs() << "LV(REG): Calculating max register usage:\n");
 
@@ -540,17 +570,19 @@ SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
             match(VPV, m_ExtractLastPart(m_VPValue())))
           continue;
 
+        Type *ScalarTy = TypeInfo.inferScalarType(VPV);
         if (VFs[J].isScalar() ||
             isa<VPCanonicalIVPHIRecipe, VPReplicateRecipe, VPDerivedIVRecipe,
                 VPEVLBasedIVPHIRecipe, VPScalarIVStepsRecipe>(VPV) ||
             (isa<VPInstruction>(VPV) && vputils::onlyScalarValuesUsed(VPV)) ||
             (isa<VPReductionPHIRecipe>(VPV) &&
              (cast<VPReductionPHIRecipe>(VPV))->isInLoop())) {
-          unsigned ClassID =
-              TTI.getRegisterClassForType(false, TypeInfo.inferScalarType(VPV));
+          unsigned ClassID = TTI.getRegisterClassForType(false, ScalarTy);
           // FIXME: The target might use more than one register for the type
           // even in the scalar case.
           RegUsage[ClassID] += 1;
+          LargestTypes[J][ClassID] =
+              MaxType(LargestTypes[J][ClassID], ScalarTy);
         } else {
           // The output from scaled phis and scaled reductions actually has
           // fewer lanes than the VF.
@@ -562,10 +594,12 @@ SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
             LLVM_DEBUG(dbgs() << "LV(REG): Scaled down VF from " << VFs[J]
                               << " to " << VF << " for " << *R << "\n";);
           }
-
-          Type *ScalarTy = TypeInfo.inferScalarType(VPV);
           unsigned ClassID = TTI.getRegisterClassForType(true, ScalarTy);
           RegUsage[ClassID] += GetRegUsage(ScalarTy, VF);
+          if (VectorType::isValidElementType(ScalarTy)) {
+            Type *T = VectorType::get(ScalarTy, VF);
+            LargestTypes[J][ClassID] = MaxType(LargestTypes[J][ClassID], T);
+          }
         }
       }
 
@@ -602,9 +636,11 @@ SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
       bool IsScalar = vputils::onlyScalarValuesUsed(In);
 
       ElementCount VF = IsScalar ? ElementCount::getFixed(1) : VFs[Idx];
-      unsigned ClassID = TTI.getRegisterClassForType(
-          VF.isVector(), TypeInfo.inferScalarType(In));
-      Invariant[ClassID] += GetRegUsage(TypeInfo.inferScalarType(In), VF);
+      Type *ScalarTy = TypeInfo.inferScalarType(In);
+      unsigned ClassID = TTI.getRegisterClassForType(VF.isVector(), ScalarTy);
+      Invariant[ClassID] += GetRegUsage(ScalarTy, VF);
+      Type *SpillTy = IsScalar ? ScalarTy : VectorType::get(ScalarTy, VF);
+      LargestTypes[Idx][ClassID] = MaxType(LargestTypes[Idx][ClassID], SpillTy);
     }
 
     LLVM_DEBUG({
@@ -623,10 +659,16 @@ SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
                << TTI.getRegisterClassName(pair.first) << ", " << pair.second
                << " registers\n";
       }
+      for (const auto &pair : LargestTypes[Idx]) {
+        dbgs() << "LV(REG): RegisterClass: "
+               << TTI.getRegisterClassName(pair.first) << ", " << *pair.second
+               << " is largest type potentially spilled\n";
+      }
     });
 
     RU.LoopInvariantRegs = Invariant;
     RU.MaxLocalUsers = MaxUsages[Idx];
+    RU.LargestType = LargestTypes[Idx];
     RUs[Idx] = RU;
   }
 
diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.h b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.h
index dc4be4270f7f1..3affa211dd140 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.h
@@ -19,6 +19,7 @@ namespace llvm {
 class LLVMContext;
 class VPValue;
 class VPBlendRecipe;
+class VPCostContext;
 class VPInstruction;
 class VPWidenRecipe;
 class VPWidenCallRecipe;
@@ -30,6 +31,7 @@ class VPlan;
 class Value;
 class TargetTransformInfo;
 class Type;
+class InstructionCost;
 
 /// An analysis for type-inference for VPValues.
 /// It infers the scalar type for a given VPValue by bottom-up traversing
@@ -78,12 +80,14 @@ struct VPRegisterUsage {
   /// Holds the maximum number of concurrent live intervals in the loop.
   /// The key is ClassID of target-provided register class.
   SmallMapVector<unsigned, unsigned, 4> MaxLocalUsers;
+  /// Holds the largest type used in each register class.
+  SmallMapVector<unsigned, Type *, 4> LargestType;
 
-  /// Check if any of the tracked live intervals exceeds the number of
-  /// available registers for the target. If non-zero, OverrideMaxNumRegs
+  /// Calculate the estimated cost of any spills due to using more registers
+  /// than the number available for the target. If non-zero, OverrideMaxNumRegs
   /// is used in place of the target's number of registers.
-  bool exceedsMaxNumRegs(const TargetTransformInfo &TTI,
-                         unsigned OverrideMaxNumRegs = 0) const;
+  InstructionCost spillCost(VPCostContext &Ctx,
+                            unsigned OverrideMaxNumRegs = 0) const;
 };
 
 /// Estimate the register usage for \p Plan and vectorization factors in \p VFs
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll b/llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll
index 8109d0683fe71..2a4d16979e0d8 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll
@@ -1,16 +1,31 @@
 ; REQUIRES: asserts
-; RUN: opt -passes=loop-vectorize -vectorizer-maximize-bandwidth -debug-only=loop-vectorize -disable-output -force-vector-interleave=1 -enable-epilogue-vectorization=false -S < %s 2>&1 | FileCheck %s --check-prefixes=CHECK-REGS-VP
-; RUN: opt -passes=loop-vectorize -vectorizer-maximize-bandwidth -debug-only=loop-vectorize -disable-output -force-target-num-vector-regs=1 -force-vector-interleave=1 -enable-epilogue-vectorization=false -S < %s 2>&1 | FileCheck %s --check-prefixes=CHECK-NOREGS-VP
+; RUN: opt -passes=loop-vectorize -vectorizer-maximize-bandwidth=false -debug-only=loop-vectorize,vplan -disable-output -force-vector-interleave=1 -enable-epilogue-vectorization=false -S < %s 2>&1 | FileCheck %s --check-prefixes=CHECK,CHECK-NOMAX
+; RUN: opt -passes=loop-vectorize -vectorizer-maximize-bandwidth=true -debug-only=loop-vectorize,vplan -disable-output -force-vector-interleave=1 -enable-epilogue-vectorization=false -S < %s 2>&1 | FileCheck %s --check-prefixes=CHECK,CHECK-REGS-VP
+; RUN: opt -passes=loop-vectorize -vectorizer-maximize-bandwidth=true -debug-only=loop-vectorize,vplan -disable-output -force-target-num-vector-regs=1 -force-vector-interleave=1 -enable-epilogue-vectorization=false -S < %s 2>&1 | FileCheck %s --check-prefixes=CHECK,CHECK-NOREGS-VP
 
 target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
 target triple = "aarch64-none-unknown-elf"
 
+; The use of the dotp instruction means we never have an i32 vector, so we don't
+; get any spills normally and with a reduced number of registers the number of
+; spills is small enough that it doesn't prevent use of a larger VF.
 define i32 @dotp(ptr %a, ptr %b) #0 {
+; CHECK-LABEL: LV: Checking a loop in 'dotp'
+;
+; CHECK-NOMAX: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK-NOMAX: LV: Selecting VF: vscale x 4.
+;
+; CHECK-REGS-VP: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK-REGS-VP: Cost for VF vscale x 8: 6 (Estimated cost per lane: 0.8)
+; CHECK-REGS-VP: Cost for VF vscale x 16: 5 (Estimated cost per lane: 0.3)
 ; CHECK-REGS-VP: LV: Selecting VF: vscale x 16.
 ;
-; CHECK-NOREGS-VP: LV(REG): Not considering vector loop of width vscale x 8 because it uses too many registers
-; CHECK-NOREGS-VP: LV(REG): Not considering vector loop of width vscale x 16 because it uses too many registers
-; CHECK-NOREGS-VP: LV: Selecting VF: vscale x 4.
+; CHECK-NOREGS-VP: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK-NOREGS-VP: LV(REG): Cost of 4 from 2 spills of Generic::VectorRC
+; CHECK-NOREGS-VP-NEXT: Cost for VF vscale x 8: 14 (Estimated cost per lane: 1.8)
+; CHECK-NOREGS-VP: LV(REG): Cost of 4 from 2 spills of Generic::VectorRC
+; CHECK-NOREGS-VP-NEXT: Cost for VF vscale x 16: 13 (Estimated cost per lane: 0.8)
+; CHECK-NOREGS-VP: LV: Selecting VF: vscale x 16.
 entry:
   br label %for.body
 
@@ -24,8 +39,7 @@ for.body:                                         ; preds = %for.body, %entry
   %load.b = load i8, ptr %gep.b, align 1
   %ext.b = zext i8 %load.b to i32
   %mul = mul i32 %ext.b, %ext.a
-  %sub = sub i32 0, %mul
-  %add = add i32 %accum, %sub
+  %add = add i32 %accum, %mul
   %iv.next = add i64 %iv, 1
   %exitcond.not = icmp eq i64 %iv.next, 1024
   br i1 %exitcond.not, label %for.exit, label %for.body
@@ -34,4 +48,70 @@ for.exit:                        ; preds = %for.body
   ret i32 %add
 }
 
+; The largest type used in the loop is small enough that we already consider all
+; VFs and maximize-bandwidth does nothing.
+define void @type_too_small(ptr %a, ptr %b) #0 {
+; CHECK-LABEL: LV: Checking a loop in 'type_too_small'
+; CHECK: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK: Cost for VF vscale x 8: 6 (Estimated cost per lane: 0.8)
+; CHECK: Cost for VF vscale x 16: 6 (Estimated cost per lane: 0.4)
+; CHECK: LV: Selecting VF: vscale x 16.
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %gep.a = getelementptr i8, ptr %a, i64 %iv
+  %load.a = load i8, ptr %gep.a, align 1
+  %gep.b = getelementptr i8, ptr %b, i64 %iv
+  %load.b = load i8, ptr %gep.b, align 1
+  %add = add i8 %load.a, %load.b
+  store i8 %add, ptr %gep.a, align 1
+  %iv.next = add i64 %iv, 1
+  %exitcond = icmp eq i64 %iv.next, 1024
+  br i1 %exitcond, label %exit, label %loop
+
+exit:
+  ret void
+}
+
+; With reduced number of registers the spills from high pressure are enough that
+; we use the same VF as if we hadn't maximized the bandwidth.
+define void @high_pressure(ptr %a, ptr %b) #0 {
+; CHECK-LABEL: LV: Checking a loop in 'high_pressure'
+;
+; CHECK-NOMAX: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK-NOMAX: LV: Selecting VF: vscale x 4.
+;
+; CHECK-REGS-VP: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK-REGS-VP: Cost for VF vscale x 8: 10 (Estimated cost per lane: 1.2)
+; CHECK-REGS-VP: Cost for VF vscale x 16: 21 (Estimated cost per lane: 1.3)
+; CHECK-REGS-VP: LV: Selecting VF: vscale x 8.
+
+; CHECK-NOREGS-VP: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK-NOREGS-VP: LV(REG): Cost of 12 from 3 spills of Generic::VectorRC
+; CHECK-NOREGS-VP-NEXT: Cost for VF vscale x 8: 26 (Estimated cost per lane: 3.2)
+; CHECK-NOREGS-VP: LV(REG): Cost of 56 from 7 spills of Generic::VectorRC
+; CHECK-NOREGS-VP-NEXT: Cost for VF vscale x 16: 81 (Estimated cost per lane: 5.1)
+; CHECK-NOREGS-VP: LV: Selecting VF: vscale x 4.
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %gep.a = getelementptr i32, ptr %a, i64 %iv
+  %load.a = load i32, ptr %gep.a, align 4
+  %gep.b = getelementptr i8, ptr %b, i64 %iv
+  %load.b = load i8, ptr %gep.b, align 1
+  %ext.b = zext i8 %load.b to i32
+  %add = add i32 %load.a, %ext.b
+  store i32 %add, ptr %gep.a, align 4
+  %iv.next = add i64 %iv, 1
+  %exitcond = icmp eq i64 %iv.next, 1024
+  br i1 %exitcond, label %exit, label %loop
+
+exit:
+  ret void
+}
+
 attributes #0 = { vscale_range(1,16) "target-features"="+sve" }
diff --git a/llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll b/llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll
new file mode 100644
index 00...
[truncated]

@llvmbot
Copy link
Copy Markdown
Member

llvmbot commented Feb 4, 2026

@llvm/pr-subscribers-vectorizers

Author: John Brawn (john-brawn-arm)

Changes

Currently when considering register pressure is enabled, we reject any VF that has higher pressure than the number of registers. However this can result in failing to vectorize in cases where it's beneficial, as the cost of the extra spills is less than the benefit we get from vectorizing.

Deal with this by instead calculating the cost of spills and adding that to the rest of the cost, so we can detect this kind of situation and still vectorize while avoiding vectorizing in cases where the extra cost makes it not with it.


Patch is 34.53 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/179646.diff

7 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h (+2-1)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+21-23)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp (+56-14)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanAnalysis.h (+8-4)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll (+87-7)
  • (added) llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll (+266)
  • (modified) llvm/test/Transforms/LoopVectorize/LoongArch/reg-usage.ll (+2-2)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
index 44d4d92d4a7e2..06e8efef20c03 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
@@ -45,6 +45,7 @@ class OptimizationRemarkEmitter;
 class TargetTransformInfo;
 class TargetLibraryInfo;
 class VPRecipeBuilder;
+class VPRegisterUsage;
 struct VFRange;
 
 extern cl::opt<bool> EnableVPlanNativePath;
@@ -497,7 +498,7 @@ class LoopVectorizationPlanner {
   ///
   /// TODO: Move to VPlan::cost once the use of LoopVectorizationLegality has
   /// been retired.
-  InstructionCost cost(VPlan &Plan, ElementCount VF) const;
+  InstructionCost cost(VPlan &Plan, ElementCount VF, VPRegisterUsage *RU) const;
 
   /// Precompute costs for certain instructions using the legacy cost model. The
   /// function is used to bring up the VPlan-based cost model to initially avoid
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index abac45b265d10..492e716fd6ad2 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -4247,13 +4247,6 @@ VectorizationFactor LoopVectorizationPlanner::selectVectorizationFactor() {
       if (VF.isScalar())
         continue;
 
-      /// If the register pressure needs to be considered for VF,
-      /// don't consider the VF as valid if it exceeds the number
-      /// of registers for the target.
-      if (CM.shouldConsiderRegPressureForVF(VF) &&
-          RUs[I].exceedsMaxNumRegs(TTI, ForceTargetNumVectorRegs))
-        continue;
-
       InstructionCost C = CM.expectedCost(VF);
 
       // Add on other costs that are modelled in VPlan, but not in the legacy
@@ -4302,6 +4295,10 @@ VectorizationFactor LoopVectorizationPlanner::selectVectorizationFactor() {
         }
       }
 
+      // Add the cost of any spills due to excess register usage
+      if (CM.shouldConsiderRegPressureForVF(VF))
+        C += RUs[I].spillCost(CostCtx, ForceTargetNumVectorRegs);
+
       VectorizationFactor Candidate(VF, C, ScalarCost.ScalarCost);
       unsigned Width =
           estimateElementCount(Candidate.Width, CM.getVScaleForTuning());
@@ -4687,13 +4684,16 @@ LoopVectorizationPlanner::selectInterleaveCount(VPlan &Plan, ElementCount VF,
   if (hasFindLastReductionPhi(Plan))
     return 1;
 
+  VPRegisterUsage R =
+      calculateRegisterUsageForPlan(Plan, {VF}, TTI, CM.ValuesToIgnore)[0];
+
   // If we did not calculate the cost for VF (because the user selected the VF)
   // then we calculate the cost of VF here.
   if (LoopCost == 0) {
     if (VF.isScalar())
       LoopCost = CM.expectedCost(VF);
     else
-      LoopCost = cost(Plan, VF);
+      LoopCost = cost(Plan, VF, &R);
     assert(LoopCost.isValid() && "Expected to have chosen a VF with valid cost");
 
     // Loop body is free and there is no need for interleaving.
@@ -4701,8 +4701,6 @@ LoopVectorizationPlanner::selectInterleaveCount(VPlan &Plan, ElementCount VF,
       return 1;
   }
 
-  VPRegisterUsage R =
-      calculateRegisterUsageForPlan(Plan, {VF}, TTI, CM.ValuesToIgnore)[0];
   // We divide by these constants so assume that we have at least one
   // instruction that uses at least one register.
   for (auto &Pair : R.MaxLocalUsers) {
@@ -7027,13 +7025,18 @@ LoopVectorizationPlanner::precomputeCosts(VPlan &Plan, ElementCount VF,
   return Cost;
 }
 
-InstructionCost LoopVectorizationPlanner::cost(VPlan &Plan,
-                                               ElementCount VF) const {
+InstructionCost LoopVectorizationPlanner::cost(VPlan &Plan, ElementCount VF,
+                                               VPRegisterUsage *RU) const {
   VPCostContext CostCtx(CM.TTI, *CM.TLI, Plan, CM, CM.CostKind, PSE, OrigLoop);
   InstructionCost Cost = precomputeCosts(Plan, VF, CostCtx);
 
   // Now compute and add the VPlan-based cost.
   Cost += Plan.cost(VF, CostCtx);
+
+  // Add the cost of spills due to excess register usage
+  if (CM.shouldConsiderRegPressureForVF(VF))
+    Cost += RU->spillCost(CostCtx, ForceTargetNumVectorRegs);
+
 #ifndef NDEBUG
   unsigned EstimatedWidth = estimateElementCount(VF, CM.getVScaleForTuning());
   LLVM_DEBUG(dbgs() << "Cost for VF " << VF << ": " << Cost
@@ -7233,9 +7236,10 @@ VectorizationFactor LoopVectorizationPlanner::computeBestVF() {
                                P->vectorFactors().end());
 
     SmallVector<VPRegisterUsage, 8> RUs;
-    if (any_of(VFs, [this](ElementCount VF) {
-          return CM.shouldConsiderRegPressureForVF(VF);
-        }))
+    bool ConsiderRegPressure = any_of(VFs, [this](ElementCount VF) {
+      return CM.shouldConsiderRegPressureForVF(VF);
+    });
+    if (ConsiderRegPressure)
       RUs = calculateRegisterUsageForPlan(*P, VFs, TTI, CM.ValuesToIgnore);
 
     for (unsigned I = 0; I < VFs.size(); I++) {
@@ -7258,16 +7262,10 @@ VectorizationFactor LoopVectorizationPlanner::computeBestVF() {
         continue;
       }
 
-      InstructionCost Cost = cost(*P, VF);
+      InstructionCost Cost =
+          cost(*P, VF, ConsiderRegPressure ? &RUs[I] : nullptr);
       VectorizationFactor CurrentFactor(VF, Cost, ScalarCost);
 
-      if (CM.shouldConsiderRegPressureForVF(VF) &&
-          RUs[I].exceedsMaxNumRegs(TTI, ForceTargetNumVectorRegs)) {
-        LLVM_DEBUG(dbgs() << "LV(REG): Not considering vector loop of width "
-                          << VF << " because it uses too many registers\n");
-        continue;
-      }
-
       if (isMoreProfitable(CurrentFactor, BestFactor, P->hasScalarTail()))
         BestFactor = CurrentFactor;
 
diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
index 8fbe7d93e6f45..b8be1be79831e 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
@@ -16,6 +16,7 @@
 #include "llvm/ADT/TypeSwitch.h"
 #include "llvm/Analysis/ScalarEvolution.h"
 #include "llvm/Analysis/TargetTransformInfo.h"
+#include "llvm/IR/DataLayout.h"
 #include "llvm/IR/Instruction.h"
 #include "llvm/IR/PatternMatch.h"
 
@@ -389,13 +390,33 @@ bool VPDominatorTree::properlyDominates(const VPRecipeBase *A,
   return Base::properlyDominates(ParentA, ParentB);
 }
 
-bool VPRegisterUsage::exceedsMaxNumRegs(const TargetTransformInfo &TTI,
-                                        unsigned OverrideMaxNumRegs) const {
-  return any_of(MaxLocalUsers, [&TTI, &OverrideMaxNumRegs](auto &LU) {
-    return LU.second > (OverrideMaxNumRegs > 0
-                            ? OverrideMaxNumRegs
-                            : TTI.getNumberOfRegisters(LU.first));
-  });
+InstructionCost VPRegisterUsage::spillCost(VPCostContext &Ctx,
+                                           unsigned OverrideMaxNumRegs) const {
+  InstructionCost Cost;
+  DataLayout DL = Ctx.PSE.getSE()->getDataLayout();
+  for (const auto &Pair : MaxLocalUsers) {
+    unsigned AvailableRegs = OverrideMaxNumRegs > 0
+                                 ? OverrideMaxNumRegs
+                                 : Ctx.TTI.getNumberOfRegisters(Pair.first);
+    if (Pair.second > AvailableRegs) {
+      // Assume that for each register used past what's available we get one
+      // spill and reload of the largest type seen for that register class.
+      unsigned Spills = Pair.second - AvailableRegs;
+      Type *SpillType = LargestType.at(Pair.first);
+      Align Alignment = DL.getPrefTypeAlign(SpillType);
+      InstructionCost SpillCost =
+          Ctx.TTI.getMemoryOpCost(Instruction::Load, SpillType, Alignment, 0,
+                                  Ctx.CostKind) +
+          Ctx.TTI.getMemoryOpCost(Instruction::Store, SpillType, Alignment, 0,
+                                  Ctx.CostKind);
+      InstructionCost TotalCost = SpillCost * Spills;
+      LLVM_DEBUG(dbgs() << "LV(REG): Cost of " << TotalCost << " from "
+                        << Spills << " spills of "
+                        << Ctx.TTI.getRegisterClassName(Pair.first) << "\n");
+      Cost += TotalCost;
+    }
+  }
+  return Cost;
 }
 
 SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
@@ -479,6 +500,15 @@ SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
   SmallPtrSet<VPValue *, 8> OpenIntervals;
   SmallVector<VPRegisterUsage, 8> RUs(VFs.size());
   SmallVector<SmallMapVector<unsigned, unsigned, 4>, 8> MaxUsages(VFs.size());
+  SmallVector<SmallMapVector<unsigned, Type *, 4>, 8> LargestTypes(VFs.size());
+  auto MaxType = [](Type *CurMax, Type *T) {
+    if (!CurMax)
+      return T;
+    if (TypeSize::isKnownGT(T->getPrimitiveSizeInBits(),
+                            CurMax->getPrimitiveSizeInBits()))
+      return T;
+    return CurMax;
+  };
 
   LLVM_DEBUG(dbgs() << "LV(REG): Calculating max register usage:\n");
 
@@ -540,17 +570,19 @@ SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
             match(VPV, m_ExtractLastPart(m_VPValue())))
           continue;
 
+        Type *ScalarTy = TypeInfo.inferScalarType(VPV);
         if (VFs[J].isScalar() ||
             isa<VPCanonicalIVPHIRecipe, VPReplicateRecipe, VPDerivedIVRecipe,
                 VPEVLBasedIVPHIRecipe, VPScalarIVStepsRecipe>(VPV) ||
             (isa<VPInstruction>(VPV) && vputils::onlyScalarValuesUsed(VPV)) ||
             (isa<VPReductionPHIRecipe>(VPV) &&
              (cast<VPReductionPHIRecipe>(VPV))->isInLoop())) {
-          unsigned ClassID =
-              TTI.getRegisterClassForType(false, TypeInfo.inferScalarType(VPV));
+          unsigned ClassID = TTI.getRegisterClassForType(false, ScalarTy);
           // FIXME: The target might use more than one register for the type
           // even in the scalar case.
           RegUsage[ClassID] += 1;
+          LargestTypes[J][ClassID] =
+              MaxType(LargestTypes[J][ClassID], ScalarTy);
         } else {
           // The output from scaled phis and scaled reductions actually has
           // fewer lanes than the VF.
@@ -562,10 +594,12 @@ SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
             LLVM_DEBUG(dbgs() << "LV(REG): Scaled down VF from " << VFs[J]
                               << " to " << VF << " for " << *R << "\n";);
           }
-
-          Type *ScalarTy = TypeInfo.inferScalarType(VPV);
           unsigned ClassID = TTI.getRegisterClassForType(true, ScalarTy);
           RegUsage[ClassID] += GetRegUsage(ScalarTy, VF);
+          if (VectorType::isValidElementType(ScalarTy)) {
+            Type *T = VectorType::get(ScalarTy, VF);
+            LargestTypes[J][ClassID] = MaxType(LargestTypes[J][ClassID], T);
+          }
         }
       }
 
@@ -602,9 +636,11 @@ SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
       bool IsScalar = vputils::onlyScalarValuesUsed(In);
 
       ElementCount VF = IsScalar ? ElementCount::getFixed(1) : VFs[Idx];
-      unsigned ClassID = TTI.getRegisterClassForType(
-          VF.isVector(), TypeInfo.inferScalarType(In));
-      Invariant[ClassID] += GetRegUsage(TypeInfo.inferScalarType(In), VF);
+      Type *ScalarTy = TypeInfo.inferScalarType(In);
+      unsigned ClassID = TTI.getRegisterClassForType(VF.isVector(), ScalarTy);
+      Invariant[ClassID] += GetRegUsage(ScalarTy, VF);
+      Type *SpillTy = IsScalar ? ScalarTy : VectorType::get(ScalarTy, VF);
+      LargestTypes[Idx][ClassID] = MaxType(LargestTypes[Idx][ClassID], SpillTy);
     }
 
     LLVM_DEBUG({
@@ -623,10 +659,16 @@ SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
                << TTI.getRegisterClassName(pair.first) << ", " << pair.second
                << " registers\n";
       }
+      for (const auto &pair : LargestTypes[Idx]) {
+        dbgs() << "LV(REG): RegisterClass: "
+               << TTI.getRegisterClassName(pair.first) << ", " << *pair.second
+               << " is largest type potentially spilled\n";
+      }
     });
 
     RU.LoopInvariantRegs = Invariant;
     RU.MaxLocalUsers = MaxUsages[Idx];
+    RU.LargestType = LargestTypes[Idx];
     RUs[Idx] = RU;
   }
 
diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.h b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.h
index dc4be4270f7f1..3affa211dd140 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.h
@@ -19,6 +19,7 @@ namespace llvm {
 class LLVMContext;
 class VPValue;
 class VPBlendRecipe;
+class VPCostContext;
 class VPInstruction;
 class VPWidenRecipe;
 class VPWidenCallRecipe;
@@ -30,6 +31,7 @@ class VPlan;
 class Value;
 class TargetTransformInfo;
 class Type;
+class InstructionCost;
 
 /// An analysis for type-inference for VPValues.
 /// It infers the scalar type for a given VPValue by bottom-up traversing
@@ -78,12 +80,14 @@ struct VPRegisterUsage {
   /// Holds the maximum number of concurrent live intervals in the loop.
   /// The key is ClassID of target-provided register class.
   SmallMapVector<unsigned, unsigned, 4> MaxLocalUsers;
+  /// Holds the largest type used in each register class.
+  SmallMapVector<unsigned, Type *, 4> LargestType;
 
-  /// Check if any of the tracked live intervals exceeds the number of
-  /// available registers for the target. If non-zero, OverrideMaxNumRegs
+  /// Calculate the estimated cost of any spills due to using more registers
+  /// than the number available for the target. If non-zero, OverrideMaxNumRegs
   /// is used in place of the target's number of registers.
-  bool exceedsMaxNumRegs(const TargetTransformInfo &TTI,
-                         unsigned OverrideMaxNumRegs = 0) const;
+  InstructionCost spillCost(VPCostContext &Ctx,
+                            unsigned OverrideMaxNumRegs = 0) const;
 };
 
 /// Estimate the register usage for \p Plan and vectorization factors in \p VFs
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll b/llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll
index 8109d0683fe71..2a4d16979e0d8 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll
@@ -1,16 +1,31 @@
 ; REQUIRES: asserts
-; RUN: opt -passes=loop-vectorize -vectorizer-maximize-bandwidth -debug-only=loop-vectorize -disable-output -force-vector-interleave=1 -enable-epilogue-vectorization=false -S < %s 2>&1 | FileCheck %s --check-prefixes=CHECK-REGS-VP
-; RUN: opt -passes=loop-vectorize -vectorizer-maximize-bandwidth -debug-only=loop-vectorize -disable-output -force-target-num-vector-regs=1 -force-vector-interleave=1 -enable-epilogue-vectorization=false -S < %s 2>&1 | FileCheck %s --check-prefixes=CHECK-NOREGS-VP
+; RUN: opt -passes=loop-vectorize -vectorizer-maximize-bandwidth=false -debug-only=loop-vectorize,vplan -disable-output -force-vector-interleave=1 -enable-epilogue-vectorization=false -S < %s 2>&1 | FileCheck %s --check-prefixes=CHECK,CHECK-NOMAX
+; RUN: opt -passes=loop-vectorize -vectorizer-maximize-bandwidth=true -debug-only=loop-vectorize,vplan -disable-output -force-vector-interleave=1 -enable-epilogue-vectorization=false -S < %s 2>&1 | FileCheck %s --check-prefixes=CHECK,CHECK-REGS-VP
+; RUN: opt -passes=loop-vectorize -vectorizer-maximize-bandwidth=true -debug-only=loop-vectorize,vplan -disable-output -force-target-num-vector-regs=1 -force-vector-interleave=1 -enable-epilogue-vectorization=false -S < %s 2>&1 | FileCheck %s --check-prefixes=CHECK,CHECK-NOREGS-VP
 
 target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
 target triple = "aarch64-none-unknown-elf"
 
+; The use of the dotp instruction means we never have an i32 vector, so we don't
+; get any spills normally and with a reduced number of registers the number of
+; spills is small enough that it doesn't prevent use of a larger VF.
 define i32 @dotp(ptr %a, ptr %b) #0 {
+; CHECK-LABEL: LV: Checking a loop in 'dotp'
+;
+; CHECK-NOMAX: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK-NOMAX: LV: Selecting VF: vscale x 4.
+;
+; CHECK-REGS-VP: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK-REGS-VP: Cost for VF vscale x 8: 6 (Estimated cost per lane: 0.8)
+; CHECK-REGS-VP: Cost for VF vscale x 16: 5 (Estimated cost per lane: 0.3)
 ; CHECK-REGS-VP: LV: Selecting VF: vscale x 16.
 ;
-; CHECK-NOREGS-VP: LV(REG): Not considering vector loop of width vscale x 8 because it uses too many registers
-; CHECK-NOREGS-VP: LV(REG): Not considering vector loop of width vscale x 16 because it uses too many registers
-; CHECK-NOREGS-VP: LV: Selecting VF: vscale x 4.
+; CHECK-NOREGS-VP: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK-NOREGS-VP: LV(REG): Cost of 4 from 2 spills of Generic::VectorRC
+; CHECK-NOREGS-VP-NEXT: Cost for VF vscale x 8: 14 (Estimated cost per lane: 1.8)
+; CHECK-NOREGS-VP: LV(REG): Cost of 4 from 2 spills of Generic::VectorRC
+; CHECK-NOREGS-VP-NEXT: Cost for VF vscale x 16: 13 (Estimated cost per lane: 0.8)
+; CHECK-NOREGS-VP: LV: Selecting VF: vscale x 16.
 entry:
   br label %for.body
 
@@ -24,8 +39,7 @@ for.body:                                         ; preds = %for.body, %entry
   %load.b = load i8, ptr %gep.b, align 1
   %ext.b = zext i8 %load.b to i32
   %mul = mul i32 %ext.b, %ext.a
-  %sub = sub i32 0, %mul
-  %add = add i32 %accum, %sub
+  %add = add i32 %accum, %mul
   %iv.next = add i64 %iv, 1
   %exitcond.not = icmp eq i64 %iv.next, 1024
   br i1 %exitcond.not, label %for.exit, label %for.body
@@ -34,4 +48,70 @@ for.exit:                        ; preds = %for.body
   ret i32 %add
 }
 
+; The largest type used in the loop is small enough that we already consider all
+; VFs and maximize-bandwidth does nothing.
+define void @type_too_small(ptr %a, ptr %b) #0 {
+; CHECK-LABEL: LV: Checking a loop in 'type_too_small'
+; CHECK: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK: Cost for VF vscale x 8: 6 (Estimated cost per lane: 0.8)
+; CHECK: Cost for VF vscale x 16: 6 (Estimated cost per lane: 0.4)
+; CHECK: LV: Selecting VF: vscale x 16.
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %gep.a = getelementptr i8, ptr %a, i64 %iv
+  %load.a = load i8, ptr %gep.a, align 1
+  %gep.b = getelementptr i8, ptr %b, i64 %iv
+  %load.b = load i8, ptr %gep.b, align 1
+  %add = add i8 %load.a, %load.b
+  store i8 %add, ptr %gep.a, align 1
+  %iv.next = add i64 %iv, 1
+  %exitcond = icmp eq i64 %iv.next, 1024
+  br i1 %exitcond, label %exit, label %loop
+
+exit:
+  ret void
+}
+
+; With reduced number of registers the spills from high pressure are enough that
+; we use the same VF as if we hadn't maximized the bandwidth.
+define void @high_pressure(ptr %a, ptr %b) #0 {
+; CHECK-LABEL: LV: Checking a loop in 'high_pressure'
+;
+; CHECK-NOMAX: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK-NOMAX: LV: Selecting VF: vscale x 4.
+;
+; CHECK-REGS-VP: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK-REGS-VP: Cost for VF vscale x 8: 10 (Estimated cost per lane: 1.2)
+; CHECK-REGS-VP: Cost for VF vscale x 16: 21 (Estimated cost per lane: 1.3)
+; CHECK-REGS-VP: LV: Selecting VF: vscale x 8.
+
+; CHECK-NOREGS-VP: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK-NOREGS-VP: LV(REG): Cost of 12 from 3 spills of Generic::VectorRC
+; CHECK-NOREGS-VP-NEXT: Cost for VF vscale x 8: 26 (Estimated cost per lane: 3.2)
+; CHECK-NOREGS-VP: LV(REG): Cost of 56 from 7 spills of Generic::VectorRC
+; CHECK-NOREGS-VP-NEXT: Cost for VF vscale x 16: 81 (Estimated cost per lane: 5.1)
+; CHECK-NOREGS-VP: LV: Selecting VF: vscale x 4.
+entry:
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %gep.a = getelementptr i32, ptr %a, i64 %iv
+  %load.a = load i32, ptr %gep.a, align 4
+  %gep.b = getelementptr i8, ptr %b, i64 %iv
+  %load.b = load i8, ptr %gep.b, align 1
+  %ext.b = zext i8 %load.b to i32
+  %add = add i32 %load.a, %ext.b
+  store i32 %add, ptr %gep.a, align 4
+  %iv.next = add i64 %iv, 1
+  %exitcond = icmp eq i64 %iv.next, 1024
+  br i1 %exitcond, label %exit, label %loop
+
+exit:
+  ret void
+}
+
 attributes #0 = { vscale_range(1,16) "target-features"="+sve" }
diff --git a/llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll b/llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll
new file mode 100644
index 00...
[truncated]

@john-brawn-arm
Copy link
Copy Markdown
Collaborator Author

The motivation for doing this is that I'm looking at enabling shouldConsiderVectorizationRegPressure on Arm Cortex-M CPUs with MVE, and the current behaviour makes things significantly worse in some cases due to preventing vectorization when it's beneficial. I've been specifically looking at the code we generate for https://github.com/ARM-software/CMSIS-DSP on Cortex-M55. If I enable vectorization register pressure then with the current behaviour the change in throughput is

Function Change
Filtering/arm_conv_partial_q31 3.64%
Filtering/arm_conv_q31 1.37%
Filtering/arm_correlate_q31 -1.15%
Filtering/arm_fir_decimate_f32 -0.92%
Filtering/arm_fir_decimate_q31 -56.92%
Filtering/arm_fir_f32_16taps 187.17%
Filtering/arm_fir_f32_4taps 34.33%
Filtering/arm_fir_f32_8taps 350.95%
Filtering/arm_fir_q31_16taps -4.88%
Filtering/arm_fir_q31_4taps 0.54%
Filtering/arm_fir_q31_8taps -9.04%
Matrix/arm_mat_vec_mult_f16 -52.49%
Matrix/arm_mat_vec_mult_f32 -48.69%
Quaternion/arm_quaternion2rotation_f32 -4.66%
Transform/arm_cfft_f16 -7.94%
Transform/arm_rfft_fast_f16 -0.39%
Transform/arm_rfft_fast_f32 -8.86%

With this PR the change in throughput is

Function Change
Filtering/arm_fir_f32_16taps 187.17%
Filtering/arm_fir_f32_4taps 34.33%
Filtering/arm_fir_f32_8taps 350.95%
Filtering/arm_fir_q31_16taps -4.88%
Filtering/arm_fir_q31_4taps 0.54%
Filtering/arm_fir_q31_8taps -9.04%
Quaternion/arm_quaternion2rotation_f32 -4.66%

The remaining regressions are due to the relative costs of interleave vs gather/scatter vs scalarize being wrong in some cases, which I'll be looking at next.

I've also checked llvm-test-suite on Neoverse-V2 (AWS Graviton 4), where useMaxBandwidth is enabled for scalable vectors and so register pressure calculation is used, and there's zero change in code generation.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 4, 2026

🐧 Linux x64 Test Results

  • 192119 tests passed
  • 4916 tests skipped

✅ The build succeeded and all tests passed.

@john-brawn-arm
Copy link
Copy Markdown
Collaborator Author

Ping

@SamTebbs33 SamTebbs33 removed their request for review February 10, 2026 15:07
Copy link
Copy Markdown
Contributor

@lukel97 lukel97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, I remember this being discussed at the time the VPlan register pressure stuff initially landed. The load+store per register over heuristic seems sensible to me

@john-brawn-arm
Copy link
Copy Markdown
Collaborator Author

Ping

unsigned Spills = MaxUsers - AvailableRegs;
Type *SpillType = LargestType.at(RegClass);
Align Alignment = DL.getPrefTypeAlign(SpillType);
InstructionCost SpillCost =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd probably want this code to live as part of the default implementation of a getSpillCost() TTI hook, i.e. the default could be implemented in BasicTTIImpl.h. Some targets may simply want to return Invalid here in order to prevent any spilling or filling whatsoever. Also, spilling the largest type doesn't guarantee the most pessimistic cost. I haven't looked into this in detail, but I can imagine situations where spilling <16 x i1> is more expensive than <16 x i8>, simply because in the backend <16 x i1> is not a legal type and requires promotion first.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's the getCostOfKeepingLiveOverCall hook in TTI which is basically the cost of spilling multiple types. We could rework that to be getSpillCost() for a single type, I think the only user of it is SLP currently IIRC.

Type *SpillType = LargestType.at(RegClass);
Align Alignment = DL.getPrefTypeAlign(SpillType);
InstructionCost SpillCost =
Ctx.TTI.getMemoryOpCost(Instruction::Load, SpillType, Alignment, 0,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code needs to be able to handle invalid costs being returned. See AArch64TTIImpl::getMemoryOpCost for an example of what happens when calculating the cost of load/store of <vscale x 16 x i1> types.

Given that you could encounter Invalid costs here, perhaps it might be easier to see the impact of these changes if you split up into two PRs:

  1. Create a NFC patch to refactor the code so that we call spillCost, but spillCost always returns Invalid if MaxUsers > AvailableRegs. That way we can see what happens when Invalid is returned and make sure it behaves sensibly. In theory, it should be NFC because we'll just ignore this VF the same as before.
  2. Create a follow-on patch to add a better cost model, perhaps introducing a new TTI hook as suggested above.

unsigned ClassID = TTI.getRegisterClassForType(VF.isVector(), ScalarTy);
Type *SpillTy = IsScalar ? ScalarTy : VectorType::get(ScalarTy, VF);
Invariant[ClassID] += TTI.getRegUsageForType(SpillTy);
LargestTypes[Idx][ClassID] = MaxType(LargestTypes[Idx][ClassID], SpillTy);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in an earlier comment, it's not obvious to me that the largest type has the greatest spill/fill cost. I think you might need to either:

  1. Track all types and add up the spill/fill cost for each type, or
  2. Calculate the largest possible spill/cost for each type here, then use the largest cost in the spillCost routine.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually from some thinking about this, I don't think either largest size or largest cost is the right thing here. Taking this example:

void fn(char *src, long *dst, long n) {
  for (long i = 0; i < n; i++) {
    dst[i] += src[i];
  }
}

With VF 8 the largest (and most costly to spill) type is <vscale x 8 x i64>, which with -mcpu=neoverse-v2 -mllvm -force-target-num-vector-regs=4 gives

LV(REG): Cost of 32 from 4 spills of Generic::VectorRC

32 is the cost of 4 spills of <vscale x 8 x i64>, but what gets spilled is registers not types, i.e. what we want here is the cost of 4 z-register spills. I'm not sure what the best way to handle this is. Perhaps TargetTransformInfo should have a method that gives the spill cost for a generic register class, given that we're using these generic registers classes here to count how many registers are being used.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed things to calculate the cost using the register class, by adding getRegisterClassSpillCost to TargetTransformInfo.

@llvmbot llvmbot added the llvm:analysis Includes value tracking, cost tables and constant folding label Feb 26, 2026
@john-brawn-arm
Copy link
Copy Markdown
Collaborator Author

Ping

1 similar comment
@john-brawn-arm
Copy link
Copy Markdown
Collaborator Author

Ping

Copy link
Copy Markdown
Contributor

@lukel97 lukel97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@john-brawn-arm john-brawn-arm merged commit a083e19 into llvm:main Mar 18, 2026
10 checks passed
@john-brawn-arm john-brawn-arm deleted the vectorize_spill_cost branch March 18, 2026 15:30
@llvm-ci
Copy link
Copy Markdown

llvm-ci commented Mar 18, 2026

LLVM Buildbot has detected a new failure on builder fuchsia-x86_64-linux running on fuchsia-debian-64-us-central1-b-1 while building llvm at step 4 "annotate".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/11/builds/36090

Here is the relevant piece of the build log for the reference
Step 4 (annotate) failure: 'python ../llvm-zorg/zorg/buildbot/builders/annotated/fuchsia-linux.py ...' (failure)
...
  Passed           : 49744 (97.00%)
  Expectedly Failed:    24 (0.05%)
[1513/1515] Linking CXX executable unittests/Frontend/LLVMFrontendTests
[1514/1515] Running the LLVM regression tests
llvm-lit: /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/utils/lit/lit/llvm/config.py:569: note: using ld.lld: /var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/ld.lld
llvm-lit: /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/utils/lit/lit/llvm/config.py:569: note: using lld-link: /var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/lld-link
llvm-lit: /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/utils/lit/lit/llvm/config.py:569: note: using ld64.lld: /var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/ld64.lld
llvm-lit: /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/utils/lit/lit/llvm/config.py:569: note: using wasm-ld: /var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/wasm-ld
-- Testing: 64189 tests, 60 workers --
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70..
FAIL: LLVM :: Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll (51242 of 64189)
******************** TEST 'LLVM :: Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll' FAILED ********************
Exit Code: 1

Command Output (stdout):
--
# RUN: at line 1
/var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/opt -mcpu=cortex-m55 -passes=loop-vectorize -disable-output -debug-only=loop-vectorize,vplan -vectorizer-consider-reg-pressure=false /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll 2>&1 | /var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/FileCheck /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll --check-prefixes=CHECK,CHECK-NOPRESSURE
# executed command: /var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/opt -mcpu=cortex-m55 -passes=loop-vectorize -disable-output -debug-only=loop-vectorize,vplan -vectorizer-consider-reg-pressure=false /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll
# note: command had no output on stdout or stderr
# error: command failed with exit status: 1
# executed command: /var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/FileCheck /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll --check-prefixes=CHECK,CHECK-NOPRESSURE
# .---command stderr------------
# | /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll:10:16: error: CHECK-LABEL: expected string not found in input
# | ; CHECK-LABEL: LV: Checking a loop in 'spills_not_profitable'
# |                ^
# | <stdin>:1:1: note: scanning from here
# | opt: Unknown command line argument '-debug-only=loop-vectorize,vplan'. Try: '/var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/opt --help'
# | ^
# | <stdin>:1:134: note: possible intended match here
# | opt: Unknown command line argument '-debug-only=loop-vectorize,vplan'. Try: '/var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/opt --help'
# |                                                                                                                                      ^
# | 
# | Input file: <stdin>
# | Check file: /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll
# | 
# | -dump-input=help explains the following input dump.
# | 
# | Input was:
# | <<<<<<
# |             1: opt: Unknown command line argument '-debug-only=loop-vectorize,vplan'. Try: '/var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/opt --help' 
# | label:10'0     X~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ error: no match found
# | label:10'1                                                                                                                                          ?                         possible intended match
# |             2: opt: Did you mean '--debug-pass=loop-vectorize,vplan'? 
# | label:10'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# | >>>>>>
# `-----------------------------
# error: command failed with exit status: 1

Step 7 (check) failure: check (failure)
...
  Passed           : 49744 (97.00%)
  Expectedly Failed:    24 (0.05%)
[1513/1515] Linking CXX executable unittests/Frontend/LLVMFrontendTests
[1514/1515] Running the LLVM regression tests
llvm-lit: /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/utils/lit/lit/llvm/config.py:569: note: using ld.lld: /var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/ld.lld
llvm-lit: /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/utils/lit/lit/llvm/config.py:569: note: using lld-link: /var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/lld-link
llvm-lit: /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/utils/lit/lit/llvm/config.py:569: note: using ld64.lld: /var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/ld64.lld
llvm-lit: /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/utils/lit/lit/llvm/config.py:569: note: using wasm-ld: /var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/wasm-ld
-- Testing: 64189 tests, 60 workers --
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70..
FAIL: LLVM :: Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll (51242 of 64189)
******************** TEST 'LLVM :: Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll' FAILED ********************
Exit Code: 1

Command Output (stdout):
--
# RUN: at line 1
/var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/opt -mcpu=cortex-m55 -passes=loop-vectorize -disable-output -debug-only=loop-vectorize,vplan -vectorizer-consider-reg-pressure=false /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll 2>&1 | /var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/FileCheck /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll --check-prefixes=CHECK,CHECK-NOPRESSURE
# executed command: /var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/opt -mcpu=cortex-m55 -passes=loop-vectorize -disable-output -debug-only=loop-vectorize,vplan -vectorizer-consider-reg-pressure=false /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll
# note: command had no output on stdout or stderr
# error: command failed with exit status: 1
# executed command: /var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/FileCheck /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll --check-prefixes=CHECK,CHECK-NOPRESSURE
# .---command stderr------------
# | /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll:10:16: error: CHECK-LABEL: expected string not found in input
# | ; CHECK-LABEL: LV: Checking a loop in 'spills_not_profitable'
# |                ^
# | <stdin>:1:1: note: scanning from here
# | opt: Unknown command line argument '-debug-only=loop-vectorize,vplan'. Try: '/var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/opt --help'
# | ^
# | <stdin>:1:134: note: possible intended match here
# | opt: Unknown command line argument '-debug-only=loop-vectorize,vplan'. Try: '/var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/opt --help'
# |                                                                                                                                      ^
# | 
# | Input file: <stdin>
# | Check file: /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll
# | 
# | -dump-input=help explains the following input dump.
# | 
# | Input was:
# | <<<<<<
# |             1: opt: Unknown command line argument '-debug-only=loop-vectorize,vplan'. Try: '/var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-0b3j3h2m/bin/opt --help' 
# | label:10'0     X~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ error: no match found
# | label:10'1                                                                                                                                          ?                         possible intended match
# |             2: opt: Did you mean '--debug-pass=loop-vectorize,vplan'? 
# | label:10'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# | >>>>>>
# `-----------------------------
# error: command failed with exit status: 1


Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably needs ; REQUIRES: asserts here

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #187316

@john-brawn-arm
Copy link
Copy Markdown
Collaborator Author

mve-reg-pressure-spills.ll failed in https://lab.llvm.org/buildbot/#/builders/187/builds/18019, which enables expensive checks. The failure is

LastActiveLane operand vp<%active.lane.mask> must be prefix mask (a header mask or an EVL-derived mask currently)
LLVM ERROR: Broken VPlan found, compilation aborted!

Currently investigating, looks like maybe this is unrelated to the compiler change here and the test has exposed an existing problem.

@lukel97
Copy link
Copy Markdown
Contributor

lukel97 commented Mar 18, 2026

mve-reg-pressure-spills.ll failed in https://lab.llvm.org/buildbot/#/builders/187/builds/18019, which enables expensive checks. The failure is

LastActiveLane operand vp<%active.lane.mask> must be prefix mask (a header mask or an EVL-derived mask currently)
LLVM ERROR: Broken VPlan found, compilation aborted!

Currently investigating, looks like maybe this is unrelated to the compiler change here and the test has exposed an existing problem.

See #182254, it might be that VPlanVerifier::verifyLastActiveLaneRecipe/isKnownMonotonic needs updated to handle another case

@john-brawn-arm
Copy link
Copy Markdown
Collaborator Author

#187360 should fix the VPlanVerifier failure.

@vvereschaka
Copy link
Copy Markdown
Contributor

Hi @john-brawn-arm ,

the 'llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll' test gets failed on the aarch64 cross builder https://lab.llvm.org/buildbot/#/builders/193/builds/15053 with the following errors:

# .---command stderr------------
# | C:\buildbot\as-builder-2\x-aarch64\llvm-project\llvm\test\Transforms\LoopVectorize\AArch64\maxbandwidth-regpressure.ll:87:18: error: CHECK-REGS-VP: expected string not found in input
# | ; CHECK-REGS-VP: Cost for VF vscale x 8: 10 (Estimated cost per lane: 1.2)
# |                  ^
# | <stdin>:2813:57: note: scanning from here
# | Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
# |                                                         ^
# | <stdin>:2845:1: note: possible intended match here
# | Cost for VF vscale x 8: 10 (Estimated cost per lane: 1.3)
# | ^
# | 
# | Input file: <stdin>
# | Check file: C:\buildbot\as-builder-2\x-aarch64\llvm-project\llvm\test\Transforms\LoopVectorize\AArch64\maxbandwidth-regpressure.ll
# | 
# | -dump-input=help explains the following input dump.
# | 
# | Input was:
# | <<<<<<
...

https://lab.llvm.org/buildbot/#/builders/193/builds/15053/steps/9/logs/FAIL__LLVM__maxbandwidth-regpressure_ll

looks like because of these changes.

would you take care of it?

@john-brawn-arm
Copy link
Copy Markdown
Collaborator Author

Hi @john-brawn-arm ,

the 'llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll' test gets failed on the aarch64 cross builder https://lab.llvm.org/buildbot/#/builders/193/builds/15053 with the following errors:

# .---command stderr------------
# | C:\buildbot\as-builder-2\x-aarch64\llvm-project\llvm\test\Transforms\LoopVectorize\AArch64\maxbandwidth-regpressure.ll:87:18: error: CHECK-REGS-VP: expected string not found in input
# | ; CHECK-REGS-VP: Cost for VF vscale x 8: 10 (Estimated cost per lane: 1.2)
# |                  ^
# | <stdin>:2813:57: note: scanning from here
# | Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
# |                                                         ^
# | <stdin>:2845:1: note: possible intended match here
# | Cost for VF vscale x 8: 10 (Estimated cost per lane: 1.3)
# | ^
# | 
# | Input file: <stdin>
# | Check file: C:\buildbot\as-builder-2\x-aarch64\llvm-project\llvm\test\Transforms\LoopVectorize\AArch64\maxbandwidth-regpressure.ll
# | 
# | -dump-input=help explains the following input dump.
# | 
# | Input was:
# | <<<<<<
...

https://lab.llvm.org/buildbot/#/builders/193/builds/15053/steps/9/logs/FAIL__LLVM__maxbandwidth-regpressure_ll

looks like because of these changes.

would you take care of it?

Fix for this in #187498

albertbolt1 pushed a commit to albertbolt1/llvm-project that referenced this pull request Mar 28, 2026
…vm#179646)

Currently when considering register pressure is enabled, we reject any
VF that has higher pressure than the number of registers. However this
can result in failing to vectorize in cases where it's beneficial, as
the cost of the extra spills is less than the benefit we get from
vectorizing.

Deal with this by instead calculating the cost of spills and adding that
to the rest of the cost, so we can detect this kind of situation and
still vectorize while avoiding vectorizing in cases where the extra cost
makes it not with it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

llvm:analysis Includes value tracking, cost tables and constant folding llvm:transforms vectorizers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants