[LoopVectorizer] Only check register pressure for VFs that have been enabled via maxBandwidth by NickGuy-Arm · Pull Request #149056 · llvm/llvm-project

NickGuy-Arm · 2025-07-16T10:11:49Z

Currently if MaxBandwidth is enabled, the register pressure is checked for each VF. This changes that to only perform said check if the VF would not have otherwise been considered by the LoopVectorizer if maxBandwidth was not enabled.

Theoretically this allows for higher VFs to be considered than would otherwise be deemed "safe" (from a regpressure perspective), but more concretely this reduces the amount of work done at compile-time when maxBandwidth is enabled.

…enabled via maxBandwidth

llvmbot · 2025-07-16T10:12:20Z

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-vectorizers

Author: Nicholas Guy (NickGuy-Arm)

Changes

Full diff: https://github.com/llvm/llvm-project/pull/149056.diff

4 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+27-11)
(modified) llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp (+6-3)
(modified) llvm/lib/Transforms/Vectorize/VPlanAnalysis.h (+5-2)
(added) llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll (+40)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index e7bae17dd2ceb..682b2f949bdd1 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -964,9 +964,8 @@ class LoopVectorizationCostModel {
   /// user options, for the given register kind.
   bool useMaxBandwidth(TargetTransformInfo::RegisterKind RegKind);
 
-  /// \return True if maximizing vector bandwidth is enabled by the target or
-  /// user options, for the given vector factor.
-  bool useMaxBandwidth(ElementCount VF);
+  /// \return True if register pressure should be calculated for the given VF.
+  bool shouldCalculateRegPressureForVF(ElementCount VF);
 
   /// \return The size (in bits) of the smallest and widest types in the code
   /// that needs to be vectorized. We ignore values that remain scalar such as
@@ -1753,6 +1752,9 @@ class LoopVectorizationCostModel {
   /// Whether this loop should be optimized for size based on function attribute
   /// or profile information.
   bool OptForSize;
+
+  /// The highest VF possible for this loop, without using MaxBandwidth.
+  FixedScalableVFPair MaxPermissibleVFWithoutMaxBW;
 };
 } // end namespace llvm
 
@@ -3943,10 +3945,16 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
   return FixedScalableVFPair::getNone();
 }
 
-bool LoopVectorizationCostModel::useMaxBandwidth(ElementCount VF) {
-  return useMaxBandwidth(VF.isScalable()
-                             ? TargetTransformInfo::RGK_ScalableVector
-                             : TargetTransformInfo::RGK_FixedWidthVector);
+bool LoopVectorizationCostModel::shouldCalculateRegPressureForVF(
+    ElementCount VF) {
+  if (!useMaxBandwidth(VF.isScalable()
+                           ? TargetTransformInfo::RGK_ScalableVector
+                           : TargetTransformInfo::RGK_FixedWidthVector))
+    return false;
+  // Only calculate register pressure for VFs enabled by MaxBandwidth.
+  return ElementCount::isKnownGT(
+      VF, VF.isScalable() ? MaxPermissibleVFWithoutMaxBW.ScalableVF
+                          : MaxPermissibleVFWithoutMaxBW.FixedVF);
 }
 
 bool LoopVectorizationCostModel::useMaxBandwidth(
@@ -4022,6 +4030,12 @@ ElementCount LoopVectorizationCostModel::getMaximizedVFForTarget(
       ComputeScalableMaxVF ? TargetTransformInfo::RGK_ScalableVector
                            : TargetTransformInfo::RGK_FixedWidthVector;
   ElementCount MaxVF = MaxVectorElementCount;
+
+  if (MaxVF.isScalable())
+    MaxPermissibleVFWithoutMaxBW.ScalableVF = MaxVF;
+  else
+    MaxPermissibleVFWithoutMaxBW.FixedVF = MaxVF;
+
   if (useMaxBandwidth(RegKind)) {
     auto MaxVectorElementCountMaxBW = ElementCount::get(
         llvm::bit_floor(WidestRegister.getKnownMinValue() / SmallestType),
@@ -4375,9 +4389,10 @@ VectorizationFactor LoopVectorizationPlanner::selectVectorizationFactor() {
       if (VF.isScalar())
         continue;
 
-      /// Don't consider the VF if it exceeds the number of registers for the
-      /// target.
-      if (CM.useMaxBandwidth(VF) && RUs[I].exceedsMaxNumRegs(TTI))
+      /// If the VF was proposed due to MaxBandwidth, don't consider the VF if
+      /// it exceeds the number of registers for the target.
+      if (CM.shouldCalculateRegPressureForVF(VF) &&
+          RUs[I].exceedsMaxNumRegs(TTI, ForceTargetNumVectorRegs))
         continue;
 
       InstructionCost C = CM.expectedCost(VF);
@@ -7155,7 +7170,8 @@ VectorizationFactor LoopVectorizationPlanner::computeBestVF() {
       InstructionCost Cost = cost(*P, VF);
       VectorizationFactor CurrentFactor(VF, Cost, ScalarCost);
 
-      if (CM.useMaxBandwidth(VF) && RUs[I].exceedsMaxNumRegs(TTI)) {
+      if (CM.shouldCalculateRegPressureForVF(VF) &&
+          RUs[I].exceedsMaxNumRegs(TTI, ForceTargetNumVectorRegs)) {
         LLVM_DEBUG(dbgs() << "LV(REG): Not considering vector loop of width "
                           << VF << " because it uses too many registers\n");
         continue;
diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
index 92db9674ef42b..3ab18b2fe6438 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
@@ -405,9 +405,12 @@ static unsigned getVFScaleFactor(VPRecipeBase *R) {
   return 1;
 }
 
-bool VPRegisterUsage::exceedsMaxNumRegs(const TargetTransformInfo &TTI) const {
-  return any_of(MaxLocalUsers, [&TTI](auto &LU) {
-    return LU.second > TTI.getNumberOfRegisters(LU.first);
+bool VPRegisterUsage::exceedsMaxNumRegs(const TargetTransformInfo &TTI,
+                                        unsigned OverrideMaxNumRegs) const {
+  return any_of(MaxLocalUsers, [&TTI, &OverrideMaxNumRegs](auto &LU) {
+    return LU.second > (OverrideMaxNumRegs > 0
+                            ? OverrideMaxNumRegs
+                            : TTI.getNumberOfRegisters(LU.first));
   });
 }
 
diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.h b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.h
index 7bcf9dba8c311..3be2af2226b42 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.h
@@ -85,8 +85,11 @@ struct VPRegisterUsage {
   SmallMapVector<unsigned, unsigned, 4> MaxLocalUsers;
 
   /// Check if any of the tracked live intervals exceeds the number of
-  /// available registers for the target.
-  bool exceedsMaxNumRegs(const TargetTransformInfo &TTI) const;
+  /// available registers for the target. Specifying OverrideMaxNumRegs
+  /// to be non-zero will cause that number to be used in place of the
+  /// number of available registers.
+  bool exceedsMaxNumRegs(const TargetTransformInfo &TTI,
+                         unsigned OverrideMaxNumRegs = 0) const;
 };
 
 /// Estimate the register usage for \p Plan and vectorization factors in \p VFs
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll b/llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll
new file mode 100644
index 0000000000000..f7a6ccfc74b16
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll
@@ -0,0 +1,40 @@
+; RUN: opt -passes=loop-vectorize -vectorizer-maximize-bandwidth -debug-only=loop-vectorize -disable-output -force-vector-interleave=1 -enable-epilogue-vectorization=false -S < %s 2>&1 | FileCheck %s --check-prefixes=CHECK-REGS-VP
+; RUN: opt -passes=loop-vectorize -vectorizer-maximize-bandwidth -debug-only=loop-vectorize -disable-output -force-target-num-vector-regs=1 -force-vector-interleave=1 -enable-epilogue-vectorization=false -S < %s 2>&1 | FileCheck %s --check-prefixes=CHECK-NOREGS-VP
+
+target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
+target triple = "aarch64-none-unknown-elf"
+
+define i32 @dotp(ptr %a, ptr %b) #0 {
+; CHECK-REGS-VP-NOT: LV(REG): Not considering vector loop of width vscale x 16 because it uses too many registers
+; CHECK-REGS-VP: LV: Selecting VF: vscale x 8.
+;
+; CHECK-NOREGS-VP: LV(REG): Not considering vector loop of width vscale x 16 because it uses too many registers
+; CHECK-NOREGS-VP: LV: Selecting VF: vscale x 4.
+entry:
+  br label %for.body
+
+for.body:                                         ; preds = %for.body, %entry
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
+  %accum = phi i32 [ 0, %entry ], [ %add, %for.body ]
+  %gep.a = getelementptr i8, ptr %a, i64 %iv
+  %load.a = load i8, ptr %gep.a, align 1
+  %ext.a = zext i8 %load.a to i32
+  %gep.b = getelementptr i8, ptr %b, i64 %iv
+  %load.b = load i8, ptr %gep.b, align 1
+  %ext.b = zext i8 %load.b to i32
+  %mul = mul i32 %ext.b, %ext.a
+  %sub = sub i32 0, %mul
+  %add = add i32 %accum, %sub
+  %iv.next = add i64 %iv, 1
+  %exitcond.not = icmp eq i64 %iv.next, 1024
+  br i1 %exitcond.not, label %for.exit, label %for.body
+
+for.exit:                        ; preds = %for.body
+  ret i32 %add
+}
+
+!7 = distinct !{!7, !8, !9, !10}
+!8 = !{!"llvm.loop.mustprogress"}
+!9 = !{!"llvm.loop.vectorize.predicate.enable", i1 true}
+!10 = !{!"llvm.loop.vectorize.enable", i1 true}
+attributes #0 = { vscale_range(1,16) "target-features"="+sve" }

llvm/lib/Transforms/Vectorize/VPlanAnalysis.h

fhahn

Theoretically this allows for higher VFs to be considered than would otherwise be deemed "safe" (from a regpressure perspective), but more concretely this reduces the amount of work done at compile-time when maxBandwidth is enabled.

Is there any expected perf-impact from the change? If so, is there any perf data you could share?

llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll

NickGuy-Arm · 2025-07-17T12:18:51Z

Theoretically this allows for higher VFs to be considered than would otherwise be deemed "safe" (from a regpressure perspective), but more concretely this reduces the amount of work done at compile-time when maxBandwidth is enabled.

Is there any expected perf-impact from the change? If so, is there any perf data you could share?

No runtime perf-impact has been observed, but costing a loop with e.g. VF=2 should be the same regardless of whether maxBandwidth is enabled, whereas currently such a VF may be discarded with maxBandwidth enabled.

As for compile-time impact, we have observed some small improvements (up to 2.5%) on some workloads when MaxBandwidth is enabled with this change vs without.

fhahn · 2025-07-18T08:28:59Z

llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp

-  return any_of(MaxLocalUsers, [&TTI](auto &LU) {
-    return LU.second > TTI.getNumberOfRegisters(LU.first);
+bool VPRegisterUsage::exceedsMaxNumRegs(const TargetTransformInfo &TTI,
+                                        unsigned OverrideMaxNumRegs) const {


This change is independent of updated shouldCalculateRegPressureForVF? If so, might have been good to do separately.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

In llvm#149056 VF pruning was changed so that it only pruned VFs that stemmed from MaxBandwidth being enabled. However we always compute register pressure regardless of whether or not max bandwidth produced any new VFs. This skips the computation if not needed and renames the method for clarity. The diff in reg-usage.ll is due to the scalable VPlan not actually having any maxbandwidth VFs, so I've changed it to check the fixed-length VF instead, which is affected by maxbandwidth.

In #149056 VF pruning was changed so that it only pruned VFs that stemmed from MaxBandwidth being enabled. However we always compute register pressure regardless of whether or not max bandwidth is permitted for any VFs (via `MaxPermissibleVFWithoutMaxBW`). This skips the computation if not needed and renames the method for clarity. The diff in reg-usage.ll is due to the scalable VPlan not actually having any maxbandwidth VFs, so I've changed it to check the fixed-length VF instead, which is affected by maxbandwidth.

Stacked on #156923 In https://godbolt.org/z/8svWaredK, we spill a lot on RISC-V because whilst the largest element type is i8, we generate a bunch of pointer vectors for gathers and scatters. This means the VF chosen is quite high e.g. <vscale x 16 x i8>, but we end up using a bunch of <vscale x 16 x i64> m8 registers for the pointers. This was briefly fixed by #132190 where we computed register pressure in VPlan and used it to prune VFs that were likely to spill. The legacy cost model wasn't able to do this pruning because it didn't have visibility into the pointer vectors that were needed for the gathers/scatters. However VF pruning was restricted again to just the case when max bandwidth was enabled in #141736 to avoid an AArch64 regression, and restricted again in #149056 to only prune VFs that had max bandwidth enabled. On RISC-V we take advantage of register grouping for performance and choose a default of LMUL 2, which means there are 16 registers to work with – half the number as SVE, so we encounter higher register pressure more frequently. As such, we likely want to always consider pruning VFs with high register pressure and not just the VFs from max bandwidth. This adds a TTI hook to opt into this behaviour for RISC-V which fixes the motivating godbolt example above. When last checked this significantly reduces the number of spills on SPEC CPU 2017, up to 80% on 538.imagick_r.

…enabled via maxBandwidth (llvm#149056) Currently if MaxBandwidth is enabled, the register pressure is checked for each VF. This changes that to only perform said check if the VF would not have otherwise been considered by the LoopVectorizer if maxBandwidth was not enabled. Theoretically this allows for higher VFs to be considered than would otherwise be deemed "safe" (from a regpressure perspective), but more concretely this reduces the amount of work done at compile-time when maxBandwidth is enabled. (cherry-picked from 0f37b1f8909790ceda0e85789345bcf40ea0184b)

In llvm#149056 VF pruning was changed so that it only pruned VFs that stemmed from MaxBandwidth being enabled. However we always compute register pressure regardless of whether or not max bandwidth is permitted for any VFs (via `MaxPermissibleVFWithoutMaxBW`). This skips the computation if not needed and renames the method for clarity. The diff in reg-usage.ll is due to the scalable VPlan not actually having any maxbandwidth VFs, so I've changed it to check the fixed-length VF instead, which is affected by maxbandwidth. (cherry-picked from 33f8ad7b6d9c5c43525576854fc07eebdca00a9f)

…enabled via maxBandwidth (llvm#149056) Currently if MaxBandwidth is enabled, the register pressure is checked for each VF. This changes that to only perform said check if the VF would not have otherwise been considered by the LoopVectorizer if maxBandwidth was not enabled. Theoretically this allows for higher VFs to be considered than would otherwise be deemed "safe" (from a regpressure perspective), but more concretely this reduces the amount of work done at compile-time when maxBandwidth is enabled. (cherry-picked from 0f37b1f8909790ceda0e85789345bcf40ea0184b)

In llvm#149056 VF pruning was changed so that it only pruned VFs that stemmed from MaxBandwidth being enabled. However we always compute register pressure regardless of whether or not max bandwidth is permitted for any VFs (via `MaxPermissibleVFWithoutMaxBW`). This skips the computation if not needed and renames the method for clarity. The diff in reg-usage.ll is due to the scalable VPlan not actually having any maxbandwidth VFs, so I've changed it to check the fixed-length VF instead, which is affected by maxbandwidth. (cherry-picked from 33f8ad7b6d9c5c43525576854fc07eebdca00a9f)

…enabled via maxBandwidth (llvm#149056) Currently if MaxBandwidth is enabled, the register pressure is checked for each VF. This changes that to only perform said check if the VF would not have otherwise been considered by the LoopVectorizer if maxBandwidth was not enabled. Theoretically this allows for higher VFs to be considered than would otherwise be deemed "safe" (from a regpressure perspective), but more concretely this reduces the amount of work done at compile-time when maxBandwidth is enabled. (cherry-picked from 0f37b1f8909790ceda0e85789345bcf40ea0184b)

In llvm#149056 VF pruning was changed so that it only pruned VFs that stemmed from MaxBandwidth being enabled. However we always compute register pressure regardless of whether or not max bandwidth is permitted for any VFs (via `MaxPermissibleVFWithoutMaxBW`). This skips the computation if not needed and renames the method for clarity. The diff in reg-usage.ll is due to the scalable VPlan not actually having any maxbandwidth VFs, so I've changed it to check the fixed-length VF instead, which is affected by maxbandwidth. (cherry-picked from 33f8ad7b6d9c5c43525576854fc07eebdca00a9f)

[LoopVectorizer] Only check register pressure for VFs that have been …

f0be6fa

…enabled via maxBandwidth

NickGuy-Arm requested review from MacDue, SamTebbs33, fhahn, gbossu, huntergr-arm and sdesmalen-arm July 16, 2025 10:11

llvmbot added vectorizers llvm:transforms labels Jul 16, 2025

SamTebbs33 approved these changes Jul 16, 2025

View reviewed changes

llvm/lib/Transforms/Vectorize/VPlanAnalysis.h Outdated Show resolved Hide resolved

Adjust doc-comment

b6fa7ca

fhahn reviewed Jul 17, 2025

View reviewed changes

llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll Outdated Show resolved Hide resolved

NickGuy-Arm added 2 commits July 17, 2025 12:38

Remove unused metadata from test

1a341b0

Add extra check-line to test

a4f1d6c

NickGuy-Arm merged commit 20fc297 into llvm:main Jul 18, 2025
9 checks passed

fhahn reviewed Jul 18, 2025

View reviewed changes

This was referenced Jul 23, 2025

test abhinavgaba/llvm-project#2

Closed

Add dataFence plugin interface abhinavgaba/llvm-project#3

Closed

This was referenced Sep 4, 2025

[VPlan] Only compute reg pressure if considered. NFCI #156923

Merged

[VPlan] Always consider register pressure on RISC-V #156951

Merged

lukel97 mentioned this pull request Sep 5, 2025

[RISCV] Enable loop vectorizer register pressure VF pruning again #144520

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[LoopVectorizer] Only check register pressure for VFs that have been enabled via maxBandwidth#149056

[LoopVectorizer] Only check register pressure for VFs that have been enabled via maxBandwidth#149056
NickGuy-Arm merged 4 commits intollvm:mainfrom
NickGuy-Arm:maxbw-regpressure

NickGuy-Arm commented Jul 16, 2025 •

edited

Loading

Uh oh!

llvmbot commented Jul 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

fhahn left a comment

Uh oh!

Uh oh!

NickGuy-Arm commented Jul 17, 2025

Uh oh!

Uh oh!

fhahn Jul 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

NickGuy-Arm commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

fhahn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

NickGuy-Arm commented Jul 17, 2025

Uh oh!

Uh oh!

fhahn Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

NickGuy-Arm commented Jul 16, 2025 •

edited

Loading

llvmbot commented Jul 16, 2025 •

edited

Loading