[VPlan] Add the cost of spills when considering register pressure#179646
[VPlan] Add the cost of spills when considering register pressure#179646john-brawn-arm merged 8 commits intollvm:mainfrom
Conversation
Currently when considering register pressure is enabled, we reject any VF that has higher pressure than the number of registers. However this can result in failing to vectorize in cases where it's beneficial, as the cost of the extra spills is less than the benefit we get from vectorizing. Deal with this by instead calculating the cost of spills and adding that to the rest of the cost, so we can detect this kind of situation and still vectorize while avoiding vectorizing in cases where the extra cost makes it not with it.
|
@llvm/pr-subscribers-llvm-analysis @llvm/pr-subscribers-llvm-transforms Author: John Brawn (john-brawn-arm) ChangesCurrently when considering register pressure is enabled, we reject any VF that has higher pressure than the number of registers. However this can result in failing to vectorize in cases where it's beneficial, as the cost of the extra spills is less than the benefit we get from vectorizing. Deal with this by instead calculating the cost of spills and adding that to the rest of the cost, so we can detect this kind of situation and still vectorize while avoiding vectorizing in cases where the extra cost makes it not with it. Patch is 34.53 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/179646.diff 7 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
index 44d4d92d4a7e2..06e8efef20c03 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
@@ -45,6 +45,7 @@ class OptimizationRemarkEmitter;
class TargetTransformInfo;
class TargetLibraryInfo;
class VPRecipeBuilder;
+class VPRegisterUsage;
struct VFRange;
extern cl::opt<bool> EnableVPlanNativePath;
@@ -497,7 +498,7 @@ class LoopVectorizationPlanner {
///
/// TODO: Move to VPlan::cost once the use of LoopVectorizationLegality has
/// been retired.
- InstructionCost cost(VPlan &Plan, ElementCount VF) const;
+ InstructionCost cost(VPlan &Plan, ElementCount VF, VPRegisterUsage *RU) const;
/// Precompute costs for certain instructions using the legacy cost model. The
/// function is used to bring up the VPlan-based cost model to initially avoid
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index abac45b265d10..492e716fd6ad2 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -4247,13 +4247,6 @@ VectorizationFactor LoopVectorizationPlanner::selectVectorizationFactor() {
if (VF.isScalar())
continue;
- /// If the register pressure needs to be considered for VF,
- /// don't consider the VF as valid if it exceeds the number
- /// of registers for the target.
- if (CM.shouldConsiderRegPressureForVF(VF) &&
- RUs[I].exceedsMaxNumRegs(TTI, ForceTargetNumVectorRegs))
- continue;
-
InstructionCost C = CM.expectedCost(VF);
// Add on other costs that are modelled in VPlan, but not in the legacy
@@ -4302,6 +4295,10 @@ VectorizationFactor LoopVectorizationPlanner::selectVectorizationFactor() {
}
}
+ // Add the cost of any spills due to excess register usage
+ if (CM.shouldConsiderRegPressureForVF(VF))
+ C += RUs[I].spillCost(CostCtx, ForceTargetNumVectorRegs);
+
VectorizationFactor Candidate(VF, C, ScalarCost.ScalarCost);
unsigned Width =
estimateElementCount(Candidate.Width, CM.getVScaleForTuning());
@@ -4687,13 +4684,16 @@ LoopVectorizationPlanner::selectInterleaveCount(VPlan &Plan, ElementCount VF,
if (hasFindLastReductionPhi(Plan))
return 1;
+ VPRegisterUsage R =
+ calculateRegisterUsageForPlan(Plan, {VF}, TTI, CM.ValuesToIgnore)[0];
+
// If we did not calculate the cost for VF (because the user selected the VF)
// then we calculate the cost of VF here.
if (LoopCost == 0) {
if (VF.isScalar())
LoopCost = CM.expectedCost(VF);
else
- LoopCost = cost(Plan, VF);
+ LoopCost = cost(Plan, VF, &R);
assert(LoopCost.isValid() && "Expected to have chosen a VF with valid cost");
// Loop body is free and there is no need for interleaving.
@@ -4701,8 +4701,6 @@ LoopVectorizationPlanner::selectInterleaveCount(VPlan &Plan, ElementCount VF,
return 1;
}
- VPRegisterUsage R =
- calculateRegisterUsageForPlan(Plan, {VF}, TTI, CM.ValuesToIgnore)[0];
// We divide by these constants so assume that we have at least one
// instruction that uses at least one register.
for (auto &Pair : R.MaxLocalUsers) {
@@ -7027,13 +7025,18 @@ LoopVectorizationPlanner::precomputeCosts(VPlan &Plan, ElementCount VF,
return Cost;
}
-InstructionCost LoopVectorizationPlanner::cost(VPlan &Plan,
- ElementCount VF) const {
+InstructionCost LoopVectorizationPlanner::cost(VPlan &Plan, ElementCount VF,
+ VPRegisterUsage *RU) const {
VPCostContext CostCtx(CM.TTI, *CM.TLI, Plan, CM, CM.CostKind, PSE, OrigLoop);
InstructionCost Cost = precomputeCosts(Plan, VF, CostCtx);
// Now compute and add the VPlan-based cost.
Cost += Plan.cost(VF, CostCtx);
+
+ // Add the cost of spills due to excess register usage
+ if (CM.shouldConsiderRegPressureForVF(VF))
+ Cost += RU->spillCost(CostCtx, ForceTargetNumVectorRegs);
+
#ifndef NDEBUG
unsigned EstimatedWidth = estimateElementCount(VF, CM.getVScaleForTuning());
LLVM_DEBUG(dbgs() << "Cost for VF " << VF << ": " << Cost
@@ -7233,9 +7236,10 @@ VectorizationFactor LoopVectorizationPlanner::computeBestVF() {
P->vectorFactors().end());
SmallVector<VPRegisterUsage, 8> RUs;
- if (any_of(VFs, [this](ElementCount VF) {
- return CM.shouldConsiderRegPressureForVF(VF);
- }))
+ bool ConsiderRegPressure = any_of(VFs, [this](ElementCount VF) {
+ return CM.shouldConsiderRegPressureForVF(VF);
+ });
+ if (ConsiderRegPressure)
RUs = calculateRegisterUsageForPlan(*P, VFs, TTI, CM.ValuesToIgnore);
for (unsigned I = 0; I < VFs.size(); I++) {
@@ -7258,16 +7262,10 @@ VectorizationFactor LoopVectorizationPlanner::computeBestVF() {
continue;
}
- InstructionCost Cost = cost(*P, VF);
+ InstructionCost Cost =
+ cost(*P, VF, ConsiderRegPressure ? &RUs[I] : nullptr);
VectorizationFactor CurrentFactor(VF, Cost, ScalarCost);
- if (CM.shouldConsiderRegPressureForVF(VF) &&
- RUs[I].exceedsMaxNumRegs(TTI, ForceTargetNumVectorRegs)) {
- LLVM_DEBUG(dbgs() << "LV(REG): Not considering vector loop of width "
- << VF << " because it uses too many registers\n");
- continue;
- }
-
if (isMoreProfitable(CurrentFactor, BestFactor, P->hasScalarTail()))
BestFactor = CurrentFactor;
diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
index 8fbe7d93e6f45..b8be1be79831e 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
@@ -16,6 +16,7 @@
#include "llvm/ADT/TypeSwitch.h"
#include "llvm/Analysis/ScalarEvolution.h"
#include "llvm/Analysis/TargetTransformInfo.h"
+#include "llvm/IR/DataLayout.h"
#include "llvm/IR/Instruction.h"
#include "llvm/IR/PatternMatch.h"
@@ -389,13 +390,33 @@ bool VPDominatorTree::properlyDominates(const VPRecipeBase *A,
return Base::properlyDominates(ParentA, ParentB);
}
-bool VPRegisterUsage::exceedsMaxNumRegs(const TargetTransformInfo &TTI,
- unsigned OverrideMaxNumRegs) const {
- return any_of(MaxLocalUsers, [&TTI, &OverrideMaxNumRegs](auto &LU) {
- return LU.second > (OverrideMaxNumRegs > 0
- ? OverrideMaxNumRegs
- : TTI.getNumberOfRegisters(LU.first));
- });
+InstructionCost VPRegisterUsage::spillCost(VPCostContext &Ctx,
+ unsigned OverrideMaxNumRegs) const {
+ InstructionCost Cost;
+ DataLayout DL = Ctx.PSE.getSE()->getDataLayout();
+ for (const auto &Pair : MaxLocalUsers) {
+ unsigned AvailableRegs = OverrideMaxNumRegs > 0
+ ? OverrideMaxNumRegs
+ : Ctx.TTI.getNumberOfRegisters(Pair.first);
+ if (Pair.second > AvailableRegs) {
+ // Assume that for each register used past what's available we get one
+ // spill and reload of the largest type seen for that register class.
+ unsigned Spills = Pair.second - AvailableRegs;
+ Type *SpillType = LargestType.at(Pair.first);
+ Align Alignment = DL.getPrefTypeAlign(SpillType);
+ InstructionCost SpillCost =
+ Ctx.TTI.getMemoryOpCost(Instruction::Load, SpillType, Alignment, 0,
+ Ctx.CostKind) +
+ Ctx.TTI.getMemoryOpCost(Instruction::Store, SpillType, Alignment, 0,
+ Ctx.CostKind);
+ InstructionCost TotalCost = SpillCost * Spills;
+ LLVM_DEBUG(dbgs() << "LV(REG): Cost of " << TotalCost << " from "
+ << Spills << " spills of "
+ << Ctx.TTI.getRegisterClassName(Pair.first) << "\n");
+ Cost += TotalCost;
+ }
+ }
+ return Cost;
}
SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
@@ -479,6 +500,15 @@ SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
SmallPtrSet<VPValue *, 8> OpenIntervals;
SmallVector<VPRegisterUsage, 8> RUs(VFs.size());
SmallVector<SmallMapVector<unsigned, unsigned, 4>, 8> MaxUsages(VFs.size());
+ SmallVector<SmallMapVector<unsigned, Type *, 4>, 8> LargestTypes(VFs.size());
+ auto MaxType = [](Type *CurMax, Type *T) {
+ if (!CurMax)
+ return T;
+ if (TypeSize::isKnownGT(T->getPrimitiveSizeInBits(),
+ CurMax->getPrimitiveSizeInBits()))
+ return T;
+ return CurMax;
+ };
LLVM_DEBUG(dbgs() << "LV(REG): Calculating max register usage:\n");
@@ -540,17 +570,19 @@ SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
match(VPV, m_ExtractLastPart(m_VPValue())))
continue;
+ Type *ScalarTy = TypeInfo.inferScalarType(VPV);
if (VFs[J].isScalar() ||
isa<VPCanonicalIVPHIRecipe, VPReplicateRecipe, VPDerivedIVRecipe,
VPEVLBasedIVPHIRecipe, VPScalarIVStepsRecipe>(VPV) ||
(isa<VPInstruction>(VPV) && vputils::onlyScalarValuesUsed(VPV)) ||
(isa<VPReductionPHIRecipe>(VPV) &&
(cast<VPReductionPHIRecipe>(VPV))->isInLoop())) {
- unsigned ClassID =
- TTI.getRegisterClassForType(false, TypeInfo.inferScalarType(VPV));
+ unsigned ClassID = TTI.getRegisterClassForType(false, ScalarTy);
// FIXME: The target might use more than one register for the type
// even in the scalar case.
RegUsage[ClassID] += 1;
+ LargestTypes[J][ClassID] =
+ MaxType(LargestTypes[J][ClassID], ScalarTy);
} else {
// The output from scaled phis and scaled reductions actually has
// fewer lanes than the VF.
@@ -562,10 +594,12 @@ SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
LLVM_DEBUG(dbgs() << "LV(REG): Scaled down VF from " << VFs[J]
<< " to " << VF << " for " << *R << "\n";);
}
-
- Type *ScalarTy = TypeInfo.inferScalarType(VPV);
unsigned ClassID = TTI.getRegisterClassForType(true, ScalarTy);
RegUsage[ClassID] += GetRegUsage(ScalarTy, VF);
+ if (VectorType::isValidElementType(ScalarTy)) {
+ Type *T = VectorType::get(ScalarTy, VF);
+ LargestTypes[J][ClassID] = MaxType(LargestTypes[J][ClassID], T);
+ }
}
}
@@ -602,9 +636,11 @@ SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
bool IsScalar = vputils::onlyScalarValuesUsed(In);
ElementCount VF = IsScalar ? ElementCount::getFixed(1) : VFs[Idx];
- unsigned ClassID = TTI.getRegisterClassForType(
- VF.isVector(), TypeInfo.inferScalarType(In));
- Invariant[ClassID] += GetRegUsage(TypeInfo.inferScalarType(In), VF);
+ Type *ScalarTy = TypeInfo.inferScalarType(In);
+ unsigned ClassID = TTI.getRegisterClassForType(VF.isVector(), ScalarTy);
+ Invariant[ClassID] += GetRegUsage(ScalarTy, VF);
+ Type *SpillTy = IsScalar ? ScalarTy : VectorType::get(ScalarTy, VF);
+ LargestTypes[Idx][ClassID] = MaxType(LargestTypes[Idx][ClassID], SpillTy);
}
LLVM_DEBUG({
@@ -623,10 +659,16 @@ SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
<< TTI.getRegisterClassName(pair.first) << ", " << pair.second
<< " registers\n";
}
+ for (const auto &pair : LargestTypes[Idx]) {
+ dbgs() << "LV(REG): RegisterClass: "
+ << TTI.getRegisterClassName(pair.first) << ", " << *pair.second
+ << " is largest type potentially spilled\n";
+ }
});
RU.LoopInvariantRegs = Invariant;
RU.MaxLocalUsers = MaxUsages[Idx];
+ RU.LargestType = LargestTypes[Idx];
RUs[Idx] = RU;
}
diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.h b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.h
index dc4be4270f7f1..3affa211dd140 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.h
@@ -19,6 +19,7 @@ namespace llvm {
class LLVMContext;
class VPValue;
class VPBlendRecipe;
+class VPCostContext;
class VPInstruction;
class VPWidenRecipe;
class VPWidenCallRecipe;
@@ -30,6 +31,7 @@ class VPlan;
class Value;
class TargetTransformInfo;
class Type;
+class InstructionCost;
/// An analysis for type-inference for VPValues.
/// It infers the scalar type for a given VPValue by bottom-up traversing
@@ -78,12 +80,14 @@ struct VPRegisterUsage {
/// Holds the maximum number of concurrent live intervals in the loop.
/// The key is ClassID of target-provided register class.
SmallMapVector<unsigned, unsigned, 4> MaxLocalUsers;
+ /// Holds the largest type used in each register class.
+ SmallMapVector<unsigned, Type *, 4> LargestType;
- /// Check if any of the tracked live intervals exceeds the number of
- /// available registers for the target. If non-zero, OverrideMaxNumRegs
+ /// Calculate the estimated cost of any spills due to using more registers
+ /// than the number available for the target. If non-zero, OverrideMaxNumRegs
/// is used in place of the target's number of registers.
- bool exceedsMaxNumRegs(const TargetTransformInfo &TTI,
- unsigned OverrideMaxNumRegs = 0) const;
+ InstructionCost spillCost(VPCostContext &Ctx,
+ unsigned OverrideMaxNumRegs = 0) const;
};
/// Estimate the register usage for \p Plan and vectorization factors in \p VFs
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll b/llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll
index 8109d0683fe71..2a4d16979e0d8 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll
@@ -1,16 +1,31 @@
; REQUIRES: asserts
-; RUN: opt -passes=loop-vectorize -vectorizer-maximize-bandwidth -debug-only=loop-vectorize -disable-output -force-vector-interleave=1 -enable-epilogue-vectorization=false -S < %s 2>&1 | FileCheck %s --check-prefixes=CHECK-REGS-VP
-; RUN: opt -passes=loop-vectorize -vectorizer-maximize-bandwidth -debug-only=loop-vectorize -disable-output -force-target-num-vector-regs=1 -force-vector-interleave=1 -enable-epilogue-vectorization=false -S < %s 2>&1 | FileCheck %s --check-prefixes=CHECK-NOREGS-VP
+; RUN: opt -passes=loop-vectorize -vectorizer-maximize-bandwidth=false -debug-only=loop-vectorize,vplan -disable-output -force-vector-interleave=1 -enable-epilogue-vectorization=false -S < %s 2>&1 | FileCheck %s --check-prefixes=CHECK,CHECK-NOMAX
+; RUN: opt -passes=loop-vectorize -vectorizer-maximize-bandwidth=true -debug-only=loop-vectorize,vplan -disable-output -force-vector-interleave=1 -enable-epilogue-vectorization=false -S < %s 2>&1 | FileCheck %s --check-prefixes=CHECK,CHECK-REGS-VP
+; RUN: opt -passes=loop-vectorize -vectorizer-maximize-bandwidth=true -debug-only=loop-vectorize,vplan -disable-output -force-target-num-vector-regs=1 -force-vector-interleave=1 -enable-epilogue-vectorization=false -S < %s 2>&1 | FileCheck %s --check-prefixes=CHECK,CHECK-NOREGS-VP
target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64-none-unknown-elf"
+; The use of the dotp instruction means we never have an i32 vector, so we don't
+; get any spills normally and with a reduced number of registers the number of
+; spills is small enough that it doesn't prevent use of a larger VF.
define i32 @dotp(ptr %a, ptr %b) #0 {
+; CHECK-LABEL: LV: Checking a loop in 'dotp'
+;
+; CHECK-NOMAX: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK-NOMAX: LV: Selecting VF: vscale x 4.
+;
+; CHECK-REGS-VP: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK-REGS-VP: Cost for VF vscale x 8: 6 (Estimated cost per lane: 0.8)
+; CHECK-REGS-VP: Cost for VF vscale x 16: 5 (Estimated cost per lane: 0.3)
; CHECK-REGS-VP: LV: Selecting VF: vscale x 16.
;
-; CHECK-NOREGS-VP: LV(REG): Not considering vector loop of width vscale x 8 because it uses too many registers
-; CHECK-NOREGS-VP: LV(REG): Not considering vector loop of width vscale x 16 because it uses too many registers
-; CHECK-NOREGS-VP: LV: Selecting VF: vscale x 4.
+; CHECK-NOREGS-VP: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK-NOREGS-VP: LV(REG): Cost of 4 from 2 spills of Generic::VectorRC
+; CHECK-NOREGS-VP-NEXT: Cost for VF vscale x 8: 14 (Estimated cost per lane: 1.8)
+; CHECK-NOREGS-VP: LV(REG): Cost of 4 from 2 spills of Generic::VectorRC
+; CHECK-NOREGS-VP-NEXT: Cost for VF vscale x 16: 13 (Estimated cost per lane: 0.8)
+; CHECK-NOREGS-VP: LV: Selecting VF: vscale x 16.
entry:
br label %for.body
@@ -24,8 +39,7 @@ for.body: ; preds = %for.body, %entry
%load.b = load i8, ptr %gep.b, align 1
%ext.b = zext i8 %load.b to i32
%mul = mul i32 %ext.b, %ext.a
- %sub = sub i32 0, %mul
- %add = add i32 %accum, %sub
+ %add = add i32 %accum, %mul
%iv.next = add i64 %iv, 1
%exitcond.not = icmp eq i64 %iv.next, 1024
br i1 %exitcond.not, label %for.exit, label %for.body
@@ -34,4 +48,70 @@ for.exit: ; preds = %for.body
ret i32 %add
}
+; The largest type used in the loop is small enough that we already consider all
+; VFs and maximize-bandwidth does nothing.
+define void @type_too_small(ptr %a, ptr %b) #0 {
+; CHECK-LABEL: LV: Checking a loop in 'type_too_small'
+; CHECK: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK: Cost for VF vscale x 8: 6 (Estimated cost per lane: 0.8)
+; CHECK: Cost for VF vscale x 16: 6 (Estimated cost per lane: 0.4)
+; CHECK: LV: Selecting VF: vscale x 16.
+entry:
+ br label %loop
+
+loop:
+ %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+ %gep.a = getelementptr i8, ptr %a, i64 %iv
+ %load.a = load i8, ptr %gep.a, align 1
+ %gep.b = getelementptr i8, ptr %b, i64 %iv
+ %load.b = load i8, ptr %gep.b, align 1
+ %add = add i8 %load.a, %load.b
+ store i8 %add, ptr %gep.a, align 1
+ %iv.next = add i64 %iv, 1
+ %exitcond = icmp eq i64 %iv.next, 1024
+ br i1 %exitcond, label %exit, label %loop
+
+exit:
+ ret void
+}
+
+; With reduced number of registers the spills from high pressure are enough that
+; we use the same VF as if we hadn't maximized the bandwidth.
+define void @high_pressure(ptr %a, ptr %b) #0 {
+; CHECK-LABEL: LV: Checking a loop in 'high_pressure'
+;
+; CHECK-NOMAX: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK-NOMAX: LV: Selecting VF: vscale x 4.
+;
+; CHECK-REGS-VP: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK-REGS-VP: Cost for VF vscale x 8: 10 (Estimated cost per lane: 1.2)
+; CHECK-REGS-VP: Cost for VF vscale x 16: 21 (Estimated cost per lane: 1.3)
+; CHECK-REGS-VP: LV: Selecting VF: vscale x 8.
+
+; CHECK-NOREGS-VP: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK-NOREGS-VP: LV(REG): Cost of 12 from 3 spills of Generic::VectorRC
+; CHECK-NOREGS-VP-NEXT: Cost for VF vscale x 8: 26 (Estimated cost per lane: 3.2)
+; CHECK-NOREGS-VP: LV(REG): Cost of 56 from 7 spills of Generic::VectorRC
+; CHECK-NOREGS-VP-NEXT: Cost for VF vscale x 16: 81 (Estimated cost per lane: 5.1)
+; CHECK-NOREGS-VP: LV: Selecting VF: vscale x 4.
+entry:
+ br label %loop
+
+loop:
+ %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+ %gep.a = getelementptr i32, ptr %a, i64 %iv
+ %load.a = load i32, ptr %gep.a, align 4
+ %gep.b = getelementptr i8, ptr %b, i64 %iv
+ %load.b = load i8, ptr %gep.b, align 1
+ %ext.b = zext i8 %load.b to i32
+ %add = add i32 %load.a, %ext.b
+ store i32 %add, ptr %gep.a, align 4
+ %iv.next = add i64 %iv, 1
+ %exitcond = icmp eq i64 %iv.next, 1024
+ br i1 %exitcond, label %exit, label %loop
+
+exit:
+ ret void
+}
+
attributes #0 = { vscale_range(1,16) "target-features"="+sve" }
diff --git a/llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll b/llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll
new file mode 100644
index 00...
[truncated]
|
|
@llvm/pr-subscribers-vectorizers Author: John Brawn (john-brawn-arm) ChangesCurrently when considering register pressure is enabled, we reject any VF that has higher pressure than the number of registers. However this can result in failing to vectorize in cases where it's beneficial, as the cost of the extra spills is less than the benefit we get from vectorizing. Deal with this by instead calculating the cost of spills and adding that to the rest of the cost, so we can detect this kind of situation and still vectorize while avoiding vectorizing in cases where the extra cost makes it not with it. Patch is 34.53 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/179646.diff 7 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
index 44d4d92d4a7e2..06e8efef20c03 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
@@ -45,6 +45,7 @@ class OptimizationRemarkEmitter;
class TargetTransformInfo;
class TargetLibraryInfo;
class VPRecipeBuilder;
+class VPRegisterUsage;
struct VFRange;
extern cl::opt<bool> EnableVPlanNativePath;
@@ -497,7 +498,7 @@ class LoopVectorizationPlanner {
///
/// TODO: Move to VPlan::cost once the use of LoopVectorizationLegality has
/// been retired.
- InstructionCost cost(VPlan &Plan, ElementCount VF) const;
+ InstructionCost cost(VPlan &Plan, ElementCount VF, VPRegisterUsage *RU) const;
/// Precompute costs for certain instructions using the legacy cost model. The
/// function is used to bring up the VPlan-based cost model to initially avoid
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index abac45b265d10..492e716fd6ad2 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -4247,13 +4247,6 @@ VectorizationFactor LoopVectorizationPlanner::selectVectorizationFactor() {
if (VF.isScalar())
continue;
- /// If the register pressure needs to be considered for VF,
- /// don't consider the VF as valid if it exceeds the number
- /// of registers for the target.
- if (CM.shouldConsiderRegPressureForVF(VF) &&
- RUs[I].exceedsMaxNumRegs(TTI, ForceTargetNumVectorRegs))
- continue;
-
InstructionCost C = CM.expectedCost(VF);
// Add on other costs that are modelled in VPlan, but not in the legacy
@@ -4302,6 +4295,10 @@ VectorizationFactor LoopVectorizationPlanner::selectVectorizationFactor() {
}
}
+ // Add the cost of any spills due to excess register usage
+ if (CM.shouldConsiderRegPressureForVF(VF))
+ C += RUs[I].spillCost(CostCtx, ForceTargetNumVectorRegs);
+
VectorizationFactor Candidate(VF, C, ScalarCost.ScalarCost);
unsigned Width =
estimateElementCount(Candidate.Width, CM.getVScaleForTuning());
@@ -4687,13 +4684,16 @@ LoopVectorizationPlanner::selectInterleaveCount(VPlan &Plan, ElementCount VF,
if (hasFindLastReductionPhi(Plan))
return 1;
+ VPRegisterUsage R =
+ calculateRegisterUsageForPlan(Plan, {VF}, TTI, CM.ValuesToIgnore)[0];
+
// If we did not calculate the cost for VF (because the user selected the VF)
// then we calculate the cost of VF here.
if (LoopCost == 0) {
if (VF.isScalar())
LoopCost = CM.expectedCost(VF);
else
- LoopCost = cost(Plan, VF);
+ LoopCost = cost(Plan, VF, &R);
assert(LoopCost.isValid() && "Expected to have chosen a VF with valid cost");
// Loop body is free and there is no need for interleaving.
@@ -4701,8 +4701,6 @@ LoopVectorizationPlanner::selectInterleaveCount(VPlan &Plan, ElementCount VF,
return 1;
}
- VPRegisterUsage R =
- calculateRegisterUsageForPlan(Plan, {VF}, TTI, CM.ValuesToIgnore)[0];
// We divide by these constants so assume that we have at least one
// instruction that uses at least one register.
for (auto &Pair : R.MaxLocalUsers) {
@@ -7027,13 +7025,18 @@ LoopVectorizationPlanner::precomputeCosts(VPlan &Plan, ElementCount VF,
return Cost;
}
-InstructionCost LoopVectorizationPlanner::cost(VPlan &Plan,
- ElementCount VF) const {
+InstructionCost LoopVectorizationPlanner::cost(VPlan &Plan, ElementCount VF,
+ VPRegisterUsage *RU) const {
VPCostContext CostCtx(CM.TTI, *CM.TLI, Plan, CM, CM.CostKind, PSE, OrigLoop);
InstructionCost Cost = precomputeCosts(Plan, VF, CostCtx);
// Now compute and add the VPlan-based cost.
Cost += Plan.cost(VF, CostCtx);
+
+ // Add the cost of spills due to excess register usage
+ if (CM.shouldConsiderRegPressureForVF(VF))
+ Cost += RU->spillCost(CostCtx, ForceTargetNumVectorRegs);
+
#ifndef NDEBUG
unsigned EstimatedWidth = estimateElementCount(VF, CM.getVScaleForTuning());
LLVM_DEBUG(dbgs() << "Cost for VF " << VF << ": " << Cost
@@ -7233,9 +7236,10 @@ VectorizationFactor LoopVectorizationPlanner::computeBestVF() {
P->vectorFactors().end());
SmallVector<VPRegisterUsage, 8> RUs;
- if (any_of(VFs, [this](ElementCount VF) {
- return CM.shouldConsiderRegPressureForVF(VF);
- }))
+ bool ConsiderRegPressure = any_of(VFs, [this](ElementCount VF) {
+ return CM.shouldConsiderRegPressureForVF(VF);
+ });
+ if (ConsiderRegPressure)
RUs = calculateRegisterUsageForPlan(*P, VFs, TTI, CM.ValuesToIgnore);
for (unsigned I = 0; I < VFs.size(); I++) {
@@ -7258,16 +7262,10 @@ VectorizationFactor LoopVectorizationPlanner::computeBestVF() {
continue;
}
- InstructionCost Cost = cost(*P, VF);
+ InstructionCost Cost =
+ cost(*P, VF, ConsiderRegPressure ? &RUs[I] : nullptr);
VectorizationFactor CurrentFactor(VF, Cost, ScalarCost);
- if (CM.shouldConsiderRegPressureForVF(VF) &&
- RUs[I].exceedsMaxNumRegs(TTI, ForceTargetNumVectorRegs)) {
- LLVM_DEBUG(dbgs() << "LV(REG): Not considering vector loop of width "
- << VF << " because it uses too many registers\n");
- continue;
- }
-
if (isMoreProfitable(CurrentFactor, BestFactor, P->hasScalarTail()))
BestFactor = CurrentFactor;
diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
index 8fbe7d93e6f45..b8be1be79831e 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
@@ -16,6 +16,7 @@
#include "llvm/ADT/TypeSwitch.h"
#include "llvm/Analysis/ScalarEvolution.h"
#include "llvm/Analysis/TargetTransformInfo.h"
+#include "llvm/IR/DataLayout.h"
#include "llvm/IR/Instruction.h"
#include "llvm/IR/PatternMatch.h"
@@ -389,13 +390,33 @@ bool VPDominatorTree::properlyDominates(const VPRecipeBase *A,
return Base::properlyDominates(ParentA, ParentB);
}
-bool VPRegisterUsage::exceedsMaxNumRegs(const TargetTransformInfo &TTI,
- unsigned OverrideMaxNumRegs) const {
- return any_of(MaxLocalUsers, [&TTI, &OverrideMaxNumRegs](auto &LU) {
- return LU.second > (OverrideMaxNumRegs > 0
- ? OverrideMaxNumRegs
- : TTI.getNumberOfRegisters(LU.first));
- });
+InstructionCost VPRegisterUsage::spillCost(VPCostContext &Ctx,
+ unsigned OverrideMaxNumRegs) const {
+ InstructionCost Cost;
+ DataLayout DL = Ctx.PSE.getSE()->getDataLayout();
+ for (const auto &Pair : MaxLocalUsers) {
+ unsigned AvailableRegs = OverrideMaxNumRegs > 0
+ ? OverrideMaxNumRegs
+ : Ctx.TTI.getNumberOfRegisters(Pair.first);
+ if (Pair.second > AvailableRegs) {
+ // Assume that for each register used past what's available we get one
+ // spill and reload of the largest type seen for that register class.
+ unsigned Spills = Pair.second - AvailableRegs;
+ Type *SpillType = LargestType.at(Pair.first);
+ Align Alignment = DL.getPrefTypeAlign(SpillType);
+ InstructionCost SpillCost =
+ Ctx.TTI.getMemoryOpCost(Instruction::Load, SpillType, Alignment, 0,
+ Ctx.CostKind) +
+ Ctx.TTI.getMemoryOpCost(Instruction::Store, SpillType, Alignment, 0,
+ Ctx.CostKind);
+ InstructionCost TotalCost = SpillCost * Spills;
+ LLVM_DEBUG(dbgs() << "LV(REG): Cost of " << TotalCost << " from "
+ << Spills << " spills of "
+ << Ctx.TTI.getRegisterClassName(Pair.first) << "\n");
+ Cost += TotalCost;
+ }
+ }
+ return Cost;
}
SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
@@ -479,6 +500,15 @@ SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
SmallPtrSet<VPValue *, 8> OpenIntervals;
SmallVector<VPRegisterUsage, 8> RUs(VFs.size());
SmallVector<SmallMapVector<unsigned, unsigned, 4>, 8> MaxUsages(VFs.size());
+ SmallVector<SmallMapVector<unsigned, Type *, 4>, 8> LargestTypes(VFs.size());
+ auto MaxType = [](Type *CurMax, Type *T) {
+ if (!CurMax)
+ return T;
+ if (TypeSize::isKnownGT(T->getPrimitiveSizeInBits(),
+ CurMax->getPrimitiveSizeInBits()))
+ return T;
+ return CurMax;
+ };
LLVM_DEBUG(dbgs() << "LV(REG): Calculating max register usage:\n");
@@ -540,17 +570,19 @@ SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
match(VPV, m_ExtractLastPart(m_VPValue())))
continue;
+ Type *ScalarTy = TypeInfo.inferScalarType(VPV);
if (VFs[J].isScalar() ||
isa<VPCanonicalIVPHIRecipe, VPReplicateRecipe, VPDerivedIVRecipe,
VPEVLBasedIVPHIRecipe, VPScalarIVStepsRecipe>(VPV) ||
(isa<VPInstruction>(VPV) && vputils::onlyScalarValuesUsed(VPV)) ||
(isa<VPReductionPHIRecipe>(VPV) &&
(cast<VPReductionPHIRecipe>(VPV))->isInLoop())) {
- unsigned ClassID =
- TTI.getRegisterClassForType(false, TypeInfo.inferScalarType(VPV));
+ unsigned ClassID = TTI.getRegisterClassForType(false, ScalarTy);
// FIXME: The target might use more than one register for the type
// even in the scalar case.
RegUsage[ClassID] += 1;
+ LargestTypes[J][ClassID] =
+ MaxType(LargestTypes[J][ClassID], ScalarTy);
} else {
// The output from scaled phis and scaled reductions actually has
// fewer lanes than the VF.
@@ -562,10 +594,12 @@ SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
LLVM_DEBUG(dbgs() << "LV(REG): Scaled down VF from " << VFs[J]
<< " to " << VF << " for " << *R << "\n";);
}
-
- Type *ScalarTy = TypeInfo.inferScalarType(VPV);
unsigned ClassID = TTI.getRegisterClassForType(true, ScalarTy);
RegUsage[ClassID] += GetRegUsage(ScalarTy, VF);
+ if (VectorType::isValidElementType(ScalarTy)) {
+ Type *T = VectorType::get(ScalarTy, VF);
+ LargestTypes[J][ClassID] = MaxType(LargestTypes[J][ClassID], T);
+ }
}
}
@@ -602,9 +636,11 @@ SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
bool IsScalar = vputils::onlyScalarValuesUsed(In);
ElementCount VF = IsScalar ? ElementCount::getFixed(1) : VFs[Idx];
- unsigned ClassID = TTI.getRegisterClassForType(
- VF.isVector(), TypeInfo.inferScalarType(In));
- Invariant[ClassID] += GetRegUsage(TypeInfo.inferScalarType(In), VF);
+ Type *ScalarTy = TypeInfo.inferScalarType(In);
+ unsigned ClassID = TTI.getRegisterClassForType(VF.isVector(), ScalarTy);
+ Invariant[ClassID] += GetRegUsage(ScalarTy, VF);
+ Type *SpillTy = IsScalar ? ScalarTy : VectorType::get(ScalarTy, VF);
+ LargestTypes[Idx][ClassID] = MaxType(LargestTypes[Idx][ClassID], SpillTy);
}
LLVM_DEBUG({
@@ -623,10 +659,16 @@ SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
<< TTI.getRegisterClassName(pair.first) << ", " << pair.second
<< " registers\n";
}
+ for (const auto &pair : LargestTypes[Idx]) {
+ dbgs() << "LV(REG): RegisterClass: "
+ << TTI.getRegisterClassName(pair.first) << ", " << *pair.second
+ << " is largest type potentially spilled\n";
+ }
});
RU.LoopInvariantRegs = Invariant;
RU.MaxLocalUsers = MaxUsages[Idx];
+ RU.LargestType = LargestTypes[Idx];
RUs[Idx] = RU;
}
diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.h b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.h
index dc4be4270f7f1..3affa211dd140 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.h
@@ -19,6 +19,7 @@ namespace llvm {
class LLVMContext;
class VPValue;
class VPBlendRecipe;
+class VPCostContext;
class VPInstruction;
class VPWidenRecipe;
class VPWidenCallRecipe;
@@ -30,6 +31,7 @@ class VPlan;
class Value;
class TargetTransformInfo;
class Type;
+class InstructionCost;
/// An analysis for type-inference for VPValues.
/// It infers the scalar type for a given VPValue by bottom-up traversing
@@ -78,12 +80,14 @@ struct VPRegisterUsage {
/// Holds the maximum number of concurrent live intervals in the loop.
/// The key is ClassID of target-provided register class.
SmallMapVector<unsigned, unsigned, 4> MaxLocalUsers;
+ /// Holds the largest type used in each register class.
+ SmallMapVector<unsigned, Type *, 4> LargestType;
- /// Check if any of the tracked live intervals exceeds the number of
- /// available registers for the target. If non-zero, OverrideMaxNumRegs
+ /// Calculate the estimated cost of any spills due to using more registers
+ /// than the number available for the target. If non-zero, OverrideMaxNumRegs
/// is used in place of the target's number of registers.
- bool exceedsMaxNumRegs(const TargetTransformInfo &TTI,
- unsigned OverrideMaxNumRegs = 0) const;
+ InstructionCost spillCost(VPCostContext &Ctx,
+ unsigned OverrideMaxNumRegs = 0) const;
};
/// Estimate the register usage for \p Plan and vectorization factors in \p VFs
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll b/llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll
index 8109d0683fe71..2a4d16979e0d8 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll
@@ -1,16 +1,31 @@
; REQUIRES: asserts
-; RUN: opt -passes=loop-vectorize -vectorizer-maximize-bandwidth -debug-only=loop-vectorize -disable-output -force-vector-interleave=1 -enable-epilogue-vectorization=false -S < %s 2>&1 | FileCheck %s --check-prefixes=CHECK-REGS-VP
-; RUN: opt -passes=loop-vectorize -vectorizer-maximize-bandwidth -debug-only=loop-vectorize -disable-output -force-target-num-vector-regs=1 -force-vector-interleave=1 -enable-epilogue-vectorization=false -S < %s 2>&1 | FileCheck %s --check-prefixes=CHECK-NOREGS-VP
+; RUN: opt -passes=loop-vectorize -vectorizer-maximize-bandwidth=false -debug-only=loop-vectorize,vplan -disable-output -force-vector-interleave=1 -enable-epilogue-vectorization=false -S < %s 2>&1 | FileCheck %s --check-prefixes=CHECK,CHECK-NOMAX
+; RUN: opt -passes=loop-vectorize -vectorizer-maximize-bandwidth=true -debug-only=loop-vectorize,vplan -disable-output -force-vector-interleave=1 -enable-epilogue-vectorization=false -S < %s 2>&1 | FileCheck %s --check-prefixes=CHECK,CHECK-REGS-VP
+; RUN: opt -passes=loop-vectorize -vectorizer-maximize-bandwidth=true -debug-only=loop-vectorize,vplan -disable-output -force-target-num-vector-regs=1 -force-vector-interleave=1 -enable-epilogue-vectorization=false -S < %s 2>&1 | FileCheck %s --check-prefixes=CHECK,CHECK-NOREGS-VP
target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64-none-unknown-elf"
+; The use of the dotp instruction means we never have an i32 vector, so we don't
+; get any spills normally and with a reduced number of registers the number of
+; spills is small enough that it doesn't prevent use of a larger VF.
define i32 @dotp(ptr %a, ptr %b) #0 {
+; CHECK-LABEL: LV: Checking a loop in 'dotp'
+;
+; CHECK-NOMAX: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK-NOMAX: LV: Selecting VF: vscale x 4.
+;
+; CHECK-REGS-VP: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK-REGS-VP: Cost for VF vscale x 8: 6 (Estimated cost per lane: 0.8)
+; CHECK-REGS-VP: Cost for VF vscale x 16: 5 (Estimated cost per lane: 0.3)
; CHECK-REGS-VP: LV: Selecting VF: vscale x 16.
;
-; CHECK-NOREGS-VP: LV(REG): Not considering vector loop of width vscale x 8 because it uses too many registers
-; CHECK-NOREGS-VP: LV(REG): Not considering vector loop of width vscale x 16 because it uses too many registers
-; CHECK-NOREGS-VP: LV: Selecting VF: vscale x 4.
+; CHECK-NOREGS-VP: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK-NOREGS-VP: LV(REG): Cost of 4 from 2 spills of Generic::VectorRC
+; CHECK-NOREGS-VP-NEXT: Cost for VF vscale x 8: 14 (Estimated cost per lane: 1.8)
+; CHECK-NOREGS-VP: LV(REG): Cost of 4 from 2 spills of Generic::VectorRC
+; CHECK-NOREGS-VP-NEXT: Cost for VF vscale x 16: 13 (Estimated cost per lane: 0.8)
+; CHECK-NOREGS-VP: LV: Selecting VF: vscale x 16.
entry:
br label %for.body
@@ -24,8 +39,7 @@ for.body: ; preds = %for.body, %entry
%load.b = load i8, ptr %gep.b, align 1
%ext.b = zext i8 %load.b to i32
%mul = mul i32 %ext.b, %ext.a
- %sub = sub i32 0, %mul
- %add = add i32 %accum, %sub
+ %add = add i32 %accum, %mul
%iv.next = add i64 %iv, 1
%exitcond.not = icmp eq i64 %iv.next, 1024
br i1 %exitcond.not, label %for.exit, label %for.body
@@ -34,4 +48,70 @@ for.exit: ; preds = %for.body
ret i32 %add
}
+; The largest type used in the loop is small enough that we already consider all
+; VFs and maximize-bandwidth does nothing.
+define void @type_too_small(ptr %a, ptr %b) #0 {
+; CHECK-LABEL: LV: Checking a loop in 'type_too_small'
+; CHECK: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK: Cost for VF vscale x 8: 6 (Estimated cost per lane: 0.8)
+; CHECK: Cost for VF vscale x 16: 6 (Estimated cost per lane: 0.4)
+; CHECK: LV: Selecting VF: vscale x 16.
+entry:
+ br label %loop
+
+loop:
+ %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+ %gep.a = getelementptr i8, ptr %a, i64 %iv
+ %load.a = load i8, ptr %gep.a, align 1
+ %gep.b = getelementptr i8, ptr %b, i64 %iv
+ %load.b = load i8, ptr %gep.b, align 1
+ %add = add i8 %load.a, %load.b
+ store i8 %add, ptr %gep.a, align 1
+ %iv.next = add i64 %iv, 1
+ %exitcond = icmp eq i64 %iv.next, 1024
+ br i1 %exitcond, label %exit, label %loop
+
+exit:
+ ret void
+}
+
+; With reduced number of registers the spills from high pressure are enough that
+; we use the same VF as if we hadn't maximized the bandwidth.
+define void @high_pressure(ptr %a, ptr %b) #0 {
+; CHECK-LABEL: LV: Checking a loop in 'high_pressure'
+;
+; CHECK-NOMAX: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK-NOMAX: LV: Selecting VF: vscale x 4.
+;
+; CHECK-REGS-VP: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK-REGS-VP: Cost for VF vscale x 8: 10 (Estimated cost per lane: 1.2)
+; CHECK-REGS-VP: Cost for VF vscale x 16: 21 (Estimated cost per lane: 1.3)
+; CHECK-REGS-VP: LV: Selecting VF: vscale x 8.
+
+; CHECK-NOREGS-VP: Cost for VF vscale x 4: 6 (Estimated cost per lane: 1.5)
+; CHECK-NOREGS-VP: LV(REG): Cost of 12 from 3 spills of Generic::VectorRC
+; CHECK-NOREGS-VP-NEXT: Cost for VF vscale x 8: 26 (Estimated cost per lane: 3.2)
+; CHECK-NOREGS-VP: LV(REG): Cost of 56 from 7 spills of Generic::VectorRC
+; CHECK-NOREGS-VP-NEXT: Cost for VF vscale x 16: 81 (Estimated cost per lane: 5.1)
+; CHECK-NOREGS-VP: LV: Selecting VF: vscale x 4.
+entry:
+ br label %loop
+
+loop:
+ %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+ %gep.a = getelementptr i32, ptr %a, i64 %iv
+ %load.a = load i32, ptr %gep.a, align 4
+ %gep.b = getelementptr i8, ptr %b, i64 %iv
+ %load.b = load i8, ptr %gep.b, align 1
+ %ext.b = zext i8 %load.b to i32
+ %add = add i32 %load.a, %ext.b
+ store i32 %add, ptr %gep.a, align 4
+ %iv.next = add i64 %iv, 1
+ %exitcond = icmp eq i64 %iv.next, 1024
+ br i1 %exitcond, label %exit, label %loop
+
+exit:
+ ret void
+}
+
attributes #0 = { vscale_range(1,16) "target-features"="+sve" }
diff --git a/llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll b/llvm/test/Transforms/LoopVectorize/ARM/mve-reg-pressure-spills.ll
new file mode 100644
index 00...
[truncated]
|
|
The motivation for doing this is that I'm looking at enabling shouldConsiderVectorizationRegPressure on Arm Cortex-M CPUs with MVE, and the current behaviour makes things significantly worse in some cases due to preventing vectorization when it's beneficial. I've been specifically looking at the code we generate for https://github.com/ARM-software/CMSIS-DSP on Cortex-M55. If I enable vectorization register pressure then with the current behaviour the change in throughput is
With this PR the change in throughput is
The remaining regressions are due to the relative costs of interleave vs gather/scatter vs scalarize being wrong in some cases, which I'll be looking at next. I've also checked llvm-test-suite on Neoverse-V2 (AWS Graviton 4), where useMaxBandwidth is enabled for scalable vectors and so register pressure calculation is used, and there's zero change in code generation. |
🐧 Linux x64 Test Results
✅ The build succeeded and all tests passed. |
|
Ping |
lukel97
left a comment
There was a problem hiding this comment.
Thanks for working on this, I remember this being discussed at the time the VPlan register pressure stuff initially landed. The load+store per register over heuristic seems sensible to me
|
Ping |
| unsigned Spills = MaxUsers - AvailableRegs; | ||
| Type *SpillType = LargestType.at(RegClass); | ||
| Align Alignment = DL.getPrefTypeAlign(SpillType); | ||
| InstructionCost SpillCost = |
There was a problem hiding this comment.
I think we'd probably want this code to live as part of the default implementation of a getSpillCost() TTI hook, i.e. the default could be implemented in BasicTTIImpl.h. Some targets may simply want to return Invalid here in order to prevent any spilling or filling whatsoever. Also, spilling the largest type doesn't guarantee the most pessimistic cost. I haven't looked into this in detail, but I can imagine situations where spilling <16 x i1> is more expensive than <16 x i8>, simply because in the backend <16 x i1> is not a legal type and requires promotion first.
There was a problem hiding this comment.
There's the getCostOfKeepingLiveOverCall hook in TTI which is basically the cost of spilling multiple types. We could rework that to be getSpillCost() for a single type, I think the only user of it is SLP currently IIRC.
| Type *SpillType = LargestType.at(RegClass); | ||
| Align Alignment = DL.getPrefTypeAlign(SpillType); | ||
| InstructionCost SpillCost = | ||
| Ctx.TTI.getMemoryOpCost(Instruction::Load, SpillType, Alignment, 0, |
There was a problem hiding this comment.
This code needs to be able to handle invalid costs being returned. See AArch64TTIImpl::getMemoryOpCost for an example of what happens when calculating the cost of load/store of <vscale x 16 x i1> types.
Given that you could encounter Invalid costs here, perhaps it might be easier to see the impact of these changes if you split up into two PRs:
- Create a NFC patch to refactor the code so that we call
spillCost, butspillCostalways returns Invalid if MaxUsers > AvailableRegs. That way we can see what happens when Invalid is returned and make sure it behaves sensibly. In theory, it should be NFC because we'll just ignore this VF the same as before. - Create a follow-on patch to add a better cost model, perhaps introducing a new TTI hook as suggested above.
| unsigned ClassID = TTI.getRegisterClassForType(VF.isVector(), ScalarTy); | ||
| Type *SpillTy = IsScalar ? ScalarTy : VectorType::get(ScalarTy, VF); | ||
| Invariant[ClassID] += TTI.getRegUsageForType(SpillTy); | ||
| LargestTypes[Idx][ClassID] = MaxType(LargestTypes[Idx][ClassID], SpillTy); |
There was a problem hiding this comment.
As mentioned in an earlier comment, it's not obvious to me that the largest type has the greatest spill/fill cost. I think you might need to either:
- Track all types and add up the spill/fill cost for each type, or
- Calculate the largest possible spill/cost for each type here, then use the largest cost in the
spillCostroutine.
There was a problem hiding this comment.
Actually from some thinking about this, I don't think either largest size or largest cost is the right thing here. Taking this example:
void fn(char *src, long *dst, long n) {
for (long i = 0; i < n; i++) {
dst[i] += src[i];
}
}
With VF 8 the largest (and most costly to spill) type is <vscale x 8 x i64>, which with -mcpu=neoverse-v2 -mllvm -force-target-num-vector-regs=4 gives
LV(REG): Cost of 32 from 4 spills of Generic::VectorRC
32 is the cost of 4 spills of <vscale x 8 x i64>, but what gets spilled is registers not types, i.e. what we want here is the cost of 4 z-register spills. I'm not sure what the best way to handle this is. Perhaps TargetTransformInfo should have a method that gives the spill cost for a generic register class, given that we're using these generic registers classes here to count how many registers are being used.
There was a problem hiding this comment.
I've changed things to calculate the cost using the register class, by adding getRegisterClassSpillCost to TargetTransformInfo.
|
Ping |
1 similar comment
|
Ping |
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/11/builds/36090 Here is the relevant piece of the build log for the reference |
There was a problem hiding this comment.
probably needs ; REQUIRES: asserts here
|
mve-reg-pressure-spills.ll failed in https://lab.llvm.org/buildbot/#/builders/187/builds/18019, which enables expensive checks. The failure is Currently investigating, looks like maybe this is unrelated to the compiler change here and the test has exposed an existing problem. |
See #182254, it might be that |
|
#187360 should fix the VPlanVerifier failure. |
|
Hi @john-brawn-arm , the 'llvm/test/Transforms/LoopVectorize/AArch64/maxbandwidth-regpressure.ll' test gets failed on the aarch64 cross builder https://lab.llvm.org/buildbot/#/builders/193/builds/15053 with the following errors: looks like because of these changes. would you take care of it? |
Fix for this in #187498 |
…vm#179646) Currently when considering register pressure is enabled, we reject any VF that has higher pressure than the number of registers. However this can result in failing to vectorize in cases where it's beneficial, as the cost of the extra spills is less than the benefit we get from vectorizing. Deal with this by instead calculating the cost of spills and adding that to the rest of the cost, so we can detect this kind of situation and still vectorize while avoiding vectorizing in cases where the extra cost makes it not with it.
Currently when considering register pressure is enabled, we reject any VF that has higher pressure than the number of registers. However this can result in failing to vectorize in cases where it's beneficial, as the cost of the extra spills is less than the benefit we get from vectorizing.
Deal with this by instead calculating the cost of spills and adding that to the rest of the cost, so we can detect this kind of situation and still vectorize while avoiding vectorizing in cases where the extra cost makes it not with it.