-
Notifications
You must be signed in to change notification settings - Fork 16.1k
[NFC][VPlan] Rename VPEVLBasedIVPHIRecipe to VPCurrentIterationPHIRecipe #177114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@llvm/pr-subscribers-vectorizers @llvm/pr-subscribers-llvm-transforms Author: Shih-Po Hung (arcbbb) ChangesThis patch introduces VPCumulativeIVPHIRecipe to track the cumulative count of processed elements across loop iterations. Unlike CanonicalIV which always increments by VF*UF, CumulativeIV can step by a variable amount (e.g., EVL) per iteration. Key changes:
This also addresses the issue in #166164 (comment) which needs a cumulative element count in createScalarIVSteps. Patch is 579.98 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/177114.diff 79 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 346b8a1f9e420..3da0c6f206ec1 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -4117,10 +4117,10 @@ static bool willGenerateVectors(VPlan &Plan, ElementCount VF,
case VPDef::VPReplicateSC:
case VPDef::VPInstructionSC:
case VPDef::VPCanonicalIVPHISC:
+ case VPDef::VPCumulativeIVPHISC:
case VPDef::VPVectorPointerSC:
case VPDef::VPVectorEndPointerSC:
case VPDef::VPExpandSCEVSC:
- case VPDef::VPEVLBasedIVPHISC:
case VPDef::VPPredInstPHISC:
case VPDef::VPBranchOnMaskSC:
continue;
@@ -4632,8 +4632,9 @@ LoopVectorizationPlanner::selectInterleaveCount(VPlan &Plan, ElementCount VF,
!(CM.preferPredicatedLoop() && CM.useWideActiveLaneMask()))
return 1;
+ // TODO: Support interleave for loop with variable-length stepping.
if (any_of(Plan.getVectorLoopRegion()->getEntryBasicBlock()->phis(),
- IsaPred<VPEVLBasedIVPHIRecipe>)) {
+ IsaPred<VPCumulativeIVPHIRecipe>)) {
LLVM_DEBUG(dbgs() << "LV: Preference for VP intrinsics indicated. "
"Unroll factor forced to be 1.\n");
return 1;
@@ -7443,8 +7444,8 @@ DenseMap<const SCEV *, Value *> LoopVectorizationPlanner::executePlan(
// Expand BranchOnTwoConds after dissolution, when latch has direct access to
// its successors.
VPlanTransforms::expandBranchOnTwoConds(BestVPlan);
- // Canonicalize EVL loops after regions are dissolved.
- VPlanTransforms::canonicalizeEVLLoops(BestVPlan);
+ VPlanTransforms::convertToVariableLengthStep(BestVPlan,
+ CM.foldTailByMasking());
VPlanTransforms::materializeBackedgeTakenCount(BestVPlan, VectorPH);
VPlanTransforms::materializeVectorTripCount(
BestVPlan, VectorPH, CM.foldTailByMasking(),
@@ -8371,6 +8372,10 @@ void LoopVectorizationPlanner::buildVPlansWithVPRecipes(ElementCount MinVF,
*Plan, CM.getMaxSafeElements());
VPlanTransforms::runPass(VPlanTransforms::optimizeEVLMasks, *Plan);
}
+ // TODO: Place this before optimization after addExplicitVectorLength
+ // is placed close to addActiveLaneMask.
+ VPlanTransforms::runPass(VPlanTransforms::removeFixedStepCumulativeIV,
+ *Plan);
assert(verifyVPlanIsValid(*Plan) && "VPlan is invalid");
VPlans.push_back(std::move(Plan));
}
@@ -8674,6 +8679,7 @@ VPlanPtr LoopVectorizationPlanner::tryToBuildVPlan(VFRange &Range) {
// failures.
DenseMap<VPValue *, VPValue *> IVEndValues;
VPlanTransforms::updateScalarResumePhis(*Plan, IVEndValues);
+ VPlanTransforms::removeFixedStepCumulativeIV(*Plan);
assert(verifyVPlanIsValid(*Plan) && "VPlan is invalid");
return Plan;
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 329181df443db..13e42bde49925 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -541,7 +541,6 @@ class VPSingleDefRecipe : public VPRecipeBase, public VPRecipeValue {
static inline bool classof(const VPRecipeBase *R) {
switch (R->getVPDefID()) {
case VPRecipeBase::VPDerivedIVSC:
- case VPRecipeBase::VPEVLBasedIVPHISC:
case VPRecipeBase::VPExpandSCEVSC:
case VPRecipeBase::VPExpressionSC:
case VPRecipeBase::VPInstructionSC:
@@ -560,6 +559,7 @@ class VPSingleDefRecipe : public VPRecipeBase, public VPRecipeValue {
case VPRecipeBase::VPBlendSC:
case VPRecipeBase::VPPredInstPHISC:
case VPRecipeBase::VPCanonicalIVPHISC:
+ case VPRecipeBase::VPCumulativeIVPHISC:
case VPRecipeBase::VPActiveLaneMaskPHISC:
case VPRecipeBase::VPFirstOrderRecurrencePHISC:
case VPRecipeBase::VPWidenPHISC:
@@ -3669,28 +3669,32 @@ class VPActiveLaneMaskPHIRecipe : public VPHeaderPHIRecipe {
};
/// A recipe for generating the phi node for the current index of elements,
-/// adjusted in accordance with EVL value. It starts at the start value of the
-/// canonical induction and gets incremented by EVL in each iteration of the
-/// vector loop.
-class VPEVLBasedIVPHIRecipe : public VPHeaderPHIRecipe {
+/// may be adjusted by variable-length-stepping transform. It starts at the
+/// start value of the canonical induction and gets incremented by the number
+/// of elements processed in each iteration of the vector loop.
+/// When the step equals VFxUF, this can be replaced by
+/// VPCanonicalIVPHIRecipe.
+class VPCumulativeIVPHIRecipe : public VPHeaderPHIRecipe {
public:
- VPEVLBasedIVPHIRecipe(VPValue *StartIV, DebugLoc DL)
- : VPHeaderPHIRecipe(VPDef::VPEVLBasedIVPHISC, nullptr, StartIV, DL) {}
+ VPCumulativeIVPHIRecipe(VPValue *StartIV, DebugLoc DL)
+ : VPHeaderPHIRecipe(VPDef::VPCumulativeIVPHISC, nullptr, StartIV, DL) {}
- ~VPEVLBasedIVPHIRecipe() override = default;
+ ~VPCumulativeIVPHIRecipe() override = default;
- VPEVLBasedIVPHIRecipe *clone() override {
- llvm_unreachable("cloning not implemented yet");
+ VPCumulativeIVPHIRecipe *clone() override {
+ auto *R = new VPCumulativeIVPHIRecipe(getStartValue(), getDebugLoc());
+ R->addOperand(getBackedgeValue());
+ return R;
}
- VP_CLASSOF_IMPL(VPDef::VPEVLBasedIVPHISC)
+ VP_CLASSOF_IMPL(VPDef::VPCumulativeIVPHISC)
void execute(VPTransformState &State) override {
llvm_unreachable("cannot execute this recipe, should be replaced by a "
"scalar phi recipe");
}
- /// Return the cost of this VPEVLBasedIVPHIRecipe.
+ /// Return the cost of this VPCumulativeIVPHIRecipe.
InstructionCost computeCost(ElementCount VF,
VPCostContext &Ctx) const override {
// For now, match the behavior of the legacy cost model.
@@ -4295,6 +4299,13 @@ class LLVM_ABI_FOR_TEST VPRegionBlock : public VPBlockBase {
return const_cast<VPRegionBlock *>(this)->getCanonicalIV();
}
+ VPCumulativeIVPHIRecipe *getCumulativeIV() {
+ return cast<VPCumulativeIVPHIRecipe>(getCanonicalIV()->getNextNode());
+ }
+ const VPCumulativeIVPHIRecipe *getCumulativeIV() const {
+ return const_cast<VPRegionBlock *>(this)->getCumulativeIV();
+ }
+
/// Return the type of the canonical IV for loop regions.
Type *getCanonicalIVType() { return getCanonicalIV()->getScalarType(); }
const Type *getCanonicalIVType() const {
diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
index 994a4d8921480..25102ccec9c46 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
@@ -269,7 +269,7 @@ Type *VPTypeAnalysis::inferScalarType(const VPValue *V) {
TypeSwitch<const VPRecipeBase *, Type *>(V->getDefiningRecipe())
.Case<VPActiveLaneMaskPHIRecipe, VPCanonicalIVPHIRecipe,
VPFirstOrderRecurrencePHIRecipe, VPReductionPHIRecipe,
- VPWidenPointerInductionRecipe, VPEVLBasedIVPHIRecipe>(
+ VPWidenPointerInductionRecipe, VPCumulativeIVPHIRecipe>(
[this](const auto *R) {
// Handle header phi recipes, except VPWidenIntOrFpInduction
// which needs special handling due it being possibly truncated.
@@ -542,7 +542,7 @@ SmallVector<VPRegisterUsage, 8> llvm::calculateRegisterUsageForPlan(
if (VFs[J].isScalar() ||
isa<VPCanonicalIVPHIRecipe, VPReplicateRecipe, VPDerivedIVRecipe,
- VPEVLBasedIVPHIRecipe, VPScalarIVStepsRecipe>(VPV) ||
+ VPCumulativeIVPHIRecipe, VPScalarIVStepsRecipe>(VPV) ||
(isa<VPInstruction>(VPV) && vputils::onlyScalarValuesUsed(VPV)) ||
(isa<VPReductionPHIRecipe>(VPV) &&
(cast<VPReductionPHIRecipe>(VPV))->isInLoop())) {
diff --git a/llvm/lib/Transforms/Vectorize/VPlanConstruction.cpp b/llvm/lib/Transforms/Vectorize/VPlanConstruction.cpp
index 96dd3aff80eb4..ef76c452798fc 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanConstruction.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanConstruction.cpp
@@ -484,6 +484,25 @@ static void addCanonicalIVRecipes(VPlan &Plan, VPBasicBlock *HeaderVPBB,
LatchDL);
}
+static void addCumulativeIVRecipes(VPlan &Plan, VPBasicBlock *HeaderVPBB,
+ VPBasicBlock *LatchVPBB, Type *IdxTy,
+ DebugLoc DL) {
+ auto *CanonicalIV = cast<VPCanonicalIVPHIRecipe>(&*HeaderVPBB->begin());
+ Value *StartIdx = ConstantInt::get(IdxTy, 0);
+ auto *StartV = Plan.getOrAddLiveIn(StartIdx);
+ // Add a CumulativeIV after CanonicalIV.
+ auto *CumulativeIVPHI = new VPCumulativeIVPHIRecipe(StartV, DL);
+ CumulativeIVPHI->insertAfter(CanonicalIV);
+
+ // Add the CumulativeIV increment. Initially steps by VFxUF.
+ VPBuilder Builder(LatchVPBB,
+ std::next(CanonicalIV->getBackedgeRecipe().getIterator()));
+ auto *CumulativeIVIncrement = Builder.createOverflowingOp(
+ Instruction::Add, {&Plan.getVFxUF(), CumulativeIVPHI}, {true, false}, DL,
+ "cumulative.iv.next");
+ CumulativeIVPHI->addOperand(CumulativeIVIncrement);
+}
+
/// Creates extracts for values in \p Plan defined in a loop region and used
/// outside a loop region.
static void createExtractsForLiveOuts(VPlan &Plan, VPBasicBlock *MiddleVPBB) {
@@ -567,6 +586,8 @@ static void addInitialSkeleton(VPlan &Plan, Type *InductionTy, DebugLoc IVDL,
{VectorPhiR, VectorPhiR->getOperand(0)}, VectorPhiR->getDebugLoc());
cast<VPIRPhi>(&ScalarPhiR)->addOperand(ResumePhiR);
}
+
+ addCumulativeIVRecipes(Plan, HeaderVPBB, LatchVPBB, InductionTy, IVDL);
}
/// Check \p Plan's live-in and replace them with constants, if they can be
@@ -694,7 +715,7 @@ void VPlanTransforms::createHeaderPhiRecipes(
};
for (VPRecipeBase &R : make_early_inc_range(HeaderVPBB->phis())) {
- if (isa<VPCanonicalIVPHIRecipe>(&R))
+ if (isa<VPCanonicalIVPHIRecipe, VPCumulativeIVPHIRecipe>(&R))
continue;
auto *PhiR = cast<VPPhi>(&R);
VPHeaderPHIRecipe *HeaderPhiR = CreateHeaderPhiRecipe(PhiR);
@@ -1161,7 +1182,8 @@ bool VPlanTransforms::handleMaxMinNumReductions(VPlan &Plan) {
MinMaxNumReductionsToHandle;
bool HasUnsupportedPhi = false;
for (auto &R : LoopRegion->getEntryBasicBlock()->phis()) {
- if (isa<VPCanonicalIVPHIRecipe, VPWidenIntOrFpInductionRecipe>(&R))
+ if (isa<VPCanonicalIVPHIRecipe, VPCumulativeIVPHIRecipe,
+ VPWidenIntOrFpInductionRecipe>(&R))
continue;
auto *Cur = dyn_cast<VPReductionPHIRecipe>(&R);
if (!Cur) {
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 11e4f930f1e85..8960733183fc6 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -74,6 +74,7 @@ bool VPRecipeBase::mayWriteToMemory() const {
case VPWidenIntrinsicSC:
return cast<VPWidenIntrinsicRecipe>(this)->mayWriteToMemory();
case VPCanonicalIVPHISC:
+ case VPCumulativeIVPHISC:
case VPBranchOnMaskSC:
case VPDerivedIVSC:
case VPFirstOrderRecurrencePHISC:
@@ -4494,9 +4495,9 @@ void VPActiveLaneMaskPHIRecipe::printRecipe(raw_ostream &O, const Twine &Indent,
#endif
#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
-void VPEVLBasedIVPHIRecipe::printRecipe(raw_ostream &O, const Twine &Indent,
- VPSlotTracker &SlotTracker) const {
- O << Indent << "EXPLICIT-VECTOR-LENGTH-BASED-IV-PHI ";
+void VPCumulativeIVPHIRecipe::printRecipe(raw_ostream &O, const Twine &Indent,
+ VPSlotTracker &SlotTracker) const {
+ O << Indent << "Cumulative-IV-PHI ";
printAsOperand(O, SlotTracker);
O << " = phi ";
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index bfef277070db7..d58fa8c6eaa73 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -2040,7 +2040,7 @@ static bool simplifyBranchConditionForVFAndUF(VPlan &Plan, ElementCount BestVF,
if (all_of(Header->phis(), [](VPRecipeBase &Phi) {
if (auto *R = dyn_cast<VPWidenIntOrFpInductionRecipe>(&Phi))
return R->isCanonical();
- return isa<VPCanonicalIVPHIRecipe, VPEVLBasedIVPHIRecipe,
+ return isa<VPCanonicalIVPHIRecipe, VPCumulativeIVPHIRecipe,
VPFirstOrderRecurrencePHIRecipe, VPPhi>(&Phi);
})) {
for (VPRecipeBase &HeaderR : make_early_inc_range(Header->phis())) {
@@ -3121,10 +3121,8 @@ static void fixupVFUsersForEVL(VPlan &Plan, VPValue &EVL) {
/// Converts a tail folded vector loop region to step by
/// VPInstruction::ExplicitVectorLength elements instead of VF elements each
/// iteration.
-///
-/// - Add a VPEVLBasedIVPHIRecipe and related recipes to \p Plan and
-/// replaces all uses except the canonical IV increment of
-/// VPCanonicalIVPHIRecipe with a VPEVLBasedIVPHIRecipe.
+/// This transformation:
+/// - Makes VPCumulativeIVPHIRecipe step by EVL instead of VFxUF.
/// VPCanonicalIVPHIRecipe is used only for loop iterations counting after
/// this transformation.
///
@@ -3134,6 +3132,8 @@ static void fixupVFUsersForEVL(VPlan &Plan, VPValue &EVL) {
/// previous iteration, and VPFirstOrderRecurrencePHIRecipes are replaced with
/// @llvm.vp.splice.
///
+/// - Switches the loop from up-counting to down-counting.
+///
/// The function uses the following definitions:
/// %StartV is the canonical induction start value.
///
@@ -3144,13 +3144,13 @@ static void fixupVFUsersForEVL(VPlan &Plan, VPValue &EVL) {
///
/// vector.body:
/// ...
-/// %EVLPhi = EXPLICIT-VECTOR-LENGTH-BASED-IV-PHI [ %StartV, %vector.ph ],
-/// [ %NextEVLIV, %vector.body ]
+/// %CumulativeIVPhi = Cumulative-IV-PHI [ %StartV, %vector.ph ],
+/// [ %NextIV, %vector.body ]
/// %AVL = phi [ trip-count, %vector.ph ], [ %NextAVL, %vector.body ]
/// %VPEVL = EXPLICIT-VECTOR-LENGTH %AVL
/// ...
/// %OpEVL = cast i32 %VPEVL to IVSize
-/// %NextEVLIV = add IVSize %OpEVL, %EVLPhi
+/// %NextIV = add IVSize %OpEVL, %CumulativeIVPhi
/// %NextAVL = sub IVSize nuw %AVL, %OpEVL
/// ...
///
@@ -3160,15 +3160,15 @@ static void fixupVFUsersForEVL(VPlan &Plan, VPValue &EVL) {
///
/// vector.body:
/// ...
-/// %EVLPhi = EXPLICIT-VECTOR-LENGTH-BASED-IV-PHI [ %StartV, %vector.ph ],
-/// [ %NextEVLIV, %vector.body ]
+/// %CumulativeIVPhi = Cumulative-IV-PHI [ %StartV, %vector.ph ],
+/// [ %NextIV, %vector.body ]
/// %AVL = phi [ trip-count, %vector.ph ], [ %NextAVL, %vector.body ]
/// %cmp = cmp ult %AVL, MaxSafeElements
/// %SAFE_AVL = select %cmp, %AVL, MaxSafeElements
/// %VPEVL = EXPLICIT-VECTOR-LENGTH %SAFE_AVL
/// ...
/// %OpEVL = cast i32 %VPEVL to IVSize
-/// %NextEVLIV = add IVSize %OpEVL, %EVLPhi
+/// %NextIV = add IVSize %OpEVL, %CumulativeIVPhi
/// %NextAVL = sub IVSize nuw %AVL, %OpEVL
/// ...
///
@@ -3181,11 +3181,9 @@ void VPlanTransforms::addExplicitVectorLength(
auto *CanonicalIVPHI = LoopRegion->getCanonicalIV();
auto *CanIVTy = LoopRegion->getCanonicalIVType();
- VPValue *StartV = CanonicalIVPHI->getStartValue();
- // Create the ExplicitVectorLengthPhi recipe in the main loop.
- auto *EVLPhi = new VPEVLBasedIVPHIRecipe(StartV, DebugLoc::getUnknown());
- EVLPhi->insertAfter(CanonicalIVPHI);
+ VPCumulativeIVPHIRecipe *CumulativeIVPhi = LoopRegion->getCumulativeIV();
+ VPRecipeBase *CumulativeIVInc = &CumulativeIVPhi->getBackedgeRecipe();
VPBuilder Builder(Header, Header->getFirstNonPhi());
// Create the AVL (application vector length), starting from TC -> 0 in steps
// of EVL.
@@ -3205,19 +3203,17 @@ void VPlanTransforms::addExplicitVectorLength(
auto *CanonicalIVIncrement =
cast<VPInstruction>(CanonicalIVPHI->getBackedgeValue());
- Builder.setInsertPoint(CanonicalIVIncrement);
+ Builder.setInsertPoint(CumulativeIVInc);
VPValue *OpVPEVL = VPEVL;
auto *I32Ty = Type::getInt32Ty(Plan.getContext());
- OpVPEVL = Builder.createScalarZExtOrTrunc(
- OpVPEVL, CanIVTy, I32Ty, CanonicalIVIncrement->getDebugLoc());
+ OpVPEVL = Builder.createScalarZExtOrTrunc(OpVPEVL, CanIVTy, I32Ty,
+ CumulativeIVInc->getDebugLoc());
+
+ CumulativeIVInc->setOperand(0, OpVPEVL);
- auto *NextEVLIV = Builder.createOverflowingOp(
- Instruction::Add, {OpVPEVL, EVLPhi},
- {CanonicalIVIncrement->hasNoUnsignedWrap(),
- CanonicalIVIncrement->hasNoSignedWrap()},
- CanonicalIVIncrement->getDebugLoc(), "index.evl.next");
- EVLPhi->addOperand(NextEVLIV);
+ Builder.setInsertPoint(CumulativeIVInc->getParent(),
+ std::next(CumulativeIVInc->getIterator()));
VPValue *NextAVL = Builder.createOverflowingOp(
Instruction::Sub, {AVLPhi, OpVPEVL}, {/*hasNUW=*/true, /*hasNSW=*/false},
@@ -3228,89 +3224,135 @@ void VPlanTransforms::addExplicitVectorLength(
removeDeadRecipes(Plan);
// Replace all uses of VPCanonicalIVPHIRecipe by
- // VPEVLBasedIVPHIRecipe except for the canonical IV increment.
- CanonicalIVPHI->replaceAllUsesWith(EVLPhi);
+ // VPCumulativeIVPHIRecipe except for the canonical IV increment.
+ CanonicalIVPHI->replaceAllUsesWith(CumulativeIVPhi);
CanonicalIVIncrement->setOperand(0, CanonicalIVPHI);
+
// TODO: support unroll factor > 1.
Plan.setUF(1);
+
+ // Switch the loop from up-counting to down counting.
+ // convert (branch-on-count (CanonicalInc, VTC)
+ // -> (branch-on-count (sub VTC, CanonicalIVInc), 0)
+ VPBasicBlock *LatchVPBB = LoopRegion->getExitingBasicBlock();
+ auto *LatchExitingBranch = cast<VPInstruction>(LatchVPBB->getTerminator());
+ if (match(LatchExitingBranch, m_BranchOnCond(m_True())))
+ return;
+ assert(match(LatchExitingBranch,
+ m_BranchOnCount(m_Specific(CanonicalIVIncrement),
+ m_Specific(&Plan.getVectorTripCount()))) &&
+ "Unexpected terminator");
+ Builder.setInsertPoint(LatchExitingBranch);
+ VPValue *RemainElementCount = Builder.createOverflowingOp(
+ Instruction::Sub, {&Plan.getVectorTripCount(), CanonicalIVIncrement},
+ {/*hasNUW=*/true, /*hasNSW=*/false}, DebugLoc::getCompilerGenerated(),
+ "remain.element.count");
+ auto *Zero = Plan.getOrAddLiveIn(ConstantInt::get(CanIVTy, 0));
+ LatchExitingBranch->setOperand(0, RemainElementCount);
+ LatchExitingBranch->setOperand(1, Zero);
+}
+
+void VPlanTransforms::removeFixedStepCumulativeIV(VPlan &Plan) {
+ VPRegionBlock *LoopRegion = Plan.getVectorLoopRegion();
+ auto *CanonicalIV = LoopRegion->getCanonicalIV();
+ VPCumulativeIVPHIRecipe *CumulativeIVPhi = LoopRegion->getCumulativeIV();
+ VPRecipeBase *CumulativeIVInc = &CumulativeIVPhi->getBackedgeRecipe();
+ // Replace all uses with CanonicalIV if it steps by VF*UF.
+ if (match(CumulativeIVInc,
+ m_Binary<Instruction::Add>(m_Specific(&Plan.getVFxUF()),
+ m_Specific(CumulativeIVPhi)))) {
+ CumulativeIVPhi->replaceAllUsesWith(CanonicalIV);
+ CumulativeIVPhi->eraseFromParent();
+ CumulativeIVInc->eraseFromParent();
+ }
}
-void VPlanTransforms::canonicalizeEVLLoops(VPlan &Plan) {
- // Find EVL loop entries by locating VPEVLBasedIVPHIRecipe.
- // There should be only one EVL PHI in the entire plan.
- VPEVLBasedIVPHIRecipe *EVLPhi = nullptr;
+void VPlanTransforms::convertToVariableLengthStep(VPlan &Plan,
+ bool TailByMasking) {
+ VPCumulativeIVPHIRecipe *CumulativeIVPhi = nullptr;
for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(
vp_depth_first_shallow(Plan.getEntry())))
for (VPRecipeBase &R : VPBB->phis())
- if (auto *PhiR = dyn_cast<VPEVLBasedIVPHIRecipe>(&R)) {
- assert(!EVLPhi && "Found multiple EVL PHIs. Only one expected");
- EVLPhi = PhiR;
+ if (auto *PhiR = dyn_cast<VPCumulativeIVPHIRecipe>(&R)) {
+ assert(!CumulativeIVPhi &&
+ "Found multiple CumulativeIV. Only one expected");
+ CumulativeI...
[truncated]
|
| @@ -567,6 +586,8 @@ static void addInitialSkeleton(VPlan &Plan, Type *InductionTy, DebugLoc IVDL, | |||
| {VectorPhiR, VectorPhiR->getOperand(0)}, VectorPhiR->getDebugLoc()); | |||
| cast<VPIRPhi>(&ScalarPhiR)->addOperand(ResumePhiR); | |||
| } | |||
|
|
|||
| addCumulativeIVRecipes(Plan, HeaderVPBB, LatchVPBB, InductionTy, IVDL); | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular reason why we always add the cumulative IV recipe for every VPlan? I would have thought we would only need to add it in the cases where we need to convert to a variably stepping loop region.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is inspired by #166164, transforms after addExplicitVectorLength are expected to use CumulativeIV instead of CanonicalIV.
For instance, the diff in createScalarIVSteps :
VPHeaderPHIRecipe *IV = LoopRegion->getCanonicalIV();
if (auto *EVLIV =
dyn_cast<VPEVLBasedIVPHIRecipe>(std::next(IV->getIterator())))
IV = EVLIV;The transform can simply use getCumulativeIV() when it cares about the processed element count
To support getCumulativeIV(), I chose to create it by default during VPlan construction and remove it later if unused.
An alternative approach would be to only create it after variable-length transforms (e.g., addExplicitVectorLength), and have getCumulativeIV() fall back to CanonicalIV when CumulativeIV doesn't exist.
Would that be preferable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative approach would be to only create it after variable-length transforms (e.g., addExplicitVectorLength), and have getCumulativeIV() fall back to CanonicalIV when CumulativeIV doesn't exist.
Would that be preferable?
Yes, I think that's what I had in mind. That way we don't change any VPlans that don't care about CumulativeIV, and we don't need removeFixedStepCumulativeIV.
What we could also even do in a prior NFC is to add VPRegionBlock::getCumulativeIV, returning just getCanonicalIV() for now, and go through every caller of getCanonicalIV() and check if it should be moved to getCumulativeIV().
Then in this PR you can have make getCumulativeIV() return VPCumulativeIVPHIRecipe when it exists
| "remain.element.count"); | ||
| auto *Zero = Plan.getOrAddLiveIn(ConstantInt::get(CanIVTy, 0)); | ||
| LatchExitingBranch->setOperand(0, RemainElementCount); | ||
| LatchExitingBranch->setOperand(1, Zero); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come we're now converting the branch condition earlier in the EVL specific transform? I think fault-only-first loads will want this format as well so I would have thought we would want to keep it shared in VPlanTransforms::convertToVariableLengthStep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it a multi-step transform because AVL {TC,-,Step} may be optional.
(branch-on-count CanonicalIVInc, VTC)
-> (branch-on-count (sub VectorTripCount, CanonicalIVInc), 0)
-> (branch-on-count (sub TripCount, (add Step, CumulativeIV)), 0)
-> (branch-on-count (sub AVL, Step), 0)
For masked-based loop with llvm.masked.load.ff, I'm not sure if we need an AVL to generate the mask.
For example:
%mask = call <VF x i1> @llvm.get.active.lane.mask(i32 %cumulative.iv, i32 %TC)
%ret = call {<VF x ty>, <VF x i1>} @llvm.masked.load.ff (ptr %ptr, <VF x i1> %mask)
%ret.mask = extractvalue {<VF x ty>, <VF x i1>} %ret, 1
...
%count = call @llvm.cttz.elts (<VF x i1> %ret.mask, i1 true)
%cumulative.iv.next = add %count, %cumulative.iv
Hence I moved the down-counting decision into addExplicitVectorLength.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean about llvm.masked.load.ff, but I think all the different transforms needed for branch-on-count is kind of complicated.
If that's the case that we want to only do the "downward counting" transform for EVL tail folded loops and not any variably stepped loop, I think it would be easier to just split it out from canonicalizeEVLLoops. I've opened up #178181 for this, after that I think this PR shouldn't need to worry about the exit condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've landed #178181 so that the EVL exit condition transform is split out. I've also relaxed an assertion that the VPEVLBasedIVPHIRecipe should have a ExplicitVectorLength in its backedge value, so I think you can probably get away with just renaming VPEVLBasedIVPHIRecipe in it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for #178181! That makes this an NFC now.
c6e7432 to
9dce672
Compare
|
By the way, as a topic for bikeshedding I think we might need to find another term other than "cumulative IV", since IIUC an induction variable by definition has to increment/decrement by a fixed amount each time. |
fhahn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good if you could add a bit more detail on why this is needed and what the benefit is vs the EVL based IV. The description mostly describes how things to moved around, but it would be good to clarify why this is needed.
I don't think this recipe should be added to all plans unless there's a clear benefit in all cases. In terms of terminology, IV refers to induction, but inductions variables in LLVM terminology step by a loop-invariant amount, which the new recipes would not do in some cases.
|
9dce672 to
38aa9d6
Compare
38aa9d6 to
49f4470
Compare
Thanks! I've updated the description to clarify the motivation. Please take another look! |
|
This is split out from llvm#177114. In order to make canonicalizeEVLLoops a generic "convert to variable stepping" transform, move the code that changes the exit condition to a separate transform. Run it before canonicalizeEVLLoops before VPEVLBasedIVPHIRecipe is expanded. Also relax the assertion for VPInstruction::ExplicitVectorLength to just bail instead, since eventually VPEVLBasedIVPHIRecipe will be used by other loops that aren't EVL tail folded.
| "remain.element.count"); | ||
| auto *Zero = Plan.getOrAddLiveIn(ConstantInt::get(CanIVTy, 0)); | ||
| LatchExitingBranch->setOperand(0, RemainElementCount); | ||
| LatchExitingBranch->setOperand(1, Zero); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean about llvm.masked.load.ff, but I think all the different transforms needed for branch-on-count is kind of complicated.
If that's the case that we want to only do the "downward counting" transform for EVL tail folded loops and not any variably stepped loop, I think it would be easier to just split it out from canonicalizeEVLLoops. I've opened up #178181 for this, after that I think this PR shouldn't need to worry about the exit condition
| llvm_unreachable("cloning not implemented yet"); | ||
| VPNumProcessedElementsPHIRecipe *clone() override { | ||
| auto *R = | ||
| new VPNumProcessedElementsPHIRecipe(getStartValue(), getDebugLoc()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, does the phi get cloned when unrolling with #151300?
…NFC (#178181) This is split out from #177114. In order to make canonicalizeEVLLoops a generic "convert to variable stepping" transform, move the code that changes the exit condition to a separate transform since not all variable stepping loops will want to transform the exit condition. Run it before canonicalizeEVLLoops before VPEVLBasedIVPHIRecipe is expanded. Also relax the assertion for VPInstruction::ExplicitVectorLength to just bail instead, since eventually VPEVLBasedIVPHIRecipe will be used by other loops that aren't EVL tail folded.
I think VPNumProcessedElements is a bit non-standard, and it's not clear to me what constitutes an element being processed. E.g. if the original scalar loop has two loads and two stores, is each one load/store pair an element processed? How about something like CurrentTripCount. That way it can be defined in terms of the original scalar loop. |
This can be confusing as well, one might thing about number of iterations remaining (maybe?). How about |
I see what you mean, I don't think the term "processed" is used much in the loop vectorizer currently though. How about CurrentIteration? That would match with how VPlan::TripCount/VPlan::BackedgeTakenCount implicitly refer to the scalar loop. The vector equivalent would be VPlan::VectorCurrentIteration. |
This is groundwork for llvm#151300, which aims to support first-faulting loads in non-tail-folded early-exit loops. Per llvm#175900, we need a variable-length stepping transform that can shared between EVL and non-EVL loops. The idea is to have an EVL-independent counter and transform for tracking the cumulative number of processed elements. This patch renames the existing counter (VPEVLBasedIVPHIRecipe) and transform (canonicalizeEVLLoops) to be EVL-independent: - Rename VPEVLBasedIVPHIRecipe to VPCurrentIterationRecipe to reflect its general purpose of tracking processed element count. - Rename canonicalizeEVLLoops to convertToVariableLengthStep.
49f4470 to
c95a2e5
Compare
|
I'm renaming the title again, apologies for the confusion. |
lukel97
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just left some nits. Can you also update the PR title to mention it's NFC?
fhahn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would probably be good to update title/ description to clarify that this renames EVL-based PHI recipe to more general name. My reading of the current title implies new functionality
1. Rename VPCurrentIterationRecipe to VPCurrentIterationPHIRecipe. 2. Rename VPCurrentIterationSC to VPCurrentIterationPHISC. 3. Rephrase current index of elements. 4. Update assertion string in verifier.
|
@fhahn is there anything blocking this or any changes you'd like me to make? |
| // The EVL IV is always immediately after the canonical IV. | ||
| auto *EVLPhi = | ||
| dyn_cast_or_null<VPEVLBasedIVPHIRecipe>(std::next(CanIV->getIterator())); | ||
| auto *EVLPhi = dyn_cast_or_null<VPCurrentIterationPHIRecipe>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably also needs updating; whether it is EVL based will be determine later, by checking the increment I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right after this, we check that the increment is EVL and bail out if it's not.
fhahn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's also llvm/test/Transforms/LoopVectorize/vplan-force-tail-with-evl.ll which has ; NO-VP-NOT: EXPLICIT-VECTOR-LENGTH-BASED-IV-PHI
This is groundwork for #151300, which aims to support first-faulting
loads in non-tail-folded early-exit loops.
Per #175900, we need a variable-length stepping transform that can
shared between EVL and non-EVL loops.
The idea is to have an EVL-independent counter and transform for
tracking the cumulative number of processed elements.
This patch renames the existing counter (VPEVLBasedIVPHIRecipe) and
transform (canonicalizeEVLLoops) to be EVL-independent:
reflect its general purpose of tracking processed element count.
This is NFC.