-
Notifications
You must be signed in to change notification settings - Fork 12.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LV, VP]VP intrinsics support for the Loop Vectorizer + adding new tail-folding mode using EVL. #76172
Conversation
@llvm/pr-subscribers-backend-powerpc @llvm/pr-subscribers-llvm-analysis Author: Alexey Bataev (alexey-bataev) ChangesThis patch introduces generating VP intrinsics in the Loop Vectorizer. Currently the Loop Vectorizer supports vector predication in a very limited capacity via tail-folding and masked load/store/gather/scatter intrinsics. However, this does not let architectures with active vector length predication support take advantage of their capabilities. Architectures with general masked predication support also can only take advantage of predication on memory operations. By having a way for the Loop Vectorizer to generate Vector Predication intrinsics, which (will) provide a target-independent way to model predicated vector instructions, These architectures can make better use of their predication capabilities. Our first approach (implemented in this patch) builds on top of the existing tail-folding mechanism in the LV, but instead of generating masked intrinsics for memory operations it generates VP intrinsics for loads/stores instructions. Other important part of this approach is how the Explicit Vector Length is computed. (We use active vector length and explicit vector length interchangeably; VP intrinsics define this vector length parameter as Explicit Vector Length (EVL)). We consider the following three ways to compute the EVL parameter for the VP Intrinsics.
Also, added a new recipe to emit instructions for computing EVL. Using VPlan in this way will eventually help build and compare VPlans corresponding to different strategies and alternatives. ===Tentative Development Roadmap===
Differential Revision: https://reviews.llvm.org/D99750 Patch is 101.79 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/76172.diff 24 Files Affected:
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 735be3680aea0d..e2a127ff35be26 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -190,7 +190,10 @@ enum class TailFoldingStyle {
/// Use predicate to control both data and control flow, but modify
/// the trip count so that a runtime overflow check can be avoided
/// and such that the scalar epilogue loop can always be removed.
- DataAndControlFlowWithoutRuntimeCheck
+ DataAndControlFlowWithoutRuntimeCheck,
+ /// Use predicated EVL instructions for tail-folding.
+ /// Indicates that VP intrinsics should be used if tail-folding is enabled.
+ DataWithEVL,
};
struct TailFoldingInfo {
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
index 4614446b2150b7..1a9abaea811159 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
@@ -169,6 +169,10 @@ RISCVTTIImpl::getIntImmCostIntrin(Intrinsic::ID IID, unsigned Idx,
return TTI::TCC_Free;
}
+bool RISCVTTIImpl::hasActiveVectorLength(unsigned, Type *DataTy, Align) const {
+ return ST->hasVInstructions();
+}
+
TargetTransformInfo::PopcntSupportKind
RISCVTTIImpl::getPopcntSupport(unsigned TyWidth) {
assert(isPowerOf2_32(TyWidth) && "Ty width must be power of 2");
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
index 96ecc771863e56..d2592be75000de 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
@@ -72,6 +72,22 @@ class RISCVTTIImpl : public BasicTTIImplBase<RISCVTTIImpl> {
const APInt &Imm, Type *Ty,
TTI::TargetCostKind CostKind);
+ /// \name Vector Predication Information
+ /// Whether the target supports the %evl parameter of VP intrinsic efficiently
+ /// in hardware, for the given opcode and type/alignment. (see LLVM Language
+ /// Reference - "Vector Predication Intrinsics",
+ /// https://llvm.org/docs/LangRef.html#vector-predication-intrinsics and
+ /// "IR-level VP intrinsics",
+ /// https://llvm.org/docs/Proposals/VectorPredication.html#ir-level-vp-intrinsics).
+ /// \param Opcode the opcode of the instruction checked for predicated version
+ /// support.
+ /// \param DataType the type of the instruction with the \p Opcode checked for
+ /// prediction support.
+ /// \param Alignment the alignment for memory access operation checked for
+ /// predicated version support.
+ bool hasActiveVectorLength(unsigned Opcode, Type *DataType,
+ Align Alignment) const;
+
TargetTransformInfo::PopcntSupportKind getPopcntSupport(unsigned TyWidth);
bool shouldExpandReduction(const IntrinsicInst *II) const;
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index f82e161fb846d1..7b0e268877ded3 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -123,6 +123,7 @@
#include "llvm/IR/User.h"
#include "llvm/IR/Value.h"
#include "llvm/IR/ValueHandle.h"
+#include "llvm/IR/VectorBuilder.h"
#include "llvm/IR/Verifier.h"
#include "llvm/Support/Casting.h"
#include "llvm/Support/CommandLine.h"
@@ -247,10 +248,12 @@ static cl::opt<TailFoldingStyle> ForceTailFoldingStyle(
clEnumValN(TailFoldingStyle::DataAndControlFlow, "data-and-control",
"Create lane mask using active.lane.mask intrinsic, and use "
"it for both data and control flow"),
- clEnumValN(
- TailFoldingStyle::DataAndControlFlowWithoutRuntimeCheck,
- "data-and-control-without-rt-check",
- "Similar to data-and-control, but remove the runtime check")));
+ clEnumValN(TailFoldingStyle::DataAndControlFlowWithoutRuntimeCheck,
+ "data-and-control-without-rt-check",
+ "Similar to data-and-control, but remove the runtime check"),
+ clEnumValN(TailFoldingStyle::DataWithEVL, "data-with-evl",
+ "Use predicated EVL instructions for tail folding if the "
+ "target supports vector length predication")));
static cl::opt<bool> MaximizeBandwidth(
"vectorizer-maximize-bandwidth", cl::init(false), cl::Hidden,
@@ -1106,8 +1109,7 @@ void InnerLoopVectorizer::collectPoisonGeneratingRecipes(
if (isa<VPWidenMemoryInstructionRecipe>(CurRec) ||
isa<VPInterleaveRecipe>(CurRec) ||
isa<VPScalarIVStepsRecipe>(CurRec) ||
- isa<VPCanonicalIVPHIRecipe>(CurRec) ||
- isa<VPActiveLaneMaskPHIRecipe>(CurRec))
+ isa<VPHeaderPHIRecipe>(CurRec))
continue;
// This recipe contributes to the address computation of a widen
@@ -1655,6 +1657,23 @@ class LoopVectorizationCostModel {
return foldTailByMasking() || Legal->blockNeedsPredication(BB);
}
+ /// Returns true if VP intrinsics with explicit vector length support should
+ /// be generated in the tail folded loop.
+ bool useVPIWithVPEVLVectorization() const {
+ return PreferEVL && !EnableVPlanNativePath &&
+ getTailFoldingStyle() == TailFoldingStyle::DataWithEVL &&
+ // FIXME: implement support for max safe dependency distance.
+ Legal->isSafeForAnyVectorWidth() &&
+ // FIXME: remove this once reductions are supported.
+ Legal->getReductionVars().empty() &&
+ // FIXME: remove this once vp_reverse is supported.
+ none_of(
+ WideningDecisions,
+ [](const std::pair<std::pair<Instruction *, ElementCount>,
+ std::pair<InstWidening, InstructionCost>>
+ &Data) { return Data.second.first == CM_Widen_Reverse; });
+ }
+
/// Returns true if the Phi is part of an inloop reduction.
bool isInLoopReduction(PHINode *Phi) const {
return InLoopReductions.contains(Phi);
@@ -1800,6 +1819,10 @@ class LoopVectorizationCostModel {
/// All blocks of loop are to be masked to fold tail of scalar iterations.
bool CanFoldTailByMasking = false;
+ /// Control whether to generate VP intrinsics with explicit-vector-length
+ /// support in vectorized code.
+ bool PreferEVL = false;
+
/// A map holding scalar costs for different vectorization factors. The
/// presence of a cost for an instruction in the mapping indicates that the
/// instruction will be scalarized when vectorizing with the associated
@@ -4883,6 +4906,39 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
// FIXME: look for a smaller MaxVF that does divide TC rather than masking.
if (Legal->prepareToFoldTailByMasking()) {
CanFoldTailByMasking = true;
+ if (getTailFoldingStyle() == TailFoldingStyle::None)
+ return MaxFactors;
+
+ if (UserIC > 1) {
+ LLVM_DEBUG(dbgs() << "LV: Preference for VP intrinsics indicated. Will "
+ "not generate VP intrinsics since interleave count "
+ "specified is greater than 1.\n");
+ return MaxFactors;
+ }
+
+ if (MaxFactors.ScalableVF.isVector()) {
+ assert(MaxFactors.ScalableVF.isScalable() &&
+ "Expected scalable vector factor.");
+ // FIXME: use actual opcode/data type for analysis here.
+ PreferEVL = getTailFoldingStyle() == TailFoldingStyle::DataWithEVL &&
+ TTI.hasActiveVectorLength(0, nullptr, Align());
+#if !NDEBUG
+ if (getTailFoldingStyle() == TailFoldingStyle::DataWithEVL) {
+ if (PreferEVL)
+ dbgs() << "LV: Preference for VP intrinsics indicated. Will "
+ "try to generate VP Intrinsics.\n";
+ else
+ dbgs() << "LV: Preference for VP intrinsics indicated. Will "
+ "not try to generate VP Intrinsics since the target "
+ "does not support vector length predication.\n";
+ }
+#endif // !NDEBUG
+
+ // Tail folded loop using VP intrinsics restricts the VF to be scalable.
+ if (PreferEVL)
+ MaxFactors.FixedVF = ElementCount::getFixed(1);
+ }
+
return MaxFactors;
}
@@ -5493,6 +5549,10 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
if (!isScalarEpilogueAllowed())
return 1;
+ // Do not interleave if EVL is preferred and no User IC is specified.
+ if (useVPIWithVPEVLVectorization())
+ return 1;
+
// We used the distance for the interleave count.
if (!Legal->isSafeForAnyVectorWidth())
return 1;
@@ -8622,6 +8682,8 @@ void LoopVectorizationPlanner::buildVPlansWithVPRecipes(ElementCount MinVF,
VPlanTransforms::truncateToMinimalBitwidths(
*Plan, CM.getMinimalBitwidths(), PSE.getSE()->getContext());
VPlanTransforms::optimize(*Plan, *PSE.getSE());
+ if (CM.useVPIWithVPEVLVectorization())
+ VPlanTransforms::addExplicitVectorLength(*Plan);
assert(VPlanVerifier::verifyPlanIsValid(*Plan) && "VPlan is invalid");
VPlans.push_back(std::move(Plan));
}
@@ -9454,6 +9516,52 @@ void VPReplicateRecipe::execute(VPTransformState &State) {
State.ILV->scalarizeInstruction(UI, this, VPIteration(Part, Lane), State);
}
+/// Creates either vp_store or vp_scatter intrinsics calls to represent
+/// predicated store/scatter.
+static Instruction *
+lowerStoreUsingVectorIntrinsics(IRBuilderBase &Builder, Value *Addr,
+ Value *StoredVal, bool IsScatter, Value *Mask,
+ Value *EVLPart, const Align &Alignment) {
+ CallInst *Call;
+ if (IsScatter) {
+ Call = Builder.CreateIntrinsic(Type::getVoidTy(EVLPart->getContext()),
+ Intrinsic::vp_scatter,
+ {StoredVal, Addr, Mask, EVLPart});
+ } else {
+ VectorBuilder VBuilder(Builder);
+ VBuilder.setEVL(EVLPart).setMask(Mask);
+ Call = cast<CallInst>(VBuilder.createVectorInstruction(
+ Instruction::Store, Type::getVoidTy(EVLPart->getContext()),
+ {StoredVal, Addr}));
+ }
+ Call->addParamAttr(
+ 1, Attribute::getWithAlignment(Call->getContext(), Alignment));
+ return Call;
+}
+
+/// Creates either vp_load or vp_gather intrinsics calls to represent
+/// predicated load/gather.
+static Instruction *lowerLoadUsingVectorIntrinsics(IRBuilderBase &Builder,
+ VectorType *DataTy,
+ Value *Addr, bool IsGather,
+ Value *Mask, Value *EVLPart,
+ const Align &Alignment) {
+ CallInst *Call;
+ if (IsGather) {
+ Call = Builder.CreateIntrinsic(DataTy, Intrinsic::vp_gather,
+ {Addr, Mask, EVLPart}, nullptr,
+ "wide.masked.gather");
+ } else {
+ VectorBuilder VBuilder(Builder);
+ VBuilder.setEVL(EVLPart).setMask(Mask);
+ Call = cast<CallInst>(VBuilder.createVectorInstruction(
+ Instruction::Load, DataTy, Addr, "vp.op.load"));
+ }
+ Call->addParamAttr(
+ 0, Attribute::getWithAlignment(Call->getContext(), Alignment));
+ return Call;
+}
+
void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
VPValue *StoredValue = isStore() ? getStoredValue() : nullptr;
@@ -9523,6 +9631,12 @@ void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
return PartPtr;
};
+ auto MaskValue = [&](unsigned Part) -> Value * {
+ if (isMaskRequired)
+ return BlockInMaskParts[Part];
+ return nullptr;
+ };
+
// Handle Stores:
if (SI) {
State.setDebugLocFrom(SI->getDebugLoc());
@@ -9530,7 +9644,22 @@ void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
for (unsigned Part = 0; Part < State.UF; ++Part) {
Instruction *NewSI = nullptr;
Value *StoredVal = State.get(StoredValue, Part);
- if (CreateGatherScatter) {
+ if (State.EVL) {
+ Value *EVLPart = State.get(State.EVL, Part);
+ // If EVL is not nullptr, then EVL must be a valid value set during plan
+ // creation, possibly default value = whole vector register length. EVL
+ // is created only if TTI prefers predicated vectorization, thus if EVL
+ // is not nullptr it also implies preference for predicated
+ // vectorization.
+ // FIXME: Support reverse store after vp_reverse is added.
+ NewSI = lowerStoreUsingVectorIntrinsics(
+ Builder,
+ CreateGatherScatter
+ ? State.get(getAddr(), Part)
+ : CreateVecPtr(Part, State.get(getAddr(), VPIteration(0, 0))),
+ StoredVal, CreateGatherScatter, MaskValue(Part), EVLPart,
+ Alignment);
+ } else if (CreateGatherScatter) {
Value *MaskPart = isMaskRequired ? BlockInMaskParts[Part] : nullptr;
Value *VectorGep = State.get(getAddr(), Part);
NewSI = Builder.CreateMaskedScatter(StoredVal, VectorGep, Alignment,
@@ -9561,7 +9690,21 @@ void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
State.setDebugLocFrom(LI->getDebugLoc());
for (unsigned Part = 0; Part < State.UF; ++Part) {
Value *NewLI;
- if (CreateGatherScatter) {
+ if (State.EVL) {
+ Value *EVLPart = State.get(State.EVL, Part);
+ // If EVL is not nullptr, then EVL must be a valid value set during plan
+ // creation, possibly default value = whole vector register length. EVL
+ // is created only if TTI prefers predicated vectorization, thus if EVL
+ // is not nullptr it also implies preference for predicated
+ // vectorization.
+ // FIXME: Support reverse loading after vp_reverse is added.
+ NewLI = lowerLoadUsingVectorIntrinsics(
+ Builder, DataTy,
+ CreateGatherScatter
+ ? State.get(getAddr(), Part)
+ : CreateVecPtr(Part, State.get(getAddr(), VPIteration(0, 0))),
+ CreateGatherScatter, MaskValue(Part), EVLPart, Alignment);
+ } else if (CreateGatherScatter) {
Value *MaskPart = isMaskRequired ? BlockInMaskParts[Part] : nullptr;
Value *VectorGep = State.get(getAddr(), Part);
NewLI = Builder.CreateMaskedGather(DataTy, VectorGep, Alignment, MaskPart,
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 94cb7688981361..0ca668abbe60c7 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -242,6 +242,12 @@ struct VPTransformState {
ElementCount VF;
unsigned UF;
+ /// If EVL is not nullptr, then EVL must be a valid value set during plan
+ /// creation, possibly a default value = whole vector register length. EVL is
+ /// created only if TTI prefers predicated vectorization, thus if EVL is
+ /// not nullptr it also implies preference for predicated vectorization.
+ VPValue *EVL = nullptr;
+
/// Hold the indices to generate specific scalar instructions. Null indicates
/// that all instances are to be generated, using either scalar or vector
/// instructions.
@@ -1057,6 +1063,8 @@ class VPInstruction : public VPRecipeWithIRFlags, public VPValue {
SLPLoad,
SLPStore,
ActiveLaneMask,
+ ExplicitVectorLength,
+ ExplicitVectorLengthIVIncrement,
CalculateTripCountMinusVF,
// Increment the canonical IV separately for each unrolled part.
CanonicalIVIncrementForPart,
@@ -1165,6 +1173,8 @@ class VPInstruction : public VPRecipeWithIRFlags, public VPValue {
default:
return false;
case VPInstruction::ActiveLaneMask:
+ case VPInstruction::ExplicitVectorLength:
+ case VPInstruction::ExplicitVectorLengthIVIncrement:
case VPInstruction::CalculateTripCountMinusVF:
case VPInstruction::CanonicalIVIncrementForPart:
case VPInstruction::BranchOnCount:
@@ -2180,6 +2190,39 @@ class VPActiveLaneMaskPHIRecipe : public VPHeaderPHIRecipe {
#endif
};
+/// A recipe for generating the phi node for the current index of elements,
+/// adjusted in accordance with EVL value. It starts at StartIV value and gets
+/// incremented by EVL in each iteration of the vector loop.
+class VPEVLBasedIVPHIRecipe : public VPHeaderPHIRecipe {
+public:
+ VPEVLBasedIVPHIRecipe(VPValue *StartMask, DebugLoc DL)
+ : VPHeaderPHIRecipe(VPDef::VPEVLBasedIVPHISC, nullptr, StartMask, DL) {}
+
+ ~VPEVLBasedIVPHIRecipe() override = default;
+
+ VP_CLASSOF_IMPL(VPDef::VPEVLBasedIVPHISC)
+
+ static inline bool classof(const VPHeaderPHIRecipe *D) {
+ return D->getVPDefID() == VPDef::VPEVLBasedIVPHISC;
+ }
+
+ /// Generate phi for handling IV based on EVL over iterations correctly.
+ void execute(VPTransformState &State) override;
+
+ /// Returns true if the recipe only uses the first lane of operand \p Op.
+ bool onlyFirstLaneUsed(const VPValue *Op) const override {
+ assert(is_contained(operands(), Op) &&
+ "Op must be an operand of the recipe");
+ return true;
+ }
+
+#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
+ /// Print the recipe.
+ void print(raw_ostream &O, const Twine &Indent,
+ VPSlotTracker &SlotTracker) const override;
+#endif
+};
+
/// A Recipe for widening the canonical induction variable of the vector loop.
class VPWidenCanonicalIVRecipe : public VPRecipeBase, public VPValue {
public:
diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
index 97a8a1803bbf5a..b8ed256d236a4b 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
@@ -207,14 +207,14 @@ Type *VPTypeAnalysis::inferScalarType(const VPValue *V) {
Type *ResultTy =
TypeSwitch<const VPRecipeBase *, Type *>(V->getDefiningRecipe())
.Case<VPCanonicalIVPHIRecipe, VPFirstOrderRecurrencePHIRecipe,
- VPReductionPHIRecipe, VPWidenPointerInductionRecipe>(
- [this](const auto *R) {
- // Handle header phi recipes, except VPWienIntOrFpInduction
- // which needs special handling due it being possibly truncated.
- // TODO: consider inferring/caching type of siblings, e.g.,
- // backedge value, here and in cases below.
- return inferScalarType(R->getStartValue());
- })
+ VPReductionPHIRecipe, VPWidenPointerInductionRecipe,
+ VPEVLBasedIVPHIRecipe>([this](const auto *R) {
+ // Handle header phi recipes, except VPWienIntOrFpInduction
+ // which needs special handling due it being possibly truncated.
+ // TODO: consider inferring/caching type of siblings, e.g.,
+ // backedge value, here and in cases below.
+ return inferScalarType(R->getStartValue());
+ })
.Case<VPWidenIntOrFpInductionRecipe, VPDerivedIVRecipe>(
[](const auto *R) { return R->getScalarType(); })
.Case<VPPredInstPHIRecipe, VPWidenPHIRecipe, VPScalarIVStepsRecipe,
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 02e400d590bed4..5e0344a14df5da 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -345,6 +345,44 @@ Value *VPInstruction::generateInstruction(VPTransformState &State,
Value *Zero = ConstantInt::get(ScalarTC->getType(), 0);
return Builder.CreateSelect(Cmp, Sub, Zero);
}
+ case VPInstruction::ExplicitVectorLength: {
+ // Compute EVL
+ auto GetSetVL = [=](VPTransformState &State, Value *EVL) {
+ assert(EVL->getType()->isIntegerTy() &&
+ "Requested vector length should be an integer.");
+
+ // TODO: Add support for MaxSafeDist for correct loop emission.
+ Value *VFArg = State.Builder.getInt32(State.VF.getKnownMinValue());
+
+ Value *GVL = State.Builder.CreateIntrinsic(
+ State.Builder.getInt32Ty(), Intrinsic::experimental_get_vector_length,
+ {EVL, VFArg, State.Builder.getTrue()});
+ return GVL;
+ };
+ // TODO: Restructur...
[truncated]
|
@llvm/pr-subscribers-backend-risc-v Author: Alexey Bataev (alexey-bataev) ChangesThis patch introduces generating VP intrinsics in the Loop Vectorizer. Currently the Loop Vectorizer supports vector predication in a very limited capacity via tail-folding and masked load/store/gather/scatter intrinsics. However, this does not let architectures with active vector length predication support take advantage of their capabilities. Architectures with general masked predication support also can only take advantage of predication on memory operations. By having a way for the Loop Vectorizer to generate Vector Predication intrinsics, which (will) provide a target-independent way to model predicated vector instructions, These architectures can make better use of their predication capabilities. Our first approach (implemented in this patch) builds on top of the existing tail-folding mechanism in the LV, but instead of generating masked intrinsics for memory operations it generates VP intrinsics for loads/stores instructions. Other important part of this approach is how the Explicit Vector Length is computed. (We use active vector length and explicit vector length interchangeably; VP intrinsics define this vector length parameter as Explicit Vector Length (EVL)). We consider the following three ways to compute the EVL parameter for the VP Intrinsics.
Also, added a new recipe to emit instructions for computing EVL. Using VPlan in this way will eventually help build and compare VPlans corresponding to different strategies and alternatives. ===Tentative Development Roadmap===
Differential Revision: https://reviews.llvm.org/D99750 Patch is 101.79 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/76172.diff 24 Files Affected:
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 735be3680aea0d..e2a127ff35be26 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -190,7 +190,10 @@ enum class TailFoldingStyle {
/// Use predicate to control both data and control flow, but modify
/// the trip count so that a runtime overflow check can be avoided
/// and such that the scalar epilogue loop can always be removed.
- DataAndControlFlowWithoutRuntimeCheck
+ DataAndControlFlowWithoutRuntimeCheck,
+ /// Use predicated EVL instructions for tail-folding.
+ /// Indicates that VP intrinsics should be used if tail-folding is enabled.
+ DataWithEVL,
};
struct TailFoldingInfo {
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
index 4614446b2150b7..1a9abaea811159 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
@@ -169,6 +169,10 @@ RISCVTTIImpl::getIntImmCostIntrin(Intrinsic::ID IID, unsigned Idx,
return TTI::TCC_Free;
}
+bool RISCVTTIImpl::hasActiveVectorLength(unsigned, Type *DataTy, Align) const {
+ return ST->hasVInstructions();
+}
+
TargetTransformInfo::PopcntSupportKind
RISCVTTIImpl::getPopcntSupport(unsigned TyWidth) {
assert(isPowerOf2_32(TyWidth) && "Ty width must be power of 2");
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
index 96ecc771863e56..d2592be75000de 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
@@ -72,6 +72,22 @@ class RISCVTTIImpl : public BasicTTIImplBase<RISCVTTIImpl> {
const APInt &Imm, Type *Ty,
TTI::TargetCostKind CostKind);
+ /// \name Vector Predication Information
+ /// Whether the target supports the %evl parameter of VP intrinsic efficiently
+ /// in hardware, for the given opcode and type/alignment. (see LLVM Language
+ /// Reference - "Vector Predication Intrinsics",
+ /// https://llvm.org/docs/LangRef.html#vector-predication-intrinsics and
+ /// "IR-level VP intrinsics",
+ /// https://llvm.org/docs/Proposals/VectorPredication.html#ir-level-vp-intrinsics).
+ /// \param Opcode the opcode of the instruction checked for predicated version
+ /// support.
+ /// \param DataType the type of the instruction with the \p Opcode checked for
+ /// prediction support.
+ /// \param Alignment the alignment for memory access operation checked for
+ /// predicated version support.
+ bool hasActiveVectorLength(unsigned Opcode, Type *DataType,
+ Align Alignment) const;
+
TargetTransformInfo::PopcntSupportKind getPopcntSupport(unsigned TyWidth);
bool shouldExpandReduction(const IntrinsicInst *II) const;
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index f82e161fb846d1..7b0e268877ded3 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -123,6 +123,7 @@
#include "llvm/IR/User.h"
#include "llvm/IR/Value.h"
#include "llvm/IR/ValueHandle.h"
+#include "llvm/IR/VectorBuilder.h"
#include "llvm/IR/Verifier.h"
#include "llvm/Support/Casting.h"
#include "llvm/Support/CommandLine.h"
@@ -247,10 +248,12 @@ static cl::opt<TailFoldingStyle> ForceTailFoldingStyle(
clEnumValN(TailFoldingStyle::DataAndControlFlow, "data-and-control",
"Create lane mask using active.lane.mask intrinsic, and use "
"it for both data and control flow"),
- clEnumValN(
- TailFoldingStyle::DataAndControlFlowWithoutRuntimeCheck,
- "data-and-control-without-rt-check",
- "Similar to data-and-control, but remove the runtime check")));
+ clEnumValN(TailFoldingStyle::DataAndControlFlowWithoutRuntimeCheck,
+ "data-and-control-without-rt-check",
+ "Similar to data-and-control, but remove the runtime check"),
+ clEnumValN(TailFoldingStyle::DataWithEVL, "data-with-evl",
+ "Use predicated EVL instructions for tail folding if the "
+ "target supports vector length predication")));
static cl::opt<bool> MaximizeBandwidth(
"vectorizer-maximize-bandwidth", cl::init(false), cl::Hidden,
@@ -1106,8 +1109,7 @@ void InnerLoopVectorizer::collectPoisonGeneratingRecipes(
if (isa<VPWidenMemoryInstructionRecipe>(CurRec) ||
isa<VPInterleaveRecipe>(CurRec) ||
isa<VPScalarIVStepsRecipe>(CurRec) ||
- isa<VPCanonicalIVPHIRecipe>(CurRec) ||
- isa<VPActiveLaneMaskPHIRecipe>(CurRec))
+ isa<VPHeaderPHIRecipe>(CurRec))
continue;
// This recipe contributes to the address computation of a widen
@@ -1655,6 +1657,23 @@ class LoopVectorizationCostModel {
return foldTailByMasking() || Legal->blockNeedsPredication(BB);
}
+ /// Returns true if VP intrinsics with explicit vector length support should
+ /// be generated in the tail folded loop.
+ bool useVPIWithVPEVLVectorization() const {
+ return PreferEVL && !EnableVPlanNativePath &&
+ getTailFoldingStyle() == TailFoldingStyle::DataWithEVL &&
+ // FIXME: implement support for max safe dependency distance.
+ Legal->isSafeForAnyVectorWidth() &&
+ // FIXME: remove this once reductions are supported.
+ Legal->getReductionVars().empty() &&
+ // FIXME: remove this once vp_reverse is supported.
+ none_of(
+ WideningDecisions,
+ [](const std::pair<std::pair<Instruction *, ElementCount>,
+ std::pair<InstWidening, InstructionCost>>
+ &Data) { return Data.second.first == CM_Widen_Reverse; });
+ }
+
/// Returns true if the Phi is part of an inloop reduction.
bool isInLoopReduction(PHINode *Phi) const {
return InLoopReductions.contains(Phi);
@@ -1800,6 +1819,10 @@ class LoopVectorizationCostModel {
/// All blocks of loop are to be masked to fold tail of scalar iterations.
bool CanFoldTailByMasking = false;
+ /// Control whether to generate VP intrinsics with explicit-vector-length
+ /// support in vectorized code.
+ bool PreferEVL = false;
+
/// A map holding scalar costs for different vectorization factors. The
/// presence of a cost for an instruction in the mapping indicates that the
/// instruction will be scalarized when vectorizing with the associated
@@ -4883,6 +4906,39 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
// FIXME: look for a smaller MaxVF that does divide TC rather than masking.
if (Legal->prepareToFoldTailByMasking()) {
CanFoldTailByMasking = true;
+ if (getTailFoldingStyle() == TailFoldingStyle::None)
+ return MaxFactors;
+
+ if (UserIC > 1) {
+ LLVM_DEBUG(dbgs() << "LV: Preference for VP intrinsics indicated. Will "
+ "not generate VP intrinsics since interleave count "
+ "specified is greater than 1.\n");
+ return MaxFactors;
+ }
+
+ if (MaxFactors.ScalableVF.isVector()) {
+ assert(MaxFactors.ScalableVF.isScalable() &&
+ "Expected scalable vector factor.");
+ // FIXME: use actual opcode/data type for analysis here.
+ PreferEVL = getTailFoldingStyle() == TailFoldingStyle::DataWithEVL &&
+ TTI.hasActiveVectorLength(0, nullptr, Align());
+#if !NDEBUG
+ if (getTailFoldingStyle() == TailFoldingStyle::DataWithEVL) {
+ if (PreferEVL)
+ dbgs() << "LV: Preference for VP intrinsics indicated. Will "
+ "try to generate VP Intrinsics.\n";
+ else
+ dbgs() << "LV: Preference for VP intrinsics indicated. Will "
+ "not try to generate VP Intrinsics since the target "
+ "does not support vector length predication.\n";
+ }
+#endif // !NDEBUG
+
+ // Tail folded loop using VP intrinsics restricts the VF to be scalable.
+ if (PreferEVL)
+ MaxFactors.FixedVF = ElementCount::getFixed(1);
+ }
+
return MaxFactors;
}
@@ -5493,6 +5549,10 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
if (!isScalarEpilogueAllowed())
return 1;
+ // Do not interleave if EVL is preferred and no User IC is specified.
+ if (useVPIWithVPEVLVectorization())
+ return 1;
+
// We used the distance for the interleave count.
if (!Legal->isSafeForAnyVectorWidth())
return 1;
@@ -8622,6 +8682,8 @@ void LoopVectorizationPlanner::buildVPlansWithVPRecipes(ElementCount MinVF,
VPlanTransforms::truncateToMinimalBitwidths(
*Plan, CM.getMinimalBitwidths(), PSE.getSE()->getContext());
VPlanTransforms::optimize(*Plan, *PSE.getSE());
+ if (CM.useVPIWithVPEVLVectorization())
+ VPlanTransforms::addExplicitVectorLength(*Plan);
assert(VPlanVerifier::verifyPlanIsValid(*Plan) && "VPlan is invalid");
VPlans.push_back(std::move(Plan));
}
@@ -9454,6 +9516,52 @@ void VPReplicateRecipe::execute(VPTransformState &State) {
State.ILV->scalarizeInstruction(UI, this, VPIteration(Part, Lane), State);
}
+/// Creates either vp_store or vp_scatter intrinsics calls to represent
+/// predicated store/scatter.
+static Instruction *
+lowerStoreUsingVectorIntrinsics(IRBuilderBase &Builder, Value *Addr,
+ Value *StoredVal, bool IsScatter, Value *Mask,
+ Value *EVLPart, const Align &Alignment) {
+ CallInst *Call;
+ if (IsScatter) {
+ Call = Builder.CreateIntrinsic(Type::getVoidTy(EVLPart->getContext()),
+ Intrinsic::vp_scatter,
+ {StoredVal, Addr, Mask, EVLPart});
+ } else {
+ VectorBuilder VBuilder(Builder);
+ VBuilder.setEVL(EVLPart).setMask(Mask);
+ Call = cast<CallInst>(VBuilder.createVectorInstruction(
+ Instruction::Store, Type::getVoidTy(EVLPart->getContext()),
+ {StoredVal, Addr}));
+ }
+ Call->addParamAttr(
+ 1, Attribute::getWithAlignment(Call->getContext(), Alignment));
+ return Call;
+}
+
+/// Creates either vp_load or vp_gather intrinsics calls to represent
+/// predicated load/gather.
+static Instruction *lowerLoadUsingVectorIntrinsics(IRBuilderBase &Builder,
+ VectorType *DataTy,
+ Value *Addr, bool IsGather,
+ Value *Mask, Value *EVLPart,
+ const Align &Alignment) {
+ CallInst *Call;
+ if (IsGather) {
+ Call = Builder.CreateIntrinsic(DataTy, Intrinsic::vp_gather,
+ {Addr, Mask, EVLPart}, nullptr,
+ "wide.masked.gather");
+ } else {
+ VectorBuilder VBuilder(Builder);
+ VBuilder.setEVL(EVLPart).setMask(Mask);
+ Call = cast<CallInst>(VBuilder.createVectorInstruction(
+ Instruction::Load, DataTy, Addr, "vp.op.load"));
+ }
+ Call->addParamAttr(
+ 0, Attribute::getWithAlignment(Call->getContext(), Alignment));
+ return Call;
+}
+
void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
VPValue *StoredValue = isStore() ? getStoredValue() : nullptr;
@@ -9523,6 +9631,12 @@ void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
return PartPtr;
};
+ auto MaskValue = [&](unsigned Part) -> Value * {
+ if (isMaskRequired)
+ return BlockInMaskParts[Part];
+ return nullptr;
+ };
+
// Handle Stores:
if (SI) {
State.setDebugLocFrom(SI->getDebugLoc());
@@ -9530,7 +9644,22 @@ void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
for (unsigned Part = 0; Part < State.UF; ++Part) {
Instruction *NewSI = nullptr;
Value *StoredVal = State.get(StoredValue, Part);
- if (CreateGatherScatter) {
+ if (State.EVL) {
+ Value *EVLPart = State.get(State.EVL, Part);
+ // If EVL is not nullptr, then EVL must be a valid value set during plan
+ // creation, possibly default value = whole vector register length. EVL
+ // is created only if TTI prefers predicated vectorization, thus if EVL
+ // is not nullptr it also implies preference for predicated
+ // vectorization.
+ // FIXME: Support reverse store after vp_reverse is added.
+ NewSI = lowerStoreUsingVectorIntrinsics(
+ Builder,
+ CreateGatherScatter
+ ? State.get(getAddr(), Part)
+ : CreateVecPtr(Part, State.get(getAddr(), VPIteration(0, 0))),
+ StoredVal, CreateGatherScatter, MaskValue(Part), EVLPart,
+ Alignment);
+ } else if (CreateGatherScatter) {
Value *MaskPart = isMaskRequired ? BlockInMaskParts[Part] : nullptr;
Value *VectorGep = State.get(getAddr(), Part);
NewSI = Builder.CreateMaskedScatter(StoredVal, VectorGep, Alignment,
@@ -9561,7 +9690,21 @@ void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
State.setDebugLocFrom(LI->getDebugLoc());
for (unsigned Part = 0; Part < State.UF; ++Part) {
Value *NewLI;
- if (CreateGatherScatter) {
+ if (State.EVL) {
+ Value *EVLPart = State.get(State.EVL, Part);
+ // If EVL is not nullptr, then EVL must be a valid value set during plan
+ // creation, possibly default value = whole vector register length. EVL
+ // is created only if TTI prefers predicated vectorization, thus if EVL
+ // is not nullptr it also implies preference for predicated
+ // vectorization.
+ // FIXME: Support reverse loading after vp_reverse is added.
+ NewLI = lowerLoadUsingVectorIntrinsics(
+ Builder, DataTy,
+ CreateGatherScatter
+ ? State.get(getAddr(), Part)
+ : CreateVecPtr(Part, State.get(getAddr(), VPIteration(0, 0))),
+ CreateGatherScatter, MaskValue(Part), EVLPart, Alignment);
+ } else if (CreateGatherScatter) {
Value *MaskPart = isMaskRequired ? BlockInMaskParts[Part] : nullptr;
Value *VectorGep = State.get(getAddr(), Part);
NewLI = Builder.CreateMaskedGather(DataTy, VectorGep, Alignment, MaskPart,
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 94cb7688981361..0ca668abbe60c7 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -242,6 +242,12 @@ struct VPTransformState {
ElementCount VF;
unsigned UF;
+ /// If EVL is not nullptr, then EVL must be a valid value set during plan
+ /// creation, possibly a default value = whole vector register length. EVL is
+ /// created only if TTI prefers predicated vectorization, thus if EVL is
+ /// not nullptr it also implies preference for predicated vectorization.
+ VPValue *EVL = nullptr;
+
/// Hold the indices to generate specific scalar instructions. Null indicates
/// that all instances are to be generated, using either scalar or vector
/// instructions.
@@ -1057,6 +1063,8 @@ class VPInstruction : public VPRecipeWithIRFlags, public VPValue {
SLPLoad,
SLPStore,
ActiveLaneMask,
+ ExplicitVectorLength,
+ ExplicitVectorLengthIVIncrement,
CalculateTripCountMinusVF,
// Increment the canonical IV separately for each unrolled part.
CanonicalIVIncrementForPart,
@@ -1165,6 +1173,8 @@ class VPInstruction : public VPRecipeWithIRFlags, public VPValue {
default:
return false;
case VPInstruction::ActiveLaneMask:
+ case VPInstruction::ExplicitVectorLength:
+ case VPInstruction::ExplicitVectorLengthIVIncrement:
case VPInstruction::CalculateTripCountMinusVF:
case VPInstruction::CanonicalIVIncrementForPart:
case VPInstruction::BranchOnCount:
@@ -2180,6 +2190,39 @@ class VPActiveLaneMaskPHIRecipe : public VPHeaderPHIRecipe {
#endif
};
+/// A recipe for generating the phi node for the current index of elements,
+/// adjusted in accordance with EVL value. It starts at StartIV value and gets
+/// incremented by EVL in each iteration of the vector loop.
+class VPEVLBasedIVPHIRecipe : public VPHeaderPHIRecipe {
+public:
+ VPEVLBasedIVPHIRecipe(VPValue *StartMask, DebugLoc DL)
+ : VPHeaderPHIRecipe(VPDef::VPEVLBasedIVPHISC, nullptr, StartMask, DL) {}
+
+ ~VPEVLBasedIVPHIRecipe() override = default;
+
+ VP_CLASSOF_IMPL(VPDef::VPEVLBasedIVPHISC)
+
+ static inline bool classof(const VPHeaderPHIRecipe *D) {
+ return D->getVPDefID() == VPDef::VPEVLBasedIVPHISC;
+ }
+
+ /// Generate phi for handling IV based on EVL over iterations correctly.
+ void execute(VPTransformState &State) override;
+
+ /// Returns true if the recipe only uses the first lane of operand \p Op.
+ bool onlyFirstLaneUsed(const VPValue *Op) const override {
+ assert(is_contained(operands(), Op) &&
+ "Op must be an operand of the recipe");
+ return true;
+ }
+
+#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
+ /// Print the recipe.
+ void print(raw_ostream &O, const Twine &Indent,
+ VPSlotTracker &SlotTracker) const override;
+#endif
+};
+
/// A Recipe for widening the canonical induction variable of the vector loop.
class VPWidenCanonicalIVRecipe : public VPRecipeBase, public VPValue {
public:
diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
index 97a8a1803bbf5a..b8ed256d236a4b 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
@@ -207,14 +207,14 @@ Type *VPTypeAnalysis::inferScalarType(const VPValue *V) {
Type *ResultTy =
TypeSwitch<const VPRecipeBase *, Type *>(V->getDefiningRecipe())
.Case<VPCanonicalIVPHIRecipe, VPFirstOrderRecurrencePHIRecipe,
- VPReductionPHIRecipe, VPWidenPointerInductionRecipe>(
- [this](const auto *R) {
- // Handle header phi recipes, except VPWienIntOrFpInduction
- // which needs special handling due it being possibly truncated.
- // TODO: consider inferring/caching type of siblings, e.g.,
- // backedge value, here and in cases below.
- return inferScalarType(R->getStartValue());
- })
+ VPReductionPHIRecipe, VPWidenPointerInductionRecipe,
+ VPEVLBasedIVPHIRecipe>([this](const auto *R) {
+ // Handle header phi recipes, except VPWienIntOrFpInduction
+ // which needs special handling due it being possibly truncated.
+ // TODO: consider inferring/caching type of siblings, e.g.,
+ // backedge value, here and in cases below.
+ return inferScalarType(R->getStartValue());
+ })
.Case<VPWidenIntOrFpInductionRecipe, VPDerivedIVRecipe>(
[](const auto *R) { return R->getScalarType(); })
.Case<VPPredInstPHIRecipe, VPWidenPHIRecipe, VPScalarIVStepsRecipe,
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 02e400d590bed4..5e0344a14df5da 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -345,6 +345,44 @@ Value *VPInstruction::generateInstruction(VPTransformState &State,
Value *Zero = ConstantInt::get(ScalarTC->getType(), 0);
return Builder.CreateSelect(Cmp, Sub, Zero);
}
+ case VPInstruction::ExplicitVectorLength: {
+ // Compute EVL
+ auto GetSetVL = [=](VPTransformState &State, Value *EVL) {
+ assert(EVL->getType()->isIntegerTy() &&
+ "Requested vector length should be an integer.");
+
+ // TODO: Add support for MaxSafeDist for correct loop emission.
+ Value *VFArg = State.Builder.getInt32(State.VF.getKnownMinValue());
+
+ Value *GVL = State.Builder.CreateIntrinsic(
+ State.Builder.getInt32Ty(), Intrinsic::experimental_get_vector_length,
+ {EVL, VFArg, State.Builder.getTrue()});
+ return GVL;
+ };
+ // TODO: Restructur...
[truncated]
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
9f0b36c
to
3071632
Compare
Ping! |
Thanks for moving to Github now that Phabricator has been taken down! I think @ayalz added some comments shortly before Phabricator was deactivated; unfortunately https://reviews.llvm.org/D99750 isn't accessible at the moment it seems (and it also doesn't seem to be available at http://108.170.204.19/D99750 which is supposed to have a static mirror). I am not sure what's the best way to pick up the recent comments here, perhaps it would be best to share the latest responses here on GH now? |
I addressed most of the @ayalz comments in this version |
Ok thanks! It would be helpful to import the recent conversations here and including what has been addressed how in the current iteration and if anything is still left open. Unfortunately it looks like for some reason D99750 isn't included in the static archive of reviews.llvm.org, I posted https://discourse.llvm.org/t/some-reviews-on-reviews-llvm-org-seem-to-be-missing-from-the-static-archive/76001 to hopefully get back access to the context in Phabricator. |
3071632
to
8bb19c6
Compare
Rebase |
8bb19c6
to
b960fdd
Compare
assert(EVL->getType()->getScalarSizeInBits() <= | ||
Phi->getType()->getScalarSizeInBits() && | ||
"EVL type must be smaller than Phi type."); | ||
EVL = Builder.CreateIntCast(EVL, Phi->getType(), /*isSigned=*/false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to use the same type for all users without needing to cast here? Without the case, would a simple Add
VPInstruction suffice (as in a5891fa)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried but it does not work unfortunately. It would be good to have Cast VPRecipe to implement this without adding new Instruction.
The type of the EVL (and many of their users) is i32 (because of https://llvm.org/docs/LangRef.html#llvm-experimental-get-vector-length-intrinsic) and the cast is required
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, I remember now again! There's now a recipe for vector casts, but not yet for scalar casts. Let me check if there are other places that would benefit from such a recipe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks link a general recipe for scalar casts would also be helpful in other cases (e.g. truncate of induction steps), shared a patch for discussion: #78113
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#78113 landed so it should be possible now to use Add
for the increment. Does that work now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still does not work:
opt: lib/Transforms/Vectorize/VPlan.cpp:290: llvm::Value *llvm::VPTransformState::get(llvm::VPValue *, unsigned int): Assertion `(isa(Def->getDefiningRecipe()) || isa(Def->getDefiningRecipe()) || isa(Def->getDefiningRecipe())) && "unexpected recipe found to be invariant"' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
llvm/test/Transforms/LoopVectorize/vectorize-vp-intrinsics-interleave.ll
Outdated
Show resolved
Hide resolved
d847073
to
8665929
Compare
The update regarding AVL/EVL. I missed one point here, when we discussed it before.
So, this 2 subject are separate. For this reason the corresponding parameter in LLVM IR Reference manual (https://llvm.org/docs/LangRef.html) for VP-based intrinsics is named as |
8665929
to
a6c1689
Compare
Rebase + merged the check lines in the tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The review on Phabricator is now available on the static archive: https://reviews.llvm.org/D99750
Went through @ayalz 's latest comments and shared the one that still seems pending/open. There was quite a lot to go over, so I might have missed some comments.
Also added some more comments inline. In terms of further refactoring for this patch, would be good to remove the dedicated EVLIncrement opcode now that #78113 landed, if possible.
The other larger pending suggestion is related to EVL handling in the recipes; I suggest to add a TODO and address this as follow-up, unless @ayalz prefers doing the refactoring first. I am planning to split up/refactor memory recipe soon now that address computation is already moved out.
assert(EVL->getType()->getScalarSizeInBits() <= | ||
Phi->getType()->getScalarSizeInBits() && | ||
"EVL type must be smaller than Phi type."); | ||
EVL = Builder.CreateIntCast(EVL, Phi->getType(), /*isSigned=*/false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#78113 landed so it should be possible now to use Add
for the increment. Does that work now?
llvm/test/Transforms/LoopVectorize/RISCV/vplan-vp-intrinsics.ll
Outdated
Show resolved
Hide resolved
Instruction::Add still does not work, crashes the compiler because this VPInstruction returns that it "does not use only first lane". |
a6c1689
to
9a7809b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instruction::Add still does not work, crashes the compiler because this VPInstruction returns that it "does not use only first lane".
Can you push that commit somewhere so I can have a look? Looks like only-first-lane-used analysis might need some additional info.
llvm/test/Transforms/LoopVectorize/RISCV/vplan-vp-intrinsics.ll
Outdated
Show resolved
Hide resolved
Just replace VPInstruction::ExplicitVectorLengthIVIncrement with Instruction::Add in lib/Transforms/Vectorize/VPlanTransforms.cpp, line 1273 |
Currently it won't work for PPC, since it has some specific checks in TTI. We (or PPC developers) can enable it later. |
Thanks for approving it! |
Introduce new subclasses of VPWidenMemoryRecipe for VP (vector-predicated) loads and stores to address multiple TODOs from llvm#76172 Note that the introduction of the new recipes also improves code-gen for VP gather/scatters by removing the redundant header mask. With the new approach, it is not sufficient to look at users of the widened canonical IV to find all uses of the header mask. In some cases, a widened IV is used instead of separately widening the canonical IV. To handle those cases, iterate over all recipes in the vector loop region to make sure all widened memory recipes are processed. Depends on llvm#87411.
@alexey-bataev I just noticed that this patch is merged, good work! I noticed that because of the move from phabricator to github, some of the history of this patch is now lost - I can't find a way to access the initial commits and discussions on the patch from 2021. Vineet Kumar |
Hi, sure! Sorry, forgot about that :( |
Added co-authors to the commit |
Thank you! For those interested in the older discussion on this patch, it is recorded on the internet archive at https://web.archive.org/web/20230128111909/https://reviews.llvm.org/D99750. |
Hi Vineet,
What you can do is create a PR(on top of the correct baseline) with this change and close the PR. That way it can be there on github as well. It might take a bit but i feel this would help others keep historical context. |
Introduce new subclasses of VPWidenMemoryRecipe for VP (vector-predicated) loads and stores to address multiple TODOs from llvm#76172 Note that the introduction of the new recipes also improves code-gen for VP gather/scatters by removing the redundant header mask. With the new approach, it is not sufficient to look at users of the widened canonical IV to find all uses of the header mask. In some cases, a widened IV is used instead of separately widening the canonical IV. To handle those cases, iterate over all recipes in the vector loop region to make sure all widened memory recipes are processed. Depends on llvm#87411.
Introduce new subclasses of VPWidenMemoryRecipe for VP (vector-predicated) loads and stores to address multiple TODOs from #76172 Note that the introduction of the new recipes also improves code-gen for VP gather/scatters by removing the redundant header mask. With the new approach, it is not sufficient to look at users of the widened canonical IV to find all uses of the header mask. In some cases, a widened IV is used instead of separately widening the canonical IV. To handle that, first collect all VPValues representing header masks (by looking at users of both the canonical IV and widened inductions that are canonical) and then checking all users (recursively) of those header masks. Depends on #87411. PR: #87816
Introduce new subclasses of VPWidenMemoryRecipe for VP (vector-predicated) loads and stores to address multiple TODOs from llvm#76172 Note that the introduction of the new recipes also improves code-gen for VP gather/scatters by removing the redundant header mask. With the new approach, it is not sufficient to look at users of the widened canonical IV to find all uses of the header mask. In some cases, a widened IV is used instead of separately widening the canonical IV. To handle that, first collect all VPValues representing header masks (by looking at users of both the canonical IV and widened inductions that are canonical) and then checking all users (recursively) of those header masks. Depends on llvm#87411. PR: llvm#87816
Posted on https://lists.riscv.org/g/sig-toolchains/message/678 notifying interested parties.. |
Are there any plans on adding upstream runtime testing for EVL vectorization to guard against regressions? We really should have upstream end-to-end testing that enables the EVL vectorization path and does a stage2 build + llvm-test-suite to catch regressions (similar to how SVE enabled bots were added when scalable vector support was added IIRC) cc'ing some additional people who might also be able to help @appujee @Mel-Chen @nikolaypanchenko @arcbbb @preames |
We don't have them yet, but certainly stability testing should be done as early as possible. We will work on plan for it! |
+1. Let me know if i can help with anything here. |
I'm currently working through a project to spin up a range of RISC-V builders. Part of that will involve deciding what configurations to test given the resources we have. It sounds like this could be an interesting config to add to the list. To clarify, is the suggestion basically a build with |
Hi Alex, yes, at least one builder should enable this option. I think we need to test both configs for now, with and without this option. |
Do we also need |
Selects the tail-folding style while choosing the max vector factor and storing it in the data member rather than calculating it each time upon getTailFoldingStyle call. Part of llvm#76172 Reviewers: ayalz, fhahn Reviewed By: fhahn Pull Request: llvm#81885
…m#90184) Summary: Following from llvm#87816, add VPReductionEVLRecipe to describe vector predication reduction. Address one of TODOs from llvm#76172. Test Plan: Reviewers: Subscribers: Tasks: Tags: Differential Revision: https://phabricator.intern.facebook.com/D59822470
) Summary: Following from #87816, add VPReductionEVLRecipe to describe vector predication reduction. Address one of TODOs from #76172. Test Plan: Reviewers: Subscribers: Tasks: Tags: Differential Revision: https://phabricator.intern.facebook.com/D60251485
This patch introduces generating VP intrinsics in the Loop Vectorizer.
Currently the Loop Vectorizer supports vector predication in a very limited capacity via tail-folding and masked load/store/gather/scatter intrinsics. However, this does not let architectures with active vector length predication support take advantage of their capabilities. Architectures with general masked predication support also can only take advantage of predication on memory operations. By having a way for the Loop Vectorizer to generate Vector Predication intrinsics, which (will) provide a target-independent way to model predicated vector instructions. These architectures can make better use of their predication capabilities.
Our first approach (implemented in this patch) builds on top of the existing tail-folding mechanism in the LV (just adds a new tail-folding mode using EVL), but instead of generating masked intrinsics for memory operations it generates VP intrinsics for loads/stores instructions. The patch adds a new VPlanTransforms to replace the wide header predicate compare with EVL and updates codegen for load/stores to use VP store/load with EVL.
Other important part of this approach is how the Explicit Vector Length is computed. (VP intrinsics define this vector length parameter as Explicit Vector Length (EVL)). We use an experimental intrinsic
get_vector_length
, that can be lowered to architecture specific instruction(s) to compute EVL.Also, added a new recipe to emit instructions for computing EVL. Using VPlan in this way will eventually help build and compare VPlans corresponding to different strategies and alternatives.
Differential Revision: https://reviews.llvm.org/D99750