Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LV, VP]VP intrinsics support for the Loop Vectorizer + adding new tail-folding mode using EVL. #76172

Merged
merged 17 commits into from
Apr 4, 2024

Conversation

alexey-bataev
Copy link
Member

@alexey-bataev alexey-bataev commented Dec 21, 2023

This patch introduces generating VP intrinsics in the Loop Vectorizer.

Currently the Loop Vectorizer supports vector predication in a very limited capacity via tail-folding and masked load/store/gather/scatter intrinsics. However, this does not let architectures with active vector length predication support take advantage of their capabilities. Architectures with general masked predication support also can only take advantage of predication on memory operations. By having a way for the Loop Vectorizer to generate Vector Predication intrinsics, which (will) provide a target-independent way to model predicated vector instructions. These architectures can make better use of their predication capabilities.

Our first approach (implemented in this patch) builds on top of the existing tail-folding mechanism in the LV (just adds a new tail-folding mode using EVL), but instead of generating masked intrinsics for memory operations it generates VP intrinsics for loads/stores instructions. The patch adds a new VPlanTransforms to replace the wide header predicate compare with EVL and updates codegen for load/stores to use VP store/load with EVL.

Other important part of this approach is how the Explicit Vector Length is computed. (VP intrinsics define this vector length parameter as Explicit Vector Length (EVL)). We use an experimental intrinsic get_vector_length, that can be lowered to architecture specific instruction(s) to compute EVL.

Also, added a new recipe to emit instructions for computing EVL. Using VPlan in this way will eventually help build and compare VPlans corresponding to different strategies and alternatives.

Differential Revision: https://reviews.llvm.org/D99750

@llvmbot
Copy link
Member

llvmbot commented Dec 21, 2023

@llvm/pr-subscribers-backend-powerpc
@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-llvm-analysis

Author: Alexey Bataev (alexey-bataev)
Co-Authored-By: Vineet Kumar (vntkmr)
Co-Authored-By: Roger Ferrer Ibáñez (rofirrim)
Co-Authored-By: Simon Moll (simoll)

Changes

This patch introduces generating VP intrinsics in the Loop Vectorizer.

Currently the Loop Vectorizer supports vector predication in a very limited capacity via tail-folding and masked load/store/gather/scatter intrinsics. However, this does not let architectures with active vector length predication support take advantage of their capabilities. Architectures with general masked predication support also can only take advantage of predication on memory operations. By having a way for the Loop Vectorizer to generate Vector Predication intrinsics, which (will) provide a target-independent way to model predicated vector instructions, These architectures can make better use of their predication capabilities.

Our first approach (implemented in this patch) builds on top of the existing tail-folding mechanism in the LV, but instead of generating masked intrinsics for memory operations it generates VP intrinsics for loads/stores instructions.

Other important part of this approach is how the Explicit Vector Length is computed. (We use active vector length and explicit vector length interchangeably; VP intrinsics define this vector length parameter as Explicit Vector Length (EVL)). We consider the following three ways to compute the EVL parameter for the VP Intrinsics.

  • The simplest way is to use the VF as EVL and rely solely on the mask parameter to control predication. The mask parameter is the same as computed for current tail-folding implementation.
  • The second way is to insert instructions to compute min(VF, trip_count - index) for each vector iteration.
  • For architectures like RISC-V, which have special instruction to compute/set an explicit vector length, we also introduce an experimental intrinsic get_vector_length, that can be lowered to architecture specific instruction(s) to compute EVL.

Also, added a new recipe to emit instructions for computing EVL. Using VPlan in this way will eventually help build and compare VPlans corresponding to different strategies and alternatives.

===Tentative Development Roadmap===

  • Use vp-intrinsics for all possible vector operations. That work has 2 possible implementations:
    1. Introduce a new pass which transforms emitted vector instructions to vp intrinsics if the the loop was transformed to use predication for loads/stores. The advantage of this approach is that it does not require many changes in the loop vectorizer itself. The disadvantage is that it may require to copy some existing functionality from the loop vectorizer in a separate patch, have similar code in the different passes and perform the same analysis 2 times, at least.
    2. Extend Loop Vectorizer using VectorBuildor and make it emit vp intrinsics automatically in presence of EVL value. The advantage is that it does not require a separate pass, thus it may reduce compile time. Plus, we can avoid code duplication. It requires some extra work in the LoopVectorizer to add VectorBuilder support and smart vector instructions/vp intrinsics emission. Also, to fully support Loop Vectorizer it will require adding a new PHI recipe to handle EVL on the previous iteration + extending several existing recipes with the new operands (depends on the design).
  • Switch to vp-intrinsics for memory operations for VLS and VLA vectorizations.

Differential Revision: https://reviews.llvm.org/D99750


Patch is 101.79 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/76172.diff

24 Files Affected:

  • (modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+4-1)
  • (modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp (+4)
  • (modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h (+16)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+151-8)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.h (+43)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp (+8-8)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+66)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+98-13)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.h (+7)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanValue.h (+1)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp (+51)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/inloop-reduction.ll (+65-1)
  • (added) llvm/test/Transforms/LoopVectorize/RISCV/vectorize-vp-intrinsics.ll (+142)
  • (added) llvm/test/Transforms/LoopVectorize/RISCV/vplan-vp-intrinsics.ll (+125)
  • (added) llvm/test/Transforms/LoopVectorize/X86/vectorize-vp-intrinsics.ll (+127)
  • (added) llvm/test/Transforms/LoopVectorize/X86/vplan-vp-intrinsics.ll (+83)
  • (added) llvm/test/Transforms/LoopVectorize/vectorize-vp-intrinsics-gather-scatter.ll (+64)
  • (added) llvm/test/Transforms/LoopVectorize/vectorize-vp-intrinsics-interleave.ll (+169)
  • (added) llvm/test/Transforms/LoopVectorize/vectorize-vp-intrinsics-iv32.ll (+84)
  • (added) llvm/test/Transforms/LoopVectorize/vectorize-vp-intrinsics-masked-loadstore.ll (+81)
  • (added) llvm/test/Transforms/LoopVectorize/vectorize-vp-intrinsics-no-masking.ll (+46)
  • (added) llvm/test/Transforms/LoopVectorize/vectorize-vp-intrinsics-reverse-load-store.ll (+64)
  • (added) llvm/test/Transforms/LoopVectorize/vectorize-vp-intrinsics.ll (+97)
  • (added) llvm/test/Transforms/LoopVectorize/vplan-vp-intrinsics.ll (+36)
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 735be3680aea0d..e2a127ff35be26 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -190,7 +190,10 @@ enum class TailFoldingStyle {
   /// Use predicate to control both data and control flow, but modify
   /// the trip count so that a runtime overflow check can be avoided
   /// and such that the scalar epilogue loop can always be removed.
-  DataAndControlFlowWithoutRuntimeCheck
+  DataAndControlFlowWithoutRuntimeCheck,
+  /// Use predicated EVL instructions for tail-folding.
+  /// Indicates that VP intrinsics should be used if tail-folding is enabled.
+  DataWithEVL,
 };
 
 struct TailFoldingInfo {
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
index 4614446b2150b7..1a9abaea811159 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
@@ -169,6 +169,10 @@ RISCVTTIImpl::getIntImmCostIntrin(Intrinsic::ID IID, unsigned Idx,
   return TTI::TCC_Free;
 }
 
+bool RISCVTTIImpl::hasActiveVectorLength(unsigned, Type *DataTy, Align) const {
+  return ST->hasVInstructions();
+}
+
 TargetTransformInfo::PopcntSupportKind
 RISCVTTIImpl::getPopcntSupport(unsigned TyWidth) {
   assert(isPowerOf2_32(TyWidth) && "Ty width must be power of 2");
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
index 96ecc771863e56..d2592be75000de 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
@@ -72,6 +72,22 @@ class RISCVTTIImpl : public BasicTTIImplBase<RISCVTTIImpl> {
                                       const APInt &Imm, Type *Ty,
                                       TTI::TargetCostKind CostKind);
 
+  /// \name Vector Predication Information
+  /// Whether the target supports the %evl parameter of VP intrinsic efficiently
+  /// in hardware, for the given opcode and type/alignment. (see LLVM Language
+  /// Reference - "Vector Predication Intrinsics",
+  /// https://llvm.org/docs/LangRef.html#vector-predication-intrinsics and
+  /// "IR-level VP intrinsics",
+  /// https://llvm.org/docs/Proposals/VectorPredication.html#ir-level-vp-intrinsics).
+  /// \param Opcode the opcode of the instruction checked for predicated version
+  /// support.
+  /// \param DataType the type of the instruction with the \p Opcode checked for
+  /// prediction support.
+  /// \param Alignment the alignment for memory access operation checked for
+  /// predicated version support.
+  bool hasActiveVectorLength(unsigned Opcode, Type *DataType,
+                             Align Alignment) const;
+
   TargetTransformInfo::PopcntSupportKind getPopcntSupport(unsigned TyWidth);
 
   bool shouldExpandReduction(const IntrinsicInst *II) const;
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index f82e161fb846d1..7b0e268877ded3 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -123,6 +123,7 @@
 #include "llvm/IR/User.h"
 #include "llvm/IR/Value.h"
 #include "llvm/IR/ValueHandle.h"
+#include "llvm/IR/VectorBuilder.h"
 #include "llvm/IR/Verifier.h"
 #include "llvm/Support/Casting.h"
 #include "llvm/Support/CommandLine.h"
@@ -247,10 +248,12 @@ static cl::opt<TailFoldingStyle> ForceTailFoldingStyle(
         clEnumValN(TailFoldingStyle::DataAndControlFlow, "data-and-control",
                    "Create lane mask using active.lane.mask intrinsic, and use "
                    "it for both data and control flow"),
-        clEnumValN(
-            TailFoldingStyle::DataAndControlFlowWithoutRuntimeCheck,
-            "data-and-control-without-rt-check",
-            "Similar to data-and-control, but remove the runtime check")));
+        clEnumValN(TailFoldingStyle::DataAndControlFlowWithoutRuntimeCheck,
+                   "data-and-control-without-rt-check",
+                   "Similar to data-and-control, but remove the runtime check"),
+        clEnumValN(TailFoldingStyle::DataWithEVL, "data-with-evl",
+                   "Use predicated EVL instructions for tail folding if the "
+                   "target supports vector length predication")));
 
 static cl::opt<bool> MaximizeBandwidth(
     "vectorizer-maximize-bandwidth", cl::init(false), cl::Hidden,
@@ -1106,8 +1109,7 @@ void InnerLoopVectorizer::collectPoisonGeneratingRecipes(
       if (isa<VPWidenMemoryInstructionRecipe>(CurRec) ||
           isa<VPInterleaveRecipe>(CurRec) ||
           isa<VPScalarIVStepsRecipe>(CurRec) ||
-          isa<VPCanonicalIVPHIRecipe>(CurRec) ||
-          isa<VPActiveLaneMaskPHIRecipe>(CurRec))
+          isa<VPHeaderPHIRecipe>(CurRec))
         continue;
 
       // This recipe contributes to the address computation of a widen
@@ -1655,6 +1657,23 @@ class LoopVectorizationCostModel {
     return foldTailByMasking() || Legal->blockNeedsPredication(BB);
   }
 
+  /// Returns true if VP intrinsics with explicit vector length support should
+  /// be generated in the tail folded loop.
+  bool useVPIWithVPEVLVectorization() const {
+    return PreferEVL && !EnableVPlanNativePath &&
+           getTailFoldingStyle() == TailFoldingStyle::DataWithEVL &&
+           // FIXME: implement support for max safe dependency distance.
+           Legal->isSafeForAnyVectorWidth() &&
+           // FIXME: remove this once reductions are supported.
+           Legal->getReductionVars().empty() &&
+           // FIXME: remove this once vp_reverse is supported.
+           none_of(
+               WideningDecisions,
+               [](const std::pair<std::pair<Instruction *, ElementCount>,
+                                  std::pair<InstWidening, InstructionCost>>
+                      &Data) { return Data.second.first == CM_Widen_Reverse; });
+  }
+
   /// Returns true if the Phi is part of an inloop reduction.
   bool isInLoopReduction(PHINode *Phi) const {
     return InLoopReductions.contains(Phi);
@@ -1800,6 +1819,10 @@ class LoopVectorizationCostModel {
   /// All blocks of loop are to be masked to fold tail of scalar iterations.
   bool CanFoldTailByMasking = false;
 
+  /// Control whether to generate VP intrinsics with explicit-vector-length
+  /// support in vectorized code.
+  bool PreferEVL = false;
+
   /// A map holding scalar costs for different vectorization factors. The
   /// presence of a cost for an instruction in the mapping indicates that the
   /// instruction will be scalarized when vectorizing with the associated
@@ -4883,6 +4906,39 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
   // FIXME: look for a smaller MaxVF that does divide TC rather than masking.
   if (Legal->prepareToFoldTailByMasking()) {
     CanFoldTailByMasking = true;
+    if (getTailFoldingStyle() == TailFoldingStyle::None)
+      return MaxFactors;
+
+    if (UserIC > 1) {
+      LLVM_DEBUG(dbgs() << "LV: Preference for VP intrinsics indicated. Will "
+                           "not generate VP intrinsics since interleave count "
+                           "specified is greater than 1.\n");
+      return MaxFactors;
+    }
+
+    if (MaxFactors.ScalableVF.isVector()) {
+      assert(MaxFactors.ScalableVF.isScalable() &&
+             "Expected scalable vector factor.");
+      // FIXME: use actual opcode/data type for analysis here.
+      PreferEVL = getTailFoldingStyle() == TailFoldingStyle::DataWithEVL &&
+                  TTI.hasActiveVectorLength(0, nullptr, Align());
+#if !NDEBUG
+      if (getTailFoldingStyle() == TailFoldingStyle::DataWithEVL) {
+        if (PreferEVL)
+          dbgs() << "LV: Preference for VP intrinsics indicated. Will "
+                    "try to generate VP Intrinsics.\n";
+        else
+          dbgs() << "LV: Preference for VP intrinsics indicated. Will "
+                    "not try to generate VP Intrinsics since the target "
+                    "does not support vector length predication.\n";
+      }
+#endif // !NDEBUG
+
+      // Tail folded loop using VP intrinsics restricts the VF to be scalable.
+      if (PreferEVL)
+        MaxFactors.FixedVF = ElementCount::getFixed(1);
+    }
+
     return MaxFactors;
   }
 
@@ -5493,6 +5549,10 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
   if (!isScalarEpilogueAllowed())
     return 1;
 
+  // Do not interleave if EVL is preferred and no User IC is specified.
+  if (useVPIWithVPEVLVectorization())
+    return 1;
+
   // We used the distance for the interleave count.
   if (!Legal->isSafeForAnyVectorWidth())
     return 1;
@@ -8622,6 +8682,8 @@ void LoopVectorizationPlanner::buildVPlansWithVPRecipes(ElementCount MinVF,
         VPlanTransforms::truncateToMinimalBitwidths(
             *Plan, CM.getMinimalBitwidths(), PSE.getSE()->getContext());
       VPlanTransforms::optimize(*Plan, *PSE.getSE());
+      if (CM.useVPIWithVPEVLVectorization())
+        VPlanTransforms::addExplicitVectorLength(*Plan);
       assert(VPlanVerifier::verifyPlanIsValid(*Plan) && "VPlan is invalid");
       VPlans.push_back(std::move(Plan));
     }
@@ -9454,6 +9516,52 @@ void VPReplicateRecipe::execute(VPTransformState &State) {
       State.ILV->scalarizeInstruction(UI, this, VPIteration(Part, Lane), State);
 }
 
+/// Creates either vp_store or vp_scatter intrinsics calls to represent
+/// predicated store/scatter.
+static Instruction *
+lowerStoreUsingVectorIntrinsics(IRBuilderBase &Builder, Value *Addr,
+                                Value *StoredVal, bool IsScatter, Value *Mask,
+                                Value *EVLPart, const Align &Alignment) {
+  CallInst *Call;
+  if (IsScatter) {
+    Call = Builder.CreateIntrinsic(Type::getVoidTy(EVLPart->getContext()),
+                                   Intrinsic::vp_scatter,
+                                   {StoredVal, Addr, Mask, EVLPart});
+  } else {
+    VectorBuilder VBuilder(Builder);
+    VBuilder.setEVL(EVLPart).setMask(Mask);
+    Call = cast<CallInst>(VBuilder.createVectorInstruction(
+        Instruction::Store, Type::getVoidTy(EVLPart->getContext()),
+        {StoredVal, Addr}));
+  }
+  Call->addParamAttr(
+      1, Attribute::getWithAlignment(Call->getContext(), Alignment));
+  return Call;
+}
+
+/// Creates either vp_load or vp_gather intrinsics calls to represent
+/// predicated load/gather.
+static Instruction *lowerLoadUsingVectorIntrinsics(IRBuilderBase &Builder,
+                                                   VectorType *DataTy,
+                                                   Value *Addr, bool IsGather,
+                                                   Value *Mask, Value *EVLPart,
+                                                   const Align &Alignment) {
+  CallInst *Call;
+  if (IsGather) {
+    Call = Builder.CreateIntrinsic(DataTy, Intrinsic::vp_gather,
+                                   {Addr, Mask, EVLPart}, nullptr,
+                                   "wide.masked.gather");
+  } else {
+    VectorBuilder VBuilder(Builder);
+    VBuilder.setEVL(EVLPart).setMask(Mask);
+    Call = cast<CallInst>(VBuilder.createVectorInstruction(
+        Instruction::Load, DataTy, Addr, "vp.op.load"));
+  }
+  Call->addParamAttr(
+      0, Attribute::getWithAlignment(Call->getContext(), Alignment));
+  return Call;
+}
+
 void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
   VPValue *StoredValue = isStore() ? getStoredValue() : nullptr;
 
@@ -9523,6 +9631,12 @@ void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
     return PartPtr;
   };
 
+  auto MaskValue = [&](unsigned Part) -> Value * {
+    if (isMaskRequired)
+      return BlockInMaskParts[Part];
+    return nullptr;
+  };
+
   // Handle Stores:
   if (SI) {
     State.setDebugLocFrom(SI->getDebugLoc());
@@ -9530,7 +9644,22 @@ void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
     for (unsigned Part = 0; Part < State.UF; ++Part) {
       Instruction *NewSI = nullptr;
       Value *StoredVal = State.get(StoredValue, Part);
-      if (CreateGatherScatter) {
+      if (State.EVL) {
+        Value *EVLPart = State.get(State.EVL, Part);
+        // If EVL is not nullptr, then EVL must be a valid value set during plan
+        // creation, possibly default value = whole vector register length. EVL
+        // is created only if TTI prefers predicated vectorization, thus if EVL
+        // is not nullptr it also implies preference for predicated
+        // vectorization.
+        // FIXME: Support reverse store after vp_reverse is added.
+        NewSI = lowerStoreUsingVectorIntrinsics(
+            Builder,
+            CreateGatherScatter
+                ? State.get(getAddr(), Part)
+                : CreateVecPtr(Part, State.get(getAddr(), VPIteration(0, 0))),
+            StoredVal, CreateGatherScatter, MaskValue(Part), EVLPart,
+            Alignment);
+      } else if (CreateGatherScatter) {
         Value *MaskPart = isMaskRequired ? BlockInMaskParts[Part] : nullptr;
         Value *VectorGep = State.get(getAddr(), Part);
         NewSI = Builder.CreateMaskedScatter(StoredVal, VectorGep, Alignment,
@@ -9561,7 +9690,21 @@ void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
   State.setDebugLocFrom(LI->getDebugLoc());
   for (unsigned Part = 0; Part < State.UF; ++Part) {
     Value *NewLI;
-    if (CreateGatherScatter) {
+    if (State.EVL) {
+      Value *EVLPart = State.get(State.EVL, Part);
+      // If EVL is not nullptr, then EVL must be a valid value set during plan
+      // creation, possibly default value = whole vector register length. EVL
+      // is created only if TTI prefers predicated vectorization, thus if EVL
+      // is not nullptr it also implies preference for predicated
+      // vectorization.
+      // FIXME: Support reverse loading after vp_reverse is added.
+      NewLI = lowerLoadUsingVectorIntrinsics(
+          Builder, DataTy,
+          CreateGatherScatter
+              ? State.get(getAddr(), Part)
+              : CreateVecPtr(Part, State.get(getAddr(), VPIteration(0, 0))),
+          CreateGatherScatter, MaskValue(Part), EVLPart, Alignment);
+    } else if (CreateGatherScatter) {
       Value *MaskPart = isMaskRequired ? BlockInMaskParts[Part] : nullptr;
       Value *VectorGep = State.get(getAddr(), Part);
       NewLI = Builder.CreateMaskedGather(DataTy, VectorGep, Alignment, MaskPart,
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 94cb7688981361..0ca668abbe60c7 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -242,6 +242,12 @@ struct VPTransformState {
   ElementCount VF;
   unsigned UF;
 
+  /// If EVL is not nullptr, then EVL must be a valid value set during plan
+  /// creation, possibly a default value = whole vector register length. EVL is
+  /// created only if TTI prefers predicated vectorization, thus if EVL is
+  /// not nullptr it also implies preference for predicated vectorization.
+  VPValue *EVL = nullptr;
+
   /// Hold the indices to generate specific scalar instructions. Null indicates
   /// that all instances are to be generated, using either scalar or vector
   /// instructions.
@@ -1057,6 +1063,8 @@ class VPInstruction : public VPRecipeWithIRFlags, public VPValue {
     SLPLoad,
     SLPStore,
     ActiveLaneMask,
+    ExplicitVectorLength,
+    ExplicitVectorLengthIVIncrement,
     CalculateTripCountMinusVF,
     // Increment the canonical IV separately for each unrolled part.
     CanonicalIVIncrementForPart,
@@ -1165,6 +1173,8 @@ class VPInstruction : public VPRecipeWithIRFlags, public VPValue {
     default:
       return false;
     case VPInstruction::ActiveLaneMask:
+    case VPInstruction::ExplicitVectorLength:
+    case VPInstruction::ExplicitVectorLengthIVIncrement:
     case VPInstruction::CalculateTripCountMinusVF:
     case VPInstruction::CanonicalIVIncrementForPart:
     case VPInstruction::BranchOnCount:
@@ -2180,6 +2190,39 @@ class VPActiveLaneMaskPHIRecipe : public VPHeaderPHIRecipe {
 #endif
 };
 
+/// A recipe for generating the phi node for the current index of elements,
+/// adjusted in accordance with EVL value. It starts at StartIV value and gets
+/// incremented by EVL in each iteration of the vector loop.
+class VPEVLBasedIVPHIRecipe : public VPHeaderPHIRecipe {
+public:
+  VPEVLBasedIVPHIRecipe(VPValue *StartMask, DebugLoc DL)
+      : VPHeaderPHIRecipe(VPDef::VPEVLBasedIVPHISC, nullptr, StartMask, DL) {}
+
+  ~VPEVLBasedIVPHIRecipe() override = default;
+
+  VP_CLASSOF_IMPL(VPDef::VPEVLBasedIVPHISC)
+
+  static inline bool classof(const VPHeaderPHIRecipe *D) {
+    return D->getVPDefID() == VPDef::VPEVLBasedIVPHISC;
+  }
+
+  /// Generate phi for handling IV based on EVL over iterations correctly.
+  void execute(VPTransformState &State) override;
+
+  /// Returns true if the recipe only uses the first lane of operand \p Op.
+  bool onlyFirstLaneUsed(const VPValue *Op) const override {
+    assert(is_contained(operands(), Op) &&
+           "Op must be an operand of the recipe");
+    return true;
+  }
+
+#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
+  /// Print the recipe.
+  void print(raw_ostream &O, const Twine &Indent,
+             VPSlotTracker &SlotTracker) const override;
+#endif
+};
+
 /// A Recipe for widening the canonical induction variable of the vector loop.
 class VPWidenCanonicalIVRecipe : public VPRecipeBase, public VPValue {
 public:
diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
index 97a8a1803bbf5a..b8ed256d236a4b 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
@@ -207,14 +207,14 @@ Type *VPTypeAnalysis::inferScalarType(const VPValue *V) {
   Type *ResultTy =
       TypeSwitch<const VPRecipeBase *, Type *>(V->getDefiningRecipe())
           .Case<VPCanonicalIVPHIRecipe, VPFirstOrderRecurrencePHIRecipe,
-                VPReductionPHIRecipe, VPWidenPointerInductionRecipe>(
-              [this](const auto *R) {
-                // Handle header phi recipes, except VPWienIntOrFpInduction
-                // which needs special handling due it being possibly truncated.
-                // TODO: consider inferring/caching type of siblings, e.g.,
-                // backedge value, here and in cases below.
-                return inferScalarType(R->getStartValue());
-              })
+                VPReductionPHIRecipe, VPWidenPointerInductionRecipe,
+                VPEVLBasedIVPHIRecipe>([this](const auto *R) {
+            // Handle header phi recipes, except VPWienIntOrFpInduction
+            // which needs special handling due it being possibly truncated.
+            // TODO: consider inferring/caching type of siblings, e.g.,
+            // backedge value, here and in cases below.
+            return inferScalarType(R->getStartValue());
+          })
           .Case<VPWidenIntOrFpInductionRecipe, VPDerivedIVRecipe>(
               [](const auto *R) { return R->getScalarType(); })
           .Case<VPPredInstPHIRecipe, VPWidenPHIRecipe, VPScalarIVStepsRecipe,
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 02e400d590bed4..5e0344a14df5da 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -345,6 +345,44 @@ Value *VPInstruction::generateInstruction(VPTransformState &State,
     Value *Zero = ConstantInt::get(ScalarTC->getType(), 0);
     return Builder.CreateSelect(Cmp, Sub, Zero);
   }
+  case VPInstruction::ExplicitVectorLength: {
+    // Compute EVL
+    auto GetSetVL = [=](VPTransformState &State, Value *EVL) {
+      assert(EVL->getType()->isIntegerTy() &&
+             "Requested vector length should be an integer.");
+
+      // TODO: Add support for MaxSafeDist for correct loop emission.
+      Value *VFArg = State.Builder.getInt32(State.VF.getKnownMinValue());
+
+      Value *GVL = State.Builder.CreateIntrinsic(
+          State.Builder.getInt32Ty(), Intrinsic::experimental_get_vector_length,
+          {EVL, VFArg, State.Builder.getTrue()});
+      return GVL;
+    };
+    // TODO: Restructur...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Dec 21, 2023

@llvm/pr-subscribers-backend-risc-v

Author: Alexey Bataev (alexey-bataev)
Co-Authored-By: Vineet Kumar (vntkmr)
Co-Authored-By: Roger Ferrer Ibáñez (rofirrim)
Co-Authored-By: Simon Moll (simoll)

Changes

This patch introduces generating VP intrinsics in the Loop Vectorizer.

Currently the Loop Vectorizer supports vector predication in a very limited capacity via tail-folding and masked load/store/gather/scatter intrinsics. However, this does not let architectures with active vector length predication support take advantage of their capabilities. Architectures with general masked predication support also can only take advantage of predication on memory operations. By having a way for the Loop Vectorizer to generate Vector Predication intrinsics, which (will) provide a target-independent way to model predicated vector instructions, These architectures can make better use of their predication capabilities.

Our first approach (implemented in this patch) builds on top of the existing tail-folding mechanism in the LV, but instead of generating masked intrinsics for memory operations it generates VP intrinsics for loads/stores instructions.

Other important part of this approach is how the Explicit Vector Length is computed. (We use active vector length and explicit vector length interchangeably; VP intrinsics define this vector length parameter as Explicit Vector Length (EVL)). We consider the following three ways to compute the EVL parameter for the VP Intrinsics.

  • The simplest way is to use the VF as EVL and rely solely on the mask parameter to control predication. The mask parameter is the same as computed for current tail-folding implementation.
  • The second way is to insert instructions to compute min(VF, trip_count - index) for each vector iteration.
  • For architectures like RISC-V, which have special instruction to compute/set an explicit vector length, we also introduce an experimental intrinsic get_vector_length, that can be lowered to architecture specific instruction(s) to compute EVL.

Also, added a new recipe to emit instructions for computing EVL. Using VPlan in this way will eventually help build and compare VPlans corresponding to different strategies and alternatives.

===Tentative Development Roadmap===

  • Use vp-intrinsics for all possible vector operations. That work has 2 possible implementations:
    1. Introduce a new pass which transforms emitted vector instructions to vp intrinsics if the the loop was transformed to use predication for loads/stores. The advantage of this approach is that it does not require many changes in the loop vectorizer itself. The disadvantage is that it may require to copy some existing functionality from the loop vectorizer in a separate patch, have similar code in the different passes and perform the same analysis 2 times, at least.
    2. Extend Loop Vectorizer using VectorBuildor and make it emit vp intrinsics automatically in presence of EVL value. The advantage is that it does not require a separate pass, thus it may reduce compile time. Plus, we can avoid code duplication. It requires some extra work in the LoopVectorizer to add VectorBuilder support and smart vector instructions/vp intrinsics emission. Also, to fully support Loop Vectorizer it will require adding a new PHI recipe to handle EVL on the previous iteration + extending several existing recipes with the new operands (depends on the design).
  • Switch to vp-intrinsics for memory operations for VLS and VLA vectorizations.

Differential Revision: https://reviews.llvm.org/D99750


Patch is 101.79 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/76172.diff

24 Files Affected:

  • (modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+4-1)
  • (modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp (+4)
  • (modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h (+16)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+151-8)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.h (+43)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp (+8-8)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+66)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+98-13)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.h (+7)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanValue.h (+1)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp (+51)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/inloop-reduction.ll (+65-1)
  • (added) llvm/test/Transforms/LoopVectorize/RISCV/vectorize-vp-intrinsics.ll (+142)
  • (added) llvm/test/Transforms/LoopVectorize/RISCV/vplan-vp-intrinsics.ll (+125)
  • (added) llvm/test/Transforms/LoopVectorize/X86/vectorize-vp-intrinsics.ll (+127)
  • (added) llvm/test/Transforms/LoopVectorize/X86/vplan-vp-intrinsics.ll (+83)
  • (added) llvm/test/Transforms/LoopVectorize/vectorize-vp-intrinsics-gather-scatter.ll (+64)
  • (added) llvm/test/Transforms/LoopVectorize/vectorize-vp-intrinsics-interleave.ll (+169)
  • (added) llvm/test/Transforms/LoopVectorize/vectorize-vp-intrinsics-iv32.ll (+84)
  • (added) llvm/test/Transforms/LoopVectorize/vectorize-vp-intrinsics-masked-loadstore.ll (+81)
  • (added) llvm/test/Transforms/LoopVectorize/vectorize-vp-intrinsics-no-masking.ll (+46)
  • (added) llvm/test/Transforms/LoopVectorize/vectorize-vp-intrinsics-reverse-load-store.ll (+64)
  • (added) llvm/test/Transforms/LoopVectorize/vectorize-vp-intrinsics.ll (+97)
  • (added) llvm/test/Transforms/LoopVectorize/vplan-vp-intrinsics.ll (+36)
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 735be3680aea0d..e2a127ff35be26 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -190,7 +190,10 @@ enum class TailFoldingStyle {
   /// Use predicate to control both data and control flow, but modify
   /// the trip count so that a runtime overflow check can be avoided
   /// and such that the scalar epilogue loop can always be removed.
-  DataAndControlFlowWithoutRuntimeCheck
+  DataAndControlFlowWithoutRuntimeCheck,
+  /// Use predicated EVL instructions for tail-folding.
+  /// Indicates that VP intrinsics should be used if tail-folding is enabled.
+  DataWithEVL,
 };
 
 struct TailFoldingInfo {
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
index 4614446b2150b7..1a9abaea811159 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
@@ -169,6 +169,10 @@ RISCVTTIImpl::getIntImmCostIntrin(Intrinsic::ID IID, unsigned Idx,
   return TTI::TCC_Free;
 }
 
+bool RISCVTTIImpl::hasActiveVectorLength(unsigned, Type *DataTy, Align) const {
+  return ST->hasVInstructions();
+}
+
 TargetTransformInfo::PopcntSupportKind
 RISCVTTIImpl::getPopcntSupport(unsigned TyWidth) {
   assert(isPowerOf2_32(TyWidth) && "Ty width must be power of 2");
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
index 96ecc771863e56..d2592be75000de 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
@@ -72,6 +72,22 @@ class RISCVTTIImpl : public BasicTTIImplBase<RISCVTTIImpl> {
                                       const APInt &Imm, Type *Ty,
                                       TTI::TargetCostKind CostKind);
 
+  /// \name Vector Predication Information
+  /// Whether the target supports the %evl parameter of VP intrinsic efficiently
+  /// in hardware, for the given opcode and type/alignment. (see LLVM Language
+  /// Reference - "Vector Predication Intrinsics",
+  /// https://llvm.org/docs/LangRef.html#vector-predication-intrinsics and
+  /// "IR-level VP intrinsics",
+  /// https://llvm.org/docs/Proposals/VectorPredication.html#ir-level-vp-intrinsics).
+  /// \param Opcode the opcode of the instruction checked for predicated version
+  /// support.
+  /// \param DataType the type of the instruction with the \p Opcode checked for
+  /// prediction support.
+  /// \param Alignment the alignment for memory access operation checked for
+  /// predicated version support.
+  bool hasActiveVectorLength(unsigned Opcode, Type *DataType,
+                             Align Alignment) const;
+
   TargetTransformInfo::PopcntSupportKind getPopcntSupport(unsigned TyWidth);
 
   bool shouldExpandReduction(const IntrinsicInst *II) const;
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index f82e161fb846d1..7b0e268877ded3 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -123,6 +123,7 @@
 #include "llvm/IR/User.h"
 #include "llvm/IR/Value.h"
 #include "llvm/IR/ValueHandle.h"
+#include "llvm/IR/VectorBuilder.h"
 #include "llvm/IR/Verifier.h"
 #include "llvm/Support/Casting.h"
 #include "llvm/Support/CommandLine.h"
@@ -247,10 +248,12 @@ static cl::opt<TailFoldingStyle> ForceTailFoldingStyle(
         clEnumValN(TailFoldingStyle::DataAndControlFlow, "data-and-control",
                    "Create lane mask using active.lane.mask intrinsic, and use "
                    "it for both data and control flow"),
-        clEnumValN(
-            TailFoldingStyle::DataAndControlFlowWithoutRuntimeCheck,
-            "data-and-control-without-rt-check",
-            "Similar to data-and-control, but remove the runtime check")));
+        clEnumValN(TailFoldingStyle::DataAndControlFlowWithoutRuntimeCheck,
+                   "data-and-control-without-rt-check",
+                   "Similar to data-and-control, but remove the runtime check"),
+        clEnumValN(TailFoldingStyle::DataWithEVL, "data-with-evl",
+                   "Use predicated EVL instructions for tail folding if the "
+                   "target supports vector length predication")));
 
 static cl::opt<bool> MaximizeBandwidth(
     "vectorizer-maximize-bandwidth", cl::init(false), cl::Hidden,
@@ -1106,8 +1109,7 @@ void InnerLoopVectorizer::collectPoisonGeneratingRecipes(
       if (isa<VPWidenMemoryInstructionRecipe>(CurRec) ||
           isa<VPInterleaveRecipe>(CurRec) ||
           isa<VPScalarIVStepsRecipe>(CurRec) ||
-          isa<VPCanonicalIVPHIRecipe>(CurRec) ||
-          isa<VPActiveLaneMaskPHIRecipe>(CurRec))
+          isa<VPHeaderPHIRecipe>(CurRec))
         continue;
 
       // This recipe contributes to the address computation of a widen
@@ -1655,6 +1657,23 @@ class LoopVectorizationCostModel {
     return foldTailByMasking() || Legal->blockNeedsPredication(BB);
   }
 
+  /// Returns true if VP intrinsics with explicit vector length support should
+  /// be generated in the tail folded loop.
+  bool useVPIWithVPEVLVectorization() const {
+    return PreferEVL && !EnableVPlanNativePath &&
+           getTailFoldingStyle() == TailFoldingStyle::DataWithEVL &&
+           // FIXME: implement support for max safe dependency distance.
+           Legal->isSafeForAnyVectorWidth() &&
+           // FIXME: remove this once reductions are supported.
+           Legal->getReductionVars().empty() &&
+           // FIXME: remove this once vp_reverse is supported.
+           none_of(
+               WideningDecisions,
+               [](const std::pair<std::pair<Instruction *, ElementCount>,
+                                  std::pair<InstWidening, InstructionCost>>
+                      &Data) { return Data.second.first == CM_Widen_Reverse; });
+  }
+
   /// Returns true if the Phi is part of an inloop reduction.
   bool isInLoopReduction(PHINode *Phi) const {
     return InLoopReductions.contains(Phi);
@@ -1800,6 +1819,10 @@ class LoopVectorizationCostModel {
   /// All blocks of loop are to be masked to fold tail of scalar iterations.
   bool CanFoldTailByMasking = false;
 
+  /// Control whether to generate VP intrinsics with explicit-vector-length
+  /// support in vectorized code.
+  bool PreferEVL = false;
+
   /// A map holding scalar costs for different vectorization factors. The
   /// presence of a cost for an instruction in the mapping indicates that the
   /// instruction will be scalarized when vectorizing with the associated
@@ -4883,6 +4906,39 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
   // FIXME: look for a smaller MaxVF that does divide TC rather than masking.
   if (Legal->prepareToFoldTailByMasking()) {
     CanFoldTailByMasking = true;
+    if (getTailFoldingStyle() == TailFoldingStyle::None)
+      return MaxFactors;
+
+    if (UserIC > 1) {
+      LLVM_DEBUG(dbgs() << "LV: Preference for VP intrinsics indicated. Will "
+                           "not generate VP intrinsics since interleave count "
+                           "specified is greater than 1.\n");
+      return MaxFactors;
+    }
+
+    if (MaxFactors.ScalableVF.isVector()) {
+      assert(MaxFactors.ScalableVF.isScalable() &&
+             "Expected scalable vector factor.");
+      // FIXME: use actual opcode/data type for analysis here.
+      PreferEVL = getTailFoldingStyle() == TailFoldingStyle::DataWithEVL &&
+                  TTI.hasActiveVectorLength(0, nullptr, Align());
+#if !NDEBUG
+      if (getTailFoldingStyle() == TailFoldingStyle::DataWithEVL) {
+        if (PreferEVL)
+          dbgs() << "LV: Preference for VP intrinsics indicated. Will "
+                    "try to generate VP Intrinsics.\n";
+        else
+          dbgs() << "LV: Preference for VP intrinsics indicated. Will "
+                    "not try to generate VP Intrinsics since the target "
+                    "does not support vector length predication.\n";
+      }
+#endif // !NDEBUG
+
+      // Tail folded loop using VP intrinsics restricts the VF to be scalable.
+      if (PreferEVL)
+        MaxFactors.FixedVF = ElementCount::getFixed(1);
+    }
+
     return MaxFactors;
   }
 
@@ -5493,6 +5549,10 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
   if (!isScalarEpilogueAllowed())
     return 1;
 
+  // Do not interleave if EVL is preferred and no User IC is specified.
+  if (useVPIWithVPEVLVectorization())
+    return 1;
+
   // We used the distance for the interleave count.
   if (!Legal->isSafeForAnyVectorWidth())
     return 1;
@@ -8622,6 +8682,8 @@ void LoopVectorizationPlanner::buildVPlansWithVPRecipes(ElementCount MinVF,
         VPlanTransforms::truncateToMinimalBitwidths(
             *Plan, CM.getMinimalBitwidths(), PSE.getSE()->getContext());
       VPlanTransforms::optimize(*Plan, *PSE.getSE());
+      if (CM.useVPIWithVPEVLVectorization())
+        VPlanTransforms::addExplicitVectorLength(*Plan);
       assert(VPlanVerifier::verifyPlanIsValid(*Plan) && "VPlan is invalid");
       VPlans.push_back(std::move(Plan));
     }
@@ -9454,6 +9516,52 @@ void VPReplicateRecipe::execute(VPTransformState &State) {
       State.ILV->scalarizeInstruction(UI, this, VPIteration(Part, Lane), State);
 }
 
+/// Creates either vp_store or vp_scatter intrinsics calls to represent
+/// predicated store/scatter.
+static Instruction *
+lowerStoreUsingVectorIntrinsics(IRBuilderBase &Builder, Value *Addr,
+                                Value *StoredVal, bool IsScatter, Value *Mask,
+                                Value *EVLPart, const Align &Alignment) {
+  CallInst *Call;
+  if (IsScatter) {
+    Call = Builder.CreateIntrinsic(Type::getVoidTy(EVLPart->getContext()),
+                                   Intrinsic::vp_scatter,
+                                   {StoredVal, Addr, Mask, EVLPart});
+  } else {
+    VectorBuilder VBuilder(Builder);
+    VBuilder.setEVL(EVLPart).setMask(Mask);
+    Call = cast<CallInst>(VBuilder.createVectorInstruction(
+        Instruction::Store, Type::getVoidTy(EVLPart->getContext()),
+        {StoredVal, Addr}));
+  }
+  Call->addParamAttr(
+      1, Attribute::getWithAlignment(Call->getContext(), Alignment));
+  return Call;
+}
+
+/// Creates either vp_load or vp_gather intrinsics calls to represent
+/// predicated load/gather.
+static Instruction *lowerLoadUsingVectorIntrinsics(IRBuilderBase &Builder,
+                                                   VectorType *DataTy,
+                                                   Value *Addr, bool IsGather,
+                                                   Value *Mask, Value *EVLPart,
+                                                   const Align &Alignment) {
+  CallInst *Call;
+  if (IsGather) {
+    Call = Builder.CreateIntrinsic(DataTy, Intrinsic::vp_gather,
+                                   {Addr, Mask, EVLPart}, nullptr,
+                                   "wide.masked.gather");
+  } else {
+    VectorBuilder VBuilder(Builder);
+    VBuilder.setEVL(EVLPart).setMask(Mask);
+    Call = cast<CallInst>(VBuilder.createVectorInstruction(
+        Instruction::Load, DataTy, Addr, "vp.op.load"));
+  }
+  Call->addParamAttr(
+      0, Attribute::getWithAlignment(Call->getContext(), Alignment));
+  return Call;
+}
+
 void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
   VPValue *StoredValue = isStore() ? getStoredValue() : nullptr;
 
@@ -9523,6 +9631,12 @@ void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
     return PartPtr;
   };
 
+  auto MaskValue = [&](unsigned Part) -> Value * {
+    if (isMaskRequired)
+      return BlockInMaskParts[Part];
+    return nullptr;
+  };
+
   // Handle Stores:
   if (SI) {
     State.setDebugLocFrom(SI->getDebugLoc());
@@ -9530,7 +9644,22 @@ void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
     for (unsigned Part = 0; Part < State.UF; ++Part) {
       Instruction *NewSI = nullptr;
       Value *StoredVal = State.get(StoredValue, Part);
-      if (CreateGatherScatter) {
+      if (State.EVL) {
+        Value *EVLPart = State.get(State.EVL, Part);
+        // If EVL is not nullptr, then EVL must be a valid value set during plan
+        // creation, possibly default value = whole vector register length. EVL
+        // is created only if TTI prefers predicated vectorization, thus if EVL
+        // is not nullptr it also implies preference for predicated
+        // vectorization.
+        // FIXME: Support reverse store after vp_reverse is added.
+        NewSI = lowerStoreUsingVectorIntrinsics(
+            Builder,
+            CreateGatherScatter
+                ? State.get(getAddr(), Part)
+                : CreateVecPtr(Part, State.get(getAddr(), VPIteration(0, 0))),
+            StoredVal, CreateGatherScatter, MaskValue(Part), EVLPart,
+            Alignment);
+      } else if (CreateGatherScatter) {
         Value *MaskPart = isMaskRequired ? BlockInMaskParts[Part] : nullptr;
         Value *VectorGep = State.get(getAddr(), Part);
         NewSI = Builder.CreateMaskedScatter(StoredVal, VectorGep, Alignment,
@@ -9561,7 +9690,21 @@ void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
   State.setDebugLocFrom(LI->getDebugLoc());
   for (unsigned Part = 0; Part < State.UF; ++Part) {
     Value *NewLI;
-    if (CreateGatherScatter) {
+    if (State.EVL) {
+      Value *EVLPart = State.get(State.EVL, Part);
+      // If EVL is not nullptr, then EVL must be a valid value set during plan
+      // creation, possibly default value = whole vector register length. EVL
+      // is created only if TTI prefers predicated vectorization, thus if EVL
+      // is not nullptr it also implies preference for predicated
+      // vectorization.
+      // FIXME: Support reverse loading after vp_reverse is added.
+      NewLI = lowerLoadUsingVectorIntrinsics(
+          Builder, DataTy,
+          CreateGatherScatter
+              ? State.get(getAddr(), Part)
+              : CreateVecPtr(Part, State.get(getAddr(), VPIteration(0, 0))),
+          CreateGatherScatter, MaskValue(Part), EVLPart, Alignment);
+    } else if (CreateGatherScatter) {
       Value *MaskPart = isMaskRequired ? BlockInMaskParts[Part] : nullptr;
       Value *VectorGep = State.get(getAddr(), Part);
       NewLI = Builder.CreateMaskedGather(DataTy, VectorGep, Alignment, MaskPart,
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 94cb7688981361..0ca668abbe60c7 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -242,6 +242,12 @@ struct VPTransformState {
   ElementCount VF;
   unsigned UF;
 
+  /// If EVL is not nullptr, then EVL must be a valid value set during plan
+  /// creation, possibly a default value = whole vector register length. EVL is
+  /// created only if TTI prefers predicated vectorization, thus if EVL is
+  /// not nullptr it also implies preference for predicated vectorization.
+  VPValue *EVL = nullptr;
+
   /// Hold the indices to generate specific scalar instructions. Null indicates
   /// that all instances are to be generated, using either scalar or vector
   /// instructions.
@@ -1057,6 +1063,8 @@ class VPInstruction : public VPRecipeWithIRFlags, public VPValue {
     SLPLoad,
     SLPStore,
     ActiveLaneMask,
+    ExplicitVectorLength,
+    ExplicitVectorLengthIVIncrement,
     CalculateTripCountMinusVF,
     // Increment the canonical IV separately for each unrolled part.
     CanonicalIVIncrementForPart,
@@ -1165,6 +1173,8 @@ class VPInstruction : public VPRecipeWithIRFlags, public VPValue {
     default:
       return false;
     case VPInstruction::ActiveLaneMask:
+    case VPInstruction::ExplicitVectorLength:
+    case VPInstruction::ExplicitVectorLengthIVIncrement:
     case VPInstruction::CalculateTripCountMinusVF:
     case VPInstruction::CanonicalIVIncrementForPart:
     case VPInstruction::BranchOnCount:
@@ -2180,6 +2190,39 @@ class VPActiveLaneMaskPHIRecipe : public VPHeaderPHIRecipe {
 #endif
 };
 
+/// A recipe for generating the phi node for the current index of elements,
+/// adjusted in accordance with EVL value. It starts at StartIV value and gets
+/// incremented by EVL in each iteration of the vector loop.
+class VPEVLBasedIVPHIRecipe : public VPHeaderPHIRecipe {
+public:
+  VPEVLBasedIVPHIRecipe(VPValue *StartMask, DebugLoc DL)
+      : VPHeaderPHIRecipe(VPDef::VPEVLBasedIVPHISC, nullptr, StartMask, DL) {}
+
+  ~VPEVLBasedIVPHIRecipe() override = default;
+
+  VP_CLASSOF_IMPL(VPDef::VPEVLBasedIVPHISC)
+
+  static inline bool classof(const VPHeaderPHIRecipe *D) {
+    return D->getVPDefID() == VPDef::VPEVLBasedIVPHISC;
+  }
+
+  /// Generate phi for handling IV based on EVL over iterations correctly.
+  void execute(VPTransformState &State) override;
+
+  /// Returns true if the recipe only uses the first lane of operand \p Op.
+  bool onlyFirstLaneUsed(const VPValue *Op) const override {
+    assert(is_contained(operands(), Op) &&
+           "Op must be an operand of the recipe");
+    return true;
+  }
+
+#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
+  /// Print the recipe.
+  void print(raw_ostream &O, const Twine &Indent,
+             VPSlotTracker &SlotTracker) const override;
+#endif
+};
+
 /// A Recipe for widening the canonical induction variable of the vector loop.
 class VPWidenCanonicalIVRecipe : public VPRecipeBase, public VPValue {
 public:
diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
index 97a8a1803bbf5a..b8ed256d236a4b 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
@@ -207,14 +207,14 @@ Type *VPTypeAnalysis::inferScalarType(const VPValue *V) {
   Type *ResultTy =
       TypeSwitch<const VPRecipeBase *, Type *>(V->getDefiningRecipe())
           .Case<VPCanonicalIVPHIRecipe, VPFirstOrderRecurrencePHIRecipe,
-                VPReductionPHIRecipe, VPWidenPointerInductionRecipe>(
-              [this](const auto *R) {
-                // Handle header phi recipes, except VPWienIntOrFpInduction
-                // which needs special handling due it being possibly truncated.
-                // TODO: consider inferring/caching type of siblings, e.g.,
-                // backedge value, here and in cases below.
-                return inferScalarType(R->getStartValue());
-              })
+                VPReductionPHIRecipe, VPWidenPointerInductionRecipe,
+                VPEVLBasedIVPHIRecipe>([this](const auto *R) {
+            // Handle header phi recipes, except VPWienIntOrFpInduction
+            // which needs special handling due it being possibly truncated.
+            // TODO: consider inferring/caching type of siblings, e.g.,
+            // backedge value, here and in cases below.
+            return inferScalarType(R->getStartValue());
+          })
           .Case<VPWidenIntOrFpInductionRecipe, VPDerivedIVRecipe>(
               [](const auto *R) { return R->getScalarType(); })
           .Case<VPPredInstPHIRecipe, VPWidenPHIRecipe, VPScalarIVStepsRecipe,
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 02e400d590bed4..5e0344a14df5da 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -345,6 +345,44 @@ Value *VPInstruction::generateInstruction(VPTransformState &State,
     Value *Zero = ConstantInt::get(ScalarTC->getType(), 0);
     return Builder.CreateSelect(Cmp, Sub, Zero);
   }
+  case VPInstruction::ExplicitVectorLength: {
+    // Compute EVL
+    auto GetSetVL = [=](VPTransformState &State, Value *EVL) {
+      assert(EVL->getType()->isIntegerTy() &&
+             "Requested vector length should be an integer.");
+
+      // TODO: Add support for MaxSafeDist for correct loop emission.
+      Value *VFArg = State.Builder.getInt32(State.VF.getKnownMinValue());
+
+      Value *GVL = State.Builder.CreateIntrinsic(
+          State.Builder.getInt32Ty(), Intrinsic::experimental_get_vector_length,
+          {EVL, VFArg, State.Builder.getTrue()});
+      return GVL;
+    };
+    // TODO: Restructur...
[truncated]

Copy link

github-actions bot commented Dec 21, 2023

✅ With the latest revision this PR passed the C/C++ code formatter.

@alexey-bataev
Copy link
Member Author

Ping!

@fhahn
Copy link
Contributor

fhahn commented Dec 29, 2023

Thanks for moving to Github now that Phabricator has been taken down!

I think @ayalz added some comments shortly before Phabricator was deactivated; unfortunately https://reviews.llvm.org/D99750 isn't accessible at the moment it seems (and it also doesn't seem to be available at http://108.170.204.19/D99750 which is supposed to have a static mirror). I am not sure what's the best way to pick up the recent comments here, perhaps it would be best to share the latest responses here on GH now?

@alexey-bataev
Copy link
Member Author

I addressed most of the @ayalz comments in this version

@fhahn
Copy link
Contributor

fhahn commented Jan 2, 2024

I addressed most of the @ayalz comments in this version

Ok thanks!

It would be helpful to import the recent conversations here and including what has been addressed how in the current iteration and if anything is still left open. Unfortunately it looks like for some reason D99750 isn't included in the static archive of reviews.llvm.org, I posted https://discourse.llvm.org/t/some-reviews-on-reviews-llvm-org-seem-to-be-missing-from-the-static-archive/76001 to hopefully get back access to the context in Phabricator.

@alexey-bataev
Copy link
Member Author

Rebase

assert(EVL->getType()->getScalarSizeInBits() <=
Phi->getType()->getScalarSizeInBits() &&
"EVL type must be smaller than Phi type.");
EVL = Builder.CreateIntCast(EVL, Phi->getType(), /*isSigned=*/false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to use the same type for all users without needing to cast here? Without the case, would a simple Add VPInstruction suffice (as in a5891fa)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried but it does not work unfortunately. It would be good to have Cast VPRecipe to implement this without adding new Instruction.
The type of the EVL (and many of their users) is i32 (because of https://llvm.org/docs/LangRef.html#llvm-experimental-get-vector-length-intrinsic) and the cast is required

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I remember now again! There's now a recipe for vector casts, but not yet for scalar casts. Let me check if there are other places that would benefit from such a recipe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks link a general recipe for scalar casts would also be helpful in other cases (e.g. truncate of induction steps), shared a patch for discussion: #78113

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#78113 landed so it should be possible now to use Add for the increment. Does that work now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still does not work:
opt: lib/Transforms/Vectorize/VPlan.cpp:290: llvm::Value *llvm::VPTransformState::get(llvm::VPValue *, unsigned int): Assertion `(isa(Def->getDefiningRecipe()) || isa(Def->getDefiningRecipe()) || isa(Def->getDefiningRecipe())) && "unexpected recipe found to be invariant"' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp Outdated Show resolved Hide resolved
@alexey-bataev
Copy link
Member Author

The update regarding AVL/EVL. I missed one point here, when we discussed it before.

  1. AVL can be referred as the input parameter for llvm.experimental.get.vector.length intrinsic.
  2. EVL is the result, returned by this intrinsic.

So, this 2 subject are separate. For this reason the corresponding parameter in LLVM IR Reference manual (https://llvm.org/docs/LangRef.html) for VP-based intrinsics is named as %evl and we use it here as EVL.

@alexey-bataev
Copy link
Member Author

Rebase + merged the check lines in the tests

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The review on Phabricator is now available on the static archive: https://reviews.llvm.org/D99750

Went through @ayalz 's latest comments and shared the one that still seems pending/open. There was quite a lot to go over, so I might have missed some comments.

Also added some more comments inline. In terms of further refactoring for this patch, would be good to remove the dedicated EVLIncrement opcode now that #78113 landed, if possible.

The other larger pending suggestion is related to EVL handling in the recipes; I suggest to add a TODO and address this as follow-up, unless @ayalz prefers doing the refactoring first. I am planning to split up/refactor memory recipe soon now that address computation is already moved out.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved
llvm/lib/Transforms/Vectorize/VPlan.h Outdated Show resolved Hide resolved
assert(EVL->getType()->getScalarSizeInBits() <=
Phi->getType()->getScalarSizeInBits() &&
"EVL type must be smaller than Phi type.");
EVL = Builder.CreateIntCast(EVL, Phi->getType(), /*isSigned=*/false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#78113 landed so it should be possible now to use Add for the increment. Does that work now?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved
llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp Outdated Show resolved Hide resolved
llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp Outdated Show resolved Hide resolved
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h Outdated Show resolved Hide resolved
@alexey-bataev
Copy link
Member Author

The review on Phabricator is now available on the static archive: https://reviews.llvm.org/D99750

Went through @ayalz 's latest comments and shared the one that still seems pending/open. There was quite a lot to go over, so I might have missed some comments.

Also added some more comments inline. In terms of further refactoring for this patch, would be good to remove the dedicated EVLIncrement opcode now that #78113 landed, if possible.

The other larger pending suggestion is related to EVL handling in the recipes; I suggest to add a TODO and address this as follow-up, unless @ayalz prefers doing the refactoring first. I am planning to split up/refactor memory recipe soon now that address computation is already moved out.

Instruction::Add still does not work, crashes the compiler because this VPInstruction returns that it "does not use only first lane".

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instruction::Add still does not work, crashes the compiler because this VPInstruction returns that it "does not use only first lane".

Can you push that commit somewhere so I can have a look? Looks like only-first-lane-used analysis might need some additional info.

llvm/lib/Transforms/Vectorize/VPlan.h Outdated Show resolved Hide resolved
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved
@alexey-bataev
Copy link
Member Author

Instruction::Add still does not work, crashes the compiler because this VPInstruction returns that it "does not use only first lane".

Can you push that commit somewhere so I can have a look? Looks like only-first-lane-used analysis might need some additional info.

Just replace VPInstruction::ExplicitVectorLengthIVIncrement with Instruction::Add in lib/Transforms/Vectorize/VPlanTransforms.cpp, line 1273

@alexey-bataev
Copy link
Member Author

Oh right, surprised it is already used by PPC. EVL LV won't work on PPC due to them not using scalable vectors? At least I cannot find a test that uses vscale. But would it make sense to support EVL on PPC, if it supports active-vector-length?

Currently it won't work for PPC, since it has some specific checks in TTI. We (or PPC developers) can enable it later.

@appujee
Copy link
Contributor

appujee commented Apr 4, 2024

Thanks for approving it!

@alexey-bataev alexey-bataev merged commit 413a66f into llvm:main Apr 4, 2024
4 checks passed
@alexey-bataev alexey-bataev deleted the arcpatch-D99750 branch April 4, 2024 22:30
fhahn added a commit to fhahn/llvm-project that referenced this pull request Apr 5, 2024
Introduce new subclasses of VPWidenMemoryRecipe for VP
(vector-predicated) loads and stores to address multiple TODOs from
llvm#76172

Note that the introduction of the new recipes also improves code-gen for
VP gather/scatters by removing the redundant header mask. With the new
approach, it is not sufficient to look at users of the widened canonical
IV to find all uses of the header mask.

In some cases, a widened IV is used instead of separately widening the
canonical IV. To handle those cases, iterate over all recipes in the
vector loop region to make sure all widened memory recipes are
processed.

Depends on llvm#87411.
@vntkmr
Copy link

vntkmr commented Apr 8, 2024

@alexey-bataev I just noticed that this patch is merged, good work!
As you might remember, I was the original author of this patch back in 2021 while at BSC, before you took over.

I noticed that because of the move from phabricator to github, some of the history of this patch is now lost - I can't find a way to access the initial commits and discussions on the patch from 2021.
I will appreciate if you can add an acknowledgement for the initial work on this patch.

CC: @rofirrim @simoll

Vineet Kumar

@alexey-bataev
Copy link
Member Author

@alexey-bataev I just noticed that this patch is merged, good work! As you might remember, I was the original author of this patch back in 2021 while at BSC, before you took over.

I noticed that because of the move from phabricator to github, some of the history of this patch is now lost - I can't find a way to access the initial commits and discussions on the patch from 2021. I will appreciate if you can add an acknowledgement for the initial work on this patch.

CC: @rofirrim @simoll

Vineet Kumar

Hi, sure! Sorry, forgot about that :(

@alexey-bataev
Copy link
Member Author

@alexey-bataev I just noticed that this patch is merged, good work! As you might remember, I was the original author of this patch back in 2021 while at BSC, before you took over.

I noticed that because of the move from phabricator to github, some of the history of this patch is now lost - I can't find a way to access the initial commits and discussions on the patch from 2021. I will appreciate if you can add an acknowledgement for the initial work on this patch.

CC: @rofirrim @simoll

Vineet Kumar

Added co-authors to the commit

@vntkmr
Copy link

vntkmr commented Apr 8, 2024

@alexey-bataev I just noticed that this patch is merged, good work! As you might remember, I was the original author of this patch back in 2021 while at BSC, before you took over.
I noticed that because of the move from phabricator to github, some of the history of this patch is now lost - I can't find a way to access the initial commits and discussions on the patch from 2021. I will appreciate if you can add an acknowledgement for the initial work on this patch.
CC: @rofirrim @simoll
Vineet Kumar

Hi, sure! Sorry, forgot about that :(

Thank you!

For those interested in the older discussion on this patch, it is recorded on the internet archive at https://web.archive.org/web/20230128111909/https://reviews.llvm.org/D99750.
Diff for the initial patch is available on phabricator archive https://reviews.llvm.org/D99750?vs=334753&id=353243.
("Show Older Changes" link on the archived phabricator page does not work anymore, unfortunately.)

@appujee
Copy link
Contributor

appujee commented Apr 9, 2024

@alexey-bataev I just noticed that this patch is merged, good work! As you might remember, I was the original author of this patch back in 2021 while at BSC, before you took over.
I noticed that because of the move from phabricator to github, some of the history of this patch is now lost - I can't find a way to access the initial commits and discussions on the patch from 2021. I will appreciate if you can add an acknowledgement for the initial work on this patch.
CC: @rofirrim @simoll
Vineet Kumar

Hi, sure! Sorry, forgot about that :(

Thank you!

Hi Vineet,
Thanks for your work!

For those interested in the initial version and older discussion on this patch, it is recorded on the internet archive at https://web.archive.org/web/20230128111909/https://reviews.llvm.org/D99750. "Show Older Changes" link on the archived phabricator page does not work anymore, unfortunately.

What you can do is create a PR(on top of the correct baseline) with this change and close the PR. That way it can be there on github as well. It might take a bit but i feel this would help others keep historical context.

fhahn added a commit to fhahn/llvm-project that referenced this pull request Apr 17, 2024
Introduce new subclasses of VPWidenMemoryRecipe for VP
(vector-predicated) loads and stores to address multiple TODOs from
llvm#76172

Note that the introduction of the new recipes also improves code-gen for
VP gather/scatters by removing the redundant header mask. With the new
approach, it is not sufficient to look at users of the widened canonical
IV to find all uses of the header mask.

In some cases, a widened IV is used instead of separately widening the
canonical IV. To handle those cases, iterate over all recipes in the
vector loop region to make sure all widened memory recipes are
processed.

Depends on llvm#87411.
fhahn added a commit that referenced this pull request Apr 19, 2024
Introduce new subclasses of VPWidenMemoryRecipe for VP
(vector-predicated) loads and stores to address multiple TODOs from
#76172

Note that the introduction of the new recipes also improves code-gen for
VP gather/scatters by removing the redundant header mask. With the new
approach, it is not sufficient to look at users of the widened canonical
IV to find all uses of the header mask.

In some cases, a widened IV is used instead of separately widening the
canonical IV. To handle that, first collect all VPValues representing header
masks (by looking at users of both the canonical IV and widened inductions
that are canonical) and then checking all users (recursively) of those header
masks.

Depends on #87411.

PR: #87816
aniplcc pushed a commit to aniplcc/llvm-project that referenced this pull request Apr 21, 2024
Introduce new subclasses of VPWidenMemoryRecipe for VP
(vector-predicated) loads and stores to address multiple TODOs from
llvm#76172

Note that the introduction of the new recipes also improves code-gen for
VP gather/scatters by removing the redundant header mask. With the new
approach, it is not sufficient to look at users of the widened canonical
IV to find all uses of the header mask.

In some cases, a widened IV is used instead of separately widening the
canonical IV. To handle that, first collect all VPValues representing header
masks (by looking at users of both the canonical IV and widened inductions
that are canonical) and then checking all users (recursively) of those header
masks.

Depends on llvm#87411.

PR: llvm#87816
@appujee
Copy link
Contributor

appujee commented Apr 22, 2024

Posted on https://lists.riscv.org/g/sig-toolchains/message/678 notifying interested parties..

@fhahn
Copy link
Contributor

fhahn commented Apr 26, 2024

Are there any plans on adding upstream runtime testing for EVL vectorization to guard against regressions?

We really should have upstream end-to-end testing that enables the EVL vectorization path and does a stage2 build + llvm-test-suite to catch regressions (similar to how SVE enabled bots were added when scalable vector support was added IIRC)

cc'ing some additional people who might also be able to help @appujee @Mel-Chen @nikolaypanchenko @arcbbb @preames

@nikolaypanchenko
Copy link
Contributor

Are there any plans on adding upstream runtime testing for EVL vectorization to guard against regressions?

We really should have upstream end-to-end testing that enables the EVL vectorization path and does a stage2 build + llvm-test-suite to catch regressions (similar to how SVE enabled bots were added when scalable vector support was added IIRC)

cc'ing some additional people who might also be able to help @appujee @Mel-Chen @nikolaypanchenko @arcbbb @preames

We don't have them yet, but certainly stability testing should be done as early as possible. We will work on plan for it!

@appujee
Copy link
Contributor

appujee commented Apr 26, 2024

Are there any plans on adding upstream runtime testing for EVL vectorization to guard against regressions?

We really should have upstream end-to-end testing that enables the EVL vectorization path and does a stage2 build + llvm-test-suite to catch regressions (similar to how SVE enabled bots were added when scalable vector support was added IIRC)

cc'ing some additional people who might also be able to help @appujee @Mel-Chen @nikolaypanchenko @arcbbb @preames

+1. Let me know if i can help with anything here.

@asb
Copy link
Contributor

asb commented Apr 27, 2024

Are there any plans on adding upstream runtime testing for EVL vectorization to guard against regressions?

We really should have upstream end-to-end testing that enables the EVL vectorization path and does a stage2 build + llvm-test-suite to catch regressions (similar to how SVE enabled bots were added when scalable vector support was added IIRC)

I'm currently working through a project to spin up a range of RISC-V builders. Part of that will involve deciding what configurations to test given the resources we have. It sounds like this could be an interesting config to add to the list. To clarify, is the suggestion basically a build with -mllvm -force-tail-folding-style=data-with-evl?

@alexey-bataev
Copy link
Member Author

Hi Alex, yes, at least one builder should enable this option. I think we need to test both configs for now, with and without this option.

@topperc
Copy link
Collaborator

topperc commented May 8, 2024

data-with-evl

Do we also need -prefer-predicate-over-epilogue=predicate-dont-vectorize?

mylai-mtk pushed a commit to mylai-mtk/llvm-project that referenced this pull request Jul 12, 2024
Selects the tail-folding style while choosing the max vector
factor and storing it in the data member rather than calculating it each
time upon getTailFoldingStyle call.

Part of llvm#76172

Reviewers: ayalz, fhahn

Reviewed By: fhahn

Pull Request: llvm#81885
Mel-Chen added a commit that referenced this pull request Jul 16, 2024
)

Following from #87816, add VPReductionEVLRecipe to describe vector
predication reduction.

Address one of TODOs from #76172.
sayhaan pushed a commit to sayhaan/llvm-project that referenced this pull request Jul 16, 2024
…m#90184)

Summary:
Following from llvm#87816, add VPReductionEVLRecipe to describe vector
predication reduction.

Address one of TODOs from llvm#76172.

Test Plan: 

Reviewers: 

Subscribers: 

Tasks: 

Tags: 


Differential Revision: https://phabricator.intern.facebook.com/D59822470
yuxuanchen1997 pushed a commit that referenced this pull request Jul 25, 2024
)

Summary:
Following from #87816, add VPReductionEVLRecipe to describe vector
predication reduction.

Address one of TODOs from #76172.

Test Plan: 

Reviewers: 

Subscribers: 

Tasks: 

Tags: 


Differential Revision: https://phabricator.intern.facebook.com/D60251485
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants