[IR] Remove pointer arguments from loop.dependence.{war|raw}.mask#188248
[IR] Remove pointer arguments from loop.dependence.{war|raw}.mask#188248
Conversation
Passing pointer arguments is quite inconvenient for practical use. This intrininc only cares about the difference between the address bits of the two pointers, but by taking pointer arguments it is implying cares about other details of the pointers (such as addrspace, or other pointer bits). Taking two values rather than the difference also prevents CSE opportunities where the for some set of pointers, the difference can be determined to be the same, even the pointers are not identical. This patch reworks the arguments to: @loop.dependence.mask(i64 %pointerDiff, i64 immarg %elementSize) This is more convenient to emit and allows for more CSE for free.
|
@llvm/pr-subscribers-llvm-ir @llvm/pr-subscribers-backend-aarch64 Author: Benjamin Maxwell (MacDue) ChangesPassing pointer arguments is quite inconvenient for practical use. This intrininc only cares about the difference between the address bits of the two pointers, but by taking pointer arguments it is implying cares about other details of the pointers (such as addrspace, or other pointer bits). Taking two values rather than the difference also prevents CSE opportunities where the for some set of pointers, the difference can be determined to be the same, even the pointers are not identical. This patch reworks the arguments to: This is more convenient to emit and allows for more CSE for free. Patch is 86.55 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/188248.diff 12 Files Affected:
diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index 13883883d3981..d7188ca817d53 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -25020,17 +25020,20 @@ This is an overloaded intrinsic.
::
- declare <4 x i1> @llvm.loop.dependence.war.mask.v4i1(ptr %ptrA, ptr %ptrB, i64 immarg %elementSize)
- declare <8 x i1> @llvm.loop.dependence.war.mask.v8i1(ptr %ptrA, ptr %ptrB, i64 immarg %elementSize)
- declare <16 x i1> @llvm.loop.dependence.war.mask.v16i1(ptr %ptrA, ptr %ptrB, i64 immarg %elementSize)
- declare <vscale x 16 x i1> @llvm.loop.dependence.war.mask.nxv16i1(ptr %ptrA, ptr %ptrB, i64 immarg %elementSize)
+ declare <4 x i1> @llvm.loop.dependence.war.mask.v4i1.i64(i64 %pointerDiff, i64 immarg %elementSize)
+ declare <8 x i1> @llvm.loop.dependence.war.mask.v8i1.i32(i32 %pointerDiff, i64 immarg %elementSize)
+ declare <16 x i1> @llvm.loop.dependence.war.mask.v16i1.i64(i64 %pointerDiff, i64 immarg %elementSize)
+ declare <vscale x 16 x i1> @llvm.loop.dependence.war.mask.nxv16i1.i64(ptr %pointerDiff, i64 immarg %elementSize)
Overview:
"""""""""
-Given a vector load from %ptrA followed by a vector store to %ptrB, this
-instruction generates a mask where an active lane indicates that the
+Given ``%pointerDiff``, which is the signed pointer difference between a load
+and store (i.e, ``%pointerDiff = %storePointer - %loadPointer``), of same type
+vectors, where the load occurs before the store.
+
+This instruction generates a mask where an active lane indicates that the
write-after-read sequence can be performed safely for that lane, without the
danger of a write-after-read hazard occurring.
@@ -25048,30 +25051,30 @@ Semantics:
""""""""""
``%elementSize`` is the size of the accessed elements in bytes.
-The intrinsic returns ``poison`` if the distance between ``%prtA`` and ``%ptrB``
-is smaller than ``VF * %elementsize`` and either ``%ptrA + VF * %elementSize``
-or ``%ptrB + VF * %elementSize`` wrap.
+The intrinsic returns ``poison`` if ``pointerDiff`` is smaller than
+``VF * %elementsize``.
-The element of the result mask is active when loading from %ptrA then storing to
-%ptrB is safe and doesn't result in a write-after-read hazard, meaning that:
+For each lane of the mask, the mask is active if:
-* (ptrB - ptrA) <= 0 (guarantees that all lanes are loaded before any stores), or
-* elementSize * lane < (ptrB - ptrA) (guarantees that this lane is loaded
+* pointerDiff <= 0 (guarantees that all lanes are loaded before any stores), or
+* (elementSize * lane) < pointerDiff (guarantees that this lane is loaded
before the store to the same address)
Examples:
"""""""""
.. code-block:: llvm
-
- %loop.dependence.mask = call <4 x i1> @llvm.loop.dependence.war.mask.v4i1(ptr %ptrA, ptr %ptrB, i64 4)
- %vecA = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(ptr align 4 %ptrA, <4 x i1> %loop.dependence.mask, <4 x i32> poison)
+ %aAddr = ptrtoaddr ptr %ptrA to i64
+ %bAddr = ptrtoaddr ptr %ptrB to i64
+ %pointerDiff = sub i64 %bAddr, %aAddr
+ %loop.dependence.mask = call <4 x i1> @llvm.loop.dependence.war.mask.v4i1.i64(i64 %pointerDiff, i64 4)
+ %vecA = call <4 x i32> @llvm.masked.load.v4i32.v4i32.p0(ptr align 4 %ptrA, <4 x i1> %loop.dependence.mask, <4 x i32> poison)
[...]
- call @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %vecA, ptr align 4 %ptrB, <4 x i1> %loop.dependence.mask)
+ call @llvm.masked.store.v4i32.p0(<4 x i32> %vecA, ptr align 4 %ptrB, <4 x i1> %loop.dependence.mask)
; For the above example, consider the following cases:
;
- ; 1. ptrA >= ptrB
+ ; 1. ptrA >= ptrB (pointerDiff <= 0)
;
; load = <0,1,2,3> ; uint32_t load = array[i+2];
; store = <0,1,2,3> ; array[i] = store;
@@ -25079,7 +25082,7 @@ Examples:
; This results in an all-true mask, as the load always occurs before the
; store, so it does not depend on any values to be stored.
;
- ; 2. ptrB - ptrA = 2 * elementSize:
+ ; 2. pointerDiff = 2 * elementSize:
;
; load = <0,1,2,3> ; uint32_t load = array[i];
; store = <0,1,2,3> ; array[i+2] = store;
@@ -25088,7 +25091,7 @@ Examples:
; we can only read two lanes before we would read values that have yet to
; be written.
;
- ; 3. ptrB - ptrA = 4 * elementSize
+ ; 3. pointerDiff = 4 * elementSize
;
; load = <0,1,2,3> ; uint32_t load = array[i];
; store = <0,1,2,3> ; array[i+4] = store;
@@ -25107,17 +25110,20 @@ This is an overloaded intrinsic.
::
- declare <4 x i1> @llvm.loop.dependence.raw.mask.v4i1(ptr %ptrA, ptr %ptrB, i64 immarg %elementSize)
- declare <8 x i1> @llvm.loop.dependence.raw.mask.v8i1(ptr %ptrA, ptr %ptrB, i64 immarg %elementSize)
- declare <16 x i1> @llvm.loop.dependence.raw.mask.v16i1(ptr %ptrA, ptr %ptrB, i64 immarg %elementSize)
- declare <vscale x 16 x i1> @llvm.loop.dependence.raw.mask.nxv16i1(ptr %ptrA, ptr %ptrB, i64 immarg %elementSize)
+ declare <4 x i1> @llvm.loop.dependence.raw.mask.v4i1.i64(i64 %pointerDiff, i64 immarg %elementSize)
+ declare <8 x i1> @llvm.loop.dependence.raw.mask.v8i1.i32(i32 %pointerDiff, i64 immarg %elementSize)
+ declare <16 x i1> @llvm.loop.dependence.raw.mask.v16i1.i64(i64 %pointerDiff, i64 immarg %elementSize)
+ declare <vscale x 16 x i1> @llvm.loop.dependence.raw.mask.nxv16i1.i64(i64 %pointerDiff, i64 immarg %elementSize)
Overview:
"""""""""
-Given a vector store to %ptrA followed by a vector load from %ptrB, this
-instruction generates a mask where an active lane indicates that the
+Given ``%pointerDiff``, which is the signed pointer difference between a store
+and load (i.e, ``%pointerDiff = %loadPointer - %storePointer``), of same type
+vectors, where the store occurs before the load.
+
+This instruction generates a mask where an active lane indicates that the
read-after-write sequence can be performed safely for that lane, without a
read-after-write hazard or a store-to-load forwarding hazard being introduced.
@@ -25140,38 +25146,38 @@ Semantics:
""""""""""
``%elementSize`` is the size of the accessed elements in bytes.
-The intrinsic returns ``poison`` if the distance between ``%prtA`` and ``%ptrB``
-is smaller than ``VF * %elementsize`` and either ``%ptrA + VF * %elementSize``
-or ``%ptrB + VF * %elementSize`` wrap.
+The intrinsic returns ``poison`` if ``pointerDiff`` is smaller than
+``VF * %elementsize``.
-The element of the result mask is active when storing to %ptrA then loading from
-%ptrB is safe and doesn't result in aliasing, meaning that:
+For each lane of the mask, the mask is active if:
-* elementSize * lane < abs(ptrB - ptrA) (guarantees that the store of this lane
+* elementSize * lane < abs(pointerDiff) (guarantees that the store of this lane
occurs before loading from this address), or
-* ptrA == ptrB (doesn't introduce any new hazards that weren't in the scalar
+* pointerDiff == 0 (doesn't introduce any new hazards that weren't in the scalar
code)
Examples:
"""""""""
.. code-block:: llvm
-
- %loop.dependence.mask = call <4 x i1> @llvm.loop.dependence.raw.mask.v4i1(ptr %ptrA, ptr %ptrB, i64 4)
- call @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %vecA, ptr align 4 %ptrA, <4 x i1> %loop.dependence.mask)
+ %aAddr = ptrtoaddr ptr %ptrA to i64
+ %bAddr = ptrtoaddr ptr %ptrB to i64
+ %pointerDiff = sub i64 %bAddr, %aAddr
+ %loop.dependence.mask = call <4 x i1> @llvm.loop.dependence.raw.mask.v4i1(i64 %pointerDiff, i64 4)
+ call @llvm.masked.store.v4i32.v4i32.p0(<4 x i32> %vecA, ptr align 4 %ptrA, <4 x i1> %loop.dependence.mask)
[...]
- %vecB = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(ptr align 4 %ptrB, <4 x i1> %loop.dependence.mask, <4 x i32> poison)
+ %vecB = call <4 x i32> @llvm.masked.load.v4i32.p0(ptr align 4 %ptrB, <4 x i1> %loop.dependence.mask, <4 x i32> poison)
; For the above example, consider the following cases:
;
- ; 1. ptrA == ptrB
+ ; 1. ptrA == ptrB (pointerDiff == 0)
;
; store = <0,1,2,3> ; array[i] = store;
; load = <0,1,2,3> ; uint32_t load = array[i];
;
; This results in a all-true mask. There is no conflict.
;
- ; 2. ptrB - ptrA = 2 * elementSize
+ ; 2. pointerDiff = 2 * elementSize
;
; store = <0,1,2,3> ; array[i] = store;
; load = <0,1,2,3> ; uint32_t load = array[i+2];
@@ -25179,7 +25185,7 @@ Examples:
; This results in a mask with the first two lanes active. In this case,
; only two lanes can be written without overwriting values yet to be read.
;
- ; 3. ptrB - ptrA = -2 * elementSize
+ ; 3. pointerDiff = -2 * elementSize
;
; store = <0,1,2,3> ; array[i+2] = store;
; load = <0,1,2,3> ; uint32_t load = array[i];
diff --git a/llvm/include/llvm/CodeGen/BasicTTIImpl.h b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
index 7812a301efbd7..d63e98ef67f58 100644
--- a/llvm/include/llvm/CodeGen/BasicTTIImpl.h
+++ b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
@@ -2190,46 +2190,38 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
// The possible expansions are...
//
// loop_dependence_war_mask:
- // diff = (ptrB - ptrA) / eltSize
- // cmp = icmp sle diff, 0
+ // cmp = icmp sle (diff / eltSize), 0
// upper_bound = select cmp, -1, diff
// mask = get_active_lane_mask 0, upper_bound
//
// loop_dependence_raw_mask:
- // diff = (abs(ptrB - ptrA)) / eltSize
- // cmp = icmp eq diff, 0
+ // cmp = icmp eq (diff / eltSize), 0
// upper_bound = select cmp, -1, diff
// mask = get_active_lane_mask 0, upper_bound
//
- auto *PtrTy = cast<PointerType>(ICA.getArgTypes()[0]);
- Type *IntPtrTy = IntegerType::getIntNTy(
- RetTy->getContext(), thisT()->getDataLayout().getPointerSizeInBits(
- PtrTy->getAddressSpace()));
+ Type *IntTy = ICA.getArgTypes()[0];
bool IsReadAfterWrite = IID == Intrinsic::loop_dependence_raw_mask;
- InstructionCost Cost =
- thisT()->getArithmeticInstrCost(Instruction::Sub, IntPtrTy, CostKind);
+ TTI::OperandValueInfo EltSizeOpInfo =
+ TTI::getOperandInfo(ICA.getArgs()[1]);
+ InstructionCost Cost = thisT()->getArithmeticInstrCost(
+ Instruction::SDiv, IntTy, CostKind, {}, EltSizeOpInfo);
+
if (IsReadAfterWrite) {
- IntrinsicCostAttributes AbsAttrs(Intrinsic::abs, IntPtrTy, {IntPtrTy},
- {});
+ IntrinsicCostAttributes AbsAttrs(Intrinsic::abs, IntTy, {IntTy}, {});
Cost += thisT()->getIntrinsicInstrCost(AbsAttrs, CostKind);
}
- TTI::OperandValueInfo EltSizeOpInfo =
- TTI::getOperandInfo(ICA.getArgs()[2]);
- Cost += thisT()->getArithmeticInstrCost(Instruction::SDiv, IntPtrTy,
- CostKind, {}, EltSizeOpInfo);
-
Type *CondTy = IntegerType::getInt1Ty(RetTy->getContext());
CmpInst::Predicate Pred =
IsReadAfterWrite ? CmpInst::ICMP_EQ : CmpInst::ICMP_SLE;
- Cost += thisT()->getCmpSelInstrCost(BinaryOperator::ICmp, CondTy,
- IntPtrTy, Pred, CostKind);
- Cost += thisT()->getCmpSelInstrCost(BinaryOperator::Select, IntPtrTy,
- CondTy, Pred, CostKind);
+ Cost += thisT()->getCmpSelInstrCost(BinaryOperator::ICmp, CondTy, IntTy,
+ Pred, CostKind);
+ Cost += thisT()->getCmpSelInstrCost(BinaryOperator::Select, IntTy, CondTy,
+ Pred, CostKind);
IntrinsicCostAttributes Attrs(Intrinsic::get_active_lane_mask, RetTy,
- {IntPtrTy, IntPtrTy}, FMF);
+ {IntTy, IntTy}, FMF);
Cost += thisT()->getIntrinsicInstrCost(Attrs, CostKind);
return Cost;
}
diff --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td
index 4469ff155b854..e1ca27f9fb0f4 100644
--- a/llvm/include/llvm/IR/Intrinsics.td
+++ b/llvm/include/llvm/IR/Intrinsics.td
@@ -2511,13 +2511,13 @@ let IntrProperties = [IntrNoMem, IntrSpeculatable, ImmArg<ArgIndex<1>>] in {
def int_loop_dependence_raw_mask:
DefaultAttrsIntrinsic<[llvm_anyvector_ty],
- [llvm_ptr_ty, llvm_ptr_ty, llvm_i64_ty],
- [IntrNoMem, IntrNoSync, IntrWillReturn, ImmArg<ArgIndex<2>>]>;
+ [llvm_anyint_ty, llvm_i64_ty],
+ [IntrNoMem, IntrNoSync, IntrWillReturn, ImmArg<ArgIndex<1>>]>;
def int_loop_dependence_war_mask:
DefaultAttrsIntrinsic<[llvm_anyvector_ty],
- [llvm_ptr_ty, llvm_ptr_ty, llvm_i64_ty],
- [IntrNoMem, IntrNoSync, IntrWillReturn, ImmArg<ArgIndex<2>>]>;
+ [llvm_anyint_ty, llvm_i64_ty],
+ [IntrNoMem, IntrNoSync, IntrWillReturn, ImmArg<ArgIndex<1>>]>;
def int_get_active_lane_mask:
DefaultAttrsIntrinsic<[llvm_anyvector_ty],
diff --git a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
index 82f8fd572bf19..0e0b2c8f1f7f3 100644
--- a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
@@ -535,6 +535,7 @@ namespace {
SDValue visitBRCOND(SDNode *N);
SDValue visitBR_CC(SDNode *N);
SDValue visitLOAD(SDNode *N);
+ SDValue visitLOOP_DEPENDENCE_MASK(SDNode *N);
SDValue replaceStoreChain(StoreSDNode *ST, SDValue BetterChain);
SDValue replaceStoreOfFPConstant(StoreSDNode *ST);
@@ -2095,6 +2096,9 @@ SDValue DAGCombiner::visit(SDNode *N) {
case ISD::VECREDUCE_FMIN:
case ISD::VECREDUCE_FMAXIMUM:
case ISD::VECREDUCE_FMINIMUM: return visitVECREDUCE(N);
+ case ISD::LOOP_DEPENDENCE_RAW_MASK:
+ case ISD::LOOP_DEPENDENCE_WAR_MASK:
+ return visitLOOP_DEPENDENCE_MASK(N);
#define BEGIN_REGISTER_VP_SDNODE(SDOPC, ...) case ISD::SDOPC:
#include "llvm/IR/VPIntrinsics.def"
return visitVPOp(N);
@@ -21132,6 +21136,22 @@ SDValue DAGCombiner::visitLOAD(SDNode *N) {
return SDValue();
}
+// Fold LOOP_DEPENDENCE_MASK(0, sub(B, A)) to LOOP_DEPENDENCE_MASK(A, B).
+SDValue DAGCombiner::visitLOOP_DEPENDENCE_MASK(SDNode *N) {
+ auto *Op0Const = dyn_cast<ConstantSDNode>(N->getOperand(0));
+ if (!Op0Const || !Op0Const->isZero())
+ return SDValue();
+
+ SDValue Op1 = N->getOperand(1);
+ if (Op1.getOpcode() != ISD::SUB)
+ return SDValue();
+
+ SDValue Op10 = Op1->getOperand(0);
+ SDValue Op11 = Op1->getOperand(1);
+ return DAG.getNode(N->getOpcode(), SDLoc(N), N->getValueType(0), Op11, Op10,
+ N->getOperand(2), N->getOperand(3));
+}
+
namespace {
/// Helper structure used to slice a load in smaller loads.
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
index 04b17b56b3d49..efa9228d49c2c 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
@@ -8511,18 +8511,15 @@ void SelectionDAGBuilder::visitIntrinsicCall(const CallInst &I,
return;
}
case Intrinsic::loop_dependence_war_mask:
- setValue(&I,
- DAG.getNode(ISD::LOOP_DEPENDENCE_WAR_MASK, sdl,
- EVT::getEVT(I.getType()), getValue(I.getOperand(0)),
- getValue(I.getOperand(1)), getValue(I.getOperand(2)),
- DAG.getConstant(0, sdl, MVT::i64)));
- return;
case Intrinsic::loop_dependence_raw_mask:
- setValue(&I,
- DAG.getNode(ISD::LOOP_DEPENDENCE_RAW_MASK, sdl,
- EVT::getEVT(I.getType()), getValue(I.getOperand(0)),
- getValue(I.getOperand(1)), getValue(I.getOperand(2)),
- DAG.getConstant(0, sdl, MVT::i64)));
+ unsigned Opcode = Intrinsic == Intrinsic::loop_dependence_war_mask
+ ? ISD::LOOP_DEPENDENCE_WAR_MASK
+ : ISD::LOOP_DEPENDENCE_RAW_MASK;
+ SDValue PointerDiff = getValue(I.getOperand(0));
+ SDValue Zero = DAG.getConstant(0, sdl, PointerDiff.getValueType());
+ setValue(&I, DAG.getNode(Opcode, sdl, EVT::getEVT(I.getType()), Zero,
+ PointerDiff, getValue(I.getOperand(1)),
+ DAG.getConstant(0, sdl, MVT::i64)));
return;
}
}
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 38db1ac4a2fb9..374719374126e 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -5578,6 +5578,19 @@ AArch64TargetLowering::LowerLOOP_DEPENDENCE_MASK(SDValue Op,
SDLoc DL(Op);
EVT VT = Op.getValueType();
+
+ SDValue Op0 = Op.getOperand(0);
+ if (Op0.getValueSizeInBits() < 64) {
+ // Handle operands less than 64-bit (the diff must be sign extended).
+ SDValue Zero = DAG.getConstant(0, DL, MVT::i64);
+ SDValue Diff =
+ DAG.getNode(ISD::SUB, DL, Op0.getValueType(), Op.getOperand(1), Op0);
+ SDValue DiffExt = DAG.getNode(ISD::SIGN_EXTEND, DL, MVT::i64, Diff);
+ return DAG.getNode(Op.getOpcode(), DL, VT,
+ {Zero, DiffExt, Op.getOperand(2), Op.getOperand(3)});
+ } else if (Op0.getValueSizeInBits() > 64)
+ return SDValue(); // Unsupported (expand).
+
unsigned LaneOffset = Op.getConstantOperandVal(3);
unsigned NumElements = VT.getVectorMinNumElements();
uint64_t EltSizeInBytes = Op.getConstantOperandVal(2);
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index ae7144155ad72..5b27b70c7f006 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -1092,7 +1092,7 @@ AArch64TTIImpl::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
if (ST->hasSVE2() || ST->hasSME()) {
EVT VecVT = getTLI()->getValueType(DL, RetTy);
unsigned EltSizeInBytes =
- cast<ConstantInt>(ICA.getArgs()[2])->getZExtValue();
+ cast<ConstantInt>(ICA.getArgs()[1])->getZExtValue();
if (!is_contained({1u, 2u, 4u, 8u}, EltSizeInBytes) ||
VecVT.getVectorMinNumElements() != (16 / EltSizeInBytes))
break;
diff --git a/llvm/test/Analysis/CostModel/AArch64/loop_dependence_mask.ll b/llvm/test/Analysis/CostModel/AArch64/loop_dependence_mask.ll
index 74bd41db4a64d..e80644a5bdb90 100644
--- a/llvm/test/Analysis/CostModel/AArch64/loop_dependence_mask.ll
+++ b/llvm/test/Analysis/CostModel/AArch64/loop_dependence_mask.ll
@@ -4,186 +4,186 @@
; RUN: opt < %s -passes="print<cost-model>" 2>&1 -disable-output -mtriple=aarch64-linux-gnu -mattr=+sme | FileCheck %s --check-prefixes=CHECK,CHECK-SME
; loop.dependence.{war,raw}.mask can be lowered to while{wr,rw} if SVE2 or SME is enabled.
-define void @loop_dependence_war_mask(ptr %a, ptr %b) {
+define void @loop_dependence_war_mask(i64 %diff) {
; CHECK-EXPANDED-LABEL: 'loop_dependence_war_mask'
-; CHECK-EXPANDED-NEXT: Cost Model: Found an estimated cost of 39 for instruction: %res1 = call <16 x i1> @llvm.loop.dependence.war.mask.v16i1(ptr %a, ptr %b, i64 1)
-; CHECK-EXPANDED-NEXT: Cost Model: Found an estimated cost of 23 for instruction: %res2 = call <8 x i1> @llvm.loop.dependence.war.mask.v8i1(ptr %a, ptr %b, i64 2)
-; CHECK-EXPANDED-NEXT: Cost Model: Found an estimated cost of 15 for instruction: %res3 = call <4 x i1> @llvm.loop.dependence.war.mask.v4i1(ptr %a, ptr %b, i64 4)
-; CHECK-EXPANDED-NEXT: Cost Model: Found an estimated cost of 11 for instruction: %res4 = call <2 x i1> @llvm.loop.dependence.war.mask.v2i1(ptr %a, ptr %b, i64 8)
-; CHECK-EXPANDED-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %res5 = call <vscale x 16 x i1> @llvm.loop.dependence.war.mask.nxv16i1(ptr %a, ptr %b, i64 1)
-; CHECK-EXPANDED-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %res6 = call <vscale x 8 x i1> @llvm.loop.dependence.war.mask.nxv8i1(ptr %a, ptr %b, i64 2)
-; CHECK-EXPANDED-NEXT: C...
[truncated]
|
|
Creating this as all patches where I've tried to make use of these intrinsics #182457 and #175943, the pointer arguments are a nuisance I have to work around. Note: I have not provided an autoupgrade as these intrinics have never been emitted by LLVM/Clang in any release, so I think it's safe to change them. Happy to provide one if people think otherwise. |
| ; CHECK-NEXT: ret | ||
| entry: | ||
| %0 = call <16 x i1> @llvm.loop.dependence.war.mask.v16i1(ptr %a, ptr %b, i64 1) | ||
| %a_int = ptrtoaddr ptr %a to i64 |
There was a problem hiding this comment.
What is codegen like when the pointers are not available? (i.e. it takes the pointer-diff as a parameter)
There was a problem hiding this comment.
The first operand is set to xzr (see added test)
fhahn
left a comment
There was a problem hiding this comment.
as long as this plays nice with codegen, this seems like a good simplification.
Note: I have not provided an autoupgrade as these intrinics have never been emitted by LLVM/Clang in any release, so I think it's safe to change them. Happy to provide one if people think otherwise.
That's probably fine
I believe codegen should be pretty much the same, as I've added a simple DAG combine to fold the difference into the mask during ISEL. |
🐧 Linux x64 Test Results
✅ The build succeeded and all tests passed. |
sdesmalen-arm
left a comment
There was a problem hiding this comment.
Just some nits, otherwise looks fine to me.
| The intrinsic returns ``poison`` if ``%pointerDiff`` is not a multiple of | ||
| ``%elementSize``. |
There was a problem hiding this comment.
It has been a while since I looked into these intrinsics but it sounds like this would remove the reason for them existing. The instructions on AArch64 do not follow the basic math that you would expect from simple integers, otherwise we would recognise them from icmp+activelanemask directly. As far as I remember the reason the intrinsics exist is to properly handle the fact that they do not wrap as expected from i64 math.
(Also I think it should be possible to fold a loop.mask(A,B) with noalias A, B to all-true, which could be useful in the current pass pipeline but maybe isn't something that anyone would ever implement).
There was a problem hiding this comment.
Can we specify this in a way we don't need to be overly cautious about the extreme cases? I don't think they introduce a correctness issue (in that a lane is active where a dependence exists).
I think for there to be a correctness issue:
- The difference between the pointers would have to be >= 1 and < VL (call this
trueDist) - Computing that difference in i64s would have result in a value < 0 or >
trueDist.
But as 1 to VL - 1 should fit within an i64, I don't think any correctness issue can occur.
Note: If this is not possible it also makes this intrinsic hard to expand without whilewr/whilerw (and I think the current expansions would be incorrect).
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
So, maybe if we specify %pointerDiff as implicitly frozen (to avoid freeze(sub nsw) -> sub), and always emit the subs with nsw within the loop vectorizer. We can fold to whilewr/rw and still give a safe result in case the difference does wrap. If we can't fold the sub into the mask, we can always expand to a whilelo (but I don't think that should occur to masks emitted by the loop vectorizer). Any thoughts?
There was a problem hiding this comment.
So, maybe if we specify %pointerDiff as implicitly frozen (to avoid freeze(sub nsw) -> sub), and always emit the subs with nsw within the loop vectorizer.
So when it wraps and is poison, we want to always generate an all-true mask? It think if we implicitly freeze, we could end up with a pointerDiff value that makes a mask with some trailing lanes being false, which would still correct if I understand the above correctly.
Would the new variant keep the bit about pointerDiff needing to be a multiple of element size? If so, then the implicitly frozen value may not be a multiple and we would have poison mask. If that's an issue, maybe the intrinsic could return a all-true mask for poison pointerDiff?
There was a problem hiding this comment.
some trailing lanes being false,
My idea was do to do something "safe" but not necessarily optimal for the extreme case.
I think I'm probably overthinking this though and it'd be simpler to do @llvm.loop.dependence.war.mask.v4i1.i64(i64 %a, i64 %b, i64 immarg %elementSize), and more precisely specify the semantics. We'd still have the CSE issues, but not having to deal with pointer types is still a little nicer.
I think the semantics should be (and I think this could be done in plain IR):
.war mask: true if any of:
icmp ule i64 %b, %aicmp ult (%elementSize * lane), (%b - %a)
.raw mask: true if any of:
icmp eq i64 %a, %b- if
icmp ugt i64 %b, %a:- then:
icmp ult (%elementSize * lane), (%b - %a) - otherwise:
icmp ult (%elementSize * lane), (%a - %b)
- then:
Not sure if the current expansions exactly match this or not.
There was a problem hiding this comment.
I think I'm probably overthinking this though and it'd be simpler to do @llvm.loop.dependence.war.mask.v4i1.i64(i64 %a, i64 %b, i64 immarg %elementSize), and more precisely specify the semantics. We'd still have the CSE issues, but not having to deal with pointer types is still a little nicer.
Yeah, that also sounds reasonable, with the main benefit being to avoid the conversion back to pointers.
With that we could also simplify the wording quite a bit I think, not mention pointers at all, as it is defined clearly as IR operations
There was a problem hiding this comment.
If we can just use unsigned conditions and they are equivalent to the instructions that would be good. That would help make them simpler.
What makes ptr arguments difficult to deal with?
There was a problem hiding this comment.
What makes ptr arguments difficult to deal with?
We're using these instructions as an alternative way to handle diff checks. The SCEV expressions use for diff checks have already been converted to integers at the places we want to emit dependence masks. If we don't rework things to preserve the exact pointer types then naively casting to a pointer type will drop information such as the addresspace. I don't think this is a problem for correctness, but the IR does look broken. Really, these instructions don't care about all the details about pointers other than the address, so I think it makes sense to drop that information.
Passing pointer arguments is quite inconvenient for practical use. These intrinincs only care about the difference between the address bits of two pointers, but by taking pointer arguments it is implying they care about other details of the pointers (such as addrspace, or other pointer bits).
Taking two values rather than the difference also prevents CSE opportunities where the for some set of pointers, the difference can be determined to be the same, even though the pointers are not identical.
This patch reworks the arguments to:
@loop.dependence.mask(i64 %pointerDiff, i64 immarg %elementSize)This is more convenient to emit and allows for more CSE for free.