[IR] Remove pointer arguments from loop.dependence.{war|raw}.mask by MacDue · Pull Request #188248 · llvm/llvm-project

MacDue · 2026-03-24T13:38:41Z

Passing pointer arguments is quite inconvenient for practical use. These intrinincs only care about the difference between the address bits of two pointers, but by taking pointer arguments it is implying they care about other details of the pointers (such as addrspace, or other pointer bits).

Taking two values rather than the difference also prevents CSE opportunities where the for some set of pointers, the difference can be determined to be the same, even though the pointers are not identical.

This patch reworks the arguments to:
@loop.dependence.mask(i64 %pointerDiff, i64 immarg %elementSize)

This is more convenient to emit and allows for more CSE for free.

Passing pointer arguments is quite inconvenient for practical use. This intrininc only cares about the difference between the address bits of the two pointers, but by taking pointer arguments it is implying cares about other details of the pointers (such as addrspace, or other pointer bits). Taking two values rather than the difference also prevents CSE opportunities where the for some set of pointers, the difference can be determined to be the same, even the pointers are not identical. This patch reworks the arguments to: @loop.dependence.mask(i64 %pointerDiff, i64 immarg %elementSize) This is more convenient to emit and allows for more CSE for free.

llvmbot · 2026-03-24T13:39:16Z

@llvm/pr-subscribers-llvm-ir
@llvm/pr-subscribers-llvm-analysis

@llvm/pr-subscribers-backend-aarch64

Author: Benjamin Maxwell (MacDue)

Changes

Passing pointer arguments is quite inconvenient for practical use. This intrininc only cares about the difference between the address bits of the two pointers, but by taking pointer arguments it is implying cares about other details of the pointers (such as addrspace, or other pointer bits).

Taking two values rather than the difference also prevents CSE opportunities where the for some set of pointers, the difference can be determined to be the same, even the pointers are not identical.

This patch reworks the arguments to:
@loop.dependence.mask(i64 %pointerDiff, i64 immarg %elementSize)

This is more convenient to emit and allows for more CSE for free.

Patch is 86.55 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/188248.diff

12 Files Affected:

(modified) llvm/docs/LangRef.rst (+46-40)
(modified) llvm/include/llvm/CodeGen/BasicTTIImpl.h (+14-22)
(modified) llvm/include/llvm/IR/Intrinsics.td (+4-4)
(modified) llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp (+20)
(modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp (+8-11)
(modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+13)
(modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp (+1-1)
(modified) llvm/test/Analysis/CostModel/AArch64/loop_dependence_mask.ll (+132-132)
(modified) llvm/test/CodeGen/AArch64/alias_mask.ll (+136-34)
(modified) llvm/test/CodeGen/AArch64/alias_mask_nosve.ll (+4-1)
(modified) llvm/test/CodeGen/AArch64/alias_mask_scalable.ll (+117-26)
(modified) llvm/test/CodeGen/AArch64/alias_mask_scalable_nosve2.ll (+4-1)

diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index 13883883d3981..d7188ca817d53 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -25020,17 +25020,20 @@ This is an overloaded intrinsic.
 
 ::
 
-      declare <4 x i1> @llvm.loop.dependence.war.mask.v4i1(ptr %ptrA, ptr %ptrB, i64 immarg %elementSize)
-      declare <8 x i1> @llvm.loop.dependence.war.mask.v8i1(ptr %ptrA, ptr %ptrB, i64 immarg %elementSize)
-      declare <16 x i1> @llvm.loop.dependence.war.mask.v16i1(ptr %ptrA, ptr %ptrB, i64 immarg %elementSize)
-      declare <vscale x 16 x i1> @llvm.loop.dependence.war.mask.nxv16i1(ptr %ptrA, ptr %ptrB, i64 immarg %elementSize)
+      declare <4 x i1> @llvm.loop.dependence.war.mask.v4i1.i64(i64 %pointerDiff, i64 immarg %elementSize)
+      declare <8 x i1> @llvm.loop.dependence.war.mask.v8i1.i32(i32 %pointerDiff, i64 immarg %elementSize)
+      declare <16 x i1> @llvm.loop.dependence.war.mask.v16i1.i64(i64 %pointerDiff, i64 immarg %elementSize)
+      declare <vscale x 16 x i1> @llvm.loop.dependence.war.mask.nxv16i1.i64(ptr %pointerDiff, i64 immarg %elementSize)
 
 
 Overview:
 """""""""
 
-Given a vector load from %ptrA followed by a vector store to %ptrB, this
-instruction generates a mask where an active lane indicates that the
+Given ``%pointerDiff``, which is the signed pointer difference between a load
+and store (i.e, ``%pointerDiff = %storePointer - %loadPointer``), of same type
+vectors, where the load occurs before the store.
+
+This instruction generates a mask where an active lane indicates that the
 write-after-read sequence can be performed safely for that lane, without the
 danger of a write-after-read hazard occurring.
 
@@ -25048,30 +25051,30 @@ Semantics:
 """"""""""
 
 ``%elementSize`` is the size of the accessed elements in bytes.
-The intrinsic returns ``poison`` if the distance between ``%prtA`` and ``%ptrB``
-is smaller than ``VF * %elementsize`` and either ``%ptrA + VF * %elementSize``
-or ``%ptrB + VF * %elementSize`` wrap.
+The intrinsic returns ``poison`` if ``pointerDiff`` is smaller than
+``VF * %elementsize``.
 
-The element of the result mask is active when loading from %ptrA then storing to
-%ptrB is safe and doesn't result in a write-after-read hazard, meaning that:
+For each lane of the mask, the mask is active if:
 
-* (ptrB - ptrA) <= 0 (guarantees that all lanes are loaded before any stores), or
-* elementSize * lane < (ptrB - ptrA) (guarantees that this lane is loaded
+* pointerDiff <= 0 (guarantees that all lanes are loaded before any stores), or
+* (elementSize * lane) < pointerDiff (guarantees that this lane is loaded
   before the store to the same address)
 
 Examples:
 """""""""
 
 .. code-block:: llvm
-
-      %loop.dependence.mask = call <4 x i1> @llvm.loop.dependence.war.mask.v4i1(ptr %ptrA, ptr %ptrB, i64 4)
-      %vecA = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(ptr align 4 %ptrA, <4 x i1> %loop.dependence.mask, <4 x i32> poison)
+      %aAddr = ptrtoaddr ptr %ptrA to i64
+      %bAddr = ptrtoaddr ptr %ptrB to i64
+      %pointerDiff = sub i64 %bAddr, %aAddr
+      %loop.dependence.mask = call <4 x i1> @llvm.loop.dependence.war.mask.v4i1.i64(i64 %pointerDiff, i64 4)
+      %vecA = call <4 x i32> @llvm.masked.load.v4i32.v4i32.p0(ptr align 4 %ptrA, <4 x i1> %loop.dependence.mask, <4 x i32> poison)
       [...]
-      call @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %vecA, ptr align 4 %ptrB, <4 x i1> %loop.dependence.mask)
+      call @llvm.masked.store.v4i32.p0(<4 x i32> %vecA, ptr align 4 %ptrB, <4 x i1> %loop.dependence.mask)
 
       ; For the above example, consider the following cases:
       ;
-      ; 1. ptrA >= ptrB
+      ; 1. ptrA >= ptrB (pointerDiff <= 0)
       ;
       ;   load =      <0,1,2,3>     ; uint32_t load = array[i+2];
       ;  store =  <0,1,2,3>         ; array[i] = store;
@@ -25079,7 +25082,7 @@ Examples:
       ; This results in an all-true mask, as the load always occurs before the
       ; store, so it does not depend on any values to be stored.
       ;
-      ; 2. ptrB - ptrA = 2 * elementSize:
+      ; 2. pointerDiff = 2 * elementSize:
       ;
       ;   load =  <0,1,2,3>         ; uint32_t load = array[i];
       ;  store =      <0,1,2,3>     ; array[i+2] = store;
@@ -25088,7 +25091,7 @@ Examples:
       ; we can only read two lanes before we would read values that have yet to
       ; be written.
       ;
-      ; 3. ptrB - ptrA = 4 * elementSize
+      ; 3. pointerDiff = 4 * elementSize
       ;
       ;   load =  <0,1,2,3>         ; uint32_t load = array[i];
       ;  store =          <0,1,2,3> ; array[i+4] = store;
@@ -25107,17 +25110,20 @@ This is an overloaded intrinsic.
 
 ::
 
-      declare <4 x i1> @llvm.loop.dependence.raw.mask.v4i1(ptr %ptrA, ptr %ptrB, i64 immarg %elementSize)
-      declare <8 x i1> @llvm.loop.dependence.raw.mask.v8i1(ptr %ptrA, ptr %ptrB, i64 immarg %elementSize)
-      declare <16 x i1> @llvm.loop.dependence.raw.mask.v16i1(ptr %ptrA, ptr %ptrB, i64 immarg %elementSize)
-      declare <vscale x 16 x i1> @llvm.loop.dependence.raw.mask.nxv16i1(ptr %ptrA, ptr %ptrB, i64 immarg %elementSize)
+      declare <4 x i1> @llvm.loop.dependence.raw.mask.v4i1.i64(i64 %pointerDiff, i64 immarg %elementSize)
+      declare <8 x i1> @llvm.loop.dependence.raw.mask.v8i1.i32(i32 %pointerDiff, i64 immarg %elementSize)
+      declare <16 x i1> @llvm.loop.dependence.raw.mask.v16i1.i64(i64 %pointerDiff, i64 immarg %elementSize)
+      declare <vscale x 16 x i1> @llvm.loop.dependence.raw.mask.nxv16i1.i64(i64 %pointerDiff, i64 immarg %elementSize)
 
 
 Overview:
 """""""""
 
-Given a vector store to %ptrA followed by a vector load from %ptrB, this
-instruction generates a mask where an active lane indicates that the
+Given ``%pointerDiff``, which is the signed pointer difference between a store
+and load (i.e, ``%pointerDiff = %loadPointer - %storePointer``), of same type
+vectors, where the store occurs before the load.
+
+This instruction generates a mask where an active lane indicates that the
 read-after-write sequence can be performed safely for that lane, without a
 read-after-write hazard or a store-to-load forwarding hazard being introduced.
 
@@ -25140,38 +25146,38 @@ Semantics:
 """"""""""
 
 ``%elementSize`` is the size of the accessed elements in bytes.
-The intrinsic returns ``poison`` if the distance between ``%prtA`` and ``%ptrB``
-is smaller than ``VF * %elementsize`` and either ``%ptrA + VF * %elementSize``
-or ``%ptrB + VF * %elementSize`` wrap.
+The intrinsic returns ``poison`` if ``pointerDiff`` is smaller than
+``VF * %elementsize``.
 
-The element of the result mask is active when storing to %ptrA then loading from
-%ptrB is safe and doesn't result in aliasing, meaning that:
+For each lane of the mask, the mask is active if:
 
-* elementSize * lane < abs(ptrB - ptrA) (guarantees that the store of this lane
+* elementSize * lane < abs(pointerDiff) (guarantees that the store of this lane
   occurs before loading from this address), or
-* ptrA == ptrB (doesn't introduce any new hazards that weren't in the scalar
+* pointerDiff == 0 (doesn't introduce any new hazards that weren't in the scalar
   code)
 
 Examples:
 """""""""
 
 .. code-block:: llvm
-
-      %loop.dependence.mask = call <4 x i1> @llvm.loop.dependence.raw.mask.v4i1(ptr %ptrA, ptr %ptrB, i64 4)
-      call @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %vecA, ptr align 4 %ptrA, <4 x i1> %loop.dependence.mask)
+      %aAddr = ptrtoaddr ptr %ptrA to i64
+      %bAddr = ptrtoaddr ptr %ptrB to i64
+      %pointerDiff = sub i64 %bAddr, %aAddr
+      %loop.dependence.mask = call <4 x i1> @llvm.loop.dependence.raw.mask.v4i1(i64 %pointerDiff, i64 4)
+      call @llvm.masked.store.v4i32.v4i32.p0(<4 x i32> %vecA, ptr align 4 %ptrA, <4 x i1> %loop.dependence.mask)
       [...]
-      %vecB = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(ptr align 4 %ptrB, <4 x i1> %loop.dependence.mask, <4 x i32> poison)
+      %vecB = call <4 x i32> @llvm.masked.load.v4i32.p0(ptr align 4 %ptrB, <4 x i1> %loop.dependence.mask, <4 x i32> poison)
 
       ; For the above example, consider the following cases:
       ;
-      ; 1. ptrA == ptrB
+      ; 1. ptrA == ptrB (pointerDiff == 0)
       ;
       ;  store = <0,1,2,3>       ; array[i] = store;
       ;   load = <0,1,2,3>       ; uint32_t load = array[i];
       ;
       ; This results in a all-true mask. There is no conflict.
       ;
-      ; 2. ptrB - ptrA = 2 * elementSize
+      ; 2. pointerDiff = 2 * elementSize
       ;
       ;  store =  <0,1,2,3>      ; array[i] = store;
       ;   load =      <0,1,2,3>  ; uint32_t load = array[i+2];
@@ -25179,7 +25185,7 @@ Examples:
       ; This results in a mask with the first two lanes active. In this case,
       ; only two lanes can be written without overwriting values yet to be read.
       ;
-      ; 3. ptrB - ptrA = -2 * elementSize
+      ; 3. pointerDiff = -2 * elementSize
       ;
       ;  store =      <0,1,2,3>  ; array[i+2] = store;
       ;   load =  <0,1,2,3>      ; uint32_t load = array[i];
diff --git a/llvm/include/llvm/CodeGen/BasicTTIImpl.h b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
index 7812a301efbd7..d63e98ef67f58 100644
--- a/llvm/include/llvm/CodeGen/BasicTTIImpl.h
+++ b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
@@ -2190,46 +2190,38 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
       // The possible expansions are...
       //
       // loop_dependence_war_mask:
-      //   diff = (ptrB - ptrA) / eltSize
-      //   cmp = icmp sle diff, 0
+      //   cmp = icmp sle (diff / eltSize), 0
       //   upper_bound = select cmp, -1, diff
       //   mask = get_active_lane_mask 0, upper_bound
       //
       // loop_dependence_raw_mask:
-      //   diff = (abs(ptrB - ptrA)) / eltSize
-      //   cmp = icmp eq diff, 0
+      //   cmp = icmp eq (diff / eltSize), 0
       //   upper_bound = select cmp, -1, diff
       //   mask = get_active_lane_mask 0, upper_bound
       //
-      auto *PtrTy = cast<PointerType>(ICA.getArgTypes()[0]);
-      Type *IntPtrTy = IntegerType::getIntNTy(
-          RetTy->getContext(), thisT()->getDataLayout().getPointerSizeInBits(
-                                   PtrTy->getAddressSpace()));
+      Type *IntTy = ICA.getArgTypes()[0];
       bool IsReadAfterWrite = IID == Intrinsic::loop_dependence_raw_mask;
 
-      InstructionCost Cost =
-          thisT()->getArithmeticInstrCost(Instruction::Sub, IntPtrTy, CostKind);
+      TTI::OperandValueInfo EltSizeOpInfo =
+          TTI::getOperandInfo(ICA.getArgs()[1]);
+      InstructionCost Cost = thisT()->getArithmeticInstrCost(
+          Instruction::SDiv, IntTy, CostKind, {}, EltSizeOpInfo);
+
       if (IsReadAfterWrite) {
-        IntrinsicCostAttributes AbsAttrs(Intrinsic::abs, IntPtrTy, {IntPtrTy},
-                                         {});
+        IntrinsicCostAttributes AbsAttrs(Intrinsic::abs, IntTy, {IntTy}, {});
         Cost += thisT()->getIntrinsicInstrCost(AbsAttrs, CostKind);
       }
 
-      TTI::OperandValueInfo EltSizeOpInfo =
-          TTI::getOperandInfo(ICA.getArgs()[2]);
-      Cost += thisT()->getArithmeticInstrCost(Instruction::SDiv, IntPtrTy,
-                                              CostKind, {}, EltSizeOpInfo);
-
       Type *CondTy = IntegerType::getInt1Ty(RetTy->getContext());
       CmpInst::Predicate Pred =
           IsReadAfterWrite ? CmpInst::ICMP_EQ : CmpInst::ICMP_SLE;
-      Cost += thisT()->getCmpSelInstrCost(BinaryOperator::ICmp, CondTy,
-                                          IntPtrTy, Pred, CostKind);
-      Cost += thisT()->getCmpSelInstrCost(BinaryOperator::Select, IntPtrTy,
-                                          CondTy, Pred, CostKind);
+      Cost += thisT()->getCmpSelInstrCost(BinaryOperator::ICmp, CondTy, IntTy,
+                                          Pred, CostKind);
+      Cost += thisT()->getCmpSelInstrCost(BinaryOperator::Select, IntTy, CondTy,
+                                          Pred, CostKind);
 
       IntrinsicCostAttributes Attrs(Intrinsic::get_active_lane_mask, RetTy,
-                                    {IntPtrTy, IntPtrTy}, FMF);
+                                    {IntTy, IntTy}, FMF);
       Cost += thisT()->getIntrinsicInstrCost(Attrs, CostKind);
       return Cost;
     }
diff --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td
index 4469ff155b854..e1ca27f9fb0f4 100644
--- a/llvm/include/llvm/IR/Intrinsics.td
+++ b/llvm/include/llvm/IR/Intrinsics.td
@@ -2511,13 +2511,13 @@ let IntrProperties = [IntrNoMem, IntrSpeculatable, ImmArg<ArgIndex<1>>] in {
 
 def int_loop_dependence_raw_mask:
   DefaultAttrsIntrinsic<[llvm_anyvector_ty],
-            [llvm_ptr_ty, llvm_ptr_ty, llvm_i64_ty],
-            [IntrNoMem, IntrNoSync, IntrWillReturn, ImmArg<ArgIndex<2>>]>;
+            [llvm_anyint_ty, llvm_i64_ty],
+            [IntrNoMem, IntrNoSync, IntrWillReturn, ImmArg<ArgIndex<1>>]>;
 
 def int_loop_dependence_war_mask:
   DefaultAttrsIntrinsic<[llvm_anyvector_ty],
-            [llvm_ptr_ty, llvm_ptr_ty, llvm_i64_ty],
-            [IntrNoMem, IntrNoSync, IntrWillReturn, ImmArg<ArgIndex<2>>]>;
+            [llvm_anyint_ty, llvm_i64_ty],
+            [IntrNoMem, IntrNoSync, IntrWillReturn, ImmArg<ArgIndex<1>>]>;
 
 def int_get_active_lane_mask:
   DefaultAttrsIntrinsic<[llvm_anyvector_ty],
diff --git a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
index 82f8fd572bf19..0e0b2c8f1f7f3 100644
--- a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
@@ -535,6 +535,7 @@ namespace {
     SDValue visitBRCOND(SDNode *N);
     SDValue visitBR_CC(SDNode *N);
     SDValue visitLOAD(SDNode *N);
+    SDValue visitLOOP_DEPENDENCE_MASK(SDNode *N);
 
     SDValue replaceStoreChain(StoreSDNode *ST, SDValue BetterChain);
     SDValue replaceStoreOfFPConstant(StoreSDNode *ST);
@@ -2095,6 +2096,9 @@ SDValue DAGCombiner::visit(SDNode *N) {
   case ISD::VECREDUCE_FMIN:
   case ISD::VECREDUCE_FMAXIMUM:
   case ISD::VECREDUCE_FMINIMUM:     return visitVECREDUCE(N);
+  case ISD::LOOP_DEPENDENCE_RAW_MASK:
+  case ISD::LOOP_DEPENDENCE_WAR_MASK:
+    return visitLOOP_DEPENDENCE_MASK(N);
 #define BEGIN_REGISTER_VP_SDNODE(SDOPC, ...) case ISD::SDOPC:
 #include "llvm/IR/VPIntrinsics.def"
     return visitVPOp(N);
@@ -21132,6 +21136,22 @@ SDValue DAGCombiner::visitLOAD(SDNode *N) {
   return SDValue();
 }
 
+// Fold LOOP_DEPENDENCE_MASK(0, sub(B, A)) to LOOP_DEPENDENCE_MASK(A, B).
+SDValue DAGCombiner::visitLOOP_DEPENDENCE_MASK(SDNode *N) {
+  auto *Op0Const = dyn_cast<ConstantSDNode>(N->getOperand(0));
+  if (!Op0Const || !Op0Const->isZero())
+    return SDValue();
+
+  SDValue Op1 = N->getOperand(1);
+  if (Op1.getOpcode() != ISD::SUB)
+    return SDValue();
+
+  SDValue Op10 = Op1->getOperand(0);
+  SDValue Op11 = Op1->getOperand(1);
+  return DAG.getNode(N->getOpcode(), SDLoc(N), N->getValueType(0), Op11, Op10,
+                     N->getOperand(2), N->getOperand(3));
+}
+
 namespace {
 
 /// Helper structure used to slice a load in smaller loads.
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
index 04b17b56b3d49..efa9228d49c2c 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
@@ -8511,18 +8511,15 @@ void SelectionDAGBuilder::visitIntrinsicCall(const CallInst &I,
     return;
   }
   case Intrinsic::loop_dependence_war_mask:
-    setValue(&I,
-             DAG.getNode(ISD::LOOP_DEPENDENCE_WAR_MASK, sdl,
-                         EVT::getEVT(I.getType()), getValue(I.getOperand(0)),
-                         getValue(I.getOperand(1)), getValue(I.getOperand(2)),
-                         DAG.getConstant(0, sdl, MVT::i64)));
-    return;
   case Intrinsic::loop_dependence_raw_mask:
-    setValue(&I,
-             DAG.getNode(ISD::LOOP_DEPENDENCE_RAW_MASK, sdl,
-                         EVT::getEVT(I.getType()), getValue(I.getOperand(0)),
-                         getValue(I.getOperand(1)), getValue(I.getOperand(2)),
-                         DAG.getConstant(0, sdl, MVT::i64)));
+    unsigned Opcode = Intrinsic == Intrinsic::loop_dependence_war_mask
+                          ? ISD::LOOP_DEPENDENCE_WAR_MASK
+                          : ISD::LOOP_DEPENDENCE_RAW_MASK;
+    SDValue PointerDiff = getValue(I.getOperand(0));
+    SDValue Zero = DAG.getConstant(0, sdl, PointerDiff.getValueType());
+    setValue(&I, DAG.getNode(Opcode, sdl, EVT::getEVT(I.getType()), Zero,
+                             PointerDiff, getValue(I.getOperand(1)),
+                             DAG.getConstant(0, sdl, MVT::i64)));
     return;
   }
 }
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 38db1ac4a2fb9..374719374126e 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -5578,6 +5578,19 @@ AArch64TargetLowering::LowerLOOP_DEPENDENCE_MASK(SDValue Op,
 
   SDLoc DL(Op);
   EVT VT = Op.getValueType();
+
+  SDValue Op0 = Op.getOperand(0);
+  if (Op0.getValueSizeInBits() < 64) {
+    // Handle operands less than 64-bit (the diff must be sign extended).
+    SDValue Zero = DAG.getConstant(0, DL, MVT::i64);
+    SDValue Diff =
+        DAG.getNode(ISD::SUB, DL, Op0.getValueType(), Op.getOperand(1), Op0);
+    SDValue DiffExt = DAG.getNode(ISD::SIGN_EXTEND, DL, MVT::i64, Diff);
+    return DAG.getNode(Op.getOpcode(), DL, VT,
+                       {Zero, DiffExt, Op.getOperand(2), Op.getOperand(3)});
+  } else if (Op0.getValueSizeInBits() > 64)
+    return SDValue(); // Unsupported (expand).
+
   unsigned LaneOffset = Op.getConstantOperandVal(3);
   unsigned NumElements = VT.getVectorMinNumElements();
   uint64_t EltSizeInBytes = Op.getConstantOperandVal(2);
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index ae7144155ad72..5b27b70c7f006 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -1092,7 +1092,7 @@ AArch64TTIImpl::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
     if (ST->hasSVE2() || ST->hasSME()) {
       EVT VecVT = getTLI()->getValueType(DL, RetTy);
       unsigned EltSizeInBytes =
-          cast<ConstantInt>(ICA.getArgs()[2])->getZExtValue();
+          cast<ConstantInt>(ICA.getArgs()[1])->getZExtValue();
       if (!is_contained({1u, 2u, 4u, 8u}, EltSizeInBytes) ||
           VecVT.getVectorMinNumElements() != (16 / EltSizeInBytes))
         break;
diff --git a/llvm/test/Analysis/CostModel/AArch64/loop_dependence_mask.ll b/llvm/test/Analysis/CostModel/AArch64/loop_dependence_mask.ll
index 74bd41db4a64d..e80644a5bdb90 100644
--- a/llvm/test/Analysis/CostModel/AArch64/loop_dependence_mask.ll
+++ b/llvm/test/Analysis/CostModel/AArch64/loop_dependence_mask.ll
@@ -4,186 +4,186 @@
 ; RUN: opt < %s -passes="print<cost-model>" 2>&1 -disable-output -mtriple=aarch64-linux-gnu -mattr=+sme | FileCheck %s --check-prefixes=CHECK,CHECK-SME
 
 ; loop.dependence.{war,raw}.mask can be lowered to while{wr,rw} if SVE2 or SME is enabled.
-define void @loop_dependence_war_mask(ptr %a, ptr %b) {
+define void @loop_dependence_war_mask(i64 %diff) {
 ; CHECK-EXPANDED-LABEL: 'loop_dependence_war_mask'
-; CHECK-EXPANDED-NEXT:  Cost Model: Found an estimated cost of 39 for instruction: %res1 = call <16 x i1> @llvm.loop.dependence.war.mask.v16i1(ptr %a, ptr %b, i64 1)
-; CHECK-EXPANDED-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %res2 = call <8 x i1> @llvm.loop.dependence.war.mask.v8i1(ptr %a, ptr %b, i64 2)
-; CHECK-EXPANDED-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %res3 = call <4 x i1> @llvm.loop.dependence.war.mask.v4i1(ptr %a, ptr %b, i64 4)
-; CHECK-EXPANDED-NEXT:  Cost Model: Found an estimated cost of 11 for instruction: %res4 = call <2 x i1> @llvm.loop.dependence.war.mask.v2i1(ptr %a, ptr %b, i64 8)
-; CHECK-EXPANDED-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %res5 = call <vscale x 16 x i1> @llvm.loop.dependence.war.mask.nxv16i1(ptr %a, ptr %b, i64 1)
-; CHECK-EXPANDED-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %res6 = call <vscale x 8 x i1> @llvm.loop.dependence.war.mask.nxv8i1(ptr %a, ptr %b, i64 2)
-; CHECK-EXPANDED-NEXT:  C...
[truncated]

MacDue · 2026-03-24T13:42:21Z

Creating this as all patches where I've tried to make use of these intrinsics #182457 and #175943, the pointer arguments are a nuisance I have to work around.

Note: I have not provided an autoupgrade as these intrinics have never been emitted by LLVM/Clang in any release, so I think it's safe to change them. Happy to provide one if people think otherwise.

sdesmalen-arm · 2026-03-24T14:07:21Z

llvm/test/CodeGen/AArch64/alias_mask.ll

 ; CHECK-NEXT:    ret
 entry:
-  %0 = call <16 x i1> @llvm.loop.dependence.war.mask.v16i1(ptr %a, ptr %b, i64 1)
+  %a_int = ptrtoaddr ptr %a to i64


What is codegen like when the pointers are not available? (i.e. it takes the pointer-diff as a parameter)

The first operand is set to xzr (see added test)

fhahn

as long as this plays nice with codegen, this seems like a good simplification.

Note: I have not provided an autoupgrade as these intrinics have never been emitted by LLVM/Clang in any release, so I think it's safe to change them. Happy to provide one if people think otherwise.

That's probably fine

MacDue · 2026-03-24T14:40:33Z

as long as this plays nice with codegen, this seems like a good simplification.

I believe codegen should be pretty much the same, as I've added a simple DAG combine to fold the difference into the mask during ISEL.

github-actions · 2026-03-24T18:39:02Z

🐧 Linux x64 Test Results

192784 tests passed
4960 tests skipped

✅ The build succeeded and all tests passed.

sdesmalen-arm

Just some nits, otherwise looks fine to me.

llvm/docs/LangRef.rst

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/docs/LangRef.rst

davemgreen · 2026-03-27T06:54:49Z

llvm/docs/LangRef.rst

+The intrinsic returns ``poison`` if ``%pointerDiff`` is not a multiple of
+``%elementSize``.


It has been a while since I looked into these intrinsics but it sounds like this would remove the reason for them existing. The instructions on AArch64 do not follow the basic math that you would expect from simple integers, otherwise we would recognise them from icmp+activelanemask directly. As far as I remember the reason the intrinsics exist is to properly handle the fact that they do not wrap as expected from i64 math.

(Also I think it should be possible to fold a loop.mask(A,B) with noalias A, B to all-true, which could be useful in the current pass pipeline but maybe isn't something that anyone would ever implement).

It comes from #114028 (review)

Can we specify this in a way we don't need to be overly cautious about the extreme cases? I don't think they introduce a correctness issue (in that a lane is active where a dependence exists).

I think for there to be a correctness issue:

The difference between the pointers would have to be >= 1 and < VL (call this trueDist)

Computing that difference in i64s would have result in a value < 0 or > trueDist.

But as 1 to VL - 1 should fit within an i64, I don't think any correctness issue can occur.

Note: If this is not possible it also makes this intrinsic hard to expand without whilewr/whilerw (and I think the current expansions would be incorrect).

So, maybe if we specify %pointerDiff as implicitly frozen (to avoid freeze(sub nsw) -> sub), and always emit the subs with nsw within the loop vectorizer. We can fold to whilewr/rw and still give a safe result in case the difference does wrap. If we can't fold the sub into the mask, we can always expand to a whilelo (but I don't think that should occur to masks emitted by the loop vectorizer). Any thoughts?

So, maybe if we specify %pointerDiff as implicitly frozen (to avoid freeze(sub nsw) -> sub), and always emit the subs with nsw within the loop vectorizer.

So when it wraps and is poison, we want to always generate an all-true mask? It think if we implicitly freeze, we could end up with a pointerDiff value that makes a mask with some trailing lanes being false, which would still correct if I understand the above correctly.

Would the new variant keep the bit about pointerDiff needing to be a multiple of element size? If so, then the implicitly frozen value may not be a multiple and we would have poison mask. If that's an issue, maybe the intrinsic could return a all-true mask for poison pointerDiff?

some trailing lanes being false,

My idea was do to do something "safe" but not necessarily optimal for the extreme case.

I think I'm probably overthinking this though and it'd be simpler to do @llvm.loop.dependence.war.mask.v4i1.i64(i64 %a, i64 %b, i64 immarg %elementSize), and more precisely specify the semantics. We'd still have the CSE issues, but not having to deal with pointer types is still a little nicer.

I think the semantics should be (and I think this could be done in plain IR):

.war mask: true if any of:

icmp ule i64 %b, %a

icmp ult (%elementSize * lane), (%b - %a)

.raw mask: true if any of:

icmp eq i64 %a, %b

if icmp ugt i64 %b, %a:

then: icmp ult (%elementSize * lane), (%b - %a)

otherwise: icmp ult (%elementSize * lane), (%a - %b)

Not sure if the current expansions exactly match this or not.

I think I'm probably overthinking this though and it'd be simpler to do @llvm.loop.dependence.war.mask.v4i1.i64(i64 %a, i64 %b, i64 immarg %elementSize), and more precisely specify the semantics. We'd still have the CSE issues, but not having to deal with pointer types is still a little nicer.

Yeah, that also sounds reasonable, with the main benefit being to avoid the conversion back to pointers.

With that we could also simplify the wording quite a bit I think, not mention pointers at all, as it is defined clearly as IR operations

If we can just use unsigned conditions and they are equivalent to the instructions that would be good. That would help make them simpler.

What makes ptr arguments difficult to deal with?

What makes ptr arguments difficult to deal with?

We're using these instructions as an alternative way to handle diff checks. The SCEV expressions use for diff checks have already been converted to integers at the places we want to emit dependence masks. If we don't rework things to preserve the exact pointer types then naively casting to a pointer type will drop information such as the addresspace. I don't think this is a problem for correctness, but the IR does look broken. Really, these instructions don't care about all the details about pointers other than the address, so I think it makes sense to drop that information.

MacDue requested review from gbossu, huntergr-arm and sdesmalen-arm March 24, 2026 13:38

llvmbot added backend:AArch64 llvm:SelectionDAG SelectionDAGISel as well llvm:ir llvm:analysis Includes value tracking, cost tables and constant folding labels Mar 24, 2026

sdesmalen-arm reviewed Mar 24, 2026

View reviewed changes

fhahn reviewed Mar 24, 2026

View reviewed changes

Fixups

35d63a7

jayfoad changed the title ~~[IR] Remove pointer arguments from loop.dependance.{war|raw}.mask~~ [IR] Remove pointer arguments from loop.dependence.{war|raw}.mask Mar 24, 2026

MacDue added 2 commits March 24, 2026 17:42

Fixups

8794e84

Fixups

321eccc

sdesmalen-arm reviewed Mar 25, 2026

View reviewed changes

llvm/docs/LangRef.rst Outdated Show resolved Hide resolved

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp Show resolved Hide resolved

MacDue added 2 commits March 25, 2026 16:19

Rewrite docs

2165376

Fix typo

856ae46

fhahn reviewed Mar 25, 2026

View reviewed changes

llvm/docs/LangRef.rst Outdated Show resolved Hide resolved

llvm/docs/LangRef.rst Outdated Show resolved Hide resolved

llvm/docs/LangRef.rst Outdated Show resolved Hide resolved

llvm/docs/LangRef.rst Outdated Show resolved Hide resolved

llvm/docs/LangRef.rst Outdated Show resolved Hide resolved

More doc tweaks

2eaf933

sdesmalen-arm approved these changes Mar 26, 2026

View reviewed changes

llvm/docs/LangRef.rst Outdated Show resolved Hide resolved

llvm/docs/LangRef.rst Outdated Show resolved Hide resolved

llvm/docs/LangRef.rst Outdated Show resolved Hide resolved

Fixups

6ae5485

davemgreen reviewed Mar 27, 2026

View reviewed changes

		The intrinsic returns ``poison`` if ``%pointerDiff`` is not a multiple of
		``%elementSize``.

Conversation

MacDue commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MacDue commented Mar 24, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fhahn left a comment

Choose a reason for hiding this comment

Uh oh!

MacDue commented Mar 24, 2026

Uh oh!

github-actions bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🐧 Linux x64 Test Results

Uh oh!

sdesmalen-arm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MacDue Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MacDue Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

MacDue commented Mar 24, 2026 •

edited

Loading

llvmbot commented Mar 24, 2026 •

edited

Loading

github-actions bot commented Mar 24, 2026 •

edited

Loading

MacDue Mar 27, 2026 •

edited

Loading

MacDue Mar 27, 2026 •

edited

Loading