[LV] Mask off possibly aliasing vector lanes by SamTebbs33 · Pull Request #100579 · llvm/llvm-project

SamTebbs33 · 2024-07-25T14:40:35Z

When vectorising a loop that uses loads and stores, those pointers with unknown bounds could overlap if their difference is less than the vector factor. For example, if address 20 is being stored to and address 23 is being loaded from, they overlap when the vector factor is 4 or higher. Currently LoopVectorize branches to a scalar loop in these cases with a runtime check. However if we construct a mask that disables the overlapping (aliasing) lanes then the vectorised loop can be safely entered, as long as the loads and stores are masked off.

This PR modifies the LoopVectorizer and VPlan to create such a mask and branch to the vector loop if the mask has active lanes.

llvmbot · 2024-07-25T14:41:09Z

@llvm/pr-subscribers-llvm-ir
@llvm/pr-subscribers-backend-aarch64
@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-llvm-analysis

Author: Sam Tebbs (SamTebbs33)

Changes

When vectorising a loop that uses loads and stores, those pointers with unknown bounds could overlap if their difference is less than the vector factor. For example, if address 20 is being stored to and address 23 is being loaded from, they overlap when the vector factor is 4 or higher. Currently LoopVectorize branches to a scalar loop in these cases with a runtime check. However if we construct a mask that disables the overlapping (aliasing) lanes then the vectorised loop can be safely entered, as long as the loads and stores are masked off.

This PR modifies the LoopVectorizer and VPlan to create such a mask and always branch to the vector loop. Currently this is only done if we're tail-predicating, but more work will come in the future to do this in other cases as well.

Patch is 103.32 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/100579.diff

11 Files Affected:

(added) clang/test/CodeGen/loop-alias-mask.c (+404)
(modified) llvm/include/llvm/Analysis/LoopAccessAnalysis.h (+15-1)
(modified) llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h (+21-3)
(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+112-31)
(modified) llvm/lib/Transforms/Vectorize/VPlan.h (+2)
(modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+90-1)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+21-10)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.h (+4-1)
(added) llvm/test/Transforms/LoopVectorize/AArch64/whilewr-opt.ll (+369)
(modified) llvm/test/Transforms/LoopVectorize/runtime-check-small-clamped-bounds.ll (+11-11)
(modified) llvm/test/Transforms/LoopVectorize/runtime-checks-difference.ll (+53-49)

diff --git a/clang/test/CodeGen/loop-alias-mask.c b/clang/test/CodeGen/loop-alias-mask.c
new file mode 100644
index 0000000000000..76c3b5deddfa0
--- /dev/null
+++ b/clang/test/CodeGen/loop-alias-mask.c
@@ -0,0 +1,404 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --version 4
+// RUN: %clang --target=aarch64-linux-gnu -march=armv9+sme2 -emit-llvm -S -g0 -O3 -mllvm -prefer-predicate-over-epilogue=predicate-dont-vectorize %s -o - | FileCheck %s
+#include <stdint.h>
+
+// CHECK-LABEL: define dso_local void @alias_mask_8(
+// CHECK-SAME: ptr noalias nocapture noundef readonly [[A:%.*]], ptr nocapture noundef readonly [[B:%.*]], ptr nocapture noundef writeonly [[C:%.*]], i32 noundef [[N:%.*]]) local_unnamed_addr #[[ATTR0:[0-9]+]] {
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[CMP11:%.*]] = icmp sgt i32 [[N]], 0
+// CHECK-NEXT:    br i1 [[CMP11]], label [[FOR_BODY_PREHEADER:%.*]], label [[FOR_COND_CLEANUP:%.*]]
+// CHECK:       for.body.preheader:
+// CHECK-NEXT:    [[C14:%.*]] = ptrtoint ptr [[C]] to i64
+// CHECK-NEXT:    [[B15:%.*]] = ptrtoint ptr [[B]] to i64
+// CHECK-NEXT:    [[WIDE_TRIP_COUNT:%.*]] = zext nneg i32 [[N]] to i64
+// CHECK-NEXT:    [[SUB_DIFF:%.*]] = sub i64 [[B15]], [[C14]]
+// CHECK-NEXT:    [[NEG_COMPARE:%.*]] = icmp slt i64 [[SUB_DIFF]], 0
+// CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 16 x i1> poison, i1 [[NEG_COMPARE]], i64 0
+// CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <vscale x 16 x i1> [[DOTSPLATINSERT]], <vscale x 16 x i1> poison, <vscale x 16 x i32> zeroinitializer
+// CHECK-NEXT:    [[PTR_DIFF_LANE_MASK:%.*]] = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 [[SUB_DIFF]])
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK_ALIAS:%.*]] = or <vscale x 16 x i1> [[PTR_DIFF_LANE_MASK]], [[DOTSPLAT]]
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 [[WIDE_TRIP_COUNT]])
+// CHECK-NEXT:    [[TMP0:%.*]] = zext <vscale x 16 x i1> [[ACTIVE_LANE_MASK_ALIAS]] to <vscale x 16 x i8>
+// CHECK-NEXT:    [[TMP1:%.*]] = tail call i8 @llvm.vector.reduce.add.nxv16i8(<vscale x 16 x i8> [[TMP0]])
+// CHECK-NEXT:    [[TMP2:%.*]] = zext i8 [[TMP1]] to i64
+// CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
+// CHECK:       vector.body:
+// CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[FOR_BODY_PREHEADER]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 16 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[FOR_BODY_PREHEADER]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[VECTOR_BODY]] ]
+// CHECK-NEXT:    [[TMP3:%.*]] = and <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], [[ACTIVE_LANE_MASK_ALIAS]]
+// CHECK-NEXT:    [[TMP4:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[INDEX]]
+// CHECK-NEXT:    [[WIDE_MASKED_LOAD:%.*]] = tail call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP4]], i32 1, <vscale x 16 x i1> [[TMP3]], <vscale x 16 x i8> poison), !tbaa [[TBAA6:![0-9]+]]
+// CHECK-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[INDEX]]
+// CHECK-NEXT:    [[WIDE_MASKED_LOAD16:%.*]] = tail call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP5]], i32 1, <vscale x 16 x i1> [[TMP3]], <vscale x 16 x i8> poison), !tbaa [[TBAA6]]
+// CHECK-NEXT:    [[TMP6:%.*]] = add <vscale x 16 x i8> [[WIDE_MASKED_LOAD16]], [[WIDE_MASKED_LOAD]]
+// CHECK-NEXT:    [[TMP7:%.*]] = getelementptr inbounds i8, ptr [[C]], i64 [[INDEX]]
+// CHECK-NEXT:    tail call void @llvm.masked.store.nxv16i8.p0(<vscale x 16 x i8> [[TMP6]], ptr [[TMP7]], i32 1, <vscale x 16 x i1> [[TMP3]]), !tbaa [[TBAA6]]
+// CHECK-NEXT:    [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP2]]
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK_NEXT]] = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[INDEX_NEXT]], i64 [[WIDE_TRIP_COUNT]])
+// CHECK-NEXT:    [[TMP8:%.*]] = extractelement <vscale x 16 x i1> [[ACTIVE_LANE_MASK_NEXT]], i64 0
+// CHECK-NEXT:    br i1 [[TMP8]], label [[VECTOR_BODY]], label [[FOR_COND_CLEANUP]], !llvm.loop [[LOOP9:![0-9]+]]
+// CHECK:       for.cond.cleanup:
+// CHECK-NEXT:    ret void
+//
+void alias_mask_8(uint8_t *restrict a, uint8_t * b, uint8_t * c, int n) {
+  #pragma clang loop vectorize(enable)
+  for (int i = 0; i < n; i++) {
+    c[i] = a[i] + b[i];
+  }
+}
+
+// CHECK-LABEL: define dso_local void @alias_mask_16(
+// CHECK-SAME: ptr noalias nocapture noundef readonly [[A:%.*]], ptr nocapture noundef readonly [[B:%.*]], ptr nocapture noundef writeonly [[C:%.*]], i32 noundef [[N:%.*]]) local_unnamed_addr #[[ATTR0]] {
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[CMP11:%.*]] = icmp sgt i32 [[N]], 0
+// CHECK-NEXT:    br i1 [[CMP11]], label [[FOR_BODY_PREHEADER:%.*]], label [[FOR_COND_CLEANUP:%.*]]
+// CHECK:       for.body.preheader:
+// CHECK-NEXT:    [[C14:%.*]] = ptrtoint ptr [[C]] to i64
+// CHECK-NEXT:    [[B15:%.*]] = ptrtoint ptr [[B]] to i64
+// CHECK-NEXT:    [[WIDE_TRIP_COUNT:%.*]] = zext nneg i32 [[N]] to i64
+// CHECK-NEXT:    [[SUB_DIFF:%.*]] = sub i64 [[B15]], [[C14]]
+// CHECK-NEXT:    [[DIFF:%.*]] = sdiv i64 [[SUB_DIFF]], 2
+// CHECK-NEXT:    [[NEG_COMPARE:%.*]] = icmp slt i64 [[SUB_DIFF]], -1
+// CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 8 x i1> poison, i1 [[NEG_COMPARE]], i64 0
+// CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <vscale x 8 x i1> [[DOTSPLATINSERT]], <vscale x 8 x i1> poison, <vscale x 8 x i32> zeroinitializer
+// CHECK-NEXT:    [[PTR_DIFF_LANE_MASK:%.*]] = tail call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 0, i64 [[DIFF]])
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK_ALIAS:%.*]] = or <vscale x 8 x i1> [[PTR_DIFF_LANE_MASK]], [[DOTSPLAT]]
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = tail call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 0, i64 [[WIDE_TRIP_COUNT]])
+// CHECK-NEXT:    [[TMP0:%.*]] = zext <vscale x 8 x i1> [[ACTIVE_LANE_MASK_ALIAS]] to <vscale x 8 x i8>
+// CHECK-NEXT:    [[TMP1:%.*]] = tail call i8 @llvm.vector.reduce.add.nxv8i8(<vscale x 8 x i8> [[TMP0]])
+// CHECK-NEXT:    [[TMP2:%.*]] = zext i8 [[TMP1]] to i64
+// CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
+// CHECK:       vector.body:
+// CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[FOR_BODY_PREHEADER]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 8 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[FOR_BODY_PREHEADER]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[VECTOR_BODY]] ]
+// CHECK-NEXT:    [[TMP3:%.*]] = and <vscale x 8 x i1> [[ACTIVE_LANE_MASK]], [[ACTIVE_LANE_MASK_ALIAS]]
+// CHECK-NEXT:    [[TMP4:%.*]] = getelementptr inbounds i16, ptr [[A]], i64 [[INDEX]]
+// CHECK-NEXT:    [[WIDE_MASKED_LOAD:%.*]] = tail call <vscale x 8 x i16> @llvm.masked.load.nxv8i16.p0(ptr [[TMP4]], i32 2, <vscale x 8 x i1> [[TMP3]], <vscale x 8 x i16> poison), !tbaa [[TBAA13:![0-9]+]]
+// CHECK-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i16, ptr [[B]], i64 [[INDEX]]
+// CHECK-NEXT:    [[WIDE_MASKED_LOAD16:%.*]] = tail call <vscale x 8 x i16> @llvm.masked.load.nxv8i16.p0(ptr [[TMP5]], i32 2, <vscale x 8 x i1> [[TMP3]], <vscale x 8 x i16> poison), !tbaa [[TBAA13]]
+// CHECK-NEXT:    [[TMP6:%.*]] = add <vscale x 8 x i16> [[WIDE_MASKED_LOAD16]], [[WIDE_MASKED_LOAD]]
+// CHECK-NEXT:    [[TMP7:%.*]] = getelementptr inbounds i16, ptr [[C]], i64 [[INDEX]]
+// CHECK-NEXT:    tail call void @llvm.masked.store.nxv8i16.p0(<vscale x 8 x i16> [[TMP6]], ptr [[TMP7]], i32 2, <vscale x 8 x i1> [[TMP3]]), !tbaa [[TBAA13]]
+// CHECK-NEXT:    [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP2]]
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK_NEXT]] = tail call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 [[INDEX_NEXT]], i64 [[WIDE_TRIP_COUNT]])
+// CHECK-NEXT:    [[TMP8:%.*]] = extractelement <vscale x 8 x i1> [[ACTIVE_LANE_MASK_NEXT]], i64 0
+// CHECK-NEXT:    br i1 [[TMP8]], label [[VECTOR_BODY]], label [[FOR_COND_CLEANUP]], !llvm.loop [[LOOP15:![0-9]+]]
+// CHECK:       for.cond.cleanup:
+// CHECK-NEXT:    ret void
+//
+void alias_mask_16(uint16_t *restrict a, uint16_t * b, uint16_t * c, int n) {
+  #pragma clang loop vectorize(enable)
+  for (int i = 0; i < n; i++) {
+    c[i] = a[i] + b[i];
+  }
+}
+
+// CHECK-LABEL: define dso_local void @alias_mask_32(
+// CHECK-SAME: ptr noalias nocapture noundef readonly [[A:%.*]], ptr nocapture noundef readonly [[B:%.*]], ptr nocapture noundef writeonly [[C:%.*]], i32 noundef [[N:%.*]]) local_unnamed_addr #[[ATTR0]] {
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[CMP9:%.*]] = icmp sgt i32 [[N]], 0
+// CHECK-NEXT:    br i1 [[CMP9]], label [[FOR_BODY_PREHEADER:%.*]], label [[FOR_COND_CLEANUP:%.*]]
+// CHECK:       for.body.preheader:
+// CHECK-NEXT:    [[C12:%.*]] = ptrtoint ptr [[C]] to i64
+// CHECK-NEXT:    [[B13:%.*]] = ptrtoint ptr [[B]] to i64
+// CHECK-NEXT:    [[WIDE_TRIP_COUNT:%.*]] = zext nneg i32 [[N]] to i64
+// CHECK-NEXT:    [[SUB_DIFF:%.*]] = sub i64 [[B13]], [[C12]]
+// CHECK-NEXT:    [[DIFF:%.*]] = sdiv i64 [[SUB_DIFF]], 4
+// CHECK-NEXT:    [[NEG_COMPARE:%.*]] = icmp slt i64 [[SUB_DIFF]], -3
+// CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i1> poison, i1 [[NEG_COMPARE]], i64 0
+// CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i1> [[DOTSPLATINSERT]], <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer
+// CHECK-NEXT:    [[PTR_DIFF_LANE_MASK:%.*]] = tail call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[DIFF]])
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK_ALIAS:%.*]] = or <vscale x 4 x i1> [[PTR_DIFF_LANE_MASK]], [[DOTSPLAT]]
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = tail call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[WIDE_TRIP_COUNT]])
+// CHECK-NEXT:    [[TMP0:%.*]] = zext <vscale x 4 x i1> [[ACTIVE_LANE_MASK_ALIAS]] to <vscale x 4 x i8>
+// CHECK-NEXT:    [[TMP1:%.*]] = tail call i8 @llvm.vector.reduce.add.nxv4i8(<vscale x 4 x i8> [[TMP0]])
+// CHECK-NEXT:    [[TMP2:%.*]] = zext i8 [[TMP1]] to i64
+// CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
+// CHECK:       vector.body:
+// CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[FOR_BODY_PREHEADER]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 4 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[FOR_BODY_PREHEADER]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[VECTOR_BODY]] ]
+// CHECK-NEXT:    [[TMP3:%.*]] = and <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], [[ACTIVE_LANE_MASK_ALIAS]]
+// CHECK-NEXT:    [[TMP4:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[INDEX]]
+// CHECK-NEXT:    [[WIDE_MASKED_LOAD:%.*]] = tail call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0(ptr [[TMP4]], i32 4, <vscale x 4 x i1> [[TMP3]], <vscale x 4 x i32> poison), !tbaa [[TBAA16:![0-9]+]]
+// CHECK-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i32, ptr [[B]], i64 [[INDEX]]
+// CHECK-NEXT:    [[WIDE_MASKED_LOAD14:%.*]] = tail call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0(ptr [[TMP5]], i32 4, <vscale x 4 x i1> [[TMP3]], <vscale x 4 x i32> poison), !tbaa [[TBAA16]]
+// CHECK-NEXT:    [[TMP6:%.*]] = add <vscale x 4 x i32> [[WIDE_MASKED_LOAD14]], [[WIDE_MASKED_LOAD]]
+// CHECK-NEXT:    [[TMP7:%.*]] = getelementptr inbounds i32, ptr [[C]], i64 [[INDEX]]
+// CHECK-NEXT:    tail call void @llvm.masked.store.nxv4i32.p0(<vscale x 4 x i32> [[TMP6]], ptr [[TMP7]], i32 4, <vscale x 4 x i1> [[TMP3]]), !tbaa [[TBAA16]]
+// CHECK-NEXT:    [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP2]]
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK_NEXT]] = tail call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX_NEXT]], i64 [[WIDE_TRIP_COUNT]])
+// CHECK-NEXT:    [[TMP8:%.*]] = extractelement <vscale x 4 x i1> [[ACTIVE_LANE_MASK_NEXT]], i64 0
+// CHECK-NEXT:    br i1 [[TMP8]], label [[VECTOR_BODY]], label [[FOR_COND_CLEANUP]], !llvm.loop [[LOOP18:![0-9]+]]
+// CHECK:       for.cond.cleanup:
+// CHECK-NEXT:    ret void
+//
+void alias_mask_32(uint32_t *restrict a, uint32_t * b, uint32_t * c, int n) {
+  #pragma clang loop vectorize(enable)
+  for (int i = 0; i < n; i++) {
+    c[i] = a[i] + b[i];
+  }
+}
+
+// CHECK-LABEL: define dso_local void @alias_mask_64(
+// CHECK-SAME: ptr noalias nocapture noundef readonly [[A:%.*]], ptr nocapture noundef readonly [[B:%.*]], ptr nocapture noundef writeonly [[C:%.*]], i32 noundef [[N:%.*]]) local_unnamed_addr #[[ATTR0]] {
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[CMP9:%.*]] = icmp sgt i32 [[N]], 0
+// CHECK-NEXT:    br i1 [[CMP9]], label [[FOR_BODY_PREHEADER:%.*]], label [[FOR_COND_CLEANUP:%.*]]
+// CHECK:       for.body.preheader:
+// CHECK-NEXT:    [[C12:%.*]] = ptrtoint ptr [[C]] to i64
+// CHECK-NEXT:    [[B13:%.*]] = ptrtoint ptr [[B]] to i64
+// CHECK-NEXT:    [[WIDE_TRIP_COUNT:%.*]] = zext nneg i32 [[N]] to i64
+// CHECK-NEXT:    [[SUB_DIFF:%.*]] = sub i64 [[B13]], [[C12]]
+// CHECK-NEXT:    [[DIFF:%.*]] = sdiv i64 [[SUB_DIFF]], 8
+// CHECK-NEXT:    [[NEG_COMPARE:%.*]] = icmp slt i64 [[SUB_DIFF]], -7
+// CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 2 x i1> poison, i1 [[NEG_COMPARE]], i64 0
+// CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <vscale x 2 x i1> [[DOTSPLATINSERT]], <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer
+// CHECK-NEXT:    [[PTR_DIFF_LANE_MASK:%.*]] = tail call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 0, i64 [[DIFF]])
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK_ALIAS:%.*]] = or <vscale x 2 x i1> [[PTR_DIFF_LANE_MASK]], [[DOTSPLAT]]
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = tail call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 0, i64 [[WIDE_TRIP_COUNT]])
+// CHECK-NEXT:    [[TMP0:%.*]] = zext <vscale x 2 x i1> [[ACTIVE_LANE_MASK_ALIAS]] to <vscale x 2 x i8>
+// CHECK-NEXT:    [[TMP1:%.*]] = tail call i8 @llvm.vector.reduce.add.nxv2i8(<vscale x 2 x i8> [[TMP0]])
+// CHECK-NEXT:    [[TMP2:%.*]] = zext i8 [[TMP1]] to i64
+// CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
+// CHECK:       vector.body:
+// CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[FOR_BODY_PREHEADER]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 2 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[FOR_BODY_PREHEADER]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[VECTOR_BODY]] ]
+// CHECK-NEXT:    [[TMP3:%.*]] = and <vscale x 2 x i1> [[ACTIVE_LANE_MASK]], [[ACTIVE_LANE_MASK_ALIAS]]
+// CHECK-NEXT:    [[TMP4:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[INDEX]]
+// CHECK-NEXT:    [[WIDE_MASKED_LOAD:%.*]] = tail call <vscale x 2 x i64> @llvm.masked.load.nxv2i64.p0(ptr [[TMP4]], i32 8, <vscale x 2 x i1> [[TMP3]], <vscale x 2 x i64> poison), !tbaa [[TBAA19:![0-9]+]]
+// CHECK-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i64, ptr [[B]], i64 [[INDEX]]
+// CHECK-NEXT:    [[WIDE_MASKED_LOAD14:%.*]] = tail call <vscale x 2 x i64> @llvm.masked.load.nxv2i64.p0(ptr [[TMP5]], i32 8, <vscale x 2 x i1> [[TMP3]], <vscale x 2 x i64> poison), !tbaa [[TBAA19]]
+// CHECK-NEXT:    [[TMP6:%.*]] = add <vscale x 2 x i64> [[WIDE_MASKED_LOAD14]], [[WIDE_MASKED_LOAD]]
+// CHECK-NEXT:    [[TMP7:%.*]] = getelementptr inbounds i64, ptr [[C]], i64 [[INDEX]]
+// CHECK-NEXT:    tail call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[TMP6]], ptr [[TMP7]], i32 8, <vscale x 2 x i1> [[TMP3]]), !tbaa [[TBAA19]]
+// CHECK-NEXT:    [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP2]]
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK_NEXT]] = tail call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 [[INDEX_NEXT]], i64 [[WIDE_TRIP_COUNT]])
+// CHECK-NEXT:    [[TMP8:%.*]] = extractelement <vscale x 2 x i1> [[ACTIVE_LANE_MASK_NEXT]], i64 0
+// CHECK-NEXT:    br i1 [[TMP8]], label [[VECTOR_BODY]], label [[FOR_COND_CLEANUP]], !llvm.loop [[LOOP21:![0-9]+]]
+// CHECK:       for.cond.cleanup:
+// CHECK-NEXT:    ret void
+//
+void alias_mask_64(uint64_t *restrict a, uint64_t * b, uint64_t * c, int n) {
+  #pragma clang loop vectorize(enable)
+  for (int i = 0; i < n; i++) {
+    c[i] = a[i] + b[i];
+  }
+}
+
+// CHECK-LABEL: define dso_local void @alias_mask_multiple_8(
+// CHECK-SAME: ptr nocapture noundef readonly [[A:%.*]], ptr nocapture noundef readonly [[B:%.*]], ptr nocapture noundef writeonly [[C:%.*]], i32 noundef [[N:%.*]]) local_unnamed_addr #[[ATTR0]] {
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[CMP11:%.*]] = icmp sgt i32 [[N]], 0
+// CHECK-NEXT:    br i1 [[CMP11]], label [[FOR_BODY_PREHEADER:%.*]], label [[FOR_COND_CLEANUP:%.*]]
+// CHECK:       for.body.preheader:
+// CHECK-NEXT:    [[C14:%.*]] = ptrtoint ptr [[C]] to i64
+// CHECK-NEXT:    [[A15:%.*]] = ptrtoint ptr [[A]] to i64
+// CHECK-NEXT:    [[B16:%.*]] = ptrtoint ptr [[B]] to i64
+// CHECK-NEXT:    [[WIDE_TRIP_COUNT:%.*]] = zext nneg i32 [[N]] to i64
+// CHECK-NEXT:    [[SUB_DIFF:%.*]] = sub i64 [[A15]], [[C14]]
+// CHECK-NEXT:    [[NEG_COMPARE:%.*]] = icmp slt i64 [[SUB_DIFF]], 0
+// CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 16 x i1> poison, i1 [[NEG_COMPARE]], i64 0
+// CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <vscale x 16 x i1> [[DOTSPLATINSERT]], <vscale x 16 x i1> poison, <vscale x 16 x i32> zeroinitializer
+// CHECK-NEXT:    [[PTR_DIFF_LANE_MASK:%.*]] = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 [[SUB_DIFF]])
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK_ALIAS:%.*]] = or <vscale x 16 x i1> [[PTR_DIFF_LANE_MASK]], [[DOTSPLAT]]
+// CHECK-NEXT:    [[SUB_DIFF18:%.*]] = sub i64 [[B16]], [[C14]]
+// CHECK-NEXT:    [[NEG_COMPARE20:%.*]] = icmp slt i64 [[SUB_DIFF18]], 0
+// CHECK-NEXT:    [[DOTSPLATINSERT21:%.*]] = insertelement <vscale x 16 x i1> poison, i1 [[NEG_COMPARE20]], i64 0
+// CHECK-NEXT:    [[DOTSPLAT22:%.*]] = shufflevector <vscale x 16 x i1> [[DOTSPLATINSERT21]], <vscale x 16 x i1> poison, <vscale x 16 x i32> zeroinitializer
+// CHECK-NEXT:    [[PTR_DIFF_LANE_MASK23:%.*]] = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 [[SUB_DIFF18]])
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK_ALIAS24:%.*]] = or <vscale x 16 x i1> [[PTR_DIFF_LANE_MASK23]], [[DOTSPLAT22]]
+// CHECK-NEXT:    [[TMP0:%.*]] = and <vscale x 16 x i1> [[ACTIVE_LANE_MASK_ALIAS]], [[ACTIVE_LANE_MASK_ALIAS24]]
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 [[WIDE_TRIP_COUNT]])
+// CHECK-NEXT:    [[TMP1:%.*]] = zext <vscale x 16 x i1> [[TMP0]] to <vscale x 16 x i8>
+// CHECK-NEXT:    [[TMP2:%.*]] = tail call i8 @llvm.vector.reduce.add.nxv16i8(<vscale x 16 x i8> [[TMP1]])
+// CHECK-NEXT:    [[TMP3:%.*]] = zext i8 [[TMP2]] to i64
+// CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
+// CHECK:       vector.body:
+// CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[FOR_BODY_PREHEADER]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 16 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[FOR_BODY_PREHEADER]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[VECTOR_BODY]] ]
+// CHECK-NEXT:    [[TMP4:%.*]] = and <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], [[TMP0]]
+// CHECK-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[INDEX]]
+// CHECK-NEXT:    [[WIDE_MASKED_LOAD:%.*]] = tail call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP5]], i32 1, <vscale x 16 x i1> [[TMP4]], <vscale x 16 x i8> poison), !tbaa [[TBAA6]]
+// CHECK-NEXT:    [[TMP6:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[INDEX]]
+// CHECK-NEXT:    [[WIDE_MASKED_LOAD25:%.*]] = tail call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP6]], i32 1, <vscale x 16 x i1> [[TMP4]], <vscale x 16 x i8> poison), !tbaa [[TBAA6]]
+// CHECK-NEXT:    [[TMP7:%.*]] = add <vscale x 16 x i8> [[WIDE_MASKED_LOAD25]], [[WIDE_MASKED_LOAD]]
+// CHECK-NEXT:    [[TMP8:%.*]] = getelementptr inbounds i8, ptr [[C]], i64 [[INDEX]]
+// CHECK-NEXT:    tail call void @llvm.masked.store.nxv16i8.p0(<vscale x 16 x i8> [[TMP7]], ptr [[TMP8]], i32 1, <vscale x 16 x i1> [[TMP4]]), !tbaa [[TBAA6]]
+// CHECK-NEXT:    [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP3]]
+// CHECK-NEXT:    [[ACTIVE_LANE_MASK_NEXT]] = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[INDEX_NEXT]], i64 [[WIDE_TRIP_COUNT]])
+// CHECK-NEXT:    [[TMP9:%.*]] = extractelement <vscale x 16 x i1> [[ACTIVE_LANE_MASK_NEXT]], i64 0
+// CHECK-NEXT:    br i1 [[TMP9]], label [[VECTOR_BODY]], label [[FOR_COND_CLEANUP]], !llvm.loop [[LOOP22:![0-9]+]]
+// CHECK:       for.cond.cleanup:
+// CHECK-NEXT:    ret void
+//
+void alias_mask_multiple_8(ui...
[truncated]

llvm#100579 emits IR that creates a mask disabling lanes that could alias within a loop iteration, based on a pair of pointers. This PR lowers that IR to a WHILEWR instruction for AArch64.

huntergr-arm · 2024-07-30T11:06:09Z

clang/test/CodeGen/loop-alias-mask.c

@@ -0,0 +1,404 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --version 4


We generally don't add source level tests for llvm optimizations; aiui clang's CodeGen test directory is mainly to test lowering the AST to LLVM IR, not end-to-end.

We might want a different way of testing the feature from source that we could add to a buildbot.

That's true, I've removed this file and have started work on adding a test to the llvm-test-suite to replace it. A statistic variable will help make sure the alias masking is happening.

huntergr-arm · 2024-07-30T11:09:57Z

llvm/include/llvm/Analysis/LoopAccessAnalysis.h

        NeedsFreeze(NeedsFreeze) {}
 };

+/// A pair of pointers that could overlap across a loop iteration.


If the analysis has been removed from LAA, then I suggest the struct can also be moved elsewhere. Maybe VPlan.h ?

That sounds good to me. I originally put it here so it was next to the related PointerDiffInfo

#100579 emits IR that creates a mask disabling lanes that could alias within a loop iteration, based on a pair of pointers. This PR lowers that IR to the WHILEWR instruction for AArch64.

fhahn · 2024-08-05T12:50:55Z

llvm/include/llvm/Analysis/LoopAccessAnalysis.h

 #include "llvm/Analysis/LoopAnalysisManager.h"
-#include "llvm/Analysis/ScalarEvolutionExpressions.h"
 #include "llvm/IR/DiagnosticInfo.h"
+#include "llvm/Transforms/Utils/ScalarEvolutionExpander.h"


Unrelated changes?

Yep this is left over from previous changes, thanks. Removed now.

fhahn · 2024-08-05T12:55:34Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

  bool HasNUW = Style == TailFoldingStyle::None;
-  addCanonicalIVRecipes(*Plan, Legal->getWidestInductionType(), HasNUW, DL);
+
+  VPValue *AliasMask = nullptr;


Does this need to be done at initial construction? IIUC this could be framed as optimization of the original VPlan, done as a separate VPlan-to-VPlan transform?

At the moment this requires threading a bunch of information to a lot of places, would
Be good limit to a single dedicated transform

Currently the mask is only generated when tail predication is enabled, in which case it perhaps could be done with a transform since all the vector loads and stores will be masked anyway. However the goal is to generate the mask even when tail predication is disabled, to capture more cases, and so the transform would have to handle swapping the non-masked loads and stores to masked variants, which I don't think is best done with a transform since the original VPlan does a good job of that anyway.

I'd also argue that it's a good idea to get the masking as we want it to be from the beginning rather than setting up an active lane mask and building on top of it later on.

If possible to do so in a separate transform, that would preferable IMO as now the general VPlan construction has a bunch of added complexity for something that is quite specific.

A similar approach was taken for adding various options to control tail-folding (addActiveLaneMask, tryAddExplicitVectorLength). This also has the advantage of keeping the code responsible to generate the new construct at a single place, rather than spreading it across multiple places?

Having it as a separate transform also allows to generate VPlans with different strategies (e.g. with and without this new mask) and pick the best option based on cost.

fhahn · 2024-08-05T12:57:45Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+  GeneratedRTChecks Checks(*PSE.getSE(), DT, LI, TTI, F->getDataLayout(),
+                           AddBranchWeights);
+
+  // VPlan needs the aliasing pointers as Values and not SCEVs, so expand them


If expanded SCEVs are needed in VPlan they should be expanded during VPlan execution using VPExpandSCEVRecioe if possible

Oh perfect, I didn't know that existed, thanks.

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

fhahn · 2024-08-05T13:00:26Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+                          EltVT->getArrayNumElements();
+          else
+            ElementSize =
+                GEP->getSourceElementType()->getScalarSizeInBits() / 8;


Not sure if using the GEP source element type here is correct, should this use the accessed type from load/store which could be different?

I've managed to figure it out now and am using the AccessSize provided by PointerDIffInfo. Thanks for pointing out the flakiness here!

fhahn · 2024-08-05T13:02:13Z

llvm/test/Transforms/LoopVectorize/runtime-checks-difference.ll

 ; CHECK-NEXT:    [[B1:%.*]] = ptrtoint ptr [[B:%.*]] to i64
 ; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N:%.*]], 4
-; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %scalar.ph, label %vector.memcheck
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_MEMCHECK:%.*]]


Unrelated changes?

Indeed, the update script did this. I've removed it now.

fhahn · 2024-08-05T13:03:31Z

llvm/test/Transforms/LoopVectorize/AArch64/whilewr-opt.ll

+  %cmp11 = icmp sgt i32 %n, 0
+  br i1 %cmp11, label %for.body.preheader, label %for.cond.cleanup
+
+for.body.preheader:


I might be missing something but the input is already vectorized?

You're right, thanks for pointing that out. I've improved the test so that it shows the loop vectoriser's work.

github-actions · 2024-08-06T16:30:37Z

✅ With the latest revision this PR passed the C/C++ code formatter.

SamTebbs33 · 2024-08-27T13:52:21Z

Ping.

huntergr-arm · 2024-09-03T16:19:37Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+      }
+    }
+  }
+  assert(ElementSize > 0 && "Couldn't get element size from pointer");


This assert fires when compiling MicroBenchmarks/ImageProcessing/Dilate/CMakeFiles/Dilate.dir/dilateKernel.c from the test suite, so that needs fixing at least.

But I think we shouldn't be trying to figure out the element size at recipe execution time; it should be determined when trying to build the recipe and the value stored as part of the class. If we can't find the size then we stop trying to build the recipe and fall back to normal RT checks. Does that sound sensible to you?

Thanks Graham, I've had a deeper look and PointerDiffInfo provides an AccessSize which corresponds to what we want, so I'm using that now.

llvm/lib/Transforms/Vectorize/VPlan.h

SamTebbs33 · 2024-10-29T15:30:28Z

I've moved the read-after-write lowering to #114028

SamTebbs33 · 2024-11-06T14:12:48Z

I've now fixed the comparison that's emitted since it should have been comparing to 0. See the Arm WHILERW instruction.

It can be unsafe to load a vector from an address and write a vector to an address if those two addresses have overlapping lanes within a vectorised loop iteration. This PR adds an intrinsic designed to create a mask with lanes disabled if they overlap between the two pointer arguments, so that only safe lanes are loaded, operated on and stored. Along with the two pointer parameters, the intrinsic also takes an immediate that represents the size in bytes of the vector element types, as well as an immediate i1 that is true if there is a write after-read-hazard or false if there is a read-after-write hazard. This will be used by llvm#100579 and replaces the existing lowering for whilewr since that isn't needed now we have the intrinsic.

MacDue · 2024-12-09T11:52:17Z

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h

+  /// RTChecks is a list of pointer pairs that should be checked for aliasing,
+  /// setting HasAliasMask to true in the case that an alias mask is generated


Outdated comment? Is this DiffChecks now?

Done, thanks.

MacDue

A bunch of little comments (mostly just nitpicks from a pass over the PR) 🙂

MacDue · 2024-12-09T11:56:14Z

llvm/include/llvm/Analysis/TargetTransformInfo.h

+  /// Branch to scalar loop if checks fails at runtime.
+  ScalarFallback,
+  /// Form a mask based on elements which won't be a WAR or RAW hazard


ultra nit: One of these comments ends with a full-stop and the other does not.

Suggested change

/// Branch to scalar loop if checks fails at runtime.

ScalarFallback,

/// Form a mask based on elements which won't be a WAR or RAW hazard

/// Branch to scalar loop if checks fails at runtime.

ScalarFallback,

/// Form a mask based on elements which won't be a WAR or RAW hazard.

MacDue · 2024-12-09T12:06:08Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+  // Count the number of bits set in each lane and reduce the result to a scalar
+  case VPInstruction::PopCount: {
+    Value *Op = State.get(getOperand(0));
+    auto *VT = Op->getType();


nit: Spell out type if it's not present on the RHS.

Suggested change

auto *VT = Op->getType();

Type *VT = Op->getType();

MacDue · 2024-12-09T12:10:20Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+  Value *SinkValue = State.get(getSinkValue(), true);
+  Value *SourceValue = State.get(getSourceValue(), true);
+
+  auto *Type = SinkValue->getType();


nit:

Suggested change

auto *Type = SinkValue->getType();

Type *PtrType = SinkValue->getType();

Not needed thanks to rebase.

MacDue · 2024-12-09T12:13:28Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+                                  VPSlotTracker &SlotTracker) const {
+  O << Indent << "EMIT ";
+  getVPSingleValue()->printAsOperand(O, SlotTracker);
+  O << " = alias lane mask ";


nit: These seem more commonly printed in all caps with hyphens.

Suggested change

O << " = alias lane mask ";

O << " = ALIAS-LANE-MASK ";

MacDue · 2024-12-09T12:14:02Z

llvm/lib/Transforms/Vectorize/VPlan.cpp


  IRBuilder<> Builder(State.CFG.PrevBB->getTerminator());
  // FIXME: Model VF * UF computation completely in VPlan.
-  assert(VFxUF.getNumUsers() && "VFxUF expected to always have users");


How does removing this assert relate to these changes?

MacDue · 2024-12-09T12:21:55Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

      find_if(Plan.getCanonicalIV()->users(),
              [](VPUser *U) { return isa<VPWidenCanonicalIVRecipe>(U); });
-  assert(FoundWidenCanonicalIVUser &&
+  assert(FoundWidenCanonicalIVUser && *FoundWidenCanonicalIVUser &&


This looks a little odd. Doesn't find_if return an iterator?

Suggested change

assert(FoundWidenCanonicalIVUser && *FoundWidenCanonicalIVUser &&

auto IVUsers = Plan.getCanonicalIV()->users();

/// ...

assert(FoundWidenCanonicalIVUser != IVUsers.end() && "Must have widened canonical IV when tail folding!");

Done, thanks.

MacDue · 2024-12-09T12:22:28Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.h

+  /// RTChecks refers to the pointer pairs that need aliasing elements to be
+  /// masked off each loop iteration.


No docs for PSE?

Added, let me know if anything about it should change.

MacDue · 2024-12-09T12:23:14Z

llvm/lib/Transforms/Vectorize/VPlan.h

+  /// Print the recipe.
+  void print(raw_ostream &O, const Twine &Indent,


Would it be possible to add a test that prints the VPlan recipe with the alias mask?

MacDue · 2024-12-09T12:24:27Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+    LaneMask = Builder.createNaryOp(Instruction::BinaryOps::And,
+                                    {LaneMaskPhi, AliasMask}, DL);


Do we know this AND won't be all-false?

We don't, and there's actually a case in the test suite that hangs because the mask is all-false. I'll start looking into a solution for that.

MacDue · 2024-12-09T12:33:32Z

llvm/lib/Transforms/Vectorize/VPlan.h

+// is X and B is X + 2 with VF being 4, only the final two elements of the
+// loaded vector can be stored since they don't overlap with the stored
+// vector. %b.vec = load %b ; = [s, t, u, v]
+// [...]
+// store %a, %b.vec ; only u and v can be stored as their addresses don't
+// overlap with %a + (VF - 1)


This is specifically RAW? Of something like:

store A[x] load A[x + 2]

Perhaps I'm muddled on what "final two elements" means, but isn't the first two elements store that is valid (so it won't overwrite the elements for the load)?

Yes you're right, this should say that the first two are valid. Thanks for spotting that. I've re-worded the comment to make it more clear.

It can be unsafe to load a vector from an address and write a vector to an address if those two addresses have overlapping lanes within a vectorised loop iteration. This PR adds an intrinsic designed to create a mask with lanes disabled if they overlap between the two pointer arguments, so that only safe lanes are loaded, operated on and stored. Along with the two pointer parameters, the intrinsic also takes an immediate that represents the size in bytes of the vector element types, as well as an immediate i1 that is true if there is a write after-read-hazard or false if there is a read-after-write hazard. This will be used by llvm#100579 and replaces the existing lowering for whilewr since that isn't needed now we have the intrinsic.

SamTebbs33 · 2025-08-07T09:30:03Z

I've just pushed a new version that makes sure the popcount of the mask is greater than 0 before branching to the vector loop, otherwise it branches to the scalar loop as usual.

SamTebbs33 · 2025-08-07T09:54:19Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

The intrinsics do expand into a lot of instructions, so I'm keen to hear people's opinions on whether invalid is better than calculating the cost of them, since that will probably be very high.

I would suggest reverting to calling the default cost-model here by doing break;.

To the question whether to return either an Invalid or a high cost in the default cost-model, I would argue for a high cost because that would allow for testing the feature without requiring the special instructions.

It will now get the expanded intrinsic cost instead of returning invalid.

sdesmalen-arm · 2025-08-08T13:03:07Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

This is also supported for SME.

sdesmalen-arm · 2025-08-08T13:03:38Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

This is also supported for SME.

sdesmalen-arm · 2025-08-08T13:06:39Z

llvm/lib/Analysis/LoopAccessAnalysis.cpp

I'd suggest writing this as: bool WriteAfterRead = !Src->IsWritePtr && Sink->IsWritePtr.

That's better, thanks.

sdesmalen-arm · 2025-08-08T13:23:19Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

I would suggest reverting to calling the default cost-model here by doing break;.

To the question whether to return either an Invalid or a high cost in the default cost-model, I would argue for a high cost because that would allow for testing the feature without requiring the special instructions.

sdesmalen-arm · 2025-08-08T13:42:22Z

llvm/lib/Transforms/Utils/LoopUtils.cpp

Two things:

When I remove !VF.isScalar() && all the tests pass, suggesting that there's either missing test coverage or this code also works for a VF=1.

UseSafeEltsMask is loop invariant, so my suggestion would be to separate this out into two parts: one loop for the case that UnsafeEltsMask==false and then return MemoryRuntimeCheck afterwards, and a second part for the case where UnsafeEltsMask==true.

I think that VF.isScalar() is actually an unnecessary check since the memory check block is only used if the LV actually ends up vectorising or if epilogue vectorisation is on. In the case of epilogue vectorisation it won't use tail predication and tail predication being off turns off UseSafeEltsMask.

As per Florian's suggestion I've separated this into two functions.

sdesmalen-arm · 2025-08-08T14:13:43Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

When you add a check to TTI.useSafeEltsMask in CostModel::getRTCheckStyle, then function becomes redundant and you can call the CostModel function directly to check if CM.getRTCheckStyle() == RTCheckStyle::UseSafeEltsMask

sdesmalen-arm · 2025-08-08T14:32:05Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

Why would you want to check that here rather than where it's used (i.e. tryToBuildVPlanWithVPRecipes). You can check there if the NoAlias-mask is needed before passing them to VPlanTransforms::addActiveLaneMask. That way you don't need to pass the RTChecks to all those functions, thus reducing the diff.

It also seems strange that RuntimePointerChecking::getDiffChecks returns a std::optional<ArrayRef> rather than an empty ArrayRef, when CanUseDiffCheck == false. It would be worth removing the std::optional in an NFC patch.

it does make sense to do this closer to where the checks are used, done.

sdesmalen-arm · 2025-08-08T15:11:36Z

llvm/lib/Transforms/Utils/LoopUtils.cpp

Do you need to use/check SeenCompares in the loop_dependence_war/raw_mask path as well?

That should help, done.

sdesmalen-arm · 2025-08-08T15:58:44Z

llvm/test/Transforms/LoopVectorize/AArch64/alias_mask.ll

My recommendation would be not to add too many passes like instcombine/early-cse here, because that may mask issues. It's good to know that CSE will combine the two llvm.loop.dependence.*.mask intrinsics (the loop invariant one in the loop body and the one in the check block), but it doesn't necessarily need to check that in the LV test.

Sounds sensible to me, done.

sdesmalen-arm · 2025-08-08T16:04:29Z

llvm/test/Transforms/LoopVectorize/AArch64/alias_mask.ll

Can you add some tests where there are more than a single loop dependence? (preferably mixed) That way we can verify they are combined correctly.

Done, let me know if anything needs to be added or changed with them.

fhahn · 2025-08-10T20:35:49Z

llvm/include/llvm/Analysis/TargetTransformInfo.h

Could we just check the cost of the new intrinsic using the existing hooks?

That would work if I return invalid for the cost, but @sdesmalen-arm has suggested I calculate the cost of the expanded intrinsic instead of returning invalid: https://github.com/llvm/llvm-project/pull/100579/files#r2262969228

We could still check if the cost is < some bound for which vectorization is worthwhile?

fhahn · 2025-08-10T20:36:08Z

llvm/lib/Transforms/Utils/LoopUtils.cpp

It doesn't look like there's much re-use going on here. Better to have a separate function than complicate the logic here?

I think that makes sense, done.

fhahn · 2025-08-10T20:36:39Z

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h

Better to add the new recipes at the same place where the runtime checks are attached to VPlan, rather than threading through the general planning stage: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L9359C32-L9359C51

The problem is that the mask needs to be ANDed with the active lane mask and masked operations need to use that ANDed value, and that is all done as part of the plan creation.

Can't we just check if there's an active-lane-mask in the plan? Or AND the masks for all relevant predicated memory instructions?

Are the predicated instructions kept track of somewhere? If not, then finding them reliably will be worse than the current approach, IMO.

Memory accesses will be lowered to masked VPWidenLoad/StoreRecipe, so should hopefully be easy to find in the plan

I've re-worked it to add the masks when attaching the runtime check block as suggested in a commit here. Please let me know what you think about this approach.

Yep that looks like a good first step. Would be good if updating the masks could also be done there (the PopCount introduction) and we could avoid making addActiveLaneMask more complex

SamTebbs33 · 2025-09-22T09:50:49Z

Ping.

llvm/test/Transforms/LoopVectorize/AArch64/alias_mask.ll

llvm/lib/Transforms/Utils/LoopUtils.cpp

sdesmalen-arm · 2025-11-11T10:56:47Z

llvm/lib/Transforms/Utils/LoopUtils.cpp

@@ -2117,8 +2170,8 @@ Value *llvm::addDiffRuntimeChecks(
    Value *Diff =
        Expander.expandCodeFor(SE.getMinusSCEV(SinkStart, SrcStart), Ty, Loc);

-    // Check if the same compare has already been created earlier. In that case,
-    // there is no need to check it again.
+    // Check if the same compare has already been created earlier. In that


nit: unnecessary change.

sdesmalen-arm · 2025-11-11T10:58:03Z

llvm/lib/Transforms/Utils/LoopUtils.cpp

+
+  // Map to keep track of created compares, The key is the pair of operands for
+  // the compare, to allow detecting and re-using redundant compares.
+  DenseMap<std::pair<Value *, Value *>, Value *> SeenCompares;


nit: add newline after DenseMap to distinguish the comment from the code below.

No longer relevant now that this function has been removed, due to the new mask insertion code. Likewise for the other comments here.

sdesmalen-arm · 2025-11-11T11:00:12Z

llvm/lib/Transforms/Utils/LoopUtils.cpp

+    if (SeenCompares.lookup({Sink, Src}))
+      continue;


Can we not assume that {Sink, Src} tuples are always unique? (note that when I comment this out, there's no failing test)

sdesmalen-arm · 2025-11-11T11:03:18Z

llvm/lib/Transforms/Utils/LoopUtils.cpp

+    Value *SourceAsPtr = ChkBuilder.CreateCast(Instruction::IntToPtr, Src,
+                                               ChkBuilder.getPtrTy());


Suggested change

Value *SourceAsPtr = ChkBuilder.CreateCast(Instruction::IntToPtr, Src,

ChkBuilder.getPtrTy());

Value *SourceAsPtr = ChkBuilder.CreateIntToPtr(Src, ChkBuilder.getPtrTy());

(same below)

sdesmalen-arm · 2025-11-11T11:14:57Z

llvm/lib/Transforms/Utils/LoopUtils.cpp

+  }
+  assert(AliasLaneMask && "Expected an alias lane mask to have been created.");
+  auto *VecVT = VectorType::get(ChkBuilder.getInt1Ty(), VF);
+  // Extend to an i8 since i1 is too small to add with


Is the intention to compare with 0? (i.e. any lanes set) If so, then I would think an or reduction on i1's would be better.
Alternatively, if we only want to execute the vector loop if at least VF/2 elements are handled in the vectorised loop (for VF > 2), then doing a popcount makes sense. But in that case, we should be cautious about the element type used. i8 may not be enough for very large vectors, that will depend on the VF and VScale values.

At the moment the intention is to compare with zero, but the popcount instruction is generic (the new mask insertion approach uses the instruction rather than constructing the intrinsic) so I'd like to keep it as an add. I'm sure I've seen it being optimised to an or somewhere!

sdesmalen-arm · 2025-11-11T11:17:14Z

llvm/test/Transforms/LoopVectorize/AArch64/alias_mask.ll

+; CHECK-NEXT:    [[TMP1:%.*]] = call i8 @llvm.vector.reduce.add.nxv16i8(<vscale x 16 x i8> [[TMP0]])
+; CHECK-NEXT:    [[TMP2:%.*]] = zext i8 [[TMP1]] to i64
+; CHECK-NEXT:    [[TMP3:%.*]] = icmp ugt i64 [[TMP2]], 0
+; CHECK-NEXT:    br i1 [[TMP3]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]]


If the popcount is higher than 0, it branches to the scalar preheader, isn't it supposed to branch to the vector preheader instead?

Fixed now, thank you. I also realised I don't have to AND or OR with the existing MemCheckCond since we want to override it.

sdesmalen-arm · 2025-11-11T11:21:56Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+    case TailFoldingStyle::Data:
+    case TailFoldingStyle::DataAndControlFlow:
+    case TailFoldingStyle::DataAndControlFlowWithoutRuntimeCheck:
+      return RTCheckStyle::UseSafeEltsMask;


Do we need to tie this to tail folding? We would be able to get more coverage if we could enable this for all loops. If there are reasons for tying this to fail folding, then I'd prefer to do that under an option that we can disable for testing purposes.

I believe it was because the loop dependence mask was inserted at the same time as the active lane mask, but with the new approach they're de-coupled, so I have removed this restriction.

sdesmalen-arm · 2025-11-11T11:30:07Z

llvm/lib/Transforms/Vectorize/VPlan.h

+// [...]
+// store %b, %a.vec ; only s and t can be stored as their addresses don't
+// overlap with %a + (VF - 1)
+class VPAliasLaneMaskRecipe : public VPSingleDefRecipe {


I'm sure there's a good reason for it, but why is this a VPRecipe as opposed to being a VPInstruction?

I'm going to try to use VPWidenIntrinsicRecipe instead but have run into an issue.

Replaced with the intrinsic recipe now.

fhahn · 2025-11-11T11:54:47Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+                                          Builder.getPtrTy());
+  Value *SinkAsPtr =
+      Builder.CreateCast(Instruction::IntToPtr, SinkValue, Builder.getPtrTy());
+  Value *AliasMask = Builder.CreateIntrinsic(


This basically lowers to a wide intrinsic, can this simply use VPWidenIntrinsicReicpe?

The loop dependence intrinsics expect pointer arguments, but the scev expansion turns the pointer arguments into integers. This is fine for the dedicated recipe, since it converts those integers to pointers for the intrinsic call, but VPWidenIntrinsicRecipe won't do that. I've tried adding inttoptr casts before the VPWidenIntrinsicRecipe, but then an assertion fails saying "VPExpandSCEVRecipes must be at the beginning of the entry block, " "after any VPIRInstructions", since there's now a cast instruction after the scev expansion. It's very annoying.

Right, but can't all needed SCEVs be expanded first, then casted? Or if a pointer is needed, avoid converting them to integers first?

Figured it out now! I think I was getting the VPBuilder usage wrong so the recipes were being inserted in the wrong place.

fhahn · 2025-11-11T11:59:37Z

llvm/include/llvm/CodeGen/BasicTTIImpl.h

      // Otherwise, fallback to default scalarization cost.
      break;
    }
+    case Intrinsic::loop_dependence_raw_mask:


I think you should be able to move this to an independent PR, and also test this with dedicated cost-model tests.

Raised that PR here and I've rebased this work on top of it.

fhahn · 2025-11-11T13:50:46Z

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h

Yep that looks like a good first step. Would be good if updating the masks could also be done there (the PopCount introduction) and we could avoid making addActiveLaneMask more complex

This PR adds the cost model for the loop dependence mask intrinsics, both for cases where they must be expanded and when they can be lowered for AArch64.

When vectorising a loop that uses loads and stores, those pointers could overlap if their difference is less than the vector factor. For example, if address 20 is being stored to and address 23 is being loaded from, they overlap when the vector factor is 4 or higher. Currently LoopVectorize branches to a scalar loop in these cases with a runtime check. Howver if we construct a mask that disables the overlapping (aliasing) lanes then the vectorised loop can be safely entered, as long as the loads and stores are masked off.

sdesmalen-arm · 2025-11-20T14:44:48Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+      // Replace uses of the loop body's active lane mask phi with an AND of the
+      // phi and the alias mask.
+      for (VPRecipeBase &R : *LoopBody) {
+        auto *MaskPhi = dyn_cast<VPActiveLaneMaskPHIRecipe>(&R);


I believe the transform is currently incorrect. When there is no active lane mask, it would create an unpredicated vector loop that handles e.g. VF=16 lanes in a loop, even when the result of the alias.mask would say that only 3 lanes could be safely handled, for example. It would then increment the IV by 3 elements, but that doesn't mean only 3 lanes are handled each iteration. Without predication, it still handles 16 lanes each iteration.

I think there are two options here:

if there is no active lane mask in the loop, we could bail out to the scalar loop if the number of lanes < VF

request the use of an active lane mask in the loop for data when there is an alias mask required and the target supports the use of an active lane mask.

I wouldn't mind taking approach 1 first, so that we can already use the whilerw instructions for the alias checks in the check block, rather than a bunch of scalar instructions, and then follow this up by option 2.

I don't think we necessarily need an active-lane-mask, as long as either all recipes that need predication (memory ops, ops that are immediate UB on poison, reduction/recurrences) are already predicated (could be due to tail-folding without active-lane-mask) or we could convert them to predicated variants using the alias mask.

Also, an active-lane-mask also does not necessarily mean all required recipes are predicated and use the active-lane-mask (e.g. a transform may convert a masked memory access to an unmasked one, if it is guaranteed dereferneceable for the whole loop).

So would probably be good to check if all required recipes are masked and make sure their masks inlcude AliasMask after the transform

sdesmalen-arm · 2025-11-20T14:54:03Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+    if (VecVT->getElementType()->getScalarSizeInBits() < 8) {
+      Cnt = Builder.CreateCast(
+          Instruction::ZExt, Cnt,
+          VectorType::get(Builder.getInt8Ty(), VecVT->getElementCount()));


We can't use a hard-coded i8, as this may not be sufficient to represent the number of bits from popcount.

fhahn · 2025-11-21T10:09:53Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+      // Replace uses of the loop body's active lane mask phi with an AND of the
+      // phi and the alias mask.
+      for (VPRecipeBase &R : *LoopBody) {
+        auto *MaskPhi = dyn_cast<VPActiveLaneMaskPHIRecipe>(&R);


I don't think we necessarily need an active-lane-mask, as long as either all recipes that need predication (memory ops, ops that are immediate UB on poison, reduction/recurrences) are already predicated (could be due to tail-folding without active-lane-mask) or we could convert them to predicated variants using the alias mask.

Also, an active-lane-mask also does not necessarily mean all required recipes are predicated and use the active-lane-mask (e.g. a transform may convert a masked memory access to an unmasked one, if it is guaranteed dereferneceable for the whole loop).

So would probably be good to check if all required recipes are masked and make sure their masks inlcude AliasMask after the transform

fhahn · 2025-11-21T10:12:53Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+        CM.Legal->getRuntimePointerChecking()->getDiffChecks();
+
+    // Create a mask enabling safe elements for each iteration.
+    if (CM.getRTCheckStyle(TTI) == RTCheckStyle::UseSafeEltsMask &&


would be good to outline to a separte function + document the transform

SamTebbs33 requested review from DevM-uk, NickGuy-Arm, dtemirbulatov, huntergr-arm, kmclaughlin-arm and sdesmalen-arm July 25, 2024 14:40

llvmbot added clang Clang issues not falling into any other category vectorizers llvm:analysis Includes value tracking, cost tables and constant folding llvm:transforms labels Jul 25, 2024

SamTebbs33 mentioned this pull request Jul 26, 2024

[AArch64] Lower alias mask to a whilewr #100769

Merged

SamTebbs33 requested a review from davemgreen July 30, 2024 09:47

huntergr-arm reviewed Jul 30, 2024

View reviewed changes

fhahn reviewed Aug 5, 2024

View reviewed changes

SamTebbs33 force-pushed the whilewr branch from 315699e to 9039996 Compare August 27, 2024 16:41

huntergr-arm reviewed Sep 3, 2024

View reviewed changes

SamTebbs33 commented Sep 19, 2024

View reviewed changes

llvm/lib/Transforms/Vectorize/VPlan.h Outdated Show resolved Hide resolved

SamTebbs33 force-pushed the whilewr branch 2 times, most recently from 7a87bf4 to 96b8b65 Compare October 22, 2024 07:57

llvmbot added the backend:AArch64 label Oct 23, 2024

llvmbot added the llvm:ir label Nov 26, 2024

SamTebbs33 requested a review from david-arm December 9, 2024 10:39

MacDue reviewed Dec 9, 2024

View reviewed changes

SamTebbs33 force-pushed the whilewr branch from a74eef4 to 5386786 Compare May 7, 2025 09:30

SamTebbs33 requested review from DeinAlptraum, QuietMisdreavus and daniel-grumberg as code owners May 7, 2025 09:30

DeinAlptraum removed their request for review May 7, 2025 11:23

SamTebbs33 force-pushed the users/SamTebbs33/alias-intrinsic branch from 4c4ab8a to 8822b8f Compare May 7, 2025 13:00

SamTebbs33 removed request for QuietMisdreavus and daniel-grumberg May 7, 2025 13:01

SamTebbs33 commented Aug 7, 2025

View reviewed changes

sdesmalen-arm reviewed Aug 8, 2025

View reviewed changes

fhahn reviewed Aug 10, 2025

View reviewed changes

sdesmalen-arm reviewed Nov 11, 2025

View reviewed changes

fhahn reviewed Nov 11, 2025

View reviewed changes

SamTebbs33 added 4 commits November 11, 2025 17:29

[Analysis][AArch64] Add cost model for loop.dependence.{war/raw}.mask

eab11c8

This PR adds the cost model for the loop dependence mask intrinsics, both for cases where they must be expanded and when they can be lowered for AArch64.

Address review

9f0382b

Use VPWidenIntrinsicRecipe and clean up

bd6f231

sdesmalen-arm reviewed Nov 20, 2025

View reviewed changes

fhahn reviewed Nov 21, 2025

View reviewed changes

MacDue mentioned this pull request Jan 23, 2026

[LV] Add initial support for partial alias masking #177599

Draft

MacDue mentioned this pull request Feb 20, 2026

[LV] Add support for partial alias masking with tail folding #182457

Open

		@@ -0,0 +1,404 @@
		// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --version 4

		/// RTChecks is a list of pointer pairs that should be checked for aliasing,
		/// setting HasAliasMask to true in the case that an alias mask is generated

	auto *Type = SinkValue->getType();
	Type *PtrType = SinkValue->getType();

-  assert(FoundWidenCanonicalIVUser && *FoundWidenCanonicalIVUser &&
+  auto IVUsers = Plan.getCanonicalIV()->users();
+  /// ...
+  assert(FoundWidenCanonicalIVUser != IVUsers.end() && "Must have widened canonical IV when tail folding!");

		/// RTChecks refers to the pointer pairs that need aliasing elements to be
		/// masked off each loop iteration.

		/// Print the recipe.
		void print(raw_ostream &O, const Twine &Indent,

		LaneMask = Builder.createNaryOp(Instruction::BinaryOps::And,
		{LaneMaskPhi, AliasMask}, DL);

		Value *SourceAsPtr = ChkBuilder.CreateCast(Instruction::IntToPtr, Src,
		ChkBuilder.getPtrTy());

	Value *SourceAsPtr = ChkBuilder.CreateCast(Instruction::IntToPtr, Src,
	ChkBuilder.getPtrTy());
	Value *SourceAsPtr = ChkBuilder.CreateIntToPtr(Src, ChkBuilder.getPtrTy());

Conversation

SamTebbs33 commented Jul 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jul 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huntergr-arm Jul 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Aug 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SamTebbs33 commented Aug 27, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SamTebbs33 commented Oct 29, 2024

Uh oh!

SamTebbs33 commented Nov 6, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MacDue left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

SamTebbs33 commented Jul 25, 2024 •

edited

Loading

llvmbot commented Jul 25, 2024 •

edited

Loading

huntergr-arm Jul 30, 2024 •

edited

Loading

github-actions bot commented Aug 6, 2024 •

edited

Loading