Skip to content

Conversation

@sebpop
Copy link
Contributor

@sebpop sebpop commented Sep 3, 2025

Add detailed comments explaining why each function should/shouldn't be unroll-and-jammed based on memory access patterns and dependencies.

Fix loop bounds to ensure array accesses are within array bounds:

  • sub_sub_less: j starts from 1 (not 0) to ensure j-1 >= 0
  • sub_sub_less_3d: k starts from 1 (not 0) to ensure k-1 >= 0
  • sub_sub_outer_scalar: j starts from 1 (not 0) to ensure j-1 >= 0

Copy link
Collaborator

@sjoerdmeijer sjoerdmeijer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good fix to me.

Comment on lines 61 to 63
; No dependency conflict: A[i+1][j] from iteration (i,j) doesn't conflict with
; any A[i'][j'] from unrolled j iterations since j' values are different and
; i+1 from current doesn't overlap with i' from unrolled iterations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sum = 0;
for (int i = 0; i < N; ++i) // N > 0
  for (int j = 0; j < N; ++j) 
    sum += i * B[j];
    A[i][j] = 1;
    A[i+1][j] = sum;

unrolled j iterations

The i (outer loop) iterations are unrolled with unroll-and-jam, not the j iterations.

since j' values are different and i+1 from current doesn't overlap with i' from unrolled iteration

A[i+1][j] and A[i'][j] with i'==i+1 (next iteration that will be in the same body ofter unroll-and-jam) do point to the same memory, so there seems to be a hazard between the iterations. It is just there is no iteration from [i,j+1] to [i,N-1], and [i+1,0] to [i+1,j-1] (iterations that execute between [i,j] and [i',j] in the original loop) do access A[i+1][j] or A[i+2][j]. Maybe this is saying this, but I have difficulty understanding this explanation.

Factor-2 unroll:

sum0 = 0;
sum1 = 0;
for (int i = 0; i < N; i+=2)
  for (int j = 0; j < N; ++j) 
    sum0 += i * B[j];
    A[i][j] = 1;
    A[i + 1][j] = sum0;
    
    sum1 += (i+1) * B[j];
    A[i+1][j] = 1;
    A[i+2][j] = sum1;

Copy link
Contributor

@kasuga-fj kasuga-fj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sebpop Are you planning to proceed with this PR? I was planning to wait to land #161822 until this one is merged, since we're modifying the same test. But if it's going to take a while, I'd like to merge #161822 first.

Comment on lines 210 to 218
; sub_sub_outer_scalar should NOT be unroll-and-jammed due to a loop-carried dependency.
; Memory accesses:
; - load from A[j][k] (read from current j iteration)
; - store to A[j-1][k] (write to previous j iteration)
; The dependency: reading A[j][k] and writing A[j-1][k] creates a backward
; dependency in the j dimension. The test attempts to unroll-and-jam the j loop
; with the k loop being jammed. When this happens, iterations j, j+1, j+2, j+3
; would be unrolled and their k loops jammed together, but j+1's write to A[j][k]
; would conflict with j's read from A[j][k], violating sequential semantics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for (int i = 0; i < 100; i++)
  for (int j = 1; j < 100; j++)
    for (int k = 0; k < 100; k++) {
      ... = A[j][k];
      A[j-1][k] = ...;
    }

I think it's safe to unroll-and-jam the j-loop. It's true that the store in iteration (j+1, k) will access to the same location as the load in iteration (j, k), but the execution order is preserved.

// Unroll factor is 2
for (int i = 0; i < 100; i++) {
  for (int j = 1; j < 100; j+= 2)
    for (int k = 0; k < 100; k++) {
      ... = A[j][k];
      A[j-1][k] = ...;

      ... = A[j+1][k];
      A[j][k] = ...;
    }

    ... // remainder (j=99)
}

Unroll-and-jam is not applied now since delinearization doesn't work as expected. As you can see, the address calculations (%arrayidx7 and arrayidx9) are split into two GEPs, not a single one.

…ds (NFC)

- Add detailed comments explaining why each function should/shouldn't be
  unroll-and-jammed based on memory access patterns and dependencies.
- Fix loop bounds to ensure array accesses are within array bounds:
  * sub_sub_less: j starts from 1 (not 0) to ensure j-1 >= 0
  * sub_sub_less_3d: k starts from 1 (not 0) to ensure k-1 >= 0
  * sub_sub_outer_scalar: j starts from 1 (not 0) to ensure j-1 >= 0
@llvmbot
Copy link
Member

llvmbot commented Nov 21, 2025

@llvm/pr-subscribers-llvm-transforms

Author: Sebastian Pop (sebpop)

Changes

Add detailed comments explaining why each function should/shouldn't be unroll-and-jammed based on memory access patterns and dependencies.

Fix loop bounds to ensure array accesses are within array bounds:

  • sub_sub_less: j starts from 1 (not 0) to ensure j-1 >= 0
  • sub_sub_less_3d: k starts from 1 (not 0) to ensure k-1 >= 0
  • sub_sub_outer_scalar: j starts from 1 (not 0) to ensure j-1 >= 0

Full diff: https://github.com/llvm/llvm-project/pull/156578.diff

1 Files Affected:

  • (modified) llvm/test/Transforms/LoopUnrollAndJam/dependencies_multidims.ll (+101-17)
diff --git a/llvm/test/Transforms/LoopUnrollAndJam/dependencies_multidims.ll b/llvm/test/Transforms/LoopUnrollAndJam/dependencies_multidims.ll
index 76c8a32ce116c..c8f991cdf36fa 100644
--- a/llvm/test/Transforms/LoopUnrollAndJam/dependencies_multidims.ll
+++ b/llvm/test/Transforms/LoopUnrollAndJam/dependencies_multidims.ll
@@ -1,15 +1,17 @@
 ; RUN: opt -da-disable-delinearization-checks -passes=loop-unroll-and-jam -allow-unroll-and-jam -unroll-and-jam-count=4 < %s -S | FileCheck %s
 ; RUN: opt -da-disable-delinearization-checks -aa-pipeline=basic-aa -passes='loop-unroll-and-jam' -allow-unroll-and-jam -unroll-and-jam-count=4 < %s -S | FileCheck %s
 
-target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"
-
-; XFAIL: *
-; The transformation seems to have succeeded "accidentally". It should be fixed
-; by PR #156578.
-
 ; CHECK-LABEL: sub_sub_less
 ; CHECK: %j = phi
 ; CHECK-NOT: %j.1 = phi
+;
+; sub_sub_less should NOT be unroll-and-jammed due to a loop-carried dependency.
+; Memory accesses:
+;   - A[i][j] = 1        (write to current iteration)
+;   - A[i+1][j-1] = add  (write to next i iteration, previous j iteration)
+; The dependency: A[i+1][j-1] from iteration (i,j) may conflict with A[i'][j']
+; from a later iteration when i'=i+1 and j'=j-1, creating a backward dependency
+; in the j dimension that prevents safe unroll-and-jam.
 define void @sub_sub_less(ptr noalias nocapture %A, i32 %N, ptr noalias nocapture readonly %B) {
 entry:
   %cmp = icmp sgt i32 %N, 0
@@ -20,7 +22,7 @@ for.outer:
   br label %for.inner
 
 for.inner:
-  %j = phi i32 [ %add6, %for.inner ], [ 0, %for.outer ]
+  %j = phi i32 [ %add6, %for.inner ], [ 1, %for.outer ]
   %sum = phi i32 [ %add, %for.inner ], [ 0, %for.outer ]
   %arrayidx5 = getelementptr inbounds i32, ptr %B, i32 %j
   %0 = load i32, ptr %arrayidx5, align 4
@@ -51,6 +53,31 @@ cleanup:
 ; CHECK: %j.1 = phi
 ; CHECK: %j.2 = phi
 ; CHECK: %j.3 = phi
+;
+; sub_sub_eq SHOULD be unroll-and-jammed (count=4) as it's safe.
+; Memory accesses:
+;   - A[i][j] = 1      (write to current iteration)
+;   - A[i+1][j] = add  (write to next i iteration, same j iteration)
+; No dependency conflict: When unroll-and-jamming with count=4, the i loop
+; iterations (i, i+1, i+2, i+3) are unrolled and their j loops are jammed
+; together. Unroll-and-jam factor 4:
+;
+; for (int i = 0; i < N; i += 4)
+;   for (int j = 0; j < N; ++j) {
+;     // i iteration
+;     A[i][j] = 1;     A[i+1][j] = sum_i;
+;     // i+1 iteration
+;     A[i+1][j] = 1;   A[i+2][j] = sum_i1;
+;     // i+2 iteration
+;     A[i+2][j] = 1;   A[i+3][j] = sum_i2;
+;     // i+3 iteration
+;     A[i+3][j] = 1;   A[i+4][j] = sum_i3;
+;   }
+;
+; A[i+1][j] from iteration i doesn't conflict with A[i'][j'] from unrolled
+; iterations since each unrolled i iteration accesses its own row i+1, i+2, i+3.
+; j' values are identical, but accesses happen to different rows in the same j
+; iteration before moving to the next j value.
 define void @sub_sub_eq(ptr noalias nocapture %A, i32 %N, ptr noalias nocapture readonly %B) {
 entry:
   %cmp = icmp sgt i32 %N, 0
@@ -92,6 +119,29 @@ cleanup:
 ; CHECK: %j.1 = phi
 ; CHECK: %j.2 = phi
 ; CHECK: %j.3 = phi
+;
+; sub_sub_more SHOULD be unroll-and-jammed (count=4) as it's safe.
+; Memory accesses:
+;   - A[i][j] = 1        (write to current iteration)
+;   - A[i+1][j+1] = add  (write to next i iteration, next j iteration)
+; No dependency conflict: The forward dependency pattern (j+1 in i dimension)
+; is safe. Unroll-and-jam factor 4:
+;
+; for (int i = 0; i < N; i += 4)
+;   for (int j = 0; j < N; ++j) {
+;     // i iteration
+;     A[i][j] = 1;     A[i+1][j+1] = sum_i;
+;     // i+1 iteration
+;     A[i+1][j] = 1;   A[i+2][j+1] = sum_i1;
+;     // i+2 iteration
+;     A[i+2][j] = 1;   A[i+3][j+1] = sum_i2;
+;     // i+3 iteration
+;     A[i+3][j] = 1;   A[i+4][j+1] = sum_i3;
+;   }
+;
+; A[i+1][j+1] from iteration i accesses row i+1 and column j+1, which is
+; disjoint from the accesses in the same iteration. The forward dependency
+; pattern doesn't create conflicts between unrolled i iterations.
 define void @sub_sub_more(ptr noalias nocapture %A, i32 %N, ptr noalias nocapture readonly %B) {
 entry:
   %cmp = icmp sgt i32 %N, 0
@@ -130,12 +180,21 @@ cleanup:
 ; CHECK-LABEL: sub_sub_less_3d
 ; CHECK: %k = phi
 ; CHECK-NOT: %k.1 = phi
-
+;
+; sub_sub_less_3d should NOT be unroll-and-jammed due to a loop-carried dependency.
+; Memory accesses:
+;   - A3d[i][j][k] = 0     (write to current iteration)
+;   - A3d[i+1][j][k-1] = 0 (write to next i iteration, previous k iteration)
+; The dependency: A[i+1][j][k-1] from iteration (i,j,k) may conflict with
+; A[i'][j'][k'] from a later iteration when i'=i+1 and k'=k-1, creating a
+; backward dependency in the k dimension that prevents safe unroll-and-jam.
+; This is a 3D version of the same pattern as sub_sub_less.
+;
 ; for (long i = 0; i < 100; ++i)
 ;   for (long j = 0; j < 100; ++j)
-;     for (long k = 0; k < 100; ++k) {
-;       A[i][j][k] = 0;
-;       A[i+1][j][k-1] = 0;
+;     for (long k = 1; k < 100; ++k) {
+;       A[i][j][k] = 5;
+;       A[i+1][j][k-1] = 10;
 ;     }
 
 define void @sub_sub_less_3d(ptr noalias %A) {
@@ -151,13 +210,13 @@ for.j:
   br label %for.k
 
 for.k:
-  %k = phi i32 [ 0, %for.j ], [ %inc.k, %for.k ]
+  %k = phi i32 [ 1, %for.j ], [ %inc.k, %for.k ]
   %arrayidx = getelementptr inbounds [100 x [100 x i32]], ptr %A, i32 %i, i32 %j, i32 %k
-  store i32 0, ptr %arrayidx, align 4
+  store i32 5, ptr %arrayidx, align 4
   %add.i = add nsw i32 %i, 1
   %sub.k = add nsw i32 %k, -1
   %arrayidx2 = getelementptr inbounds [100 x [100 x i32]], ptr %A, i32 %add.i, i32 %j, i32 %sub.k
-  store i32 0, ptr %arrayidx2, align 4
+  store i32 10, ptr %arrayidx2, align 4
   %inc.k = add nsw i32 %k, 1
   %cmp.k = icmp slt i32 %inc.k, 100
   br i1 %cmp.k, label %for.k, label %for.j.latch
@@ -178,8 +237,33 @@ for.end:
 
 ; CHECK-LABEL: sub_sub_outer_scalar
 ; CHECK: %k = phi
-; CHECK-NOT: %k.1 = phi
-
+; CHECK: %k.1 = phi
+; CHECK: %k.2 = phi
+; CHECK: %k.3 = phi
+;
+; sub_sub_outer_scalar SHOULD be unroll-and-jammed (count=4) as it's safe.
+; Memory accesses:
+;   - load from A[j][k]    (read from current j iteration)
+;   - store to A[j-1][k]   (write to previous j iteration)
+; The dependency: reading A[j][k] and writing A[j-1][k] creates a backward
+; dependency, but execution order is preserved. Unroll-and-jam factor 4:
+;
+; for (int i = 0; i < 100; i++)
+;   for (int j = 1; j < 100; j += 4)
+;     for (int k = 0; k < 100; k++) {
+;       // j iteration
+;       temp0 = A[j][k];     A[j-1][k] = temp0;
+;       // j+1 iteration
+;       temp1 = A[j+1][k];   A[j][k] = temp1;
+;       // j+2 iteration
+;       temp2 = A[j+2][k];   A[j+1][k] = temp2;
+;       // j+3 iteration
+;       temp3 = A[j+3][k];   A[j+2][k] = temp3;
+;     }
+;
+; All k iterations for each j iteration (including j+1, j+2, j+3) are completed
+; before moving to the next j group, so j+1's store to A[j][k] doesn't conflict
+; with j's load from A[j][k] because they happen in different k loop invocations.
 define void @sub_sub_outer_scalar(ptr %A) {
 entry:
   br label %for.i
@@ -189,7 +273,7 @@ for.i:
   br label %for.j
 
 for.j:
-  %j = phi i64 [ 0, %for.i ], [ %inc.j, %for.j.latch ]
+  %j = phi i64 [ 1, %for.i ], [ %inc.j, %for.j.latch ]
   br label %for.k
 
 for.k:

@github-actions
Copy link

github-actions bot commented Nov 21, 2025

🐧 Linux x64 Test Results

  • 186453 tests passed
  • 4868 tests skipped

@sebpop sebpop merged commit 427c182 into llvm:main Nov 21, 2025
10 checks passed
@sebpop sebpop deleted the unj-doc branch November 21, 2025 20:19
aadeshps-mcw pushed a commit to aadeshps-mcw/llvm-project that referenced this pull request Nov 26, 2025
…ds (NFC) (llvm#156578)

Add detailed comments explaining why each function should/shouldn't be
unroll-and-jammed based on memory access patterns and dependencies.

Fix loop bounds to ensure array accesses are within array bounds:
  * sub_sub_less: j starts from 1 (not 0) to ensure j-1 >= 0
  * sub_sub_less_3d: k starts from 1 (not 0) to ensure k-1 >= 0
  * sub_sub_outer_scalar: j starts from 1 (not 0) to ensure j-1 >= 0
Priyanshu3820 pushed a commit to Priyanshu3820/llvm-project that referenced this pull request Nov 26, 2025
…ds (NFC) (llvm#156578)

Add detailed comments explaining why each function should/shouldn't be
unroll-and-jammed based on memory access patterns and dependencies.

Fix loop bounds to ensure array accesses are within array bounds:
  * sub_sub_less: j starts from 1 (not 0) to ensure j-1 >= 0
  * sub_sub_less_3d: k starts from 1 (not 0) to ensure k-1 >= 0
  * sub_sub_outer_scalar: j starts from 1 (not 0) to ensure j-1 >= 0
augusto2112 pushed a commit to augusto2112/llvm-project that referenced this pull request Dec 3, 2025
…ds (NFC) (llvm#156578)

Add detailed comments explaining why each function should/shouldn't be
unroll-and-jammed based on memory access patterns and dependencies.

Fix loop bounds to ensure array accesses are within array bounds:
  * sub_sub_less: j starts from 1 (not 0) to ensure j-1 >= 0
  * sub_sub_less_3d: k starts from 1 (not 0) to ensure k-1 >= 0
  * sub_sub_outer_scalar: j starts from 1 (not 0) to ensure j-1 >= 0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants