[AMDGPU] Limit promoting allocas that have users with dynamic index #170327

choikwa · 2025-12-02T17:04:59Z

AMDGPU backend has poor code generation (scalarized copy, but best compiler can do on CDNA on arbitrary vector IR) for extracting subvectors with dynamic index that can impact compile-time, reg-pressure, etc. For vectors with large number of elements (i.e. <128 x i8> with <32 x i8> subvector user), dynamic indexing will blow up compile-time in GreedyRA.

Added check in GEP to see if it's used in a load.
Added testcase to test different number of elements in subvector user.

Copilot

Pull request overview

This PR adds a threshold mechanism to prevent promoting allocas with dynamic indices when the number of vector elements exceeds a configurable limit. This addresses poor code generation and compile-time issues in the AMDGPU backend when extracting subvectors with dynamic indices from large vectors (e.g., <128 x i8> with <32 x i8> subvector users).

Key Changes:

Introduced a new command-line option DynIdxNumElmLimit (default: 8) to control the maximum number of elements for alloca promotion with dynamic indices
Added validation in GEP handling to check if dynamic indices are used in loads and reject promotion when element count exceeds the threshold
Added test cases demonstrating the behavior with different vector sizes (v32i8, v8i8) and non-load GEP usage

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp	Implements the dynamic index element limit check in GEP validation logic
llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll	Adds test cases verifying the threshold behavior for different vector sizes and GEP usage patterns

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

llvmbot · 2025-12-02T17:05:31Z

@llvm/pr-subscribers-backend-amdgpu

Author: Kevin Choi (choikwa)

Changes

AMDGPU backend has poor code generation (scalarized copy) for extracting subvectors with dynamic index that can impact compile-time, reg-pressure, etc. For vectors with large number of elements (i.e. <128 x i8> with <32 x i8> subvector user), dynamic indexing will blow up compile-time in GreedyRA.

Added check in GEP to see if it's used in a load.
Added testcase to test different number of elements in subvector user.

Full diff: https://github.com/llvm/llvm-project/pull/170327.diff

2 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp (+22)
(modified) llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll (+80)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
index bb95265a794a0..aba660ffb6e45 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
@@ -85,6 +85,11 @@ static cl::opt<unsigned>
                             "when sorting profitable allocas"),
                    cl::init(4));
 
+static cl::opt<unsigned> DynIdxNumElmLimit("dynamic-index-num-element-limit",
+    cl::desc("Maximum number of elements for promoting alloca with dynamic"
+      " index"),
+    cl::init(8));
+
 // Shared implementation which can do both promotion to vector and to LDS.
 class AMDGPUPromoteAllocaImpl {
 private:
@@ -919,6 +924,23 @@ bool AMDGPUPromoteAllocaImpl::tryPromoteAllocaToVector(AllocaInst &Alloca) {
       Value *Index = GEPToVectorIndex(GEP, &Alloca, VecEltTy, *DL, NewGEPInsts);
       if (!Index)
         return RejectUser(Inst, "cannot compute vector index for GEP");
+      
+      if (!isa<ConstantInt>(Index)) {
+        bool UsedInLoad = false;
+        for (auto *U : GEP->users()) {
+          if(isa<LoadInst>(U)) {
+            UsedInLoad = true;
+            break;
+          }
+        }
+        if (auto *UserVecTy = dyn_cast<FixedVectorType>(
+                GEP->getSourceElementType())) {
+          if (UsedInLoad && UserVecTy->getNumElements() > DynIdxNumElmLimit) {
+            return RejectUser(Inst, 
+              "user has too many number of elements for dynamic index");
+          }
+        }
+      }
 
       GEPVectorIdx[GEP] = Index;
       UsersToRemove.push_back(Inst);
diff --git a/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll b/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll
index 76e1868b3c4b9..caab29b58c13f 100644
--- a/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll
+++ b/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll
@@ -3,6 +3,8 @@
 
 ; Check that invalid IR is not produced on a vector typed
 ; getelementptr with a scalar alloca pointer base.
+; Also check if GEP with dynamic index is rejected above
+; threshold # of elements.
 
 define amdgpu_kernel void @scalar_alloca_ptr_with_vector_gep_offset() {
 ; CHECK-LABEL: define amdgpu_kernel void @scalar_alloca_ptr_with_vector_gep_offset() {
@@ -250,6 +252,84 @@ bb2:
   store i32 0, ptr addrspace(5) %extractelement
   ret void
 }
+
+define amdgpu_kernel void @GEP_dynamic_idx_v32i8(ptr addrspace(1) %out, i32 %idx) {
+; CHECK-LABEL: define amdgpu_kernel void @GEP_dynamic_idx_v32i8(
+; CHECK-SAME: ptr addrspace(1) [[OUT:%.*]], i32 [[IDX:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[ALLOCA:%.*]] = alloca [64 x i8], align 4, addrspace(5)
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr inbounds <16 x i8>, ptr addrspace(5) [[ALLOCA]], i32 [[IDX]]
+; CHECK-NEXT:    [[VEC:%.*]] = load <16 x i8>, ptr addrspace(5) [[GEP]], align 4
+; CHECK-NEXT:    store <16 x i8> [[VEC]], ptr addrspace(1) [[OUT]], align 4
+; CHECK-NEXT:    ret void
+;
+entry:
+  %alloca = alloca [64 x i8], align 4, addrspace(5)
+  %gep = getelementptr inbounds <16 x i8>, ptr addrspace(5) %alloca, i32 %idx
+  %vec = load <16 x i8>, ptr addrspace(5) %gep, align 4
+  store <16 x i8> %vec, ptr addrspace(1) %out, align 4
+  ret void
+}
+
+define amdgpu_kernel void @GEP_dynamic_idx_v8i8(ptr addrspace(1) %out, i32 %idx) {
+; CHECK-LABEL: define amdgpu_kernel void @GEP_dynamic_idx_v8i8(
+; CHECK-SAME: ptr addrspace(1) [[OUT:%.*]], i32 [[IDX:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[ALLOCA:%.*]] = freeze <64 x i8> poison
+; CHECK-NEXT:    [[TMP0:%.*]] = mul i32 [[IDX]], 8
+; CHECK-NEXT:    [[TMP1:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP0]]
+; CHECK-NEXT:    [[TMP2:%.*]] = insertelement <8 x i8> poison, i8 [[TMP1]], i64 0
+; CHECK-NEXT:    [[TMP3:%.*]] = add i32 [[TMP0]], 1
+; CHECK-NEXT:    [[TMP4:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP3]]
+; CHECK-NEXT:    [[TMP5:%.*]] = insertelement <8 x i8> [[TMP2]], i8 [[TMP4]], i64 1
+; CHECK-NEXT:    [[TMP6:%.*]] = add i32 [[TMP0]], 2
+; CHECK-NEXT:    [[TMP7:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP6]]
+; CHECK-NEXT:    [[TMP8:%.*]] = insertelement <8 x i8> [[TMP5]], i8 [[TMP7]], i64 2
+; CHECK-NEXT:    [[TMP9:%.*]] = add i32 [[TMP0]], 3
+; CHECK-NEXT:    [[TMP10:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP9]]
+; CHECK-NEXT:    [[TMP11:%.*]] = insertelement <8 x i8> [[TMP8]], i8 [[TMP10]], i64 3
+; CHECK-NEXT:    [[TMP12:%.*]] = add i32 [[TMP0]], 4
+; CHECK-NEXT:    [[TMP13:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP12]]
+; CHECK-NEXT:    [[TMP14:%.*]] = insertelement <8 x i8> [[TMP11]], i8 [[TMP13]], i64 4
+; CHECK-NEXT:    [[TMP15:%.*]] = add i32 [[TMP0]], 5
+; CHECK-NEXT:    [[TMP16:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP15]]
+; CHECK-NEXT:    [[TMP17:%.*]] = insertelement <8 x i8> [[TMP14]], i8 [[TMP16]], i64 5
+; CHECK-NEXT:    [[TMP18:%.*]] = add i32 [[TMP0]], 6
+; CHECK-NEXT:    [[TMP19:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP18]]
+; CHECK-NEXT:    [[TMP20:%.*]] = insertelement <8 x i8> [[TMP17]], i8 [[TMP19]], i64 6
+; CHECK-NEXT:    [[TMP21:%.*]] = add i32 [[TMP0]], 7
+; CHECK-NEXT:    [[TMP22:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP21]]
+; CHECK-NEXT:    [[TMP23:%.*]] = insertelement <8 x i8> [[TMP20]], i8 [[TMP22]], i64 7
+; CHECK-NEXT:    store <8 x i8> [[TMP23]], ptr addrspace(1) [[OUT]], align 4
+; CHECK-NEXT:    ret void
+;
+entry:
+  %alloca = alloca [64 x i8], align 4, addrspace(5)
+  %gep = getelementptr inbounds <8 x i8>, ptr addrspace(5) %alloca, i32 %idx
+  %vec = load <8 x i8>, ptr addrspace(5) %gep, align 4
+  store <8 x i8> %vec, ptr addrspace(1) %out, align 4
+  ret void
+}
+
+define amdgpu_kernel void @GEP_dynamic_idx_noload(ptr addrspace(1) %out, i32 %idx) {
+; CHECK-LABEL: define amdgpu_kernel void @GEP_dynamic_idx_noload(
+; CHECK-SAME: ptr addrspace(1) [[OUT:%.*]], i32 [[IDX:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[ALLOCA:%.*]] = alloca [64 x i8], align 4, addrspace(5)
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr inbounds <8 x i8>, ptr addrspace(5) [[ALLOCA]], i32 [[IDX]]
+; CHECK-NEXT:    [[GEPINT:%.*]] = ptrtoint ptr addrspace(5) [[GEP]] to i64
+; CHECK-NEXT:    store i64 [[GEPINT]], ptr addrspace(1) [[OUT]], align 4
+; CHECK-NEXT:    ret void
+;
+entry:
+  %alloca = alloca [64 x i8], align 4, addrspace(5)
+  %gep = getelementptr inbounds <8 x i8>, ptr addrspace(5) %alloca, i32 %idx
+  %gepint = ptrtoint ptr addrspace(5) %gep to i64
+  store i64 %gepint, ptr addrspace(1) %out, align 4
+  ret void
+}
+
+
 ;.
 ; CHECK: [[META0]] = !{}
 ; CHECK: [[RNG1]] = !{i32 0, i32 1025}

github-actions · 2025-12-02T17:06:40Z

✅ With the latest revision this PR passed the C/C++ code formatter.

…bove a threshold on number of elements AMDGPU backend has poor code generation (scalarized copy) for extracting subvectors with dynamic index that can impact compile-time, reg-pressure, etc. For vectors with large number of elements (i.e. <128 x i8> with <32 x i8> user), dynamic indexing will blow up compile-time in GreedyRA. Added check in GEP to see if it's used in a load. Added testcase to test different number of elements in subvector user.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

perlfu

Is the fundamental limit here actually the number of elements in the GEP type, or rather the width of the GEP type in 32b VGPRs?
I guess alignment (or rather misalignment) drives the complexity explosion?

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

ruiling · 2025-12-03T13:33:18Z

dynamic indexing will blow up compile-time in GreedyRA.

Have you done some further investigation why it causes issues in GreedyRA?

Note that by adding another limit, we are also making the pass less useful in alloca promotion. Do you have the runtime performance and compile-time numbers with and without this change for your case? 8 sounds too small, maybe 16? (since the case you cared has 32 elements).

choikwa · 2025-12-03T16:50:30Z

dynamic indexing will blow up compile-time in GreedyRA.

Have you done some further investigation why it causes issues in GreedyRA?

Note that by adding another limit, we are also making the pass less useful in alloca promotion. Do you have the runtime performance and compile-time numbers with and without this change for your case? 8 sounds too small, maybe 16? (since the case you cared has 32 elements).

Yes, we had an MLIR testcase (SWDEV-559837) that would blow up in compile-time when promote alloca tried to create <128 x i8> with <16 x i8> users. After rejecting those cases, compile time dropped from ~2min to 0.5s in my sandbox. Investigation has shown that a long chain of extract/insert elements with dynamic index would end up creating 35x more LiveIntervals for GreedyRA to deal with, and ends up being bogged down in interference check in the eviction phase.
I've discussed with colleagues and the hope is that this fix is very surgical to avoid dropping runtime perf while targetting compile-time. Internally, we are tracking runtime perf and thought that this change was too small to warrant a custom request.

Edit: This was a regression from SWDEV-525817, but seeing how in that case the promote alloca needed to turn [16 x double] to <16 x double>, I don't suspect hipBone to regress with this change.

choikwa · 2025-12-03T19:13:35Z

Is the fundamental limit here actually the number of elements in the GEP type, or rather the width of the GEP type in 32b VGPRs? I guess alignment (or rather misalignment) drives the complexity explosion?

It looks like the IR count in SDag scales linearly with number of elements (roughly 4x after legalization etc per extract/insert). Problem seems especially bad in GreedyRA with O(n^2) or O(nlogn) interference check as seen in the compilation profile.

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

…ll, test different limit values

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

ruiling · 2025-12-08T07:49:38Z

Have you checked whether in the original case, the gep result pointer is aligned to 16 for the <16 x i8> access? If aligned, we can still bitcast the <128 x i8> to <8 x i128> and do insert/extract of i128 without expansion. For the unaligned case, abort promotion maybe a reasonable comprise.

choikwa · 2025-12-08T16:31:27Z

Have you checked whether in the original case, the gep result pointer is aligned to 16 for the <16 x i8> access? If aligned, we can still bitcast the <128 x i8> to <8 x i128> and do insert/extract of i128 without expansion. For the unaligned case, abort promotion maybe a reasonable comprise.

From the log

Scoring:   %11 = alloca [8 x <16 x i8>], align 16, addrspace(5)
  [+5]:   store <16 x i8> %543, ptr addrspace(5) %11, align 16
  [+9]:   store <16 x i8> %390, ptr addrspace(5) %11, align 16
  [+13]:    %426 = load <4 x i32>, ptr addrspace(5) %425, align 16
  [+9]:   %579 = load <4 x i32>, ptr addrspace(5) %578, align 16
  [+5]:   store <16 x i8> %546, ptr addrspace(5) %547, align 16
  [+5]:   store <16 x i8> %550, ptr addrspace(5) %551, align 16
  [+5]:   store <16 x i8> %554, ptr addrspace(5) %555, align 16
  [+5]:   store <16 x i8> %558, ptr addrspace(5) %559, align 16
  [+5]:   store <16 x i8> %562, ptr addrspace(5) %563, align 16
  [+5]:   store <16 x i8> %566, ptr addrspace(5) %567, align 16
  [+5]:   store <16 x i8> %570, ptr addrspace(5) %571, align 16
  [+9]:   store <16 x i8> %393, ptr addrspace(5) %394, align 16
  [+9]:   store <16 x i8> %397, ptr addrspace(5) %398, align 16
  [+9]:   store <16 x i8> %401, ptr addrspace(5) %402, align 16
  [+9]:   store <16 x i8> %405, ptr addrspace(5) %406, align 16
  [+9]:   store <16 x i8> %409, ptr addrspace(5) %410, align 16
  [+9]:   store <16 x i8> %413, ptr addrspace(5) %414, align 16
  [+9]:   store <16 x i8> %417, ptr addrspace(5) %418, align 16
  => Final Score:134
Scoring:   %12 = alloca [8 x <16 x i8>], align 16, addrspace(5)
  [+1]:   store <16 x i8> %476, ptr addrspace(5) %12, align 16
  [+5]:   store <16 x i8> %323, ptr addrspace(5) %12, align 16
  [+13]:    %424 = load <4 x i32>, ptr addrspace(5) %423, align 16
  [+9]:   %577 = load <4 x i32>, ptr addrspace(5) %576, align 16
  [+1]:   store <16 x i8> %483, ptr addrspace(5) %484, align 16
  [+1]:   store <16 x i8> %491, ptr addrspace(5) %492, align 16
  [+1]:   store <16 x i8> %499, ptr addrspace(5) %500, align 16
  [+1]:   store <16 x i8> %507, ptr addrspace(5) %508, align 16
  [+1]:   store <16 x i8> %515, ptr addrspace(5) %516, align 16
  [+1]:   store <16 x i8> %523, ptr addrspace(5) %524, align 16
  [+1]:   store <16 x i8> %531, ptr addrspace(5) %532, align 16
  [+5]:   store <16 x i8> %330, ptr addrspace(5) %331, align 16
  [+5]:   store <16 x i8> %338, ptr addrspace(5) %339, align 16
  [+5]:   store <16 x i8> %346, ptr addrspace(5) %347, align 16
  [+5]:   store <16 x i8> %354, ptr addrspace(5) %355, align 16
  [+5]:   store <16 x i8> %362, ptr addrspace(5) %363, align 16
  [+5]:   store <16 x i8> %370, ptr addrspace(5) %371, align 16
  [+5]:   store <16 x i8> %378, ptr addrspace(5) %379, align 16
  => Final Score:70
...
**Trying to promote to vector:   %11 = alloca [8 x <16 x i8>], align 16, addrspace(5)**
  Attempting promotion to: <128 x i8>
  Converting alloca to vector [8 x <16 x i8>] -> <128 x i8>
  Inserted PHI:   %promotealloca17 = phi <128 x i8> [ %promotealloca16, %711 ], [ %17, %141 ]
  Inserted PHI:   %promotealloca16 = phi <128 x i8> [ %675, %689 ], [ %promotealloca17, %250 ]
  Inserted PHI:   %promotealloca = phi <128 x i8> [ %1085, %1099 ], [ %promotealloca17, %732 ]
  Remaining vectorization budget:3072
...
**Trying to promote to vector:   %11 = alloca [8 x <16 x i8>], align 16, addrspace(5)**
  Attempting promotion to: <128 x i8>
  Converting alloca to vector [8 x <16 x i8>] -> <128 x i8>
  Inserted PHI:   %promotealloca19 = phi <128 x i8> [ %636, %1007 ], [ %16, %141 ]
  Remaining vectorization budget:2048

choikwa · 2025-12-08T17:10:18Z

It would seem that adding BITCAST per User is still beneficial over scalarized copies, and that work can precede the disabling alloca check which doesn't check for alignment. Will look into that first.

choikwa · 2025-12-08T20:07:36Z

It seems promising. It looks like during legalization, it will bitcast to <N x i64>. I'm wondering if the alignment requirement is necessary. I will follow up with a separate PR.

define amdgpu_kernel void @testa(ptr addrspace(1) %out, i32 %idx) {
entry:
  %alloca = freeze <128 x i8> poison
  %allocabc = bitcast <128 x i8> %alloca to <8 x i128>
  %vec = extractelement <8 x i128> %allocabc, i32 %idx
  %vecbc = bitcast i128 %vec to <16 x i8>
  store <16 x i8> %vecbc, ptr addrspace(1) %out, align 16
  ret void
}
...
before SDag:
*** IR Dump After Module Verifier (verify) ***
define amdgpu_kernel void @testa(ptr addrspace(1) %out, i32 %idx) {
entry:
  %testa.kernarg.segment = call nonnull align 16 dereferenceable(272) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
  %out.kernarg.offset1 = bitcast ptr addrspace(4) %testa.kernarg.segment to ptr addrspace(4), !amdgpu.uniform !0
  %out.load = load ptr addrspace(1), ptr addrspace(4) %out.kernarg.offset1, align 16, !invariant.load !0
  %idx.kernarg.offset = getelementptr inbounds i8, ptr addrspace(4) %testa.kernarg.segment, i64 8, !amdgpu.uniform !0
  %idx.load = load i32, ptr addrspace(4) %idx.kernarg.offset, align 8, !invariant.load !0
  %alloca = freeze <128 x i8> poison
  %allocabc = bitcast <128 x i8> %alloca to <8 x i128>
  %vec = extractelement <8 x i128> %allocabc, i32 %idx.load
  %vecbc = bitcast i128 %vec to <16 x i8>
  store <16 x i8> %vecbc, ptr addrspace(1) %out.load, align 16
  ret void
}
...
before ISEL:
Optimized legalized selection DAG: %bb.0 'testa:entry'
SelectionDAG has 22 nodes:
  t0: ch,glue = EntryToken
  t350: i32 = freeze undef:i32
    t346: i32 = and t350, Constant:i32<255>
    t347: i32 = shl t350, Constant:i32<8>
  t348: i32 = or t346, t347
    t341: i32 = and t348, Constant:i32<65535>
    t329: i32 = shl t348, Constant:i32<16>
  t330: i32 = or t341, t329
      t381: v4i32 = BUILD_VECTOR t330, t330, t330, t330
            t376: i64,ch = CopyFromReg t0, Register:i64 %6
          t14: i64 = AssertAlign<16> t376
        t373: v2i32,ch = load<(dereferenceable invariant load (s64) from %ir.out.kernarg.offset1, align 16, addrspace 4)> t0, t14, undef:i64
      t374: i64 = bitcast t373
    t52: ch = store<(store (s128) into %ir.out.load, addrspace 1)> t0, t381, t374, undef:i64
  t24: ch = AMDGPUISD::ENDPGM t52

arsenm

This feels like a workaround for issues which should be addressed in codegen. Lots of uses is where this optimization will be most profiatlhb

arsenm · 2025-12-08T22:39:07Z

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

+        if (auto *LI = dyn_cast<LoadInst>(U)) {
+          if (auto *LoadVecTy = dyn_cast<FixedVectorType>(LI->getType())) {
+            if (LoadVecTy->getNumElements() >
+                PromoteAllocaDynamicIndexNumberElementLimit)


Lots of uses is where this optimization will be most profitable

choikwa · 2025-12-09T04:37:13Z

Created #171253

choikwa · 2025-12-31T23:43:38Z

It seems #171253 addressed bulk of the issues I was facing. If unaligned subvector-extract becomes an issue, I can reopen this.

choikwa requested review from arsenm, Copilot and perlfu December 2, 2025 17:04

llvmbot added the backend:AMDGPU label Dec 2, 2025

Copilot AI reviewed Dec 2, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp Outdated Show resolved Hide resolved

choikwa and others added 3 commits December 2, 2025 17:21

Update llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

406a575

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

NFC, formatting

a311a65

choikwa force-pushed the lim-dyn-idx branch from 06ac1b1 to a311a65 Compare December 2, 2025 23:22

perlfu reviewed Dec 3, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp Outdated Show resolved Hide resolved

choikwa added 2 commits December 3, 2025 12:38

addressing feedback

80fb5a4

nfc, rename var

671de2f

choikwa added 2 commits December 3, 2025 13:14

nfc, formatting

3b72499

space

8e31b85

shiltian reviewed Dec 3, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp Outdated Show resolved Hide resolved

choikwa added 3 commits December 3, 2025 13:54

addressing feedback, move tests to promote-alloca-vector-dynamic-idx.…

904e5e0

…ll, test different limit values

format

8700e91

remove comment

9d7b94b

ruiling reviewed Dec 4, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp Outdated Show resolved Hide resolved

choikwa added 2 commits December 3, 2025 23:21

Look at User LoadInst's type

b7c57d4

format

c0fd36f

arsenm reviewed Dec 8, 2025

View reviewed changes

dhernandez0 mentioned this pull request Dec 11, 2025

Upstream merge december ROCm/rocMLIR#2167

Closed

12 tasks

choikwa changed the title ~~[AMDGPU] Limit promoting allocas that have users with dynamic index above a threshold on number of elements~~ [AMDGPU] Limit promoting allocas that have users with dynamic index Dec 18, 2025

choikwa mentioned this pull request Dec 19, 2025

[AMDGPU] In promote-alloca, if index is dynamic, sandwich load with bitcasts to reduce excessive codegen #171253

Merged

choikwa closed this Dec 31, 2025

[AMDGPU] Limit promoting allocas that have users with dynamic index #170327

[AMDGPU] Limit promoting allocas that have users with dynamic index #170327

Uh oh!

Conversation

choikwa commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

llvmbot commented Dec 2, 2025

Uh oh!

github-actions bot commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

perlfu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ruiling commented Dec 3, 2025

Uh oh!

choikwa commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

choikwa commented Dec 3, 2025

Uh oh!

Uh oh!

Uh oh!

ruiling commented Dec 8, 2025

Uh oh!

choikwa commented Dec 8, 2025

Uh oh!

choikwa commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

choikwa commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

arsenm Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

choikwa commented Dec 9, 2025

Uh oh!

choikwa commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

choikwa commented Dec 2, 2025 •

edited

Loading

github-actions bot commented Dec 2, 2025 •

edited

Loading

choikwa commented Dec 3, 2025 •

edited

Loading

choikwa commented Dec 8, 2025 •

edited

Loading

choikwa commented Dec 8, 2025 •

edited

Loading