Skip to content

Conversation

@choikwa
Copy link
Contributor

@choikwa choikwa commented Dec 2, 2025

AMDGPU backend has poor code generation (scalarized copy, but best compiler can do on CDNA on arbitrary vector IR) for extracting subvectors with dynamic index that can impact compile-time, reg-pressure, etc. For vectors with large number of elements (i.e. <128 x i8> with <32 x i8> subvector user), dynamic indexing will blow up compile-time in GreedyRA.

Added check in GEP to see if it's used in a load.
Added testcase to test different number of elements in subvector user.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a threshold mechanism to prevent promoting allocas with dynamic indices when the number of vector elements exceeds a configurable limit. This addresses poor code generation and compile-time issues in the AMDGPU backend when extracting subvectors with dynamic indices from large vectors (e.g., <128 x i8> with <32 x i8> subvector users).

Key Changes:

  • Introduced a new command-line option DynIdxNumElmLimit (default: 8) to control the maximum number of elements for alloca promotion with dynamic indices
  • Added validation in GEP handling to check if dynamic indices are used in loads and reject promotion when element count exceeds the threshold
  • Added test cases demonstrating the behavior with different vector sizes (v32i8, v8i8) and non-load GEP usage

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp Implements the dynamic index element limit check in GEP validation logic
llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll Adds test cases verifying the threshold behavior for different vector sizes and GEP usage patterns

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@llvmbot
Copy link
Member

llvmbot commented Dec 2, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Kevin Choi (choikwa)

Changes

AMDGPU backend has poor code generation (scalarized copy) for extracting subvectors with dynamic index that can impact compile-time, reg-pressure, etc. For vectors with large number of elements (i.e. <128 x i8> with <32 x i8> subvector user), dynamic indexing will blow up compile-time in GreedyRA.

Added check in GEP to see if it's used in a load.
Added testcase to test different number of elements in subvector user.


Full diff: https://github.com/llvm/llvm-project/pull/170327.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp (+22)
  • (modified) llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll (+80)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
index bb95265a794a0..aba660ffb6e45 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
@@ -85,6 +85,11 @@ static cl::opt<unsigned>
                             "when sorting profitable allocas"),
                    cl::init(4));
 
+static cl::opt<unsigned> DynIdxNumElmLimit("dynamic-index-num-element-limit",
+    cl::desc("Maximum number of elements for promoting alloca with dynamic"
+      " index"),
+    cl::init(8));
+
 // Shared implementation which can do both promotion to vector and to LDS.
 class AMDGPUPromoteAllocaImpl {
 private:
@@ -919,6 +924,23 @@ bool AMDGPUPromoteAllocaImpl::tryPromoteAllocaToVector(AllocaInst &Alloca) {
       Value *Index = GEPToVectorIndex(GEP, &Alloca, VecEltTy, *DL, NewGEPInsts);
       if (!Index)
         return RejectUser(Inst, "cannot compute vector index for GEP");
+      
+      if (!isa<ConstantInt>(Index)) {
+        bool UsedInLoad = false;
+        for (auto *U : GEP->users()) {
+          if(isa<LoadInst>(U)) {
+            UsedInLoad = true;
+            break;
+          }
+        }
+        if (auto *UserVecTy = dyn_cast<FixedVectorType>(
+                GEP->getSourceElementType())) {
+          if (UsedInLoad && UserVecTy->getNumElements() > DynIdxNumElmLimit) {
+            return RejectUser(Inst, 
+              "user has too many number of elements for dynamic index");
+          }
+        }
+      }
 
       GEPVectorIdx[GEP] = Index;
       UsersToRemove.push_back(Inst);
diff --git a/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll b/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll
index 76e1868b3c4b9..caab29b58c13f 100644
--- a/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll
+++ b/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll
@@ -3,6 +3,8 @@
 
 ; Check that invalid IR is not produced on a vector typed
 ; getelementptr with a scalar alloca pointer base.
+; Also check if GEP with dynamic index is rejected above
+; threshold # of elements.
 
 define amdgpu_kernel void @scalar_alloca_ptr_with_vector_gep_offset() {
 ; CHECK-LABEL: define amdgpu_kernel void @scalar_alloca_ptr_with_vector_gep_offset() {
@@ -250,6 +252,84 @@ bb2:
   store i32 0, ptr addrspace(5) %extractelement
   ret void
 }
+
+define amdgpu_kernel void @GEP_dynamic_idx_v32i8(ptr addrspace(1) %out, i32 %idx) {
+; CHECK-LABEL: define amdgpu_kernel void @GEP_dynamic_idx_v32i8(
+; CHECK-SAME: ptr addrspace(1) [[OUT:%.*]], i32 [[IDX:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[ALLOCA:%.*]] = alloca [64 x i8], align 4, addrspace(5)
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr inbounds <16 x i8>, ptr addrspace(5) [[ALLOCA]], i32 [[IDX]]
+; CHECK-NEXT:    [[VEC:%.*]] = load <16 x i8>, ptr addrspace(5) [[GEP]], align 4
+; CHECK-NEXT:    store <16 x i8> [[VEC]], ptr addrspace(1) [[OUT]], align 4
+; CHECK-NEXT:    ret void
+;
+entry:
+  %alloca = alloca [64 x i8], align 4, addrspace(5)
+  %gep = getelementptr inbounds <16 x i8>, ptr addrspace(5) %alloca, i32 %idx
+  %vec = load <16 x i8>, ptr addrspace(5) %gep, align 4
+  store <16 x i8> %vec, ptr addrspace(1) %out, align 4
+  ret void
+}
+
+define amdgpu_kernel void @GEP_dynamic_idx_v8i8(ptr addrspace(1) %out, i32 %idx) {
+; CHECK-LABEL: define amdgpu_kernel void @GEP_dynamic_idx_v8i8(
+; CHECK-SAME: ptr addrspace(1) [[OUT:%.*]], i32 [[IDX:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[ALLOCA:%.*]] = freeze <64 x i8> poison
+; CHECK-NEXT:    [[TMP0:%.*]] = mul i32 [[IDX]], 8
+; CHECK-NEXT:    [[TMP1:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP0]]
+; CHECK-NEXT:    [[TMP2:%.*]] = insertelement <8 x i8> poison, i8 [[TMP1]], i64 0
+; CHECK-NEXT:    [[TMP3:%.*]] = add i32 [[TMP0]], 1
+; CHECK-NEXT:    [[TMP4:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP3]]
+; CHECK-NEXT:    [[TMP5:%.*]] = insertelement <8 x i8> [[TMP2]], i8 [[TMP4]], i64 1
+; CHECK-NEXT:    [[TMP6:%.*]] = add i32 [[TMP0]], 2
+; CHECK-NEXT:    [[TMP7:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP6]]
+; CHECK-NEXT:    [[TMP8:%.*]] = insertelement <8 x i8> [[TMP5]], i8 [[TMP7]], i64 2
+; CHECK-NEXT:    [[TMP9:%.*]] = add i32 [[TMP0]], 3
+; CHECK-NEXT:    [[TMP10:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP9]]
+; CHECK-NEXT:    [[TMP11:%.*]] = insertelement <8 x i8> [[TMP8]], i8 [[TMP10]], i64 3
+; CHECK-NEXT:    [[TMP12:%.*]] = add i32 [[TMP0]], 4
+; CHECK-NEXT:    [[TMP13:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP12]]
+; CHECK-NEXT:    [[TMP14:%.*]] = insertelement <8 x i8> [[TMP11]], i8 [[TMP13]], i64 4
+; CHECK-NEXT:    [[TMP15:%.*]] = add i32 [[TMP0]], 5
+; CHECK-NEXT:    [[TMP16:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP15]]
+; CHECK-NEXT:    [[TMP17:%.*]] = insertelement <8 x i8> [[TMP14]], i8 [[TMP16]], i64 5
+; CHECK-NEXT:    [[TMP18:%.*]] = add i32 [[TMP0]], 6
+; CHECK-NEXT:    [[TMP19:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP18]]
+; CHECK-NEXT:    [[TMP20:%.*]] = insertelement <8 x i8> [[TMP17]], i8 [[TMP19]], i64 6
+; CHECK-NEXT:    [[TMP21:%.*]] = add i32 [[TMP0]], 7
+; CHECK-NEXT:    [[TMP22:%.*]] = extractelement <64 x i8> [[ALLOCA]], i32 [[TMP21]]
+; CHECK-NEXT:    [[TMP23:%.*]] = insertelement <8 x i8> [[TMP20]], i8 [[TMP22]], i64 7
+; CHECK-NEXT:    store <8 x i8> [[TMP23]], ptr addrspace(1) [[OUT]], align 4
+; CHECK-NEXT:    ret void
+;
+entry:
+  %alloca = alloca [64 x i8], align 4, addrspace(5)
+  %gep = getelementptr inbounds <8 x i8>, ptr addrspace(5) %alloca, i32 %idx
+  %vec = load <8 x i8>, ptr addrspace(5) %gep, align 4
+  store <8 x i8> %vec, ptr addrspace(1) %out, align 4
+  ret void
+}
+
+define amdgpu_kernel void @GEP_dynamic_idx_noload(ptr addrspace(1) %out, i32 %idx) {
+; CHECK-LABEL: define amdgpu_kernel void @GEP_dynamic_idx_noload(
+; CHECK-SAME: ptr addrspace(1) [[OUT:%.*]], i32 [[IDX:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[ALLOCA:%.*]] = alloca [64 x i8], align 4, addrspace(5)
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr inbounds <8 x i8>, ptr addrspace(5) [[ALLOCA]], i32 [[IDX]]
+; CHECK-NEXT:    [[GEPINT:%.*]] = ptrtoint ptr addrspace(5) [[GEP]] to i64
+; CHECK-NEXT:    store i64 [[GEPINT]], ptr addrspace(1) [[OUT]], align 4
+; CHECK-NEXT:    ret void
+;
+entry:
+  %alloca = alloca [64 x i8], align 4, addrspace(5)
+  %gep = getelementptr inbounds <8 x i8>, ptr addrspace(5) %alloca, i32 %idx
+  %gepint = ptrtoint ptr addrspace(5) %gep to i64
+  store i64 %gepint, ptr addrspace(1) %out, align 4
+  ret void
+}
+
+
 ;.
 ; CHECK: [[META0]] = !{}
 ; CHECK: [[RNG1]] = !{i32 0, i32 1025}

@github-actions
Copy link

github-actions bot commented Dec 2, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

choikwa and others added 3 commits December 2, 2025 17:21
…bove a threshold on number of elements

AMDGPU backend has poor code generation (scalarized copy) for extracting subvectors with dynamic index that can impact compile-time, reg-pressure, etc.
For vectors with large number of elements (i.e. <128 x i8> with <32 x i8> user), dynamic indexing will blow up compile-time in GreedyRA.

Added check in GEP to see if it's used in a load.
Added testcase to test different number of elements in subvector user.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Contributor

@perlfu perlfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the fundamental limit here actually the number of elements in the GEP type, or rather the width of the GEP type in 32b VGPRs?
I guess alignment (or rather misalignment) drives the complexity explosion?

@ruiling
Copy link
Contributor

ruiling commented Dec 3, 2025

dynamic indexing will blow up compile-time in GreedyRA.

Have you done some further investigation why it causes issues in GreedyRA?

Note that by adding another limit, we are also making the pass less useful in alloca promotion. Do you have the runtime performance and compile-time numbers with and without this change for your case? 8 sounds too small, maybe 16? (since the case you cared has 32 elements).

@choikwa
Copy link
Contributor Author

choikwa commented Dec 3, 2025

dynamic indexing will blow up compile-time in GreedyRA.

Have you done some further investigation why it causes issues in GreedyRA?

Note that by adding another limit, we are also making the pass less useful in alloca promotion. Do you have the runtime performance and compile-time numbers with and without this change for your case? 8 sounds too small, maybe 16? (since the case you cared has 32 elements).

Yes, we had an MLIR testcase (SWDEV-559837) that would blow up in compile-time when promote alloca tried to create <128 x i8> with <16 x i8> users. After rejecting those cases, compile time dropped from ~2min to 0.5s in my sandbox. Investigation has shown that a long chain of extract/insert elements with dynamic index would end up creating 35x more LiveIntervals for GreedyRA to deal with, and ends up being bogged down in interference check in the eviction phase.
I've discussed with colleagues and the hope is that this fix is very surgical to avoid dropping runtime perf while targetting compile-time. Internally, we are tracking runtime perf and thought that this change was too small to warrant a custom request.

Edit: This was a regression from SWDEV-525817, but seeing how in that case the promote alloca needed to turn [16 x double] to <16 x double>, I don't suspect hipBone to regress with this change.

@choikwa
Copy link
Contributor Author

choikwa commented Dec 3, 2025

Is the fundamental limit here actually the number of elements in the GEP type, or rather the width of the GEP type in 32b VGPRs? I guess alignment (or rather misalignment) drives the complexity explosion?

It looks like the IR count in SDag scales linearly with number of elements (roughly 4x after legalization etc per extract/insert). Problem seems especially bad in GreedyRA with O(n^2) or O(nlogn) interference check as seen in the compilation profile.

@ruiling
Copy link
Contributor

ruiling commented Dec 8, 2025

Have you checked whether in the original case, the gep result pointer is aligned to 16 for the <16 x i8> access? If aligned, we can still bitcast the <128 x i8> to <8 x i128> and do insert/extract of i128 without expansion. For the unaligned case, abort promotion maybe a reasonable comprise.

@choikwa
Copy link
Contributor Author

choikwa commented Dec 8, 2025

Have you checked whether in the original case, the gep result pointer is aligned to 16 for the <16 x i8> access? If aligned, we can still bitcast the <128 x i8> to <8 x i128> and do insert/extract of i128 without expansion. For the unaligned case, abort promotion maybe a reasonable comprise.

From the log

Scoring:   %11 = alloca [8 x <16 x i8>], align 16, addrspace(5)
  [+5]:   store <16 x i8> %543, ptr addrspace(5) %11, align 16
  [+9]:   store <16 x i8> %390, ptr addrspace(5) %11, align 16
  [+13]:    %426 = load <4 x i32>, ptr addrspace(5) %425, align 16
  [+9]:   %579 = load <4 x i32>, ptr addrspace(5) %578, align 16
  [+5]:   store <16 x i8> %546, ptr addrspace(5) %547, align 16
  [+5]:   store <16 x i8> %550, ptr addrspace(5) %551, align 16
  [+5]:   store <16 x i8> %554, ptr addrspace(5) %555, align 16
  [+5]:   store <16 x i8> %558, ptr addrspace(5) %559, align 16
  [+5]:   store <16 x i8> %562, ptr addrspace(5) %563, align 16
  [+5]:   store <16 x i8> %566, ptr addrspace(5) %567, align 16
  [+5]:   store <16 x i8> %570, ptr addrspace(5) %571, align 16
  [+9]:   store <16 x i8> %393, ptr addrspace(5) %394, align 16
  [+9]:   store <16 x i8> %397, ptr addrspace(5) %398, align 16
  [+9]:   store <16 x i8> %401, ptr addrspace(5) %402, align 16
  [+9]:   store <16 x i8> %405, ptr addrspace(5) %406, align 16
  [+9]:   store <16 x i8> %409, ptr addrspace(5) %410, align 16
  [+9]:   store <16 x i8> %413, ptr addrspace(5) %414, align 16
  [+9]:   store <16 x i8> %417, ptr addrspace(5) %418, align 16
  => Final Score:134
Scoring:   %12 = alloca [8 x <16 x i8>], align 16, addrspace(5)
  [+1]:   store <16 x i8> %476, ptr addrspace(5) %12, align 16
  [+5]:   store <16 x i8> %323, ptr addrspace(5) %12, align 16
  [+13]:    %424 = load <4 x i32>, ptr addrspace(5) %423, align 16
  [+9]:   %577 = load <4 x i32>, ptr addrspace(5) %576, align 16
  [+1]:   store <16 x i8> %483, ptr addrspace(5) %484, align 16
  [+1]:   store <16 x i8> %491, ptr addrspace(5) %492, align 16
  [+1]:   store <16 x i8> %499, ptr addrspace(5) %500, align 16
  [+1]:   store <16 x i8> %507, ptr addrspace(5) %508, align 16
  [+1]:   store <16 x i8> %515, ptr addrspace(5) %516, align 16
  [+1]:   store <16 x i8> %523, ptr addrspace(5) %524, align 16
  [+1]:   store <16 x i8> %531, ptr addrspace(5) %532, align 16
  [+5]:   store <16 x i8> %330, ptr addrspace(5) %331, align 16
  [+5]:   store <16 x i8> %338, ptr addrspace(5) %339, align 16
  [+5]:   store <16 x i8> %346, ptr addrspace(5) %347, align 16
  [+5]:   store <16 x i8> %354, ptr addrspace(5) %355, align 16
  [+5]:   store <16 x i8> %362, ptr addrspace(5) %363, align 16
  [+5]:   store <16 x i8> %370, ptr addrspace(5) %371, align 16
  [+5]:   store <16 x i8> %378, ptr addrspace(5) %379, align 16
  => Final Score:70
...
**Trying to promote to vector:   %11 = alloca [8 x <16 x i8>], align 16, addrspace(5)**
  Attempting promotion to: <128 x i8>
  Converting alloca to vector [8 x <16 x i8>] -> <128 x i8>
  Inserted PHI:   %promotealloca17 = phi <128 x i8> [ %promotealloca16, %711 ], [ %17, %141 ]
  Inserted PHI:   %promotealloca16 = phi <128 x i8> [ %675, %689 ], [ %promotealloca17, %250 ]
  Inserted PHI:   %promotealloca = phi <128 x i8> [ %1085, %1099 ], [ %promotealloca17, %732 ]
  Remaining vectorization budget:3072
...
**Trying to promote to vector:   %11 = alloca [8 x <16 x i8>], align 16, addrspace(5)**
  Attempting promotion to: <128 x i8>
  Converting alloca to vector [8 x <16 x i8>] -> <128 x i8>
  Inserted PHI:   %promotealloca19 = phi <128 x i8> [ %636, %1007 ], [ %16, %141 ]
  Remaining vectorization budget:2048

@choikwa
Copy link
Contributor Author

choikwa commented Dec 8, 2025

It would seem that adding BITCAST per User is still beneficial over scalarized copies, and that work can precede the disabling alloca check which doesn't check for alignment. Will look into that first.

@choikwa
Copy link
Contributor Author

choikwa commented Dec 8, 2025

It seems promising. It looks like during legalization, it will bitcast to <N x i64>. I'm wondering if the alignment requirement is necessary. I will follow up with a separate PR.

define amdgpu_kernel void @testa(ptr addrspace(1) %out, i32 %idx) {
entry:
  %alloca = freeze <128 x i8> poison
  %allocabc = bitcast <128 x i8> %alloca to <8 x i128>
  %vec = extractelement <8 x i128> %allocabc, i32 %idx
  %vecbc = bitcast i128 %vec to <16 x i8>
  store <16 x i8> %vecbc, ptr addrspace(1) %out, align 16
  ret void
}
...
before SDag:
*** IR Dump After Module Verifier (verify) ***
define amdgpu_kernel void @testa(ptr addrspace(1) %out, i32 %idx) {
entry:
  %testa.kernarg.segment = call nonnull align 16 dereferenceable(272) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
  %out.kernarg.offset1 = bitcast ptr addrspace(4) %testa.kernarg.segment to ptr addrspace(4), !amdgpu.uniform !0
  %out.load = load ptr addrspace(1), ptr addrspace(4) %out.kernarg.offset1, align 16, !invariant.load !0
  %idx.kernarg.offset = getelementptr inbounds i8, ptr addrspace(4) %testa.kernarg.segment, i64 8, !amdgpu.uniform !0
  %idx.load = load i32, ptr addrspace(4) %idx.kernarg.offset, align 8, !invariant.load !0
  %alloca = freeze <128 x i8> poison
  %allocabc = bitcast <128 x i8> %alloca to <8 x i128>
  %vec = extractelement <8 x i128> %allocabc, i32 %idx.load
  %vecbc = bitcast i128 %vec to <16 x i8>
  store <16 x i8> %vecbc, ptr addrspace(1) %out.load, align 16
  ret void
}
...
before ISEL:
Optimized legalized selection DAG: %bb.0 'testa:entry'
SelectionDAG has 22 nodes:
  t0: ch,glue = EntryToken
  t350: i32 = freeze undef:i32
    t346: i32 = and t350, Constant:i32<255>
    t347: i32 = shl t350, Constant:i32<8>
  t348: i32 = or t346, t347
    t341: i32 = and t348, Constant:i32<65535>
    t329: i32 = shl t348, Constant:i32<16>
  t330: i32 = or t341, t329
      t381: v4i32 = BUILD_VECTOR t330, t330, t330, t330
            t376: i64,ch = CopyFromReg t0, Register:i64 %6
          t14: i64 = AssertAlign<16> t376
        t373: v2i32,ch = load<(dereferenceable invariant load (s64) from %ir.out.kernarg.offset1, align 16, addrspace 4)> t0, t14, undef:i64
      t374: i64 = bitcast t373
    t52: ch = store<(store (s128) into %ir.out.load, addrspace 1)> t0, t381, t374, undef:i64
  t24: ch = AMDGPUISD::ENDPGM t52

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a workaround for issues which should be addressed in codegen. Lots of uses is where this optimization will be most profiatlhb

if (auto *LI = dyn_cast<LoadInst>(U)) {
if (auto *LoadVecTy = dyn_cast<FixedVectorType>(LI->getType())) {
if (LoadVecTy->getNumElements() >
PromoteAllocaDynamicIndexNumberElementLimit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of uses is where this optimization will be most profitable

@choikwa
Copy link
Contributor Author

choikwa commented Dec 9, 2025

Created #171253

@choikwa choikwa changed the title [AMDGPU] Limit promoting allocas that have users with dynamic index above a threshold on number of elements [AMDGPU] Limit promoting allocas that have users with dynamic index Dec 18, 2025
@choikwa
Copy link
Contributor Author

choikwa commented Dec 31, 2025

It seems #171253 addressed bulk of the issues I was facing. If unaligned subvector-extract becomes an issue, I can reopen this.

@choikwa choikwa closed this Dec 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants