Skip to content

Conversation

@choikwa
Copy link
Contributor

@choikwa choikwa commented Dec 9, 2025

Investigation revealed that scalarized copy results in a long chain of extract/insert elements which can explode in generated temps in the AMDGPU backend as there is no efficient representation for extracting subvector with dynamic index. Using identity bitcasts can reduce the number of extract/insert elements down to 1 and produce much smaller, efficient generated code.

Credit: ruiling in #171253

@llvmbot
Copy link
Member

llvmbot commented Dec 9, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Kevin Choi (choikwa)

Changes

Investigation revealed that scalarized copy results in a long chain of extract/insert elements which can explode in generated temps in the AMDGPU backend as there is no efficient representation for extracting subvector with dynamic index. Using identity bitcasts can reduce the number of extract/insert elements down to 1 and produce much smaller, efficient generated code.


Patch is 29.51 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/171253.diff

4 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp (+30)
  • (modified) llvm/test/CodeGen/AMDGPU/promote-alloca-subvecs.ll (+5-12)
  • (added) llvm/test/CodeGen/AMDGPU/promote-alloca-vector-dynamic-idx-bitcasts-llc.ll (+497)
  • (added) llvm/test/CodeGen/AMDGPU/promote-alloca-vector-dynamic-idx-bitcasts.ll (+82)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
index b79689c39ef84..b3de42ff7612a 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
@@ -42,6 +42,7 @@
 #include "llvm/IR/PatternMatch.h"
 #include "llvm/InitializePasses.h"
 #include "llvm/Pass.h"
+#include "llvm/Support/MathExtras.h"
 #include "llvm/Target/TargetMachine.h"
 #include "llvm/Transforms/Utils/SSAUpdater.h"
 
@@ -556,6 +557,35 @@ static Value *promoteAllocaUserToVector(
       auto *SubVecTy = FixedVectorType::get(VecEltTy, NumLoadedElts);
       assert(DL.getTypeStoreSize(SubVecTy) == DL.getTypeStoreSize(AccessTy));
 
+      // If idx is dynamic, then sandwich load with bitcasts.
+      // ie. <64 x i8> -> <16 x i8>  instead do
+      //     <64 x i8> -> <4 x i128> -> i128 -> <16 x i8>
+      // Extracting subvector with dynamic index has very large expansion in
+      // the amdgpu backend. Limit to pow2 for UDiv.
+      if (!isa<ConstantInt>(Index) && SubVecTy->isIntOrIntVectorTy() &&
+          llvm::isPowerOf2_32(VectorTy->getNumElements()) &&
+          llvm::isPowerOf2_32(SubVecTy->getNumElements())) {
+        IntegerType *NewElemType = Builder.getIntNTy(
+            SubVecTy->getScalarSizeInBits() * SubVecTy->getNumElements());
+        const unsigned NewNumElts =
+            VectorTy->getNumElements() * VectorTy->getScalarSizeInBits() /
+              NewElemType->getScalarSizeInBits();
+        const unsigned IndexDivisor = VectorTy->getNumElements() / NewNumElts;
+        assert(VectorTy->getScalarSizeInBits() <
+            NewElemType->getScalarSizeInBits() &&
+            "New element type should be bigger");
+        assert(IndexDivisor > 0u && "Zero index divisor");
+        FixedVectorType *BitCastType =
+            FixedVectorType::get(NewElemType, NewNumElts);
+        Value *BCVal = Builder.CreateBitCast(CurVal, BitCastType);
+        Value *NewIdx = Builder.CreateUDiv(Index,
+            ConstantInt::get(Index->getType(), IndexDivisor));
+        Value *ExtVal = Builder.CreateExtractElement(BCVal, NewIdx);
+        Value *BCOut = Builder.CreateBitCast(ExtVal, SubVecTy);
+        Inst->replaceAllUsesWith(BCOut);
+        return nullptr;
+      }
+
       Value *SubVec = PoisonValue::get(SubVecTy);
       for (unsigned K = 0; K < NumLoadedElts; ++K) {
         Value *CurIdx =
diff --git a/llvm/test/CodeGen/AMDGPU/promote-alloca-subvecs.ll b/llvm/test/CodeGen/AMDGPU/promote-alloca-subvecs.ll
index 62a04f3a6f86f..5aa09ae74ec36 100644
--- a/llvm/test/CodeGen/AMDGPU/promote-alloca-subvecs.ll
+++ b/llvm/test/CodeGen/AMDGPU/promote-alloca-subvecs.ll
@@ -436,18 +436,11 @@ define <4 x i16> @nonconst_indexes(i1 %cond, i32 %otheridx, <4 x i16> %store) #0
 ; CHECK-NEXT:    [[TMP8:%.*]] = add i32 [[INDEX_1]], 3
 ; CHECK-NEXT:    [[TMP9:%.*]] = extractelement <4 x i16> [[STORE]], i64 3
 ; CHECK-NEXT:    [[TMP10:%.*]] = insertelement <16 x i16> [[TMP7]], i16 [[TMP9]], i32 [[TMP8]]
-; CHECK-NEXT:    [[TMP11:%.*]] = extractelement <16 x i16> [[TMP10]], i32 [[INDEX_2]]
-; CHECK-NEXT:    [[TMP12:%.*]] = insertelement <4 x i16> poison, i16 [[TMP11]], i64 0
-; CHECK-NEXT:    [[TMP13:%.*]] = add i32 [[INDEX_2]], 1
-; CHECK-NEXT:    [[TMP14:%.*]] = extractelement <16 x i16> [[TMP10]], i32 [[TMP13]]
-; CHECK-NEXT:    [[TMP15:%.*]] = insertelement <4 x i16> [[TMP12]], i16 [[TMP14]], i64 1
-; CHECK-NEXT:    [[TMP16:%.*]] = add i32 [[INDEX_2]], 2
-; CHECK-NEXT:    [[TMP17:%.*]] = extractelement <16 x i16> [[TMP10]], i32 [[TMP16]]
-; CHECK-NEXT:    [[TMP18:%.*]] = insertelement <4 x i16> [[TMP15]], i16 [[TMP17]], i64 2
-; CHECK-NEXT:    [[TMP19:%.*]] = add i32 [[INDEX_2]], 3
-; CHECK-NEXT:    [[TMP20:%.*]] = extractelement <16 x i16> [[TMP10]], i32 [[TMP19]]
-; CHECK-NEXT:    [[TMP21:%.*]] = insertelement <4 x i16> [[TMP18]], i16 [[TMP20]], i64 3
-; CHECK-NEXT:    ret <4 x i16> [[TMP21]]
+; CHECK-NEXT:    [[TMP11:%.*]] = bitcast <16 x i16> [[TMP10]] to <4 x i64>
+; CHECK-NEXT:    [[TMP12:%.*]] = udiv i32 [[INDEX_2]], 4
+; CHECK-NEXT:    [[TMP13:%.*]] = extractelement <4 x i64> [[TMP11]], i32 [[TMP12]]
+; CHECK-NEXT:    [[TMP14:%.*]] = bitcast i64 [[TMP13]] to <4 x i16>
+; CHECK-NEXT:    ret <4 x i16> [[TMP14]]
 ;
 entry:
   %data = alloca [16 x i16], addrspace(5)
diff --git a/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-dynamic-idx-bitcasts-llc.ll b/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-dynamic-idx-bitcasts-llc.ll
new file mode 100644
index 0000000000000..084b7a2d59b2f
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-dynamic-idx-bitcasts-llc.ll
@@ -0,0 +1,497 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx950 < %s | FileCheck %s --check-prefixes=GFX9
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1100 < %s | FileCheck %s --check-prefixes=GFX11
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1200 < %s | FileCheck %s --check-prefixes=GFX12
+
+define amdgpu_kernel void @test_bitcast_llc_v128i8_v16i8(ptr addrspace(1) %out, i32 %idx) {
+; GFX9-LABEL: test_bitcast_llc_v128i8_v16i8:
+; GFX9:       ; %bb.0: ; %entry
+; GFX9-NEXT:    s_load_dwordx2 s[34:35], s[4:5], 0x0
+; GFX9-NEXT:    s_load_dword s33, s[4:5], 0x8
+; GFX9-NEXT:    s_lshl_b32 s0, s0, 8
+; GFX9-NEXT:    s_and_b32 s1, s0, 0xff
+; GFX9-NEXT:    s_or_b32 s0, s1, s0
+; GFX9-NEXT:    s_and_b32 s1, s0, 0xffff
+; GFX9-NEXT:    s_lshl_b32 s0, s0, 16
+; GFX9-NEXT:    s_or_b32 s0, s1, s0
+; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9-NEXT:    s_add_i32 s33, s33, s33
+; GFX9-NEXT:    s_mov_b32 s1, s0
+; GFX9-NEXT:    s_lshl_b32 s33, s33, 1
+; GFX9-NEXT:    s_mov_b32 s2, s0
+; GFX9-NEXT:    s_mov_b32 s3, s0
+; GFX9-NEXT:    s_mov_b32 s4, s0
+; GFX9-NEXT:    s_mov_b32 s5, s0
+; GFX9-NEXT:    s_mov_b32 s6, s0
+; GFX9-NEXT:    s_mov_b32 s7, s0
+; GFX9-NEXT:    s_mov_b32 s8, s0
+; GFX9-NEXT:    s_mov_b32 s9, s0
+; GFX9-NEXT:    s_mov_b32 s10, s0
+; GFX9-NEXT:    s_mov_b32 s11, s0
+; GFX9-NEXT:    s_mov_b32 s12, s0
+; GFX9-NEXT:    s_mov_b32 s13, s0
+; GFX9-NEXT:    s_mov_b32 s14, s0
+; GFX9-NEXT:    s_mov_b32 s15, s0
+; GFX9-NEXT:    s_mov_b32 s16, s0
+; GFX9-NEXT:    s_mov_b32 s17, s0
+; GFX9-NEXT:    s_mov_b32 s18, s0
+; GFX9-NEXT:    s_mov_b32 s19, s0
+; GFX9-NEXT:    s_mov_b32 s20, s0
+; GFX9-NEXT:    s_mov_b32 s21, s0
+; GFX9-NEXT:    s_mov_b32 s22, s0
+; GFX9-NEXT:    s_mov_b32 s23, s0
+; GFX9-NEXT:    s_mov_b32 s24, s0
+; GFX9-NEXT:    s_mov_b32 s25, s0
+; GFX9-NEXT:    s_mov_b32 s26, s0
+; GFX9-NEXT:    s_mov_b32 s27, s0
+; GFX9-NEXT:    s_mov_b32 s28, s0
+; GFX9-NEXT:    s_mov_b32 s29, s0
+; GFX9-NEXT:    s_mov_b32 s30, s0
+; GFX9-NEXT:    s_mov_b32 s31, s0
+; GFX9-NEXT:    s_add_i32 s36, s33, 3
+; GFX9-NEXT:    v_mov_b64_e32 v[0:1], s[0:1]
+; GFX9-NEXT:    v_mov_b64_e32 v[2:3], s[2:3]
+; GFX9-NEXT:    v_mov_b64_e32 v[4:5], s[4:5]
+; GFX9-NEXT:    v_mov_b64_e32 v[6:7], s[6:7]
+; GFX9-NEXT:    v_mov_b64_e32 v[8:9], s[8:9]
+; GFX9-NEXT:    v_mov_b64_e32 v[10:11], s[10:11]
+; GFX9-NEXT:    v_mov_b64_e32 v[12:13], s[12:13]
+; GFX9-NEXT:    v_mov_b64_e32 v[14:15], s[14:15]
+; GFX9-NEXT:    v_mov_b64_e32 v[16:17], s[16:17]
+; GFX9-NEXT:    v_mov_b64_e32 v[18:19], s[18:19]
+; GFX9-NEXT:    v_mov_b64_e32 v[20:21], s[20:21]
+; GFX9-NEXT:    v_mov_b64_e32 v[22:23], s[22:23]
+; GFX9-NEXT:    v_mov_b64_e32 v[24:25], s[24:25]
+; GFX9-NEXT:    v_mov_b64_e32 v[26:27], s[26:27]
+; GFX9-NEXT:    v_mov_b64_e32 v[28:29], s[28:29]
+; GFX9-NEXT:    v_mov_b64_e32 v[30:31], s[30:31]
+; GFX9-NEXT:    s_set_gpr_idx_on s36, gpr_idx(SRC0)
+; GFX9-NEXT:    v_mov_b32_e32 v35, v0
+; GFX9-NEXT:    s_set_gpr_idx_off
+; GFX9-NEXT:    s_add_i32 s0, s33, 2
+; GFX9-NEXT:    s_set_gpr_idx_on s0, gpr_idx(SRC0)
+; GFX9-NEXT:    v_mov_b32_e32 v34, v0
+; GFX9-NEXT:    s_set_gpr_idx_off
+; GFX9-NEXT:    v_mov_b32_e32 v36, 0
+; GFX9-NEXT:    s_set_gpr_idx_on s33, gpr_idx(SRC0)
+; GFX9-NEXT:    v_mov_b32_e32 v33, v1
+; GFX9-NEXT:    v_mov_b32_e32 v32, v0
+; GFX9-NEXT:    s_set_gpr_idx_off
+; GFX9-NEXT:    global_store_dwordx4 v36, v[32:35], s[34:35]
+; GFX9-NEXT:    s_endpgm
+;
+; GFX11-LABEL: test_bitcast_llc_v128i8_v16i8:
+; GFX11:       ; %bb.0: ; %entry
+; GFX11-NEXT:    s_clause 0x1
+; GFX11-NEXT:    s_load_b64 s[34:35], s[4:5], 0x0
+; GFX11-NEXT:    s_load_b32 s33, s[4:5], 0x8
+; GFX11-NEXT:    s_lshl_b32 s0, s0, 8
+; GFX11-NEXT:    v_mov_b32_e32 v35, 0
+; GFX11-NEXT:    s_and_b32 s1, s0, 0xff
+; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX11-NEXT:    s_or_b32 s0, s1, s0
+; GFX11-NEXT:    s_and_b32 s1, s0, 0xffff
+; GFX11-NEXT:    s_lshl_b32 s0, s0, 16
+; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX11-NEXT:    s_or_b32 s0, s1, s0
+; GFX11-NEXT:    s_mov_b32 s1, s0
+; GFX11-NEXT:    s_mov_b32 s2, s0
+; GFX11-NEXT:    s_mov_b32 s3, s0
+; GFX11-NEXT:    s_mov_b32 s4, s0
+; GFX11-NEXT:    s_mov_b32 s5, s0
+; GFX11-NEXT:    s_mov_b32 s6, s0
+; GFX11-NEXT:    s_mov_b32 s7, s0
+; GFX11-NEXT:    s_mov_b32 s8, s0
+; GFX11-NEXT:    s_mov_b32 s9, s0
+; GFX11-NEXT:    s_mov_b32 s10, s0
+; GFX11-NEXT:    s_mov_b32 s11, s0
+; GFX11-NEXT:    s_mov_b32 s12, s0
+; GFX11-NEXT:    s_mov_b32 s13, s0
+; GFX11-NEXT:    s_mov_b32 s14, s0
+; GFX11-NEXT:    s_mov_b32 s15, s0
+; GFX11-NEXT:    s_mov_b32 s16, s0
+; GFX11-NEXT:    s_mov_b32 s17, s0
+; GFX11-NEXT:    s_mov_b32 s18, s0
+; GFX11-NEXT:    s_mov_b32 s19, s0
+; GFX11-NEXT:    s_mov_b32 s20, s0
+; GFX11-NEXT:    s_mov_b32 s21, s0
+; GFX11-NEXT:    s_mov_b32 s22, s0
+; GFX11-NEXT:    s_mov_b32 s23, s0
+; GFX11-NEXT:    s_mov_b32 s24, s0
+; GFX11-NEXT:    s_mov_b32 s25, s0
+; GFX11-NEXT:    s_mov_b32 s26, s0
+; GFX11-NEXT:    s_mov_b32 s27, s0
+; GFX11-NEXT:    s_mov_b32 s28, s0
+; GFX11-NEXT:    s_mov_b32 s29, s0
+; GFX11-NEXT:    s_mov_b32 s30, s0
+; GFX11-NEXT:    s_mov_b32 s31, s0
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    s_add_i32 s33, s33, s33
+; GFX11-NEXT:    v_dual_mov_b32 v0, s0 :: v_dual_mov_b32 v1, s1
+; GFX11-NEXT:    v_dual_mov_b32 v2, s2 :: v_dual_mov_b32 v3, s3
+; GFX11-NEXT:    v_dual_mov_b32 v4, s4 :: v_dual_mov_b32 v5, s5
+; GFX11-NEXT:    v_dual_mov_b32 v6, s6 :: v_dual_mov_b32 v7, s7
+; GFX11-NEXT:    v_dual_mov_b32 v8, s8 :: v_dual_mov_b32 v9, s9
+; GFX11-NEXT:    v_dual_mov_b32 v10, s10 :: v_dual_mov_b32 v11, s11
+; GFX11-NEXT:    v_dual_mov_b32 v12, s12 :: v_dual_mov_b32 v13, s13
+; GFX11-NEXT:    v_dual_mov_b32 v14, s14 :: v_dual_mov_b32 v15, s15
+; GFX11-NEXT:    v_dual_mov_b32 v16, s16 :: v_dual_mov_b32 v17, s17
+; GFX11-NEXT:    v_dual_mov_b32 v18, s18 :: v_dual_mov_b32 v19, s19
+; GFX11-NEXT:    v_dual_mov_b32 v20, s20 :: v_dual_mov_b32 v21, s21
+; GFX11-NEXT:    v_dual_mov_b32 v22, s22 :: v_dual_mov_b32 v23, s23
+; GFX11-NEXT:    v_dual_mov_b32 v24, s24 :: v_dual_mov_b32 v25, s25
+; GFX11-NEXT:    v_dual_mov_b32 v26, s26 :: v_dual_mov_b32 v27, s27
+; GFX11-NEXT:    v_dual_mov_b32 v28, s28 :: v_dual_mov_b32 v29, s29
+; GFX11-NEXT:    v_dual_mov_b32 v30, s30 :: v_dual_mov_b32 v31, s31
+; GFX11-NEXT:    s_lshl_b32 s0, s33, 1
+; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; GFX11-NEXT:    s_add_i32 m0, s0, 3
+; GFX11-NEXT:    v_movrels_b32_e32 v34, v0
+; GFX11-NEXT:    s_add_i32 m0, s0, 2
+; GFX11-NEXT:    v_movrels_b32_e32 v33, v0
+; GFX11-NEXT:    s_mov_b32 m0, s0
+; GFX11-NEXT:    v_movrels_b32_e32 v32, v1
+; GFX11-NEXT:    v_movrels_b32_e32 v31, v0
+; GFX11-NEXT:    global_store_b128 v35, v[31:34], s[34:35]
+; GFX11-NEXT:    s_endpgm
+;
+; GFX12-LABEL: test_bitcast_llc_v128i8_v16i8:
+; GFX12:       ; %bb.0: ; %entry
+; GFX12-NEXT:    s_load_b96 s[36:38], s[4:5], 0x0
+; GFX12-NEXT:    s_lshl_b32 s0, s0, 8
+; GFX12-NEXT:    v_mov_b32_e32 v35, 0
+; GFX12-NEXT:    s_and_b32 s1, s0, 0xff
+; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX12-NEXT:    s_or_b32 s0, s1, s0
+; GFX12-NEXT:    s_and_b32 s1, s0, 0xffff
+; GFX12-NEXT:    s_lshl_b32 s0, s0, 16
+; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX12-NEXT:    s_or_b32 s0, s1, s0
+; GFX12-NEXT:    s_mov_b32 s1, s0
+; GFX12-NEXT:    s_mov_b32 s2, s0
+; GFX12-NEXT:    s_mov_b32 s3, s0
+; GFX12-NEXT:    s_mov_b32 s4, s0
+; GFX12-NEXT:    s_mov_b32 s5, s0
+; GFX12-NEXT:    s_mov_b32 s6, s0
+; GFX12-NEXT:    s_mov_b32 s7, s0
+; GFX12-NEXT:    s_mov_b32 s8, s0
+; GFX12-NEXT:    s_mov_b32 s9, s0
+; GFX12-NEXT:    s_mov_b32 s10, s0
+; GFX12-NEXT:    s_mov_b32 s11, s0
+; GFX12-NEXT:    s_mov_b32 s12, s0
+; GFX12-NEXT:    s_mov_b32 s13, s0
+; GFX12-NEXT:    s_mov_b32 s14, s0
+; GFX12-NEXT:    s_mov_b32 s15, s0
+; GFX12-NEXT:    s_mov_b32 s16, s0
+; GFX12-NEXT:    s_mov_b32 s17, s0
+; GFX12-NEXT:    s_mov_b32 s18, s0
+; GFX12-NEXT:    s_mov_b32 s19, s0
+; GFX12-NEXT:    s_mov_b32 s20, s0
+; GFX12-NEXT:    s_mov_b32 s21, s0
+; GFX12-NEXT:    s_mov_b32 s22, s0
+; GFX12-NEXT:    s_mov_b32 s23, s0
+; GFX12-NEXT:    s_mov_b32 s24, s0
+; GFX12-NEXT:    s_mov_b32 s25, s0
+; GFX12-NEXT:    s_mov_b32 s26, s0
+; GFX12-NEXT:    s_mov_b32 s27, s0
+; GFX12-NEXT:    s_mov_b32 s28, s0
+; GFX12-NEXT:    s_mov_b32 s29, s0
+; GFX12-NEXT:    s_mov_b32 s30, s0
+; GFX12-NEXT:    s_mov_b32 s31, s0
+; GFX12-NEXT:    s_wait_kmcnt 0x0
+; GFX12-NEXT:    s_add_co_i32 s33, s38, s38
+; GFX12-NEXT:    v_dual_mov_b32 v0, s0 :: v_dual_mov_b32 v1, s1
+; GFX12-NEXT:    v_dual_mov_b32 v2, s2 :: v_dual_mov_b32 v3, s3
+; GFX12-NEXT:    v_dual_mov_b32 v4, s4 :: v_dual_mov_b32 v5, s5
+; GFX12-NEXT:    v_dual_mov_b32 v6, s6 :: v_dual_mov_b32 v7, s7
+; GFX12-NEXT:    v_dual_mov_b32 v8, s8 :: v_dual_mov_b32 v9, s9
+; GFX12-NEXT:    v_dual_mov_b32 v10, s10 :: v_dual_mov_b32 v11, s11
+; GFX12-NEXT:    v_dual_mov_b32 v12, s12 :: v_dual_mov_b32 v13, s13
+; GFX12-NEXT:    v_dual_mov_b32 v14, s14 :: v_dual_mov_b32 v15, s15
+; GFX12-NEXT:    v_dual_mov_b32 v16, s16 :: v_dual_mov_b32 v17, s17
+; GFX12-NEXT:    v_dual_mov_b32 v18, s18 :: v_dual_mov_b32 v19, s19
+; GFX12-NEXT:    v_dual_mov_b32 v20, s20 :: v_dual_mov_b32 v21, s21
+; GFX12-NEXT:    v_dual_mov_b32 v22, s22 :: v_dual_mov_b32 v23, s23
+; GFX12-NEXT:    v_dual_mov_b32 v24, s24 :: v_dual_mov_b32 v25, s25
+; GFX12-NEXT:    v_dual_mov_b32 v26, s26 :: v_dual_mov_b32 v27, s27
+; GFX12-NEXT:    v_dual_mov_b32 v28, s28 :: v_dual_mov_b32 v29, s29
+; GFX12-NEXT:    v_dual_mov_b32 v30, s30 :: v_dual_mov_b32 v31, s31
+; GFX12-NEXT:    s_lshl_b32 s0, s33, 1
+; GFX12-NEXT:    s_wait_alu depctr_sa_sdst(0)
+; GFX12-NEXT:    s_add_co_i32 m0, s0, 3
+; GFX12-NEXT:    v_movrels_b32_e32 v34, v0
+; GFX12-NEXT:    s_add_co_i32 m0, s0, 2
+; GFX12-NEXT:    v_movrels_b32_e32 v33, v0
+; GFX12-NEXT:    s_mov_b32 m0, s0
+; GFX12-NEXT:    v_movrels_b32_e32 v32, v1
+; GFX12-NEXT:    v_movrels_b32_e32 v31, v0
+; GFX12-NEXT:    global_store_b128 v35, v[31:34], s[36:37]
+; GFX12-NEXT:    s_endpgm
+entry:
+  %alloca = freeze <128 x i8> poison
+  %allocabc = bitcast <128 x i8> %alloca to <8 x i128>
+  %vec = extractelement <8 x i128> %allocabc, i32 %idx
+  %vecbc = bitcast i128 %vec to <16 x i8>
+  store <16 x i8> %vecbc, ptr addrspace(1) %out, align 16
+  ret void
+}
+
+define amdgpu_kernel void @test_bitcast_llc_v64i16_v8i16(ptr addrspace(1) %out, i32 %idx) {
+; GFX9-LABEL: test_bitcast_llc_v64i16_v8i16:
+; GFX9:       ; %bb.0: ; %entry
+; GFX9-NEXT:    s_load_dword s2, s[4:5], 0x8
+; GFX9-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x0
+; GFX9-NEXT:    v_mov_b32_e32 v4, 0
+; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9-NEXT:    s_add_i32 s2, s2, s2
+; GFX9-NEXT:    s_lshl_b32 s2, s2, 1
+; GFX9-NEXT:    s_set_gpr_idx_on s2, gpr_idx(SRC0)
+; GFX9-NEXT:    v_mov_b32_e32 v1, v1
+; GFX9-NEXT:    s_add_i32 s3, s2, 3
+; GFX9-NEXT:    v_mov_b32_e32 v0, v0
+; GFX9-NEXT:    s_set_gpr_idx_off
+; GFX9-NEXT:    s_set_gpr_idx_on s3, gpr_idx(SRC0)
+; GFX9-NEXT:    v_mov_b32_e32 v3, v0
+; GFX9-NEXT:    s_set_gpr_idx_off
+; GFX9-NEXT:    s_add_i32 s2, s2, 2
+; GFX9-NEXT:    s_set_gpr_idx_on s2, gpr_idx(SRC0)
+; GFX9-NEXT:    v_mov_b32_e32 v2, v0
+; GFX9-NEXT:    s_set_gpr_idx_off
+; GFX9-NEXT:    global_store_dwordx4 v4, v[0:3], s[0:1]
+; GFX9-NEXT:    s_endpgm
+;
+; GFX11-LABEL: test_bitcast_llc_v64i16_v8i16:
+; GFX11:       ; %bb.0: ; %entry
+; GFX11-NEXT:    s_clause 0x1
+; GFX11-NEXT:    s_load_b32 s2, s[4:5], 0x8
+; GFX11-NEXT:    s_load_b64 s[0:1], s[4:5], 0x0
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    s_add_i32 s2, s2, s2
+; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX11-NEXT:    s_lshl_b32 s2, s2, 1
+; GFX11-NEXT:    s_mov_b32 m0, s2
+; GFX11-NEXT:    v_movrels_b32_e32 v1, v1
+; GFX11-NEXT:    v_movrels_b32_e32 v0, v0
+; GFX11-NEXT:    s_add_i32 m0, s2, 3
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX11-NEXT:    v_movrels_b32_e32 v3, v0
+; GFX11-NEXT:    s_add_i32 m0, s2, 2
+; GFX11-NEXT:    v_mov_b32_e32 v4, 0
+; GFX11-NEXT:    v_movrels_b32_e32 v2, v0
+; GFX11-NEXT:    global_store_b128 v4, v[0:3], s[0:1]
+; GFX11-NEXT:    s_endpgm
+;
+; GFX12-LABEL: test_bitcast_llc_v64i16_v8i16:
+; GFX12:       ; %bb.0: ; %entry
+; GFX12-NEXT:    s_load_b96 s[0:2], s[4:5], 0x0
+; GFX12-NEXT:    s_wait_kmcnt 0x0
+; GFX12-NEXT:    s_add_co_i32 s2, s2, s2
+; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX12-NEXT:    s_lshl_b32 s2, s2, 1
+; GFX12-NEXT:    s_mov_b32 m0, s2
+; GFX12-NEXT:    v_movrels_b32_e32 v1, v1
+; GFX12-NEXT:    v_movrels_b32_e32 v0, v0
+; GFX12-NEXT:    s_add_co_i32 m0, s2, 3
+; GFX12-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX12-NEXT:    v_movrels_b32_e32 v3, v0
+; GFX12-NEXT:    s_add_co_i32 m0, s2, 2
+; GFX12-NEXT:    v_mov_b32_e32 v4, 0
+; GFX12-NEXT:    v_movrels_b32_e32 v2, v0
+; GFX12-NEXT:    global_store_b128 v4, v[0:3], s[0:1]
+; GFX12-NEXT:    s_endpgm
+entry:
+  %alloca = freeze <64 x i16> poison
+  %allocabc = bitcast <64 x i16> %alloca to <8 x i128>
+  %vec = extractelement <8 x i128> %allocabc, i32 %idx
+  %vecbc = bitcast i128 %vec to <8 x i16>
+  store <8 x i16> %vecbc, ptr addrspace(1) %out, align 16
+  ret void
+}
+
+define amdgpu_kernel void @test_bitcast_llc_v32i32_v4i32(ptr addrspace(1) %out, i32 %idx) {
+; GFX9-LABEL: test_bitcast_llc_v32i32_v4i32:
+; GFX9:       ; %bb.0: ; %entry
+; GFX9-NEXT:    s_load_dword s2, s[4:5], 0x8
+; GFX9-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x0
+; GFX9-NEXT:    v_mov_b32_e32 v4, 0
+; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9-NEXT:    s_add_i32 s2, s2, s2
+; GFX9-NEXT:    s_lshl_b32 s2, s2, 1
+; GFX9-NEXT:    s_set_gpr_idx_on s2, gpr_idx(SRC0)
+; GFX9-NEXT:    v_mov_b32_e32 v1, v1
+; GFX9-NEXT:    s_add_i32 s3, s2, 3
+; GFX9-NEXT:    v_mov_b32_e32 v0, v0
+; GFX9-NEXT:    s_set_gpr_idx_off
+; GFX9-NEXT:    s_set_gpr_idx_on s3, gpr_idx(SRC0)
+; GFX9-NEXT:    v_mov_b32_e32 v3, v0
+; GFX9-NEXT:    s_set_gpr_idx_off
+; GFX9-NEXT:    s_add_i32 s2, s2, 2
+; GFX9-NEXT:    s_set_gpr_idx_on s2, gpr_idx(SRC0)
+; GFX9-NEXT:    v_mov_b32_e32 v2, v0
+; GFX9-NEXT:    s_set_gpr_idx_off
+; GFX9-NEXT:    global_store_dwordx4 v4, v[0:3], s[0:1]
+; GFX9-NEXT:    s_endpgm
+;
+; GFX11-LABEL: test_bitcast_llc_v32i32_v4i32:
+; GFX11:       ; %bb.0: ; %entry
+; GFX11-NEXT:    s_clause 0x1
+; GFX11-NEXT:    s_load_b32 s2, s[4:5], 0x8
+; GFX11-NEXT:    s_load_b64 s[0:1], s[4:5], 0x0
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    s_add_i32 s2, s2, s2
+; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX11-NEXT:    s_lshl_b32 s2, s2, 1
+; GFX11-NEXT:    s_mov_b32 m0, s2
+; GFX11-NEXT:    v_movrels_b32_e32 v1, v1
+; GFX11-NEXT:    v_movrels_b32_e32 v0, v0
+; GFX11-NEXT:    s_add_i32 m0, s2, 3
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX11-NEXT:    v_movrels_b32_e32 v3, v0
+; GFX11-NEXT:    s_add_i32 m0, s2, 2
+; GFX11-NEXT:    v_mov_b32_e32 v4, 0
+; GFX11-NEXT:    v_movrels_b32_e32 v2, v0
+; GFX11-NEXT:    global_store_b128 v4, v[0:3], s[0:1]
+; GFX11-NEXT:    s_endpgm
+;
+; GFX12-LABEL: test_bitcast_llc_v32i32_v4i32:
+; GFX12:       ; %bb.0: ; %entry
+; GFX12-NEXT:    s_load_b96 s[0:2], s[4:5], 0x0
+; GFX12-NEXT:    s_wait_kmcnt 0x0
+; GFX12-NEXT:    s_add_co_i32 s2, s2, s2
+; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX12-NEXT:    s_lshl_b32 s2, s2, 1
+; GFX12-NEXT:    s_mov_b32 m0, s2
+; GFX12-NEXT:    v_movrels_b32_e32 v1, v1
+; GFX12-NEXT:    v_movrels_b32_e32 v0, v0
+; GFX12-NEXT:    s_add_co_i32 m0, s2...
[truncated]

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the AMDGPU promote-alloca pass by reducing the number of extract/insert element operations when accessing vectors with dynamic indices. When a subvector is loaded with a dynamic index, the code now inserts bitcasts to transform the operation into a single extractelement on a larger integer type, rather than generating multiple extract/insert operations that would expand inefficiently in the AMDGPU backend.

Key changes:

  • Added optimization in promoteAllocaUserToVector to detect dynamic index loads of power-of-2 integer vectors
  • Transforms scalarized copy operations using intermediate bitcasts to reduce extractelement chains
  • Updated test expectations to verify the new bitcast-based transformation

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp Implements the bitcast optimization for dynamic index vector loads
llvm/test/CodeGen/AMDGPU/promote-alloca-vector-dynamic-idx-bitcasts.ll Tests the IR transformation with various vector sizes
llvm/test/CodeGen/AMDGPU/promote-alloca-vector-dynamic-idx-bitcasts-llc.ll Tests the backend code generation for different GPU architectures
llvm/test/CodeGen/AMDGPU/promote-alloca-subvecs.ll Updates existing test to reflect new transformation behavior

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions
Copy link

github-actions bot commented Dec 9, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@choikwa
Copy link
Contributor Author

choikwa commented Dec 9, 2025

@ruiling do you know if the alignment check is strictly required here? As I understand, once alloca is promoted to vector, there is no need for alignment for extracting subvector.

@ruiling
Copy link
Contributor

ruiling commented Dec 9, 2025

@ruiling do you know if the alignment check is strictly required here? As I understand, once alloca is promoted to vector, there is no need for alignment for extracting subvector.

Vector element extract/insert needs to operate on element basis, like in the example:

  %alloca = alloca [32 x i16], align 16, addrspace(5)
  %gep = getelementptr  i8, ptr addrspace(5) %alloca, i32 0, i32 %idx
  %load = load <8 x i16>, ptr addrspace(5) %gep, align 1

As we don't know whether %gep will be aligned to 16. You cannot bitcast %alloca to <4 x i128> and translate the <8 x i16> load into one extractelement. So you need alignment check to lower this way. For the unaligned case, I feel @arsenm's point is let's optimize the register allocator part or other ways to simplify the IR. I think you can only handle aligned case properly in this change. Better we have some tests for the unaligned case.

@choikwa
Copy link
Contributor Author

choikwa commented Dec 9, 2025

@ruiling do you know if the alignment check is strictly required here? As I understand, once alloca is promoted to vector, there is no need for alignment for extracting subvector.

Vector element extract/insert needs to operate on element basis, like in the example:

  %alloca = alloca [32 x i16], align 16, addrspace(5)
  %gep = getelementptr  i8, ptr addrspace(5) %alloca, i32 0, i32 %idx
  %load = load <8 x i16>, ptr addrspace(5) %gep, align 1

As we don't know whether %gep will be aligned to 16. You cannot bitcast %alloca to <4 x i128> and translate the <8 x i16> load into one extractelement. So you need alignment check to lower this way. For the unaligned case, I feel @arsenm's point is let's optimize the register allocator part or other ways to simplify the IR. I think you can only handle aligned case properly in this change. Better we have some tests for the unaligned case.

I believe GEPToVectorIndex is only allowing GEPs that map to the vector index. I will add an unaligned testcase to make sure.

@ruiling
Copy link
Contributor

ruiling commented Dec 9, 2025

Vector element extract/insert needs to operate on element basis, like in the example:

  %alloca = alloca [32 x i16], align 16, addrspace(5)
  %gep = getelementptr  i8, ptr addrspace(5) %alloca, i32 0, i32 %idx
  %load = load <8 x i16>, ptr addrspace(5) %gep, align 1

As we don't know whether %gep will be aligned to 16. You cannot bitcast %alloca to <4 x i128> and translate the <8 x i16> load into one extractelement. So you need alignment check to lower this way. For the unaligned case, I feel @arsenm's point is let's optimize the register allocator part or other ways to simplify the IR. I think you can only handle aligned case properly in this change. Better we have some tests for the unaligned case.

I believe GEPToVectorIndex is only allowing GEPs that map to the vector index. I will add an unaligned testcase to make sure.

I think GEPToVectorIndex is used to get the index into the vector type translated from the alloca. By saying unaligned, I really mean not aligned to the vector type you just bitcasted to temporarily. Like in below case, the alloca was translated to <32 x i16>. but in order to translate the load <8 x i16> into the simplified form. You bitcast it to <4 x i128>. Unalignment means the case that the address of %gep is not aligned to 128bit.

 %alloca = alloca [32 x i16], align 16, addrspace(5)
 %gep = getelementptr  i16, ptr addrspace(5) %alloca, i32 0, i32 %idx
 %load = load <8 x i16>, ptr addrspace(5) %gep, align 2

Hope I understand everything correctly.

Copy link
Collaborator

@nhaehnle nhaehnle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that the alignment test is required.

…itcasts to reduce number of extractelements as they have large expansion in the backend.

Investigation revealed that scalarized copy results in a long chain of extract/insert elements which can explode in generated temps in the AMDGPU backend as there is no efficient representation for extracting subvector with dynamic index.
Using identity bitcasts can reduce the number of extract/insert elements down to 1 and produce much smaller generated and efficient code.
@github-actions
Copy link

github-actions bot commented Dec 15, 2025

🐧 Linux x64 Test Results

  • 187645 tests passed
  • 4984 tests skipped

✅ The build succeeded and all tests passed.

@github-actions
Copy link

github-actions bot commented Dec 15, 2025

🪟 Windows x64 Test Results

  • 128800 tests passed
  • 2836 tests skipped

✅ The build succeeded and all tests passed.

Comment on lines 663 to 667
// Check that NumBits and GEP's type's NumBits match for alignment
if (auto *GEPTy =
dyn_cast<FixedVectorType>(GEP->getSourceElementType())) {
if (NumBits == GEPTy->getScalarSizeInBits() *
GEPTy->getNumElements()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of bad problems there. First, you used a dyn_cast above, so GEP may be null.

Second, looking at the GEP source element type is strongly discouraged since GEPs can (and will be) canonicalized to i8.

Can't you just use the load instruction alignment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. Yes, I can look at the Load's align instead.

Copy link
Collaborator

@nhaehnle nhaehnle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Copy link
Contributor

@ruiling ruiling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Please make the title shorter. Longer explanation should go to the commit message.

@choikwa choikwa changed the title [AMDGPU] In promote-alloca, if index is dynamic, sandwich load with bitcasts to reduce number of extractelements as they have large expansion in the backend. [AMDGPU] In promote-alloca, if index is dynamic, sandwich load with bitcasts to reduce excessive codegen Dec 18, 2025
Comment on lines +652 to +653
// Extracting subvector with dynamic index has very large expansion in
// the amdgpu backend. Limit to pow2.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also directly address this in the backend? I thought we already did try to turn these into friendlier vector types there?

Copy link
Contributor Author

@choikwa choikwa Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My previous attempts at doing this in the backend stalled as I couldn't easily come up with a better way to expand the extract_subvector than the scalarized copy. Doing bitcast'ed copy to largest element type doesn't avoid consecutive copies, but as seen in github.com//pull/170327 the SDag decomposes one large copy into several and leverages the largest legal integral type to do the copying.

uint64_t LoadAlign = cast<LoadInst>(Inst)->getAlign().value();
bool IsAlignedLoad = NumBits <= (LoadAlign * 8u);
unsigned TotalNumElts = VectorTy->getNumElements();
bool IsProperlyDivisible = TotalNumElts % NumLoadedElts == 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you keep this in terms of TypeSize operators

Copy link
Contributor

@ruiling ruiling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@choikwa choikwa merged commit 5897f27 into llvm:main Dec 19, 2025
10 checks passed
valadaptive pushed a commit to valadaptive/llvm-project that referenced this pull request Dec 24, 2025
…itcasts to reduce excessive codegen (llvm#171253)

Investigation revealed that scalarized copy results in a long chain of
extract/insert elements which can explode in generated temps in the
AMDGPU backend as there is no efficient representation for extracting
subvector with dynamic index. Using identity bitcasts can reduce the
number of extract/insert elements down to 1 and produce much smaller,
efficient generated code.

Credit: ruiling
mahesh-attarde pushed a commit to mahesh-attarde/llvm-project that referenced this pull request Jan 6, 2026
…itcasts to reduce excessive codegen (llvm#171253)

Investigation revealed that scalarized copy results in a long chain of
extract/insert elements which can explode in generated temps in the
AMDGPU backend as there is no efficient representation for extracting
subvector with dynamic index. Using identity bitcasts can reduce the
number of extract/insert elements down to 1 and produce much smaller,
efficient generated code.

Credit: ruiling
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants