Skip to content

Conversation

@harrisonGPU
Copy link
Contributor

@harrisonGPU harrisonGPU commented Nov 3, 2025

This patch adds support for the pattern:

  %index = select i1 %idx_sel, i32 0, i32 4
  %elt = getelementptr inbounds i8, ptr addrspace(5) %alloca, i32 %index

by scaling the byte offset to an element index (index >> log2(ElemSize)),
allowing the vector element to be updated with insertelement instead of using
scratch memory.

@llvmbot
Copy link
Member

llvmbot commented Nov 3, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Harrison Hao (harrisonGPU)

Changes

This patch adds support for the pattern:

  %elt = getelementptr inbounds i8, ptr addrspace(5) %alloca, i32 %index

by scaling the byte offset to an element index (index >> log2(ElemSize)),
allowing the vector element to be updated with insertelement instead of using
scratch memory.


Full diff: https://github.com/llvm/llvm-project/pull/166132.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp (+13-2)
  • (modified) llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll (+20)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
index ddabd25894414..793c0237cdf38 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
@@ -456,10 +456,21 @@ static Value *GEPToVectorIndex(GetElementPtrInst *GEP, AllocaInst *Alloca,
   const auto &VarOffset = VarOffsets.front();
   APInt OffsetQuot;
   APInt::sdivrem(VarOffset.second, VecElemSize, OffsetQuot, Rem);
-  if (Rem != 0 || OffsetQuot.isZero())
-    return nullptr;
+
+  Value *Scaled = nullptr;
+  if (Rem != 0 || OffsetQuot.isZero()) {
+    unsigned ElemSizeShift = Log2_64(VecElemSize);
+    Scaled = Builder.CreateLShr(VarOffset.first, ElemSizeShift);
+    if (Instruction *NewInst = dyn_cast<Instruction>(Scaled))
+      NewInsts.push_back(NewInst);
+    OffsetQuot = APInt(BW, 1);
+    Rem = 0;
+  }
 
   Value *Offset = VarOffset.first;
+  if (Scaled)
+    Offset = Scaled;
+
   auto *OffsetType = dyn_cast<IntegerType>(Offset->getType());
   if (!OffsetType)
     return nullptr;
diff --git a/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll b/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll
index 76e1868b3c4b9..65bddaba8dd14 100644
--- a/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll
+++ b/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll
@@ -250,6 +250,26 @@ bb2:
   store i32 0, ptr addrspace(5) %extractelement
   ret void
 }
+
+define amdgpu_kernel void @scalar_alloca_vector_gep_i8(ptr %buffer, float %data, i32 %index) {
+; CHECK-LABEL: define amdgpu_kernel void @scalar_alloca_vector_gep_i8(
+; CHECK-SAME: ptr [[BUFFER:%.*]], float [[DATA:%.*]], i32 [[INDEX:%.*]]) {
+; CHECK-NEXT:    [[ALLOCA:%.*]] = freeze <3 x float> poison
+; CHECK-NEXT:    [[VEC:%.*]] = load <3 x float>, ptr [[BUFFER]], align 16
+; CHECK-NEXT:    [[TMP1:%.*]] = lshr i32 [[INDEX]], 2
+; CHECK-NEXT:    [[TMP2:%.*]] = insertelement <3 x float> [[VEC]], float [[DATA]], i32 [[TMP1]]
+; CHECK-NEXT:    store <3 x float> [[TMP2]], ptr [[BUFFER]], align 16
+; CHECK-NEXT:    ret void
+;
+  %alloca = alloca <3 x float>, align 16, addrspace(5)
+  %vec = load <3 x float>, ptr %buffer
+  store <3 x float> %vec, ptr addrspace(5) %alloca
+  %elt = getelementptr inbounds nuw i8, ptr addrspace(5) %alloca, i32 %index
+  store float %data, ptr addrspace(5) %elt, align 4
+  %updated = load <3 x float>, ptr addrspace(5) %alloca, align 16
+  store <3 x float> %updated, ptr %buffer, align 16
+  ret void
+}
 ;.
 ; CHECK: [[META0]] = !{}
 ; CHECK: [[RNG1]] = !{i32 0, i32 1025}

Copy link
Contributor

@shiltian shiltian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the index here is somewhere in the middle? It doesn't look like we have any constraint on the index.


Value *Scaled = nullptr;
if (Rem != 0 || OffsetQuot.isZero()) {
unsigned ElemSizeShift = Log2_64(VecElemSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to validate VecElemSize is a power of 2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, but I think it is not necessary to explicitly check whether the element size is a power of two, because it is already covered by the existing check here:

Type *VecEltTy = VectorTy->getElementType();
unsigned ElementSizeInBits = DL->getTypeSizeInBits(VecEltTy);
if (ElementSizeInBits != DL->getTypeAllocSizeInBits(VecEltTy)) {
LLVM_DEBUG(dbgs() << " Cannot convert to vector if the allocation size "
"does not match the type's size\n");
return false;
}

If the element type is not naturally aligned, it will return false, which also rejects non power of 2 element sizes, such as i24.

%updated = load <3 x float>, ptr addrspace(5) %alloca, align 16
store <3 x float> %updated, ptr %buffer, align 16
ret void
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test with non-power-of-2 element size

@harrisonGPU
Copy link
Contributor Author

What if the index here is somewhere in the middle? It doesn't look like we have any constraint on the index.

Could you please give me an example?

@shiltian
Copy link
Contributor

shiltian commented Nov 4, 2025

Could you please give me an example?

In the test case, the %index is for %alloca of <3 x float>, which is 12-byte, each of which has 4-byte. Since the GEP is of type i8, what if %index is 5, which will be somewhere in the middle of the 2nd element of %alloca.

@harrisonGPU harrisonGPU force-pushed the amdgpu/promote-vector branch from c1439a3 to 6a28740 Compare November 10, 2025 08:22
@harrisonGPU
Copy link
Contributor Author

harrisonGPU commented Nov 10, 2025

Could you please give me an example?

In the test case, the %index is for %alloca of <3 x float>, which is 12-byte, each of which has 4-byte. Since the GEP is of type i8, what if %index is 5, which will be somewhere in the middle of the 2nd element of %alloca.

Hi @shiltian , I’ve already thought about this issue, thank you very much for your suggestion and for pointing it out.
Now I think we should only promote when the variable index is guaranteed to be aligned to the element size.
We can use computeKnownBits and countMinTrailingZeros to check that the lower bits of the index are zero, which verifies its alignment before promoting.
I’ve updated the lit test and commit message accordingly. What do you think?

return nullptr;

Value *Offset = VarOffset.first;
if (Rem != 0 || OffsetQuot.isZero()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it have to be this complicated? I thought checking whether offset % size would be sufficient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I agree your points, I have removed it.

@github-actions
Copy link

github-actions bot commented Nov 20, 2025

🐧 Linux x64 Test Results

  • 187096 tests passed
  • 4929 tests skipped

✅ The build succeeded and all tests passed.

@shiltian
Copy link
Contributor

The logic seems reasonable for plain vector, but I'm not sure about a nested vector. Can you add tests as well?

@harrisonGPU
Copy link
Contributor Author

The logic seems reasonable for plain vector, but I'm not sure about a nested vector. Can you add tests as well?

Thanks, I have added two tests about nested vector.

@harrisonGPU harrisonGPU force-pushed the amdgpu/promote-vector branch from 8c3f2e3 to 0e8c3fe Compare December 8, 2025 03:26
@harrisonGPU harrisonGPU merged commit 6ec8c43 into llvm:main Dec 8, 2025
10 checks passed
@harrisonGPU harrisonGPU deleted the amdgpu/promote-vector branch December 8, 2025 04:13
@llvm-ci
Copy link
Collaborator

llvm-ci commented Dec 8, 2025

LLVM Buildbot has detected a new failure on builder openmp-offload-amdgpu-runtime-2 running on rocm-worker-hw-02 while building llvm at step 10 "Add check check-libc-amdgcn-amd-amdhsa".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/10/builds/18635

Here is the relevant piece of the build log for the reference
Step 10 (Add check check-libc-amdgcn-amd-amdhsa) failure: test (failure)
...
[2611/3298] Running hermetic test libc.test.src.math.smoke.lroundf_test.__hermetic__
[==========] Running 3 tests from 1 test suite.
[ RUN      ] LlvmLibcRoundToIntegerTest.InfinityAndNaN
[       OK ] LlvmLibcRoundToIntegerTest.InfinityAndNaN (10 us)
[ RUN      ] LlvmLibcRoundToIntegerTest.RoundNumbers
[       OK ] LlvmLibcRoundToIntegerTest.RoundNumbers (6 us)
[ RUN      ] LlvmLibcRoundToIntegerTest.SubnormalRange
[       OK ] LlvmLibcRoundToIntegerTest.SubnormalRange (789 us)
Ran 3 tests.  PASS: 3  FAIL: 0
[2612/3298] Linking CXX executable libc/test/src/stdlib/libc.test.src.stdlib.heap_sort_test.__hermetic__.__build__
FAILED: libc/test/src/stdlib/libc.test.src.stdlib.heap_sort_test.__hermetic__.__build__ 
: && /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/./bin/clang++ --target=amdgcn-amd-amdhsa -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wno-pass-failed -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -O3 -DNDEBUG -Wl,--color-diagnostics    --target=amdgcn-amd-amdhsa -Wno-multi-gpu -mcpu=native -flto -Wl,-mllvm,-amdgpu-lower-global-ctor-dtor=0 -nostdlib -static -Wl,-mllvm,-amdhsa-code-object-version=6 libc/startup/gpu/amdgpu/CMakeFiles/libc.startup.gpu.amdgpu.crt1.dir/start.cpp.o libc/test/src/stdlib/CMakeFiles/libc.test.src.stdlib.heap_sort_test.__hermetic__.__build__.dir/heap_sort_test.cpp.o -o libc/test/src/stdlib/libc.test.src.stdlib.heap_sort_test.__hermetic__.__build__  libc/test/UnitTest/libLibcTest.hermetic.a  libc/test/UnitTest/libLibcHermeticTestSupport.hermetic.a  libc/test/src/stdlib/liblibc.test.src.stdlib.heap_sort_test.__hermetic__.libc.a && :
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace and instructions to reproduce the bug.
Stack dump:
0.	Program arguments: /home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/bin/ld.lld --no-undefined -shared -plugin-opt=mcpu=gfx90a -plugin-opt=O3 --lto-CGO3 -plugin-opt=-function-sections=1 -plugin-opt=-data-sections=1 -L/home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/bin/../lib/amdgcn-amd-amdhsa --color-diagnostics -mllvm -amdgpu-lower-global-ctor-dtor=0 -mllvm -amdhsa-code-object-version=6 libc/startup/gpu/amdgpu/CMakeFiles/libc.startup.gpu.amdgpu.crt1.dir/start.cpp.o libc/test/src/stdlib/CMakeFiles/libc.test.src.stdlib.heap_sort_test.__hermetic__.__build__.dir/heap_sort_test.cpp.o libc/test/UnitTest/libLibcTest.hermetic.a libc/test/UnitTest/libLibcHermeticTestSupport.hermetic.a libc/test/src/stdlib/liblibc.test.src.stdlib.heap_sort_test.__hermetic__.libc.a -o libc/test/src/stdlib/libc.test.src.stdlib.heap_sort_test.__hermetic__.__build__
1.	Running pass 'Function Pass Manager' on module 'ld-temp.o'.
2.	Running pass 'AMDGPU Promote Alloca' on function '@_ZN49LlvmLibcHeapSortTest_SameElementThreeElementArray3RunEv'
 #0 0x00007bd51120e100 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/bin/../lib/libLLVMSupport.so.22.0git+0x20e100)
 #1 0x00007bd51120adbf llvm::sys::RunSignalHandlers() (/home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/bin/../lib/libLLVMSupport.so.22.0git+0x20adbf)
 #2 0x00007bd51120af12 SignalHandler(int, siginfo_t*, void*) Signals.cpp:0:0
 #3 0x00007bd510842520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
 #4 0x00007bd50c854df1 llvm::Instruction::eraseFromParent() (/home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/bin/../lib/../lib/libLLVMCore.so.22.0git+0x254df1)
 #5 0x00007bd50ef80f7d (anonymous namespace)::AMDGPUPromoteAllocaImpl::tryPromoteAllocaToVector(llvm::AllocaInst&)::'lambda'(llvm::Instruction*, llvm::Twine)::operator()(llvm::Instruction*, llvm::Twine) const AMDGPUPromoteAlloca.cpp:0:0
 #6 0x00007bd50ef89288 (anonymous namespace)::AMDGPUPromoteAllocaImpl::tryPromoteAllocaToVector(llvm::AllocaInst&) AMDGPUPromoteAlloca.cpp:0:0
 #7 0x00007bd50ef8bb21 (anonymous namespace)::AMDGPUPromoteAllocaImpl::run(llvm::Function&, bool) (.part.0) AMDGPUPromoteAlloca.cpp:0:0
 #8 0x00007bd50ef8dc33 (anonymous namespace)::AMDGPUPromoteAlloca::runOnFunction(llvm::Function&) AMDGPUPromoteAlloca.cpp:0:0
 #9 0x00007bd50c8a4e93 llvm::FPPassManager::runOnFunction(llvm::Function&) (/home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/bin/../lib/../lib/libLLVMCore.so.22.0git+0x2a4e93)
#10 0x00007bd50c8a52e9 llvm::FPPassManager::runOnModule(llvm::Module&) (/home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/bin/../lib/../lib/libLLVMCore.so.22.0git+0x2a52e9)
#11 0x00007bd50c8a5c37 llvm::legacy::PassManagerImpl::run(llvm::Module&) (/home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/bin/../lib/../lib/libLLVMCore.so.22.0git+0x2a5c37)
#12 0x00007bd510ebeb29 codegen(llvm::lto::Config const&, llvm::TargetMachine*, std::function<llvm::Expected<std::unique_ptr<llvm::CachedFileStream, std::default_delete<llvm::CachedFileStream>>> (unsigned int, llvm::Twine const&)>, unsigned int, llvm::Module&, llvm::ModuleSummaryIndex const&) LTOBackend.cpp:0:0
#13 0x00007bd510ec09ad llvm::lto::backend(llvm::lto::Config const&, std::function<llvm::Expected<std::unique_ptr<llvm::CachedFileStream, std::default_delete<llvm::CachedFileStream>>> (unsigned int, llvm::Twine const&)>, unsigned int, llvm::Module&, llvm::ModuleSummaryIndex&) (/home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/bin/../lib/../lib/libLLVMLTO.so.22.0git+0x569ad)
#14 0x00007bd510ead619 llvm::lto::LTO::runRegularLTO(std::function<llvm::Expected<std::unique_ptr<llvm::CachedFileStream, std::default_delete<llvm::CachedFileStream>>> (unsigned int, llvm::Twine const&)>) (/home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/bin/../lib/../lib/libLLVMLTO.so.22.0git+0x43619)
#15 0x00007bd510eb2df2 llvm::lto::LTO::run(std::function<llvm::Expected<std::unique_ptr<llvm::CachedFileStream, std::default_delete<llvm::CachedFileStream>>> (unsigned int, llvm::Twine const&)>, llvm::FileCache) (/home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/bin/../lib/../lib/libLLVMLTO.so.22.0git+0x48df2)
#16 0x00007bd5117ad8fc lld::elf::BitcodeCompiler::compile() (/home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/bin/../lib/liblldELF.so.22.0git+0x1ad8fc)
#17 0x00007bd5116fca0d void lld::elf::LinkerDriver::compileBitcodeFiles<llvm::object::ELFType<(llvm::endianness)1, true>>(bool) (/home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/bin/../lib/liblldELF.so.22.0git+0xfca0d)
#18 0x00007bd511720e24 void lld::elf::LinkerDriver::link<llvm::object::ELFType<(llvm::endianness)1, true>>(llvm::opt::InputArgList&) (/home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/bin/../lib/liblldELF.so.22.0git+0x120e24)
#19 0x00007bd511728ca9 lld::elf::LinkerDriver::linkerMain(llvm::ArrayRef<char const*>) (/home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/bin/../lib/liblldELF.so.22.0git+0x128ca9)
#20 0x00007bd5117292cc lld::elf::link(llvm::ArrayRef<char const*>, llvm::raw_ostream&, llvm::raw_ostream&, bool, bool) (/home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/bin/../lib/liblldELF.so.22.0git+0x1292cc)
#21 0x00007bd5119e35e1 lld::unsafeLldMain(llvm::ArrayRef<char const*>, llvm::raw_ostream&, llvm::raw_ostream&, llvm::ArrayRef<lld::DriverDef>, bool) (/home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/bin/../lib/liblldCommon.so.22.0git+0xe5e1)
#22 0x00005e1ec1509d16 lld_main(int, char**, llvm::ToolContext const&) (/home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/bin/ld.lld+0x2d16)
#23 0x00005e1ec150955b main (/home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/bin/ld.lld+0x255b)
#24 0x00007bd510829d90 __libc_start_call_main ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
#25 0x00007bd510829e40 call_init ./csu/../csu/libc-start.c:128:20
#26 0x00007bd510829e40 __libc_start_main ./csu/../csu/libc-start.c:379:5
#27 0x00005e1ec15095b5 _start (/home/botworker/builds/openmp-offload-amdgpu-runtime-2/llvm.build/bin/ld.lld+0x25b5)
clang++: error: unable to execute command: Segmentation fault (core dumped)
clang++: error: ld.lld command failed due to signal (use -v to see invocation)
[2613/3298] Linking CXX executable libc/test/src/math/smoke/libc.test.src.math.smoke.sin_test.__hermetic__.__build__
[2614/3298] Running hermetic test libc.test.src.math.smoke.fminimumf_test.__hermetic__

@llvm-ci
Copy link
Collaborator

llvm-ci commented Dec 8, 2025

LLVM Buildbot has detected a new failure on builder sanitizer-x86_64-linux running on sanitizer-buildbot1 while building llvm at step 2 "annotate".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/66/builds/23325

Here is the relevant piece of the build log for the reference
Step 2 (annotate) failure: 'python ../sanitizer_buildbot/sanitizers/zorg/buildbot/builders/sanitizers/buildbot_selector.py' (failure)
...
[229/235] Generating MSAN_INST_GTEST.gtest-all.cc.x86_64-with-call.o
[230/235] Generating MSAN_INST_GTEST.gtest-all.cc.x86_64.o
[231/235] Generating MSAN_INST_TEST_OBJECTS.msan_test.cpp.x86_64-with-call.o
[232/235] Generating Msan-x86_64-with-call-Test
[233/235] Generating MSAN_INST_TEST_OBJECTS.msan_test.cpp.x86_64.o
[234/235] Generating Msan-x86_64-Test
[234/235] Running compiler_rt regression tests
llvm-lit: /home/b/sanitizer-x86_64-linux/build/llvm-project/llvm/utils/lit/lit/main.py:74: note: The test suite configuration requested an individual test timeout of 0 seconds but a timeout of 900 seconds was requested on the command line. Forcing timeout to be 900 seconds.
-- Testing: 6148 tests, 64 workers --
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90
FAIL: XRay-x86_64-linux :: TestCases/Posix/basic-filtering.cpp (5807 of 6148)
******************** TEST 'XRay-x86_64-linux :: TestCases/Posix/basic-filtering.cpp' FAILED ********************
Exit Code: 1

Command Output (stdout):
--
# RUN: at line 4
/home/b/sanitizer-x86_64-linux/build/build_default/bin/clang  --driver-mode=g++ -fxray-instrument  -m64 -nobuiltininc -I/home/b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/include -idirafter /home/b/sanitizer-x86_64-linux/build/build_default/lib/clang/22/include -resource-dir=/home/b/sanitizer-x86_64-linux/build/compiler_rt_build -Wl,-rpath,/home/b/sanitizer-x86_64-linux/build/compiler_rt_build/lib/linux   -std=c++11 /home/b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/xray/TestCases/Posix/basic-filtering.cpp -o /home/b/sanitizer-x86_64-linux/build/compiler_rt_build/test/xray/X86_64LinuxConfig/TestCases/Posix/Output/basic-filtering.cpp.tmp -g
# executed command: /home/b/sanitizer-x86_64-linux/build/build_default/bin/clang --driver-mode=g++ -fxray-instrument -m64 -nobuiltininc -I/home/b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/include -idirafter /home/b/sanitizer-x86_64-linux/build/build_default/lib/clang/22/include -resource-dir=/home/b/sanitizer-x86_64-linux/build/compiler_rt_build -Wl,-rpath,/home/b/sanitizer-x86_64-linux/build/compiler_rt_build/lib/linux -std=c++11 /home/b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/xray/TestCases/Posix/basic-filtering.cpp -o /home/b/sanitizer-x86_64-linux/build/compiler_rt_build/test/xray/X86_64LinuxConfig/TestCases/Posix/Output/basic-filtering.cpp.tmp -g
# note: command had no output on stdout or stderr
# RUN: at line 5
rm -f basic-filtering-*
# executed command: rm -f 'basic-filtering-*'
# note: command had no output on stdout or stderr
# RUN: at line 6
env XRAY_OPTIONS="patch_premain=true xray_mode=xray-basic verbosity=1      xray_logfile_base=basic-filtering-      xray_naive_log_func_duration_threshold_us=1000      xray_naive_log_max_stack_depth=2"  /home/b/sanitizer-x86_64-linux/build/compiler_rt_build/test/xray/X86_64LinuxConfig/TestCases/Posix/Output/basic-filtering.cpp.tmp 2>&1 |      FileCheck /home/b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/xray/TestCases/Posix/basic-filtering.cpp
# executed command: env 'XRAY_OPTIONS=patch_premain=true xray_mode=xray-basic verbosity=1      xray_logfile_base=basic-filtering-      xray_naive_log_func_duration_threshold_us=1000      xray_naive_log_max_stack_depth=2' /home/b/sanitizer-x86_64-linux/build/compiler_rt_build/test/xray/X86_64LinuxConfig/TestCases/Posix/Output/basic-filtering.cpp.tmp
# note: command had no output on stdout or stderr
# executed command: FileCheck /home/b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/xray/TestCases/Posix/basic-filtering.cpp
# note: command had no output on stdout or stderr
# RUN: at line 11
ls basic-filtering-* | head -1 | tr -d '\n' > /home/b/sanitizer-x86_64-linux/build/compiler_rt_build/test/xray/X86_64LinuxConfig/TestCases/Posix/Output/basic-filtering.cpp.tmp.log
# executed command: ls 'basic-filtering-*'
# note: command had no output on stdout or stderr
# executed command: head -1
# note: command had no output on stdout or stderr
# executed command: tr -d '\n'
# note: command had no output on stdout or stderr
# RUN: at line 12
/home/b/sanitizer-x86_64-linux/build/build_default/./bin/llvm-xray convert --symbolize --output-format=yaml -instr_map=/home/b/sanitizer-x86_64-linux/build/compiler_rt_build/test/xray/X86_64LinuxConfig/TestCases/Posix/Output/basic-filtering.cpp.tmp      "/home/b/sanitizer-x86_64-linux/build/compiler_rt_build/test/xray/X86_64LinuxConfig/TestCases/Posix/basic-filtering-basic-filtering.cpp.tmp.RqkfRk" |      FileCheck /home/b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/xray/TestCases/Posix/basic-filtering.cpp --check-prefix TRACE
# executed command: /home/b/sanitizer-x86_64-linux/build/build_default/./bin/llvm-xray convert --symbolize --output-format=yaml -instr_map=/home/b/sanitizer-x86_64-linux/build/compiler_rt_build/test/xray/X86_64LinuxConfig/TestCases/Posix/Output/basic-filtering.cpp.tmp '%{readfile:/home/b/sanitizer-x86_64-linux/build/compiler_rt_build/test/xray/X86_64LinuxConfig/TestCases/Posix/Output/basic-filtering.cpp.tmp.log}'
# note: command had no output on stdout or stderr
# executed command: FileCheck /home/b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/xray/TestCases/Posix/basic-filtering.cpp --check-prefix TRACE
# note: command had no output on stdout or stderr
# RUN: at line 15
rm -f basic-filtering-*
# executed command: rm -f 'basic-filtering-*'
# note: command had no output on stdout or stderr
# RUN: at line 18
Step 16 (test standalone compiler-rt) failure: test standalone compiler-rt (failure)
...
[229/235] Generating MSAN_INST_GTEST.gtest-all.cc.x86_64-with-call.o
[230/235] Generating MSAN_INST_GTEST.gtest-all.cc.x86_64.o
[231/235] Generating MSAN_INST_TEST_OBJECTS.msan_test.cpp.x86_64-with-call.o
[232/235] Generating Msan-x86_64-with-call-Test
[233/235] Generating MSAN_INST_TEST_OBJECTS.msan_test.cpp.x86_64.o
[234/235] Generating Msan-x86_64-Test
[234/235] Running compiler_rt regression tests
llvm-lit: /home/b/sanitizer-x86_64-linux/build/llvm-project/llvm/utils/lit/lit/main.py:74: note: The test suite configuration requested an individual test timeout of 0 seconds but a timeout of 900 seconds was requested on the command line. Forcing timeout to be 900 seconds.
-- Testing: 6148 tests, 64 workers --
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90
FAIL: XRay-x86_64-linux :: TestCases/Posix/basic-filtering.cpp (5807 of 6148)
******************** TEST 'XRay-x86_64-linux :: TestCases/Posix/basic-filtering.cpp' FAILED ********************
Exit Code: 1

Command Output (stdout):
--
# RUN: at line 4
/home/b/sanitizer-x86_64-linux/build/build_default/bin/clang  --driver-mode=g++ -fxray-instrument  -m64 -nobuiltininc -I/home/b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/include -idirafter /home/b/sanitizer-x86_64-linux/build/build_default/lib/clang/22/include -resource-dir=/home/b/sanitizer-x86_64-linux/build/compiler_rt_build -Wl,-rpath,/home/b/sanitizer-x86_64-linux/build/compiler_rt_build/lib/linux   -std=c++11 /home/b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/xray/TestCases/Posix/basic-filtering.cpp -o /home/b/sanitizer-x86_64-linux/build/compiler_rt_build/test/xray/X86_64LinuxConfig/TestCases/Posix/Output/basic-filtering.cpp.tmp -g
# executed command: /home/b/sanitizer-x86_64-linux/build/build_default/bin/clang --driver-mode=g++ -fxray-instrument -m64 -nobuiltininc -I/home/b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/include -idirafter /home/b/sanitizer-x86_64-linux/build/build_default/lib/clang/22/include -resource-dir=/home/b/sanitizer-x86_64-linux/build/compiler_rt_build -Wl,-rpath,/home/b/sanitizer-x86_64-linux/build/compiler_rt_build/lib/linux -std=c++11 /home/b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/xray/TestCases/Posix/basic-filtering.cpp -o /home/b/sanitizer-x86_64-linux/build/compiler_rt_build/test/xray/X86_64LinuxConfig/TestCases/Posix/Output/basic-filtering.cpp.tmp -g
# note: command had no output on stdout or stderr
# RUN: at line 5
rm -f basic-filtering-*
# executed command: rm -f 'basic-filtering-*'
# note: command had no output on stdout or stderr
# RUN: at line 6
env XRAY_OPTIONS="patch_premain=true xray_mode=xray-basic verbosity=1      xray_logfile_base=basic-filtering-      xray_naive_log_func_duration_threshold_us=1000      xray_naive_log_max_stack_depth=2"  /home/b/sanitizer-x86_64-linux/build/compiler_rt_build/test/xray/X86_64LinuxConfig/TestCases/Posix/Output/basic-filtering.cpp.tmp 2>&1 |      FileCheck /home/b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/xray/TestCases/Posix/basic-filtering.cpp
# executed command: env 'XRAY_OPTIONS=patch_premain=true xray_mode=xray-basic verbosity=1      xray_logfile_base=basic-filtering-      xray_naive_log_func_duration_threshold_us=1000      xray_naive_log_max_stack_depth=2' /home/b/sanitizer-x86_64-linux/build/compiler_rt_build/test/xray/X86_64LinuxConfig/TestCases/Posix/Output/basic-filtering.cpp.tmp
# note: command had no output on stdout or stderr
# executed command: FileCheck /home/b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/xray/TestCases/Posix/basic-filtering.cpp
# note: command had no output on stdout or stderr
# RUN: at line 11
ls basic-filtering-* | head -1 | tr -d '\n' > /home/b/sanitizer-x86_64-linux/build/compiler_rt_build/test/xray/X86_64LinuxConfig/TestCases/Posix/Output/basic-filtering.cpp.tmp.log
# executed command: ls 'basic-filtering-*'
# note: command had no output on stdout or stderr
# executed command: head -1
# note: command had no output on stdout or stderr
# executed command: tr -d '\n'
# note: command had no output on stdout or stderr
# RUN: at line 12
/home/b/sanitizer-x86_64-linux/build/build_default/./bin/llvm-xray convert --symbolize --output-format=yaml -instr_map=/home/b/sanitizer-x86_64-linux/build/compiler_rt_build/test/xray/X86_64LinuxConfig/TestCases/Posix/Output/basic-filtering.cpp.tmp      "/home/b/sanitizer-x86_64-linux/build/compiler_rt_build/test/xray/X86_64LinuxConfig/TestCases/Posix/basic-filtering-basic-filtering.cpp.tmp.RqkfRk" |      FileCheck /home/b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/xray/TestCases/Posix/basic-filtering.cpp --check-prefix TRACE
# executed command: /home/b/sanitizer-x86_64-linux/build/build_default/./bin/llvm-xray convert --symbolize --output-format=yaml -instr_map=/home/b/sanitizer-x86_64-linux/build/compiler_rt_build/test/xray/X86_64LinuxConfig/TestCases/Posix/Output/basic-filtering.cpp.tmp '%{readfile:/home/b/sanitizer-x86_64-linux/build/compiler_rt_build/test/xray/X86_64LinuxConfig/TestCases/Posix/Output/basic-filtering.cpp.tmp.log}'
# note: command had no output on stdout or stderr
# executed command: FileCheck /home/b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/xray/TestCases/Posix/basic-filtering.cpp --check-prefix TRACE
# note: command had no output on stdout or stderr
# RUN: at line 15
rm -f basic-filtering-*
# executed command: rm -f 'basic-filtering-*'
# note: command had no output on stdout or stderr
# RUN: at line 18

jplehr added a commit that referenced this pull request Dec 8, 2025
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Dec 8, 2025
@harrisonGPU harrisonGPU restored the amdgpu/promote-vector branch December 9, 2025 02:43
honeygoyal pushed a commit to honeygoyal/llvm-project that referenced this pull request Dec 9, 2025
This patch adds support for the pattern:
```llvm
  %index = select i1 %idx_sel, i32 0, i32 4
  %elt = getelementptr inbounds i8, ptr addrspace(5) %alloca, i32 %index
```
by scaling the byte offset to an element index (index >>
log2(ElemSize)),
allowing the vector element to be updated with insertelement instead of
using
scratch memory.
honeygoyal pushed a commit to honeygoyal/llvm-project that referenced this pull request Dec 9, 2025
NewInsts.push_back(NewInst);

Offset = Scaled;
OffsetQuot = APInt(BW, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry look into this very late. This is wrong. The case we want to optimize is when (VarOffset.first * VarOffset.second) % VecElemSize == 0, but VarOffset.second % VecElemSize != 0. To calculate the vector index, you need (VarOffset.first * VarOffset.second / VecElemSize). Here you reset OffsetQuot to one. So you were actually dropping the VarOffset.second. I can see the change here will have conflict with #170512. As the code below was mostly moved away. And the argument NewInsts was also removed. I would like we do some further refactor based on #170512. Rename the function GEPToVectorIndex to isPtrOffsetAlignedToElementSize(). And just return the three components if the offset is properly aligned. And do vector index calculation later.

%vec = load <3 x float>, ptr %buffer
store <3 x float> %vec, ptr addrspace(5) %alloca
%index = select i1 %idx_sel, i32 4, i32 8
%elt = getelementptr inbounds nuw i8, ptr addrspace(5) %alloca, i32 %index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better you change this to gep of i16, so we have a case the multiplier of the variable offset part is not 1.

ret void
}

define amdgpu_kernel void @scalar_alloca_vector_gep_i8_4_or_5_no_promote(ptr %buffer, float %data, i1 %idx_sel) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe switch to another calling convention so that the alloca will be kept unchanged (not being promoted to LDS)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants