Reorganize project folders by jayhawk-commits · Pull Request #6 · ROCm/rocm-libraries

jayhawk-commits · 2025-04-30T16:23:04Z

Moving ROCm projects into top-level projects and shared directories.
Renaming their directories to be consistent with package naming.

Adds the cache variable INTERNAL_MRISAS to the CMake build scripts, and allows CI to set these using secret files.

Cleanup GPU test filter Ignore .venv

* Use TEST_F with fixture tests * fixed some out of bounds cast_to_f8 * Make Settings threadsafe

* Add FP6 and BF6 GEMM tests with FP32 output * Fix incorrect test names and setting * Updated emulator test suite

* added new filters * skip some tests * run rrperf for only f8, f6, f4

* Implement fixed scale matrix multiplication * Add more data types to scale mfma and new MM and GEMM tests * Make ScaledMatrixMultiplyGenerator be a single component and update expression test

* Add unpackToFloat helpers. * Consolidate CPUMM flavours. Multi-dispatch for the win! * Add checks for mixed MFMA inputs. Must be low precision. * Add elementBlockSize and elementBlockStide to RegisterExpressionAttributes. * Add addLoadWaveTileCTF8F6F4 coordinate transform. * MMT: Add TA and TB template parameters to support mixed formats. * MMT: Adjust tolerances to accomdate larger input range. * MMT: Add mixed F8/F6/F4 tests. * Add mixed tests to runtests.sh. * MMT: Disable MatrixMultiplyABF8_16x16x128 because it's NN.

* Add mixed gemm client tests * Change TEST_F to TEST_P * Factor out helper functions * Add mixed gemm unit tests * Add scaled mixed gemm unit tests * Update range for f6f4 * Minor changes * Minor changes * Disable _NT tests for f8f6f4 gemm * Enable mixed mat-mul * Fix typo * Remove comments --------- Co-authored-by: Yiqian Liu <yiqialiu@amd.com>

* Add scaled mixed-precision matrix multiplication tests

* added a GPU convert unit test * include workgroup info in the index calculation * use command to launch kernel * added comments * added conversion test dir

Co-authored-by: Yiqian Liu <yiqialiu@amd.com> Co-authored-by: Carvalho, Joao <joao.carvalho@amd.com>

…a types

* Parameterize scale value * Add more test values * Add TODO note --------- Co-authored-by: Yiqian Liu <yiqialiu@amd.com> Co-authored-by: Joao P. L. de Carvalho <joao.carvalho@amd.com>

Co-authored-by: Carvalho, Joao <joao.carvalho@amd.com>

Tested different macrotile sizes given the constraints of time in the emulator; Prefetch was tested with smaller scope as more aggressive prefetch runs out or registers or "too large" imm offset issues; Tests with streamK=True fail with "RNorm too large".

Co-authored-by: Carvalho, Joao <joao.carvalho@amd.com>

* Added test always fails if locks in Settings_impl.hpp are commented out; * Restored CI threads to 32 instead of nproc (16 threads); * Set OMP_NUM_THREADS=8 for our tests; and * Updates Rocm build.

* QA for Fix inf/nan/clip for convert from FP32/FP16 to OCP FP8

* Update Rocm build * Remove removed abid flag Also disables tensile_benchmarks for now and removes vgpr hack as assembler bug was fix for scaled MFMAs * Run clang-format * Remove withFixedRegisters * Format * Placate pylint. * Remove m_fakedRegisters. --------- Co-authored-by: Yiqian Liu <yiqialiu@amd.com> Co-authored-by: Matthew Emmett <matthew.emmett@amd.com>

Co-authored-by: Yiqian Liu <yiqialiu@amd.com>

createObserver leverages constexpr Add CObserverRuntime and CObserverConst

…om assembly. The command-line parser has been replaced by CLI11. The GEMM client now has three modes: generate, validate, and benchmark. These are described in the GEMM.rst document. The StreamK numWGs is now a kernel argument; and the required scratch space is a launch-time expression instead of a translate-time expression. Added rocRoller::Version::Git().

Add a RandomNumber Expression to emit code to generate a random number

…being used after being freed. (#5220) ## Motivation A `heap-use-after-free` error was triggered by AddressSanitizer on test `CPU_Dump_NAN_FP32.testDump`. ## Technical Details Root Cause Analysis: The AddressSanitizer error occurred because the HIPOCProgramImpl constructor was not storing the binary data passed to it. When LoadProgram called LoadBinary and created a HIPOCProgram with the returned vector, the temporary vector would go out of scope, but COMGR still needed to access the binary data later, causing a use-after-free. - The fix ensures that the HIPOCProgramImpl object owns the binary data for its entire lifetime - Both constructors now consistently store the binary data in the `binary` member variable (std::vector) - The uint8_t constructor converts the data to char format using iterator range construction - This prevents the use-after-free that occurred when COMGR tried to access freed memory ## Test Plan Test output before change: ``` HSA_XNACK=1 ASAN_OPTIONS=symbolize=1 ./build/ml-libs/MIOpen/build/bin/miopen_gtest --gtest_filter="*CPU_Dump_NAN_FP32*" PRNG seed: 12345678 Note: Google Test filter = *CPU_Dump_NAN_FP32* [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from CPU_Dump_NAN_FP32 [ RUN ] CPU_Dump_NAN_FP32.testDump ================================================================= ==3639==ERROR: AddressSanitizer: heap-use-after-free on address 0x7e0f08c50200 at pc 0x7f5f8d7a6554 bp 0x7ffcb7c4a730 sp 0x7ffcb7c49ee8 READ of size 26088 at 0x7e0f08c50200 thread T0 #0 0x7f5f8d7a6553 in memcpy /data/nhanna/repos/TheRock/compiler/amd-llvm/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors_memintrinsics.inc:117:5 #1 0x7f5f23d61d78 in COMGR::setCStr(char*&, llvm::StringRef, unsigned long*) /data/nhanna/repos/TheRock/compiler/amd-llvm/amd/comgr/src/comgr.cpp:216:9 #2 0x7f5f23d61d78 in COMGR::DataObject::setData(llvm::StringRef) /data/nhanna/repos/TheRock/compiler/amd-llvm/amd/comgr/src/comgr.cpp:334:17 #3 0x7f5f23d61d78 in amd_comgr_set_data /data/nhanna/repos/TheRock/compiler/amd-llvm/amd/comgr/src/comgr.cpp:606:24 #4 0x7f5f221dc1d3 in amd::Comgr::set_data(amd_comgr_data_s, unsigned long, char const*) /data/nhanna/repos/TheRock/rocm-systems/projects/clr/rocclr/device/comgrctx.hpp:252:12 #5 0x7f5f221dc1d3 in amd::device::Program::getSymbolsFromCodeObj(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>>*, amd_comgr_symbol_type_s) const /data/nhanna/repos/TheRock/rocm-systems/projects/clr/rocclr/device/devprogram.cpp:2061:14 #6 0x7f5f219e6f7c in hip::DynCO::populateDynGlobalVars() /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_code_object.cpp:216:22 #7 0x7f5f219e8e6a in hip::DynCO::getDynFunc(ihipModuleSymbol_t**, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>) /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_code_object.cpp:125:22 #8 0x7f5f21f842ba in hip::PlatformState::GetDynFunc(ihipModuleSymbol_t**, ihipModule_t*, char const*) /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_platform.cpp:884:22 #9 0x7f5f21ec2d71 in hip::hipModuleGetFunction(ihipModuleSymbol_t**, ihipModule_t*, char const*) /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_module.cpp:89:47 #10 0x7f5f2212c588 in hipModuleGetFunction /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_table_interface.cpp:1926:10 #11 0x7f5f7478d806 in miopen::HIPOCKernel::HIPOCKernel(miopen::HIPOCProgram, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::vector<unsigned long, std::allocator<unsigned long>>, std::vector<unsigned long, std::allocator<unsigned long>>) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/include/miopen/hipoc_kernel.hpp:225:25 #12 0x7f5f766febb7 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:161:18 #13 0x7f5f76b6f0e4 in miopen::Handle::AddKernel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:450:34 #14 0x7f5f7411b52f in miopen::checkNumericsImpl(miopen::Handle const&, int, miopen::TensorDescriptor const&, void const*, bool) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/check_numerics.cpp:107:12 #15 0x55e87c72ebee in void testDumpWithNan<float>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:130:8 #16 0x55e87c72d4e8 in CPU_Dump_NAN_FP32_testDump_Test::TestBody() /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:157:37 #17 0x55e87ef19d5e in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2653:27 #18 0x55e87ef19d5e in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2689:52 #19 0x55e87ef04cdd in testing::Test::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2728:50 #20 0x55e87ef04cdd in testing::Test::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2718:6 #21 0x55e87ef04e64 in testing::TestInfo::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2874:14 #22 0x55e87ef0500e in testing::TestSuite::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:3052:33 #23 0x55e87ef0500e in testing::TestSuite::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:3006:6 #24 0x55e87ef0d20b in testing::internal::UnitTestImpl::RunAllTests() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:6004:47 #25 0x55e87ef1a1de in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2653:27 #26 0x55e87ef1a1de in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2689:52 #27 0x55e87ef051b5 in testing::UnitTest::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:5583:55 #28 0x55e87eee3f9b in RUN_ALL_TESTS() /data/nhanna/repos/TheRock/build/third-party/googletest/dist/include/gtest/gtest.h:2334:73 #29 0x55e87eee3f9b in main /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/main_hip.cpp:34:12 #30 0x7f5f20a587e4 in __libc_start_main (/lib64/libc.so.6+0x3a7e4) (BuildId: 889235a2805b8308b2d0274921bbe1890e9a1986) #31 0x55e87b0bcf2d in _start (/data/nhanna/repos/TheRock/build/ml-libs/MIOpen/build/bin/miopen_gtest+0x126bf2d) 0x7e0f08c50200 is located 0 bytes inside of 26088-byte region [0x7e0f08c50200,0x7e0f08c567e8) freed by thread T0 here: #0 0x7f5f8d7b8ba2 in operator delete(void*, unsigned long) /data/nhanna/repos/TheRock/compiler/amd-llvm/compiler-rt/lib/asan/asan_new_delete.cpp:190:3 #1 0x7f5f76b7317d in std::__new_allocator<char>::deallocate(char*, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/new_allocator.h:172:2 #2 0x7f5f76b7317d in std::allocator<char>::deallocate(char*, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/allocator.h:210:25 #3 0x7f5f76b7317d in std::allocator_traits<std::allocator<char>>::deallocate(std::allocator<char>&, char*, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/alloc_traits.h:517:13 #4 0x7f5f76b7317d in std::_Vector_base<char, std::allocator<char>>::_M_deallocate(char*, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:390:4 #5 0x7f5f76b7317d in std::_Vector_base<char, std::allocator<char>>::~_Vector_base() /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:369:2 #6 0x7f5f76b7317d in std::vector<char, std::allocator<char>>::~vector() /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:738:7 #7 0x7f5f76b7317d in miopen::Handle::LoadProgram(std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, bool) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:633:5 #8 0x7f5f766fda82 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*)::'lambda'()::operator()() const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:143:30 #9 0x7f5f766fda82 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:124:26 #10 0x7f5f76b6f0e4 in miopen::Handle::AddKernel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:450:34 #11 0x7f5f7411b52f in miopen::checkNumericsImpl(miopen::Handle const&, int, miopen::TensorDescriptor const&, void const*, bool) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/check_numerics.cpp:107:12 #12 0x55e87c72ebee in void testDumpWithNan<float>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:130:8 #13 0x55e87c72d4e8 in CPU_Dump_NAN_FP32_testDump_Test::TestBody() /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:157:37 #14 0x55e87ef19d5e in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2653:27 #15 0x55e87ef19d5e in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2689:52 previously allocated by thread T0 here: #0 0x7f5f8d7b7f9d in operator new(unsigned long) /data/nhanna/repos/TheRock/compiler/amd-llvm/compiler-rt/lib/asan/asan_new_delete.cpp:109:35 #1 0x7f5f76b720a5 in std::__new_allocator<char>::allocate(unsigned long, void const*) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/new_allocator.h:151:27 #2 0x7f5f76b720a5 in std::allocator<char>::allocate(unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/allocator.h:198:32 #3 0x7f5f76b720a5 in std::allocator_traits<std::allocator<char>>::allocate(std::allocator<char>&, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/alloc_traits.h:482:20 #4 0x7f5f76b720a5 in std::_Vector_base<char, std::allocator<char>>::_M_allocate(unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:381:20 #5 0x7f5f76b720a5 in std::_Vector_base<char, std::allocator<char>>::_M_create_storage(unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:398:33 #6 0x7f5f76b720a5 in std::_Vector_base<char, std::allocator<char>>::_Vector_base(unsigned long, std::allocator<char> const&) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:335:9 #7 0x7f5f76b720a5 in std::vector<char, std::allocator<char>>::vector(std::vector<char, std::allocator<char>> const&) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:602:9 #8 0x7f5f76b720a5 in miopen::Handle::LoadProgram(std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, bool) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:623:27 #9 0x7f5f766fda82 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*)::'lambda'()::operator()() const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:143:30 #10 0x7f5f766fda82 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:124:26 #11 0x7f5f76b6f0e4 in miopen::Handle::AddKernel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:450:34 #12 0x7f5f7411b52f in miopen::checkNumericsImpl(miopen::Handle const&, int, miopen::TensorDescriptor const&, void const*, bool) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/check_numerics.cpp:107:12 #13 0x55e87c72ebee in void testDumpWithNan<float>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:130:8 #14 0x55e87c72d4e8 in CPU_Dump_NAN_FP32_testDump_Test::TestBody() /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:157:37 #15 0x55e87ef19d5e in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2653:27 #16 0x55e87ef19d5e in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2689:52 SUMMARY: AddressSanitizer: heap-use-after-free /data/nhanna/repos/TheRock/compiler/amd-llvm/amd/comgr/src/comgr.cpp:216:9 in COMGR::setCStr(char*&, llvm::StringRef, unsigned long*) Shadow bytes around the buggy address: 0x7e0f08c4ff80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x7e0f08c50000: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x7e0f08c50080: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x7e0f08c50100: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x7e0f08c50180: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa =>0x7e0f08c50200:[fd]fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x7e0f08c50280: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x7e0f08c50300: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x7e0f08c50380: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x7e0f08c50400: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x7e0f08c50480: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb ==3639==ABORTING ``` ## Test Result Test output after change: ``` HSA_XNACK=1 ASAN_OPTIONS=symbolize=1 ./build/ml-libs/MIOpen/build/bin/miopen_gtest --gtest_filter="*CPU_Dump_NAN_FP32*" PRNG seed: 12345678 Note: Google Test filter = *CPU_Dump_NAN_FP32* [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from CPU_Dump_NAN_FP32 [ RUN ] CPU_Dump_NAN_FP32.testDump [ OK ] CPU_Dump_NAN_FP32.testDump (51 ms) [----------] 1 test from CPU_Dump_NAN_FP32 (51 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test suite ran. (52 ms total) [ PASSED ] 1 test. ``` ## Cline Analysis ### Test Coverage Analysis: __1. LoadProgram Code Path (std::vector constructor):__ - __Primary Test__: `rocm-libraries/projects/miopen/test/gtest/db_sync.cpp` - __Function__: `BuildKernel()` calls `handle.LoadProgram(program_file, program_args, "")` - __Coverage__: This test extensively exercises the LoadProgram → LoadBinary → HIPOCProgramImpl constructor path - __Scope__: Tests multiple GPU architectures (gfx908, gfx90a, gfx942, gfx1030) with different CU counts - __Frequency__: Runs on thousands of kernel configurations in the database sync tests __2. Solution Binary Serialization (std::vector usage):__ - __Primary Test__: `rocm-libraries/projects/miopen/test/gtest/find_2_conv.cpp` - __Function__: `miopenSaveSolution()` and `miopenLoadSolution()` with `std::vector<char> solution_binary` - __Coverage__: Tests the save/load cycle of solution binaries - __Scope__: Tests all convolution directions (Forward, BackwardData, BackwardWeights) __3. Additional Coverage:__ - __Cache Tests__: `rocm-libraries/projects/miopen/test/gtest/cache.cpp` tests compression/decompression with `std::vector<char>` - __Dropout Tests__: Uses `std::vector<unsigned char>` for reserve space (related pattern) __Test Quality Assessment:__ ✅ __Both constructors are well-tested__: - The `std::vector<char>` constructor is heavily exercised through database sync tests - The `std::vector<uint8_t>` constructor would be tested through any code paths that use uint8_t binary data ✅ __Real-world scenarios covered__: - Database synchronization (production kernel loading) - Solution serialization (runtime binary handling) - Multi-threaded execution (db_sync uses up to 32 threads) ✅ __Comprehensive architecture coverage__: - Tests run on multiple GPU architectures - Different compute unit configurations tested __Confidence Level__: Very High ### Performance Analysis: Regarding the performance impact of this fix, it's actually quite minimal and represents good engineering practice: __Memory Impact:__ - __Additional Memory Usage__: Each HIPOCProgramImpl object now stores a copy of the binary data in its `binary` member variable - __Typical Size__: GPU code objects are usually relatively small (typically a few KB to a few MB depending on kernel complexity) - __Lifetime__: The memory is only held for the lifetime of the HIPOCProgram object, which is typically short-lived during kernel loading __Performance Characteristics:__ - __One-time Copy Cost__: There's a single memory copy operation during construction (std::vector copy or iterator range construction) - __No Runtime Overhead__: Once constructed, there's no additional performance cost during kernel execution - __Memory Safety Benefit__: Eliminates potential crashes and undefined behavior, which far outweighs the small memory cost __Context in MIOpen:__ - This occurs during the kernel loading phase, not during actual ML inference/training - Kernel loading is already an expensive operation involving compilation, module creation, etc. - The additional memory copy is negligible compared to the overall kernel loading time __Trade-off Analysis:__ - __Cost__: Small increase in memory usage during kernel loading - __Benefit__: Eliminates memory safety bugs that could cause crashes or data corruption - __Net Result__: Significantly positive - reliability and correctness are much more valuable than the minimal memory overhead In practice, this fix follows the RAII (Resource Acquisition Is Initialization) principle and ensures proper ownership semantics, which is standard best practice in modern C++. The performance impact should be unnoticeable in real-world usage.

## Motivation  This PR addresses hipDNN issue ROCm#4951, which requests adding missing frontend integration test coverage for the Matmul operation. ## Technical Details  This PR includes three changes: ### 1. FrontendGraphFactory support for Matmu Added MATMUL to OperationType. Added switch‑case dispatch in FrontendGraphFactory::create(). Implemented createMatmulGraph() using: ```cpp graph.matmul(a, b, matmulAttrs); ``` with simple 2×3 and 3×4 matrix inputs for deterministic testing. ### 2. Added new integration test: IntegrationMatmul.cpp Following the structure of IntegrationConvForward.cpp, the test: - is parameterized with: - good plugin - execute‑fail plugin - no‑engines plugin - tests both auto‑assigned and manual UIDs - builds a small Matmul graph using float tensors - exercises the entire frontend execution pipeline: ``` validate() → build_operation_graph() → create_execution_plans() → check_support() → build_plans() → get_workspace_size() → execute() ``` - uses SKIP_IF_NO_DEVICES() for GPU‑dependent execution - creates variant packs using device memory from the test tensor bundle - verifies expected failures for execute‑fail and no‑engines plugins ### 3. CMake update Added IntegrationMatmul.cpp to tests/frontend/CMakeLists.txt under public_hipdnn_frontend_tests. ## Test Plan  All tests were built and executed inside the official TheRock docker environment. ## Test Result  ``` [1/2] Validating test names with --gtest_list_tests test collection Test Name Validation Report ============================================================ Total tests found: 2901 Valid test names: 2901 Invalid test names: 0 ``` ``` [1/2] Running all tests via ctest Test project /therock/output/build/ml-libs/hipDNN/build Start 1: hipdnn_data_sdk_tests 1/7 Test ROCm#1: hipdnn_data_sdk_tests ............ Passed 0.96 sec Start 2: hipdnn_backend_tests 2/7 Test ROCm#2: hipdnn_backend_tests ............. Passed 1.38 sec Start 3: hipdnn_frontend_tests 3/7 Test ROCm#3: hipdnn_frontend_tests ............ Passed 0.05 sec Start 4: hipdnn_test_sdk_tests 4/7 Test ROCm#4: hipdnn_test_sdk_tests ............ Passed 8.19 sec Start 5: hipdnn_plugin_sdk_tests 5/7 Test ROCm#5: hipdnn_plugin_sdk_tests .......... Passed 0.03 sec Start 6: public_hipdnn_backend_tests 6/7 Test ROCm#6: public_hipdnn_backend_tests ...... Passed 0.32 sec Start 7: public_hipdnn_frontend_tests 7/7 Test ROCm#7: public_hipdnn_frontend_tests ..... Passed 0.35 sec ``` ``` 100% tests passed, 0 tests failed out of 7 Label Time Summary: integration_test = 0.67 sec*proc (2 tests) unit_test = 10.61 sec*proc (5 tests) Total Test time (real) = 11.29 sec ``` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Signed-off-by: jovanau <u.jovana2@gmail.com>

…being used after being freed. (ROCm#5220) ## Motivation A `heap-use-after-free` error was triggered by AddressSanitizer on test `CPU_Dump_NAN_FP32.testDump`. ## Technical Details Root Cause Analysis: The AddressSanitizer error occurred because the HIPOCProgramImpl constructor was not storing the binary data passed to it. When LoadProgram called LoadBinary and created a HIPOCProgram with the returned vector, the temporary vector would go out of scope, but COMGR still needed to access the binary data later, causing a use-after-free. - The fix ensures that the HIPOCProgramImpl object owns the binary data for its entire lifetime - Both constructors now consistently store the binary data in the `binary` member variable (std::vector) - The uint8_t constructor converts the data to char format using iterator range construction - This prevents the use-after-free that occurred when COMGR tried to access freed memory ## Test Plan Test output before change: ``` HSA_XNACK=1 ASAN_OPTIONS=symbolize=1 ./build/ml-libs/MIOpen/build/bin/miopen_gtest --gtest_filter="*CPU_Dump_NAN_FP32*" PRNG seed: 12345678 Note: Google Test filter = *CPU_Dump_NAN_FP32* [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from CPU_Dump_NAN_FP32 [ RUN ] CPU_Dump_NAN_FP32.testDump ================================================================= ==3639==ERROR: AddressSanitizer: heap-use-after-free on address 0x7e0f08c50200 at pc 0x7f5f8d7a6554 bp 0x7ffcb7c4a730 sp 0x7ffcb7c49ee8 READ of size 26088 at 0x7e0f08c50200 thread T0 #0 0x7f5f8d7a6553 in memcpy /data/nhanna/repos/TheRock/compiler/amd-llvm/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors_memintrinsics.inc:117:5 ROCm#1 0x7f5f23d61d78 in COMGR::setCStr(char*&, llvm::StringRef, unsigned long*) /data/nhanna/repos/TheRock/compiler/amd-llvm/amd/comgr/src/comgr.cpp:216:9 ROCm#2 0x7f5f23d61d78 in COMGR::DataObject::setData(llvm::StringRef) /data/nhanna/repos/TheRock/compiler/amd-llvm/amd/comgr/src/comgr.cpp:334:17 ROCm#3 0x7f5f23d61d78 in amd_comgr_set_data /data/nhanna/repos/TheRock/compiler/amd-llvm/amd/comgr/src/comgr.cpp:606:24 ROCm#4 0x7f5f221dc1d3 in amd::Comgr::set_data(amd_comgr_data_s, unsigned long, char const*) /data/nhanna/repos/TheRock/rocm-systems/projects/clr/rocclr/device/comgrctx.hpp:252:12 ROCm#5 0x7f5f221dc1d3 in amd::device::Program::getSymbolsFromCodeObj(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>>*, amd_comgr_symbol_type_s) const /data/nhanna/repos/TheRock/rocm-systems/projects/clr/rocclr/device/devprogram.cpp:2061:14 ROCm#6 0x7f5f219e6f7c in hip::DynCO::populateDynGlobalVars() /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_code_object.cpp:216:22 ROCm#7 0x7f5f219e8e6a in hip::DynCO::getDynFunc(ihipModuleSymbol_t**, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>) /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_code_object.cpp:125:22 ROCm#8 0x7f5f21f842ba in hip::PlatformState::GetDynFunc(ihipModuleSymbol_t**, ihipModule_t*, char const*) /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_platform.cpp:884:22 ROCm#9 0x7f5f21ec2d71 in hip::hipModuleGetFunction(ihipModuleSymbol_t**, ihipModule_t*, char const*) /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_module.cpp:89:47 ROCm#10 0x7f5f2212c588 in hipModuleGetFunction /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_table_interface.cpp:1926:10 ROCm#11 0x7f5f7478d806 in miopen::HIPOCKernel::HIPOCKernel(miopen::HIPOCProgram, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::vector<unsigned long, std::allocator<unsigned long>>, std::vector<unsigned long, std::allocator<unsigned long>>) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/include/miopen/hipoc_kernel.hpp:225:25 ROCm#12 0x7f5f766febb7 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:161:18 ROCm#13 0x7f5f76b6f0e4 in miopen::Handle::AddKernel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:450:34 ROCm#14 0x7f5f7411b52f in miopen::checkNumericsImpl(miopen::Handle const&, int, miopen::TensorDescriptor const&, void const*, bool) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/check_numerics.cpp:107:12 ROCm#15 0x55e87c72ebee in void testDumpWithNan<float>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:130:8 ROCm#16 0x55e87c72d4e8 in CPU_Dump_NAN_FP32_testDump_Test::TestBody() /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:157:37 ROCm#17 0x55e87ef19d5e in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2653:27 ROCm#18 0x55e87ef19d5e in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2689:52 ROCm#19 0x55e87ef04cdd in testing::Test::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2728:50 ROCm#20 0x55e87ef04cdd in testing::Test::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2718:6 ROCm#21 0x55e87ef04e64 in testing::TestInfo::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2874:14 ROCm#22 0x55e87ef0500e in testing::TestSuite::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:3052:33 ROCm#23 0x55e87ef0500e in testing::TestSuite::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:3006:6 ROCm#24 0x55e87ef0d20b in testing::internal::UnitTestImpl::RunAllTests() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:6004:47 ROCm#25 0x55e87ef1a1de in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2653:27 ROCm#26 0x55e87ef1a1de in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2689:52 ROCm#27 0x55e87ef051b5 in testing::UnitTest::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:5583:55 ROCm#28 0x55e87eee3f9b in RUN_ALL_TESTS() /data/nhanna/repos/TheRock/build/third-party/googletest/dist/include/gtest/gtest.h:2334:73 ROCm#29 0x55e87eee3f9b in main /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/main_hip.cpp:34:12 ROCm#30 0x7f5f20a587e4 in __libc_start_main (/lib64/libc.so.6+0x3a7e4) (BuildId: 889235a2805b8308b2d0274921bbe1890e9a1986) ROCm#31 0x55e87b0bcf2d in _start (/data/nhanna/repos/TheRock/build/ml-libs/MIOpen/build/bin/miopen_gtest+0x126bf2d) 0x7e0f08c50200 is located 0 bytes inside of 26088-byte region [0x7e0f08c50200,0x7e0f08c567e8) freed by thread T0 here: #0 0x7f5f8d7b8ba2 in operator delete(void*, unsigned long) /data/nhanna/repos/TheRock/compiler/amd-llvm/compiler-rt/lib/asan/asan_new_delete.cpp:190:3 ROCm#1 0x7f5f76b7317d in std::__new_allocator<char>::deallocate(char*, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/new_allocator.h:172:2 ROCm#2 0x7f5f76b7317d in std::allocator<char>::deallocate(char*, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/allocator.h:210:25 ROCm#3 0x7f5f76b7317d in std::allocator_traits<std::allocator<char>>::deallocate(std::allocator<char>&, char*, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/alloc_traits.h:517:13 ROCm#4 0x7f5f76b7317d in std::_Vector_base<char, std::allocator<char>>::_M_deallocate(char*, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:390:4 ROCm#5 0x7f5f76b7317d in std::_Vector_base<char, std::allocator<char>>::~_Vector_base() /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:369:2 ROCm#6 0x7f5f76b7317d in std::vector<char, std::allocator<char>>::~vector() /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:738:7 ROCm#7 0x7f5f76b7317d in miopen::Handle::LoadProgram(std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, bool) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:633:5 ROCm#8 0x7f5f766fda82 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*)::'lambda'()::operator()() const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:143:30 ROCm#9 0x7f5f766fda82 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:124:26 ROCm#10 0x7f5f76b6f0e4 in miopen::Handle::AddKernel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:450:34 ROCm#11 0x7f5f7411b52f in miopen::checkNumericsImpl(miopen::Handle const&, int, miopen::TensorDescriptor const&, void const*, bool) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/check_numerics.cpp:107:12 ROCm#12 0x55e87c72ebee in void testDumpWithNan<float>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:130:8 ROCm#13 0x55e87c72d4e8 in CPU_Dump_NAN_FP32_testDump_Test::TestBody() /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:157:37 ROCm#14 0x55e87ef19d5e in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2653:27 ROCm#15 0x55e87ef19d5e in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2689:52 previously allocated by thread T0 here: #0 0x7f5f8d7b7f9d in operator new(unsigned long) /data/nhanna/repos/TheRock/compiler/amd-llvm/compiler-rt/lib/asan/asan_new_delete.cpp:109:35 ROCm#1 0x7f5f76b720a5 in std::__new_allocator<char>::allocate(unsigned long, void const*) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/new_allocator.h:151:27 ROCm#2 0x7f5f76b720a5 in std::allocator<char>::allocate(unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/allocator.h:198:32 ROCm#3 0x7f5f76b720a5 in std::allocator_traits<std::allocator<char>>::allocate(std::allocator<char>&, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/alloc_traits.h:482:20 ROCm#4 0x7f5f76b720a5 in std::_Vector_base<char, std::allocator<char>>::_M_allocate(unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:381:20 ROCm#5 0x7f5f76b720a5 in std::_Vector_base<char, std::allocator<char>>::_M_create_storage(unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:398:33 ROCm#6 0x7f5f76b720a5 in std::_Vector_base<char, std::allocator<char>>::_Vector_base(unsigned long, std::allocator<char> const&) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:335:9 ROCm#7 0x7f5f76b720a5 in std::vector<char, std::allocator<char>>::vector(std::vector<char, std::allocator<char>> const&) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:602:9 ROCm#8 0x7f5f76b720a5 in miopen::Handle::LoadProgram(std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, bool) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:623:27 ROCm#9 0x7f5f766fda82 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*)::'lambda'()::operator()() const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:143:30 ROCm#10 0x7f5f766fda82 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:124:26 ROCm#11 0x7f5f76b6f0e4 in miopen::Handle::AddKernel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:450:34 ROCm#12 0x7f5f7411b52f in miopen::checkNumericsImpl(miopen::Handle const&, int, miopen::TensorDescriptor const&, void const*, bool) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/check_numerics.cpp:107:12 ROCm#13 0x55e87c72ebee in void testDumpWithNan<float>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:130:8 ROCm#14 0x55e87c72d4e8 in CPU_Dump_NAN_FP32_testDump_Test::TestBody() /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:157:37 ROCm#15 0x55e87ef19d5e in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2653:27 ROCm#16 0x55e87ef19d5e in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2689:52 SUMMARY: AddressSanitizer: heap-use-after-free /data/nhanna/repos/TheRock/compiler/amd-llvm/amd/comgr/src/comgr.cpp:216:9 in COMGR::setCStr(char*&, llvm::StringRef, unsigned long*) Shadow bytes around the buggy address: 0x7e0f08c4ff80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x7e0f08c50000: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x7e0f08c50080: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x7e0f08c50100: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x7e0f08c50180: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa =>0x7e0f08c50200:[fd]fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x7e0f08c50280: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x7e0f08c50300: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x7e0f08c50380: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x7e0f08c50400: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x7e0f08c50480: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb ==3639==ABORTING ``` ## Test Result Test output after change: ``` HSA_XNACK=1 ASAN_OPTIONS=symbolize=1 ./build/ml-libs/MIOpen/build/bin/miopen_gtest --gtest_filter="*CPU_Dump_NAN_FP32*" PRNG seed: 12345678 Note: Google Test filter = *CPU_Dump_NAN_FP32* [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from CPU_Dump_NAN_FP32 [ RUN ] CPU_Dump_NAN_FP32.testDump [ OK ] CPU_Dump_NAN_FP32.testDump (51 ms) [----------] 1 test from CPU_Dump_NAN_FP32 (51 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test suite ran. (52 ms total) [ PASSED ] 1 test. ``` ## Cline Analysis ### Test Coverage Analysis: __1. LoadProgram Code Path (std::vector constructor):__ - __Primary Test__: `rocm-libraries/projects/miopen/test/gtest/db_sync.cpp` - __Function__: `BuildKernel()` calls `handle.LoadProgram(program_file, program_args, "")` - __Coverage__: This test extensively exercises the LoadProgram → LoadBinary → HIPOCProgramImpl constructor path - __Scope__: Tests multiple GPU architectures (gfx908, gfx90a, gfx942, gfx1030) with different CU counts - __Frequency__: Runs on thousands of kernel configurations in the database sync tests __2. Solution Binary Serialization (std::vector usage):__ - __Primary Test__: `rocm-libraries/projects/miopen/test/gtest/find_2_conv.cpp` - __Function__: `miopenSaveSolution()` and `miopenLoadSolution()` with `std::vector<char> solution_binary` - __Coverage__: Tests the save/load cycle of solution binaries - __Scope__: Tests all convolution directions (Forward, BackwardData, BackwardWeights) __3. Additional Coverage:__ - __Cache Tests__: `rocm-libraries/projects/miopen/test/gtest/cache.cpp` tests compression/decompression with `std::vector<char>` - __Dropout Tests__: Uses `std::vector<unsigned char>` for reserve space (related pattern) __Test Quality Assessment:__ ✅ __Both constructors are well-tested__: - The `std::vector<char>` constructor is heavily exercised through database sync tests - The `std::vector<uint8_t>` constructor would be tested through any code paths that use uint8_t binary data ✅ __Real-world scenarios covered__: - Database synchronization (production kernel loading) - Solution serialization (runtime binary handling) - Multi-threaded execution (db_sync uses up to 32 threads) ✅ __Comprehensive architecture coverage__: - Tests run on multiple GPU architectures - Different compute unit configurations tested __Confidence Level__: Very High ### Performance Analysis: Regarding the performance impact of this fix, it's actually quite minimal and represents good engineering practice: __Memory Impact:__ - __Additional Memory Usage__: Each HIPOCProgramImpl object now stores a copy of the binary data in its `binary` member variable - __Typical Size__: GPU code objects are usually relatively small (typically a few KB to a few MB depending on kernel complexity) - __Lifetime__: The memory is only held for the lifetime of the HIPOCProgram object, which is typically short-lived during kernel loading __Performance Characteristics:__ - __One-time Copy Cost__: There's a single memory copy operation during construction (std::vector copy or iterator range construction) - __No Runtime Overhead__: Once constructed, there's no additional performance cost during kernel execution - __Memory Safety Benefit__: Eliminates potential crashes and undefined behavior, which far outweighs the small memory cost __Context in MIOpen:__ - This occurs during the kernel loading phase, not during actual ML inference/training - Kernel loading is already an expensive operation involving compilation, module creation, etc. - The additional memory copy is negligible compared to the overall kernel loading time __Trade-off Analysis:__ - __Cost__: Small increase in memory usage during kernel loading - __Benefit__: Eliminates memory safety bugs that could cause crashes or data corruption - __Net Result__: Significantly positive - reliability and correctness are much more valuable than the minimal memory overhead In practice, this fix follows the RAII (Resource Acquisition Is Initialization) principle and ensures proper ownership semantics, which is standard best practice in modern C++. The performance impact should be unnoticeable in real-world usage.

…ernel (ROCm#6) * Add sample subtile impl * Fix issues when disabling subtile impl * GR Offset calculation (#1) * Add sample subtile impl * Move allocOffsetRegisters before setupNewTile * Start adding GR offset calculation * Rest of logic (no swizzling) * refacto * spgr offsets * Add newserial code * Add script to debug offsets * Add unit test for GR offset calculation * Grid display * Fix both code and ref test function * Add DPP quad perm to rocisa * Apply swizzling (no rotation yet) * Function swizzling + rotation + test * Refactor test to have a single output array + add test for SGPRs * Add debug mode to test + add dynamic wavegroup calculation based on MT * Fix test runtime issue and check all vgpr offsets * Add ref test code for 1x4 & 4x1 * Fix tests * Fixed SGPR offset calculation for 2x2 * Fix more tests * Add more tests * Refactor tests * simplify tests * Remove unused script * cleanup * fix camelCase in ref test code * cleanup * Fix typo --------- Co-authored-by: brianshi <brianshi@amd.com> * Enable post-loop code generation, and add some subroutines * LR offset calculation (#2) * Add tests * as is * Add permlane16_swap instruction to rocisa * Ongoing progress * Draft for partition A0/A1 * Wave partitioning * Draft ref code in tests * Handle 1x4 wavesplit param * 2x2 test passing * Draft 1x4 LR wave partitioning * Fix alginement issue * Integration testing * Update integration test * Fix swizzling pattern on GRA. Only swizzling on even LDS rows * Subtile based test * testing A * Test both A and B * Remove graonly mode * Fix 1x4 case * Move global offset for B after rest of the logic * cleanup * cleanup * Fix ref test code for 4x1 * Fix spgr alloc issue * Remove tmp test file * Remove debug prints * Add test case * Add GR load emit logic, and misc fixes (#3) * gr emit fix * Emit LR + init ACCVGPR (#4) * Emit ds_reads * Add waits for LR and GR * Init Acc VGPR to Zero * Add missing bit_length on VLShiftLeftB32 * Insert SNop between VLShiftLeftB32 & VReadfirstlaneB32 for correctness * Fix gra test ref code for 1x4 * Remove some debug prints * Add loop and ptr update code * Update scale offset * Add tests * Address review * Add scale roundtrip e2e test and constraint assertions Add GR->LDS->LR roundtrip GPU test verifying scale offset consistency across 4 tile configs x 2 matrices. Add power-of-2 assertion for scaleBlockSize and matching scaleBlockSize assertions for A/B in shared GR/LR offset computation. Pass kernel dict to compute_lds_sizes instead of re-deriving MIWaveGroup from tile dimensions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Update fixes * Fix scale being skipped * Add flag to print layout * Fix missed merge conflicts * Fix missed merge conflicts * Refactor scale rountrip test with gpu helper fns * Fix extra spaces * Fix tests --------- Co-authored-by: brianshi <brianshi@amd.com> Co-authored-by: sebvince <115461989+sebvince@users.noreply.github.com> Co-authored-by: b-shi <bbbrianme@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…ernel (ROCm#6) * Add sample subtile impl * Fix issues when disabling subtile impl * GR Offset calculation (ROCm#1) * Add sample subtile impl * Move allocOffsetRegisters before setupNewTile * Start adding GR offset calculation * Rest of logic (no swizzling) * refacto * spgr offsets * Add newserial code * Add script to debug offsets * Add unit test for GR offset calculation * Grid display * Fix both code and ref test function * Add DPP quad perm to rocisa * Apply swizzling (no rotation yet) * Function swizzling + rotation + test * Refactor test to have a single output array + add test for SGPRs * Add debug mode to test + add dynamic wavegroup calculation based on MT * Fix test runtime issue and check all vgpr offsets * Add ref test code for 1x4 & 4x1 * Fix tests * Fixed SGPR offset calculation for 2x2 * Fix more tests * Add more tests * Refactor tests * simplify tests * Remove unused script * cleanup * fix camelCase in ref test code * cleanup * Fix typo --------- Co-authored-by: brianshi <brianshi@amd.com> * Enable post-loop code generation, and add some subroutines * LR offset calculation (ROCm#2) * Add tests * as is * Add permlane16_swap instruction to rocisa * Ongoing progress * Draft for partition A0/A1 * Wave partitioning * Draft ref code in tests * Handle 1x4 wavesplit param * 2x2 test passing * Draft 1x4 LR wave partitioning * Fix alginement issue * Integration testing * Update integration test * Fix swizzling pattern on GRA. Only swizzling on even LDS rows * Subtile based test * testing A * Test both A and B * Remove graonly mode * Fix 1x4 case * Move global offset for B after rest of the logic * cleanup * cleanup * Fix ref test code for 4x1 * Fix spgr alloc issue * Remove tmp test file * Remove debug prints * Add test case * Add GR load emit logic, and misc fixes (ROCm#3) * gr emit fix * Emit LR + init ACCVGPR (ROCm#4) * Emit ds_reads * Add waits for LR and GR * Init Acc VGPR to Zero * Add missing bit_length on VLShiftLeftB32 * Insert SNop between VLShiftLeftB32 & VReadfirstlaneB32 for correctness * Fix gra test ref code for 1x4 * Remove some debug prints * Add loop and ptr update code * Update scale offset * Add tests * Address review * Add scale roundtrip e2e test and constraint assertions Add GR->LDS->LR roundtrip GPU test verifying scale offset consistency across 4 tile configs x 2 matrices. Add power-of-2 assertion for scaleBlockSize and matching scaleBlockSize assertions for A/B in shared GR/LR offset computation. Pass kernel dict to compute_lds_sizes instead of re-deriving MIWaveGroup from tile dimensions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Update fixes * Fix scale being skipped * Add flag to print layout * Fix missed merge conflicts * Fix missed merge conflicts * Refactor scale rountrip test with gpu helper fns * Fix extra spaces * Fix tests --------- Co-authored-by: brianshi <brianshi@amd.com> Co-authored-by: sebvince <115461989+sebvince@users.noreply.github.com> Co-authored-by: b-shi <bbbrianme@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…ernel (ROCm#6) * Add sample subtile impl * Fix issues when disabling subtile impl * GR Offset calculation (#1) * Add sample subtile impl * Move allocOffsetRegisters before setupNewTile * Start adding GR offset calculation * Rest of logic (no swizzling) * refacto * spgr offsets * Add newserial code * Add script to debug offsets * Add unit test for GR offset calculation * Grid display * Fix both code and ref test function * Add DPP quad perm to rocisa * Apply swizzling (no rotation yet) * Function swizzling + rotation + test * Refactor test to have a single output array + add test for SGPRs * Add debug mode to test + add dynamic wavegroup calculation based on MT * Fix test runtime issue and check all vgpr offsets * Add ref test code for 1x4 & 4x1 * Fix tests * Fixed SGPR offset calculation for 2x2 * Fix more tests * Add more tests * Refactor tests * simplify tests * Remove unused script * cleanup * fix camelCase in ref test code * cleanup * Fix typo --------- Co-authored-by: brianshi <brianshi@amd.com> * Enable post-loop code generation, and add some subroutines * LR offset calculation (#2) * Add tests * as is * Add permlane16_swap instruction to rocisa * Ongoing progress * Draft for partition A0/A1 * Wave partitioning * Draft ref code in tests * Handle 1x4 wavesplit param * 2x2 test passing * Draft 1x4 LR wave partitioning * Fix alginement issue * Integration testing * Update integration test * Fix swizzling pattern on GRA. Only swizzling on even LDS rows * Subtile based test * testing A * Test both A and B * Remove graonly mode * Fix 1x4 case * Move global offset for B after rest of the logic * cleanup * cleanup * Fix ref test code for 4x1 * Fix spgr alloc issue * Remove tmp test file * Remove debug prints * Add test case * Add GR load emit logic, and misc fixes (#3) * gr emit fix * Emit LR + init ACCVGPR (#4) * Emit ds_reads * Add waits for LR and GR * Init Acc VGPR to Zero * Add missing bit_length on VLShiftLeftB32 * Insert SNop between VLShiftLeftB32 & VReadfirstlaneB32 for correctness * Fix gra test ref code for 1x4 * Remove some debug prints * Add loop and ptr update code * Update scale offset * Add tests * Address review * Add scale roundtrip e2e test and constraint assertions Add GR->LDS->LR roundtrip GPU test verifying scale offset consistency across 4 tile configs x 2 matrices. Add power-of-2 assertion for scaleBlockSize and matching scaleBlockSize assertions for A/B in shared GR/LR offset computation. Pass kernel dict to compute_lds_sizes instead of re-deriving MIWaveGroup from tile dimensions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Update fixes * Fix scale being skipped * Add flag to print layout * Fix missed merge conflicts * Fix missed merge conflicts * Refactor scale rountrip test with gpu helper fns * Fix extra spaces * Fix tests --------- Co-authored-by: brianshi <brianshi@amd.com> Co-authored-by: sebvince <115461989+sebvince@users.noreply.github.com> Co-authored-by: b-shi <bbbrianme@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

# Add gfx950 MXFP4 Subtile-based kernel implementation ## Summary This PR is a follow-up to #6499 ([hipblaslt] Add support for gfx950 mxfp4) and adds the **Subtile-based kernel implementation (`UseSubtileImpl=1`)** for hipBLASLt on **gfx950**. It introduces a new tile-decomposed code generation path optimized for **MXFP4** and **BF16** GEMMs, plus the solution-selection plumbing, validation, Origami logic yamls, and unit tests needed to make it production-usable. ## Motivation PR #6499 brought MX data type support online for gfx950, but the existing TensileLite codegen path leaves significant performance on the table for MXFP4-heavy workloads. The Subtile path restructures global-read / local-read / MFMA / store scheduling at a finer granularity, which **greatly improves MXFP4 GEMM performance when using `HIPBLASLT_MATMUL_MATRIX_SCALE_BLK32_UE8M0_32_8_EXT`** (added to the hipBLASLt CHANGELOG). ## What's included ### 1. New Subtile-based kernel components (Tensile) New modules under `projects/hipblaslt/tensilelite/Tensile/Components/`: * `SubtileBasedKernel.py` (~1850 LOC) — entry point and orchestration of the subtile codegen path; replaces large portions of the standard prefetch / unroll / store flow when `UseSubtileImpl=1`. * `SubtileBasedLogicalScheduler.py` (~2415 LOC) — logical scheduler that builds the subtile-grained instruction graph (GR loads, LR offsets, MFMA tiles, scale loads, stores) from kernel parameters. * `SubtileBasedInstructionScheduler.py` (~433 LOC) — converts the logical schedule to an emit order respecting wave / register / hazard constraints. * `SubtileBasedInstructionEmitter.py` (~216 LOC) — instruction emission helpers shared by the subtile components. ### 2. Kernel writer / common changes * **`KernelWriter.py`**, **`KernelWriterAssembly.py`**: integration points for the subtile path — prefetch, GR offset calculation, LR offset calculation, post-loop, MFMA macro accounting, optimized `storeD`, LDS buffer swap, MX FP4 scale emit, `SrdMXSA/B+2` handling, sgpr allocation / overflow guards, computeLoadSrd fix. * **`SolutionStructs/Solution.py`**, **`SolutionStructs/Problem.py`**: introduces the `UseSubtileImpl` parameter, MX-related reject conditions for non-Subtile paths on gfx950, and additional valid GEMM type combinations for MX inputs. * **`Common/ValidParameters.py`**, **`Common/RequiredParameters.py`**, **`Common/GlobalParameters.py`**: `UseSubtileImpl` registration and defaults. * **`Components/StreamK.py`**: subtile-aware StreamK fixup (incl. import union with the `BufferLoadB32` cache-coherence change from #6837). * **`Components/GlobalWriteBatch.py`**: optimized global write batching for the subtile path (~670 LOC of changes). * **`Components/ComputeStoreVgprs.py`**, **`Components/LSU.py`**, **`Components/WorkGroupMappingAlgos.py`**, **`AsmStoreState.py`**, **`KernelWriterModules.py`**: minor adjustments needed by the subtile pipeline. ### 3. rocisa / host / client * **`rocisa/rocisa/include/container.hpp`**: helpers needed by the new emitter. * **`tensile_host.cpp`**, **`include/Tensile/TensorDescriptor.hpp`**: small fixups for the subtile path and gfx950 build. * **`client/include/DataInitialization.hpp`**, **`client/src/DataInitialization.cpp`**, **`client/src/Reference.cpp`**, **`client/src/ReferenceValidator.cpp`**, **`client/include/TypedId.hpp`**: MX scale init and reference paths used by the new tests. * **`clients/common/include/testing_matmul.hpp`**, **`clients/common/include/norm.hpp`**, **`clients/common/include/hipblaslt_datatype2string.hpp`**, **`clients/common/src/mxDataGen.cpp`**: wiring for batched (>1) testing and MX init. ### 4. Origami / solution selection (gfx950 MXFP4) New auto-tuned logic yamls under `projects/hipblaslt/library/.../Tensile/Logic/asm_full/gfx950/gfx950/Origami/` covering the FP4 SS / HS / BS variants in three layouts: * `Origami/` (default) * `Origami/Origami_nta4/` (no-transpose-A FP4) * `Origami/Origami_ntb4/` (no-transpose-B FP4) (9 new `gfx950_Cijk_Alik_Bljk_F4{SS,HS,BS}_MXA32_MXB32_*_UserArgs.yaml` files in total.) ### 5. New tests **End-to-end gfx950 GEMM yamls** in `Tensile/Tests/common/gemm/gfx950/`: * `subtile_bf16.yaml`, `subtile_mxfp4.yaml` * `mx32f4_tn.yaml`, `mx32f8_tn.yaml` * `mxfp4_mxfp4_{fp32,bf16}_tn_act{,_groupgemm}.yaml` * `mxfp4_fp8_{fp32,bf16}_tn_act{,_groupgemm}.yaml` * `fp8_mxfp4_{fp32,bf16}_tn_act{,_groupgemm}.yaml` **StreamK + MX:** `Tensile/Tests/common/streamk/sk_mx32f4_quick.yaml`, `sk_mx32f8_quick.yaml`. **New unit tests** (`Tensile/Tests/unit/`): * `test_SubtileBasedLogicalScheduler.py` (~1735 LOC) * `test_SubtileBasedSchedulerRef.py` (~596 LOC) * `test_gr_lr_roundtrip.py` (~571 LOC) * `test_storeD_roundtrip.py` (~2420 LOC) * `test_graTileAssignment.py` (~354 LOC) * `test_lraTileAssignment.py` (~360 LOC) * `conftest.py`, `gpu_test_helpers.py` shared fixtures (~601 LOC) **New gtest:** `tensilelite/tests/MXScalePadding_test.cpp`. ### 6. Misc / hardening * Reject conditions: gfx950 MX + non-Subtile, DepthU constraints, GroupGEMM not yet supported with StreamK + MX, AssertSummationElementMultiple=256 for subtile MXFP4, missing-mxblock check for non-MX types. * Skip rocRoller for FP4-A/FP4-B with pre-swizzled scale layout (#42). * `forceDenorm=False` in `generateMXInput` (#11). * Several rebase fixes, copyright/year header updates, and review-comment fixes to `KernelWriter` / `KernelWriterAssembly`. ### 7. CHANGELOG Greatly improved MXFP4 GEMM performance when using HIPBLASLT_MATMUL_MATRIX_SCALE_BLK32_UE8M0_32_8_EXT ## How to use Set `UseSubtileImpl: 1` on a gfx950 MX-FP4 solution (see the new `subtile_mxfp4.yaml` / `mx32f4_tn.yaml` for canonical configs). The path is opt-in — non-MX and non-gfx950 kernels are unaffected. ## Backwards compatibility / risk * All new behavior is gated on `UseSubtileImpl=1` and gfx950. Existing solutions on other architectures or non-MX paths are unchanged. * `GroupGEMM + StreamK + MX` is intentionally rejected for now (TODO). * New Origami yamls only add solutions; nothing existing is modified. ## Test plan * New gtests + unit tests run automatically in CI (Tensilelite Python unit suite, `MXDataGen_test`, `MXScalePadding_test`). * New end-to-end gfx950 GEMM and StreamK yamls are added to the common test buckets. * Manual: run the gfx950 MXFP4 subtile suites (`pytest -k gfx950` after building Tensile, plus `tensilelite-client --yaml subtile_mxfp4.yaml` for sanity). ## Notes for reviewers * This branch was rebased onto current `develop` (post-#6499) by skipping the `users/nakajee/gfx950_mx_rebase_merge` history (which #6499 squash-merged) and replaying only the subtile-specific work as a single squashed commit. The actual code changes in this PR are limited to the files listed above (24 added, 56 modified; ~+170k / −2.6k including generated logic yamls). * The largest reviewable diffs are: * `Tensile/Components/SubtileBased{Kernel,LogicalScheduler,InstructionScheduler,InstructionEmitter}.py` (new files) * `Tensile/KernelWriter.py`, `Tensile/KernelWriterAssembly.py` * `Tensile/SolutionStructs/{Problem,Solution}.py` * `Tensile/Components/{GlobalWriteBatch,StreamK}.py` * `clients/common/include/testing_matmul.hpp` * `client/src/DataInitialization.cpp` * Description of all commits that were squashed for this feature branch: Subtile implementation for gfx950 MX FP4 --- 272f88d: Add sample subtile impl --- Author: brianshi <brianshi@amd.com> --- 60ecede: GR Offset calculation (#1) --- Author: sebvince <115461989+sebvince@users.noreply.github.com> --- be69c1d: Enable post-loop code generation, and add some subroutines --- Author: b-shi <brianshi@amd.com> --- 646d102: LR offset calculation (#2) --- Author: sebvince <115461989+sebvince@users.noreply.github.com> --- 71f4bca: Add GR load emit logic, and misc fixes (#3) --- Author: b-shi <brianshi@amd.com> --- 1fd0db9: Emit LR + init ACCVGPR (#4) --- Author: sebvince <115461989+sebvince@users.noreply.github.com> --- 9d406b9: Add loop and ptr update code --- Author: b-shi <brianshi@amd.com> --- b6127bc: Update GR/LR offset calculation to fully support 2x2, 1x4, 4x1 waveConfigs (#7) --- Author: sebvince <115461989+sebvince@users.noreply.github.com> --- 89ec87c: Account for valuC macro value in SK WS store code --- Author: b-shi <brianshi@amd.com> --- 6edf53d: Rebase fix --- Author: b-shi <brianshi@amd.com> --- 34e79fc: Enable fp4 (#8) --- Author: sebvince <115461989+sebvince@users.noreply.github.com> --- d5a5c57: [Tensilelite] Add MX FP4 scale offset computation for subtile-based kernel (#6) --- Author: Archana Ramalingam <98564406+archana-ramalingam@users.noreply.github.com> --- 7a8a85a: Add lds buffer swap logic --- Author: b-shi <brianshi@amd.com> --- d24a8fe: Add optimized storeD code (#9) --- Author: b-shi <brianshi@amd.com> --- a45c20c: Fix MX scale tensor initialization: set forceDenorm=false in generateMXInput (#11) --- Author: T.J. Alumbaugh <T.J.Alumbaugh@amd.com> --- f945268: [Tensilelite] Enable the MX FP4 scale emit code in the subtile-based kernel (#10) --- Author: Archana Ramalingam <98564406+archana-ramalingam@users.noreply.github.com> --- cf37df4: Use fixed value for SrdMXSA/B+2 (#14) --- Author: Koji Nakajima <75698246+nakajee@users.noreply.github.com> --- f0c8dbc: Merge subtile_mx_f4_schedule to subtile_mx branch (#16) --- Author: b-shi <brianshi@amd.com> --- 543796f: Enable DU > 256, and reduce sgpr allocation (#18) --- Author: b-shi <brianshi@amd.com> --- c65bdb0: Add missing mxblock check for non-mx data types --- Author: b-shi <brianshi@amd.com> --- d64d226: Introduce UseSubtileImpl parameter (#20) --- Author: b-shi <brianshi@amd.com> Squash commits 20-35 from subtile_mx branch --- e4780da: Enable FixSrd2 for A/B (#23) --- Author: b-shi <brianshi@amd.com> * Enable FixSrd2 for A/B * Address comments from PR --------- --- e4c64a7: Add nt libs --- Author: b-shi <brianshi@amd.com> --- cd13ec1: [Tensilelite] Pad MX scale tensor dimensions for unaligned problem sizes (#21) --- Author: Archana Ramalingam <98564406+archana-ramalingam@users.noreply.github.com> * Add scale padding * Add tests * Remove redundant pre-swizzle path * Remove code from conflict * Fix reverted mxdatagen path for tensile tests * Add diverse test cases for scale padding in MXScalePadding_test and subtile.yaml - Expanded test cases to include non-multiple-of-32, even non-multiple-of-16, and odd dimensions. --- d87938f: Split subtile.yaml into subtile_bf16.yaml and subtile_mxfp4.yaml (#22) --- Author: James Newling <james.newling@gmail.com> Replace the 'monolithic' subtile.yaml with two focused test files. All original test coverage is preserved. Two new FP4 groups added. BF16 coverage (subtile_bf16.yaml, tests are essentially unchanged): # | Description | Dest | MIs | PGR | DU | SK | Sizes --+--------------------+------+-----+-----+---------+-----+------ 0 | BF16 TN main | b | 19 | 0 | 64 | 0,3 | 11 1 | BF16 TN large DU | b | 4 | 0 | 128,192 | 0,3 | 7 2 | BSS (f32 output) | s | 6 | 0 | 64 | 0,3 | 9 3 | BF16 bias | b | 2 | 0 | 64 | 0 | 1 FP4 coverage (subtile_mxfp4.yaml): # | Description | Dest | MIs | PGR | DU | SK | Sizes | Status --+--------------------+------+-----+-----+-----+-----+-------+-------- 0 | FP4 TN main | b | 15 | 0 | 256 | 0,3 | 23 | from original 1 | FP4 TN large DU | b | 4 | 0 | 512 | 0,3 | 13 | from original 2 | F4SS (f32 output) | s | 5 | 0 | 256 | 0,3 | 13 | from original 3 | FP4 bias | b | 2 | 0 | 256 | 0 | 1 | from original 4 | FP4 PGR=2 | b | 13 | 2 | 256 | 0 | 5 | new 5 | FP4 expanded MIWT | b | 24 | 0 | 256 | 0 | 5 | new 6 | PGR=2 WG 4x1/1x4 | | 6 | 2 | 256 | 0 | 1 | known failures (commented) Run times on gfx950 (8x MI350X): File | NEV=-1 | NEV=0 -------------------+--------+------ subtile_bf16.yaml | 23s | 23s subtile_mxfp4.yaml | 37s | 40s Where NEV is number of elements to validate. I (James) have checked these numbers, and weirdly it is true that NEV=0 is a bit faster than NEV=-1 for mxfp4. --- af04f0d: Dependency based instruction scheduling (#19) --- Author: sebvince <115461989+sebvince@users.noreply.github.com> * Revert to single partition * Start using dependencies * as is * start using separate EmittedModules * remove reduntant wait * Add _extractPathsFromBeforeDeps * Continue simplification * Simplifying * Add more rules * cleanup * Add fp4 test * fix test * Add tests * Remove after field on emittedmodule * Refactoring instructionSchedule * Add comments * cleanup modules vs ops * Refactoring print functions * Test cleanup * Add more tests * Replace subgroup by partition * Remove unused unroll param * Add high level notes * Simplify NLL and NGLL GR removal * Add some comments * Force instruction insertion if no slots available * Fix test after rebase * Move scale before A/B and track inflight count * Fine-grain vmcnt calculation * Separate counts for scaleA and B * Avoid using m0 update and buffer_lod on same MFMA slot to avoid scalar instruction serialization * Fix test * Add vmcnt test * Fix duplicated loads for 1x4 and 4x1 * Fix placement in reverse order * Fix regression on PGR0 * add fallback to numMFMA=1 --- 3ec902b: Add some 1x4 and 4x1 origami solutions --- Author: b-shi <brianshi@amd.com> --- c5000d3: Fix typo --- Author: b-shi <brianshi@amd.com> --- 226ed84: [hipblaslt] Refactor Srd2 calculation for useFixedSrd2 (#30) --- Author: Koji Nakajima <75698246+nakajee@users.noreply.github.com> --- abf19d4: [Tensilelite] UseSubtileImpl: subtile-aligned edge check for store path (#29) --- Author: b-shi <brianshi@amd.com> * [Tensilelite] UseSubtileImpl: subtile-aligned edge check, OOB guard, and refactoring - Replace Size%MT edge check with subtile-aligned check: NonEdge paired store when trailing rows/cols are a multiple of the subtile block size (waveGroupM rows for M, 16 cols for N). Non-last workgroups always take NonEdge. - Add per-wave OOB guard (subtileM32ValidBlocksSgpr / subtileN16ValidBlocksSgpr) to skip stores outside valid M/N tile bounds in the NonEdge path. - Refactor duplicated OOB guard into _emitSubtileOobGuard helper; refactor M/N guard SGPR computation into _emitSubtileMGuard / _emitSubtileNGuard. - Fix orphan scalar store blockIdxM (was tt0, now (tt0*MatrixInstM)//mBlockSize). - Add quick-exit and edge/non-edge header comments to generated ASM. * Add some bias tests, combine M/N guard to single routine * Add OOB check for C loads, update storeD unit tests to check OOB, simplify quick exit checks * Address more PR comments: add M group skip, and skip to store end. simplified loadC OOB mask --------- --- 637881a: Fix unit tests & remove legacy code for subtile interleaving (#33) --- Author: sebvince <115461989+sebvince@users.noreply.github.com> * Fix gr_lr_roundtrip test * Use non-interleaved version as ref code * Fix scheduler test * Removed legacy interleaved mode for LR/GR offset calculation --- e9cb889: Fix MX FP4 scale buffer allocation and initialization for batched GEMM (#25) --- Author: Archana Ramalingam <98564406+archana-ramalingam@users.noreply.github.com> * Fix bacth count issue * Add batch count tests * Fix bacth count issue * Address PR review: clarify FP4-specific byte stride and add non-aligned batched tests - Updated comments on dataBatchBytes computation to clarify FP4 packing assumption (2 elements/byte) and flag that non-FP4 block-scaling types would require updating this conversion. - Added batched test cases with non-multiple-of-32 M/N dimensions: FP4 DU=256: [48,48,2] and [33,65,2] FP4 DU=512: [63,63,2] BF16: [50,100,2] --------- --- a43247b: Update some test yamls (#31) --- Author: b-shi <brianshi@amd.com> --- e2f69c8: Add f4bs origami library with activation function support. Refactor sgpr allocation to reduce sgpr usage in post loop. Store code-path reorganization (#32) --- Author: b-shi <brianshi@amd.com> * Free swap/localwritebase sgprs before post-loop * Defer sgpr allocation to remove holds in sgpr pool. Add Origami library logic files for Cijk_Alik_Bljk_F4BS_MXA32_MXB32 (base, nta4, ntb4 variants). * Remove uneeded alignment and comment * Add more epilogue tests * Remove older origami library for f4bs * Reorder post-loop code blocks to after persistant loop Misc fixes * Fix build issues, relax longjump sgpr requirements * Fix GSU0 branch logic --------- --- 3f034bf: Add F4HS and F4SS Origami library logic for FP4→F16 and FP4→F32 GEMM (#35) --- Author: Majedul Sujon <85503863+msujon-AMD@users.noreply.github.com> * Add F4HS and F4SS Origami library logic for FP4→F16 and FP4→F32 GEMM - Add 6 new yaml files (F4HS, F4SS) across Origami, Origami_nta4, Origami_ntb4 - Update F4BS yaml files: AssertSummationElementMultiple 32→256 for K%256 enforcement - Add ("F4", "F4", "H", "S") to _validGEMMTypes and _HPATypes in Problem.py * Add F4HS test cases to subtile_mxfp4.yaml Add two new benchmark problem blocks for FP4→F16 (F4HS): - No-bias block: same wavetile and problem size coverage as F4SS - Bias epilogue block: BiasDataTypeList [s, h], relu/none activations * Add F4HS (FP4->Half) type support to Tensile client Add TypedGemm_F4_H_S typedef and corresponding reference CPU solver case so F4HS (FP4 input, Float16 output, Float compute) problems can be validated by the benchmark client. --------- --- d0bc8fd: Rewrite subtile-based scheduler. Fix DU>64 & enable very large MT (#36) --- Author: sebvince <115461989+sebvince@users.noreply.github.com> * Initial support for DU>256 * Renaming * add option to do DU=512 in the tests * blocked K-major for scale * Change scaleSet swap logic * Update print functions * Put scales after values for avoid race conditions * Fix tests * more test * tweak printschedule display * Add PGR2 in the yaml tests * Add new scaleGROp * comment out failing tests * Revert "comment out failing tests" This reverts commit 1f5802c. * Draft new logical scheduler * Refactoring * Add more test on step1 * Add more tests on step1 * add bf16 320x320 test * reduce step1 code * Simplify step1 logic * validate some step1 test * Fix partition 2x2 test * more step1 test * 320x320 BF16 test * Add test DU512 + partition2x2 * Simplify step1 code * Add step2 tests * Fix multi-partition step2 * Add step2 du512, 2x2 partition test * Use common algo for all numPartitions * Draft for step3 tests * remove useless tests * New GR algo (draft) * [Step3] Add more test * Iteration on GR * Display ordered GR list with granularities * More test * Add some comments * Disable by default debug logs * Getting rid of step naming * Start remove AnnotatedOp (still there in group pass) * Split dependency Ops * Add todo on place_GRs pass * Valid test_annotate_deps_1x1_partition_DU256 * Test output looking better (still WIP) * single dep for LR tooo * Add remove_cross_deps pass * Fix bugs in dependency pass * insert_gr_lr_inc pass * Add group_lr_gr pass * Add emit pass * Quick port of instruction Emit code * Move emit function to separate file * Refactoring instructionEmitter * Port vgprTile tracking * Reworking second pass (WIP) * Display unrolling requirement * Unrolling check on 2nd pass * Generic validation for assign_vgpr pass * Fix unroll * Add inst schedule in standalone mode * Use lrGran for vgprTile size calculation * Fix bug in emit pass (missing depencency) * PreMFMA path + non-duplication scale load * missing globalReadLDSBufferSwap for GR_INC scales * add wairlr_sync on all LR->GR dep * add waitgr_sync op * remove_unnecessary_gr_deps * Change LR dispatch algo a bit to avoid too many waitgr_sync * Avoid duplicated loads in emitter * Fix bug on gr_emit code * GrInc pass. fix duplicated insertion for B * Fix missing LR_inc for SA/SB * preloop, NLL, NGLL * Simplify preloop * minor changes * Move unroll logic to scheduler * minor changes * Fix unroll id bug on NLL / NGLL * Disable post GRINC for now * Remove commented code * Handle 1x4, 4x1 gr read gran * Fix vmcnt computation * Use correct grCount mapping * Revert in emit logic on buffer_load for PGR0 needs * Add bf16 version in standalone test * Fix LR_Inc insertion on DU>64 * Add subIterK/Partition comment to codegen * Fix issue in GrInc placement * Remove last_mt * Fix LR MT index bug with muli-partition * Disable early LDS size check when subtileImpl is on * Add pass to remove redundant LR deps + fixed issue on dependency annotation pass * Remove more LR redundant deps * Only insert wait_lr_sync on deps * Simple algo to select partition config * Remove HC value for partitions... * Take into account all inflight GR (all tensors) * Fix tests and regressions on gr counts * Fix grCount merge calculation * Better display of dependencies * Add remove_wait_lr_sync after grouping * Add temporary non reg file * Change merge logic on GR grouping pass * Fix non necessary wait_lr_sync * Downgrade some waitlr_sync to sync + added 384x256 no reg test * non reg test 320x320 * Add larger MT * non reg test for fp4 256x256 * Moving out instructionScheduler * Remove old scheduler * Renaming scheduler * Re-work test * Add larger MT test cases * Rename non-ref test * Re-add standalone mode * Refactor DepOp * Remove dead code * Remove MFMATileSize class * Remove from_til_info * Avoid redundant tensor list creation * Remove hardcode granularities in vgrpTile allocation pass. Simplify code. * Re-enable # PGR=2 WG 4x1/1x4, K > DU tests * Remove unused GRScaleOp * DepRef renaming * Get rid of MT string representation * Remove TODO * EmmitedModule simplication * Use explicit pass dependencies * Renaming LogicalScheduler * Remove old test_InterleavingScheduler.py file * Commenting failing test for now * Remove debug logs * Disable lds padding when using UseSubtileImpl --- e8e8c09: Fix LR-GR dependency issue when DU>64 (#40) --- Author: sebvince <115461989+sebvince@users.noreply.github.com> * Fix and simplify logic for remove_unnecessary_lr_deps * Add new ref tests for 128x128x(128,64) --- 4aa441a: Rebase fix --- Author: b-shi <brianshi@amd.com> --- 5ba911e: Skip rocRoller for FP4-A/FP4-B + pre-swizzled scale layout (#42) --- Author: Archana Ramalingam <98564406+archana-ramalingam@users.noreply.github.com> --- 9c74998: Rebase fix --- Author: b-shi <brianshi@amd.com> --- 842b149: Addressed review comments for KernelWriter and KernelWriterAssembly --- Author: Koji Nakajima <knakajim@amd.com> --- dce43b1: Fix computeLoadSrd issue --- Author: Brad Nemanich <Brad.Nemanich@amd.com> --- c075bbf: Fix preSolution CPU re-sync regressing subtile_mxfp4.yaml --- Author: Brad Nemanich <Brad.Nemanich@amd.com> --- ced840f: Fix computeLoadSrd issue (#43) --- Author: bnemanich <brad.nemanich@amd.com> --- bc2f6dd: Small update for gfx950 mx tests + more - enable UseSubtileImpl for all gfx950 non subtile mx tests - skip all gfx950 mxfp8 - use MXScaleFormat=1 as default - set AssertSummationElementMultiple=256 for subtile mxfp4 - fix isSwizzledSubtile in computeLoadSrd --- Author: Koji Nakajima <knakajim@amd.com> --- 5c794b7: Fix gsuasb.yaml failures --- Author: b-shi <brianshi@amd.com> --- 727f8db: tensilelite: add solution reject conditions for UseSubtileImpl=1 (#38) --- Author: Majedul Sujon <85503863+msujon-AMD@users.noreply.github.com> --- 8928fbb: Add more reject conditions for Subtile --- Author: Koji Nakajima <knakajim@amd.com> --- 6e63ab6: Fix kringshift test failures --- Author: b-shi <brianshi@amd.com> --- b3e9724: Update reject condtion for DepthU in subtile case. Plus, update DepthU setting for gfx950 mx test cases --- Author: Koji Nakajima <knakajim@amd.com> --- 5ab6009: Fix build errors --- Author: Brad Nemanich <Brad.Nemanich@amd.com> --- 4a4edca: Update more mxfp4 tensilelite test cases --- Author: Koji Nakajima <knakajim@amd.com> --- bbbc553: Update change log --- Author: Brad Nemanich <Brad.Nemanich@amd.com> --- 6476c04: Add more reject conditions for gfx950 subtile --- Author: Koji Nakajima <knakajim@amd.com> --- c5828c4: Updated gfx950 mxfp4 test cases - add StreamK setting - skip groupgemm tests for now (groupgemm does not support streamK) --- Author: Koji Nakajima <knakajim@amd.com> --- f1fc2f1: Fix hipblaslt build error of gfx950 --- Author: Koji Nakajima <knakajim@amd.com> --- 70cea1b: Updated subtile_mxfp4.yaml (add StreamK) --- Author: Koji Nakajima <knakajim@amd.com> --- c1c9b2a: Add uninit lsc,lsp, etc.. fields for subtile --- Author: b-shi <brianshi@amd.com> --- c0c1f72: Fixed merge error in testing_matmul.hpp --- Author: Koji Nakajima <knakajim@amd.com> --- 191e0cb: Add missed batch_count >1 changes --- Author: archana-ramalingam <Archana.Ramalingam@amd.com> --- 01c52f8: Addressed PR comments --- Author: Koji Nakajima <knakajim@amd.com> --- 4e89c91: Reduce mxfp4 test time --- Author: Brad Nemanich <Brad.Nemanich@amd.com> --- 3dac20f: Prevent overflow for wgmxcc sgpr allocation --- Author: b-shi <brianshi@amd.com> --- 18dec79: Fix error with problem type --- Author: Brad Nemanich <Brad.Nemanich@amd.com> --- 9e69ffd: Add a reject conditoin for gfx950 mx + non Subtile --- Author: Koji Nakajima <knakajim@amd.com> --- 0eed3ba: Add more valid GEMM types --- Author: Brad Nemanich <brad.nemanich@amd.com> --- 8b5514e: Fix missing b build error --- Author: archana-ramalingam <Archana.Ramalingam@amd.com> --- f981ff5: Fix 1250 tests --- Author: Brad Nemanich <brad.nemanich@amd.com> --- d1e69d9: Add more FP4 tests --- Author: Brad Nemanich <Brad.Nemanich@amd.com> --- e3a688f: Add MXScaleFormat: 1 to all gfx950 mx test yaml --- Author: Koji Nakajima <knakajim@amd.com> --- aaef3f5: Add DataTypeMXSA,B setting in gfx950 mxfp4 logic yaml --- Author: Koji Nakajima <knakajim@amd.com> --- 861ef8e: Add DataTypeMXSA,B setting in gfx950 mxfp4 logic yaml (nta4,ntb4) --- Author: Koji Nakajima <knakajim@amd.com> Co-authored-by: Archana Ramalingam <Archana.Ramalingam@amd.com> Co-authored-by: Brad Nemanich <Brad.Nemanich@amd.com> Co-authored-by: Brian Shi <Brian.Shi@amd.com> Co-authored-by: James Newling <James.Newling@amd.com> Co-authored-by: Koji Nakajima <Koji.Nakajima@amd.com> Co-authored-by: Majedul Sujon <Majed.Sujon@amd.com> Co-authored-by: Sebastien Vince <Sebastien.Vince@amd.com> Co-authored-by: T.J. Alumbaugh <T.J.Alumbaugh@amd.com> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Archana Ramalingam <Archana.Ramalingam@amd.com> Co-authored-by: Brad Nemanich <Brad.Nemanich@amd.com> Co-authored-by: Brian Shi <Brian.Shi@amd.com> Co-authored-by: James Newling <James.Newling@amd.com> Co-authored-by: Koji Nakajima <Koji.Nakajima@amd.com> Co-authored-by: Majedul Sujon <Majed.Sujon@amd.com> Co-authored-by: Sebastien Vince <Sebastien.Vince@amd.com> Co-authored-by: T.J. Alumbaugh <T.J.Alumbaugh@amd.com>

Adds _iter_note(producer, consumer) in ScheduleCapture.py: returns " (of next iteration)" when consumer.position.loop_index == producer.position.loop_index + 1. Generalizes the prior MissingWaitFailure inline check (which hardcoded BODY_LABEL_TO_LOOP_INDEX[ML_PREV] -> [ML]) to any i -> i+1 boundary; loop_index is the canonical cross-body iteration counter so the numeric +1 test is the right discriminator. MissingWaitFailure (#2) refactored to use the helper. WaitTooLateFailure (#4), WaitInsufficientFailure (#5), and MissingBarrierFailure (#6) now also append the suffix when the producer/consumer pair crosses an iteration boundary. Suffix attaches right after the consumer's `@ idx=N` mention so the message reads: MFMA[name] @ idx=10 (of next iteration) is guaranteed by an SWaitCnt @ idx=12 ... MFMA[name] @ idx=10 (of next iteration)'s producer LRA0 @ idx=5 ... ... between the SWaitCnt and GRA @ idx=2 (of next iteration). 3 new cross-iter pinning tests + 3 same-iter regression assertions (`assert "(of next iteration)" not in msg`) so a future regression that incorrectly fires the suffix on same-iter pairs is caught.

Adds _node_with_pos(node, capture) — combines _node_label (per-category- stream [N] index, plain MFMA omits) with bare '@ idx=M' position render. Single helper for the canonical Failure node reference shape, replacing three different prior styles: - #1, #2: manual `_node_label + " " + "@ idx=N"` concatenation - #4, #5: `category[name] format_position(...)` (rendered the FULL name inside brackets, e.g. `LRA0[LRA0[0]]`, plus a cross-category list suffix that duplicates [N]'s purpose) - #6: `category format_position(...)` (no brackets) All five formatters now route through _node_with_pos. Plain MFMA stays bracket-less per _node_label's MFMA discriminator; PackMFMAs (categories PackA*/PackB*) keep [N] because CMS reschedules them. Waits in #5 stay as bare '@ idx=N' — the surrounding 'SWaitCnt' word already names the kind; rendering the SYNC category as `SYNC[N] @ idx=M` would just duplicate that. Skipped: - #10 SCCConflict: brackets carry rocisa class name (e.g. [SCSelectB32]) not [N] index; semantic conflict, separate audit. - #7 WrongInterleaving / #8 TimingTooClose: use `name` field for Pack identity (MiddlePack_a/_b/_c, CVT0_a/_b); replacing with [N] would lose the a/b/c discriminator. - #13 ConstraintViolation: slated for deletion in bead `pcz`. 3 new pinning tests verifying [N] actually appears when capture is given (one per Failure: #4 LRA0[1], #5 LRA0[1] + plain MFMA bracket-less, #6 GRA[1] in trailing reference). Test count: 564 passed (+3).

Major changes: - Voice rewrite per "feels too AI" feedback: dropped rhetorical setup, the "rest is the map" framing, em-dash-per-clause habit, and the load-bearing-design-flaw flourishes. - Section 2: reworded goals. Single suppression (cross-subiter ALU producers) called out instead of vague "legitimate pipelining". CDNA-4-only timing scoped properly. SCC + middle-pack folded into "data clobbers". Goal #6 now about actionable messages. - Section 3: diagram redrawn (no extra arrow into compare_graphs; arrow into validate_edge_wait_coverage added). Control ops defined inline. Multi-codepath handling explained. DTL+LdsBuf example corrected (GR→LR not LR→GR). - Section 4: removed §4.4, §4.5, §4.6 per feedback. §4.3 marked TODO for the codegen-vs-codegen generalization. §4.7 retained with pointer to follow-up bead. - Section 5: removed num_mfma_per_subiter assumption; added CMS=0/CMS=1 same-flag-support assumption; reworded ArchProfile fallback to describe correct behavior with pointer to fix bead. - Section 6: replaced misformatted table with two bulleted lists. - Section 7.3: explained the cross-subiter ALU producer suppression. - Section 7.4: explained NodeLike-accepts-ValidatorInstruction. Corrected the tagged_inst-fallback claim (it's required, no fallback). - Section 7.5: reframed order-dependent dispatch as known smell with pointer to investigation bead. - Section 7.6: trimmed "no subiter scoping" defense paragraph. - Section 7.7: reframed to lead with the purpose (where LDS-reuse edges come from), not the implementation. - Section 7.8: removed (raise_on_unexplained parameter cleanup tracked as bead). - Section 7.9: condensed; the SCC suppression question handed to a bead. - Section 8: TODO marker for verifying future-work list is real. - Status block updated: br4 closed; one-way edge confirmed. Beads filed in this commit's scope: - rocm-libraries-dj1g (NGL predicate brittleness, §4.7) - rocm-libraries-zkzw (ArchProfile silent CDNA-4 fallback, §5) - rocm-libraries-9lcs (CMS=0/=1 flag-set reconciliation, §5) - rocm-libraries-o0ei (order-dependent dispatch, §7.5) - rocm-libraries-6bue (raise_on_unexplained removal, §7.8) - rocm-libraries-so9m (cross-body SCC suppression investigation, §7.9) The Feedback section at the top of the doc has per-item sub-bullet answers per the user's instructions.

…, move inspection tool - Remove reviewer name from RFC body (line 543) - Remove generate_diagrams.py from repo (keep locally) - Remove dead graph_level_correctness diagram code from script - Move Bundle Inspection Tool from Detailed Design to Future Work item #6 (was marked v2/not-v1 but sitting in Detailed Design — confusing) - Add metadata sidecar to Future Work item #5 - Rewrite infrastructure table with consistent Read/Split/Compare pattern Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds Tensile/Components/TOPOLOGY_EQUIVALENCE_INVESTIGATION.md cataloging approaches for topology-equivalent graph comparison in compare_graphs. Recommendation: B (cached BLAKE2b digest on DataflowGraph for O(1) fast-path on identical graphs) + F (per-node Merkle hash for descend- on-mismatch localization in O(V)), with H folded into F's diagnostic output formatting. Approaches catalogued: A (multiset of structural edge signatures), B (cached digest + descend), C (canonical α-renaming + verbatim), D (bisimulation), E (graph isomorphism), F (topological Merkle tree), G (iceberg fingerprint), H (edge-dict by role-pair, folded into F), I (Bloom prefilter, excluded), J (sorted-tuple compare without hash), K (parallel emit-stream walk). Bias: 'graphs identical 99% of the time' — favors approaches with true O(1) fast-paths over per-call O(E) approaches. Concrete µs figures (BLAKE2b ~5-15µs single-shot, set-diff ~50-100µs/call) justify B over A/J for amortized cost. Open questions for user weigh-in (7 total): 1. loop_index in structural identity tuple? 2. Keep _canonical_render for diagnostics-only display? 3. Topology matches but rendered text differs — equal with warning? 4. Position (vmfma_index, sub_index) in digest tuple? 5. Position-only differences — new Failure type or absorb? 6. Intra-graph 7a mechanism (4 alternatives listed; user decides — c70/d0xd direction is option (a) and explicitly noted as previously rejected). 7. Digest computation timing (eager / lazy / incremental). Open Question #6 explicitly does NOT pre-recommend a mechanism; the four alternatives (writer-state consumption / local first-write-wins / container identity / deferred byte-range search) are presented as a design decision pending user input — the user previously rejected the writer-state direction so the memo surfaces the alternatives rather than defaulting to it. Test-update estimate (verified by counting): ~25-40 of 653 unit tests (~4-6%); ~250 LoC for B+F implementation; ~150 LoC for the chosen intra-graph 7a fix. Bead stays open — this is the design document for the eventual implementation. Per the bead's status, implementation defers until a real kernel triggers either 7a or 7b.

…7008) ## Motivation Several memory leaks were detected in MIOpen gtests using ASan. Some of the tests were blacklisted and others were not. This change looks to fix all of the low hanging fruit, which are the majority of the leaks found. This includes all of the critical leaks (>100MB) that were reported. Some other leaks were identified as needing a larger refactor to resolve. After fixing the w_supertensor.cpp leak, the supertensor tests hit virtual memory area limit errors. These changes are to enable running them with ASan, but only with a subset of the tests in order to not hit limits. The test coverage being lost is fairly negligible and the full tests are still run without ASan enabled. ## Technical Details Here is a summary of the files looked at and the changes made: | File | Status | Notes | |------|--------|-------| | `test/gtest/gpu_mha_forward.cpp` | Fixed | Added `miopenDestroySolution()` loop after using solutions. Fixes MHA forward solution descriptor leaks (report #11, #13). | | `test/gtest/gpu_mha_backward.cpp` | Fixed | Added `miopenDestroySolution()` loop after using solutions. Fixes MHA backward solution descriptor leaks (report #9, #12). | | `test/gtest/mha_find20.cpp` | Fixed | Added `miopenDestroySolution()` loop in both `MhaForward` and `MhaBackward` tests. Fixes MHA Find2.0 solution leak (report #16). | | `test/gtest/gtest_desc_guard.hpp` | Fixed | New shared header introduced by the refactor. Provides a single `DescGuard<DescType, CreateFn, DestroyFn>` template (with `TensorDescGuard`, `ConvDescGuard`, `DropoutDescGuard`, `RNNDescGuard` aliases), a `HandleGuard` RAII wrapper for `miopenHandle_t`, and the `DestroyInternalRnnDropoutDesc(rnnDesc)` helper used by every RNN/LSTM/GRU test to free the internal `DropoutDescriptor` that `miopenCreateRNNDescriptor` allocates and `miopenSetRNNDescriptor*` then leaks. Replaces the per-file ad-hoc guard structs from the initial implementation. | | `test/gtest/w_supertensor.cpp` | Fixed | Switched raw descriptors to the shared `RNNDescGuard` / `TensorDescGuard` from `gtest_desc_guard.hpp`. Added a class-local `DestroyDropoutDesc()` (called from `TearDown` and before `miopenSetRNNDescriptor`) to prevent the `miopenSetRNNDescriptor` overwrite leak. Reduced test parameter space under ASan to avoid OOM. Removed unused `seqLen` parameter and dead `param_dev_out`/`bias_dev_out` allocations. | | `test/gtest/lstm.hpp` | Fixed | Switched `rnnDesc` → `RNNDescGuard`, `DropoutDesc` → `DropoutDescGuard`, and `mio_handle` → `HandleGuard` (which now owns the `miopenDestroy` call), all from the shared `gtest_desc_guard.hpp`. Hoisted `dropout_state_buf` so it can be `hipFree`d at the end of the dropout path. Added `DestroyInternalRnnDropoutDesc(rnnDesc)` before `miopenSetRNNDescriptor*` and (in the non-dropout path only) at end of `Run`, which frees the internal `DropoutDescriptor` that the Set call would otherwise leak. Fixes ~615 MB dropout leaks and ~16.5 KB non-dropout descriptor leaks (report #3, #4, #6, #17-#22). | | `test/gtest/gru_test.cpp` | Fixed | In the GRU test class: switched `rnnDesc` → `RNNDescGuard`, `DropoutDesc` → `DropoutDescGuard`, `mio_handle` → `HandleGuard`, and added `DestroyInternalRnnDropoutDesc(rnnDesc)` before `miopenSetRNNDescriptor*` and (non-dropout path only) at end of `Run` plus `hipFree(dropout_state_buf)` for the dropout path. In the in-file `GRUFwdCPUVerify` / `GRUBwdDataCPUVerify` helpers: converted the raw `dropout_inputTensor` / `dropout_outputTensor` declarations to `TensorDescGuard` (mirroring the `cpu_rnn.hpp` change for the LSTM/RNN helpers). | | `test/gtest/softmax_find20.cpp` | Fixed | Changed `Finalize()` to take the `std::vector<miopenSolution_t>&` and destroy each solution via `miopenDestroySolution()` before destroying the problem. Updated all 6 `TEST(...)` callers to pass the solutions vector. Fixes Find2.0 softmax solution/kernel leaks (report #25-#27). | | `test/gtest/rnn_seq_api.hpp` | Fixed | Hoisted `dropout_state_buf` so the dropout path can `hipFree` it at the end. Added `DestroyInternalRnnDropoutDesc(&rnnDesc)` before `miopenSetRNNDescriptor*` and (non-dropout path only) at end of the run to free the internal `DropoutDescriptor` allocations leaked by `RNNDescriptor` copy-assignment. Same pattern as the LSTM/GRU fixes (report #29-#30). | | `test/cpu_rnn.hpp` | Fixed | Converted 6 raw `miopenTensorDescriptor_t dropout_input/outputTensor` declarations across the LSTM/RNN CPU verification helpers (`LSTMFwdCPUVerify`, `LSTMBwdDataCPUVerify`, `RNNFwdTrainCPUVerify`, `RNNBwdDataCPUVerify`, `GRUFwdCPUVerify`, `GRUBwdDataCPUVerify`) to the shared `TensorDescGuard` from `gtest_desc_guard.hpp`. Removed the redundant `miopenCreateTensorDescriptor` calls and updated 12 `miopen::deref(...)` sites to `.get()`. (Note: the GRU helpers in this header are stale duplicates; the live ones are inside `test/gtest/gru_test.cpp` and were updated there too.) Fixes the LSTM/GRU CPU-verify tensor descriptor leaks (report #1, #2, #14, #15, #32, #33). | | `test/gtest/rnn_vanilla_common.hpp` | Fixed | Added `DestroyInternalRnnDropoutDesc(rnnDesc)` calls before `miopenSetRNNDescriptor*` and (non-dropout path only) at end of the run. The `RNNDescGuard` / `DropoutDescGuard` usage was already in place from an earlier commit and now resolves to the shared definitions in `gtest_desc_guard.hpp`. Same pattern as LSTM/GRU/rnn_seq_api fixes (report #14, #15, #32, #33). | | `test/gtest/graphapi_gtest_common.hpp` | Skipped | File no longer exists in the codebase. The GraphApi test infrastructure has been removed; only the leak report (a stale snapshot) still references it. No fix possible against the current source tree. | | `test/gtest/graphapi_execution_plan.cpp` | Skipped | File no longer exists in the codebase (GraphApi removed). The leak it represented was largely an external hipblaslt bug anyway; the remaining test-side portion is not fixable against the current source. | | `test/gtest/na_train.cpp` / `na_inference.cpp` / `na_*_find2.cpp` | Skipped | Leaks come from the internal MIOpen solver/kernel cache living on the global singleton handle, which is never destroyed. Not easily fixable without redesigning global handle lifecycle. Should be suppressed in the ASan suppression file. | | hipblaslt / rocblaslt (external) | Skipped | `SolutionCache::addKernel` and `preloadCustomKernels` leak via `_rocblaslt_handle` constructor. Called from `miopen::Handle::CreateHipblasLtHandle`. This is an upstream bug in hipblaslt/rocblaslt, not fixable in MIOpen. Affects suites that initialize a handle (report Category 2, Category 7). | | CLR / HIP runtime (external) | Skipped | `amd::Context` and `amd::roc::Device` global initialization leaks from `rocclr/platform/context.cpp`. HIP runtime internals, not fixable in MIOpen. | | `src/hipoc/hipoc_program.cpp` | Skipped | `HIPOCProgramImpl` objects leak during kernel compilation/caching (line ~178). This is an internal MIOpen kernel cache lifecycle issue that requires deeper architectural changes to fix. Contributes small amounts to MHA and Softmax Find2.0 leaks. | | `test/gtest/conv_api.cpp` | Fixed | Already clean against current source — `miopenDestroyConvolutionDescriptor(conv_desc)` call exists (line 24) inside the test loop. ASAN run reports no leaks. The leak report was based on a stale snapshot. The hipblaslt handle init portion is tracked under the external-skipped row. | | `test/gtest/log_test.cpp` (CPU_LOG_TEST_FUSION / CPU_LOG_TEST_NEG) | Fixed | Already clean against current source — `Tensor`, `Conv`, `CreateCBAFusionPlan`, `CreateBNormFusionPlan` all have proper destructors that call the corresponding `miopenDestroy*` APIs in `log.cpp`. ASAN run on `CPU_LOG_TEST_*` (11 tests across log_test.cpp + log_test_neg.cpp) reports no leaks. The hipblaslt handle init portion is tracked separately under the external-skipped row. | | `test/gtest/fusion_test.cpp` (CPU_FusionCreateOpConvForward) | Fixed | File renamed from `fusion.cpp` to `fusion_test.cpp`. Already clean against current source — uses `TensorDescGuard`/`ConvDescGuard` for tensor/conv descriptors and calls `miopenDestroyFusionPlan(fusionPlanDesc)` on the fusion plan (line 195). ASAN run on `CPU_FusionCreateOpConvForward_FP32.*` reports no leaks. | | `test/gtest/deterministic_conv_api.cpp` | Fixed | Already clean against current source — uses `ConvDescGuard` (line 66) for the conv descriptor. ASAN run on `*CPU_DeterministicConvApi*` reports no leaks. | | `test/gtest/fusion_aux.cpp` (GPU_FusionAux) | Fixed | Already clean against current source — uses `ConvDescGuard` plus stack-allocated internal C++ objects (`miopen::TensorDescriptor`, `miopen::FusionPlanDescriptor`) which have proper destructors. The `convoOp` handle is owned by the fusion plan. ASAN run on `*GPU_FusionAux*` reports no leaks. | | `test/gtest/backend_api.cpp` (CPU_BackendApi) | Skipped | File no longer exists in the codebase. The backend API test infrastructure (part of the removed GraphApi suite) was removed; no fix possible against the current source tree. | ### High-level notes New shared infrastructure (test/gtest/gtest_desc_guard.hpp) - DescGuard<DescType, CreateFn, DestroyFn> — a single RAII template parameterized on the descriptor type and its create/destroy entry points. Aliases provide TensorDescGuard, ConvDescGuard, DropoutDescGuard, and RNNDescGuard, replacing the four near-identical guard structs that were copy-pasted across test files in the initial implementation. - HandleGuard — separate RAII wrapper for miopenHandle_t (couldn't reuse the template because miopenCreateWithStream takes an extra hipStream_t argument). Supports lazy create(stream) so callers that only need a handle in the dropout branch can default-construct one and populate it conditionally. - DestroyInternalRnnDropoutDesc(rnnDesc) — frees the internal DropoutDescriptor that miopenCreateRNNDescriptor allocates and that miopenSetRNNDescriptor* then orphans. Replaces the equivalent inline blocks that LSTM/GRU/RNN tests were each carrying. The header documents the two call-sites: before each Set* (always safe) and at end-of-run only on the non-dropout path (the dropout path aliases the user-owned descriptor, so freeing would double-free). Recurring patterns enabled by the refactor - The "leak from Set* overwriting the default-constructed internal dropout descriptor" fix collapsed from per-file code to a one-line helper call, applied uniformly across lstm.hpp, gru_test.cpp, rnn_seq_api.hpp, and rnn_vanilla_common.hpp. - mio_handle ownership in LSTM/GRU is now expressed via HandleGuard rather than a manual miopenDestroy at the end of the dropout branch — eliminates a class of forgotten-cleanup bugs. - dropout_state_buf is consistently hoisted out of the dropout if block so an end-of-run hipFree can release it; deletion of the buffer pairs visibly with its allocation. Notable non-RNN change - softmax_find20.cpp was the only Find2.0 leak fix in this commit: Finalize() now takes the solutions vector and calls miopenDestroySolution() for each before destroying the problem. Same shape applied to all 6 tests in the file. ## Test Plan Run the tests beforehand to observe the ASan leak errors and then again afterward to verify the fixes have resolved the problem. ## Test Result List from ROCM-21512: | # | Test Name | Status | Leak Status | |---:|------------------------------------------------------|------------------------------------------------|-----------------| | 1 | Smoke/GPU_RNNVanillaDropout_FP32 | PASSED (4 tests) | No leaks | | 2 | Smoke/GPU_RNNVanillaDropout_FP16 | PASSED (4 tests) | No leaks | | 3 | Full/GPU_LSTM_dropout_FP32 | PASSED (4 tests) | No leaks | | 4 | Full/GPU_LSTM_dropout_FP16 | PASSED (4 tests) | No leaks | | 5 | CPU_GraphApiExecutionPlanBuilder_NONE | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 6 | Full/GPU_LSTM_dropout_FP64 | REMOVED (PR #5750, 2026-03-26) | n/a (deleted) | | 7 | Unit/CPU_GraphApiPointwise_NONE | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 8 | Full/GPU_LstmMSRnn_FP32 | PASSED (1152 tests) | No leaks | | 9 | Smoke/GPU_Bwd_Mha_FP32 | PASSED (12 tests) | No leaks | | 10 | Full/GPU_LstmMSRnn_FP16 | PASSED (864 tests) | No leaks | | 11 | Smoke/GPU_Fwd_Mha_FP32 | PASSED (15 tests) | No leaks | | 12 | Full/GPU_Bwd_Mha_FP32 | PASSED (6 tests) | No leaks | | 13 | Full/GPU_Fwd_Mha_FP32 | PASSED (7 tests) | No leaks | | 14 | Full/GPU_RNNVanilla_FP32 | PASSED (96 tests) | No leaks | | 15 | Full/GPU_RNNVanilla_FP16 | PASSED (96 tests) | No leaks | | 16 | GPU_TestMhaFind20_FP32 | PASSED (2 tests) | No leaks | | 17 | Full/GPU_LSTM_FP32 | PASSED (32 tests) | No leaks | | 18 | Full/GPU_LSTM_FP16 | PASSED (32 tests) | No leaks | | 19 | Full/GPU_LSTM_extra_FP32 | PASSED (30 tests) | No leaks | | 20 | Full/GPU_LSTM_extra_FP16 | PASSED (30 tests) | No leaks | | 21 | Full/GPU_DeepBench_LSTM_FP16 | PASSED (22 tests) | No leaks | | 22 | Full/GPU_DeepBench_LSTM_FP32 | PASSED (22 tests) | No leaks | | 23 | CPU_LOG_TEST_FUSION_NONE | PASSED (2 tests) | No leaks | | 24 | CPU_LOG_TEST_NEG_NONE | PASSED (4 tests) | No leaks | | 25 | GPU_SoftmaxFind20_BFP16 | PASSED (2 tests) | No leaks | | 26 | GPU_SoftmaxFind20_FP16 | PASSED (2 tests) | No leaks | | 27 | GPU_SoftmaxFind20_FP32 | PASSED (2 tests) | No leaks | | 28 | CPU_ConvApi_NONE | PASSED (1 test) | No leaks | | 29 | Full/GPU_RNNSeqApi_FP16 | PASSED (16 tests) | No leaks | | 30 | Full/GPU_RNNSeqApi_FP32 | PASSED (16 tests) | No leaks | | 31 | UnitVAN/CPU_GraphApiRng_NONE | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 32 | Smoke/GPU_RNNVanilla_FP16 | PASSED (4 tests) | No leaks | | 33 | Smoke/GPU_RNNVanilla_FP32 | PASSED (4 tests) | No leaks | | 34 | CPU_FusionCreateOpConvForward_FP32 | PASSED (1 test) | No leaks | | 35 | CPU_GraphApiOperationReduction_NONE | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 36 | Unit2IV1/CPU_GraphApiOperationPointwiseBwd_NONE | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 37 | Unit2IV1/CPU_GraphApiOperationPointwiseTwoInput | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 38 | Unit2IV2/CPU_GraphApiOperationPointwiseBwd_NONE | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 39 | Unit2IV2/CPU_GraphApiOperationPointwiseTwoInput | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 40 | UnitVAB/CPU_GraphApiRng_NONE | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 41 | Smoke/GPU_RNNVanillaDropout_FP16 (duplicate of #2) | (see #2) | (see #2) | | 42 | CPU_GraphApiOperationGraphDescriptor_NONE | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 43 | UnitVA/CPU_GraphApiVariantPack_NONE | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 44 | UnitVAU/CPU_GraphApiRng_NONE | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 45 | CPU_GraphApiOperationReshape_NONE | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 46 | Smoke/CPU_DeterministicConvApi_NONE | PASSED (1 test) | No leaks | | 47 | Smoke/GPU_FusionAux_FP32 | PASSED (1 test) | No leaks | | 48 | CPU_GraphApiEngineHeur_NONE | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 49 | Unit/CPU_GraphApiReduction_NONE | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 50 | CPU_GraphApiEngineCfg_NONE | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 51 | Unit/CPU_GraphApiMatMul_NONE | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 52 | CPU_BackendApi_NONE | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 53 | UnitIV/CPU_GraphApiOperationPointwiseOneInput | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 54 | Unit3IV/CPU_GraphApiOperationPointwiseThreeInput | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 55 | UnitVA/CPU_GraphApiOperationMatmul_NONE | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 56 | UnitVA/CPU_GraphApiOperationRng_NONE | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | Outcome | Count | |------------------------------------|------:| | Passed, no leaks | 39 | | Passed, leaks detected | 0 | | Failed | 0 | | Crashed / timed out | 0 | | Removed — GraphAPI purge (#5603) | 15 | | Removed — FP64 LSTM purge (#5750) | 1 | | Duplicate (not re-run) | 1 | | **Total rows** | **56** | ## Risk Assessment Low --------- Co-authored-by: JonathanLichtnerAMD <195780826+JonathanLichtnerAMD@users.noreply.github.com>

# Add gfx950 MXFP4 Subtile-based kernel implementation ## Summary This PR is a follow-up to #6499 ([hipblaslt] Add support for gfx950 mxfp4) and adds the **Subtile-based kernel implementation (`UseSubtileImpl=1`)** for hipBLASLt on **gfx950**. It introduces a new tile-decomposed code generation path optimized for **MXFP4** and **BF16** GEMMs, plus the solution-selection plumbing, validation, Origami logic yamls, and unit tests needed to make it production-usable. ## Motivation PR #6499 brought MX data type support online for gfx950, but the existing TensileLite codegen path leaves significant performance on the table for MXFP4-heavy workloads. The Subtile path restructures global-read / local-read / MFMA / store scheduling at a finer granularity, which **greatly improves MXFP4 GEMM performance when using `HIPBLASLT_MATMUL_MATRIX_SCALE_BLK32_UE8M0_32_8_EXT`** (added to the hipBLASLt CHANGELOG). ## What's included ### 1. New Subtile-based kernel components (Tensile) New modules under `projects/hipblaslt/tensilelite/Tensile/Components/`: * `SubtileBasedKernel.py` (~1850 LOC) — entry point and orchestration of the subtile codegen path; replaces large portions of the standard prefetch / unroll / store flow when `UseSubtileImpl=1`. * `SubtileBasedLogicalScheduler.py` (~2415 LOC) — logical scheduler that builds the subtile-grained instruction graph (GR loads, LR offsets, MFMA tiles, scale loads, stores) from kernel parameters. * `SubtileBasedInstructionScheduler.py` (~433 LOC) — converts the logical schedule to an emit order respecting wave / register / hazard constraints. * `SubtileBasedInstructionEmitter.py` (~216 LOC) — instruction emission helpers shared by the subtile components. ### 2. Kernel writer / common changes * **`KernelWriter.py`**, **`KernelWriterAssembly.py`**: integration points for the subtile path — prefetch, GR offset calculation, LR offset calculation, post-loop, MFMA macro accounting, optimized `storeD`, LDS buffer swap, MX FP4 scale emit, `SrdMXSA/B+2` handling, sgpr allocation / overflow guards, computeLoadSrd fix. * **`SolutionStructs/Solution.py`**, **`SolutionStructs/Problem.py`**: introduces the `UseSubtileImpl` parameter, MX-related reject conditions for non-Subtile paths on gfx950, and additional valid GEMM type combinations for MX inputs. * **`Common/ValidParameters.py`**, **`Common/RequiredParameters.py`**, **`Common/GlobalParameters.py`**: `UseSubtileImpl` registration and defaults. * **`Components/StreamK.py`**: subtile-aware StreamK fixup (incl. import union with the `BufferLoadB32` cache-coherence change from #6837). * **`Components/GlobalWriteBatch.py`**: optimized global write batching for the subtile path (~670 LOC of changes). * **`Components/ComputeStoreVgprs.py`**, **`Components/LSU.py`**, **`Components/WorkGroupMappingAlgos.py`**, **`AsmStoreState.py`**, **`KernelWriterModules.py`**: minor adjustments needed by the subtile pipeline. ### 3. rocisa / host / client * **`rocisa/rocisa/include/container.hpp`**: helpers needed by the new emitter. * **`tensile_host.cpp`**, **`include/Tensile/TensorDescriptor.hpp`**: small fixups for the subtile path and gfx950 build. * **`client/include/DataInitialization.hpp`**, **`client/src/DataInitialization.cpp`**, **`client/src/Reference.cpp`**, **`client/src/ReferenceValidator.cpp`**, **`client/include/TypedId.hpp`**: MX scale init and reference paths used by the new tests. * **`clients/common/include/testing_matmul.hpp`**, **`clients/common/include/norm.hpp`**, **`clients/common/include/hipblaslt_datatype2string.hpp`**, **`clients/common/src/mxDataGen.cpp`**: wiring for batched (>1) testing and MX init. ### 4. Origami / solution selection (gfx950 MXFP4) New auto-tuned logic yamls under `projects/hipblaslt/library/.../Tensile/Logic/asm_full/gfx950/gfx950/Origami/` covering the FP4 SS / HS / BS variants in three layouts: * `Origami/` (default) * `Origami/Origami_nta4/` (no-transpose-A FP4) * `Origami/Origami_ntb4/` (no-transpose-B FP4) (9 new `gfx950_Cijk_Alik_Bljk_F4{SS,HS,BS}_MXA32_MXB32_*_UserArgs.yaml` files in total.) ### 5. New tests **End-to-end gfx950 GEMM yamls** in `Tensile/Tests/common/gemm/gfx950/`: * `subtile_bf16.yaml`, `subtile_mxfp4.yaml` * `mx32f4_tn.yaml`, `mx32f8_tn.yaml` * `mxfp4_mxfp4_{fp32,bf16}_tn_act{,_groupgemm}.yaml` * `mxfp4_fp8_{fp32,bf16}_tn_act{,_groupgemm}.yaml` * `fp8_mxfp4_{fp32,bf16}_tn_act{,_groupgemm}.yaml` **StreamK + MX:** `Tensile/Tests/common/streamk/sk_mx32f4_quick.yaml`, `sk_mx32f8_quick.yaml`. **New unit tests** (`Tensile/Tests/unit/`): * `test_SubtileBasedLogicalScheduler.py` (~1735 LOC) * `test_SubtileBasedSchedulerRef.py` (~596 LOC) * `test_gr_lr_roundtrip.py` (~571 LOC) * `test_storeD_roundtrip.py` (~2420 LOC) * `test_graTileAssignment.py` (~354 LOC) * `test_lraTileAssignment.py` (~360 LOC) * `conftest.py`, `gpu_test_helpers.py` shared fixtures (~601 LOC) **New gtest:** `tensilelite/tests/MXScalePadding_test.cpp`. ### 6. Misc / hardening * Reject conditions: gfx950 MX + non-Subtile, DepthU constraints, GroupGEMM not yet supported with StreamK + MX, AssertSummationElementMultiple=256 for subtile MXFP4, missing-mxblock check for non-MX types. * Skip rocRoller for FP4-A/FP4-B with pre-swizzled scale layout (#42). * `forceDenorm=False` in `generateMXInput` (#11). * Several rebase fixes, copyright/year header updates, and review-comment fixes to `KernelWriter` / `KernelWriterAssembly`. ### 7. CHANGELOG Greatly improved MXFP4 GEMM performance when using HIPBLASLT_MATMUL_MATRIX_SCALE_BLK32_UE8M0_32_8_EXT ## How to use Set `UseSubtileImpl: 1` on a gfx950 MX-FP4 solution (see the new `subtile_mxfp4.yaml` / `mx32f4_tn.yaml` for canonical configs). The path is opt-in — non-MX and non-gfx950 kernels are unaffected. ## Backwards compatibility / risk * All new behavior is gated on `UseSubtileImpl=1` and gfx950. Existing solutions on other architectures or non-MX paths are unchanged. * `GroupGEMM + StreamK + MX` is intentionally rejected for now (TODO). * New Origami yamls only add solutions; nothing existing is modified. ## Test plan * New gtests + unit tests run automatically in CI (Tensilelite Python unit suite, `MXDataGen_test`, `MXScalePadding_test`). * New end-to-end gfx950 GEMM and StreamK yamls are added to the common test buckets. * Manual: run the gfx950 MXFP4 subtile suites (`pytest -k gfx950` after building Tensile, plus `tensilelite-client --yaml subtile_mxfp4.yaml` for sanity). ## Notes for reviewers * This branch was rebased onto current `develop` (post-#6499) by skipping the `users/nakajee/gfx950_mx_rebase_merge` history (which #6499 squash-merged) and replaying only the subtile-specific work as a single squashed commit. The actual code changes in this PR are limited to the files listed above (24 added, 56 modified; ~+170k / −2.6k including generated logic yamls). * The largest reviewable diffs are: * `Tensile/Components/SubtileBased{Kernel,LogicalScheduler,InstructionScheduler,InstructionEmitter}.py` (new files) * `Tensile/KernelWriter.py`, `Tensile/KernelWriterAssembly.py` * `Tensile/SolutionStructs/{Problem,Solution}.py` * `Tensile/Components/{GlobalWriteBatch,StreamK}.py` * `clients/common/include/testing_matmul.hpp` * `client/src/DataInitialization.cpp` * Description of all commits that were squashed for this feature branch: Subtile implementation for gfx950 MX FP4 --- 272f88d: Add sample subtile impl --- Author: brianshi <brianshi@amd.com> --- 60ecede: GR Offset calculation (#1) --- Author: sebvince <115461989+sebvince@users.noreply.github.com> --- be69c1d: Enable post-loop code generation, and add some subroutines --- Author: b-shi <brianshi@amd.com> --- 646d102: LR offset calculation (#2) --- Author: sebvince <115461989+sebvince@users.noreply.github.com> --- 71f4bca: Add GR load emit logic, and misc fixes (#3) --- Author: b-shi <brianshi@amd.com> --- 1fd0db9: Emit LR + init ACCVGPR (#4) --- Author: sebvince <115461989+sebvince@users.noreply.github.com> --- 9d406b9: Add loop and ptr update code --- Author: b-shi <brianshi@amd.com> --- b6127bc: Update GR/LR offset calculation to fully support 2x2, 1x4, 4x1 waveConfigs (#7) --- Author: sebvince <115461989+sebvince@users.noreply.github.com> --- 89ec87c: Account for valuC macro value in SK WS store code --- Author: b-shi <brianshi@amd.com> --- 6edf53d: Rebase fix --- Author: b-shi <brianshi@amd.com> --- 34e79fc: Enable fp4 (#8) --- Author: sebvince <115461989+sebvince@users.noreply.github.com> --- d5a5c57: [Tensilelite] Add MX FP4 scale offset computation for subtile-based kernel (#6) --- Author: Archana Ramalingam <98564406+archana-ramalingam@users.noreply.github.com> --- 7a8a85a: Add lds buffer swap logic --- Author: b-shi <brianshi@amd.com> --- d24a8fe: Add optimized storeD code (#9) --- Author: b-shi <brianshi@amd.com> --- a45c20c: Fix MX scale tensor initialization: set forceDenorm=false in generateMXInput (#11) --- Author: T.J. Alumbaugh <T.J.Alumbaugh@amd.com> --- f945268: [Tensilelite] Enable the MX FP4 scale emit code in the subtile-based kernel (#10) --- Author: Archana Ramalingam <98564406+archana-ramalingam@users.noreply.github.com> --- cf37df4: Use fixed value for SrdMXSA/B+2 (#14) --- Author: Koji Nakajima <75698246+nakajee@users.noreply.github.com> --- f0c8dbc: Merge subtile_mx_f4_schedule to subtile_mx branch (#16) --- Author: b-shi <brianshi@amd.com> --- 543796f: Enable DU > 256, and reduce sgpr allocation (#18) --- Author: b-shi <brianshi@amd.com> --- c65bdb0: Add missing mxblock check for non-mx data types --- Author: b-shi <brianshi@amd.com> --- d64d226: Introduce UseSubtileImpl parameter (#20) --- Author: b-shi <brianshi@amd.com> Squash commits 20-35 from subtile_mx branch --- e4780da: Enable FixSrd2 for A/B (#23) --- Author: b-shi <brianshi@amd.com> * Enable FixSrd2 for A/B * Address comments from PR --------- --- e4c64a7: Add nt libs --- Author: b-shi <brianshi@amd.com> --- cd13ec1: [Tensilelite] Pad MX scale tensor dimensions for unaligned problem sizes (#21) --- Author: Archana Ramalingam <98564406+archana-ramalingam@users.noreply.github.com> * Add scale padding * Add tests * Remove redundant pre-swizzle path * Remove code from conflict * Fix reverted mxdatagen path for tensile tests * Add diverse test cases for scale padding in MXScalePadding_test and subtile.yaml - Expanded test cases to include non-multiple-of-32, even non-multiple-of-16, and odd dimensions. --- d87938f: Split subtile.yaml into subtile_bf16.yaml and subtile_mxfp4.yaml (#22) --- Author: James Newling <james.newling@gmail.com> Replace the 'monolithic' subtile.yaml with two focused test files. All original test coverage is preserved. Two new FP4 groups added. BF16 coverage (subtile_bf16.yaml, tests are essentially unchanged): # | Description | Dest | MIs | PGR | DU | SK | Sizes --+--------------------+------+-----+-----+---------+-----+------ 0 | BF16 TN main | b | 19 | 0 | 64 | 0,3 | 11 1 | BF16 TN large DU | b | 4 | 0 | 128,192 | 0,3 | 7 2 | BSS (f32 output) | s | 6 | 0 | 64 | 0,3 | 9 3 | BF16 bias | b | 2 | 0 | 64 | 0 | 1 FP4 coverage (subtile_mxfp4.yaml): # | Description | Dest | MIs | PGR | DU | SK | Sizes | Status --+--------------------+------+-----+-----+-----+-----+-------+-------- 0 | FP4 TN main | b | 15 | 0 | 256 | 0,3 | 23 | from original 1 | FP4 TN large DU | b | 4 | 0 | 512 | 0,3 | 13 | from original 2 | F4SS (f32 output) | s | 5 | 0 | 256 | 0,3 | 13 | from original 3 | FP4 bias | b | 2 | 0 | 256 | 0 | 1 | from original 4 | FP4 PGR=2 | b | 13 | 2 | 256 | 0 | 5 | new 5 | FP4 expanded MIWT | b | 24 | 0 | 256 | 0 | 5 | new 6 | PGR=2 WG 4x1/1x4 | | 6 | 2 | 256 | 0 | 1 | known failures (commented) Run times on gfx950 (8x MI350X): File | NEV=-1 | NEV=0 -------------------+--------+------ subtile_bf16.yaml | 23s | 23s subtile_mxfp4.yaml | 37s | 40s Where NEV is number of elements to validate. I (James) have checked these numbers, and weirdly it is true that NEV=0 is a bit faster than NEV=-1 for mxfp4. --- af04f0d: Dependency based instruction scheduling (#19) --- Author: sebvince <115461989+sebvince@users.noreply.github.com> * Revert to single partition * Start using dependencies * as is * start using separate EmittedModules * remove reduntant wait * Add _extractPathsFromBeforeDeps * Continue simplification * Simplifying * Add more rules * cleanup * Add fp4 test * fix test * Add tests * Remove after field on emittedmodule * Refactoring instructionSchedule * Add comments * cleanup modules vs ops * Refactoring print functions * Test cleanup * Add more tests * Replace subgroup by partition * Remove unused unroll param * Add high level notes * Simplify NLL and NGLL GR removal * Add some comments * Force instruction insertion if no slots available * Fix test after rebase * Move scale before A/B and track inflight count * Fine-grain vmcnt calculation * Separate counts for scaleA and B * Avoid using m0 update and buffer_lod on same MFMA slot to avoid scalar instruction serialization * Fix test * Add vmcnt test * Fix duplicated loads for 1x4 and 4x1 * Fix placement in reverse order * Fix regression on PGR0 * add fallback to numMFMA=1 --- 3ec902b: Add some 1x4 and 4x1 origami solutions --- Author: b-shi <brianshi@amd.com> --- c5000d3: Fix typo --- Author: b-shi <brianshi@amd.com> --- 226ed84: [hipblaslt] Refactor Srd2 calculation for useFixedSrd2 (#30) --- Author: Koji Nakajima <75698246+nakajee@users.noreply.github.com> --- abf19d4: [Tensilelite] UseSubtileImpl: subtile-aligned edge check for store path (#29) --- Author: b-shi <brianshi@amd.com> * [Tensilelite] UseSubtileImpl: subtile-aligned edge check, OOB guard, and refactoring - Replace Size%MT edge check with subtile-aligned check: NonEdge paired store when trailing rows/cols are a multiple of the subtile block size (waveGroupM rows for M, 16 cols for N). Non-last workgroups always take NonEdge. - Add per-wave OOB guard (subtileM32ValidBlocksSgpr / subtileN16ValidBlocksSgpr) to skip stores outside valid M/N tile bounds in the NonEdge path. - Refactor duplicated OOB guard into _emitSubtileOobGuard helper; refactor M/N guard SGPR computation into _emitSubtileMGuard / _emitSubtileNGuard. - Fix orphan scalar store blockIdxM (was tt0, now (tt0*MatrixInstM)//mBlockSize). - Add quick-exit and edge/non-edge header comments to generated ASM. * Add some bias tests, combine M/N guard to single routine * Add OOB check for C loads, update storeD unit tests to check OOB, simplify quick exit checks * Address more PR comments: add M group skip, and skip to store end. simplified loadC OOB mask --------- --- 637881a: Fix unit tests & remove legacy code for subtile interleaving (#33) --- Author: sebvince <115461989+sebvince@users.noreply.github.com> * Fix gr_lr_roundtrip test * Use non-interleaved version as ref code * Fix scheduler test * Removed legacy interleaved mode for LR/GR offset calculation --- e9cb889: Fix MX FP4 scale buffer allocation and initialization for batched GEMM (#25) --- Author: Archana Ramalingam <98564406+archana-ramalingam@users.noreply.github.com> * Fix bacth count issue * Add batch count tests * Fix bacth count issue * Address PR review: clarify FP4-specific byte stride and add non-aligned batched tests - Updated comments on dataBatchBytes computation to clarify FP4 packing assumption (2 elements/byte) and flag that non-FP4 block-scaling types would require updating this conversion. - Added batched test cases with non-multiple-of-32 M/N dimensions: FP4 DU=256: [48,48,2] and [33,65,2] FP4 DU=512: [63,63,2] BF16: [50,100,2] --------- --- a43247b: Update some test yamls (#31) --- Author: b-shi <brianshi@amd.com> --- e2f69c8: Add f4bs origami library with activation function support. Refactor sgpr allocation to reduce sgpr usage in post loop. Store code-path reorganization (#32) --- Author: b-shi <brianshi@amd.com> * Free swap/localwritebase sgprs before post-loop * Defer sgpr allocation to remove holds in sgpr pool. Add Origami library logic files for Cijk_Alik_Bljk_F4BS_MXA32_MXB32 (base, nta4, ntb4 variants). * Remove uneeded alignment and comment * Add more epilogue tests * Remove older origami library for f4bs * Reorder post-loop code blocks to after persistant loop Misc fixes * Fix build issues, relax longjump sgpr requirements * Fix GSU0 branch logic --------- --- 3f034bf: Add F4HS and F4SS Origami library logic for FP4→F16 and FP4→F32 GEMM (#35) --- Author: Majedul Sujon <85503863+msujon-AMD@users.noreply.github.com> * Add F4HS and F4SS Origami library logic for FP4→F16 and FP4→F32 GEMM - Add 6 new yaml files (F4HS, F4SS) across Origami, Origami_nta4, Origami_ntb4 - Update F4BS yaml files: AssertSummationElementMultiple 32→256 for K%256 enforcement - Add ("F4", "F4", "H", "S") to _validGEMMTypes and _HPATypes in Problem.py * Add F4HS test cases to subtile_mxfp4.yaml Add two new benchmark problem blocks for FP4→F16 (F4HS): - No-bias block: same wavetile and problem size coverage as F4SS - Bias epilogue block: BiasDataTypeList [s, h], relu/none activations * Add F4HS (FP4->Half) type support to Tensile client Add TypedGemm_F4_H_S typedef and corresponding reference CPU solver case so F4HS (FP4 input, Float16 output, Float compute) problems can be validated by the benchmark client. --------- --- d0bc8fd: Rewrite subtile-based scheduler. Fix DU>64 & enable very large MT (#36) --- Author: sebvince <115461989+sebvince@users.noreply.github.com> * Initial support for DU>256 * Renaming * add option to do DU=512 in the tests * blocked K-major for scale * Change scaleSet swap logic * Update print functions * Put scales after values for avoid race conditions * Fix tests * more test * tweak printschedule display * Add PGR2 in the yaml tests * Add new scaleGROp * comment out failing tests * Revert "comment out failing tests" This reverts commit 1f5802c. * Draft new logical scheduler * Refactoring * Add more test on step1 * Add more tests on step1 * add bf16 320x320 test * reduce step1 code * Simplify step1 logic * validate some step1 test * Fix partition 2x2 test * more step1 test * 320x320 BF16 test * Add test DU512 + partition2x2 * Simplify step1 code * Add step2 tests * Fix multi-partition step2 * Add step2 du512, 2x2 partition test * Use common algo for all numPartitions * Draft for step3 tests * remove useless tests * New GR algo (draft) * [Step3] Add more test * Iteration on GR * Display ordered GR list with granularities * More test * Add some comments * Disable by default debug logs * Getting rid of step naming * Start remove AnnotatedOp (still there in group pass) * Split dependency Ops * Add todo on place_GRs pass * Valid test_annotate_deps_1x1_partition_DU256 * Test output looking better (still WIP) * single dep for LR tooo * Add remove_cross_deps pass * Fix bugs in dependency pass * insert_gr_lr_inc pass * Add group_lr_gr pass * Add emit pass * Quick port of instruction Emit code * Move emit function to separate file * Refactoring instructionEmitter * Port vgprTile tracking * Reworking second pass (WIP) * Display unrolling requirement * Unrolling check on 2nd pass * Generic validation for assign_vgpr pass * Fix unroll * Add inst schedule in standalone mode * Use lrGran for vgprTile size calculation * Fix bug in emit pass (missing depencency) * PreMFMA path + non-duplication scale load * missing globalReadLDSBufferSwap for GR_INC scales * add wairlr_sync on all LR->GR dep * add waitgr_sync op * remove_unnecessary_gr_deps * Change LR dispatch algo a bit to avoid too many waitgr_sync * Avoid duplicated loads in emitter * Fix bug on gr_emit code * GrInc pass. fix duplicated insertion for B * Fix missing LR_inc for SA/SB * preloop, NLL, NGLL * Simplify preloop * minor changes * Move unroll logic to scheduler * minor changes * Fix unroll id bug on NLL / NGLL * Disable post GRINC for now * Remove commented code * Handle 1x4, 4x1 gr read gran * Fix vmcnt computation * Use correct grCount mapping * Revert in emit logic on buffer_load for PGR0 needs * Add bf16 version in standalone test * Fix LR_Inc insertion on DU>64 * Add subIterK/Partition comment to codegen * Fix issue in GrInc placement * Remove last_mt * Fix LR MT index bug with muli-partition * Disable early LDS size check when subtileImpl is on * Add pass to remove redundant LR deps + fixed issue on dependency annotation pass * Remove more LR redundant deps * Only insert wait_lr_sync on deps * Simple algo to select partition config * Remove HC value for partitions... * Take into account all inflight GR (all tensors) * Fix tests and regressions on gr counts * Fix grCount merge calculation * Better display of dependencies * Add remove_wait_lr_sync after grouping * Add temporary non reg file * Change merge logic on GR grouping pass * Fix non necessary wait_lr_sync * Downgrade some waitlr_sync to sync + added 384x256 no reg test * non reg test 320x320 * Add larger MT * non reg test for fp4 256x256 * Moving out instructionScheduler * Remove old scheduler * Renaming scheduler * Re-work test * Add larger MT test cases * Rename non-ref test * Re-add standalone mode * Refactor DepOp * Remove dead code * Remove MFMATileSize class * Remove from_til_info * Avoid redundant tensor list creation * Remove hardcode granularities in vgrpTile allocation pass. Simplify code. * Re-enable # PGR=2 WG 4x1/1x4, K > DU tests * Remove unused GRScaleOp * DepRef renaming * Get rid of MT string representation * Remove TODO * EmmitedModule simplication * Use explicit pass dependencies * Renaming LogicalScheduler * Remove old test_InterleavingScheduler.py file * Commenting failing test for now * Remove debug logs * Disable lds padding when using UseSubtileImpl --- e8e8c09: Fix LR-GR dependency issue when DU>64 (#40) --- Author: sebvince <115461989+sebvince@users.noreply.github.com> * Fix and simplify logic for remove_unnecessary_lr_deps * Add new ref tests for 128x128x(128,64) --- 4aa441a: Rebase fix --- Author: b-shi <brianshi@amd.com> --- 5ba911e: Skip rocRoller for FP4-A/FP4-B + pre-swizzled scale layout (#42) --- Author: Archana Ramalingam <98564406+archana-ramalingam@users.noreply.github.com> --- 9c74998: Rebase fix --- Author: b-shi <brianshi@amd.com> --- 842b149: Addressed review comments for KernelWriter and KernelWriterAssembly --- Author: Koji Nakajima <knakajim@amd.com> --- dce43b1: Fix computeLoadSrd issue --- Author: Brad Nemanich <Brad.Nemanich@amd.com> --- c075bbf: Fix preSolution CPU re-sync regressing subtile_mxfp4.yaml --- Author: Brad Nemanich <Brad.Nemanich@amd.com> --- ced840f: Fix computeLoadSrd issue (#43) --- Author: bnemanich <brad.nemanich@amd.com> --- bc2f6dd: Small update for gfx950 mx tests + more - enable UseSubtileImpl for all gfx950 non subtile mx tests - skip all gfx950 mxfp8 - use MXScaleFormat=1 as default - set AssertSummationElementMultiple=256 for subtile mxfp4 - fix isSwizzledSubtile in computeLoadSrd --- Author: Koji Nakajima <knakajim@amd.com> --- 5c794b7: Fix gsuasb.yaml failures --- Author: b-shi <brianshi@amd.com> --- 727f8db: tensilelite: add solution reject conditions for UseSubtileImpl=1 (#38) --- Author: Majedul Sujon <85503863+msujon-AMD@users.noreply.github.com> --- 8928fbb: Add more reject conditions for Subtile --- Author: Koji Nakajima <knakajim@amd.com> --- 6e63ab6: Fix kringshift test failures --- Author: b-shi <brianshi@amd.com> --- b3e9724: Update reject condtion for DepthU in subtile case. Plus, update DepthU setting for gfx950 mx test cases --- Author: Koji Nakajima <knakajim@amd.com> --- 5ab6009: Fix build errors --- Author: Brad Nemanich <Brad.Nemanich@amd.com> --- 4a4edca: Update more mxfp4 tensilelite test cases --- Author: Koji Nakajima <knakajim@amd.com> --- bbbc553: Update change log --- Author: Brad Nemanich <Brad.Nemanich@amd.com> --- 6476c04: Add more reject conditions for gfx950 subtile --- Author: Koji Nakajima <knakajim@amd.com> --- c5828c4: Updated gfx950 mxfp4 test cases - add StreamK setting - skip groupgemm tests for now (groupgemm does not support streamK) --- Author: Koji Nakajima <knakajim@amd.com> --- f1fc2f1: Fix hipblaslt build error of gfx950 --- Author: Koji Nakajima <knakajim@amd.com> --- 70cea1b: Updated subtile_mxfp4.yaml (add StreamK) --- Author: Koji Nakajima <knakajim@amd.com> --- c1c9b2a: Add uninit lsc,lsp, etc.. fields for subtile --- Author: b-shi <brianshi@amd.com> --- c0c1f72: Fixed merge error in testing_matmul.hpp --- Author: Koji Nakajima <knakajim@amd.com> --- 191e0cb: Add missed batch_count >1 changes --- Author: archana-ramalingam <Archana.Ramalingam@amd.com> --- 01c52f8: Addressed PR comments --- Author: Koji Nakajima <knakajim@amd.com> --- 4e89c91: Reduce mxfp4 test time --- Author: Brad Nemanich <Brad.Nemanich@amd.com> --- 3dac20f: Prevent overflow for wgmxcc sgpr allocation --- Author: b-shi <brianshi@amd.com> --- 18dec79: Fix error with problem type --- Author: Brad Nemanich <Brad.Nemanich@amd.com> --- 9e69ffd: Add a reject conditoin for gfx950 mx + non Subtile --- Author: Koji Nakajima <knakajim@amd.com> --- 0eed3ba: Add more valid GEMM types --- Author: Brad Nemanich <brad.nemanich@amd.com> --- 8b5514e: Fix missing b build error --- Author: archana-ramalingam <Archana.Ramalingam@amd.com> --- f981ff5: Fix 1250 tests --- Author: Brad Nemanich <brad.nemanich@amd.com> --- d1e69d9: Add more FP4 tests --- Author: Brad Nemanich <Brad.Nemanich@amd.com> --- e3a688f: Add MXScaleFormat: 1 to all gfx950 mx test yaml --- Author: Koji Nakajima <knakajim@amd.com> --- aaef3f5: Add DataTypeMXSA,B setting in gfx950 mxfp4 logic yaml --- Author: Koji Nakajima <knakajim@amd.com> --- 861ef8e: Add DataTypeMXSA,B setting in gfx950 mxfp4 logic yaml (nta4,ntb4) --- Author: Koji Nakajima <knakajim@amd.com> Co-authored-by: Archana Ramalingam <Archana.Ramalingam@amd.com> Co-authored-by: Brad Nemanich <Brad.Nemanich@amd.com> Co-authored-by: Brian Shi <Brian.Shi@amd.com> Co-authored-by: James Newling <James.Newling@amd.com> Co-authored-by: Koji Nakajima <Koji.Nakajima@amd.com> Co-authored-by: Majedul Sujon <Majed.Sujon@amd.com> Co-authored-by: Sebastien Vince <Sebastien.Vince@amd.com> Co-authored-by: T.J. Alumbaugh <T.J.Alumbaugh@amd.com> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Archana Ramalingam <Archana.Ramalingam@amd.com> Co-authored-by: Brad Nemanich <Brad.Nemanich@amd.com> Co-authored-by: Brian Shi <Brian.Shi@amd.com> Co-authored-by: James Newling <James.Newling@amd.com> Co-authored-by: Koji Nakajima <Koji.Nakajima@amd.com> Co-authored-by: Majedul Sujon <Majed.Sujon@amd.com> Co-authored-by: Sebastien Vince <Sebastien.Vince@amd.com> Co-authored-by: T.J. Alumbaugh <T.J.Alumbaugh@amd.com>

…7008) ## Motivation Several memory leaks were detected in MIOpen gtests using ASan. Some of the tests were blacklisted and others were not. This change looks to fix all of the low hanging fruit, which are the majority of the leaks found. This includes all of the critical leaks (>100MB) that were reported. Some other leaks were identified as needing a larger refactor to resolve. After fixing the w_supertensor.cpp leak, the supertensor tests hit virtual memory area limit errors. These changes are to enable running them with ASan, but only with a subset of the tests in order to not hit limits. The test coverage being lost is fairly negligible and the full tests are still run without ASan enabled. ## Technical Details Here is a summary of the files looked at and the changes made: | File | Status | Notes | |------|--------|-------| | `test/gtest/gpu_mha_forward.cpp` | Fixed | Added `miopenDestroySolution()` loop after using solutions. Fixes MHA forward solution descriptor leaks (report #11, #13). | | `test/gtest/gpu_mha_backward.cpp` | Fixed | Added `miopenDestroySolution()` loop after using solutions. Fixes MHA backward solution descriptor leaks (report #9, #12). | | `test/gtest/mha_find20.cpp` | Fixed | Added `miopenDestroySolution()` loop in both `MhaForward` and `MhaBackward` tests. Fixes MHA Find2.0 solution leak (report #16). | | `test/gtest/gtest_desc_guard.hpp` | Fixed | New shared header introduced by the refactor. Provides a single `DescGuard<DescType, CreateFn, DestroyFn>` template (with `TensorDescGuard`, `ConvDescGuard`, `DropoutDescGuard`, `RNNDescGuard` aliases), a `HandleGuard` RAII wrapper for `miopenHandle_t`, and the `DestroyInternalRnnDropoutDesc(rnnDesc)` helper used by every RNN/LSTM/GRU test to free the internal `DropoutDescriptor` that `miopenCreateRNNDescriptor` allocates and `miopenSetRNNDescriptor*` then leaks. Replaces the per-file ad-hoc guard structs from the initial implementation. | | `test/gtest/w_supertensor.cpp` | Fixed | Switched raw descriptors to the shared `RNNDescGuard` / `TensorDescGuard` from `gtest_desc_guard.hpp`. Added a class-local `DestroyDropoutDesc()` (called from `TearDown` and before `miopenSetRNNDescriptor`) to prevent the `miopenSetRNNDescriptor` overwrite leak. Reduced test parameter space under ASan to avoid OOM. Removed unused `seqLen` parameter and dead `param_dev_out`/`bias_dev_out` allocations. | | `test/gtest/lstm.hpp` | Fixed | Switched `rnnDesc` → `RNNDescGuard`, `DropoutDesc` → `DropoutDescGuard`, and `mio_handle` → `HandleGuard` (which now owns the `miopenDestroy` call), all from the shared `gtest_desc_guard.hpp`. Hoisted `dropout_state_buf` so it can be `hipFree`d at the end of the dropout path. Added `DestroyInternalRnnDropoutDesc(rnnDesc)` before `miopenSetRNNDescriptor*` and (in the non-dropout path only) at end of `Run`, which frees the internal `DropoutDescriptor` that the Set call would otherwise leak. Fixes ~615 MB dropout leaks and ~16.5 KB non-dropout descriptor leaks (report #3, #4, #6, #17-#22). | | `test/gtest/gru_test.cpp` | Fixed | In the GRU test class: switched `rnnDesc` → `RNNDescGuard`, `DropoutDesc` → `DropoutDescGuard`, `mio_handle` → `HandleGuard`, and added `DestroyInternalRnnDropoutDesc(rnnDesc)` before `miopenSetRNNDescriptor*` and (non-dropout path only) at end of `Run` plus `hipFree(dropout_state_buf)` for the dropout path. In the in-file `GRUFwdCPUVerify` / `GRUBwdDataCPUVerify` helpers: converted the raw `dropout_inputTensor` / `dropout_outputTensor` declarations to `TensorDescGuard` (mirroring the `cpu_rnn.hpp` change for the LSTM/RNN helpers). | | `test/gtest/softmax_find20.cpp` | Fixed | Changed `Finalize()` to take the `std::vector<miopenSolution_t>&` and destroy each solution via `miopenDestroySolution()` before destroying the problem. Updated all 6 `TEST(...)` callers to pass the solutions vector. Fixes Find2.0 softmax solution/kernel leaks (report #25-#27). | | `test/gtest/rnn_seq_api.hpp` | Fixed | Hoisted `dropout_state_buf` so the dropout path can `hipFree` it at the end. Added `DestroyInternalRnnDropoutDesc(&rnnDesc)` before `miopenSetRNNDescriptor*` and (non-dropout path only) at end of the run to free the internal `DropoutDescriptor` allocations leaked by `RNNDescriptor` copy-assignment. Same pattern as the LSTM/GRU fixes (report #29-#30). | | `test/cpu_rnn.hpp` | Fixed | Converted 6 raw `miopenTensorDescriptor_t dropout_input/outputTensor` declarations across the LSTM/RNN CPU verification helpers (`LSTMFwdCPUVerify`, `LSTMBwdDataCPUVerify`, `RNNFwdTrainCPUVerify`, `RNNBwdDataCPUVerify`, `GRUFwdCPUVerify`, `GRUBwdDataCPUVerify`) to the shared `TensorDescGuard` from `gtest_desc_guard.hpp`. Removed the redundant `miopenCreateTensorDescriptor` calls and updated 12 `miopen::deref(...)` sites to `.get()`. (Note: the GRU helpers in this header are stale duplicates; the live ones are inside `test/gtest/gru_test.cpp` and were updated there too.) Fixes the LSTM/GRU CPU-verify tensor descriptor leaks (report #1, #2, #14, #15, #32, #33). | | `test/gtest/rnn_vanilla_common.hpp` | Fixed | Added `DestroyInternalRnnDropoutDesc(rnnDesc)` calls before `miopenSetRNNDescriptor*` and (non-dropout path only) at end of the run. The `RNNDescGuard` / `DropoutDescGuard` usage was already in place from an earlier commit and now resolves to the shared definitions in `gtest_desc_guard.hpp`. Same pattern as LSTM/GRU/rnn_seq_api fixes (report #14, #15, #32, #33). | | `test/gtest/graphapi_gtest_common.hpp` | Skipped | File no longer exists in the codebase. The GraphApi test infrastructure has been removed; only the leak report (a stale snapshot) still references it. No fix possible against the current source tree. | | `test/gtest/graphapi_execution_plan.cpp` | Skipped | File no longer exists in the codebase (GraphApi removed). The leak it represented was largely an external hipblaslt bug anyway; the remaining test-side portion is not fixable against the current source. | | `test/gtest/na_train.cpp` / `na_inference.cpp` / `na_*_find2.cpp` | Skipped | Leaks come from the internal MIOpen solver/kernel cache living on the global singleton handle, which is never destroyed. Not easily fixable without redesigning global handle lifecycle. Should be suppressed in the ASan suppression file. | | hipblaslt / rocblaslt (external) | Skipped | `SolutionCache::addKernel` and `preloadCustomKernels` leak via `_rocblaslt_handle` constructor. Called from `miopen::Handle::CreateHipblasLtHandle`. This is an upstream bug in hipblaslt/rocblaslt, not fixable in MIOpen. Affects suites that initialize a handle (report Category 2, Category 7). | | CLR / HIP runtime (external) | Skipped | `amd::Context` and `amd::roc::Device` global initialization leaks from `rocclr/platform/context.cpp`. HIP runtime internals, not fixable in MIOpen. | | `src/hipoc/hipoc_program.cpp` | Skipped | `HIPOCProgramImpl` objects leak during kernel compilation/caching (line ~178). This is an internal MIOpen kernel cache lifecycle issue that requires deeper architectural changes to fix. Contributes small amounts to MHA and Softmax Find2.0 leaks. | | `test/gtest/conv_api.cpp` | Fixed | Already clean against current source — `miopenDestroyConvolutionDescriptor(conv_desc)` call exists (line 24) inside the test loop. ASAN run reports no leaks. The leak report was based on a stale snapshot. The hipblaslt handle init portion is tracked under the external-skipped row. | | `test/gtest/log_test.cpp` (CPU_LOG_TEST_FUSION / CPU_LOG_TEST_NEG) | Fixed | Already clean against current source — `Tensor`, `Conv`, `CreateCBAFusionPlan`, `CreateBNormFusionPlan` all have proper destructors that call the corresponding `miopenDestroy*` APIs in `log.cpp`. ASAN run on `CPU_LOG_TEST_*` (11 tests across log_test.cpp + log_test_neg.cpp) reports no leaks. The hipblaslt handle init portion is tracked separately under the external-skipped row. | | `test/gtest/fusion_test.cpp` (CPU_FusionCreateOpConvForward) | Fixed | File renamed from `fusion.cpp` to `fusion_test.cpp`. Already clean against current source — uses `TensorDescGuard`/`ConvDescGuard` for tensor/conv descriptors and calls `miopenDestroyFusionPlan(fusionPlanDesc)` on the fusion plan (line 195). ASAN run on `CPU_FusionCreateOpConvForward_FP32.*` reports no leaks. | | `test/gtest/deterministic_conv_api.cpp` | Fixed | Already clean against current source — uses `ConvDescGuard` (line 66) for the conv descriptor. ASAN run on `*CPU_DeterministicConvApi*` reports no leaks. | | `test/gtest/fusion_aux.cpp` (GPU_FusionAux) | Fixed | Already clean against current source — uses `ConvDescGuard` plus stack-allocated internal C++ objects (`miopen::TensorDescriptor`, `miopen::FusionPlanDescriptor`) which have proper destructors. The `convoOp` handle is owned by the fusion plan. ASAN run on `*GPU_FusionAux*` reports no leaks. | | `test/gtest/backend_api.cpp` (CPU_BackendApi) | Skipped | File no longer exists in the codebase. The backend API test infrastructure (part of the removed GraphApi suite) was removed; no fix possible against the current source tree. | ### High-level notes New shared infrastructure (test/gtest/gtest_desc_guard.hpp) - DescGuard<DescType, CreateFn, DestroyFn> — a single RAII template parameterized on the descriptor type and its create/destroy entry points. Aliases provide TensorDescGuard, ConvDescGuard, DropoutDescGuard, and RNNDescGuard, replacing the four near-identical guard structs that were copy-pasted across test files in the initial implementation. - HandleGuard — separate RAII wrapper for miopenHandle_t (couldn't reuse the template because miopenCreateWithStream takes an extra hipStream_t argument). Supports lazy create(stream) so callers that only need a handle in the dropout branch can default-construct one and populate it conditionally. - DestroyInternalRnnDropoutDesc(rnnDesc) — frees the internal DropoutDescriptor that miopenCreateRNNDescriptor allocates and that miopenSetRNNDescriptor* then orphans. Replaces the equivalent inline blocks that LSTM/GRU/RNN tests were each carrying. The header documents the two call-sites: before each Set* (always safe) and at end-of-run only on the non-dropout path (the dropout path aliases the user-owned descriptor, so freeing would double-free). Recurring patterns enabled by the refactor - The "leak from Set* overwriting the default-constructed internal dropout descriptor" fix collapsed from per-file code to a one-line helper call, applied uniformly across lstm.hpp, gru_test.cpp, rnn_seq_api.hpp, and rnn_vanilla_common.hpp. - mio_handle ownership in LSTM/GRU is now expressed via HandleGuard rather than a manual miopenDestroy at the end of the dropout branch — eliminates a class of forgotten-cleanup bugs. - dropout_state_buf is consistently hoisted out of the dropout if block so an end-of-run hipFree can release it; deletion of the buffer pairs visibly with its allocation. Notable non-RNN change - softmax_find20.cpp was the only Find2.0 leak fix in this commit: Finalize() now takes the solutions vector and calls miopenDestroySolution() for each before destroying the problem. Same shape applied to all 6 tests in the file. ## Test Plan Run the tests beforehand to observe the ASan leak errors and then again afterward to verify the fixes have resolved the problem. ## Test Result List from ROCM-21512: | # | Test Name | Status | Leak Status | |---:|------------------------------------------------------|------------------------------------------------|-----------------| | 1 | Smoke/GPU_RNNVanillaDropout_FP32 | PASSED (4 tests) | No leaks | | 2 | Smoke/GPU_RNNVanillaDropout_FP16 | PASSED (4 tests) | No leaks | | 3 | Full/GPU_LSTM_dropout_FP32 | PASSED (4 tests) | No leaks | | 4 | Full/GPU_LSTM_dropout_FP16 | PASSED (4 tests) | No leaks | | 5 | CPU_GraphApiExecutionPlanBuilder_NONE | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 6 | Full/GPU_LSTM_dropout_FP64 | REMOVED (PR #5750, 2026-03-26) | n/a (deleted) | | 7 | Unit/CPU_GraphApiPointwise_NONE | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 8 | Full/GPU_LstmMSRnn_FP32 | PASSED (1152 tests) | No leaks | | 9 | Smoke/GPU_Bwd_Mha_FP32 | PASSED (12 tests) | No leaks | | 10 | Full/GPU_LstmMSRnn_FP16 | PASSED (864 tests) | No leaks | | 11 | Smoke/GPU_Fwd_Mha_FP32 | PASSED (15 tests) | No leaks | | 12 | Full/GPU_Bwd_Mha_FP32 | PASSED (6 tests) | No leaks | | 13 | Full/GPU_Fwd_Mha_FP32 | PASSED (7 tests) | No leaks | | 14 | Full/GPU_RNNVanilla_FP32 | PASSED (96 tests) | No leaks | | 15 | Full/GPU_RNNVanilla_FP16 | PASSED (96 tests) | No leaks | | 16 | GPU_TestMhaFind20_FP32 | PASSED (2 tests) | No leaks | | 17 | Full/GPU_LSTM_FP32 | PASSED (32 tests) | No leaks | | 18 | Full/GPU_LSTM_FP16 | PASSED (32 tests) | No leaks | | 19 | Full/GPU_LSTM_extra_FP32 | PASSED (30 tests) | No leaks | | 20 | Full/GPU_LSTM_extra_FP16 | PASSED (30 tests) | No leaks | | 21 | Full/GPU_DeepBench_LSTM_FP16 | PASSED (22 tests) | No leaks | | 22 | Full/GPU_DeepBench_LSTM_FP32 | PASSED (22 tests) | No leaks | | 23 | CPU_LOG_TEST_FUSION_NONE | PASSED (2 tests) | No leaks | | 24 | CPU_LOG_TEST_NEG_NONE | PASSED (4 tests) | No leaks | | 25 | GPU_SoftmaxFind20_BFP16 | PASSED (2 tests) | No leaks | | 26 | GPU_SoftmaxFind20_FP16 | PASSED (2 tests) | No leaks | | 27 | GPU_SoftmaxFind20_FP32 | PASSED (2 tests) | No leaks | | 28 | CPU_ConvApi_NONE | PASSED (1 test) | No leaks | | 29 | Full/GPU_RNNSeqApi_FP16 | PASSED (16 tests) | No leaks | | 30 | Full/GPU_RNNSeqApi_FP32 | PASSED (16 tests) | No leaks | | 31 | UnitVAN/CPU_GraphApiRng_NONE | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 32 | Smoke/GPU_RNNVanilla_FP16 | PASSED (4 tests) | No leaks | | 33 | Smoke/GPU_RNNVanilla_FP32 | PASSED (4 tests) | No leaks | | 34 | CPU_FusionCreateOpConvForward_FP32 | PASSED (1 test) | No leaks | | 35 | CPU_GraphApiOperationReduction_NONE | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 36 | Unit2IV1/CPU_GraphApiOperationPointwiseBwd_NONE | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 37 | Unit2IV1/CPU_GraphApiOperationPointwiseTwoInput | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 38 | Unit2IV2/CPU_GraphApiOperationPointwiseBwd_NONE | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 39 | Unit2IV2/CPU_GraphApiOperationPointwiseTwoInput | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 40 | UnitVAB/CPU_GraphApiRng_NONE | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 41 | Smoke/GPU_RNNVanillaDropout_FP16 (duplicate of #2) | (see #2) | (see #2) | | 42 | CPU_GraphApiOperationGraphDescriptor_NONE | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 43 | UnitVA/CPU_GraphApiVariantPack_NONE | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 44 | UnitVAU/CPU_GraphApiRng_NONE | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 45 | CPU_GraphApiOperationReshape_NONE | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 46 | Smoke/CPU_DeterministicConvApi_NONE | PASSED (1 test) | No leaks | | 47 | Smoke/GPU_FusionAux_FP32 | PASSED (1 test) | No leaks | | 48 | CPU_GraphApiEngineHeur_NONE | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 49 | Unit/CPU_GraphApiReduction_NONE | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 50 | CPU_GraphApiEngineCfg_NONE | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 51 | Unit/CPU_GraphApiMatMul_NONE | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 52 | CPU_BackendApi_NONE | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 53 | UnitIV/CPU_GraphApiOperationPointwiseOneInput | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 54 | Unit3IV/CPU_GraphApiOperationPointwiseThreeInput | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 55 | UnitVA/CPU_GraphApiOperationMatmul_NONE | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | 56 | UnitVA/CPU_GraphApiOperationRng_NONE | REMOVED (PR #5603, 2026-03-26) | n/a (deleted) | | Outcome | Count | |------------------------------------|------:| | Passed, no leaks | 39 | | Passed, leaks detected | 0 | | Failed | 0 | | Crashed / timed out | 0 | | Removed — GraphAPI purge (#5603) | 15 | | Removed — FP64 LSTM purge (#5750) | 1 | | Duplicate (not re-run) | 1 | | **Total rows** | **56** | ## Risk Assessment Low --------- Co-authored-by: JonathanLichtnerAMD <195780826+JonathanLichtnerAMD@users.noreply.github.com>

pjots and others added 30 commits February 25, 2025 14:15

Skip InProcess assembler for new archs

c53bcb1

Retrieve Architecture Information from the Machine Readable ISA

92c20d5

Allow use of internal MRISA files when building rocRoller.

e149cfe

Adds the cache variable INTERNAL_MRISAS to the CMake build scripts, and allows CI to set these using secret files.

Support building and testing on a machine with no GPU

69d0f86

Cleanup GPU test filter Ignore .venv

QA for OCP vs NANOO mode to Settings class

ec47d93

* Use TEST_F with fixture tests * fixed some out of bounds cast_to_f8 * Make Settings threadsafe

Add FP6 and BF6 GEMM test with FP32 output

31198bd

* Add FP6 and BF6 GEMM tests with FP32 output * Fix incorrect test names and setting * Updated emulator test suite

Added more expansive emulator test suite

31c8db5

* added new filters * skip some tests * run rrperf for only f8, f6, f4

Create a ScaledMatrixMultiply test for one tile with fixed scale values

efaeb01

* Implement fixed scale matrix multiplication * Add more data types to scale mfma and new MM and GEMM tests * Make ScaledMatrixMultiplyGenerator be a single component and update expression test

Add scaled mixed-precision matrix multiplication tests

fdbe1ef

* Add scaled mixed-precision matrix multiplication tests

Test gfx950 new convert functions: test round-trip

a715120

* added a GPU convert unit test * include workgroup info in the index calculation * use command to launch kernel * added comments * added conversion test dir

A separate suite for client mixed gemm tests

e115531

Co-authored-by: Yiqian Liu <yiqialiu@amd.com> Co-authored-by: Carvalho, Joao <joao.carvalho@amd.com>

Create unit tests for LDS transpose loads for 4, 6, 8, and 16-bit dat…

e2b30ee

…a types

QA for scaled gemm tests

6715b5a

* Parameterize scale value * Add more test values * Add TODO note --------- Co-authored-by: Yiqian Liu <yiqialiu@amd.com> Co-authored-by: Joao P. L. de Carvalho <joao.carvalho@amd.com>

Set gtest discovery_mode to pre_test

3ce9b05

Co-authored-by: Carvalho, Joao <joao.carvalho@amd.com>

Fix inf/nan/clip for convert from FP32/FP16 to OCP FP8

2a7a12d

Co-authored-by: Carvalho, Joao <joao.carvalho@amd.com>

Merge branch 'master' into 'gfx950'.

48ef3fb

Add test to verify Settings::get is threadsafe

56619ae

* Added test always fails if locks in Settings_impl.hpp are commented out; * Restored CI threads to 32 instead of nproc (16 threads); * Set OMP_NUM_THREADS=8 for our tests; and * Updates Rocm build.

QA for Fix inf/nan/clip for convert from FP32/FP16 to OCP FP8

8143006

* QA for Fix inf/nan/clip for convert from FP32/FP16 to OCP FP8

Modify thread safety test.

708f401

Remove unused headers

fd80da3

Co-authored-by: Yiqian Liu <yiqialiu@amd.com>

Merge commit 'e7610fc9' into gfx950

ae46221

Flatten template of PotentialObservers

d193ab6

createObserver leverages constexpr Add CObserverRuntime and CObserverConst

Fix user-entrypoint to be able to restart a dev container.

0388cf9

Add Pseudo random number generator

433156f

Add a RandomNumber Expression to emit code to generate a random number

Merge commit '2584b5fe' into maemmett/update-gfx950

99e8de9

ibrahimw1 mentioned this pull request Mar 10, 2026

[hipsolver] Refactor build system ibrahimw1/rocm-libraries#2

Open

1 task

Alex-Vasile added a commit that referenced this pull request Apr 29, 2026

[plan] Remove resolved known issues #1 and #6, renumber remaining

b55d84d

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

NolanHannaAMD mentioned this pull request May 1, 2026

[MIOpen] [ROCM-21512] [ALMIOPEN-1814] Fix MIOpen ASan memory leaks #7008

Merged

bnemanich mentioned this pull request May 3, 2026

Add gfx950 MXFP4 and BF16 Subtile-based kernel implementation #7017

Merged

1 task

davidd-amd mentioned this pull request May 5, 2026

Fix/hipblaslt _APIAlgoIndex regression on per-arch mapping installs #7009

Merged

3 tasks

poyenc mentioned this pull request May 18, 2026

[CK] Fix RDNA3 FMHA tile-load paths #7016

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reorganize project folders#6

Reorganize project folders#6
jayhawk-commits merged 824 commits into
developfrom
folderReorg

jayhawk-commits commented Apr 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

Conversation

jayhawk-commits commented Apr 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants