[Tensilelite][Sparse] Enable plr min for spmm#4364
Merged
Conversation
perfci run on commit 66e22ac |
Codecov Report✅ All modified and coverable lines are covered by tests. ❌ Your project status has failed because the head coverage (76.83%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #4364 +/- ##
========================================
Coverage 65.33% 65.33%
========================================
Files 1577 1577
Lines 242154 242154
Branches 33912 33912
========================================
Hits 158201 158201
Misses 69945 69945
Partials 14008 14008
*This pull request uses carry forward flags. Click here to find out more. 🚀 New features to boost your workflow:
|
aosewski
pushed a commit
that referenced
this pull request
Feb 24, 2026
## Motivation Ref: #1208. Enable plr-min optimization for spmm ## Technical Details Plr-min only support no packing case. This PR enables plr-min for spmm when `TransposeLDSMetadata` is True ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
jovanau
pushed a commit
to jovanau/rocm-libraries
that referenced
this pull request
Mar 19, 2026
## Motivation Ref: ROCm#1208. Enable plr-min optimization for spmm ## Technical Details Plr-min only support no packing case. This PR enables plr-min for spmm when `TransposeLDSMetadata` is True ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
bethune-bryant
pushed a commit
that referenced
this pull request
Apr 3, 2026
This reverts commit cb81c2e.
bethune-bryant
pushed a commit
that referenced
this pull request
Apr 3, 2026
This reverts commit e5febecfc7fe086563625ee3e284aa617f53beda.
jichangjichang
pushed a commit
that referenced
this pull request
Apr 8, 2026
This reverts commit cb81c2e.
jichangjichang
pushed a commit
that referenced
this pull request
Apr 8, 2026
This reverts commit e5febecfc7fe086563625ee3e284aa617f53beda.
jichangjichang
pushed a commit
that referenced
this pull request
Apr 13, 2026
This reverts commit cb81c2e.
jichangjichang
pushed a commit
that referenced
this pull request
Apr 13, 2026
This reverts commit e5febecfc7fe086563625ee3e284aa617f53beda.
TonyYHsieh
pushed a commit
that referenced
this pull request
Apr 13, 2026
This reverts commit cb81c2e.
TonyYHsieh
pushed a commit
that referenced
this pull request
Apr 13, 2026
This reverts commit e5febecfc7fe086563625ee3e284aa617f53beda.
jichangjichang
pushed a commit
that referenced
this pull request
Apr 13, 2026
This reverts commit cb81c2e.
jichangjichang
pushed a commit
that referenced
this pull request
Apr 13, 2026
This reverts commit e5febecfc7fe086563625ee3e284aa617f53beda.
tomchengchitang
pushed a commit
that referenced
this pull request
Apr 16, 2026
This reverts commit cb81c2e.
tomchengchitang
pushed a commit
that referenced
this pull request
Apr 16, 2026
This reverts commit e5febecfc7fe086563625ee3e284aa617f53beda.
tomchengchitang
added a commit
that referenced
this pull request
Apr 16, 2026
…te_to_1250 (#1115) * fix computeInputType issue in ReferenceValidator.cpp * [hipblaslt] fix unit tests for gfx950_mx_rebase (#4912) Fix unit tests for gfx950_mx_rebase Fix unit tests for gfx950_mx_rebase - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Alex Brown <alex.brown@amd.com> Co-authored-by: Chun-Xun Lin <Chun-Xun.Lin@amd.com> * [hipblaslt] Fix a verification fail with spmm_i8hs.yaml (#5034) Fix a fail with spmm_i8hs.yaml - use ["_DepthU"] istead of ["_DepthUA/B"] in sparse case - restore missing code for sparse tensilelite common test tensilelite common test (spmm related) passed with this change - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. * initial set of testcase for MXFP4 (#4739) This PR is to add test cases for tensile-lite related with MXFP4 support The change here is to add new YAML file which will verify the MXFP4 support https://github.com/ROCm/rocm-libraries/commit/333d1a24bc5c1a1a1685ed1c22cd6979ad8c5839 <!-- Explain any relevant testing done to verify this PR. --> <!-- Briefly summarize test outcomes. --> - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Signed-off-by: pdhirajkumarprasad <dhirajp@amd.com> * [Tensilelite] Add regression test for MX FP4 scale buffer determinism (#4959) Write test to catch post-generation overwrites of `MXSA`/`MXSB` buffers that desync the CPU reference from GPU data, causing intermittent MX FP4 validation failures. Added `MXScaleDeterminismTest/ScaleBufferIsDeterministic` to `MXDataGen_test.cpp` - calls `generateMXInput` twice with sentinel-initialized scale buffers (`0x00` and `0xFF`) and asserts equality - the differing sentinels prevent a silent no-write from passing. Includes non-32-aligned cases (M=204, M=213) that failed due to `initializeCPUInputs` overwriting `MXSA`/`MXSB` after `generateMXInput` populated them. - `MXScaleDeterminismTest/ScaleBufferIsDeterministic` — 4 cases - [x] All tests passing - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. * UseF32XEmulation in forceLrvwTile1 for B tensor (#5143) - When `gfx950_mx_rebase` split `forceLrvwTile1` into per-tensor definitions (using `MacDataTypeA`/`MacDataTypeB` instead of `DataType`), the B tensor copy lost the `not UseF32XEmulation` guard present on `develop` - This caused TF32 kernels to get `lrvwTileB=1` instead of `VectorWidthB=2`, halving the local read block width and doubling the number of LRB instructions (8 → 16), breaking CMS schedules - One-line fix: add `and (not kernel["UseF32XEmulation"])` to the B tensor `forceLrvwTile1` definition - [x] `xfp32.yaml` passes - [x] `cms_tf32_nt.yaml` passes (extracted from `custom_mainloop_scheduling_tf32.yaml`) 🤖 Generated with [Claude Code](https://claude.com/claude-code) AIHPBLAS-1221 AIHPBLAS-1224 Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * [hipblaslt] Enable StoreSwapAddr for MXFP4, plus add GRVWMXSA/B adustment for DTL (#5117) Enable MT256x256x256 for MXFP4. Need StoreSwapAddr and GRVWMXSA/B adjustment for DTL - Enable StoreSwapAddr for MXFP4 - Added GRVWMXSA/B adjustment logic for DTL (64bit to 32bit) CI test CI test - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. * [Tensilelite] Fix UserArgs struct stride mismatch in grouped GEMM (#5129) Grouped GEMM with `UseUserArgs: True` crashes with `hipErrorIllegalAddress` because the kernel's `userArgsInfo.totalSize` doesn't match `sizeof(DeviceUserArguments)`. `DeviceUserArguments` is a fixed 196-byte packed struct, but `userArgsInfo` field sizes in `Signature.py` were only set inside feature-gated blocks, leaving them at 0 when features were disabled. This made `totalSize` (stride between per-problem entries) and `extArgOffset` (epilogue field offsets) smaller than the struct, causing the kernel to read from wrong offsets for multi-problem grouped GEMMs. Fix: move size assignments outside conditional blocks to always match the struct layout. Existing test: `grouped_gemm_userargs.yaml` — validates grouped GEMM with `UseUserArgs: True` across ScaleAlphaVec, Bias+Activation, and plain configurations. - [x] Test passed - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AIHPBLAS-1220 * [hipBLASLt] Disable failed mx f8 problem sizes (#5105) Comment two problem sizes of the new mx f8 tests that currently fail. Manually tested other problem sizes to verify they pass. <!-- Briefly summarize test outcomes. --> - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. * [hipblaslt] Scheduling related fixes for MXFP4 (#5169) Improve MXFP4 instruction scheduling. Plus, reduce sgpr usage for MXFP4+StreamK - Added if condition for MXFP4 in getMFMAIssueLatency - Fixed numGRIncInst calculation for MXFP4 - Use MinGRIncPerMfma=3 for MXFP4 - Disable staggerU code for MXFP4 + StreamK CI test Check CI test - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. * remove explicit constructor from variable_value class * fix return statement in hipDataType_to_tensile_type and add type check for activation inputs * [Tensilelite] Shuffle mx scaling data in Tensile (#4864) Generate pre-shuffled scaling data that TensileLite can use for MX datatypes Added optional `--enable-mx-preswizzle` flag to control MX FP4 scale pre-swizzle. When enabled, the flag allows scale tensors to be rearranged into gfx950 GPU kernel memory access patterns using parameters `{swizzleTileMN=32, tileK=8, subTileK=4}` and `{tileK=8, swizzleTileMN=32}`. Pre-swizzle requirements: flag enabled, solution available, `useScaleAB` populated, and dimension alignment satisfied (rows%256==0, cols%32==0 for mxBlock=32). `MXPreSwizzleTest` validates scale permutation behavior and data buffer invariance across multiple matrix dimensions and transpose modes. Tests pass successfully - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AIHPBLAS-789 * [hipblaslt] Fix fail with kringshift.yaml (#5228) Fix fail with kringshift.yaml on gfx950_mx_rebase branch Swapped the location of graAddresses and graFinalOffsets to align with develop branch. Before this fix, s[sgprSrdB+0] was referred before initialization in kringshift case. Run kringshift.yaml kringshift.yaml passed. - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. * [hipblaslt] Optimize StoreSwapAddr (#5217) optimize sgpr allocation for StoreSwapAddr - use 1 common sgpr (swapCommon) for all (A,B,MXSA,MXSB) - use s_add m0, swapCommon, LocalWriteAddr for m0 initialization Run existing dtl test dtl.yaml passed - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. * [hipblaslt] Enable MXFP4 + DtlPlusLdsBuf (#5251) MXFP4 + DtlPlusLdsBuf was broken and need to enable it Added missing code for MXFP4 + DtlPlusLdsBuf Local test with MXFP4 + DtlPlusLdsBuf Local test passed - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. * Fix gfx12 build error with integer cast * [hipblaslt] Fix SIA3 issues with MXFP4 (#5245) Fix scheduling issues with MXFP4 + SIA3 - fix incorrect local read count for MXSA/B in getLocalWriteMFMAStart - fix incorrect global read count for MXSA/B in getNumLocalWritePerMfma - optimize PointerLRCode scheduling for MXFP4 tensilelite common test tensilelite common test passed - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. * [hipBLASLt] Fix CI failures for gfx942 (#5216) This PR fixes failed gfx942 tests in CI. For fnuz/non-fnuz f8 types, the `ComputeInputType` might appear in a concatenated form:`ComputeInputTypeAComputeInputTypeB` (e.g., `BFloat8Float8_fnuz`). This leads to type mismatches when selecting solutions. This PR adds logic to correctly interpret concatenated types by checking the type at the corresponding position. Manually run these tests: gfx942: - `./clients/hipblaslt-test --gtest_filter=*pre_checkin_matmul_real_1b_fnuz_dst_1b_fnuz_smallsize_bf8_fnuz_rf8_fnuz_rbf8_fnuz_rbf8_fnuz_rf32_r_NT_3_128_128_1_3_128_2_3_3_1_SA_SB*` gfx950: - `./clients/hipblaslt-test --gtest_filter=*matmul_f8_dst_bf16*` - `./clients/hipblaslt-test --gtest_filter=*matmul_f8_bf8_dst_bf16*` - `./clients/hipblaslt-test --gtest_filter=*matmul_f8_bf8_dst_fp32*` <!-- Briefly summarize test outcomes. --> - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. * Make the usage side’s logic consistent with allocation side (tPackM) (#5273) - The error: [math-ci.amd.com/job/rocm-libraries/job/precheckin/job/hipsparselt/job/PR-5202/6/pipeline-overview/log?nodeId=628](https://math-ci.amd.com/job/rocm-libraries/job/precheckin/job/hipsparselt/job/PR-5202/6/pipeline-overview/log?nodeId=628) - Fix the inconsistency between LocalRead.py and KernelWriterAssembly.py - Change `setComputeInputType` to `setComputeInputTypeA` and `setComputeInputTypeB`, because `setComputeInputType` doesn't exist anymore. <!-- Explain the purpose of this PR and the goals it aims to achieve. --> - Inside the LocalRead side, it tries to reference the wrong names, which are non-set sgprs such as `sgprPackKForV2` and `sgprPackKForV3`. It shall reference `sgprPackKForMV2/sgprPackKForMV3`, so I have to make it reference the right ones. <!-- Explain the changes along with any relevant GitHub links. --> - Please use the attached file `setup_hipsparselt_build` to set the packages and env variables the same as the CI. [setup_hipsparselt_build.sh](https://github.com/user-attachments/files/25870143/setup_hipsparselt_build.sh) - `./install.sh -c` <!-- Explain any relevant testing done to verify this PR. --> - first build with errors: [build_hipsparselt_1st.log](https://github.com/user-attachments/files/25870212/build_hipsparselt_1st.log) - final build WITHOUT errors: [build_hipsparselt_final.log](https://github.com/user-attachments/files/25870348/build_hipsparselt_final.log) <!-- Briefly summarize test outcomes. --> - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: tomchengchitang <tom.tang@amd.com> * [hipblaslt] Fix fail with gfx942+dtv/dtl.yaml (#5349) Fix dtv/dtl.yaml fail on gfx942 Add int cast to avoid using float value in const field Run dtv/dtl.yaml on gfx942 dtv/dtl.yaml padded on gfx942 - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. * [hipblaslt] disable mxDataGenerator for windows builds (#5298) Disable mxDataGenerator for Windows builds. Disable mxDataGenerator for Windows builds. - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Brad Nemanich <Brad.Nemanich@amd.com> * fix build errors of merge * Fix: add MacDataTypeA to mock kernel (#5351) Fix the unit test now that we use `MacDataTypeA` to get the datatype for a matrix instruction. Unit tests pass * [hipblaslt] Fix tox test fp8_gfx12 failed when dtva1=1 or dtvb1=1 (#5390) - The tox test in CI failed with Tensile/Tests/common/gemm/gfx12/fp8_gfx12.yaml <!-- Explain the purpose of this PR and the goals it aims to achieve. --> - When the kernels do the validation with host side, kernels with dtva1 or dtvb1 always/usually failed. <img width="604" height="177" alt="image" src="https://github.com/user-attachments/assets/2409c81e-99b3-4cd2-8d5d-6ea1214af32d" /> - The reason behind that is v_bfe_u32 v4, v[Serial], 2.0, 1 → 2.0 encodes as 0x40000000 in IEEE 754, so the GPU shifts by 0x40000000 & 0x1F = 0 bits, extracting bit 0 instead of bit 2 <!-- Explain the changes along with any relevant GitHub links. --> `tox -e py3 -- Tensile/Tests -k Tensile/Tests/common/gemm/gfx12/fp8_gfx12.yaml 2>&1 | tee fp8_gfx12.log` <!-- Explain any relevant testing done to verify this PR. --> Build succeed log: [fp8_gfx12_fixed.log](https://github.com/user-attachments/files/25933785/fp8_gfx12_fixed.log) <!-- Briefly summarize test outcomes. --> - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: tomchengchitang <Tom.Tang@amd.com> * Revert "[hipblaslt] disable mxDataGenerator for windows builds (#5298)" This reverts commit e20a0900e6c118c82a1e22ce61ae7921a28a3acf. * [hipBLASLt] Fix failed swizzle tests (#5400) Fix failed tests: `Tensile/Tests/common/gemm/swizzleA.yaml` `Tensile/Tests/common/gemm/swizzleB.yaml` For MFMA validation, need to use `F32XdlMathOp` data type instead of `DataType` depending on if `EnableF32XdlMathOp` is enabled. Run the failed tests locally to verify the fix. <!-- Briefly summarize test outcomes. --> - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. * [hipblaslt] Fix tailLoop errors in GLOBAL_OFFSET_{A or B} for fp16_gfx12.yaml (#5419) - CI test `tox -e py3 -- Tensile/Tests -k Tensile/Tests -k Tensile/Tests/common/gemm/gfx12/fp16_gfx12.yaml` always failed in gfx12 machines <!-- Explain the purpose of this PR and the goals it aims to achieve. --> - The test passed when K = 256 (multiple of 32), but failed with K=255 and K=257 - So the problem occurred in TailLoop - Look at the TailLoop asm code: <img width="2796" height="290" alt="image" src="https://github.com/user-attachments/assets/ca974c71-6e65-45c1-8cd0-5ce6680454a9" /> - Every time we use GLOBAL_OFFSET_{A or B}, we have to add the `v_lshlrev_b32` back, but I found there is some mismatch between GLOBAL_OFFSET_* which is in the TailLoop. <!-- Explain the changes along with any relevant GitHub links. --> Script: `tox -e py3 -- Tensile/Tests -k Tensile/Tests/common/gemm/gfx12/fp16_gfx12.yaml`. <!-- Explain any relevant testing done to verify this PR. --> The test passed. [fp16_gfx12_debug.log](https://github.com/user-attachments/files/25963352/fp16_gfx12_debug.log) <!-- Briefly summarize test outcomes. --> - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: tomchengchitang <Tom.Tang@amd.com> * [hipblaslt] disable mxDataGenerator for windows builds (#5414) Disable mxDataGenerator for Windows builds. Disable mxDataGenerator for Windows builds. The original PR failed `Tensile/Tests/common/gemm/gfx950/mx32f4_tn.yaml` since we removed data types. restore the original implementation with OS specific build dispatch. - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. * [hipblaslt] Add F4/F6/BF6 to instTypeToDataType (#5457) F4/F6/BF6 missing in instTypeToDataType Add F4/F6/BF6 to instTypeToDataType CI test check CI test result - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. * [hipblaslt] remove MXFP4 TN logic file (#5487) Fix all_solutions fail. remove MXFP4 TN logic file (which has incorrect settings) hipblaslt-test with heuristic_all_solutions hipblaslt-test with heuristic_all_solutions passed - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. * [hipblaslt] Use64bShadowLimitMX support (#5499) Reduce the amount of sgpr usage by setting Use64bShadowLimit for MXSA/B separately from A/B - Added a new parameter to set Use64bShadowLimit for MXSA/B separately from A/B - Added test cases for Use64bShadowLimitMX New test cases in mx32f4_tn.yaml mx32f4_tn.yaml passed - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. * cleanup redundant lines of code * [hipBLASLt] Fix failed rocRoller test (#5529) Fix the rocRoller failed test in precheckin stage: > [ FAILED ] 1 test, listed below: > [ FAILED ] _/rocroller_predicate_test.unrollXYK/pre_checkin_rocroller_predicate_f8_rf8_rf32_rf32_rf32_r_TN_128_128_96_1p5_96_96_2_128_128_1_SAMX_32_UE8M0_SBMX_32_UE8M0, where GetParam() = { function: "rocroller_predicate", name: "rocroller_predicate", category: "pre_checkin", known_bug_platforms: "", alpha: 1.5, beta: 2, stride_a: 0x9ec9018, stride_b: 0x9ec9118, stride_c: 0x9ec9218, stride_d: 0x9ec9318, stride_e: 0x9ec9418, user_allocated_workspace: 134217728, M: 0x9ec9520, N: 0x9ec9620, K: 0x9ec9720, lda: 0x9ec9820, ldb: 0x9ec9920, ldc: 0x9ec9a20, ldd: 0x9ec9b20, lde: 0x9ec9c20, batch_count: 1, iters: 10, cold_iters: 2, algo: 0, solution_index: -1, requested_solution_num: 1, a_type: f8_r, b_type: f8_r, c_type: f32_r, d_type: f32_r, compute_type: f32_r, compute_input_typeA: non-supported type, compute_input_typeB: non-supported type, scale_type: f32_r, initialization: "hpl", gpu_arch: "950", gpu_arch_exclude: "", pad: 4096, grouped_gemm: 0, threads: 0, streams: 0, devices: Change visibility of macro `HIPBLASLT_ENABLE_MXDATAGENERATOR` to make it visible (defined) to the files that use it. Manually run the failed test: `./clients/hipblaslt-test --gtest_filter=*pre_checkin_rocroller_predicate_f8_rf8_rf32_rf32_rf32_r_TN_128_128_96_1p5_96_96_2_128_128_1_SAMX_32_UE8M0_SBMX_32_UE8M0*` <!-- Briefly summarize test outcomes. --> - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. * Revert "cleanup redundant lines of code" This reverts commit 7b3a5e5e34c0dcef34560566da118bd6ef655cf4. * [hipsparselt] Fix numSplitMetadata logic (#5608) - hipsparselt-test failed a lots: https://math-ci.amd.com/job/rocm-libraries/job/precheckin/job/hipsparselt/job/PR-5202/26/pipeline-overview/?selected-node=624 - Fix numSplitMetadata logic <!-- Explain the purpose of this PR and the goals it aims to achieve. --> - Change the numSplitMetadata logic the same as develop branch <!-- Explain the changes along with any relevant GitHub links. --> <!-- Explain any relevant testing done to verify this PR. --> <img width="1150" height="224" alt="image" src="https://github.com/user-attachments/assets/d3fd68a1-b0d0-422b-a419-183cabb2bf50" /> Note that I only count the common failed test between my local and CI. <!-- Briefly summarize test outcomes. --> - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: tomchengchitang <tom.tang@amd.com> * [hipblaslt] const GRInc support (#5526) Reduce the amount of sgpr usage for MXFP4 - Enable constant GRInc value instead of allocating sgpr for applicable cases Run tensilelite common test Run hipblaslt-test tensilelite common test passed hipblaslt-test passed (except for known issue) - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. * Fix merge * Fix tensilelite build error due to merge conflict * fix various issue in review * [hipblaslt] Reject MX + nonDTL + UnrollLoopSwapGlobalReadOrder (#5794) Fix an error with MX + nonDTL + UnrollLoopSwapGlobalReadOrder Reject MX + nonDTL + UnrollLoopSwapGlobalReadOrder Local test Local test passed (rejected) - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. * [hipsparselt] Restore to develop logic and fix mistakenly used PackKForMV (#5796) - Continuing fixing the hipsparselt-test failed test <!-- Explain the purpose of this PR and the goals it aims to achieve. --> - As the kernel shows, it misuses the PackKForMV <img width="2033" height="387" alt="image" src="https://github.com/user-attachments/assets/b45c5ab9-e293-44ba-b82a-ab2bd8cf9ba8" /> <!-- Explain the changes along with any relevant GitHub links. --> - hipsparselt-test: `TENSILE_DB=0x8000 ./build/release/hipsparselt-install/bin/hipsparselt-test --gtest_output=xml --gtest_color=yes '--gtest_filter=*quick*'` <!-- Explain any relevant testing done to verify this PR. --> <img width="874" height="605" alt="image" src="https://github.com/user-attachments/assets/2c2d2c48-7be6-4505-8ad3-557c8416a7a2" /> From 750 failed test -> 260 <!-- Briefly summarize test outcomes. --> - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: tomchengchitang <tom.tang@amd.com> * [hipsparelt] Delete spurious rIdx_ loop for hipsparselt failed tests (#5848) - Fix all hipsparselt-test failed test in rhel && gfx950 <!-- Explain the purpose of this PR and the goals it aims to achieve. --> - delete the extra `for rIdx_ loop` in LocalRead.py for the assembly <img width="1805" height="681" alt="image" src="https://github.com/user-attachments/assets/d633bd82-a291-43c2-a9c6-d99276a5335d" /> <!-- Explain the changes along with any relevant GitHub links. --> - test script: `./build/release/hipsparselt-install/bin/hipsparselt-test --gtest_output=xml --gtest_color=yes --gtest_filter=*quick* ` <!-- Explain any relevant testing done to verify this PR. --> - All passed <img width="1212" height="250" alt="image" src="https://github.com/user-attachments/assets/797091e6-2bb1-4262-9513-5220ff6b3122" /> <!-- Briefly summarize test outcomes. --> - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: tomchengchitang <tom.tang@amd.com> * Added a separate testenv for rocisa * MX: F4: scale type E5M3 * MX: support MX with non-MX combination * MX : initKernel part * MX: Tail: remove dirty MX data * MX: Tail: release MX valu vgpr after tail * MX: Edge: MX Scale shiftptr align with noraml Buffer * Shiftptr: fix Shiftptr limitation * MX: update AssertSummationElementMultiple for MX * F4: use 32x16 step 1 * F4: use 32x16: local read: add numTilePerInst loop * F4: use 32x16: local read: fix offset * F4: use 32x16: wmma tail: add numTilePerInst loop * F4: use 32x16: wmma tail * F4: use 32x16: store D * F4: use 32x16: vectorwidth * F4: use 32x16: multi wave * F4: use 32x16: local read: use MIInputPerThUnroll * F4: use 32x16: local read: add numTilePerInst loop * F4: use 32x16: local read: dstr offset * F4: use 32x16: local read: fix wave offset * MX: F4: fix tmpVGPR allocation * MXF4: support 32x16 instruction * F4: use 32x16: multi wave: NT, TT * F4: use 32x16: wmma tail * MX: fix MX buffer load length * MX: HipBlasLt: add MXE8B32 F8 yamls * MX: HipBlasLt: add MXE8B32B B8 yamls * MX: HipBlasLt: add MXE8B16 F8 yamls * MX: HipBlasLt: add MXE8B16 B8 yamls * MX: HipBlasLt: add MXE8B16 F6 yamls * MX: HipBlasLt: add MXE8B16 B6 yamls * MX: HipBlasLt: add MXE8B32 B6 yamls * MX: HipBlasLt: add MXE8B32 F6 yamls * MX: HipBlasLt: add MXE8B32 F4 yamls * MX: HipBlasLt: add MXF8B32 F4 yamls * MX: HipBlasLt: add MXE5M3B32 F4 yamls * MX: HipBlasLt: add MXE5M3B16 F4 yamls * MX: HipBlasLt: add MXF8B16 F4 yamls * MX: HipBlasLt: add MXE8B16 F4 yamls * MX: HipBlasLt: add MXE8B32/MXF8B16 sample * MX: HipBlasLt: support MXE8B32/MXF8B16 F4 * MX: HipBlasLt: support MXF8B32/MXE8B16 F4 * MX: HipBlasLt: add MXF8B32/MXE8B16 sample * MX: HipBlasLt: support MXF8B32/MXE8B16 F4 * MX: HipBlasLt: add MXE5M3B32/MXE5M3B16 sample * EfficiencyMonitor: do nothing in set_device_id if not enabled * benchmark: enable F4/F6 type without block scale * MX: hipblaslt-bench: support MXE8B32/MXE8B16 F4 * MX: hipblaslt-bench: support all F4 mx scale type * MX: hipblaslt-bench: F8/F6/F4 smoke gtest * MX: hipblaslt-bench: limit hipblaslt_e8 rand_int/hpl init value * Fix hipblaslt-test failed with HHS This patch fix the following hipblaslt-test test case (HHS): smoke_matmul_bias_relu_SAV_smoke_f16_rf16_rf16_rf16_rf32_r_relu* * f8f6f6 mix mode on hipblaslt * Added TensorLoadToLds for rocisa * WIP: F8 codegen for TDM * WIP: several bug fixes for TDM * WIP: TDM wave separated initial support * WIP: TDM uses up to 12 SGPRs now * WIP: TDM optimize lds address swap * WIP: TDM supports FP4 now * WIP: temporarily integrate regular MX and TDM A&B * Fix bugs in UseSgprForGRO. * WIP: remove g2l vgpr allocations for A & B when TDM enabled * WIP: fixed waitcnt calculation when TDM enabled * WIP: TDM supports LdsAlignPow2 is False * WIP: reject non-TN TDM kernels * Fix large LDS logic (#189) * large LDS follow gfx950 logic * fix calculation of # local read/write calculation, lessen StoreSwapAddr enable constraits --------- Co-authored-by: boringmorning <huangchen1999@gmail.com> * [hipBLASLt] cherry-pick TDM MX changes (#411) * WIP: TDM for MX buffers * WIP: fixed TDM lds swap for MXSA and MXSB * WIP: reduce VGPR usage when TDM enabled * WIP: fixed TDM MT selection for MX buffer * WIP: adjusted TDM LDS swap implementation for StoreSwapAddr * WIP: several minor fixes for TDM codegen * Fixed incompatible s_waitcnt parameter * TDM: fix MXS staggerU * TDM padding (#423) * Support TDM padding * Add padding checks for TDM * Add mxf8ss_tdm test yaml * fix LdsBlockSizePerPadB reject condition * [hipBlasLt] remove mix mode yamls and tests * fix comment in hipblaslt.h Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * remove KWS keyword * fix helper.h typo * re-order header list in include CMakeLists.txt * remove some comment * fix build break * removd DepthU 128 form mxf8ss_tdm.yaml * [hipSPARSELt] Enable Depth-U=256 in 8-bit SPMM kernels (#166) * Fix SPMM DU256 issue. Add DepthU=256 test param in yamls. * Update spmm_i8 yaml. * Refine metadata iteration variables based on commit 45ec843. * Fix vgprPerInputM calculation. * Update spmm 8-bit yamls from clr0 to clr1. * [hipSPARSELt] Fix Solution.py on gfx1250. (#231) * Fix Solution.py on gfx1250. * Correct typo in projects/hipblaslt/tensilelite/Tensile/SolutionStructs/Solution.py * Upload gfx1250 logic yamls to hipSPARSELt. * Update gfx1250 logic yaml. Fix inconsistent yamls, and disable UserArgs. * [Workaround] Bypass metadata layout when build hipSPARSELt. * [hipSPARSELt] Update setComputeInputType to setComputeInputTypeA/B in tensile_host.cpp * [hipSPARSELt] Fix sparse b tail loop mask issue. * [hipSPARSELt] Patch reduced-size smoke tests * [hipSPARSELt] Fix 8-bit metadata vGPRS number and idx. Add i8is support when using bias. * [hipSPARSELt][Patch] Fix bugs found on PR#633 based on PR#682 * Revert "[Temp] disable spmm test, since spmm is not ready." This reverts commit 8b1ab52a67bda3684ce5ae0b76ba965569eaf061. * [hipSPARSELt] Add gfx1250 I8IS (Int8->Int32) Tensile library logic for hipsparselt Add 8 GridBased library logic YAML files for I8 input / I32 output sparse matrix multiplication on gfx1250, covering all 4 transpose combinations (NN, NT, TN, TT) for both Sparse-A and Sparse-B. * [hipSPARSELt] Fix KeyError 'UnrollMajorLDSMetadata' for non-sparse kernels, and test arch typo in spmm_fp16_ml1.yaml Add sparse guard before accessing UnrollMajorLDSMetadata in KernelWriter.py and KernelWriterAssembly.py. This key is only set for sparse kernels in Solution.py, but was accessed unconditionally, causing KeyError for non-sparse configs (e.g. sgemm_xf32_asm). Also fix typo in spmm_fp16_ml1.yaml: skip-gfx942m -> skip-gfx942 so the gfx1250-only test is correctly skipped on gfx942 machines. * fix: use _DepthUTc to replace DepthU * [hipSPARSELt][Tmp] Disable autoVectorWidthA/B in gfx1250 * handle float bpe * check mxBlock in prombleType before use it * typo in DSStoreB256 * modify auto assign LocalReadVectorWidthA/B * reject condition use LocalReadVectorWidthA/B * Fix NameError SMovBX is not defined * Fix invalid assembly token '**' in GSU duBpe calculation mult_MI_Dim already contains a leading '*' (e.g. "*MI_M"), so the f-string should not add another '*', which produced invalid assembly like "64**MI_M" instead of "64*MI_M". * Fix undefined MI_M/MI_N assembly symbols in GSU swizzle code The GSU code in computeLoadSrd and graIncrements emitted string literals "*MI_M" / "*MI_N" as immediate operands in assembly instructions, but these are not valid assembler symbols and caused "expected relocatable expression" errors during .s -> .o assembly. Resolve MI_M/MI_N to their actual numeric values at code-generation time by multiplying with kernel["MatrixInstM"] / kernel["MatrixInstN"] in Python, so the assembler only sees plain integer immediates. * Reject SwizzleTensor with NumLoadsCoalesced > 1 (swizzleA.yaml) When SwizzleTensorA/B is enabled with NumLoadsCoalesced > 1, the tile offset stride calculation in graTileOffsets uses an inter-wave stride (numKr * WvG * swzBlockSize) that does not correctly map the second coalesced read for edge cases where MacroTile > actual matrix dimension (e.g., MT_M=256 with M=128). This causes the second tile read to access incorrect memory locations, resulting in numerical accuracy failures. All 58 FAILED tests in swizzleA.yaml had NLCA=2 + SwizzleTensorA + MT_M > M, while identical kernels with NLCA=1 passed. This bug was introduced during gfx1250 SwizzleTensor development where the NLCA>1 edge case in the swizzle global read path was not considered. Reject this configuration until the swizzle tile offset stride is properly fixed for multi-coalesced-load edge handling. * Fix epilogue ScaleAB load condition in fp8nfp16mix_fp8nss.yaml The epilogue ScaleAB loading code compared DataType with itself (always true) instead of with MacDataType, causing it to unconditionally load ScaleA from AddressScaleA even when preloadScaleA=True (DataTypeA > MacDataTypeA). Fix: Change the inner loop condition from DataType<=DataType (nop) to DataType<=MacDataType. * Fix DTL correctness for FP16/FP8: cast float bpe values to int for assembly operands * Fixed lds padding sanity check * Fix the logic path for VCvtF32toF16 * Fix incorrect SGPR global read offset for DirectToVgpr with UseSgprForGRO In computeScalarGroImpl(), the DirectToVgpr path used the unsuffixed kernel["LocalReadVectorWidth"] (defaults to -1) instead of the tensor- specific kernel["LocalReadVectorWidth{A,B}"]. This produced negative unrollStride values, corrupting ScalarGlobalReadOffset in the generated assembly and causing incorrect GEMM results. Test: dtv_gfx90a.yaml (DirectToVgprA=1, UseSgprForGRO=1, GRVWA=1, GRVWB=1) * convert offset from float to int before passing into DSModifiers * Fixed incorrect enum values * Removed deprecated rejection condition for swizzle kernel * [hipSPARSELt] Seperate SPMM wider localRead conditions from dense. Disable reject Sparse A kernel only support PGR with EPS=1 * Fix: skip Formocast prediction when analyticalHardware is unavailable The tensilelite-client crashed with a segfault in AllSolutionsIterator::preProblem when running bf16_tn_gfx12_predict.yaml on gfx1200. The root cause is that origami does not support gfx1200, so analyticalHardware (shared_ptr<origami::hardware_t>) is never initialized in HipAMDGPU. The getHardware() helper then dereferences a null pointer. * Fix NameError in SIA2: replace undefined instPerPack with instPerPackA/instPerPackB The variable instPerPack was renamed to instPerPackA and instPerPackB during a prior refactor, but 9 references in the scheduleIterAlg == 2 code path of _makeSubIterSchedule were not updated. This caused a NameError during assembly kernel generation for gfx1200 (RDNA4) targets, which are the only architectures with library configs using SIA2. * convert tailloopInNllmaxUnit from float to int * Fix DVT tail loop validation failure on gfx12 by adding missing bpe multiply The GLOBAL_OFFSET macro was refactored to move the BPE multiplication out of the macro and into each call site. However, the globalReadGuardK function's call to GLOBAL_OFFSET for computing the max valid address offset in DVT tail loops was not updated to include the external bpe multiply. * Skip F4 related test YAML for gfx1200 * Fix swizzle tensor SRD limit alignment in computeLoadSrd MX integration refactored the IndicesSummation check into a nested if/elif chain, making the swizzle alignment path unreachable. Swizzled tensor A used raw SizeL-1 instead of alignTo(SizeL, swzStride)-1, causing SRD limit to be too small and buffer_load to return zeros at K boundaries. Fixes 58 NLCA=2 test failures. * Fix missing arguments and conditions * remove added testcases in matmul_bias_vector_dst_fp16_32 * Update test to accept VEC16_UE4M3 scale mode in SetAttribute The MX F4 support commits moved VEC16_UE4M3 from the rejection list to an accepted case (mapped to ScalingFormat::Block_16_UE4M3) for MXFP4 on gfx1250, but the test was not updated accordingly. Fix the test to expect SUCCESS and verify the set/get round-trip. Test: hipblaslt-test --gtest_filter=*pre_checkin_aux_matmul_set_get_attr_f16_r* * Revert "Patch hipsparselt build system to use local tensilelite" This reverts commit dedf0a070911e739d3d049c059ba1e6903f4cea9. * [hipSPARSELt] Enable gfx1250 * [hipSPARSELt][Tmp] Skip use_sgpr_for_gro test cases on gfx950 * remove deprecated variable: LdsAlignPow2 * TF32: remove deprecated code * Workaround for CMS to support separate LRVWA/B * [hipSPARSELt][Workaround] Correct wlrMultiple calculation for sparse WLR rejection For sparse A (Sparse==1), MIInputPerThreadA already accounts for the compressed data, but LocalReadVectorWidthA is based on the original (uncompressed) data size. Divide by an extra factor of 2 to correctly compute wlrMultiple for sparse A. Similarly for sparse B (Sparse==2), apply the same correction to wlrMultiple based on LocalReadVectorWidthB. Also clarify rejection messages to distinguish A/B tensor. * Fix: stuck in _makeSubIterSchedule() when build gfx1200 SIA2 kernel root cause: packItems is non-empty and it never check instPerPackB to pop items * Fix local read waitcnt * Revert "[Tensilelite][Sparse] Enable plr min for spmm (#4364)" This reverts commit cb81c2e3432e933ea9674c45500dcfa788ba6051. * Revert "[hipSPARSELt][Tmp] Skip use_sgpr_for_gro test cases on gfx950" This reverts commit 8f32694cd27fc4554c82ac0fe8f4e4ca553e2510. * Fix lrvw undefined caused by conflicts resolving * Fix activation args type mismatch causing segfault in debug build prob.act0/act1 (float) were pushed into ConstantVariant without casting to the compute type. When compute type is Int32 (i8->i32 GEMM), subsequent std::get_if<int32_t> on a float-holding variant returned nullptr, causing SIGSEGV on dereference. Release builds masked this UB via -O3 optimizations. Affects: ./build/release/clients/hipblaslt-test --gtest_filter=*gemm_i8_dst_i32_94x* * Enable RocRoller for hipblaslt and fix host build issue * remove redundant pack code in xfp32 kernels * pick all fix from gfx1250 to develop-gfx1250-open-source (#969) * [hipblaslt] fix ds_bpermute_b32 msb computation and s_set_vgpr_msb (#910) * Add gfx1250 HHS AuxH yamls for gtest (#911) Co-authored-by: Andy Su <andysu12@amd.com> * fix ExtOpLayerNorm test (#935) enable extop for gfx11/12 (#129) Co-authored-by: Huang, Mark <Mark.Huang@amd.com> Co-authored-by: boringmorning <huangchen1999@gmail.com> Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com> * [hipblaslt] fix index mode discarding valid algos when pool size < batch size (#960) getAlgosFromIndex returns INVALID_VALUE when any requested index exceeds the pool size, even though valid algos are still populated in the output. Previously the caller broke immediately on INVALID_VALUE, discarding those valid results. Now process the returned algos before exiting the loop. * [Tensilelite] modify NumRecords of E and BiasSrd for gfx1250 (#957) modify NumRecords of E and BiasSrd for gfx1250 * [hipblaslt] add correct num_records to BRD of gfx1250 (#958) --------- Co-authored-by: Chang, Josh <Josh.Chang@amd.com> Co-authored-by: Su, Andy <Andy.Su@amd.com> Co-authored-by: Andy Su <andysu12@amd.com> Co-authored-by: Huang, Mark <Mark.Huang@amd.com> Co-authored-by: boringmorning <huangchen1999@gmail.com> Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com> * Init srdB when calclating KRingShift Fix hipblaslt-test cases: *bf16_rbf16_rbf16_rbf16_rf32_r_TN_128_128_2048* * skip gfx1250 for kringshift test * Fixed local write waitcnt calculation for f6 datatype * Fallback to narrowing buffer load (GRVW=1) if no partial OOB. (#991) Fix buffer load failure in tail loop. Fallback to narrowing buffer load (GRVW=1) if partial OOB is unsupported by hardware. * Add FP8/BF8 logic YAML & test coverage (#998) * Fix v_pk_mul_f32 SGPR operand error on gfx1250 Add op_sel_hi=[1,0,1] modifier to VMulPKF32 for ScaleD to properly broadcast scalar SGPR value on gfx1250 packed math instructions. * Add F8/B8 related logic YAML Add 80 GridBased logic YAML files to support FP8 and BF8 data types on gfx1250. This includes: - Multiple precision combinations: F8, B8, F8B8, B8F8 - Various output types: FP16, FP32, BF16, F8, B8 - All matrix layouts: NN, NT, TN, TT - Epilogue support: Bias, ScaleA/B, ScaleC/D * Enable F8/B8 test cases for gfx1250 in hipblaslt-test Update gpu_arch filters in matmul_gtest.yaml and smoke_gtest.yaml to include gfx1250, enabling the following test categories: "matmul_f8_bf8_dst_fp32" "matmul_f8_bf8_dst_bf16" "matmul_f8_dst_bf16" "matmul_f8_bf8_dst_f16" "matmul_f8_bf8_dst_fp32_gfx12" "matmul_f8_bf8_dst_fp16_gfx12" "matmul_f8_bf8_dst_bf16_gfx12" "matmul_real_1b_dst_f8_SCDInt1_gfx12" "matmul_real_1b_dst_f8_SCDNotInt_gfx12" "matmul_one_real_precisions_1b_gfx12" "matmul_f8_bf8_dst_fp16_gfx12_smoke" "matmul_f8_bf8_dst_bf16_gfx12_smoke" "matmul_real_1b_dst_f8_SCDInt1_gfx12_smoke" "matmul_real_1b_dst_f8_SCDNotInt_gfx12_smoke" * Update logic YAML for F8/B8 related 1x1x1 solutions Update Tensile logic YAML files across Ailk/Alik Bjlk/Bljk matrix layout combinations for FP8/BF8 data types (B8F8, F8B8, B8, F8 variants including HS, BS, SS subtypes). * Fix UseCustomMainLoopSchedule value type in F8/B8 logic YAMLs Change UseCustomMainLoopSchedule from boolean `false` to integer `0` across 80 gfx1250 GridBased logic YAML files for consistency with the expected integer type. * Changed `_UseSgprForGRO: false` to `_UseSgprForGRO: 0` * Fix uninitialized union members in f6/bf6x16 conversion helpers * Fix incorrect arg type of copy constructor for some ds_load inst * Move F4/F6 init out of Runner into sample-specific files * Add = default to Float6x16 default constructor * Remove unused iType param from MXMFMAInstruction::typeConvert * Refactor scaleA/B type dispatch to switch for -Wswitch coverage * Remove duplicated HasWMMA_f8f6f4 capability * Fix default cnt for SWaitTensorCnt to 0. * Remove datatype examples that are not currently supported. (#1017) Change the order of enum HIPBLASLT_MATMUL_MATRIX_SCALE_VEC16_UE8M0_EXT. * Reapply "[Tensilelite][Sparse] Enable plr min for spmm (#4364)" This reverts commit e5febecfc7fe086563625ee3e284aa617f53beda. * [hipSPARSELt] Fix packItemsM logic for gfx1250 to be aligned with other architectures * [hipSPARSELt] Reuse single PackTemp VGPR for MIInputPerThUnroll==8 metadata packing In the MIInputPerThUnroll==8 packing path, PackTemp's lifetime ends before the second group of packing operations begins, so PackTemp can be safely reused instead of requiring PackTemp+1. This aligns the code generation with the VGPR allocation logic which only reserves 1 VGPR for PackTemp. Also simplify the gfx1250 sparse PackTemp allocation condition by removing the redundant MIInputPerThreadMetadata>1 check, since gfx1250 only has MIInputPerThreadMetadata = 4 or 8. * Fix incorrect function name of computeInputType{A, B} calls * Fix missing mxsb when rebasing * Revert modifications for emulator (#1001) Remove emulator parameter: ROCmAgentEnumeratorPath. Remove env parameters in tox.ini. Remove compile and emulator workarounds. Set default CpuThreads back to -1. * Enable HasF32XEmulation in gfx1250 * Fix XF32 LocalRead VGPR packing for gfx1250 WMMA V3 Fix incorrect results produced by XF32 emulation kernels on gfx1250. The original XF32 codegen was written against gfx950 MFMA assumptions that do not hold for gfx1250 WMMA V3: 1. LDS offsets used gfx950-specific hardcoded constants (4, 12). gfx1250 WMMA V3 needs a *2 unroll-stride formula shared with BF16/Half. Branch by ISA via calcGfx1250LdsOffset(). 2. Pack logic assumed vgprPerInput ≤ 8 (single 8-VGPR group). gfx1250 has vgprPerInput=16 (two groups), producing interleaved [HI_g0, LO_g0, HI_g1, LO_g1]. Add v_swap_b32 to rearrange into contiguous [HI_all, LO_all] expected by 3-pass WMMA. 3. WMMA src offset hardcoded "+2"/"+4" for vgprPerInput 4/8. Replace with dynamic vgprPerInputA // 2 to yield "+8" on gfx1250. * Fix XF32 Direct32XEmulation pack/WMMA scheduling data hazard in SIA3 on gfx1250 SIA3 scheduler interleaved pack and MAC instructions without respecting data dependencies in the XF32 multigroup path, causing v_swap_b32 to corrupt F32 values mid-packing, and WMMAs to consume partially-packed VGPRs. - Move v_swap_b32 rearrangement from MAC code into pack code so it stays ordered after all TF32_1/TF32_2 packing - Fix destVgpr aliasing for UseDirect32XEmulation local reads - Place all XF32 pack items before the first WMMA instead of distributing one chunk per MFMA slot * Fix XF32 tail loop K-masking on gfx1250 WMMA V3 The tail loop K-masking logic was written against gfx950 MFMA geometry where vgprPerInput ≤ 8 and BF16 inputs are packed (2 elements/VGPR). gfx1250 WMMA V3 has vgprPerInput=16 and XF32 reads unpacked FP32 (1 element/VGPR), breaking two assumptions: 1. T0 VGPR addressing: gfx950 bk maps 1:1 to T0 slots. gfx1250 Direct32X allocates T0 at half capacity (8 slots for 16 elements), so raw bk overflows into wrong tensor's registers. Fix: adjustedBk = (bk // 8) * 4 + (bk % 4). 2. K-to-VGPR mapping: gfx950 packed BF16 gives contiguous {0-7, 16-23}. gfx1250 unpacked FP32 + numVecUnroll=2 interleaving gives {0-3, 8-11, 16-19, 24-27}, zeroing wrong VGPRs for K=5-11, 21-27. Fix: vgprPerSet0Group=1, multiplyBy /= numVecUnroll, absolute K offsets per group. * Disable ForceUnrollSubIter for F32X emulation F32X emulation pack code performs destructive in-place VGPR conversion (FP32 → BF16 high/low), which is incompatible with ForceUnrollSubIter's sub-tiling that splits local reads and pack code across sub-iterations. This caused validation failures with ScheduleIterAlg=1, MIWaveTile=[4,4], and DepthU==MatrixInstK. * Generalize MIInputPerThread for gfx1250 WMMA XF32 Hardcoded MIInputPerThread==8 assertion in LocalRead.py caused AssertionError on gfx1250 (MIInputPerThread=16). Parameterize TXInterleaveLayoutIdx, dynamically generate dsReadConvTable and convArray to support any MIInputPerThread value. * Disable UseMFMAF32XEmulation on WMMA-only ISAs (gfx1250) UseMFMAF32XEmulation was unconditionally enabled for all F32X kernels, causing gfx1250 (WMMA, no MFMA) to emit invalid v_wmma_f32_4x4x4_bf16 instructions. Gate the flag behind HasMFMA so WMMA architectures fall through to the cvt+sub path instead. * Fix lrvwTile not forced to 1 for non-MFMA XF32 (gfx1250 WMMA) The blanket "(not UseF32XEmulation)" exemption skipped lrvwTile=1 forcing for all XF32 paths, but only MFMA-based XF32 (gfx950) handles lrvwTile > 1 correctly. On gfx1250 WMMA, lrvwTile=2 produced incorrect local reads. Refine the exemption: only UseMFMAF32XEmulation and CMS kernels may keep lrvwTile > 1; non-MFMA XF32 paths are now forced to lrvwTile=1. * Fix TF32EmuInterleaveTreg local read index for non-prefetch path Problem: NT/TN format XF32 kernels produce inf/nan errors when TF32EmuInterleaveTreg is enabled but doFullPackCodePrefetch is False (PLR=0). The TXInterleaveLayoutIdx() function assumes the full prefetch pack code layout, which is incompatible with the non-prefetch register layout. Fix: Add conditional branching based on doFullPackCodePrefetch in the TF32EmuInterleaveTreg handling. For the non-prefetch path (PLR=0), use a simpler index calculation that maps the first half of each group (withinGroup < 4) to T registers with a straightforward index formula: idx = (idx // 8) * 4 + withinGroup. This matches the register layout expected by the pack code when doFullPackCodePrefetch is False. * Fix TF32 emulation T-register overlap in tail loop Problem: TT and NN format kernels with DepthU=32 failed validation for tail loop. Root cause: In macroAndSetF32XEmuTregSingle(), the T registers (vgprValuA_T0_I0, vgprValuB_T0_I0) were defined using symbolic references relative to vgprValu{A/B}_X0_I0_BASE: .set vgprValuB_T0_I0, vgprValuB_X0_I0_BASE + 56 In the main loop, vgprValuB_X0_I0_BASE=34 gives T0=90 (correct). In the tail loop, vgprValuB_X0_I0_BASE is redefined to 32, giving T0=88 which overlaps with vgprValuA_T0_I0+6 (82+6=88). This causes A's TF32 processing to corrupt B's T registers (v88-v89), leading to incorrect WMMA results. Fix: Use absolute startVgprCvt values instead of symbolic BASE-relative offsets in RegSet. This ensures T register addresses remain correct regardless of BASE redefinition in the tail loop. * Enable XFP32 test coverage for gfx1250 in hipblaslt-test and tox * Enable gradient, postprocessing, and fix CVT instructions for gfx1250. (#1042) * Fix issues in cvt, enable gradient support for gfx1250. Cherry-picked from PR #160: - 35e2ef29dd (Fix issues in cvt and initial support on hhs gradient) - 80cce534da (Enable bbs gradient and postprocessing) * Fix wave32 and FP16 gradient issues on gfx1250 - Fix BF16 NaN check in writeBiasToGlobal for wave32 - Add fallback for FP16 sum unroll when dot2 is unavailable - Enable gfx1250 gtests for dgelu, bgrada, bgradb --------- Co-authored-by: George Tseng <george.tseng@amd.com> Co-authored-by: Andy Su <andysu12@amd.com> * Remove unused segmentsize (#1049) * Fix conflicts in mem.hpp and add tests for DSStoreB192 and DSLoadB192. Also fix a bug in countX. (cherry picked from commit c2840f744630c301d70435984117bcd23fde17e5) * Fix f_math.hpp/cpp and add a new test: test_functions.py (cherry picked from commit 8968a906838dbfac7b0baffa96cd5a16f4dd405a) * include fp6 bf6 header * Resolve rocisa conflicts and add test_mfma.py * [DO NOT MERGE] Revert commit 6d069916: restore emulator parameters and workarounds Re-add ROCmAgentEnumeratorPath, tox.ini env parameters, compile/emulator workarounds, and CpuThreads=0 that were removed in 6d06991. This is a temporary revert for local emulator-based testing and should not be merged upstream. Reverted commit: 6d069916703b68e27bf67b10d6c4db2a0c69b8d2 Original message: "Remove emulator parameter: ROCmAgentEnumeratorPath. Remove env parameters in tox.ini. Remove compile and emulator workarounds. Set default CpuThreads back to -1." Affected areas: - clients/CMakeLists.txt, device-library/CMakeLists.txt - efficiency_monitor.cpp - Tensile Python modules (ClientWriter, GlobalParameters, LibraryIO, Tensile) - tox.ini - client/main.cpp - 60+ YAML test config files (gemm/gfx12, sparse/gfx1250) * [DO NOT MERGE] Enable FFM config on tox.ini * [DO NOT MERGE] Add emulator parameter: ROCmAgentEnumeratorPath in YAML * [DO NOT MERGE] Allow multiple workers of tox on FFM * [DO NOT MERGE] Fixed gradient related tests * Fix unresolved merge conflict markers in AsmAddressCalculation.py The file had leftover <<<<<<< ours / >>>>>>> theirs markers from a prior merge that were accidentally committed. Keep the HasVgprMSB conditional path which is the correct resolution. Made-with: Cursor * Fix unresolved merge conflict markers in StreamK.py, ClientWriter.py, Tensile.py, MatrixInstruction.py, and mfma.hpp - StreamK.py: use _DepthU (adjusted for XF32) instead of raw kernel["DepthU"] - ClientWriter.py: keep MX scale parameter names and type fields from ours - Tensile.py: keep both --rocm-agent-enumerator and --mx-scale-format args - MatrixInstruction.py: keep hasSMFMA guard with single-char key lookup - mfma.hpp: remove commented-out conflict markers Made-with: Cursor * Fix MFMA validation key mismatch in MatrixInstruction.py Use miDataType.toChar() (single-char key) for validMFMA lookup instead of miDataTypeKey (two-char key), matching the dictionary's actual key format. This restores the original working behavior that was broken during the merge. Made-with: Cursor * Revert "[DO NOT MERGE] Fixed gradient related tests" This reverts commit 5a07e1ab9d9d211624c2f4734afe2c67d19c6756. * Revert "[DO NOT MERGE] Allow multiple workers of tox on FFM" This reverts commit 220645c56e1a7a64663d675bda349779e23353bc. * Revert "[DO NOT MERGE] Add emulator parameter: ROCmAgentEnumeratorPath in YAML" This reverts commit b3ee5ed532824f8e4244a5e15358cc3b47cf446d. * Revert "[DO NOT MERGE] Enable FFM config on tox.ini" This reverts commit de9d0c28578ecb3946f419076f564061b02756dc. * Revert "[DO NOT MERGE] Revert commit 6d069916: restore emulator" This reverts commit ab5741985cbcc20a3080f4ded1be4c3e0743f0b0. --------- Signed-off-by: pdhirajkumarprasad <dhirajp@amd.com> Co-authored-by: Koji Nakajima <knakajim@amd.com> Co-authored-by: Hongji Chen <hongjche@amd.com> Co-authored-by: Alex Brown <alex.brown@amd.com> Co-authored-by: Chun-Xun Lin <Chun-Xun.Lin@amd.com> Co-authored-by: Koji Nakajima <75698246+nakajee@users.noreply.github.com> Co-authored-by: pdhirajkumarprasad <160474250+pdhirajkumarprasad@users.noreply.github.com> Co-authored-by: Archana Ramalingam <98564406+archana-ramalingam@users.noreply.github.com> Co-authored-by: T.J. Alumbaugh <talumbau@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: amd-chunxlin <chunxlin@amd.com> Co-authored-by: tomchengchitang <tomtang2@amd.com> Co-authored-by: Brad Nemanich <Brad.Nemanich@amd.com> Co-authored-by: Curtis Fu <yu.fu@amd.com> Co-authored-by: yu-hsieh <Yu-cheng.Hsieh@amd.com> Co-authored-by: marhuang_amdeng <marhuang@amd.com> Co-authored-by: Stacey Lai <stacey.lai@amd.com> Co-authored-by: George Tseng <george.tseng@amd.com> Co-authored-by: Serge Lu <Serge.Lu@amd.com> Co-authored-by: Huang, Mark <Mark.Huang@amd.com> Co-authored-by: boringmorning <huangchen1999@gmail.com> Co-authored-by: Yang, Anne <Anne.Yang@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Lin, Ed <Ed.Lin@amd.com> Co-authored-by: Stacey Lai <stacelai@amd.com> Co-authored-by: mengzcai_amdeng <Meng-Zhe.Cai@amd.com> Co-authored-by: Vin Huang <vin.huang@amd.com> Co-authored-by: jichang <jimmy.chang@amd.com> Co-authored-by: Ho, Henry <Henry.Ho@amd.com> Co-authored-by: Chang, Josh <Josh.Chang@amd.com> Co-authored-by: Su, Andy <Andy.Su@amd.com> Co-authored-by: Andy Su <andysu12@amd.com> Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com>
tomchengchitang
pushed a commit
that referenced
this pull request
Apr 16, 2026
* Enable v_cvt_f16_fp8 and add v_cvt_pk_f16_fp8 This patch also rename original option: Hascvtf16_fp8 as Hascvtf16_fp8_sf32. Because the old Hascvtf16_fp8 is actually for v_cvt_scalef32_pk_f16_fp8. * true16 modifier implementation This patch includes true16 modifier implementation and related compiler option. * Optimize FP32 to FP16 with pack data by v_cvt_pk_f16_f32 * Make v_cvt_f32_f16 and v_cvt_f16_f32 support true16 syntax This change modifies the relevant code, except in mixed mode scenarios. * Support VCvtPkFP8toF32 and VCvtPkBF8toF32 with VOP3 in loopSum * Use S_MOV_B32 to reset EXEC for 32-lane wavefronts * Update test YAML for data conversion This patch update test YAML for the following instructions: v_cvt_f32_f16, v_cvt_f16_f32, v_cvt_pk_f16_f32, v_cvt_f32_bf16, v_cvt_f32_fp8, v_cvt_f32_bf8, v_cvt_pk_f32_fp8, v_cvt_pk_f32_bf8, v_cvt_pk_fp8_f32, v_cvt_pk_bf8_f32 * Minor modifications based on Copilot review recommendations * Fix v_cvt_f32_bf16 build error and reviewer's suggesstions * [hipSPARSELt] Enable 8-bit SPMM kernels (#123) * Fix 8-bit sparse kernels by: 1. Adding i32-fp32 conversion and 2. fixing tail-loop dense b cndmask offset. Additionally update 16-bit yamls * Fix 8-bit datatype packing issue. * 1. Update sparse gfx1250 testing yamls. These are simplified to reduce tox testing time, but tested on local end. 2. More yamls will be added. * Code refine, fix yamls and add metadatalayout test cases * Add and revise test yamls. * Fix i8 conversion issue when using GSU > 1. Add a missing yaml. * Follow up - Refine i8 conversion issue when using GSU > 1. Add GSU2 parameter in yamls. * remove whitespace in line 370 in GlobalWriteBatch.py * Enable edge and tail loop. (#143) Co-authored-by: George Tseng <george.tseng@amd.com> * use ShortBranchMaxLength for SCLongBranchVccnz * [hipBLASLt] extend the vgpr pool for the metatadata at OptNLL * [hipBLASLt] refactor the vgpr allocation of sparse metadata. * Don't use pack for metadata when LDSTr Metadata is enabled. * Refine sparse test for i8, f8 and b8. (#156) * fix buffer insts record for gfx1250 (#109) * fix buffer insts record for gfx1250 --------- Co-authored-by: boringmorning <huangchen1999@gmail.com> Co-authored-by: marhuang_amdeng <marhuang@amd.com> * 6-bit transpose load revisited (#157) * wip: basic tr6 ds load support * Support glvw == 16 for FP6 * Fixed wrong VGPR padding for 6-bit gl and lr * Implemented new thread mapping for ds tr6 load * wip: add more test cases for BF6 * Reject solution if GRVW exceeds corrresponding side of MT * Added F6 test cases for all transposes * Fixed wrong tranpose settings for f6/b6 tests * Fixed VGPR allocation for ds tr6 load * Fixed conflicts in auto-merging * enable extop for gfx11/12 (#129) Co-authored-by: boringmorning <huangchen1999@gmail.com> * Support edge/tail for FP6/BF6 (#161) * BF6/FP6 TN Tail when glvw=16/32 * Fix shiftLrElements in TailLoop * Add FP6/BF6 tail and edge testcases * Simplify the tail/edge checks for FP6/BF6 * Refactor BPE Utility Functions and Fix Num Record Bitfield (#159) * Refactor bpe utility functions from Python to C++ * Fix Num Record Bitfield for gfx125x * MX : init and host validation: fix * MX : codegen setting 1 * refactor calcLdsNumBytes * MX: solution 2 * reorder setGlobalReadVectorWidth * MX: solution 3 * MX: add UseGeneralizedNLCOne MXSA/B * MX: LDS size calculation * MX: codegen setting * move ValuMXSAB vgpr to beginning of VGPR pool p1 * MX: add kernel argument * MX : initKernel part * MX : vgpr set * move ValuMXSAB vgpr to beginning of VGPR pool p1 * MX: global read macro * MX : local read offset * MX: local write and global read offset * MX: global read increaments * MX: staggerU * MX: global load * MX: global read increament * MX: wait * MX: ds store * MX: localWriteSwapOffsets * MX: local read part * MX: MFMA * MX: lraDeclareAddresses * MX: SIA3 pre processing * MX: SIA3 * MX: SIA3 pack for MXSA/B * MX: SIA3 pack * fix: use _DepthUTc to replace DepthU * MX: Tail: localWriteResetOffsets * MX: Tail: vgpr alloc, remove staggerU, global read * MX: tailLoopAllocValuVgpr * MX: Tail: localRead init * MX: Tail: local read * MX: unsupported datatype exception * MX: add test yaml * F6 adjust LDS alignment * MX: F4 use F8 scale type * MX: F4: scale type E5M3 * MX: support MX with non-MX combination * MX : initKernel part * MX: Tail: remove dirty MX data * MX: Tail: release MX valu vgpr after tail * MX: Edge: MX Scale shiftptr align with noraml Buffer * Shiftptr: fix Shiftptr limitation * MX: update AssertSummationElementMultiple for MX * F4: use 32x16 step 1 * F4: use 32x16: local read: add numTilePerInst loop * F4: use 32x16: local read: fix offset * F4: use 32x16: wmma tail: add numTilePerInst loop * F4: use 32x16: wmma tail * F4: use 32x16: store D * F4: use 32x16: vectorwidth * F4: use 32x16: multi wave * F4: use 32x16: local read: use MIInputPerThUnroll * F4: use 32x16: local read: add numTilePerInst loop * F4: use 32x16: local read: dstr offset * F4: use 32x16: local read: fix wave offset * MX: F4: fix tmpVGPR allocation * MXF4: support 32x16 instruction * F4: use 32x16: multi wave: NT, TT * F4: use 32x16: wmma tail * MX: fix MX buffer load length * MX: HipBlasLt: add MXE8B32 F8 yamls * MX: HipBlasLt: add MXE8B32B B8 yamls * MX: HipBlasLt: add MXE8B16 F8 yamls * MX: HipBlasLt: add MXE8B16 B8 yamls * MX: HipBlasLt: add MXE8B16 F6 yamls * MX: HipBlasLt: add MXE8B16 B6 yamls * MX: HipBlasLt: add MXE8B32 B6 yamls * MX: HipBlasLt: add MXE8B32 F6 yamls * MX: HipBlasLt: add MXE8B32 F4 yamls * MX: HipBlasLt: add MXF8B32 F4 yamls * MX: HipBlasLt: add MXE5M3B32 F4 yamls * MX: HipBlasLt: add MXE5M3B16 F4 yamls * MX: HipBlasLt: add MXF8B16 F4 yamls * MX: HipBlasLt: add MXE8B16 F4 yamls * MX: HipBlasLt: add MXE8B32/MXF8B16 sample * MX: HipBlasLt: support MXE8B32/MXF8B16 F4 * MX: HipBlasLt: support MXF8B32/MXE8B16 F4 * MX: HipBlasLt: add MXF8B32/MXE8B16 sample * MX: HipBlasLt: support MXF8B32/MXE8B16 F4 * MX: HipBlasLt: add MXE5M3B32/MXE5M3B16 sample * EfficiencyMonitor: do nothing in set_device_id if not enabled * benchmark: enable F4/F6 type without block scale * MX: hipblaslt-bench: support MXE8B32/MXE8B16 F4 * MX: hipblaslt-bench: support all F4 mx scale type * MX: hipblaslt-bench: F8/F6/F4 smoke gtest * MX: hipblaslt-bench: limit hipblaslt_e8 rand_int/hpl init value * Fix hipblaslt-test failed with HHS This patch fix the following hipblaslt-test test case (HHS): smoke_matmul_bias_relu_SAV_smoke_f16_rf16_rf16_rf16_rf32_r_relu* * f8f6f6 mix mode on hipblaslt * Added TensorLoadToLds for rocisa * WIP: F8 codegen for TDM * WIP: several bug fixes for TDM * WIP: TDM wave separated initial support * WIP: TDM uses up to 12 SGPRs now * WIP: TDM optimize lds address swap * WIP: TDM supports FP4 now * WIP: temporarily integrate regular MX and TDM A&B * Fix bugs in UseSgprForGRO. * WIP: remove g2l vgpr allocations for A & B when TDM enabled * WIP: fixed waitcnt calculation when TDM enabled * WIP: TDM supports LdsAlignPow2 is False * WIP: reject non-TN TDM kernels * Fix large LDS logic (#189) * large LDS follow gfx950 logic * fix calculation of # local read/write calculation, lessen StoreSwapAddr enable constraits --------- Co-authored-by: boringmorning <huangchen1999@gmail.com> * [hipBLASLt] cherry-pick TDM MX changes (#411) * WIP: TDM for MX buffers * WIP: fixed TDM lds swap for MXSA and MXSB * WIP: reduce VGPR usage when TDM enabled * WIP: fixed TDM MT selection for MX buffer * WIP: adjusted TDM LDS swap implementation for StoreSwapAddr * WIP: several minor fixes for TDM codegen * Fixed incompatible s_waitcnt parameter * TDM: fix MXS staggerU * TDM padding (#423) * Support TDM padding * Add padding checks for TDM * Add mxf8ss_tdm test yaml * fix LdsBlockSizePerPadB reject condition * [hipBlasLt] remove mix mode yamls and tests * fix comment in hipblaslt.h Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * remove KWS keyword * fix helper.h typo * re-order header list in include CMakeLists.txt * remove some comment * fix build break * removd DepthU 128 form mxf8ss_tdm.yaml * [hipSPARSELt] Enable Depth-U=256 in 8-bit SPMM kernels (#166) * Fix SPMM DU256 issue. Add DepthU=256 test param in yamls. * Update spmm_i8 yaml. * Refine metadata iteration variables based on commit 45ec843. * Fix vgprPerInputM calculation. * Update spmm 8-bit yamls from clr0 to clr1. * [hipSPARSELt] Fix Solution.py on gfx1250. (#231) * Fix Solution.py on gfx1250. * Correct typo in projects/hipblaslt/tensilelite/Tensile/SolutionStructs/Solution.py * Upload gfx1250 logic yamls to hipSPARSELt. * Update gfx1250 logic yaml. Fix inconsistent yamls, and disable UserArgs. * [Workaround] Bypass metadata layout when build hipSPARSELt. * [hipSPARSELt] Update setComputeInputType to setComputeInputTypeA/B in tensile_host.cpp * [hipSPARSELt] Fix sparse b tail loop mask issue. * [hipSPARSELt] Patch reduced-size smoke tests * [hipSPARSELt] Fix 8-bit metadata vGPRS number and idx. Add i8is support when using bias. * [hipSPARSELt][Patch] Fix bugs found on PR#633 based on PR#682 * Revert "[Temp] disable spmm test, since spmm is not ready." This reverts commit 8b1ab52. * [hipSPARSELt] Add gfx1250 I8IS (Int8->Int32) Tensile library logic for hipsparselt Add 8 GridBased library logic YAML files for I8 input / I32 output sparse matrix multiplication on gfx1250, covering all 4 transpose combinations (NN, NT, TN, TT) for both Sparse-A and Sparse-B. * [hipSPARSELt] Fix KeyError 'UnrollMajorLDSMetadata' for non-sparse kernels, and test arch typo in spmm_fp16_ml1.yaml Add sparse guard before accessing UnrollMajorLDSMetadata in KernelWriter.py and KernelWriterAssembly.py. This key is only set for sparse kernels in Solution.py, but was accessed unconditionally, causing KeyError for non-sparse configs (e.g. sgemm_xf32_asm). Also fix typo in spmm_fp16_ml1.yaml: skip-gfx942m -> skip-gfx942 so the gfx1250-only test is correctly skipped on gfx942 machines. * fix: use _DepthUTc to replace DepthU * handle float bpe * check mxBlock in prombleType before use it * typo in DSStoreB256 * modify auto assign LocalReadVectorWidthA/B * reject condition use LocalReadVectorWidthA/B * [hipSPARSELt][Tmp] Disable autoVectorWidthA/B in gfx1250 * Fix NameError SMovBX is not defined * Fix invalid assembly token '**' in GSU duBpe calculation mult_MI_Dim already contains a leading '*' (e.g. "*MI_M"), so the f-string should not add another '*', which produced invalid assembly like "64**MI_M" instead of "64*MI_M". * Fix undefined MI_M/MI_N assembly symbols in GSU swizzle code The GSU code in computeLoadSrd and graIncrements emitted string literals "*MI_M" / "*MI_N" as immediate operands in assembly instructions, but these are not valid assembler symbols and caused "expected relocatable expression" errors during .s -> .o assembly. Resolve MI_M/MI_N to their actual numeric values at code-generation time by multiplying with kernel["MatrixInstM"] / kernel["MatrixInstN"] in Python, so the assembler only sees plain integer immediates. * Reject SwizzleTensor with NumLoadsCoalesced > 1 (swizzleA.yaml) When SwizzleTensorA/B is enabled with NumLoadsCoalesced > 1, the tile offset stride calculation in graTileOffsets uses an inter-wave stride (numKr * WvG * swzBlockSize) that does not correctly map the second coalesced read for edge cases where MacroTile > actual matrix dimension (e.g., MT_M=256 with M=128). This causes the second tile read to access incorrect memory locations, resulting in numerical accuracy failures. All 58 FAILED tests in swizzleA.yaml had NLCA=2 + SwizzleTensorA + MT_M > M, while identical kernels with NLCA=1 passed. This bug was introduced during gfx1250 SwizzleTensor development where the NLCA>1 edge case in the swizzle global read path was not considered. Reject this configuration until the swizzle tile offset stride is properly fixed for multi-coalesced-load edge handling. * Fix epilogue ScaleAB load condition in fp8nfp16mix_fp8nss.yaml The epilogue ScaleAB loading code compared DataType with itself (always true) instead of with MacDataType, causing it to unconditionally load ScaleA from AddressScaleA even when preloadScaleA=True (DataTypeA > MacDataTypeA). Fix: Change the inner loop condition from DataType<=DataType (nop) to DataType<=MacDataType. * Fix DTL correctness for FP16/FP8: cast float bpe values to int for assembly operands * Fixed lds padding sanity check * Fix the logic path for VCvtF32toF16 * Fix incorrect SGPR global read offset for DirectToVgpr with UseSgprForGRO In computeScalarGroImpl(), the DirectToVgpr path used the unsuffixed kernel["LocalReadVectorWidth"] (defaults to -1) instead of the tensor- specific kernel["LocalReadVectorWidth{A,B}"]. This produced negative unrollStride values, corrupting ScalarGlobalReadOffset in the generated assembly and causing incorrect GEMM results. Test: dtv_gfx90a.yaml (DirectToVgprA=1, UseSgprForGRO=1, GRVWA=1, GRVWB=1) * convert offset from float to int before passing into DSModifiers * Fixed incorrect enum values * Removed deprecated rejection condition for swizzle kernel * [hipSPARSELt] Seperate SPMM wider localRead conditions from dense. Disable reject Sparse A kernel only support PGR with EPS=1 * Fix: skip Formocast prediction when analyticalHardware is unavailable The tensilelite-client crashed with a segfault in AllSolutionsIterator::preProblem when running bf16_tn_gfx12_predict.yaml on gfx1200. The root cause is that origami does not support gfx1200, so analyticalHardware (shared_ptr<origami::hardware_t>) is never initialized in HipAMDGPU. The getHardware() helper then dereferences a null pointer. * Fix NameError in SIA2: replace undefined instPerPack with instPerPackA/instPerPackB The variable instPerPack was renamed to instPerPackA and instPerPackB during a prior refactor, but 9 references in the scheduleIterAlg == 2 code path of _makeSubIterSchedule were not updated. This caused a NameError during assembly kernel generation for gfx1200 (RDNA4) targets, which are the only architectures with library configs using SIA2. * convert tailloopInNllmaxUnit from float to int * Fix DVT tail loop validation failure on gfx12 by adding missing bpe multiply The GLOBAL_OFFSET macro was refactored to move the BPE multiplication out of the macro and into each call site. However, the globalReadGuardK function's call to GLOBAL_OFFSET for computing the max valid address offset in DVT tail loops was not updated to include the external bpe multiply. * Skip F4 related test YAML for gfx1200 * Fix swizzle tensor SRD limit alignment in computeLoadSrd MX integration refactored the IndicesSummation check into a nested if/elif chain, making the swizzle alignment path unreachable. Swizzled tensor A used raw SizeL-1 instead of alignTo(SizeL, swzStride)-1, causing SRD limit to be too small and buffer_load to return zeros at K boundaries. Fixes 58 NLCA=2 test failures. * Fix missing arguments and conditions * remove added testcases in matmul_bias_vector_dst_fp16_32 * Update test to accept VEC16_UE4M3 scale mode in SetAttribute The MX F4 support commits moved VEC16_UE4M3 from the rejection list to an accepted case (mapped to ScalingFormat::Block_16_UE4M3) for MXFP4 on gfx1250, but the test was not updated accordingly. Fix the test to expect SUCCESS and verify the set/get round-trip. Test: hipblaslt-test --gtest_filter=*pre_checkin_aux_matmul_set_get_attr_f16_r* * Revert "Patch hipsparselt build system to use local tensilelite" This reverts commit dedf0a070911e739d3d049c059ba1e6903f4cea9. * [hipSPARSELt] Enable gfx1250 * [hipSPARSELt][Tmp] Skip use_sgpr_for_gro test cases on gfx950 * remove deprecated variable: LdsAlignPow2 * TF32: remove deprecated code * Workaround for CMS to support separate LRVWA/B * [hipSPARSELt][Workaround] Correct wlrMultiple calculation for sparse WLR rejection For sparse A (Sparse==1), MIInputPerThreadA already accounts for the compressed data, but LocalReadVectorWidthA is based on the original (uncompressed) data size. Divide by an extra factor of 2 to correctly compute wlrMultiple for sparse A. Similarly for sparse B (Sparse==2), apply the same correction to wlrMultiple based on LocalReadVectorWidthB. Also clarify rejection messages to distinguish A/B tensor. * Fix: stuck in _makeSubIterSchedule() when build gfx1200 SIA2 kernel root cause: packItems is non-empty and it never check instPerPackB to pop items * Fix local read waitcnt * Revert "[Tensilelite][Sparse] Enable plr min for spmm (#4364)" This reverts commit cb81c2e. * Revert "[hipSPARSELt][Tmp] Skip use_sgpr_for_gro test cases on gfx950" This reverts commit 8f32694cd27fc4554c82ac0fe8f4e4ca553e2510. * Fix lrvw undefined caused by conflicts resolving * Fix activation args type mismatch causing segfault in debug build prob.act0/act1 (float) were pushed into ConstantVariant without casting to the compute type. When compute type is Int32 (i8->i32 GEMM), subsequent std::get_if<int32_t> on a float-holding variant returned nullptr, causing SIGSEGV on dereference. Release builds masked this UB via -O3 optimizations. Affects: ./build/release/clients/hipblaslt-test --gtest_filter=*gemm_i8_dst_i32_94x* * Enable RocRoller for hipblaslt and fix host build issue * remove redundant pack code in xfp32 kernels * pick all fix from gfx1250 to develop-gfx1250-open-source (#969) * [hipblaslt] fix ds_bpermute_b32 msb computation and s_set_vgpr_msb (#910) * Add gfx1250 HHS AuxH yamls for gtest (#911) Co-authored-by: Andy Su <andysu12@amd.com> * fix ExtOpLayerNorm test (#935) enable extop for gfx11/12 (#129) Co-authored-by: Huang, Mark <Mark.Huang@amd.com> Co-authored-by: boringmorning <huangchen1999@gmail.com> Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com> * [hipblaslt] fix index mode discarding valid algos when pool size < batch size (#960) getAlgosFromIndex returns INVALID_VALUE when any requested index exceeds the pool size, even though valid algos are still populated in the output. Previously the caller broke immediately on INVALID_VALUE, discarding those valid results. Now process the returned algos before exiting the loop. * [Tensilelite] modify NumRecords of E and BiasSrd for gfx1250 (#957) modify NumRecords of E and BiasSrd for gfx1250 * [hipblaslt] add correct num_records to BRD of gfx1250 (#958) --------- Co-authored-by: Chang, Josh <Josh.Chang@amd.com> Co-authored-by: Su, Andy <Andy.Su@amd.com> Co-authored-by: Andy Su <andysu12@amd.com> Co-authored-by: Huang, Mark <Mark.Huang@amd.com> Co-authored-by: boringmorning <huangchen1999@gmail.com> Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com> * Init srdB when calclating KRingShift Fix hipblaslt-test cases: *bf16_rbf16_rbf16_rbf16_rf32_r_TN_128_128_2048* * skip gfx1250 for kringshift test * Fixed local write waitcnt calculation for f6 datatype * Fallback to narrowing buffer load (GRVW=1) if no partial OOB. (#991) Fix buffer load failure in tail loop. Fallback to narrowing buffer load (GRVW=1) if partial OOB is unsupported by hardware. * Add FP8/BF8 logic YAML & test coverage (#998) * Fix v_pk_mul_f32 SGPR operand error on gfx1250 Add op_sel_hi=[1,0,1] modifier to VMulPKF32 for ScaleD to properly broadcast scalar SGPR value on gfx1250 packed math instructions. * Add F8/B8 related logic YAML Add 80 GridBased logic YAML files to support FP8 and BF8 data types on gfx1250. This includes: - Multiple precision combinations: F8, B8, F8B8, B8F8 - Various output types: FP16, FP32, BF16, F8, B8 - All matrix layouts: NN, NT, TN, TT - Epilogue support: Bias, ScaleA/B, ScaleC/D * Enable F8/B8 test cases for gfx1250 in hipblaslt-test Update gpu_arch filters in matmul_gtest.yaml and smoke_gtest.yaml to include gfx1250, enabling the following test categories: "matmul_f8_bf8_dst_fp32" "matmul_f8_bf8_dst_bf16" "matmul_f8_dst_bf16" "matmul_f8_bf8_dst_f16" "matmul_f8_bf8_dst_fp32_gfx12" "matmul_f8_bf8_dst_fp16_gfx12" "matmul_f8_bf8_dst_bf16_gfx12" "matmul_real_1b_dst_f8_SCDInt1_gfx12" "matmul_real_1b_dst_f8_SCDNotInt_gfx12" "matmul_one_real_precisions_1b_gfx12" "matmul_f8_bf8_dst_fp16_gfx12_smoke" "matmul_f8_bf8_dst_bf16_gfx12_smoke" "matmul_real_1b_dst_f8_SCDInt1_gfx12_smoke" "matmul_real_1b_dst_f8_SCDNotInt_gfx12_smoke" * Update logic YAML for F8/B8 related 1x1x1 solutions Update Tensile logic YAML files across Ailk/Alik Bjlk/Bljk matrix layout combinations for FP8/BF8 data types (B8F8, F8B8, B8, F8 variants including HS, BS, SS subtypes). * Fix UseCustomMainLoopSchedule value type in F8/B8 logic YAMLs Change UseCustomMainLoopSchedule from boolean `false` to integer `0` across 80 gfx1250 GridBased logic YAML files for consistency with the expected integer type. * Changed `_UseSgprForGRO: false` to `_UseSgprForGRO: 0` * Fix uninitialized union members in f6/bf6x16 conversion helpers * Fix incorrect arg type of copy constructor for some ds_load inst * Move F4/F6 init out of Runner into sample-specific files * Add = default to Float6x16 default constructor * Remove unused iType param from MXMFMAInstruction::typeConvert * Refactor scaleA/B type dispatch to switch for -Wswitch coverage * Remove duplicated HasWMMA_f8f6f4 capability * Fix default cnt for SWaitTensorCnt to 0. * Remove datatype examples that are not currently supported. (#1017) Change the order of enum HIPBLASLT_MATMUL_MATRIX_SCALE_VEC16_UE8M0_EXT. * Reapply "[Tensilelite][Sparse] Enable plr min for spmm (#4364)" This reverts commit e5febecfc7fe086563625ee3e284aa617f53beda. * [hipSPARSELt] Fix packItemsM logic for gfx1250 to be aligned with other architectures * [hipSPARSELt] Reuse single PackTemp VGPR for MIInputPerThUnroll==8 metadata packing In the MIInputPerThUnroll==8 packing path, PackTemp's lifetime ends before the second group of packing operations begins, so PackTemp can be safely reused instead of requiring PackTemp+1. This aligns the code generation with the VGPR allocation logic which only reserves 1 VGPR for PackTemp. Also simplify the gfx1250 sparse PackTemp allocation condition by removing the redundant MIInputPerThreadMetadata>1 check, since gfx1250 only has MIInputPerThreadMetadata = 4 or 8. * Fix incorrect function name of computeInputType{A, B} calls * Fix missing mxsb when rebasing * Revert modifications for emulator (#1001) Remove emulator parameter: ROCmAgentEnumeratorPath. Remove env parameters in tox.ini. Remove compile and emulator workarounds. Set default CpuThreads back to -1. * Enable HasF32XEmulation in gfx1250 * Fix XF32 LocalRead VGPR packing for gfx1250 WMMA V3 Fix incorrect results produced by XF32 emulation kernels on gfx1250. The original XF32 codegen was written against gfx950 MFMA assumptions that do not hold for gfx1250 WMMA V3: 1. LDS offsets used gfx950-specific hardcoded constants (4, 12). gfx1250 WMMA V3 needs a *2 unroll-stride formula shared with BF16/Half. Branch by ISA via calcGfx1250LdsOffset(). 2. Pack logic assumed vgprPerInput ≤ 8 (single 8-VGPR group). gfx1250 has vgprPerInput=16 (two groups), producing interleaved [HI_g0, LO_g0, HI_g1, LO_g1]. Add v_swap_b32 to rearrange into contiguous [HI_all, LO_all] expected by 3-pass WMMA. 3. WMMA src offset hardcoded "+2"/"+4" for vgprPerInput 4/8. Replace with dynamic vgprPerInputA // 2 to yield "+8" on gfx1250. * Fix XF32 Direct32XEmulation pack/WMMA scheduling data hazard in SIA3 on gfx1250 SIA3 scheduler interleaved pack and MAC instructions without respecting data dependencies in the XF32 multigroup path, causing v_swap_b32 to corrupt F32 values mid-packing, and WMMAs to consume partially-packed VGPRs. - Move v_swap_b32 rearrangement from MAC code into pack code so it stays ordered after all TF32_1/TF32_2 packing - Fix destVgpr aliasing for UseDirect32XEmulation local reads - Place all XF32 pack items before the first WMMA instead of distributing one chunk per MFMA slot * Fix XF32 tail loop K-masking on gfx1250 WMMA V3 The tail loop K-masking logic was written against gfx950 MFMA geometry where vgprPerInput ≤ 8 and BF16 inputs are packed (2 elements/VGPR). gfx1250 WMMA V3 has vgprPerInput=16 and XF32 reads unpacked FP32 (1 element/VGPR), breaking two assumptions: 1. T0 VGPR addressing: gfx950 bk maps 1:1 to T0 slots. gfx1250 Direct32X allocates T0 at half capacity (8 slots for 16 elements), so raw bk overflows into wrong tensor's registers. Fix: adjustedBk = (bk // 8) * 4 + (bk % 4). 2. K-to-VGPR mapping: gfx950 packed BF16 gives contiguous {0-7, 16-23}. gfx1250 unpacked FP32 + numVecUnroll=2 interleaving gives {0-3, 8-11, 16-19, 24-27}, zeroing wrong VGPRs for K=5-11, 21-27. Fix: vgprPerSet0Group=1, multiplyBy /= numVecUnroll, absolute K offsets per group. * Disable ForceUnrollSubIter for F32X emulation F32X emulation pack code performs destructive in-place VGPR conversion (FP32 → BF16 high/low), which is incompatible with ForceUnrollSubIter's sub-tiling that splits local reads and pack code across sub-iterations. This caused validation failures with ScheduleIterAlg=1, MIWaveTile=[4,4], and DepthU==MatrixInstK. * Generalize MIInputPerThread for gfx1250 WMMA XF32 Hardcoded MIInputPerThread==8 assertion in LocalRead.py caused AssertionError on gfx1250 (MIInputPerThread=16). Parameterize TXInterleaveLayoutIdx, dynamically generate dsReadConvTable and convArray to support any MIInputPerThread value. * Disable UseMFMAF32XEmulation on WMMA-only ISAs (gfx1250) UseMFMAF32XEmulation was unconditionally enabled for all F32X kernels, causing gfx1250 (WMMA, no MFMA) to emit invalid v_wmma_f32_4x4x4_bf16 instructions. Gate the flag behind HasMFMA so WMMA architectures fall through to the cvt+sub path instead. * Fix lrvwTile not forced to 1 for non-MFMA XF32 (gfx1250 WMMA) The blanket "(not UseF32XEmulation)" exemption skipped lrvwTile=1 forcing for all XF32 paths, but only MFMA-based XF32 (gfx950) handles lrvwTile > 1 correctly. On gfx1250 WMMA, lrvwTile=2 produced incorrect local reads. Refine the exemption: only UseMFMAF32XEmulation and CMS kernels may keep lrvwTile > 1; non-MFMA XF32 paths are now forced to lrvwTile=1. * Fix TF32EmuInterleaveTreg local read index for non-prefetch path Problem: NT/TN format XF32 kernels produce inf/nan errors when TF32EmuInterleaveTreg is enabled but doFullPackCodePrefetch is False (PLR=0). The TXInterleaveLayoutIdx() function assumes the full prefetch pack code layout, which is incompatible with the non-prefetch register layout. Fix: Add conditional branching based on doFullPackCodePrefetch in the TF32EmuInterleaveTreg handling. For the non-prefetch path (PLR=0), use a simpler index calculation that maps the first half of each group (withinGroup < 4) to T registers with a straightforward index formula: idx = (idx // 8) * 4 + withinGroup. This matches the register layout expected by the pack code when doFullPackCodePrefetch is False. * Fix TF32 emulation T-register overlap in tail loop Problem: TT and NN format kernels with DepthU=32 failed validation for tail loop. Root cause: In macroAndSetF32XEmuTregSingle(), the T registers (vgprValuA_T0_I0, vgprValuB_T0_I0) were defined using symbolic references relative to vgprValu{A/B}_X0_I0_BASE: .set vgprValuB_T0_I0, vgprValuB_X0_I0_BASE + 56 In the main loop, vgprValuB_X0_I0_BASE=34 gives T0=90 (correct). In the tail loop, vgprValuB_X0_I0_BASE is redefined to 32, giving T0=88 which overlaps with vgprValuA_T0_I0+6 (82+6=88). This causes A's TF32 processing to corrupt B's T registers (v88-v89), leading to incorrect WMMA results. Fix: Use absolute startVgprCvt values instead of symbolic BASE-relative offsets in RegSet. This ensures T register addresses remain correct regardless of BASE redefinition in the tail loop. * Enable XFP32 test coverage for gfx1250 in hipblaslt-test and tox * Enable gradient, postprocessing, and fix CVT instructions for gfx1250. (#1042) * Fix issues in cvt, enable gradient support for gfx1250. Cherry-picked from PR #160: - 35e2ef29dd (Fix issues in cvt and initial support on hhs gradient) - 80cce534da (Enable bbs gradient and postprocessing) * Fix wave32 and FP16 gradient issues on gfx1250 - Fix BF16 NaN check in writeBiasToGlobal for wave32 - Add fallback for FP16 sum unroll when dot2 is unavailable - Enable gfx1250 gtests for dgelu, bgrada, bgradb --------- Co-authored-by: George Tseng <george.tseng@amd.com> Co-authored-by: Andy Su <andysu12@amd.com> * Remove unused segmentsize (#1049) * include fp6 bf6 header * Don't use hip f6x16 until it is ready * Fix: Python 3.10 doesn't support [] inside f-string expressions * clear wmma_v2 flag if wmma_v3 is detected * only set new msb value untile compiler support it * added int and uint32_t constructors to three packed floating-point types * revert the tests for backward path for gfx1250 * [tensilelite] skip 120x for gfx1250 tox tests * support mix mode in test_CustomSchedule.py * Add workaround for size mismatch * Extend size mismatch workaround to FP4 types * [hipSPARSELt] Reapply: Make extops and matrix-transform subdirectories conditional Guard add_subdirectory(extops) and add_subdirectory(matrix-transform) with HIPBLASLT_ENABLE_EXTOPS and HIPBLASLT_ENABLE_MATRIX_TRANSFORM flags respectively, allowing builds to opt out of these components. * Revert "[hipSPARSELt] Patch reduced-size smoke tests" This reverts commit 462cc81. * 1. Guard packed-type getElement overload with #if to fix HIP device compile error 2. Fix -Wswitch warnings for unhandled Float6/BFloat6/Float4 enum values 3. Typo in BadInput/BadOutput error messages for Float6x16 and BFloat6x16 4. Add WIN32 macro * Remove NB_SHARED from rocisa to fix Windows DLL import failure * Fix MX on Windows --------- Co-authored-by: Stacey Lai <stacey.lai@amd.com> Co-authored-by: Lin, Ed <Ed.Lin@amd.com> Co-authored-by: Wu, Brianna <Brianna.Wu@amd.com> Co-authored-by: George Tseng <george.tseng@amd.com> Co-authored-by: yu-hsieh <Yu-cheng.Hsieh@amd.com> Co-authored-by: Vin Huang <vin.huang@amd.com> Co-authored-by: Huang, Mark <Mark.Huang@amd.com> Co-authored-by: boringmorning <huangchen1999@gmail.com> Co-authored-by: marhuang_amdeng <marhuang@amd.com> Co-authored-by: Lu, Serge <Serge.Lu@amd.com> Co-authored-by: Yang, Anne <Anne.Yang@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Stacey Lai <stacelai@amd.com> Co-authored-by: mengzcai_amdeng <Meng-Zhe.Cai@amd.com> Co-authored-by: jichang <jimmy.chang@amd.com> Co-authored-by: Ho, Henry <Henry.Ho@amd.com> Co-authored-by: Chang, Josh <Josh.Chang@amd.com> Co-authored-by: Su, Andy <Andy.Su@amd.com> Co-authored-by: Andy Su <andysu12@amd.com> Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com> Co-authored-by: Brad Nemanich <Brad.Nemanich@amd.com>
aledudek
pushed a commit
that referenced
this pull request
May 20, 2026
This reverts commit cb81c2e.
aledudek
pushed a commit
that referenced
this pull request
May 20, 2026
This reverts commit e5febecfc7fe086563625ee3e284aa617f53beda.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Ref: #1208. Enable plr-min optimization for spmm
Technical Details
Plr-min only support no packing case. This PR enables plr-min for spmm when
TransposeLDSMetadatais TrueTest Plan
Test Result
Submission Checklist