Skip to content

Test dont merge#231

Closed
ammallya wants to merge 1 commit into
developfrom
ammallya-patch-3
Closed

Test dont merge#231
ammallya wants to merge 1 commit into
developfrom
ammallya-patch-3

Conversation

@ammallya
Copy link
Copy Markdown
Collaborator

No description provided.

@ammallya ammallya requested a review from a team as a code owner June 17, 2025 21:45
@ammallya ammallya closed this Mar 13, 2026
bethune-bryant pushed a commit that referenced this pull request Apr 3, 2026
* Fix Solution.py on gfx1250.

* Correct typo in projects/hipblaslt/tensilelite/Tensile/SolutionStructs/Solution.py
jichangjichang pushed a commit that referenced this pull request Apr 8, 2026
* Fix Solution.py on gfx1250.

* Correct typo in projects/hipblaslt/tensilelite/Tensile/SolutionStructs/Solution.py
jichangjichang pushed a commit that referenced this pull request Apr 13, 2026
* Fix Solution.py on gfx1250.

* Correct typo in projects/hipblaslt/tensilelite/Tensile/SolutionStructs/Solution.py
TonyYHsieh pushed a commit that referenced this pull request Apr 13, 2026
* Fix Solution.py on gfx1250.

* Correct typo in projects/hipblaslt/tensilelite/Tensile/SolutionStructs/Solution.py
jichangjichang pushed a commit that referenced this pull request Apr 13, 2026
* Fix Solution.py on gfx1250.

* Correct typo in projects/hipblaslt/tensilelite/Tensile/SolutionStructs/Solution.py
tomchengchitang pushed a commit that referenced this pull request Apr 16, 2026
* Fix Solution.py on gfx1250.

* Correct typo in projects/hipblaslt/tensilelite/Tensile/SolutionStructs/Solution.py
tomchengchitang added a commit that referenced this pull request Apr 16, 2026
…te_to_1250 (#1115)

* fix computeInputType issue in ReferenceValidator.cpp

* [hipblaslt] fix unit tests for gfx950_mx_rebase (#4912)

Fix unit tests for gfx950_mx_rebase

Fix unit tests for gfx950_mx_rebase

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Alex Brown <alex.brown@amd.com>
Co-authored-by: Chun-Xun Lin <Chun-Xun.Lin@amd.com>

* [hipblaslt] Fix a verification fail with spmm_i8hs.yaml (#5034)

Fix a fail with spmm_i8hs.yaml

- use ["_DepthU"] istead of ["_DepthUA/B"] in sparse case
- restore missing code for sparse

tensilelite common test

tensilelite common test (spmm related) passed with this change

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

* initial set of testcase for MXFP4 (#4739)

This PR is to add test cases for tensile-lite related with MXFP4 support

The change here is to add new YAML file which will verify the MXFP4
support
https://github.com/ROCm/rocm-libraries/commit/333d1a24bc5c1a1a1685ed1c22cd6979ad8c5839

<!-- Explain any relevant testing done to verify this PR. -->

<!-- Briefly summarize test outcomes. -->

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Signed-off-by: pdhirajkumarprasad <dhirajp@amd.com>

* [Tensilelite] Add regression test for MX FP4 scale buffer determinism (#4959)

Write test to catch post-generation overwrites of `MXSA`/`MXSB` buffers
that desync the CPU reference from GPU data, causing intermittent MX FP4
validation failures.

Added `MXScaleDeterminismTest/ScaleBufferIsDeterministic` to
`MXDataGen_test.cpp` - calls `generateMXInput` twice with
sentinel-initialized scale buffers (`0x00` and `0xFF`) and asserts
equality - the differing sentinels prevent a silent no-write from
passing.

Includes non-32-aligned cases (M=204, M=213) that failed due to
`initializeCPUInputs` overwriting `MXSA`/`MXSB` after `generateMXInput`
populated them.

- `MXScaleDeterminismTest/ScaleBufferIsDeterministic` — 4 cases

- [x] All tests passing

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

* UseF32XEmulation in forceLrvwTile1 for B tensor (#5143)

- When `gfx950_mx_rebase` split `forceLrvwTile1` into per-tensor
definitions (using `MacDataTypeA`/`MacDataTypeB` instead of `DataType`),
the B tensor copy lost the `not UseF32XEmulation` guard present on
`develop`
- This caused TF32 kernels to get `lrvwTileB=1` instead of
`VectorWidthB=2`, halving the local read block width and doubling the
number of LRB instructions (8 → 16), breaking CMS schedules
- One-line fix: add `and (not kernel["UseF32XEmulation"])` to the B
tensor `forceLrvwTile1` definition

- [x] `xfp32.yaml` passes
- [x] `cms_tf32_nt.yaml` passes (extracted from
`custom_mainloop_scheduling_tf32.yaml`)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

AIHPBLAS-1221
AIHPBLAS-1224

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* [hipblaslt] Enable StoreSwapAddr for MXFP4, plus add GRVWMXSA/B adustment for DTL (#5117)

Enable MT256x256x256 for MXFP4. Need StoreSwapAddr and GRVWMXSA/B
adjustment for DTL

- Enable StoreSwapAddr for MXFP4
- Added GRVWMXSA/B adjustment logic for DTL (64bit to 32bit)

CI test

CI test

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

* [Tensilelite] Fix UserArgs struct stride mismatch in grouped GEMM (#5129)

Grouped GEMM with `UseUserArgs: True` crashes with
`hipErrorIllegalAddress` because the kernel's `userArgsInfo.totalSize`
doesn't match `sizeof(DeviceUserArguments)`.

`DeviceUserArguments` is a fixed 196-byte packed struct, but
`userArgsInfo` field sizes in `Signature.py` were only set inside
feature-gated blocks, leaving them at 0 when features were disabled.

This made `totalSize` (stride between per-problem entries) and
`extArgOffset` (epilogue field offsets) smaller than the struct, causing
the kernel to read from wrong offsets for multi-problem grouped GEMMs.

Fix: move size assignments outside conditional blocks to always match
the struct layout.

Existing test: `grouped_gemm_userargs.yaml` — validates grouped GEMM
with `UseUserArgs: True` across ScaleAlphaVec, Bias+Activation, and
plain configurations.

- [x] Test passed

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AIHPBLAS-1220

* [hipBLASLt] Disable failed mx f8 problem sizes (#5105)

Comment two problem sizes of the new mx f8 tests that currently fail.

Manually tested other problem sizes to verify they pass.

<!-- Briefly summarize test outcomes. -->

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

* [hipblaslt] Scheduling related fixes for MXFP4 (#5169)

Improve MXFP4 instruction scheduling. Plus, reduce sgpr usage for
MXFP4+StreamK

    - Added if condition for MXFP4 in getMFMAIssueLatency
    - Fixed numGRIncInst calculation for MXFP4
    - Use MinGRIncPerMfma=3 for MXFP4
    - Disable staggerU code for MXFP4 + StreamK

CI test

Check CI test

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

* remove explicit constructor from variable_value class

* fix return statement in hipDataType_to_tensile_type and add type check for activation inputs

* [Tensilelite] Shuffle mx scaling data in Tensile (#4864)

Generate pre-shuffled scaling data that TensileLite can use for MX
datatypes

Added optional `--enable-mx-preswizzle` flag to control MX FP4 scale
pre-swizzle.

When enabled, the flag allows scale tensors to be rearranged into gfx950
GPU kernel memory access patterns using parameters `{swizzleTileMN=32,
tileK=8, subTileK=4}` and `{tileK=8, swizzleTileMN=32}`.

Pre-swizzle requirements: flag enabled, solution available, `useScaleAB`
populated, and dimension alignment satisfied (rows%256==0, cols%32==0
for mxBlock=32).

`MXPreSwizzleTest` validates scale permutation behavior and data buffer
invariance across multiple matrix dimensions and transpose modes.

Tests pass successfully

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AIHPBLAS-789

* [hipblaslt] Fix fail with kringshift.yaml (#5228)

Fix fail with kringshift.yaml on gfx950_mx_rebase branch

Swapped the location of graAddresses and graFinalOffsets to align with
develop branch.
Before this fix, s[sgprSrdB+0] was referred before initialization in
kringshift case.

Run kringshift.yaml

kringshift.yaml passed.

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

* [hipblaslt] Optimize StoreSwapAddr (#5217)

optimize sgpr allocation for StoreSwapAddr

- use 1 common sgpr (swapCommon) for all (A,B,MXSA,MXSB)
- use s_add m0, swapCommon, LocalWriteAddr for m0 initialization

Run existing dtl test

dtl.yaml passed

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

* [hipblaslt] Enable MXFP4 + DtlPlusLdsBuf (#5251)

MXFP4 + DtlPlusLdsBuf was broken and need to enable it

Added missing code for MXFP4 + DtlPlusLdsBuf

Local test with MXFP4 + DtlPlusLdsBuf

Local test passed

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

* Fix gfx12 build error with integer cast

* [hipblaslt] Fix SIA3 issues with MXFP4 (#5245)

Fix scheduling issues with MXFP4 + SIA3

- fix incorrect local read count for MXSA/B in getLocalWriteMFMAStart
- fix incorrect global read count for MXSA/B in getNumLocalWritePerMfma
- optimize PointerLRCode scheduling for MXFP4

tensilelite common test

tensilelite common test passed

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

* [hipBLASLt] Fix CI failures for gfx942 (#5216)

This PR fixes failed gfx942 tests in CI.

For fnuz/non-fnuz f8 types, the `ComputeInputType` might appear in a
concatenated form:`ComputeInputTypeAComputeInputTypeB` (e.g.,
`BFloat8Float8_fnuz`). This leads to type mismatches when selecting
solutions. This PR adds logic to correctly interpret concatenated types
by checking the type at the corresponding position.

Manually run these tests:

gfx942:
- `./clients/hipblaslt-test
--gtest_filter=*pre_checkin_matmul_real_1b_fnuz_dst_1b_fnuz_smallsize_bf8_fnuz_rf8_fnuz_rbf8_fnuz_rbf8_fnuz_rf32_r_NT_3_128_128_1_3_128_2_3_3_1_SA_SB*`

gfx950:
- `./clients/hipblaslt-test --gtest_filter=*matmul_f8_dst_bf16*`
- `./clients/hipblaslt-test --gtest_filter=*matmul_f8_bf8_dst_bf16*`
- `./clients/hipblaslt-test --gtest_filter=*matmul_f8_bf8_dst_fp32*`

<!-- Briefly summarize test outcomes. -->

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

* Make the usage side’s logic consistent with allocation side (tPackM) (#5273)

- The error:
[math-ci.amd.com/job/rocm-libraries/job/precheckin/job/hipsparselt/job/PR-5202/6/pipeline-overview/log?nodeId=628](https://math-ci.amd.com/job/rocm-libraries/job/precheckin/job/hipsparselt/job/PR-5202/6/pipeline-overview/log?nodeId=628)
- Fix the inconsistency between LocalRead.py and KernelWriterAssembly.py
- Change `setComputeInputType` to `setComputeInputTypeA` and
`setComputeInputTypeB`, because `setComputeInputType` doesn't exist
anymore.

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

- Inside the LocalRead side, it tries to reference the wrong names,
which are non-set sgprs such as `sgprPackKForV2` and `sgprPackKForV3`.
It shall reference `sgprPackKForMV2/sgprPackKForMV3`, so I have to make
it reference the right ones.

<!-- Explain the changes along with any relevant GitHub links. -->

- Please use the attached file `setup_hipsparselt_build` to set the
packages and env variables the same as the CI.

[setup_hipsparselt_build.sh](https://github.com/user-attachments/files/25870143/setup_hipsparselt_build.sh)
- `./install.sh -c`
<!-- Explain any relevant testing done to verify this PR. -->

- first build with errors:

[build_hipsparselt_1st.log](https://github.com/user-attachments/files/25870212/build_hipsparselt_1st.log)

- final build WITHOUT errors:

[build_hipsparselt_final.log](https://github.com/user-attachments/files/25870348/build_hipsparselt_final.log)

<!-- Briefly summarize test outcomes. -->

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: tomchengchitang <tom.tang@amd.com>

* [hipblaslt] Fix fail with gfx942+dtv/dtl.yaml (#5349)

Fix dtv/dtl.yaml fail on gfx942

Add int cast to avoid using float value in const field

Run dtv/dtl.yaml on gfx942

dtv/dtl.yaml padded on gfx942

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

* [hipblaslt] disable mxDataGenerator for windows builds (#5298)

Disable mxDataGenerator for Windows builds.

Disable mxDataGenerator for Windows builds.

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Brad Nemanich <Brad.Nemanich@amd.com>

* fix build errors of merge

* Fix: add MacDataTypeA to mock kernel (#5351)

Fix the unit test now that we use `MacDataTypeA` to get the datatype for
a matrix instruction.

Unit tests

pass

* [hipblaslt] Fix tox test fp8_gfx12 failed when dtva1=1 or dtvb1=1 (#5390)

- The tox test in CI failed with
Tensile/Tests/common/gemm/gfx12/fp8_gfx12.yaml

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

- When the kernels do the validation with host side, kernels with dtva1
or dtvb1 always/usually failed.
<img width="604" height="177" alt="image"
src="https://github.com/user-attachments/assets/2409c81e-99b3-4cd2-8d5d-6ea1214af32d"
/>

- The reason behind that is v_bfe_u32 v4, v[Serial], 2.0, 1 → 2.0
encodes as 0x40000000 in IEEE 754, so the GPU shifts by 0x40000000 &
0x1F = 0 bits, extracting bit 0 instead of bit 2

<!-- Explain the changes along with any relevant GitHub links. -->

`tox -e py3 -- Tensile/Tests -k
Tensile/Tests/common/gemm/gfx12/fp8_gfx12.yaml 2>&1 | tee fp8_gfx12.log`

<!-- Explain any relevant testing done to verify this PR. -->

Build succeed log:

[fp8_gfx12_fixed.log](https://github.com/user-attachments/files/25933785/fp8_gfx12_fixed.log)

<!-- Briefly summarize test outcomes. -->

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: tomchengchitang <Tom.Tang@amd.com>

* Revert "[hipblaslt] disable mxDataGenerator for windows builds (#5298)"

This reverts commit e20a0900e6c118c82a1e22ce61ae7921a28a3acf.

* [hipBLASLt] Fix failed swizzle tests (#5400)

Fix failed tests:

`Tensile/Tests/common/gemm/swizzleA.yaml`
`Tensile/Tests/common/gemm/swizzleB.yaml`

For MFMA validation, need to use `F32XdlMathOp` data type instead of
`DataType` depending on if `EnableF32XdlMathOp` is enabled.

Run the failed tests locally to verify the fix.

<!-- Briefly summarize test outcomes. -->

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

* [hipblaslt] Fix tailLoop errors in GLOBAL_OFFSET_{A or B} for fp16_gfx12.yaml (#5419)

- CI test `tox -e py3 -- Tensile/Tests -k Tensile/Tests -k
Tensile/Tests/common/gemm/gfx12/fp16_gfx12.yaml` always failed in gfx12
machines
<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

- The test passed when K = 256 (multiple of 32), but failed with K=255
and K=257
- So the problem occurred in TailLoop
- Look at the TailLoop asm code:
<img width="2796" height="290" alt="image"
src="https://github.com/user-attachments/assets/ca974c71-6e65-45c1-8cd0-5ce6680454a9"
/>

- Every time we use GLOBAL_OFFSET_{A or B}, we have to add the
`v_lshlrev_b32` back, but I found there is some mismatch between
GLOBAL_OFFSET_* which is in the TailLoop.

<!-- Explain the changes along with any relevant GitHub links. -->

Script: `tox -e py3 -- Tensile/Tests -k
Tensile/Tests/common/gemm/gfx12/fp16_gfx12.yaml`.

<!-- Explain any relevant testing done to verify this PR. -->

The test passed.

[fp16_gfx12_debug.log](https://github.com/user-attachments/files/25963352/fp16_gfx12_debug.log)

<!-- Briefly summarize test outcomes. -->

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: tomchengchitang <Tom.Tang@amd.com>

* [hipblaslt] disable mxDataGenerator for windows builds (#5414)

Disable mxDataGenerator for Windows builds.

Disable mxDataGenerator for Windows builds.

The original PR failed `Tensile/Tests/common/gemm/gfx950/mx32f4_tn.yaml`
since we removed data types. restore the original implementation with OS
specific build dispatch.

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

* [hipblaslt] Add F4/F6/BF6 to instTypeToDataType (#5457)

 F4/F6/BF6 missing in instTypeToDataType

Add F4/F6/BF6 to instTypeToDataType

CI test

check CI test result

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

* [hipblaslt] remove MXFP4 TN logic file (#5487)

Fix all_solutions fail.

remove MXFP4 TN logic file (which has incorrect settings)

hipblaslt-test with heuristic_all_solutions

hipblaslt-test with heuristic_all_solutions passed

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

* [hipblaslt] Use64bShadowLimitMX support (#5499)

Reduce the amount of sgpr usage by setting Use64bShadowLimit for MXSA/B
separately from A/B

- Added a new parameter to set Use64bShadowLimit for MXSA/B separately from A/B
- Added test cases for Use64bShadowLimitMX

New test cases in mx32f4_tn.yaml

mx32f4_tn.yaml passed

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

* cleanup redundant lines of code

* [hipBLASLt] Fix failed rocRoller test (#5529)

Fix the rocRoller failed test in precheckin stage:

> [  FAILED  ] 1 test, listed below:
> [ FAILED ]
_/rocroller_predicate_test.unrollXYK/pre_checkin_rocroller_predicate_f8_rf8_rf32_rf32_rf32_r_TN_128_128_96_1p5_96_96_2_128_128_1_SAMX_32_UE8M0_SBMX_32_UE8M0,
where GetParam() = { function: "rocroller_predicate", name:
"rocroller_predicate", category: "pre_checkin", known_bug_platforms: "",
alpha: 1.5, beta: 2, stride_a: 0x9ec9018, stride_b: 0x9ec9118, stride_c:
0x9ec9218, stride_d: 0x9ec9318, stride_e: 0x9ec9418,
user_allocated_workspace: 134217728, M: 0x9ec9520, N: 0x9ec9620, K:
0x9ec9720, lda: 0x9ec9820, ldb: 0x9ec9920, ldc: 0x9ec9a20, ldd:
0x9ec9b20, lde: 0x9ec9c20, batch_count: 1, iters: 10, cold_iters: 2,
algo: 0, solution_index: -1, requested_solution_num: 1, a_type: f8_r,
b_type: f8_r, c_type: f32_r, d_type: f32_r, compute_type: f32_r,
compute_input_typeA: non-supported type, compute_input_typeB:
non-supported type, scale_type: f32_r, initialization: "hpl", gpu_arch:
"950", gpu_arch_exclude: "", pad: 4096, grouped_gemm: 0, threads: 0,
streams: 0, devices:

Change visibility of macro `HIPBLASLT_ENABLE_MXDATAGENERATOR` to make it
visible (defined) to the files that use it.

Manually run the failed test:

`./clients/hipblaslt-test
--gtest_filter=*pre_checkin_rocroller_predicate_f8_rf8_rf32_rf32_rf32_r_TN_128_128_96_1p5_96_96_2_128_128_1_SAMX_32_UE8M0_SBMX_32_UE8M0*`

<!-- Briefly summarize test outcomes. -->

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

* Revert "cleanup redundant lines of code"

This reverts commit 7b3a5e5e34c0dcef34560566da118bd6ef655cf4.

* [hipsparselt] Fix numSplitMetadata logic (#5608)

- hipsparselt-test failed a lots:
https://math-ci.amd.com/job/rocm-libraries/job/precheckin/job/hipsparselt/job/PR-5202/26/pipeline-overview/?selected-node=624
- Fix numSplitMetadata logic

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

- Change the numSplitMetadata logic the same as develop branch

<!-- Explain the changes along with any relevant GitHub links. -->

<!-- Explain any relevant testing done to verify this PR. -->

<img width="1150" height="224" alt="image"
src="https://github.com/user-attachments/assets/d3fd68a1-b0d0-422b-a419-183cabb2bf50"
/>

Note that I only count the common failed test between my local and CI.

<!-- Briefly summarize test outcomes. -->

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: tomchengchitang <tom.tang@amd.com>

* [hipblaslt] const GRInc support (#5526)

Reduce the amount of sgpr usage for MXFP4

- Enable constant GRInc value instead of allocating sgpr for applicable
cases

Run tensilelite common test
Run hipblaslt-test

tensilelite common test passed
hipblaslt-test passed (except for known issue)

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

* Fix merge

* Fix tensilelite build error due to merge conflict

* fix various issue in review

* [hipblaslt] Reject MX + nonDTL + UnrollLoopSwapGlobalReadOrder (#5794)

Fix an error with MX + nonDTL + UnrollLoopSwapGlobalReadOrder

Reject MX + nonDTL + UnrollLoopSwapGlobalReadOrder

Local test

Local test passed (rejected)

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

* [hipsparselt] Restore to develop logic and fix mistakenly used PackKForMV (#5796)

- Continuing fixing the hipsparselt-test failed test

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

- As the kernel shows, it misuses the PackKForMV
<img width="2033" height="387" alt="image"
src="https://github.com/user-attachments/assets/b45c5ab9-e293-44ba-b82a-ab2bd8cf9ba8"
/>
<!-- Explain the changes along with any relevant GitHub links. -->

- hipsparselt-test: `TENSILE_DB=0x8000
./build/release/hipsparselt-install/bin/hipsparselt-test
--gtest_output=xml --gtest_color=yes '--gtest_filter=*quick*'`
<!-- Explain any relevant testing done to verify this PR. -->

<img width="874" height="605" alt="image"
src="https://github.com/user-attachments/assets/2c2d2c48-7be6-4505-8ad3-557c8416a7a2"
/>

From 750 failed test -> 260
<!-- Briefly summarize test outcomes. -->

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: tomchengchitang <tom.tang@amd.com>

* [hipsparelt] Delete spurious rIdx_ loop for hipsparselt failed tests (#5848)

- Fix all hipsparselt-test failed test in rhel && gfx950
<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

- delete the extra `for rIdx_ loop` in LocalRead.py for the assembly
<img width="1805" height="681" alt="image"
src="https://github.com/user-attachments/assets/d633bd82-a291-43c2-a9c6-d99276a5335d"
/>

<!-- Explain the changes along with any relevant GitHub links. -->

- test script: `./build/release/hipsparselt-install/bin/hipsparselt-test
--gtest_output=xml --gtest_color=yes --gtest_filter=*quick* `
<!-- Explain any relevant testing done to verify this PR. -->

- All passed
<img width="1212" height="250" alt="image"
src="https://github.com/user-attachments/assets/797091e6-2bb1-4262-9513-5220ff6b3122"
/>

<!-- Briefly summarize test outcomes. -->

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: tomchengchitang <tom.tang@amd.com>

* Added a separate testenv for rocisa

* MX: F4: scale type E5M3

* MX: support MX with non-MX combination

* MX : initKernel part

* MX: Tail: remove dirty MX data

* MX: Tail: release MX valu vgpr after tail

* MX: Edge: MX Scale shiftptr align with noraml Buffer

* Shiftptr: fix Shiftptr limitation

* MX: update AssertSummationElementMultiple for MX

* F4: use 32x16 step 1

* F4: use 32x16: local read: add numTilePerInst loop

* F4: use 32x16: local read: fix offset

* F4: use 32x16: wmma tail: add numTilePerInst loop

* F4: use 32x16: wmma tail

* F4: use 32x16: store D

* F4: use 32x16: vectorwidth

* F4: use 32x16: multi wave

* F4: use 32x16: local read: use MIInputPerThUnroll

* F4: use 32x16: local read: add numTilePerInst loop

* F4: use 32x16: local read: dstr offset

* F4: use 32x16: local read: fix wave offset

* MX: F4: fix tmpVGPR allocation

* MXF4: support 32x16 instruction

* F4: use 32x16: multi wave: NT, TT

* F4: use 32x16: wmma tail

* MX: fix MX buffer load length

* MX: HipBlasLt: add MXE8B32 F8 yamls

* MX: HipBlasLt: add MXE8B32B B8 yamls

* MX: HipBlasLt: add MXE8B16 F8 yamls

* MX: HipBlasLt: add MXE8B16 B8 yamls

* MX: HipBlasLt: add MXE8B16 F6 yamls

* MX: HipBlasLt: add MXE8B16 B6 yamls

* MX: HipBlasLt: add MXE8B32 B6 yamls

* MX: HipBlasLt: add MXE8B32 F6 yamls

* MX: HipBlasLt: add MXE8B32 F4 yamls

* MX: HipBlasLt: add MXF8B32 F4 yamls

* MX: HipBlasLt: add MXE5M3B32 F4 yamls

* MX: HipBlasLt: add MXE5M3B16 F4 yamls

* MX: HipBlasLt: add MXF8B16 F4 yamls

* MX: HipBlasLt: add MXE8B16 F4 yamls

* MX: HipBlasLt: add MXE8B32/MXF8B16 sample

* MX: HipBlasLt: support MXE8B32/MXF8B16 F4

* MX: HipBlasLt: support MXF8B32/MXE8B16 F4

* MX: HipBlasLt: add MXF8B32/MXE8B16 sample

* MX: HipBlasLt: support MXF8B32/MXE8B16 F4

* MX: HipBlasLt: add MXE5M3B32/MXE5M3B16 sample

* EfficiencyMonitor: do nothing in set_device_id if not enabled

* benchmark: enable F4/F6 type without block scale

* MX: hipblaslt-bench: support MXE8B32/MXE8B16 F4

* MX: hipblaslt-bench: support all F4 mx scale type

* MX: hipblaslt-bench: F8/F6/F4 smoke gtest

* MX: hipblaslt-bench: limit hipblaslt_e8 rand_int/hpl init value

* Fix hipblaslt-test failed with HHS

This patch fix the following hipblaslt-test test case (HHS):
smoke_matmul_bias_relu_SAV_smoke_f16_rf16_rf16_rf16_rf32_r_relu*

* f8f6f6 mix mode on hipblaslt

* Added TensorLoadToLds for rocisa

* WIP: F8 codegen for TDM

* WIP: several bug fixes for TDM

* WIP: TDM wave separated initial support

* WIP: TDM uses up to 12 SGPRs now

* WIP: TDM optimize lds address swap

* WIP: TDM supports FP4 now

* WIP: temporarily integrate regular MX and TDM A&B

* Fix bugs in UseSgprForGRO.

* WIP: remove g2l vgpr allocations for A & B when TDM enabled

* WIP: fixed waitcnt calculation when TDM enabled

* WIP: TDM supports LdsAlignPow2 is False

* WIP: reject non-TN TDM kernels

* Fix large LDS logic (#189)

* large LDS follow gfx950 logic

* fix calculation of # local read/write calculation, lessen StoreSwapAddr enable constraits

---------

Co-authored-by: boringmorning <huangchen1999@gmail.com>

* [hipBLASLt] cherry-pick TDM MX changes (#411)

* WIP: TDM for MX buffers

* WIP: fixed TDM lds swap for MXSA and MXSB

* WIP: reduce VGPR usage when TDM enabled

* WIP: fixed TDM MT selection for MX buffer

* WIP: adjusted TDM LDS swap implementation for StoreSwapAddr

* WIP: several minor fixes for TDM codegen

* Fixed incompatible s_waitcnt parameter

* TDM: fix MXS staggerU

* TDM padding (#423)

* Support TDM padding
* Add padding checks for TDM
* Add mxf8ss_tdm test yaml

* fix LdsBlockSizePerPadB reject condition

* [hipBlasLt] remove mix mode yamls and tests

* fix comment in hipblaslt.h

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* remove KWS keyword

* fix helper.h typo

* re-order header list in include CMakeLists.txt

* remove some comment

* fix build break

* removd DepthU 128 form mxf8ss_tdm.yaml

* [hipSPARSELt] Enable Depth-U=256 in 8-bit SPMM kernels (#166)

* Fix SPMM DU256 issue. Add DepthU=256 test param in yamls.

* Update spmm_i8 yaml.

* Refine metadata iteration variables based on commit 45ec843.

* Fix vgprPerInputM calculation.

* Update spmm 8-bit yamls from clr0 to clr1.

* [hipSPARSELt] Fix Solution.py on gfx1250. (#231)

* Fix Solution.py on gfx1250.

* Correct typo in projects/hipblaslt/tensilelite/Tensile/SolutionStructs/Solution.py

* Upload gfx1250 logic yamls to hipSPARSELt.

* Update gfx1250 logic yaml. Fix inconsistent yamls, and disable UserArgs.

* [Workaround] Bypass metadata layout when build hipSPARSELt.

* [hipSPARSELt] Update setComputeInputType to setComputeInputTypeA/B in tensile_host.cpp

* [hipSPARSELt] Fix sparse b tail loop mask issue.

* [hipSPARSELt] Patch reduced-size smoke tests

* [hipSPARSELt] Fix 8-bit metadata vGPRS number and idx. Add i8is support when using bias.

* [hipSPARSELt][Patch] Fix bugs found on PR#633 based on PR#682

* Revert "[Temp] disable spmm test, since spmm is not ready."

This reverts commit 8b1ab52a67bda3684ce5ae0b76ba965569eaf061.

* [hipSPARSELt] Add gfx1250 I8IS (Int8->Int32) Tensile library logic for hipsparselt

Add 8 GridBased library logic YAML files for I8 input / I32 output
sparse matrix multiplication on gfx1250, covering all 4 transpose
combinations (NN, NT, TN, TT) for both Sparse-A and Sparse-B.

* [hipSPARSELt] Fix KeyError 'UnrollMajorLDSMetadata' for non-sparse kernels, and test arch typo in spmm_fp16_ml1.yaml

Add sparse guard before accessing UnrollMajorLDSMetadata in
KernelWriter.py and KernelWriterAssembly.py. This key is only set
for sparse kernels in Solution.py, but was accessed unconditionally,
causing KeyError for non-sparse configs (e.g. sgemm_xf32_asm).

Also fix typo in spmm_fp16_ml1.yaml: skip-gfx942m -> skip-gfx942
so the gfx1250-only test is correctly skipped on gfx942 machines.

* fix: use _DepthUTc to replace DepthU

* [hipSPARSELt][Tmp] Disable autoVectorWidthA/B in gfx1250

* handle float bpe

* check mxBlock in prombleType before use it

* typo in DSStoreB256

* modify auto assign LocalReadVectorWidthA/B

* reject condition use LocalReadVectorWidthA/B

* Fix NameError SMovBX is not defined

* Fix invalid assembly token '**' in GSU duBpe calculation

mult_MI_Dim already contains a leading '*' (e.g. "*MI_M"), so the
f-string should not add another '*', which produced invalid assembly
like "64**MI_M" instead of "64*MI_M".

* Fix undefined MI_M/MI_N assembly symbols in GSU swizzle code

The GSU code in computeLoadSrd and graIncrements emitted string
literals "*MI_M" / "*MI_N" as immediate operands in assembly
instructions, but these are not valid assembler symbols and caused
"expected relocatable expression" errors during .s -> .o assembly.

Resolve MI_M/MI_N to their actual numeric values at code-generation
time by multiplying with kernel["MatrixInstM"] / kernel["MatrixInstN"]
in Python, so the assembler only sees plain integer immediates.

* Reject SwizzleTensor with NumLoadsCoalesced > 1 (swizzleA.yaml)

When SwizzleTensorA/B is enabled with NumLoadsCoalesced > 1, the
tile offset stride calculation in graTileOffsets uses an inter-wave
stride (numKr * WvG * swzBlockSize) that does not correctly map the
second coalesced read for edge cases where MacroTile > actual matrix
dimension (e.g., MT_M=256 with M=128). This causes the second tile
read to access incorrect memory locations, resulting in numerical
accuracy failures.

All 58 FAILED tests in swizzleA.yaml had NLCA=2 + SwizzleTensorA +
MT_M > M, while identical kernels with NLCA=1 passed. This bug was
introduced during gfx1250 SwizzleTensor development where the NLCA>1
edge case in the swizzle global read path was not considered.

Reject this configuration until the swizzle tile offset stride is
properly fixed for multi-coalesced-load edge handling.

* Fix epilogue ScaleAB load condition in fp8nfp16mix_fp8nss.yaml

The epilogue ScaleAB loading code compared DataType with itself
(always true) instead of with MacDataType, causing it to
unconditionally load ScaleA from AddressScaleA even when
preloadScaleA=True (DataTypeA > MacDataTypeA).

Fix: Change the inner loop condition from DataType<=DataType (nop)
to DataType<=MacDataType.

* Fix DTL correctness for FP16/FP8: cast float bpe values to int for assembly operands

* Fixed lds padding sanity check

* Fix the logic path for VCvtF32toF16

* Fix incorrect SGPR global read offset for DirectToVgpr with UseSgprForGRO

In computeScalarGroImpl(), the DirectToVgpr path used the unsuffixed
kernel["LocalReadVectorWidth"] (defaults to -1) instead of the tensor-
specific kernel["LocalReadVectorWidth{A,B}"]. This produced negative
unrollStride values, corrupting ScalarGlobalReadOffset in the generated
assembly and causing incorrect GEMM results.

Test: dtv_gfx90a.yaml (DirectToVgprA=1, UseSgprForGRO=1, GRVWA=1, GRVWB=1)

* convert offset from float to int before passing into DSModifiers

* Fixed incorrect enum values

* Removed deprecated rejection condition for swizzle kernel

* [hipSPARSELt] Seperate SPMM  wider localRead conditions from dense. Disable reject Sparse A kernel only support PGR with EPS=1

* Fix: skip Formocast prediction when analyticalHardware is unavailable

The tensilelite-client crashed with a segfault in
AllSolutionsIterator::preProblem when running bf16_tn_gfx12_predict.yaml
on gfx1200. The root cause is that origami does not support gfx1200, so
analyticalHardware (shared_ptr<origami::hardware_t>) is never initialized
in HipAMDGPU. The getHardware() helper then dereferences a null pointer.

* Fix NameError in SIA2: replace undefined instPerPack with instPerPackA/instPerPackB

The variable instPerPack was renamed to instPerPackA and instPerPackB
during a prior refactor, but 9 references in the scheduleIterAlg == 2
code path of _makeSubIterSchedule were not updated. This caused a
NameError during assembly kernel generation for gfx1200 (RDNA4) targets,
which are the only architectures with library configs using SIA2.

* convert tailloopInNllmaxUnit from float to int

* Fix DVT tail loop validation failure on gfx12 by adding missing bpe multiply

The GLOBAL_OFFSET macro was refactored to move the BPE multiplication out of
the macro and into each call site. However, the globalReadGuardK function's
call to GLOBAL_OFFSET for computing the max valid address offset in DVT tail
loops was not updated to include the external bpe multiply.

* Skip F4 related test YAML for gfx1200

* Fix swizzle tensor SRD limit alignment in computeLoadSrd

MX integration refactored the IndicesSummation check into a nested if/elif
chain, making the swizzle alignment path unreachable.
Swizzled tensor A used raw SizeL-1 instead of alignTo(SizeL, swzStride)-1,
causing SRD limit to be too small and buffer_load to return zeros at K
boundaries. Fixes 58 NLCA=2 test failures.

* Fix missing arguments and conditions

* remove added testcases in matmul_bias_vector_dst_fp16_32

* Update test to accept VEC16_UE4M3 scale mode in SetAttribute

The MX F4 support commits moved VEC16_UE4M3 from the rejection list to an
accepted case (mapped to ScalingFormat::Block_16_UE4M3) for MXFP4 on
gfx1250, but the test was not updated accordingly.

Fix the test to expect SUCCESS and verify the set/get round-trip.

Test: hipblaslt-test --gtest_filter=*pre_checkin_aux_matmul_set_get_attr_f16_r*

* Revert "Patch hipsparselt build system to use local tensilelite"

This reverts commit dedf0a070911e739d3d049c059ba1e6903f4cea9.

* [hipSPARSELt] Enable gfx1250

* [hipSPARSELt][Tmp] Skip use_sgpr_for_gro test cases on gfx950

* remove deprecated variable: LdsAlignPow2

* TF32: remove deprecated code

* Workaround for CMS to support separate LRVWA/B

* [hipSPARSELt][Workaround] Correct wlrMultiple calculation for sparse WLR rejection

For sparse A (Sparse==1), MIInputPerThreadA already accounts for the
compressed data, but LocalReadVectorWidthA is based on the original
(uncompressed) data size. Divide by an extra factor of 2 to correctly
compute wlrMultiple for sparse A.

Similarly for sparse B (Sparse==2), apply the same correction to
wlrMultiple based on LocalReadVectorWidthB.

Also clarify rejection messages to distinguish A/B tensor.

* Fix: stuck in _makeSubIterSchedule() when build gfx1200 SIA2 kernel

root cause: packItems is non-empty and it never check instPerPackB to pop items

* Fix local read waitcnt

* Revert "[Tensilelite][Sparse] Enable plr min for spmm (#4364)"

This reverts commit cb81c2e3432e933ea9674c45500dcfa788ba6051.

* Revert "[hipSPARSELt][Tmp] Skip use_sgpr_for_gro test cases on gfx950"

This reverts commit 8f32694cd27fc4554c82ac0fe8f4e4ca553e2510.

* Fix lrvw undefined caused by conflicts resolving

* Fix activation args type mismatch causing segfault in debug build

prob.act0/act1 (float) were pushed into ConstantVariant without casting to
the compute type. When compute type is Int32 (i8->i32 GEMM), subsequent
std::get_if<int32_t> on a float-holding variant returned nullptr, causing
SIGSEGV on dereference. Release builds masked this UB via -O3 optimizations.

Affects: ./build/release/clients/hipblaslt-test --gtest_filter=*gemm_i8_dst_i32_94x*

* Enable RocRoller for hipblaslt and fix host build issue

* remove redundant pack code in xfp32 kernels

* pick all fix from gfx1250 to develop-gfx1250-open-source (#969)

* [hipblaslt] fix ds_bpermute_b32 msb computation and s_set_vgpr_msb (#910)

* Add gfx1250 HHS AuxH yamls for gtest (#911)

Co-authored-by: Andy Su <andysu12@amd.com>

* fix ExtOpLayerNorm test (#935)

enable extop for gfx11/12 (#129)

Co-authored-by: Huang, Mark <Mark.Huang@amd.com>
Co-authored-by: boringmorning <huangchen1999@gmail.com>
Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com>

* [hipblaslt] fix index mode discarding valid algos when pool size < batch size (#960)

getAlgosFromIndex returns INVALID_VALUE when any requested index
exceeds the pool size, even though valid algos are still populated
in the output. Previously the caller broke immediately on
INVALID_VALUE, discarding those valid results. Now process the
returned algos before exiting the loop.

* [Tensilelite] modify NumRecords of E and BiasSrd for gfx1250 (#957)

modify NumRecords of E and BiasSrd for gfx1250

* [hipblaslt] add correct num_records to BRD of gfx1250 (#958)

---------

Co-authored-by: Chang, Josh <Josh.Chang@amd.com>
Co-authored-by: Su, Andy <Andy.Su@amd.com>
Co-authored-by: Andy Su <andysu12@amd.com>
Co-authored-by: Huang, Mark <Mark.Huang@amd.com>
Co-authored-by: boringmorning <huangchen1999@gmail.com>
Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com>

* Init srdB when calclating KRingShift
Fix hipblaslt-test cases: *bf16_rbf16_rbf16_rbf16_rf32_r_TN_128_128_2048*

* skip gfx1250 for kringshift test

* Fixed local write waitcnt calculation for f6 datatype

* Fallback to narrowing buffer load (GRVW=1) if no partial OOB. (#991)

Fix buffer load failure in tail loop.
Fallback to narrowing buffer load (GRVW=1) if partial OOB is unsupported by hardware.

* Add FP8/BF8 logic YAML & test coverage (#998)

* Fix v_pk_mul_f32 SGPR operand error on gfx1250

Add op_sel_hi=[1,0,1] modifier to VMulPKF32 for ScaleD to properly
broadcast scalar SGPR value on gfx1250 packed math instructions.

* Add F8/B8 related logic YAML

Add 80 GridBased logic YAML files to support FP8 and BF8 data types
on gfx1250. This includes:
- Multiple precision combinations: F8, B8, F8B8, B8F8
- Various output types: FP16, FP32, BF16, F8, B8
- All matrix layouts: NN, NT, TN, TT
- Epilogue support: Bias, ScaleA/B, ScaleC/D

* Enable F8/B8 test cases for gfx1250 in hipblaslt-test

Update gpu_arch filters in matmul_gtest.yaml and smoke_gtest.yaml to
include gfx1250, enabling the following test categories:
    "matmul_f8_bf8_dst_fp32"
    "matmul_f8_bf8_dst_bf16"
    "matmul_f8_dst_bf16"
    "matmul_f8_bf8_dst_f16"
    "matmul_f8_bf8_dst_fp32_gfx12"
    "matmul_f8_bf8_dst_fp16_gfx12"
    "matmul_f8_bf8_dst_bf16_gfx12"
    "matmul_real_1b_dst_f8_SCDInt1_gfx12"
    "matmul_real_1b_dst_f8_SCDNotInt_gfx12"
    "matmul_one_real_precisions_1b_gfx12"
    "matmul_f8_bf8_dst_fp16_gfx12_smoke"
    "matmul_f8_bf8_dst_bf16_gfx12_smoke"
    "matmul_real_1b_dst_f8_SCDInt1_gfx12_smoke"
    "matmul_real_1b_dst_f8_SCDNotInt_gfx12_smoke"

* Update logic YAML for F8/B8 related 1x1x1 solutions

Update Tensile logic YAML files across Ailk/Alik Bjlk/Bljk matrix
layout combinations for FP8/BF8 data types (B8F8, F8B8, B8, F8
variants including HS, BS, SS subtypes).

* Fix UseCustomMainLoopSchedule value type in F8/B8 logic YAMLs

Change UseCustomMainLoopSchedule from boolean `false` to integer `0`
across 80 gfx1250 GridBased logic YAML files for consistency with
the expected integer type.

* Changed `_UseSgprForGRO: false` to `_UseSgprForGRO: 0`

* Fix uninitialized union members in f6/bf6x16 conversion helpers

* Fix incorrect arg type of copy constructor for some ds_load inst

* Move F4/F6 init out of Runner into sample-specific files

* Add = default to Float6x16 default constructor

* Remove unused iType param from MXMFMAInstruction::typeConvert

* Refactor scaleA/B type dispatch to switch for -Wswitch coverage

* Remove duplicated HasWMMA_f8f6f4 capability

* Fix default cnt for SWaitTensorCnt to 0.

* Remove datatype examples that are not currently supported. (#1017)

Change the order of enum HIPBLASLT_MATMUL_MATRIX_SCALE_VEC16_UE8M0_EXT.

* Reapply "[Tensilelite][Sparse] Enable plr min for spmm (#4364)"

This reverts commit e5febecfc7fe086563625ee3e284aa617f53beda.

* [hipSPARSELt] Fix packItemsM logic for gfx1250 to be aligned with other architectures

* [hipSPARSELt] Reuse single PackTemp VGPR for MIInputPerThUnroll==8 metadata packing

In the MIInputPerThUnroll==8 packing path, PackTemp's lifetime ends
before the second group of packing operations begins, so PackTemp can
be safely reused instead of requiring PackTemp+1. This aligns the code
generation with the VGPR allocation logic which only reserves 1 VGPR
for PackTemp. Also simplify the gfx1250 sparse PackTemp allocation
condition by removing the redundant MIInputPerThreadMetadata>1 check, since gfx1250 only has MIInputPerThreadMetadata = 4 or 8.

* Fix incorrect function name of computeInputType{A, B} calls

* Fix missing mxsb when rebasing

* Revert modifications for emulator (#1001)

Remove emulator parameter: ROCmAgentEnumeratorPath.

Remove env parameters in tox.ini.

Remove compile and emulator workarounds.

Set default CpuThreads back to -1.

* Enable HasF32XEmulation in gfx1250

* Fix XF32 LocalRead VGPR packing for gfx1250 WMMA V3

Fix incorrect results produced by XF32 emulation kernels on gfx1250.
The original XF32 codegen was written against gfx950 MFMA assumptions
that do not hold for gfx1250 WMMA V3:
1. LDS offsets used gfx950-specific hardcoded constants (4, 12).
   gfx1250 WMMA V3 needs a *2 unroll-stride formula shared with
   BF16/Half. Branch by ISA via calcGfx1250LdsOffset().
2. Pack logic assumed vgprPerInput ≤ 8 (single 8-VGPR group).
   gfx1250 has vgprPerInput=16 (two groups), producing interleaved
   [HI_g0, LO_g0, HI_g1, LO_g1]. Add v_swap_b32 to rearrange into
   contiguous [HI_all, LO_all] expected by 3-pass WMMA.
3. WMMA src offset hardcoded "+2"/"+4" for vgprPerInput 4/8.
   Replace with dynamic vgprPerInputA // 2 to yield "+8" on gfx1250.

* Fix XF32 Direct32XEmulation pack/WMMA scheduling data hazard in SIA3 on gfx1250

SIA3 scheduler interleaved pack and MAC instructions without respecting
data dependencies in the XF32 multigroup path, causing v_swap_b32 to
corrupt F32 values mid-packing, and WMMAs to consume partially-packed
VGPRs.

- Move v_swap_b32 rearrangement from MAC code into pack code so it
  stays ordered after all TF32_1/TF32_2 packing
- Fix destVgpr aliasing for UseDirect32XEmulation local reads
- Place all XF32 pack items before the first WMMA instead of
  distributing one chunk per MFMA slot

* Fix XF32 tail loop K-masking on gfx1250 WMMA V3

The tail loop K-masking logic was written against gfx950 MFMA geometry
where vgprPerInput ≤ 8 and BF16 inputs are packed (2 elements/VGPR).
gfx1250 WMMA V3 has vgprPerInput=16 and XF32 reads unpacked FP32
(1 element/VGPR), breaking two assumptions:

1. T0 VGPR addressing: gfx950 bk maps 1:1 to T0 slots. gfx1250
   Direct32X allocates T0 at half capacity (8 slots for 16 elements),
   so raw bk overflows into wrong tensor's registers.
   Fix: adjustedBk = (bk // 8) * 4 + (bk % 4).
2. K-to-VGPR mapping: gfx950 packed BF16 gives contiguous {0-7, 16-23}.
   gfx1250 unpacked FP32 + numVecUnroll=2 interleaving gives
   {0-3, 8-11, 16-19, 24-27}, zeroing wrong VGPRs for K=5-11, 21-27.
   Fix: vgprPerSet0Group=1, multiplyBy /= numVecUnroll, absolute K
   offsets per group.

* Disable ForceUnrollSubIter for F32X emulation

F32X emulation pack code performs destructive in-place VGPR conversion
(FP32 → BF16 high/low), which is incompatible with ForceUnrollSubIter's
sub-tiling that splits local reads and pack code across sub-iterations.
This caused validation failures with ScheduleIterAlg=1, MIWaveTile=[4,4],
and DepthU==MatrixInstK.

* Generalize MIInputPerThread for gfx1250 WMMA XF32

Hardcoded MIInputPerThread==8 assertion in LocalRead.py caused
AssertionError on gfx1250 (MIInputPerThread=16). Parameterize
TXInterleaveLayoutIdx, dynamically generate dsReadConvTable and
convArray to support any MIInputPerThread value.

* Disable UseMFMAF32XEmulation on WMMA-only ISAs (gfx1250)

UseMFMAF32XEmulation was unconditionally enabled for all F32X kernels,
causing gfx1250 (WMMA, no MFMA) to emit invalid v_wmma_f32_4x4x4_bf16
instructions. Gate the flag behind HasMFMA so WMMA architectures fall
through to the cvt+sub path instead.

* Fix lrvwTile not forced to 1 for non-MFMA XF32 (gfx1250 WMMA)

The blanket "(not UseF32XEmulation)" exemption skipped lrvwTile=1 forcing for
all XF32 paths, but only MFMA-based XF32 (gfx950) handles lrvwTile > 1
correctly. On gfx1250 WMMA, lrvwTile=2 produced incorrect local reads.

Refine the exemption: only UseMFMAF32XEmulation and CMS kernels may keep
lrvwTile > 1; non-MFMA XF32 paths are now forced to lrvwTile=1.

* Fix TF32EmuInterleaveTreg local read index for non-prefetch path

Problem: NT/TN format XF32 kernels produce inf/nan errors when
TF32EmuInterleaveTreg is enabled but doFullPackCodePrefetch is False
(PLR=0). The TXInterleaveLayoutIdx() function assumes the full prefetch
pack code layout, which is incompatible with the non-prefetch register layout.

Fix: Add conditional branching based on doFullPackCodePrefetch in the
TF32EmuInterleaveTreg handling. For the non-prefetch path (PLR=0),
use a simpler index calculation that maps the first half of each group
(withinGroup < 4) to T registers with a straightforward index formula:
idx = (idx // 8) * 4 + withinGroup. This matches the register layout
expected by the pack code when doFullPackCodePrefetch is False.

* Fix TF32 emulation T-register overlap in tail loop

Problem: TT and NN format kernels with DepthU=32 failed validation for tail loop.

Root cause: In macroAndSetF32XEmuTregSingle(), the T registers (vgprValuA_T0_I0,
vgprValuB_T0_I0) were defined using symbolic references relative to
vgprValu{A/B}_X0_I0_BASE:  .set vgprValuB_T0_I0, vgprValuB_X0_I0_BASE + 56
In the main loop, vgprValuB_X0_I0_BASE=34 gives T0=90 (correct).
In the tail loop, vgprValuB_X0_I0_BASE is redefined to 32, giving
T0=88 which overlaps with vgprValuA_T0_I0+6 (82+6=88).
This causes A's TF32 processing to corrupt B's T registers (v88-v89),
leading to incorrect WMMA results.

Fix: Use absolute startVgprCvt values instead of symbolic BASE-relative
offsets in RegSet. This ensures T register addresses remain correct
regardless of BASE redefinition in the tail loop.

* Enable XFP32 test coverage for gfx1250 in hipblaslt-test and tox

* Enable gradient, postprocessing, and fix CVT instructions for gfx1250. (#1042)

* Fix issues in cvt, enable gradient support for gfx1250.

Cherry-picked from PR #160:
- 35e2ef29dd (Fix issues in cvt and initial support on hhs gradient)
- 80cce534da (Enable bbs gradient and postprocessing)

* Fix wave32 and FP16 gradient issues on gfx1250
- Fix BF16 NaN check in writeBiasToGlobal for wave32
- Add fallback for FP16 sum unroll when dot2 is unavailable
- Enable gfx1250 gtests for dgelu, bgrada, bgradb

---------

Co-authored-by: George Tseng <george.tseng@amd.com>
Co-authored-by: Andy Su <andysu12@amd.com>

* Remove unused segmentsize (#1049)

* Fix conflicts in mem.hpp and add tests for DSStoreB192 and DSLoadB192.
Also fix a bug in countX.

(cherry picked from commit c2840f744630c301d70435984117bcd23fde17e5)

* Fix f_math.hpp/cpp and add a new test: test_functions.py

(cherry picked from commit 8968a906838dbfac7b0baffa96cd5a16f4dd405a)

* include fp6 bf6 header

* Resolve rocisa conflicts and add test_mfma.py

* [DO NOT MERGE] Revert commit 6d069916: restore emulator
 parameters and workarounds

Re-add ROCmAgentEnumeratorPath, tox.ini env parameters, compile/emulator
workarounds, and CpuThreads=0 that were removed in 6d06991. This is a
temporary revert for local emulator-based testing and should not be merged
upstream.

Reverted commit: 6d069916703b68e27bf67b10d6c4db2a0c69b8d2
Original message: "Remove emulator parameter: ROCmAgentEnumeratorPath.
Remove env parameters in tox.ini. Remove compile and emulator workarounds.
Set default CpuThreads back to -1."

Affected areas:
- clients/CMakeLists.txt, device-library/CMakeLists.txt
- efficiency_monitor.cpp
- Tensile Python modules (ClientWriter, GlobalParameters, LibraryIO, Tensile)
- tox.ini
- client/main.cpp
- 60+ YAML test config files (gemm/gfx12, sparse/gfx1250)

* [DO NOT MERGE] Enable FFM config on tox.ini

* [DO NOT MERGE] Add emulator parameter:  ROCmAgentEnumeratorPath in YAML

* [DO NOT MERGE] Allow multiple workers of tox on FFM

* [DO NOT MERGE] Fixed gradient related tests

* Fix unresolved merge conflict markers in AsmAddressCalculation.py

The file had leftover <<<<<<< ours / >>>>>>> theirs markers from
a prior merge that were accidentally committed. Keep the HasVgprMSB
conditional path which is the correct resolution.

Made-with: Cursor

* Fix unresolved merge conflict markers in StreamK.py, ClientWriter.py, Tensile.py, MatrixInstruction.py, and mfma.hpp

- StreamK.py: use _DepthU (adjusted for XF32) instead of raw kernel["DepthU"]
- ClientWriter.py: keep MX scale parameter names and type fields from ours
- Tensile.py: keep both --rocm-agent-enumerator and --mx-scale-format args
- MatrixInstruction.py: keep hasSMFMA guard with single-char key lookup
- mfma.hpp: remove commented-out conflict markers

Made-with: Cursor

* Fix MFMA validation key mismatch in MatrixInstruction.py

Use miDataType.toChar() (single-char key) for validMFMA lookup instead
of miDataTypeKey (two-char key), matching the dictionary's actual key
format. This restores the original working behavior that was broken
during the merge.

Made-with: Cursor

* Revert "[DO NOT MERGE] Fixed gradient related tests"

This reverts commit 5a07e1ab9d9d211624c2f4734afe2c67d19c6756.

* Revert "[DO NOT MERGE] Allow multiple workers of tox on FFM"

This reverts commit 220645c56e1a7a64663d675bda349779e23353bc.

* Revert "[DO NOT MERGE] Add emulator parameter:  ROCmAgentEnumeratorPath in YAML"

This reverts commit b3ee5ed532824f8e4244a5e15358cc3b47cf446d.

* Revert "[DO NOT MERGE] Enable FFM config on tox.ini"

This reverts commit de9d0c28578ecb3946f419076f564061b02756dc.

* Revert "[DO NOT MERGE] Revert commit 6d069916: restore emulator"

This reverts commit ab5741985cbcc20a3080f4ded1be4c3e0743f0b0.

---------

Signed-off-by: pdhirajkumarprasad <dhirajp@amd.com>
Co-authored-by: Koji Nakajima <knakajim@amd.com>
Co-authored-by: Hongji Chen <hongjche@amd.com>
Co-authored-by: Alex Brown <alex.brown@amd.com>
Co-authored-by: Chun-Xun Lin <Chun-Xun.Lin@amd.com>
Co-authored-by: Koji Nakajima <75698246+nakajee@users.noreply.github.com>
Co-authored-by: pdhirajkumarprasad <160474250+pdhirajkumarprasad@users.noreply.github.com>
Co-authored-by: Archana Ramalingam <98564406+archana-ramalingam@users.noreply.github.com>
Co-authored-by: T.J. Alumbaugh <talumbau@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: amd-chunxlin <chunxlin@amd.com>
Co-authored-by: tomchengchitang <tomtang2@amd.com>
Co-authored-by: Brad Nemanich <Brad.Nemanich@amd.com>
Co-authored-by: Curtis Fu <yu.fu@amd.com>
Co-authored-by: yu-hsieh <Yu-cheng.Hsieh@amd.com>
Co-authored-by: marhuang_amdeng <marhuang@amd.com>
Co-authored-by: Stacey Lai <stacey.lai@amd.com>
Co-authored-by: George Tseng <george.tseng@amd.com>
Co-authored-by: Serge Lu <Serge.Lu@amd.com>
Co-authored-by: Huang, Mark <Mark.Huang@amd.com>
Co-authored-by: boringmorning <huangchen1999@gmail.com>
Co-authored-by: Yang, Anne <Anne.Yang@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Lin, Ed <Ed.Lin@amd.com>
Co-authored-by: Stacey Lai <stacelai@amd.com>
Co-authored-by: mengzcai_amdeng <Meng-Zhe.Cai@amd.com>
Co-authored-by: Vin Huang <vin.huang@amd.com>
Co-authored-by: jichang <jimmy.chang@amd.com>
Co-authored-by: Ho, Henry <Henry.Ho@amd.com>
Co-authored-by: Chang, Josh <Josh.Chang@amd.com>
Co-authored-by: Su, Andy <Andy.Su@amd.com>
Co-authored-by: Andy Su <andysu12@amd.com>
Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com>
tomchengchitang pushed a commit that referenced this pull request Apr 16, 2026
* Enable v_cvt_f16_fp8 and add v_cvt_pk_f16_fp8

This patch also rename original option: Hascvtf16_fp8 as Hascvtf16_fp8_sf32.
Because the old Hascvtf16_fp8 is actually for v_cvt_scalef32_pk_f16_fp8.

* true16 modifier implementation

This patch includes true16 modifier implementation and related compiler
option.

* Optimize FP32 to FP16 with pack data by v_cvt_pk_f16_f32

* Make v_cvt_f32_f16 and v_cvt_f16_f32 support true16 syntax

This change modifies the relevant code, except in mixed mode scenarios.

* Support VCvtPkFP8toF32 and VCvtPkBF8toF32 with VOP3 in loopSum

* Use S_MOV_B32 to reset EXEC for 32-lane wavefronts

* Update test YAML for data conversion

This patch update test YAML for the following instructions:
v_cvt_f32_f16, v_cvt_f16_f32, v_cvt_pk_f16_f32, v_cvt_f32_bf16,
v_cvt_f32_fp8, v_cvt_f32_bf8, v_cvt_pk_f32_fp8, v_cvt_pk_f32_bf8,
v_cvt_pk_fp8_f32, v_cvt_pk_bf8_f32

* Minor modifications based on Copilot review recommendations

* Fix v_cvt_f32_bf16 build error and reviewer's suggesstions

* [hipSPARSELt] Enable 8-bit SPMM kernels (#123)

* Fix 8-bit sparse kernels by: 1. Adding i32-fp32 conversion and 2. fixing tail-loop dense b cndmask offset. Additionally update 16-bit yamls

* Fix 8-bit datatype packing issue.

* 1. Update sparse gfx1250 testing yamls. These are simplified to reduce tox testing time, but tested on local end. 2. More yamls will be added.

* Code refine, fix yamls and add metadatalayout test cases

* Add and revise test yamls.

* Fix i8 conversion issue when using GSU > 1. Add a missing yaml.

* Follow up - Refine i8 conversion issue when using GSU > 1. Add GSU2 parameter in yamls.

* remove whitespace in line 370 in GlobalWriteBatch.py

* Enable edge and tail loop. (#143)

Co-authored-by: George Tseng <george.tseng@amd.com>

* use ShortBranchMaxLength for SCLongBranchVccnz

* [hipBLASLt] extend the vgpr pool for the metatadata at OptNLL

* [hipBLASLt] refactor the vgpr allocation of sparse metadata.

* Don't use pack for metadata when LDSTr Metadata is enabled.

* Refine sparse test for i8, f8 and b8. (#156)

* fix buffer insts record for gfx1250 (#109)

* fix buffer insts record for gfx1250

---------

Co-authored-by: boringmorning <huangchen1999@gmail.com>
Co-authored-by: marhuang_amdeng <marhuang@amd.com>

* 6-bit transpose load revisited (#157)

* wip: basic tr6 ds load support

* Support glvw == 16 for FP6

* Fixed wrong VGPR padding for 6-bit gl and lr

* Implemented new thread mapping for ds tr6 load

* wip: add more test cases for BF6

* Reject solution if GRVW exceeds corrresponding side of MT

* Added F6 test cases for all transposes

* Fixed wrong tranpose settings for f6/b6 tests

* Fixed VGPR allocation for ds tr6 load

* Fixed conflicts in auto-merging

* enable extop for gfx11/12 (#129)

Co-authored-by: boringmorning <huangchen1999@gmail.com>

* Support edge/tail for FP6/BF6 (#161)

* BF6/FP6 TN Tail when glvw=16/32

* Fix shiftLrElements in TailLoop

* Add FP6/BF6 tail and edge testcases

* Simplify the tail/edge checks for FP6/BF6

* Refactor BPE Utility Functions and Fix Num Record Bitfield (#159)

* Refactor bpe utility functions from Python to C++

* Fix Num Record Bitfield for gfx125x

* MX : init and host validation: fix

* MX : codegen setting 1

* refactor calcLdsNumBytes

* MX: solution 2

* reorder setGlobalReadVectorWidth

* MX: solution 3

* MX: add UseGeneralizedNLCOne MXSA/B

* MX: LDS size calculation

* MX: codegen setting

* move ValuMXSAB vgpr to beginning of VGPR pool p1

* MX: add kernel argument

* MX : initKernel part

* MX : vgpr set

* move ValuMXSAB vgpr to beginning of VGPR pool p1

* MX: global read macro

* MX : local read offset

* MX: local write and global read offset

* MX: global read increaments

* MX: staggerU

* MX: global load

* MX: global read increament

* MX: wait

* MX: ds store

* MX: localWriteSwapOffsets

* MX: local read part

* MX: MFMA

* MX: lraDeclareAddresses

* MX: SIA3 pre processing

* MX: SIA3

* MX: SIA3 pack for MXSA/B

* MX: SIA3 pack

* fix: use _DepthUTc to replace DepthU

* MX: Tail: localWriteResetOffsets

* MX: Tail: vgpr alloc, remove staggerU, global read

* MX: tailLoopAllocValuVgpr

* MX: Tail: localRead init

* MX: Tail: local read

* MX: unsupported datatype exception

* MX: add test yaml

* F6 adjust LDS alignment

* MX: F4 use F8 scale type

* MX: F4: scale type E5M3

* MX: support MX with non-MX combination

* MX : initKernel part

* MX: Tail: remove dirty MX data

* MX: Tail: release MX valu vgpr after tail

* MX: Edge: MX Scale shiftptr align with noraml Buffer

* Shiftptr: fix Shiftptr limitation

* MX: update AssertSummationElementMultiple for MX

* F4: use 32x16 step 1

* F4: use 32x16: local read: add numTilePerInst loop

* F4: use 32x16: local read: fix offset

* F4: use 32x16: wmma tail: add numTilePerInst loop

* F4: use 32x16: wmma tail

* F4: use 32x16: store D

* F4: use 32x16: vectorwidth

* F4: use 32x16: multi wave

* F4: use 32x16: local read: use MIInputPerThUnroll

* F4: use 32x16: local read: add numTilePerInst loop

* F4: use 32x16: local read: dstr offset

* F4: use 32x16: local read: fix wave offset

* MX: F4: fix tmpVGPR allocation

* MXF4: support 32x16 instruction

* F4: use 32x16: multi wave: NT, TT

* F4: use 32x16: wmma tail

* MX: fix MX buffer load length

* MX: HipBlasLt: add MXE8B32 F8 yamls

* MX: HipBlasLt: add MXE8B32B B8 yamls

* MX: HipBlasLt: add MXE8B16 F8 yamls

* MX: HipBlasLt: add MXE8B16 B8 yamls

* MX: HipBlasLt: add MXE8B16 F6 yamls

* MX: HipBlasLt: add MXE8B16 B6 yamls

* MX: HipBlasLt: add MXE8B32 B6 yamls

* MX: HipBlasLt: add MXE8B32 F6 yamls

* MX: HipBlasLt: add MXE8B32 F4 yamls

* MX: HipBlasLt: add MXF8B32 F4 yamls

* MX: HipBlasLt: add MXE5M3B32 F4 yamls

* MX: HipBlasLt: add MXE5M3B16 F4 yamls

* MX: HipBlasLt: add MXF8B16 F4 yamls

* MX: HipBlasLt: add MXE8B16 F4 yamls

* MX: HipBlasLt: add MXE8B32/MXF8B16 sample

* MX: HipBlasLt: support MXE8B32/MXF8B16 F4

* MX: HipBlasLt: support MXF8B32/MXE8B16 F4

* MX: HipBlasLt: add MXF8B32/MXE8B16 sample

* MX: HipBlasLt: support MXF8B32/MXE8B16 F4

* MX: HipBlasLt: add MXE5M3B32/MXE5M3B16 sample

* EfficiencyMonitor: do nothing in set_device_id if not enabled

* benchmark: enable F4/F6 type without block scale

* MX: hipblaslt-bench: support MXE8B32/MXE8B16 F4

* MX: hipblaslt-bench: support all F4 mx scale type

* MX: hipblaslt-bench: F8/F6/F4 smoke gtest

* MX: hipblaslt-bench: limit hipblaslt_e8 rand_int/hpl init value

* Fix hipblaslt-test failed with HHS

This patch fix the following hipblaslt-test test case (HHS):
smoke_matmul_bias_relu_SAV_smoke_f16_rf16_rf16_rf16_rf32_r_relu*

* f8f6f6 mix mode on hipblaslt

* Added TensorLoadToLds for rocisa

* WIP: F8 codegen for TDM

* WIP: several bug fixes for TDM

* WIP: TDM wave separated initial support

* WIP: TDM uses up to 12 SGPRs now

* WIP: TDM optimize lds address swap

* WIP: TDM supports FP4 now

* WIP: temporarily integrate regular MX and TDM A&B

* Fix bugs in UseSgprForGRO.

* WIP: remove g2l vgpr allocations for A & B when TDM enabled

* WIP: fixed waitcnt calculation when TDM enabled

* WIP: TDM supports LdsAlignPow2 is False

* WIP: reject non-TN TDM kernels

* Fix large LDS logic (#189)

* large LDS follow gfx950 logic

* fix calculation of # local read/write calculation, lessen StoreSwapAddr enable constraits

---------

Co-authored-by: boringmorning <huangchen1999@gmail.com>

* [hipBLASLt] cherry-pick TDM MX changes (#411)

* WIP: TDM for MX buffers

* WIP: fixed TDM lds swap for MXSA and MXSB

* WIP: reduce VGPR usage when TDM enabled

* WIP: fixed TDM MT selection for MX buffer

* WIP: adjusted TDM LDS swap implementation for StoreSwapAddr

* WIP: several minor fixes for TDM codegen

* Fixed incompatible s_waitcnt parameter

* TDM: fix MXS staggerU

* TDM padding (#423)

* Support TDM padding
* Add padding checks for TDM
* Add mxf8ss_tdm test yaml

* fix LdsBlockSizePerPadB reject condition

* [hipBlasLt] remove mix mode yamls and tests

* fix comment in hipblaslt.h

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* remove KWS keyword

* fix helper.h typo

* re-order header list in include CMakeLists.txt

* remove some comment

* fix build break

* removd DepthU 128 form mxf8ss_tdm.yaml

* [hipSPARSELt] Enable Depth-U=256 in 8-bit SPMM kernels (#166)

* Fix SPMM DU256 issue. Add DepthU=256 test param in yamls.

* Update spmm_i8 yaml.

* Refine metadata iteration variables based on commit 45ec843.

* Fix vgprPerInputM calculation.

* Update spmm 8-bit yamls from clr0 to clr1.

* [hipSPARSELt] Fix Solution.py on gfx1250. (#231)

* Fix Solution.py on gfx1250.

* Correct typo in projects/hipblaslt/tensilelite/Tensile/SolutionStructs/Solution.py

* Upload gfx1250 logic yamls to hipSPARSELt.

* Update gfx1250 logic yaml. Fix inconsistent yamls, and disable UserArgs.

* [Workaround] Bypass metadata layout when build hipSPARSELt.

* [hipSPARSELt] Update setComputeInputType to setComputeInputTypeA/B in tensile_host.cpp

* [hipSPARSELt] Fix sparse b tail loop mask issue.

* [hipSPARSELt] Patch reduced-size smoke tests

* [hipSPARSELt] Fix 8-bit metadata vGPRS number and idx. Add i8is support when using bias.

* [hipSPARSELt][Patch] Fix bugs found on PR#633 based on PR#682

* Revert "[Temp] disable spmm test, since spmm is not ready."

This reverts commit 8b1ab52.

* [hipSPARSELt] Add gfx1250 I8IS (Int8->Int32) Tensile library logic for hipsparselt

Add 8 GridBased library logic YAML files for I8 input / I32 output
sparse matrix multiplication on gfx1250, covering all 4 transpose
combinations (NN, NT, TN, TT) for both Sparse-A and Sparse-B.

* [hipSPARSELt] Fix KeyError 'UnrollMajorLDSMetadata' for non-sparse kernels, and test arch typo in spmm_fp16_ml1.yaml

Add sparse guard before accessing UnrollMajorLDSMetadata in
KernelWriter.py and KernelWriterAssembly.py. This key is only set
for sparse kernels in Solution.py, but was accessed unconditionally,
causing KeyError for non-sparse configs (e.g. sgemm_xf32_asm).

Also fix typo in spmm_fp16_ml1.yaml: skip-gfx942m -> skip-gfx942
so the gfx1250-only test is correctly skipped on gfx942 machines.

* fix: use _DepthUTc to replace DepthU

* handle float bpe

* check mxBlock in prombleType before use it

* typo in DSStoreB256

* modify auto assign LocalReadVectorWidthA/B

* reject condition use LocalReadVectorWidthA/B

* [hipSPARSELt][Tmp] Disable autoVectorWidthA/B in gfx1250

* Fix NameError SMovBX is not defined

* Fix invalid assembly token '**' in GSU duBpe calculation

mult_MI_Dim already contains a leading '*' (e.g. "*MI_M"), so the
f-string should not add another '*', which produced invalid assembly
like "64**MI_M" instead of "64*MI_M".

* Fix undefined MI_M/MI_N assembly symbols in GSU swizzle code

The GSU code in computeLoadSrd and graIncrements emitted string
literals "*MI_M" / "*MI_N" as immediate operands in assembly
instructions, but these are not valid assembler symbols and caused
"expected relocatable expression" errors during .s -> .o assembly.

Resolve MI_M/MI_N to their actual numeric values at code-generation
time by multiplying with kernel["MatrixInstM"] / kernel["MatrixInstN"]
in Python, so the assembler only sees plain integer immediates.

* Reject SwizzleTensor with NumLoadsCoalesced > 1 (swizzleA.yaml)

When SwizzleTensorA/B is enabled with NumLoadsCoalesced > 1, the
tile offset stride calculation in graTileOffsets uses an inter-wave
stride (numKr * WvG * swzBlockSize) that does not correctly map the
second coalesced read for edge cases where MacroTile > actual matrix
dimension (e.g., MT_M=256 with M=128). This causes the second tile
read to access incorrect memory locations, resulting in numerical
accuracy failures.

All 58 FAILED tests in swizzleA.yaml had NLCA=2 + SwizzleTensorA +
MT_M > M, while identical kernels with NLCA=1 passed. This bug was
introduced during gfx1250 SwizzleTensor development where the NLCA>1
edge case in the swizzle global read path was not considered.

Reject this configuration until the swizzle tile offset stride is
properly fixed for multi-coalesced-load edge handling.

* Fix epilogue ScaleAB load condition in fp8nfp16mix_fp8nss.yaml

The epilogue ScaleAB loading code compared DataType with itself
(always true) instead of with MacDataType, causing it to
unconditionally load ScaleA from AddressScaleA even when
preloadScaleA=True (DataTypeA > MacDataTypeA).

Fix: Change the inner loop condition from DataType<=DataType (nop)
to DataType<=MacDataType.

* Fix DTL correctness for FP16/FP8: cast float bpe values to int for assembly operands

* Fixed lds padding sanity check

* Fix the logic path for VCvtF32toF16

* Fix incorrect SGPR global read offset for DirectToVgpr with UseSgprForGRO

In computeScalarGroImpl(), the DirectToVgpr path used the unsuffixed
kernel["LocalReadVectorWidth"] (defaults to -1) instead of the tensor-
specific kernel["LocalReadVectorWidth{A,B}"]. This produced negative
unrollStride values, corrupting ScalarGlobalReadOffset in the generated
assembly and causing incorrect GEMM results.

Test: dtv_gfx90a.yaml (DirectToVgprA=1, UseSgprForGRO=1, GRVWA=1, GRVWB=1)

* convert offset from float to int before passing into DSModifiers

* Fixed incorrect enum values

* Removed deprecated rejection condition for swizzle kernel

* [hipSPARSELt] Seperate SPMM  wider localRead conditions from dense. Disable reject Sparse A kernel only support PGR with EPS=1

* Fix: skip Formocast prediction when analyticalHardware is unavailable

The tensilelite-client crashed with a segfault in
AllSolutionsIterator::preProblem when running bf16_tn_gfx12_predict.yaml
on gfx1200. The root cause is that origami does not support gfx1200, so
analyticalHardware (shared_ptr<origami::hardware_t>) is never initialized
in HipAMDGPU. The getHardware() helper then dereferences a null pointer.

* Fix NameError in SIA2: replace undefined instPerPack with instPerPackA/instPerPackB

The variable instPerPack was renamed to instPerPackA and instPerPackB
during a prior refactor, but 9 references in the scheduleIterAlg == 2
code path of _makeSubIterSchedule were not updated. This caused a
NameError during assembly kernel generation for gfx1200 (RDNA4) targets,
which are the only architectures with library configs using SIA2.

* convert tailloopInNllmaxUnit from float to int

* Fix DVT tail loop validation failure on gfx12 by adding missing bpe multiply

The GLOBAL_OFFSET macro was refactored to move the BPE multiplication out of
the macro and into each call site. However, the globalReadGuardK function's
call to GLOBAL_OFFSET for computing the max valid address offset in DVT tail
loops was not updated to include the external bpe multiply.

* Skip F4 related test YAML for gfx1200

* Fix swizzle tensor SRD limit alignment in computeLoadSrd

MX integration refactored the IndicesSummation check into a nested if/elif
chain, making the swizzle alignment path unreachable.
Swizzled tensor A used raw SizeL-1 instead of alignTo(SizeL, swzStride)-1,
causing SRD limit to be too small and buffer_load to return zeros at K
boundaries. Fixes 58 NLCA=2 test failures.

* Fix missing arguments and conditions

* remove added testcases in matmul_bias_vector_dst_fp16_32

* Update test to accept VEC16_UE4M3 scale mode in SetAttribute

The MX F4 support commits moved VEC16_UE4M3 from the rejection list to an
accepted case (mapped to ScalingFormat::Block_16_UE4M3) for MXFP4 on
gfx1250, but the test was not updated accordingly.

Fix the test to expect SUCCESS and verify the set/get round-trip.

Test: hipblaslt-test --gtest_filter=*pre_checkin_aux_matmul_set_get_attr_f16_r*

* Revert "Patch hipsparselt build system to use local tensilelite"

This reverts commit dedf0a070911e739d3d049c059ba1e6903f4cea9.

* [hipSPARSELt] Enable gfx1250

* [hipSPARSELt][Tmp] Skip use_sgpr_for_gro test cases on gfx950

* remove deprecated variable: LdsAlignPow2

* TF32: remove deprecated code

* Workaround for CMS to support separate LRVWA/B

* [hipSPARSELt][Workaround] Correct wlrMultiple calculation for sparse WLR rejection

For sparse A (Sparse==1), MIInputPerThreadA already accounts for the
compressed data, but LocalReadVectorWidthA is based on the original
(uncompressed) data size. Divide by an extra factor of 2 to correctly
compute wlrMultiple for sparse A.

Similarly for sparse B (Sparse==2), apply the same correction to
wlrMultiple based on LocalReadVectorWidthB.

Also clarify rejection messages to distinguish A/B tensor.

* Fix: stuck in _makeSubIterSchedule() when build gfx1200 SIA2 kernel

root cause: packItems is non-empty and it never check instPerPackB to pop items

* Fix local read waitcnt

* Revert "[Tensilelite][Sparse] Enable plr min for spmm (#4364)"

This reverts commit cb81c2e.

* Revert "[hipSPARSELt][Tmp] Skip use_sgpr_for_gro test cases on gfx950"

This reverts commit 8f32694cd27fc4554c82ac0fe8f4e4ca553e2510.

* Fix lrvw undefined caused by conflicts resolving

* Fix activation args type mismatch causing segfault in debug build

prob.act0/act1 (float) were pushed into ConstantVariant without casting to
the compute type. When compute type is Int32 (i8->i32 GEMM), subsequent
std::get_if<int32_t> on a float-holding variant returned nullptr, causing
SIGSEGV on dereference. Release builds masked this UB via -O3 optimizations.

Affects: ./build/release/clients/hipblaslt-test --gtest_filter=*gemm_i8_dst_i32_94x*

* Enable RocRoller for hipblaslt and fix host build issue

* remove redundant pack code in xfp32 kernels

* pick all fix from gfx1250 to develop-gfx1250-open-source (#969)

* [hipblaslt] fix ds_bpermute_b32 msb computation and s_set_vgpr_msb (#910)

* Add gfx1250 HHS AuxH yamls for gtest (#911)

Co-authored-by: Andy Su <andysu12@amd.com>

* fix ExtOpLayerNorm test (#935)

enable extop for gfx11/12 (#129)

Co-authored-by: Huang, Mark <Mark.Huang@amd.com>
Co-authored-by: boringmorning <huangchen1999@gmail.com>
Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com>

* [hipblaslt] fix index mode discarding valid algos when pool size < batch size (#960)

getAlgosFromIndex returns INVALID_VALUE when any requested index
exceeds the pool size, even though valid algos are still populated
in the output. Previously the caller broke immediately on
INVALID_VALUE, discarding those valid results. Now process the
returned algos before exiting the loop.

* [Tensilelite] modify NumRecords of E and BiasSrd for gfx1250 (#957)

modify NumRecords of E and BiasSrd for gfx1250

* [hipblaslt] add correct num_records to BRD of gfx1250 (#958)

---------

Co-authored-by: Chang, Josh <Josh.Chang@amd.com>
Co-authored-by: Su, Andy <Andy.Su@amd.com>
Co-authored-by: Andy Su <andysu12@amd.com>
Co-authored-by: Huang, Mark <Mark.Huang@amd.com>
Co-authored-by: boringmorning <huangchen1999@gmail.com>
Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com>

* Init srdB when calclating KRingShift
Fix hipblaslt-test cases: *bf16_rbf16_rbf16_rbf16_rf32_r_TN_128_128_2048*

* skip gfx1250 for kringshift test

* Fixed local write waitcnt calculation for f6 datatype

* Fallback to narrowing buffer load (GRVW=1) if no partial OOB. (#991)

Fix buffer load failure in tail loop.
Fallback to narrowing buffer load (GRVW=1) if partial OOB is unsupported by hardware.

* Add FP8/BF8 logic YAML & test coverage (#998)

* Fix v_pk_mul_f32 SGPR operand error on gfx1250

Add op_sel_hi=[1,0,1] modifier to VMulPKF32 for ScaleD to properly
broadcast scalar SGPR value on gfx1250 packed math instructions.

* Add F8/B8 related logic YAML

Add 80 GridBased logic YAML files to support FP8 and BF8 data types
on gfx1250. This includes:
- Multiple precision combinations: F8, B8, F8B8, B8F8
- Various output types: FP16, FP32, BF16, F8, B8
- All matrix layouts: NN, NT, TN, TT
- Epilogue support: Bias, ScaleA/B, ScaleC/D

* Enable F8/B8 test cases for gfx1250 in hipblaslt-test

Update gpu_arch filters in matmul_gtest.yaml and smoke_gtest.yaml to
include gfx1250, enabling the following test categories:
    "matmul_f8_bf8_dst_fp32"
    "matmul_f8_bf8_dst_bf16"
    "matmul_f8_dst_bf16"
    "matmul_f8_bf8_dst_f16"
    "matmul_f8_bf8_dst_fp32_gfx12"
    "matmul_f8_bf8_dst_fp16_gfx12"
    "matmul_f8_bf8_dst_bf16_gfx12"
    "matmul_real_1b_dst_f8_SCDInt1_gfx12"
    "matmul_real_1b_dst_f8_SCDNotInt_gfx12"
    "matmul_one_real_precisions_1b_gfx12"
    "matmul_f8_bf8_dst_fp16_gfx12_smoke"
    "matmul_f8_bf8_dst_bf16_gfx12_smoke"
    "matmul_real_1b_dst_f8_SCDInt1_gfx12_smoke"
    "matmul_real_1b_dst_f8_SCDNotInt_gfx12_smoke"

* Update logic YAML for F8/B8 related 1x1x1 solutions

Update Tensile logic YAML files across Ailk/Alik Bjlk/Bljk matrix
layout combinations for FP8/BF8 data types (B8F8, F8B8, B8, F8
variants including HS, BS, SS subtypes).

* Fix UseCustomMainLoopSchedule value type in F8/B8 logic YAMLs

Change UseCustomMainLoopSchedule from boolean `false` to integer `0`
across 80 gfx1250 GridBased logic YAML files for consistency with
the expected integer type.

* Changed `_UseSgprForGRO: false` to `_UseSgprForGRO: 0`

* Fix uninitialized union members in f6/bf6x16 conversion helpers

* Fix incorrect arg type of copy constructor for some ds_load inst

* Move F4/F6 init out of Runner into sample-specific files

* Add = default to Float6x16 default constructor

* Remove unused iType param from MXMFMAInstruction::typeConvert

* Refactor scaleA/B type dispatch to switch for -Wswitch coverage

* Remove duplicated HasWMMA_f8f6f4 capability

* Fix default cnt for SWaitTensorCnt to 0.

* Remove datatype examples that are not currently supported. (#1017)

Change the order of enum HIPBLASLT_MATMUL_MATRIX_SCALE_VEC16_UE8M0_EXT.

* Reapply "[Tensilelite][Sparse] Enable plr min for spmm (#4364)"

This reverts commit e5febecfc7fe086563625ee3e284aa617f53beda.

* [hipSPARSELt] Fix packItemsM logic for gfx1250 to be aligned with other architectures

* [hipSPARSELt] Reuse single PackTemp VGPR for MIInputPerThUnroll==8 metadata packing

In the MIInputPerThUnroll==8 packing path, PackTemp's lifetime ends
before the second group of packing operations begins, so PackTemp can
be safely reused instead of requiring PackTemp+1. This aligns the code
generation with the VGPR allocation logic which only reserves 1 VGPR
for PackTemp. Also simplify the gfx1250 sparse PackTemp allocation
condition by removing the redundant MIInputPerThreadMetadata>1 check, since gfx1250 only has MIInputPerThreadMetadata = 4 or 8.

* Fix incorrect function name of computeInputType{A, B} calls

* Fix missing mxsb when rebasing

* Revert modifications for emulator (#1001)

Remove emulator parameter: ROCmAgentEnumeratorPath.

Remove env parameters in tox.ini.

Remove compile and emulator workarounds.

Set default CpuThreads back to -1.

* Enable HasF32XEmulation in gfx1250

* Fix XF32 LocalRead VGPR packing for gfx1250 WMMA V3

Fix incorrect results produced by XF32 emulation kernels on gfx1250.
The original XF32 codegen was written against gfx950 MFMA assumptions
that do not hold for gfx1250 WMMA V3:
1. LDS offsets used gfx950-specific hardcoded constants (4, 12).
   gfx1250 WMMA V3 needs a *2 unroll-stride formula shared with
   BF16/Half. Branch by ISA via calcGfx1250LdsOffset().
2. Pack logic assumed vgprPerInput ≤ 8 (single 8-VGPR group).
   gfx1250 has vgprPerInput=16 (two groups), producing interleaved
   [HI_g0, LO_g0, HI_g1, LO_g1]. Add v_swap_b32 to rearrange into
   contiguous [HI_all, LO_all] expected by 3-pass WMMA.
3. WMMA src offset hardcoded "+2"/"+4" for vgprPerInput 4/8.
   Replace with dynamic vgprPerInputA // 2 to yield "+8" on gfx1250.

* Fix XF32 Direct32XEmulation pack/WMMA scheduling data hazard in SIA3 on gfx1250

SIA3 scheduler interleaved pack and MAC instructions without respecting
data dependencies in the XF32 multigroup path, causing v_swap_b32 to
corrupt F32 values mid-packing, and WMMAs to consume partially-packed
VGPRs.

- Move v_swap_b32 rearrangement from MAC code into pack code so it
  stays ordered after all TF32_1/TF32_2 packing
- Fix destVgpr aliasing for UseDirect32XEmulation local reads
- Place all XF32 pack items before the first WMMA instead of
  distributing one chunk per MFMA slot

* Fix XF32 tail loop K-masking on gfx1250 WMMA V3

The tail loop K-masking logic was written against gfx950 MFMA geometry
where vgprPerInput ≤ 8 and BF16 inputs are packed (2 elements/VGPR).
gfx1250 WMMA V3 has vgprPerInput=16 and XF32 reads unpacked FP32
(1 element/VGPR), breaking two assumptions:

1. T0 VGPR addressing: gfx950 bk maps 1:1 to T0 slots. gfx1250
   Direct32X allocates T0 at half capacity (8 slots for 16 elements),
   so raw bk overflows into wrong tensor's registers.
   Fix: adjustedBk = (bk // 8) * 4 + (bk % 4).
2. K-to-VGPR mapping: gfx950 packed BF16 gives contiguous {0-7, 16-23}.
   gfx1250 unpacked FP32 + numVecUnroll=2 interleaving gives
   {0-3, 8-11, 16-19, 24-27}, zeroing wrong VGPRs for K=5-11, 21-27.
   Fix: vgprPerSet0Group=1, multiplyBy /= numVecUnroll, absolute K
   offsets per group.

* Disable ForceUnrollSubIter for F32X emulation

F32X emulation pack code performs destructive in-place VGPR conversion
(FP32 → BF16 high/low), which is incompatible with ForceUnrollSubIter's
sub-tiling that splits local reads and pack code across sub-iterations.
This caused validation failures with ScheduleIterAlg=1, MIWaveTile=[4,4],
and DepthU==MatrixInstK.

* Generalize MIInputPerThread for gfx1250 WMMA XF32

Hardcoded MIInputPerThread==8 assertion in LocalRead.py caused
AssertionError on gfx1250 (MIInputPerThread=16). Parameterize
TXInterleaveLayoutIdx, dynamically generate dsReadConvTable and
convArray to support any MIInputPerThread value.

* Disable UseMFMAF32XEmulation on WMMA-only ISAs (gfx1250)

UseMFMAF32XEmulation was unconditionally enabled for all F32X kernels,
causing gfx1250 (WMMA, no MFMA) to emit invalid v_wmma_f32_4x4x4_bf16
instructions. Gate the flag behind HasMFMA so WMMA architectures fall
through to the cvt+sub path instead.

* Fix lrvwTile not forced to 1 for non-MFMA XF32 (gfx1250 WMMA)

The blanket "(not UseF32XEmulation)" exemption skipped lrvwTile=1 forcing for
all XF32 paths, but only MFMA-based XF32 (gfx950) handles lrvwTile > 1
correctly. On gfx1250 WMMA, lrvwTile=2 produced incorrect local reads.

Refine the exemption: only UseMFMAF32XEmulation and CMS kernels may keep
lrvwTile > 1; non-MFMA XF32 paths are now forced to lrvwTile=1.

* Fix TF32EmuInterleaveTreg local read index for non-prefetch path

Problem: NT/TN format XF32 kernels produce inf/nan errors when
TF32EmuInterleaveTreg is enabled but doFullPackCodePrefetch is False
(PLR=0). The TXInterleaveLayoutIdx() function assumes the full prefetch
pack code layout, which is incompatible with the non-prefetch register layout.

Fix: Add conditional branching based on doFullPackCodePrefetch in the
TF32EmuInterleaveTreg handling. For the non-prefetch path (PLR=0),
use a simpler index calculation that maps the first half of each group
(withinGroup < 4) to T registers with a straightforward index formula:
idx = (idx // 8) * 4 + withinGroup. This matches the register layout
expected by the pack code when doFullPackCodePrefetch is False.

* Fix TF32 emulation T-register overlap in tail loop

Problem: TT and NN format kernels with DepthU=32 failed validation for tail loop.

Root cause: In macroAndSetF32XEmuTregSingle(), the T registers (vgprValuA_T0_I0,
vgprValuB_T0_I0) were defined using symbolic references relative to
vgprValu{A/B}_X0_I0_BASE:  .set vgprValuB_T0_I0, vgprValuB_X0_I0_BASE + 56
In the main loop, vgprValuB_X0_I0_BASE=34 gives T0=90 (correct).
In the tail loop, vgprValuB_X0_I0_BASE is redefined to 32, giving
T0=88 which overlaps with vgprValuA_T0_I0+6 (82+6=88).
This causes A's TF32 processing to corrupt B's T registers (v88-v89),
leading to incorrect WMMA results.

Fix: Use absolute startVgprCvt values instead of symbolic BASE-relative
offsets in RegSet. This ensures T register addresses remain correct
regardless of BASE redefinition in the tail loop.

* Enable XFP32 test coverage for gfx1250 in hipblaslt-test and tox

* Enable gradient, postprocessing, and fix CVT instructions for gfx1250. (#1042)

* Fix issues in cvt, enable gradient support for gfx1250.

Cherry-picked from PR #160:
- 35e2ef29dd (Fix issues in cvt and initial support on hhs gradient)
- 80cce534da (Enable bbs gradient and postprocessing)

* Fix wave32 and FP16 gradient issues on gfx1250
- Fix BF16 NaN check in writeBiasToGlobal for wave32
- Add fallback for FP16 sum unroll when dot2 is unavailable
- Enable gfx1250 gtests for dgelu, bgrada, bgradb

---------

Co-authored-by: George Tseng <george.tseng@amd.com>
Co-authored-by: Andy Su <andysu12@amd.com>

* Remove unused segmentsize (#1049)

* include fp6 bf6 header

* Don't use hip f6x16 until it is ready

* Fix: Python 3.10 doesn't support [] inside f-string expressions

* clear wmma_v2 flag if wmma_v3 is detected

* only set new msb value untile compiler support it

* added int and uint32_t constructors to three packed floating-point types

* revert the tests for backward path for gfx1250

* [tensilelite] skip 120x for gfx1250 tox tests

* support mix mode in test_CustomSchedule.py

* Add workaround for size mismatch

* Extend size mismatch workaround to FP4 types

* [hipSPARSELt] Reapply: Make extops and matrix-transform subdirectories conditional

Guard add_subdirectory(extops) and add_subdirectory(matrix-transform)
with HIPBLASLT_ENABLE_EXTOPS and HIPBLASLT_ENABLE_MATRIX_TRANSFORM flags
respectively, allowing builds to opt out of these components.

* Revert "[hipSPARSELt] Patch reduced-size smoke tests"

This reverts commit 462cc81.

* 1. Guard packed-type getElement overload with #if to fix HIP device compile error
2. Fix -Wswitch warnings for unhandled Float6/BFloat6/Float4 enum values
3. Typo in BadInput/BadOutput error messages for Float6x16 and BFloat6x16
4. Add WIN32 macro

* Remove NB_SHARED from rocisa to fix Windows DLL import failure

* Fix MX on Windows

---------

Co-authored-by: Stacey Lai <stacey.lai@amd.com>
Co-authored-by: Lin, Ed <Ed.Lin@amd.com>
Co-authored-by: Wu, Brianna <Brianna.Wu@amd.com>
Co-authored-by: George Tseng <george.tseng@amd.com>
Co-authored-by: yu-hsieh <Yu-cheng.Hsieh@amd.com>
Co-authored-by: Vin Huang <vin.huang@amd.com>
Co-authored-by: Huang, Mark <Mark.Huang@amd.com>
Co-authored-by: boringmorning <huangchen1999@gmail.com>
Co-authored-by: marhuang_amdeng <marhuang@amd.com>
Co-authored-by: Lu, Serge <Serge.Lu@amd.com>
Co-authored-by: Yang, Anne <Anne.Yang@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Stacey Lai <stacelai@amd.com>
Co-authored-by: mengzcai_amdeng <Meng-Zhe.Cai@amd.com>
Co-authored-by: jichang <jimmy.chang@amd.com>
Co-authored-by: Ho, Henry <Henry.Ho@amd.com>
Co-authored-by: Chang, Josh <Josh.Chang@amd.com>
Co-authored-by: Su, Andy <Andy.Su@amd.com>
Co-authored-by: Andy Su <andysu12@amd.com>
Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com>
Co-authored-by: Brad Nemanich <Brad.Nemanich@amd.com>
aledudek pushed a commit that referenced this pull request May 20, 2026
* Fix Solution.py on gfx1250.

* Correct typo in projects/hipblaslt/tensilelite/Tensile/SolutionStructs/Solution.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants