Skip to content

[hipblaslt] Add support for gfx950 mxfp4#6499

Merged
bnemanich merged 251 commits into
developfrom
users/nakajee/gfx950_mx_rebase_merge
May 2, 2026
Merged

[hipblaslt] Add support for gfx950 mxfp4#6499
bnemanich merged 251 commits into
developfrom
users/nakajee/gfx950_mx_rebase_merge

Conversation

@nakajee
Copy link
Copy Markdown
Contributor

@nakajee nakajee commented Apr 16, 2026

Merge gfx950 MX (Microscaling) work onto develop

Summary

This brings end-to-end Microscaling (MX) data type support —
MXFP4, MXFP6 (FP6 / BF6), and MXFP8 with per-block scale tensors
(MXSA / MXSB) — into hipBLASLt's TensileLite kernel generator and host code
for the gfx950 architecture.

Motivation

MX data types (OCP Microscaling) are required for gfx950 mixed-precision
matmul (FP4 / FP6 / FP8 with shared block scales) in TensileLite.

What's included

1. MX data type plumbing (Tensile + library)

  • Adds packed Float4x2, Float6x32, BFloat6x32 types and registers
    TypeInfo specializations for them; retains _EXT / non-_EXT variants
    for source compatibility.
  • Threads MacDataTypeA / MacDataTypeB and MX block sizes
    (MXBlockA / MXBlockB) through ProblemType, Solution, predicates,
    and the kernel writer so A and B can carry independent input types.
  • Introduces a per-call scale type to generateMXInput and the host
    mxDataGen plumbing.

2. MFMA / rocisa codegen for MX on gfx950

  • New MFMA InstTypes: INST_F4, INST_F6, INST_BF6, INST_F8_F6,
    INST_F6_F8, INST_F8_F4, INST_F4_F8, INST_F6_B6, INST_B6_F6,
    INST_F6_F4, INST_F4_F6, INST_B6_F4, INST_F4_B6, with the matching
    cbsz/blgp permutation hints.
  • MXMFMAInstruction constructor reorganized so acc2, mxsa, mxsb,
    scale types, and block are defaulted; emit path now respects HasMFMA
    capability.
  • validMFMA / validSMFMA keys switched from single-character (H, S,
    B, F8, …) to two-character (HH, SS, BB, F8F8, F8B8, F8F4,
    F4F4, F6F6, B6B6, F8F6, F6F4, …) so asymmetric A/B input types
    in mixed-MX MFMAs validate correctly.
  • New 16x16x128 / 32x32x64 instruction shapes added for FP4/FP6/FP8 mixes.

3. gfx950-specific kernel generation fixes

  • gfx950 + MX + DTL2/DTL3, MXFP4 + GSU, MXFP4 + StreamK
    (sk_mx32f4_quick, sk_mx32f8_quick), MXFP4 + TailLoop (K multiple of
    32), MXFP4 + DtlPlusLdsBuf, MXFP4 + 64b shadow limit.
  • staggerUCode=False restricted to MX on gfx950 only.
  • DirectToLds re-enabled for MXSA/MXSB and LdsPad re-enabled for
    MXFP4 + DirectToLds.
  • Workaround for duplicateFactor / MIInputPerThreadMXSA/B on gfx950.
  • NonDTLTailLoopA/B and NonDTLTailLoopMXSA/B paths added for non-TLU
    • MX; tailLoopOpt* temporarily disabled for MX (TODO to re-enable).
  • Minimum AssertSummationElementMultiple = 32 enforced for MX with
    non-TLU (host code limitation; tracked).
  • MX scaling data shuffle moved into Tensile.
  • StreamK + MX scaling correctness fixes.
  • SIA / GSU / LSU / LocalRead / LRVW / GRVW / Solution validation tightened
    around MX; Use64bShadowLimitMX, StoreSwapAddr for MXFP4, GRVWMXSA/B
    adjustment for DTL.

4. Host, client, and test framework

  • tensile_host.cpp: HIP_R_*_EXT mappings replaced with the canonical
    HIP_R_4F_E2M1 / HIP_R_6F_E2M3 / HIP_R_6F_E3M2; HIP_C_32F/HIP_C_64F
    added; mixed FP8/BF8 handling added in roc2TensileComputeInputType{A,B}.
  • DataInitialization.{hpp,cpp} extended to initialize MX A/B/MXSA/MXSB
    tensors and avoid scale-data overwrite.
  • Reference.cpp reference matmul updated for MX block scaling.
  • TypedId.hpp, ClientProblemFactory.cpp, LogReporter.hpp,
    ProgramOptions.hpp updated for MX types.
  • clients/tests/data/matmul_gtest.yaml extended.
  • MXBlock predicate added so the solution selector picks MX-aware kernels.

5. New tests

  • tensilelite/tests/MXDataGen_test.cpp — gtest regression for MX FP4 scale
    buffer determinism / data-generator zero-frequency.
  • New gfx950 GEMM test suites under
    tensilelite/Tensile/Tests/common/gemm/gfx950/:
    • mx32f4_tn.yaml, mx32f8_tn.yaml
    • mxfp4_mxfp4_{fp32,bf16}_tn_act{,_groupgemm}.yaml
    • mxfp4_fp8_{fp32,bf16}_tn_act{,_groupgemm}.yaml
    • fp8_mxfp4_{fp32,bf16}_tn_act{,_groupgemm}.yaml
  • New StreamK suites:
    tensilelite/Tensile/Tests/common/streamk/sk_mx32f4_quick.yaml,
    sk_mx32f8_quick.yaml.
  • Covers MXFP4×MXFP4, MXFP4×FP8, FP8×MXFP4 in TN layout with activation,
    groupGEMM, and StreamK variants on gfx950.

6. Build / portability

  • Resolves HIP_R_8F_E5M3_EXT build error and amdclang-23 namespace
    errors.
  • mxDataGenerator disabled for Windows builds; Windows-only
    Float6 / BFloat6 / Float4 TypeInfo aliases.
  • Build break in tensilelite-client under FFM fixed.
  • hipsparselt build issues fixed (rebased headers / metadata vgpr index).
  • Several missing-semicolon, header, and merge-residue fixups.

7. Cleanup / refactor

  • Removed MXScale, Float6x16 / Float6x16_Storage, MXFP4 TN logic file,
    redundant VGPR allocation paths for MX, and assorted dead code.
  • LRVW / GRVW validation tightened; Solution.py cruft trimmed.
  • Origami support extended for MX type conversion.

Backwards compatibility / risk

  • Targets primarily new gfx950 MX kernels; non-MX paths on existing
    architectures should be unchanged in behavior. Validator key-format
    change (HHH, etc.) is internal to TensileLite.
  • _EXT HIP types are still accepted on the host side via aliases.
  • MX TailLoop optimizations are intentionally left disabled (TODOs in code)
    pending a follow-up.

Test plan

  • Added new tests. Make sure everything passes in CI.

Notes for reviewers

  • This branch contains many merge-resolution commits (intentionally kept to
    preserve history). The substantive engineering work is in the original
    gfx950_mx_rebase lineage; the merge commits resolve conflicts against
    recent develop changes (StreamK, gfx1250 tooling, stinkytofu MUBUF,
    sparse yaml fixes, etc.).
  • Key files for review (largest substantive diffs):
    Tensile/SolutionStructs/Solution.py,
    Tensile/KernelWriterAssembly.py, Tensile/KernelWriter.py,
    Tensile/Common/ValidParameters.py,
    Tensile/SolutionStructs/Validators/MatrixInstruction.py,
    rocisa/include/instruction/mfma.hpp,
    include/Tensile/DataTypes*.hpp,
    library/src/amd_detail/rocblaslt/src/tensile_host.cpp,
    client/include/DataInitialization.hpp,
    client/src/Reference.cpp.
  • Squash-merging is recommended given the volume of merge-resolution commits.

Submission Checklist

javier-amd and others added 30 commits January 26, 2026 10:10
## Motivation

Standardize kernel names with MX types.

## Technical Details

Removes redundant underscores, and makes some minor adjustments to make
MX kernel names consistent with the rest of the library.

## Test Plan

Checked kernel names with the mx test.
## Motivation

Fix some errors in codegen that break non-mx tests.

## Technical Details

calcLdsPad was refactored to include LRVWA/B. Arg list and call points
needed to be updated to match.
bpeGR now contains a floating point value. Offsets need to be converted
to int before written to assembly since offsets cannot be floats.

## Test Plan

Only tested with stream-k unit test so far (non-mx). That one now
passes.
## Motivation

Enable DTL for gfx950 mxfp4

## Technical Details

- Fixed inconsistent conditions for MXSA/B vreg allocation and local
write code generation
- Temporarily disabled UseGeneralizedNLCOneA/B (causes issue with
MXSA/B)

## Test Plan

Single test with tensilelite yaml

## Test Result

Local test passed

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
…lling Tensile MX kernel (#4599)

## Motivation

This PR enables using mxDataGenerator when Tensile is the host and
supports calling FP4 kernels generated from Tensile.

## Technical Details

- Add a FP4 library (yaml) generated by Tensile under GridBased
category:
[YAML](https://github.com/ROCm/rocm-libraries/blob/1848898c099b127bf77f8306678150953f563648/projects/hipblaslt/library/src/amd_detail/rocblaslt/src/Tensile/Logic/asm_full/gfx950/GridBased/aquavanjaram_Cijk_Alik_Bljk___F_4_S_S___M_X_A_3_2___M_X_B_3_2___BH_Bias_S_HA_S_SAV_UserArgs.yaml)
- Remove macros to use mxDataGenerator regardless which host used. Now
**the default C++ standard is set to C++20** as it is required by
mxDataGenerator.
- Support calling Tensile FP4 solutions


## Test Plan

Use cmake preset build with rocRoller host **off** (i.e., use Tensile as
host) , gpu target set to gfx950 and `-DBUILD_TESTING:BOOL=OFF` (turn
off tensileLite test which will error out during build)

- Use `hipblaslt-test`
`./clients/hipblaslt-test --gtest_filter=*matmul_tensile_fp4*`

- Use `hipblaslt-bench`
`./clients/hipblaslt-bench --iters 0 --cold_iters 0 --transA T --transB
N --a_type f4_r --b_type f4_r --c_type f32_r --d_type f32_r -m 256 -n
256 -k 256 --alpha 2.1 --beta 0.7 --scaleA 3 --scaleB 3 --scale_type
f32_r --verify`

## Test Result


## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
…ion (#4702)

## Motivation

This PR updates the predicate to check the block size of mx data types.
If block size is not considered, wrong solution might be selected.

## Technical Details

- Add MX block size A and B into predicate and serialization

## Test Plan

Manually tried:

`./clients/hipblaslt-bench --iters 0 --cold_iters 0 --transA T --transB
N --a_type f8_r --b_type f8_r --c_type f32_r --d_type f32_r -m 256 -n
256 --alpha 2.1 --beta 0.7 --scaleA 3 --scaleB 3 --scale_type f32_r
--verify`

And no solution found (currently there is no solution for mx f8).

## Test Result


## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation

Add MXFP4 data generator for Tensile

## Technical Details

Integrate Hipblaslt’s MXFP4 data generator into TensilteLite.

The goal is to decrease the amount of zero-valued data so that it better
resembles production workloads, thereby improving real‑time performance
accuracy.

After integration, zero frequency drops by 0.5%, and stays in the range
of ~12.5%-13%.

Note: Upgrades to C++20 as mxDataGenerator requires it.

## Test Plan

MXDataGen_test.cpp - verify the integration works & that the zero
frequency is below a fixed threshold.

## Test Result

TBD

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AIHPBLAS-796
…tToLds (#4683)

## Motivation

Improve MXFP4 performance with DirectoLds

## Technical Details

- Enable DirectToLds for MXSA/B
- Re-enable LdsPad for MXFP4 + DirectToLds
- Added test cases to mx32f4_tn.yaml

## Test Plan

Added test case to tensililite yaml

## Test Result

Local test passed 

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation

Builds didn't work if hipBLASLt wasn't already installed.

## Technical Details

Changed the include paths so MXDataGen didn't need hipBLASLT.hpp

## Test Plan

Build locally without hipBLASLt already installed.

## Test Result

Build worked.

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation

Fix a verification fail with MXFP4 + non DTL

## Technical Details

Fixed incorrect waitcnt for prefetch local read with SubIter case
Also, fixed the location of second waitcnt for TN case (use MIWaveTileA
// 2 for SubIter=2 case)

## Test Plan

Local test

## Test Result

Local test passed

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation

Solving various issues blocking hipblaslt from built

## Technical Details

Solving various issues blocking hipblaslt from built

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Chun-Xun Lin <Chun-Xun.Lin@amd.com>
Comment thread projects/hipblaslt/tensilelite/client/CMakeLists.txt
Comment thread projects/hipblaslt/CMakeLists.txt
Comment thread projects/hipblaslt/CMakeLists.txt
Comment thread projects/hipblaslt/CMakeLists.txt
Comment thread projects/hipblaslt/tensilelite/CMakeLists.txt
Comment thread projects/hipblaslt/CMakeLists.txt
Copy link
Copy Markdown
Contributor

@davidd-amd davidd-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will follow up with gh issues for the outstanding comments.

@bnemanich bnemanich enabled auto-merge (squash) April 30, 2026 23:47
@bnemanich bnemanich merged commit 89b3fc4 into develop May 2, 2026
102 of 122 checks passed
@bnemanich bnemanich deleted the users/nakajee/gfx950_mx_rebase_merge branch May 2, 2026 12:30
bnemanich added a commit that referenced this pull request May 3, 2026
# Add gfx950 MXFP4 Subtile-based kernel implementation
## Summary
This PR is a follow-up to #6499 ([hipblaslt] Add support for gfx950
mxfp4)
and adds the **Subtile-based kernel implementation
(`UseSubtileImpl=1`)**
for hipBLASLt on **gfx950**. It introduces a new tile-decomposed code
generation path optimized for **MXFP4** and **BF16** GEMMs, plus the
solution-selection plumbing, validation, Origami logic yamls, and unit
tests
needed to make it production-usable.
## Motivation
PR #6499 brought MX data type support online for gfx950, but the
existing
TensileLite codegen path leaves significant performance on the table for
MXFP4-heavy workloads. The Subtile path restructures global-read /
local-read / MFMA / store scheduling at a finer granularity, which
**greatly improves MXFP4 GEMM performance when using
`HIPBLASLT_MATMUL_MATRIX_SCALE_BLK32_UE8M0_32_8_EXT`** (added to the
hipBLASLt CHANGELOG).
## What's included
### 1. New Subtile-based kernel components (Tensile)
New modules under `projects/hipblaslt/tensilelite/Tensile/Components/`:
* `SubtileBasedKernel.py` (~1850 LOC) — entry point and orchestration of
  the subtile codegen path; replaces large portions of the standard
  prefetch / unroll / store flow when `UseSubtileImpl=1`.
* `SubtileBasedLogicalScheduler.py` (~2415 LOC) — logical scheduler that
  builds the subtile-grained instruction graph (GR loads, LR offsets,
  MFMA tiles, scale loads, stores) from kernel parameters.
* `SubtileBasedInstructionScheduler.py` (~433 LOC) — converts the
logical
  schedule to an emit order respecting wave / register / hazard
  constraints.
* `SubtileBasedInstructionEmitter.py` (~216 LOC) — instruction emission
  helpers shared by the subtile components.
### 2. Kernel writer / common changes
* **`KernelWriter.py`**, **`KernelWriterAssembly.py`**: integration
points
  for the subtile path — prefetch, GR offset calculation, LR offset
  calculation, post-loop, MFMA macro accounting, optimized `storeD`,
  LDS buffer swap, MX FP4 scale emit, `SrdMXSA/B+2` handling, sgpr
  allocation / overflow guards, computeLoadSrd fix.
* **`SolutionStructs/Solution.py`**, **`SolutionStructs/Problem.py`**:
  introduces the `UseSubtileImpl` parameter, MX-related reject
  conditions for non-Subtile paths on gfx950, and additional valid GEMM
  type combinations for MX inputs.
* **`Common/ValidParameters.py`**, **`Common/RequiredParameters.py`**,
  **`Common/GlobalParameters.py`**: `UseSubtileImpl` registration and
  defaults.
* **`Components/StreamK.py`**: subtile-aware StreamK fixup (incl. import
  union with the `BufferLoadB32` cache-coherence change from #6837).
* **`Components/GlobalWriteBatch.py`**: optimized global write batching
  for the subtile path (~670 LOC of changes).
* **`Components/ComputeStoreVgprs.py`**, **`Components/LSU.py`**,
  **`Components/WorkGroupMappingAlgos.py`**, **`AsmStoreState.py`**,
  **`KernelWriterModules.py`**: minor adjustments needed by the subtile
  pipeline.
### 3. rocisa / host / client
* **`rocisa/rocisa/include/container.hpp`**: helpers needed by the new
  emitter.
* **`tensile_host.cpp`**, **`include/Tensile/TensorDescriptor.hpp`**:
  small fixups for the subtile path and gfx950 build.
* **`client/include/DataInitialization.hpp`**,
**`client/src/DataInitialization.cpp`**,
**`client/src/Reference.cpp`**, **`client/src/ReferenceValidator.cpp`**,
  **`client/include/TypedId.hpp`**: MX scale init and reference paths
  used by the new tests.
* **`clients/common/include/testing_matmul.hpp`**,
  **`clients/common/include/norm.hpp`**,
  **`clients/common/include/hipblaslt_datatype2string.hpp`**,
  **`clients/common/src/mxDataGen.cpp`**: wiring for batched (>1)
  testing and MX init.
### 4. Origami / solution selection (gfx950 MXFP4)
New auto-tuned logic yamls under

`projects/hipblaslt/library/.../Tensile/Logic/asm_full/gfx950/gfx950/Origami/`
covering the FP4 SS / HS / BS variants in three layouts:
* `Origami/` (default)
* `Origami/Origami_nta4/` (no-transpose-A FP4)
* `Origami/Origami_ntb4/` (no-transpose-B FP4)
(9 new `gfx950_Cijk_Alik_Bljk_F4{SS,HS,BS}_MXA32_MXB32_*_UserArgs.yaml`
files in total.)
### 5. New tests
**End-to-end gfx950 GEMM yamls** in
`Tensile/Tests/common/gemm/gfx950/`:
* `subtile_bf16.yaml`, `subtile_mxfp4.yaml`
* `mx32f4_tn.yaml`, `mx32f8_tn.yaml`
* `mxfp4_mxfp4_{fp32,bf16}_tn_act{,_groupgemm}.yaml`
* `mxfp4_fp8_{fp32,bf16}_tn_act{,_groupgemm}.yaml`
* `fp8_mxfp4_{fp32,bf16}_tn_act{,_groupgemm}.yaml`
**StreamK + MX:** `Tensile/Tests/common/streamk/sk_mx32f4_quick.yaml`,
`sk_mx32f8_quick.yaml`.
**New unit tests** (`Tensile/Tests/unit/`):
* `test_SubtileBasedLogicalScheduler.py` (~1735 LOC)
* `test_SubtileBasedSchedulerRef.py` (~596 LOC)
* `test_gr_lr_roundtrip.py` (~571 LOC)
* `test_storeD_roundtrip.py` (~2420 LOC)
* `test_graTileAssignment.py` (~354 LOC)
* `test_lraTileAssignment.py` (~360 LOC)
* `conftest.py`, `gpu_test_helpers.py` shared fixtures (~601 LOC)
**New gtest:** `tensilelite/tests/MXScalePadding_test.cpp`.
### 6. Misc / hardening
* Reject conditions: gfx950 MX + non-Subtile, DepthU constraints,
GroupGEMM
not yet supported with StreamK + MX, AssertSummationElementMultiple=256
  for subtile MXFP4, missing-mxblock check for non-MX types.
* Skip rocRoller for FP4-A/FP4-B with pre-swizzled scale layout (#42).
* `forceDenorm=False` in `generateMXInput` (#11).
* Several rebase fixes, copyright/year header updates, and
review-comment
  fixes to `KernelWriter` / `KernelWriterAssembly`.
### 7. CHANGELOG
Greatly improved MXFP4 GEMM performance when using
HIPBLASLT_MATMUL_MATRIX_SCALE_BLK32_UE8M0_32_8_EXT

## How to use
Set `UseSubtileImpl: 1` on a gfx950 MX-FP4 solution (see the new
`subtile_mxfp4.yaml` / `mx32f4_tn.yaml` for canonical configs). The path
is
opt-in — non-MX and non-gfx950 kernels are unaffected.
## Backwards compatibility / risk
* All new behavior is gated on `UseSubtileImpl=1` and gfx950. Existing
  solutions on other architectures or non-MX paths are unchanged.
* `GroupGEMM + StreamK + MX` is intentionally rejected for now (TODO).
* New Origami yamls only add solutions; nothing existing is modified.
## Test plan
* New gtests + unit tests run automatically in CI (Tensilelite Python
  unit suite, `MXDataGen_test`, `MXScalePadding_test`).
* New end-to-end gfx950 GEMM and StreamK yamls are added to the common
  test buckets.
* Manual: run the gfx950 MXFP4 subtile suites
  (`pytest -k gfx950` after building Tensile, plus
  `tensilelite-client --yaml subtile_mxfp4.yaml` for sanity).
## Notes for reviewers
* This branch was rebased onto current `develop` (post-#6499) by
skipping
  the `users/nakajee/gfx950_mx_rebase_merge` history (which #6499
squash-merged) and replaying only the subtile-specific work as a single
  squashed commit. The actual code changes in this PR are limited to the
  files listed above (24 added, 56 modified; ~+170k / −2.6k including
  generated logic yamls).
* The largest reviewable diffs are:
*
`Tensile/Components/SubtileBased{Kernel,LogicalScheduler,InstructionScheduler,InstructionEmitter}.py`
(new files)
  * `Tensile/KernelWriter.py`, `Tensile/KernelWriterAssembly.py`
  * `Tensile/SolutionStructs/{Problem,Solution}.py`
  * `Tensile/Components/{GlobalWriteBatch,StreamK}.py`
  * `clients/common/include/testing_matmul.hpp`
  * `client/src/DataInitialization.cpp`

* Description of all commits that were squashed for this feature branch:

Subtile implementation for gfx950 MX FP4

--- 272f88d: Add sample subtile impl ---
Author: brianshi <brianshi@amd.com>

--- 60ecede: GR Offset calculation (#1) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

--- be69c1d: Enable post-loop code generation, and add some
subroutines ---
Author: b-shi <brianshi@amd.com>

--- 646d102: LR offset calculation (#2) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

--- 71f4bca: Add GR load emit logic, and misc fixes (#3) ---
Author: b-shi <brianshi@amd.com>

--- 1fd0db9: Emit LR + init ACCVGPR (#4) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

--- 9d406b9: Add loop and ptr update code ---
Author: b-shi <brianshi@amd.com>

--- b6127bc: Update GR/LR offset calculation to fully support 2x2,
1x4, 4x1 waveConfigs (#7) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

--- 89ec87c: Account for valuC macro value in SK WS store code ---
Author: b-shi <brianshi@amd.com>

--- 6edf53d: Rebase fix ---
Author: b-shi <brianshi@amd.com>

--- 34e79fc: Enable fp4 (#8) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

--- d5a5c57: [Tensilelite] Add MX FP4 scale offset computation for
subtile-based kernel (#6) ---
Author: Archana Ramalingam
<98564406+archana-ramalingam@users.noreply.github.com>

--- 7a8a85a: Add lds buffer swap logic ---
Author: b-shi <brianshi@amd.com>

--- d24a8fe: Add optimized storeD code (#9) ---
Author: b-shi <brianshi@amd.com>

--- a45c20c: Fix MX scale tensor initialization: set
forceDenorm=false in generateMXInput (#11) ---
Author: T.J. Alumbaugh <T.J.Alumbaugh@amd.com>

--- f945268: [Tensilelite] Enable the MX FP4 scale emit code in the
subtile-based kernel (#10) ---
Author: Archana Ramalingam
<98564406+archana-ramalingam@users.noreply.github.com>

--- cf37df4: Use fixed value for SrdMXSA/B+2 (#14) ---
Author: Koji Nakajima <75698246+nakajee@users.noreply.github.com>

--- f0c8dbc: Merge subtile_mx_f4_schedule to subtile_mx branch (#16)
---
Author: b-shi <brianshi@amd.com>

--- 543796f: Enable DU > 256, and reduce sgpr allocation (#18) ---
Author: b-shi <brianshi@amd.com>

--- c65bdb0: Add missing mxblock check for non-mx data types ---
Author: b-shi <brianshi@amd.com>

--- d64d226: Introduce UseSubtileImpl parameter (#20) ---
Author: b-shi <brianshi@amd.com>

Squash commits 20-35 from subtile_mx branch

--- e4780da: Enable FixSrd2 for A/B (#23) ---
Author: b-shi <brianshi@amd.com>

* Enable FixSrd2 for A/B

* Address comments from PR

---------

--- e4c64a7: Add nt libs ---
Author: b-shi <brianshi@amd.com>

--- cd13ec1: [Tensilelite] Pad MX scale tensor dimensions for
unaligned problem sizes (#21) ---
Author: Archana Ramalingam
<98564406+archana-ramalingam@users.noreply.github.com>

* Add scale padding

* Add tests

* Remove redundant pre-swizzle path

* Remove code from
conflict

* Fix reverted mxdatagen path for tensile tests

* Add diverse test cases for scale padding in MXScalePadding_test and
subtile.yaml
- Expanded test cases to include non-multiple-of-32, even
non-multiple-of-16, and odd dimensions.

--- d87938f: Split subtile.yaml into subtile_bf16.yaml and
subtile_mxfp4.yaml (#22) ---
Author: James Newling <james.newling@gmail.com>

Replace the 'monolithic' subtile.yaml with two focused test files.
All original test coverage is preserved. Two new FP4 groups added.

BF16 coverage (subtile_bf16.yaml, tests are essentially unchanged):

  # | Description        | Dest | MIs | PGR | DU      | SK  | Sizes
  --+--------------------+------+-----+-----+---------+-----+------
  0 | BF16 TN main       | b    |  19 |   0 | 64      | 0,3 |  11
  1 | BF16 TN large DU   | b    |   4 |   0 | 128,192 | 0,3 |   7
  2 | BSS (f32 output)   | s    |   6 |   0 | 64      | 0,3 |   9
  3 | BF16 bias          | b    |   2 |   0 | 64      | 0   |   1

FP4 coverage (subtile_mxfp4.yaml):

  # | Description        | Dest | MIs | PGR | DU  | SK  | Sizes | Status
--+--------------------+------+-----+-----+-----+-----+-------+--------
0 | FP4 TN main | b | 15 | 0 | 256 | 0,3 | 23 | from original
1 | FP4 TN large DU | b | 4 | 0 | 512 | 0,3 | 13 | from original
2 | F4SS (f32 output) | s | 5 | 0 | 256 | 0,3 | 13 | from original
3 | FP4 bias | b | 2 | 0 | 256 | 0 | 1 | from original
  4 | FP4 PGR=2          | b    |  13 |   2 | 256 | 0   |   5   | new
  5 | FP4 expanded MIWT  | b    |  24 |   0 | 256 | 0   |   5   | new
6 | PGR=2 WG 4x1/1x4 | | 6 | 2 | 256 | 0 | 1 | known failures
(commented)

Run times on gfx950 (8x MI350X):

  File               | NEV=-1 | NEV=0
  -------------------+--------+------
  subtile_bf16.yaml  |    23s |   23s
  subtile_mxfp4.yaml |    37s |   40s

Where NEV is number of elements to validate. I (James) have checked
these numbers,
and weirdly it is true that NEV=0 is a bit faster than NEV=-1 for mxfp4.

--- af04f0d: Dependency based instruction scheduling (#19) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

* Revert to single partition

* Start using dependencies

* as is

* start using separate EmittedModules

* remove reduntant wait

* Add _extractPathsFromBeforeDeps

* Continue simplification

* Simplifying

* Add more rules

* cleanup

* Add fp4 test

* fix test

* Add tests

* Remove after field on emittedmodule

* Refactoring instructionSchedule

* Add comments

* cleanup modules vs ops

* Refactoring print functions

* Test cleanup

* Add more tests

* Replace subgroup by partition

* Remove unused unroll param

* Add high level notes

* Simplify NLL and NGLL GR removal

* Add some comments

* Force instruction insertion if no slots available

* Fix test after rebase

* Move scale before A/B and track inflight count

* Fine-grain vmcnt calculation

* Separate counts for scaleA and B

* Avoid using m0 update and buffer_lod on same MFMA slot to avoid scalar
instruction serialization

* Fix test

* Add vmcnt test

* Fix duplicated loads for 1x4 and 4x1

* Fix placement in reverse order

* Fix regression on PGR0

* add fallback to numMFMA=1

--- 3ec902b: Add some 1x4 and 4x1 origami solutions ---
Author: b-shi <brianshi@amd.com>

--- c5000d3: Fix typo ---
Author: b-shi <brianshi@amd.com>

--- 226ed84: [hipblaslt] Refactor Srd2 calculation for useFixedSrd2
(#30) ---
Author: Koji Nakajima <75698246+nakajee@users.noreply.github.com>

--- abf19d4: [Tensilelite] UseSubtileImpl: subtile-aligned edge check
for store path (#29) ---
Author: b-shi <brianshi@amd.com>

* [Tensilelite] UseSubtileImpl: subtile-aligned edge check, OOB guard,
and refactoring

- Replace Size%MT edge check with subtile-aligned check: NonEdge paired
  store when trailing rows/cols are a multiple of the subtile block size
(waveGroupM rows for M, 16 cols for N). Non-last workgroups always take
NonEdge.
- Add per-wave OOB guard (subtileM32ValidBlocksSgpr /
subtileN16ValidBlocksSgpr)
  to skip stores outside valid M/N tile bounds in the NonEdge path.
- Refactor duplicated OOB guard into _emitSubtileOobGuard helper;
refactor
M/N guard SGPR computation into _emitSubtileMGuard / _emitSubtileNGuard.
- Fix orphan scalar store blockIdxM (was tt0, now
(tt0*MatrixInstM)//mBlockSize).
- Add quick-exit and edge/non-edge header comments to generated ASM.

* Add some bias tests, combine M/N guard to single routine

* Add OOB check for C loads, update storeD unit tests to check OOB,
simplify quick exit checks

* Address more PR comments: add M group skip, and skip to store end.
simplified loadC OOB mask

---------

--- 637881a: Fix unit tests & remove legacy code for subtile
interleaving (#33) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

* Fix gr_lr_roundtrip test

* Use non-interleaved version as ref code

* Fix scheduler test

* Removed legacy interleaved mode for LR/GR offset calculation

--- e9cb889: Fix MX FP4 scale buffer allocation and initialization
for batched GEMM (#25) ---
Author: Archana Ramalingam
<98564406+archana-ramalingam@users.noreply.github.com>

* Fix bacth count issue

* Add batch count tests

* Fix bacth count issue

* Address PR review: clarify FP4-specific byte stride and add
non-aligned batched tests

- Updated comments on dataBatchBytes computation to clarify FP4 packing
  assumption (2 elements/byte) and flag that non-FP4 block-scaling types
  would require updating this conversion.
- Added batched test cases with non-multiple-of-32 M/N dimensions:
  FP4 DU=256: [48,48,2] and [33,65,2]
  FP4 DU=512: [63,63,2]
  BF16: [50,100,2]

---------

--- a43247b: Update some test yamls (#31) ---
Author: b-shi <brianshi@amd.com>

--- e2f69c8: Add f4bs origami library with activation function
support. Refactor sgpr allocation to reduce sgpr usage in post loop.
Store code-path reorganization (#32) ---
Author: b-shi <brianshi@amd.com>

* Free swap/localwritebase sgprs before post-loop

* Defer sgpr allocation to remove holds in sgpr pool.

Add Origami library logic files for Cijk_Alik_Bljk_F4BS_MXA32_MXB32
(base, nta4, ntb4 variants).

* Remove uneeded alignment and comment

* Add more epilogue tests

* Remove older origami library for f4bs

* Reorder post-loop code blocks to after persistant loop Misc fixes

* Fix build issues, relax longjump sgpr requirements

* Fix GSU0 branch logic

---------

--- 3f034bf: Add F4HS and F4SS Origami library logic for FP4→F16 and
FP4→F32 GEMM (#35) ---
Author: Majedul Sujon <85503863+msujon-AMD@users.noreply.github.com>

* Add F4HS and F4SS Origami library logic for FP4→F16 and FP4→F32 GEMM

- Add 6 new yaml files (F4HS, F4SS) across Origami, Origami_nta4,
Origami_ntb4
- Update F4BS yaml files: AssertSummationElementMultiple 32→256 for
K%256 enforcement
- Add ("F4", "F4", "H", "S") to _validGEMMTypes and _HPATypes in
Problem.py

* Add F4HS test cases to subtile_mxfp4.yaml

Add two new benchmark problem blocks for FP4→F16 (F4HS):
- No-bias block: same wavetile and problem size coverage as F4SS
- Bias epilogue block: BiasDataTypeList [s, h], relu/none activations

* Add F4HS (FP4->Half) type support to Tensile client

Add TypedGemm_F4_H_S typedef and corresponding reference CPU solver
case so F4HS (FP4 input, Float16 output, Float compute) problems
can be validated by the benchmark client.

---------

--- d0bc8fd: Rewrite subtile-based scheduler. Fix DU>64 & enable very
large MT (#36) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

* Initial support for DU>256

* Renaming

* add option to do DU=512 in the tests

* blocked K-major for scale

* Change scaleSet swap logic

* Update print functions

* Put scales after values for avoid race conditions

* Fix tests

* more test

* tweak printschedule display

* Add PGR2 in the yaml tests

* Add new scaleGROp

* comment out failing tests

* Revert "comment out failing tests"

This reverts commit 1f5802c.

* Draft new logical scheduler

* Refactoring

* Add more test on step1

* Add more tests on step1

* add bf16 320x320 test

* reduce step1 code

* Simplify step1 logic

* validate some step1 test

* Fix partition 2x2 test

* more step1 test

* 320x320 BF16 test

* Add test DU512 + partition2x2

* Simplify step1 code

* Add step2 tests

* Fix multi-partition step2

* Add step2 du512, 2x2 partition test

* Use common algo for all numPartitions

* Draft for step3 tests

* remove useless tests

* New GR algo (draft)

* [Step3] Add more test

* Iteration on GR

* Display ordered GR list with granularities

* More test

* Add some comments

* Disable by default debug logs

* Getting rid of step naming

* Start remove AnnotatedOp (still there in group pass)

* Split dependency Ops

* Add todo on place_GRs pass

* Valid test_annotate_deps_1x1_partition_DU256

* Test output looking better (still WIP)

* single dep for LR tooo

* Add remove_cross_deps pass

* Fix bugs in dependency pass

* insert_gr_lr_inc pass

* Add group_lr_gr pass

* Add emit pass

* Quick port of instruction Emit code

* Move emit function to separate file

* Refactoring instructionEmitter

* Port vgprTile tracking

* Reworking second pass (WIP)

* Display unrolling requirement

* Unrolling check on 2nd pass

* Generic validation for assign_vgpr pass

* Fix unroll

* Add inst schedule in standalone mode

* Use lrGran for vgprTile size calculation

* Fix bug in emit pass (missing depencency)

* PreMFMA path + non-duplication scale load

* missing globalReadLDSBufferSwap for GR_INC scales

* add wairlr_sync on all LR->GR dep

* add waitgr_sync op

* remove_unnecessary_gr_deps

* Change LR dispatch algo a bit to avoid too many waitgr_sync

* Avoid duplicated loads in emitter

* Fix bug on gr_emit code

* GrInc pass. fix duplicated insertion for B

* Fix missing LR_inc for SA/SB

* preloop, NLL, NGLL

* Simplify preloop

* minor changes

* Move unroll logic to scheduler

* minor changes

* Fix unroll id bug on NLL / NGLL

* Disable post GRINC for now

* Remove commented code

* Handle 1x4, 4x1 gr read gran

* Fix vmcnt computation

* Use correct grCount mapping

* Revert in emit logic on buffer_load for PGR0 needs

* Add bf16 version in standalone test

* Fix LR_Inc insertion on DU>64

* Add subIterK/Partition comment to codegen

* Fix issue in GrInc placement

* Remove last_mt

* Fix LR MT index bug with muli-partition

* Disable early LDS size check when subtileImpl is on

* Add pass to remove redundant LR deps + fixed issue on dependency
annotation pass

* Remove more LR redundant deps

* Only insert wait_lr_sync on deps

* Simple algo to select partition config

* Remove HC value for partitions...

* Take into account all inflight GR (all tensors)

* Fix tests and regressions on gr counts

* Fix grCount merge calculation

* Better display of dependencies

* Add remove_wait_lr_sync after grouping

* Add temporary non reg file

* Change merge logic on GR grouping pass

* Fix non necessary wait_lr_sync

* Downgrade some waitlr_sync to sync + added 384x256 no reg test

* non reg test 320x320

* Add larger MT

* non reg test for fp4 256x256

* Moving out instructionScheduler

* Remove old scheduler

* Renaming scheduler

* Re-work test

* Add larger MT test cases

* Rename non-ref test

* Re-add standalone mode

* Refactor DepOp

* Remove dead code

* Remove MFMATileSize class

* Remove from_til_info

* Avoid redundant tensor list creation

* Remove hardcode granularities in vgrpTile allocation pass. Simplify
code.

* Re-enable  # PGR=2 WG 4x1/1x4, K > DU tests

* Remove unused GRScaleOp

* DepRef renaming

* Get rid of MT string representation

* Remove TODO

* EmmitedModule simplication

* Use explicit pass dependencies

* Renaming LogicalScheduler

* Remove old test_InterleavingScheduler.py file

* Commenting failing test for now

* Remove debug logs

* Disable lds padding when using UseSubtileImpl

--- e8e8c09: Fix LR-GR dependency issue when DU>64 (#40) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

* Fix and simplify logic for remove_unnecessary_lr_deps

* Add new ref tests for 128x128x(128,64)

--- 4aa441a: Rebase fix ---
Author: b-shi <brianshi@amd.com>

--- 5ba911e: Skip rocRoller for FP4-A/FP4-B + pre-swizzled scale
layout (#42) ---
Author: Archana Ramalingam
<98564406+archana-ramalingam@users.noreply.github.com>

--- 9c74998: Rebase fix ---
Author: b-shi <brianshi@amd.com>

--- 842b149: Addressed review comments for KernelWriter and
KernelWriterAssembly ---
Author: Koji Nakajima <knakajim@amd.com>

--- dce43b1: Fix computeLoadSrd issue ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- c075bbf: Fix preSolution CPU re-sync regressing
subtile_mxfp4.yaml ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- ced840f: Fix computeLoadSrd issue (#43) ---
Author: bnemanich <brad.nemanich@amd.com>

--- bc2f6dd: Small update for gfx950 mx tests + more - enable
UseSubtileImpl for all gfx950 non subtile mx tests - skip all gfx950
mxfp8 - use MXScaleFormat=1 as default - set
AssertSummationElementMultiple=256 for subtile mxfp4 - fix
isSwizzledSubtile in computeLoadSrd ---
Author: Koji Nakajima <knakajim@amd.com>

--- 5c794b7: Fix gsuasb.yaml failures ---
Author: b-shi <brianshi@amd.com>

--- 727f8db: tensilelite: add solution reject conditions for
UseSubtileImpl=1 (#38) ---
Author: Majedul Sujon <85503863+msujon-AMD@users.noreply.github.com>

--- 8928fbb: Add more reject conditions for Subtile ---
Author: Koji Nakajima <knakajim@amd.com>

--- 6e63ab6: Fix kringshift test failures ---
Author: b-shi <brianshi@amd.com>

--- b3e9724: Update reject condtion for DepthU in subtile case. Plus,
update DepthU setting for gfx950 mx test cases ---
Author: Koji Nakajima <knakajim@amd.com>

--- 5ab6009: Fix build errors ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- 4a4edca: Update more mxfp4 tensilelite test cases ---
Author: Koji Nakajima <knakajim@amd.com>

--- bbbc553: Update change log ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- 6476c04: Add more reject conditions for gfx950 subtile ---
Author: Koji Nakajima <knakajim@amd.com>

--- c5828c4: Updated gfx950 mxfp4 test cases - add StreamK setting -
skip groupgemm tests for now (groupgemm does not support streamK) ---
Author: Koji Nakajima <knakajim@amd.com>

--- f1fc2f1: Fix hipblaslt build error of gfx950 ---
Author: Koji Nakajima <knakajim@amd.com>

--- 70cea1b: Updated subtile_mxfp4.yaml (add StreamK) ---
Author: Koji Nakajima <knakajim@amd.com>

--- c1c9b2a: Add uninit lsc,lsp, etc.. fields for subtile ---
Author: b-shi <brianshi@amd.com>

--- c0c1f72: Fixed merge error in testing_matmul.hpp ---
Author: Koji Nakajima <knakajim@amd.com>

--- 191e0cb: Add missed batch_count >1 changes ---
Author: archana-ramalingam <Archana.Ramalingam@amd.com>

--- 01c52f8: Addressed PR comments ---
Author: Koji Nakajima <knakajim@amd.com>

--- 4e89c91: Reduce mxfp4 test time ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- 3dac20f: Prevent overflow for wgmxcc sgpr allocation ---
Author: b-shi <brianshi@amd.com>

--- 18dec79: Fix error with problem type ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- 9e69ffd: Add a reject conditoin for gfx950 mx + non Subtile ---
Author: Koji Nakajima <knakajim@amd.com>

--- 0eed3ba: Add more valid GEMM types ---
Author: Brad Nemanich <brad.nemanich@amd.com>

--- 8b5514e: Fix missing b build error ---
Author: archana-ramalingam <Archana.Ramalingam@amd.com>

--- f981ff5: Fix 1250 tests ---
Author: Brad Nemanich <brad.nemanich@amd.com>

--- d1e69d9: Add more FP4 tests ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- e3a688f: Add MXScaleFormat: 1 to all gfx950 mx test yaml ---
Author: Koji Nakajima <knakajim@amd.com>

--- aaef3f5: Add DataTypeMXSA,B setting in gfx950 mxfp4 logic yaml
---
Author: Koji Nakajima <knakajim@amd.com>

--- 861ef8e: Add DataTypeMXSA,B setting in gfx950 mxfp4 logic yaml
(nta4,ntb4) ---
Author: Koji Nakajima <knakajim@amd.com>

Co-authored-by: Archana Ramalingam <Archana.Ramalingam@amd.com>
Co-authored-by: Brad Nemanich <Brad.Nemanich@amd.com>
Co-authored-by: Brian Shi <Brian.Shi@amd.com>
Co-authored-by: James Newling <James.Newling@amd.com>
Co-authored-by: Koji Nakajima <Koji.Nakajima@amd.com>
Co-authored-by: Majedul Sujon <Majed.Sujon@amd.com>
Co-authored-by: Sebastien Vince <Sebastien.Vince@amd.com>
Co-authored-by: T.J. Alumbaugh <T.J.Alumbaugh@amd.com>

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Archana Ramalingam <Archana.Ramalingam@amd.com>
Co-authored-by: Brad Nemanich <Brad.Nemanich@amd.com>
Co-authored-by: Brian Shi <Brian.Shi@amd.com>
Co-authored-by: James Newling <James.Newling@amd.com>
Co-authored-by: Koji Nakajima <Koji.Nakajima@amd.com>
Co-authored-by: Majedul Sujon <Majed.Sujon@amd.com>
Co-authored-by: Sebastien Vince <Sebastien.Vince@amd.com>
Co-authored-by: T.J. Alumbaugh <T.J.Alumbaugh@amd.com>
minsukim-amd added a commit that referenced this pull request May 5, 2026
## Motivation

PR #6499 ("[hipblaslt] Add support for gfx950 mxfp4") changed
`DataTypeInfo::elementSize` from `size_t` → `float` to support sub-byte
FP4/FP6 packed types, but missed updating one assert in
`tensilelite/include/Tensile/TensorDescriptor.hpp:352`. The expression
`totalAllocatedElements() * info.elementSize % info.packing` then
evaluates to `float % size_t`, which the C++ compiler rejects with:

`error: invalid operands to binary expression ('float' and 'const
size_t' (aka 'const unsigned long'))`

## Technical Details

Uses the existing `multiplyElementSize` helper from `Utils.hpp` (already
included). It properly handles the float `elementSize` for both
whole-byte (≥1.0) and sub-byte (0.5, 0.75) cases, returning a `size_t`.

## Test Plan

Build hipblaslt via `install.sh` locally

## Test Result

`MessagePack.cpp` now builds clean

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
bnemanich added a commit that referenced this pull request May 6, 2026
…#7078)

## Summary
Commit 89b3fc4 ("[hipblaslt] Add support for gfx950 mxfp4" #6499)
inadvertently removed three public `_EXT` datatype constants from
`projects/hipblaslt/library/include/hipblaslt/hipblaslt-types.h`:
- `HIP_R_6F_E2M3_EXT = 31`
- `HIP_R_6F_E3M2_EXT = 32`
- `HIP_R_4F_E2M1_EXT = 33`
Only `HIP_R_8F_E5M3_EXT = 34` was left in place. The original PR
description for #6499 explicitly stated that "`_EXT` HIP types are still
accepted on the host side via aliases", so dropping these public
constants was unintentional.

## Motivation
These constants are part of the public hipBLASLt API surface and are
referenced by downstream projects. Removing them is a source-breaking
change for any consumer that includes `hipblaslt-types.h` and references
`HIP_R_6F_E2M3_EXT`, `HIP_R_6F_E3M2_EXT`, or `HIP_R_4F_E2M1_EXT`.

## Changes
- Re-add the three removed `_EXT` constants in `hipblaslt-types.h` with
their original integer values, alongside the existing
`HIP_R_8F_E5M3_EXT`.

## Backwards compatibility / risk
- Pure restoration of previously-public constants with their original
values — no behavioral change inside hipBLASLt.
- Unblocks downstream builds that broke after #6499 landed.


## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
aledudek pushed a commit that referenced this pull request May 20, 2026
# Merge gfx950 MX (Microscaling) work onto develop
## Summary
This brings end-to-end Microscaling (MX) data type support —
**MXFP4**, **MXFP6 (FP6 / BF6)**, and **MXFP8** with per-block scale
tensors
(MXSA / MXSB) — into hipBLASLt's TensileLite kernel generator and host
code
for the **gfx950** architecture.

## Motivation
MX data types (OCP Microscaling) are required for gfx950 mixed-precision
matmul (FP4 / FP6 / FP8 with shared block scales) in TensileLite. 
## What's included
### 1. MX data type plumbing (Tensile + library)
- Adds packed `Float4x2`, `Float6x32`, `BFloat6x32` types and registers
`TypeInfo` specializations for them; retains `_EXT` / non-`_EXT`
variants
  for source compatibility.
- Threads `MacDataTypeA` / `MacDataTypeB` and MX block sizes
(`MXBlockA` / `MXBlockB`) through `ProblemType`, `Solution`, predicates,
  and the kernel writer so A and B can carry independent input types.
- Introduces a per-call `scale type` to `generateMXInput` and the host
  `mxDataGen` plumbing.
### 2. MFMA / rocisa codegen for MX on gfx950
- New MFMA `InstType`s: `INST_F4`, `INST_F6`, `INST_BF6`, `INST_F8_F6`,
  `INST_F6_F8`, `INST_F8_F4`, `INST_F4_F8`, `INST_F6_B6`, `INST_B6_F6`,
`INST_F6_F4`, `INST_F4_F6`, `INST_B6_F4`, `INST_F4_B6`, with the
matching
  `cbsz`/`blgp` permutation hints.
- `MXMFMAInstruction` constructor reorganized so `acc2`, `mxsa`, `mxsb`,
scale types, and `block` are defaulted; emit path now respects `HasMFMA`
  capability.
- `validMFMA` / `validSMFMA` keys switched from single-character (`H`,
`S`,
`B`, `F8`, …) to two-character (`HH`, `SS`, `BB`, `F8F8`, `F8B8`,
`F8F4`,
`F4F4`, `F6F6`, `B6B6`, `F8F6`, `F6F4`, …) so asymmetric A/B input types
  in mixed-MX MFMAs validate correctly.
- New 16x16x128 / 32x32x64 instruction shapes added for FP4/FP6/FP8
mixes.
### 3. gfx950-specific kernel generation fixes
- gfx950 + MX + DTL2/DTL3, MXFP4 + GSU, MXFP4 + StreamK
(`sk_mx32f4_quick`, `sk_mx32f8_quick`), MXFP4 + TailLoop (K multiple of
  32), MXFP4 + DtlPlusLdsBuf, MXFP4 + 64b shadow limit.
- `staggerUCode=False` restricted to MX on gfx950 only.
- DirectToLds re-enabled for MXSA/MXSB and `LdsPad` re-enabled for
  MXFP4 + DirectToLds.
- Workaround for `duplicateFactor` / `MIInputPerThreadMXSA/B` on gfx950.
- `NonDTLTailLoopA/B` and `NonDTLTailLoopMXSA/B` paths added for non-TLU
  + MX; `tailLoopOpt*` temporarily disabled for MX (TODO to re-enable).
- Minimum `AssertSummationElementMultiple = 32` enforced for MX with
  non-TLU (host code limitation; tracked).
- MX scaling data shuffle moved into Tensile.
- StreamK + MX scaling correctness fixes.
- SIA / GSU / LSU / LocalRead / LRVW / GRVW / Solution validation
tightened
around MX; `Use64bShadowLimitMX`, `StoreSwapAddr` for MXFP4, GRVWMXSA/B
  adjustment for DTL.
### 4. Host, client, and test framework
- `tensile_host.cpp`: `HIP_R_*_EXT` mappings replaced with the canonical
`HIP_R_4F_E2M1` / `HIP_R_6F_E2M3` / `HIP_R_6F_E3M2`;
`HIP_C_32F`/`HIP_C_64F`
added; mixed FP8/BF8 handling added in
`roc2TensileComputeInputType{A,B}`.
- `DataInitialization.{hpp,cpp}` extended to initialize MX A/B/MXSA/MXSB
  tensors and avoid scale-data overwrite.
- `Reference.cpp` reference matmul updated for MX block scaling.
- `TypedId.hpp`, `ClientProblemFactory.cpp`, `LogReporter.hpp`,
  `ProgramOptions.hpp` updated for MX types.
- `clients/tests/data/matmul_gtest.yaml` extended.
- `MXBlock` predicate added so the solution selector picks MX-aware
kernels.
### 5. New tests
- `tensilelite/tests/MXDataGen_test.cpp` — gtest regression for MX FP4
scale
  buffer determinism / data-generator zero-frequency.
- New gfx950 GEMM test suites under
  `tensilelite/Tensile/Tests/common/gemm/gfx950/`:
  - `mx32f4_tn.yaml`, `mx32f8_tn.yaml`
  - `mxfp4_mxfp4_{fp32,bf16}_tn_act{,_groupgemm}.yaml`
  - `mxfp4_fp8_{fp32,bf16}_tn_act{,_groupgemm}.yaml`
  - `fp8_mxfp4_{fp32,bf16}_tn_act{,_groupgemm}.yaml`
- New StreamK suites:
  `tensilelite/Tensile/Tests/common/streamk/sk_mx32f4_quick.yaml`,
  `sk_mx32f8_quick.yaml`.
- Covers MXFP4×MXFP4, MXFP4×FP8, FP8×MXFP4 in TN layout with activation,
  groupGEMM, and StreamK variants on gfx950.
### 6. Build / portability
- Resolves `HIP_R_8F_E5M3_EXT` build error and amdclang-23 namespace
  errors.
- `mxDataGenerator` disabled for Windows builds; Windows-only
  `Float6` / `BFloat6` / `Float4` `TypeInfo` aliases.
- Build break in `tensilelite-client` under FFM fixed.
- hipsparselt build issues fixed (rebased headers / metadata vgpr
index).
- Several missing-semicolon, header, and merge-residue fixups.
### 7. Cleanup / refactor
- Removed `MXScale`, `Float6x16` / `Float6x16_Storage`, MXFP4 TN logic
file,
  redundant VGPR allocation paths for MX, and assorted dead code.
- LRVW / GRVW validation tightened; `Solution.py` cruft trimmed.
- Origami support extended for MX type conversion.
## Backwards compatibility / risk
- Targets primarily new gfx950 MX kernels; non-MX paths on existing
  architectures should be unchanged in behavior. Validator key-format
  change (`H` → `HH`, etc.) is internal to TensileLite.
- `_EXT` HIP types are still accepted on the host side via aliases.
- MX TailLoop optimizations are intentionally left disabled (TODOs in
code)
  pending a follow-up.
## Test plan
- Added new tests. Make sure everything passes in CI.
## Notes for reviewers
- This branch contains many merge-resolution commits (intentionally kept
to
  preserve history). The substantive engineering work is in the original
`gfx950_mx_rebase` lineage; the merge commits resolve conflicts against
  recent `develop` changes (StreamK, gfx1250 tooling, stinkytofu MUBUF,
  sparse yaml fixes, etc.).
- Key files for review (largest substantive diffs):
  `Tensile/SolutionStructs/Solution.py`,
  `Tensile/KernelWriterAssembly.py`, `Tensile/KernelWriter.py`,
  `Tensile/Common/ValidParameters.py`,
  `Tensile/SolutionStructs/Validators/MatrixInstruction.py`,
  `rocisa/include/instruction/mfma.hpp`,
  `include/Tensile/DataTypes*.hpp`,
  `library/src/amd_detail/rocblaslt/src/tensile_host.cpp`,
  `client/include/DataInitialization.hpp`,
  `client/src/Reference.cpp`.
- Squash-merging is recommended given the volume of merge-resolution
commits.
## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Signed-off-by: pdhirajkumarprasad <dhirajp@amd.com>
Co-authored-by: wenyang2 <Wen.Yang@amd.com>
Co-authored-by: Jingwei Liao <Jingwei.Liao@amd.com>
Co-authored-by: Alex Brown <alex.brown@amd.com>
Co-authored-by: amd-chunxlin <chunxlin@amd.com>
Co-authored-by: Archana Ramalingam <98564406+archana-ramalingam@users.noreply.github.com>
Co-authored-by: bnemanich <brad.nemanich@amd.com>
Co-authored-by: Hongji Chen <hongjche@amd.com>
Co-authored-by: Chun-Xun Lin <Chun-Xun.Lin@amd.com>
Co-authored-by: pdhirajkumarprasad <160474250+pdhirajkumarprasad@users.noreply.github.com>
Co-authored-by: T.J. Alumbaugh <talumbau@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: tomchengchitang <Tom.Tang@amd.com>
Co-authored-by: tomchengchitang <tomtang2@amd.com>
Co-authored-by: yenong-amd <92903026+yenong-amd@users.noreply.github.com>
Co-authored-by: pdhirajkumarprasad <dhirajp@amd.com>
Co-authored-by: mahmoodw <wmahmood@amd.com>
Co-authored-by: Vinayak Dev <Vinayak.Dev@amd.com>
Co-authored-by: Archana Ramalingam <Archana.Ramalingam@amd.com>
Co-authored-by: Curtis Fu <yu.fu@amd.com>
aledudek pushed a commit that referenced this pull request May 20, 2026
# Add gfx950 MXFP4 Subtile-based kernel implementation
## Summary
This PR is a follow-up to #6499 ([hipblaslt] Add support for gfx950
mxfp4)
and adds the **Subtile-based kernel implementation
(`UseSubtileImpl=1`)**
for hipBLASLt on **gfx950**. It introduces a new tile-decomposed code
generation path optimized for **MXFP4** and **BF16** GEMMs, plus the
solution-selection plumbing, validation, Origami logic yamls, and unit
tests
needed to make it production-usable.
## Motivation
PR #6499 brought MX data type support online for gfx950, but the
existing
TensileLite codegen path leaves significant performance on the table for
MXFP4-heavy workloads. The Subtile path restructures global-read /
local-read / MFMA / store scheduling at a finer granularity, which
**greatly improves MXFP4 GEMM performance when using
`HIPBLASLT_MATMUL_MATRIX_SCALE_BLK32_UE8M0_32_8_EXT`** (added to the
hipBLASLt CHANGELOG).
## What's included
### 1. New Subtile-based kernel components (Tensile)
New modules under `projects/hipblaslt/tensilelite/Tensile/Components/`:
* `SubtileBasedKernel.py` (~1850 LOC) — entry point and orchestration of
  the subtile codegen path; replaces large portions of the standard
  prefetch / unroll / store flow when `UseSubtileImpl=1`.
* `SubtileBasedLogicalScheduler.py` (~2415 LOC) — logical scheduler that
  builds the subtile-grained instruction graph (GR loads, LR offsets,
  MFMA tiles, scale loads, stores) from kernel parameters.
* `SubtileBasedInstructionScheduler.py` (~433 LOC) — converts the
logical
  schedule to an emit order respecting wave / register / hazard
  constraints.
* `SubtileBasedInstructionEmitter.py` (~216 LOC) — instruction emission
  helpers shared by the subtile components.
### 2. Kernel writer / common changes
* **`KernelWriter.py`**, **`KernelWriterAssembly.py`**: integration
points
  for the subtile path — prefetch, GR offset calculation, LR offset
  calculation, post-loop, MFMA macro accounting, optimized `storeD`,
  LDS buffer swap, MX FP4 scale emit, `SrdMXSA/B+2` handling, sgpr
  allocation / overflow guards, computeLoadSrd fix.
* **`SolutionStructs/Solution.py`**, **`SolutionStructs/Problem.py`**:
  introduces the `UseSubtileImpl` parameter, MX-related reject
  conditions for non-Subtile paths on gfx950, and additional valid GEMM
  type combinations for MX inputs.
* **`Common/ValidParameters.py`**, **`Common/RequiredParameters.py`**,
  **`Common/GlobalParameters.py`**: `UseSubtileImpl` registration and
  defaults.
* **`Components/StreamK.py`**: subtile-aware StreamK fixup (incl. import
  union with the `BufferLoadB32` cache-coherence change from #6837).
* **`Components/GlobalWriteBatch.py`**: optimized global write batching
  for the subtile path (~670 LOC of changes).
* **`Components/ComputeStoreVgprs.py`**, **`Components/LSU.py`**,
  **`Components/WorkGroupMappingAlgos.py`**, **`AsmStoreState.py`**,
  **`KernelWriterModules.py`**: minor adjustments needed by the subtile
  pipeline.
### 3. rocisa / host / client
* **`rocisa/rocisa/include/container.hpp`**: helpers needed by the new
  emitter.
* **`tensile_host.cpp`**, **`include/Tensile/TensorDescriptor.hpp`**:
  small fixups for the subtile path and gfx950 build.
* **`client/include/DataInitialization.hpp`**,
**`client/src/DataInitialization.cpp`**,
**`client/src/Reference.cpp`**, **`client/src/ReferenceValidator.cpp`**,
  **`client/include/TypedId.hpp`**: MX scale init and reference paths
  used by the new tests.
* **`clients/common/include/testing_matmul.hpp`**,
  **`clients/common/include/norm.hpp`**,
  **`clients/common/include/hipblaslt_datatype2string.hpp`**,
  **`clients/common/src/mxDataGen.cpp`**: wiring for batched (>1)
  testing and MX init.
### 4. Origami / solution selection (gfx950 MXFP4)
New auto-tuned logic yamls under

`projects/hipblaslt/library/.../Tensile/Logic/asm_full/gfx950/gfx950/Origami/`
covering the FP4 SS / HS / BS variants in three layouts:
* `Origami/` (default)
* `Origami/Origami_nta4/` (no-transpose-A FP4)
* `Origami/Origami_ntb4/` (no-transpose-B FP4)
(9 new `gfx950_Cijk_Alik_Bljk_F4{SS,HS,BS}_MXA32_MXB32_*_UserArgs.yaml`
files in total.)
### 5. New tests
**End-to-end gfx950 GEMM yamls** in
`Tensile/Tests/common/gemm/gfx950/`:
* `subtile_bf16.yaml`, `subtile_mxfp4.yaml`
* `mx32f4_tn.yaml`, `mx32f8_tn.yaml`
* `mxfp4_mxfp4_{fp32,bf16}_tn_act{,_groupgemm}.yaml`
* `mxfp4_fp8_{fp32,bf16}_tn_act{,_groupgemm}.yaml`
* `fp8_mxfp4_{fp32,bf16}_tn_act{,_groupgemm}.yaml`
**StreamK + MX:** `Tensile/Tests/common/streamk/sk_mx32f4_quick.yaml`,
`sk_mx32f8_quick.yaml`.
**New unit tests** (`Tensile/Tests/unit/`):
* `test_SubtileBasedLogicalScheduler.py` (~1735 LOC)
* `test_SubtileBasedSchedulerRef.py` (~596 LOC)
* `test_gr_lr_roundtrip.py` (~571 LOC)
* `test_storeD_roundtrip.py` (~2420 LOC)
* `test_graTileAssignment.py` (~354 LOC)
* `test_lraTileAssignment.py` (~360 LOC)
* `conftest.py`, `gpu_test_helpers.py` shared fixtures (~601 LOC)
**New gtest:** `tensilelite/tests/MXScalePadding_test.cpp`.
### 6. Misc / hardening
* Reject conditions: gfx950 MX + non-Subtile, DepthU constraints,
GroupGEMM
not yet supported with StreamK + MX, AssertSummationElementMultiple=256
  for subtile MXFP4, missing-mxblock check for non-MX types.
* Skip rocRoller for FP4-A/FP4-B with pre-swizzled scale layout (#42).
* `forceDenorm=False` in `generateMXInput` (#11).
* Several rebase fixes, copyright/year header updates, and
review-comment
  fixes to `KernelWriter` / `KernelWriterAssembly`.
### 7. CHANGELOG
Greatly improved MXFP4 GEMM performance when using
HIPBLASLT_MATMUL_MATRIX_SCALE_BLK32_UE8M0_32_8_EXT

## How to use
Set `UseSubtileImpl: 1` on a gfx950 MX-FP4 solution (see the new
`subtile_mxfp4.yaml` / `mx32f4_tn.yaml` for canonical configs). The path
is
opt-in — non-MX and non-gfx950 kernels are unaffected.
## Backwards compatibility / risk
* All new behavior is gated on `UseSubtileImpl=1` and gfx950. Existing
  solutions on other architectures or non-MX paths are unchanged.
* `GroupGEMM + StreamK + MX` is intentionally rejected for now (TODO).
* New Origami yamls only add solutions; nothing existing is modified.
## Test plan
* New gtests + unit tests run automatically in CI (Tensilelite Python
  unit suite, `MXDataGen_test`, `MXScalePadding_test`).
* New end-to-end gfx950 GEMM and StreamK yamls are added to the common
  test buckets.
* Manual: run the gfx950 MXFP4 subtile suites
  (`pytest -k gfx950` after building Tensile, plus
  `tensilelite-client --yaml subtile_mxfp4.yaml` for sanity).
## Notes for reviewers
* This branch was rebased onto current `develop` (post-#6499) by
skipping
  the `users/nakajee/gfx950_mx_rebase_merge` history (which #6499
squash-merged) and replaying only the subtile-specific work as a single
  squashed commit. The actual code changes in this PR are limited to the
  files listed above (24 added, 56 modified; ~+170k / −2.6k including
  generated logic yamls).
* The largest reviewable diffs are:
*
`Tensile/Components/SubtileBased{Kernel,LogicalScheduler,InstructionScheduler,InstructionEmitter}.py`
(new files)
  * `Tensile/KernelWriter.py`, `Tensile/KernelWriterAssembly.py`
  * `Tensile/SolutionStructs/{Problem,Solution}.py`
  * `Tensile/Components/{GlobalWriteBatch,StreamK}.py`
  * `clients/common/include/testing_matmul.hpp`
  * `client/src/DataInitialization.cpp`

* Description of all commits that were squashed for this feature branch:

Subtile implementation for gfx950 MX FP4

--- 272f88d: Add sample subtile impl ---
Author: brianshi <brianshi@amd.com>

--- 60ecede: GR Offset calculation (#1) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

--- be69c1d: Enable post-loop code generation, and add some
subroutines ---
Author: b-shi <brianshi@amd.com>

--- 646d102: LR offset calculation (#2) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

--- 71f4bca: Add GR load emit logic, and misc fixes (#3) ---
Author: b-shi <brianshi@amd.com>

--- 1fd0db9: Emit LR + init ACCVGPR (#4) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

--- 9d406b9: Add loop and ptr update code ---
Author: b-shi <brianshi@amd.com>

--- b6127bc: Update GR/LR offset calculation to fully support 2x2,
1x4, 4x1 waveConfigs (#7) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

--- 89ec87c: Account for valuC macro value in SK WS store code ---
Author: b-shi <brianshi@amd.com>

--- 6edf53d: Rebase fix ---
Author: b-shi <brianshi@amd.com>

--- 34e79fc: Enable fp4 (#8) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

--- d5a5c57: [Tensilelite] Add MX FP4 scale offset computation for
subtile-based kernel (#6) ---
Author: Archana Ramalingam
<98564406+archana-ramalingam@users.noreply.github.com>

--- 7a8a85a: Add lds buffer swap logic ---
Author: b-shi <brianshi@amd.com>

--- d24a8fe: Add optimized storeD code (#9) ---
Author: b-shi <brianshi@amd.com>

--- a45c20c: Fix MX scale tensor initialization: set
forceDenorm=false in generateMXInput (#11) ---
Author: T.J. Alumbaugh <T.J.Alumbaugh@amd.com>

--- f945268: [Tensilelite] Enable the MX FP4 scale emit code in the
subtile-based kernel (#10) ---
Author: Archana Ramalingam
<98564406+archana-ramalingam@users.noreply.github.com>

--- cf37df4: Use fixed value for SrdMXSA/B+2 (#14) ---
Author: Koji Nakajima <75698246+nakajee@users.noreply.github.com>

--- f0c8dbc: Merge subtile_mx_f4_schedule to subtile_mx branch (#16)
---
Author: b-shi <brianshi@amd.com>

--- 543796f: Enable DU > 256, and reduce sgpr allocation (#18) ---
Author: b-shi <brianshi@amd.com>

--- c65bdb0: Add missing mxblock check for non-mx data types ---
Author: b-shi <brianshi@amd.com>

--- d64d226: Introduce UseSubtileImpl parameter (#20) ---
Author: b-shi <brianshi@amd.com>

Squash commits 20-35 from subtile_mx branch

--- e4780da: Enable FixSrd2 for A/B (#23) ---
Author: b-shi <brianshi@amd.com>

* Enable FixSrd2 for A/B

* Address comments from PR

---------

--- e4c64a7: Add nt libs ---
Author: b-shi <brianshi@amd.com>

--- cd13ec1: [Tensilelite] Pad MX scale tensor dimensions for
unaligned problem sizes (#21) ---
Author: Archana Ramalingam
<98564406+archana-ramalingam@users.noreply.github.com>

* Add scale padding

* Add tests

* Remove redundant pre-swizzle path

* Remove code from
conflict

* Fix reverted mxdatagen path for tensile tests

* Add diverse test cases for scale padding in MXScalePadding_test and
subtile.yaml
- Expanded test cases to include non-multiple-of-32, even
non-multiple-of-16, and odd dimensions.

--- d87938f: Split subtile.yaml into subtile_bf16.yaml and
subtile_mxfp4.yaml (#22) ---
Author: James Newling <james.newling@gmail.com>

Replace the 'monolithic' subtile.yaml with two focused test files.
All original test coverage is preserved. Two new FP4 groups added.

BF16 coverage (subtile_bf16.yaml, tests are essentially unchanged):

  # | Description        | Dest | MIs | PGR | DU      | SK  | Sizes
  --+--------------------+------+-----+-----+---------+-----+------
  0 | BF16 TN main       | b    |  19 |   0 | 64      | 0,3 |  11
  1 | BF16 TN large DU   | b    |   4 |   0 | 128,192 | 0,3 |   7
  2 | BSS (f32 output)   | s    |   6 |   0 | 64      | 0,3 |   9
  3 | BF16 bias          | b    |   2 |   0 | 64      | 0   |   1

FP4 coverage (subtile_mxfp4.yaml):

  # | Description        | Dest | MIs | PGR | DU  | SK  | Sizes | Status
--+--------------------+------+-----+-----+-----+-----+-------+--------
0 | FP4 TN main | b | 15 | 0 | 256 | 0,3 | 23 | from original
1 | FP4 TN large DU | b | 4 | 0 | 512 | 0,3 | 13 | from original
2 | F4SS (f32 output) | s | 5 | 0 | 256 | 0,3 | 13 | from original
3 | FP4 bias | b | 2 | 0 | 256 | 0 | 1 | from original
  4 | FP4 PGR=2          | b    |  13 |   2 | 256 | 0   |   5   | new
  5 | FP4 expanded MIWT  | b    |  24 |   0 | 256 | 0   |   5   | new
6 | PGR=2 WG 4x1/1x4 | | 6 | 2 | 256 | 0 | 1 | known failures
(commented)

Run times on gfx950 (8x MI350X):

  File               | NEV=-1 | NEV=0
  -------------------+--------+------
  subtile_bf16.yaml  |    23s |   23s
  subtile_mxfp4.yaml |    37s |   40s

Where NEV is number of elements to validate. I (James) have checked
these numbers,
and weirdly it is true that NEV=0 is a bit faster than NEV=-1 for mxfp4.

--- af04f0d: Dependency based instruction scheduling (#19) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

* Revert to single partition

* Start using dependencies

* as is

* start using separate EmittedModules

* remove reduntant wait

* Add _extractPathsFromBeforeDeps

* Continue simplification

* Simplifying

* Add more rules

* cleanup

* Add fp4 test

* fix test

* Add tests

* Remove after field on emittedmodule

* Refactoring instructionSchedule

* Add comments

* cleanup modules vs ops

* Refactoring print functions

* Test cleanup

* Add more tests

* Replace subgroup by partition

* Remove unused unroll param

* Add high level notes

* Simplify NLL and NGLL GR removal

* Add some comments

* Force instruction insertion if no slots available

* Fix test after rebase

* Move scale before A/B and track inflight count

* Fine-grain vmcnt calculation

* Separate counts for scaleA and B

* Avoid using m0 update and buffer_lod on same MFMA slot to avoid scalar
instruction serialization

* Fix test

* Add vmcnt test

* Fix duplicated loads for 1x4 and 4x1

* Fix placement in reverse order

* Fix regression on PGR0

* add fallback to numMFMA=1

--- 3ec902b: Add some 1x4 and 4x1 origami solutions ---
Author: b-shi <brianshi@amd.com>

--- c5000d3: Fix typo ---
Author: b-shi <brianshi@amd.com>

--- 226ed84: [hipblaslt] Refactor Srd2 calculation for useFixedSrd2
(#30) ---
Author: Koji Nakajima <75698246+nakajee@users.noreply.github.com>

--- abf19d4: [Tensilelite] UseSubtileImpl: subtile-aligned edge check
for store path (#29) ---
Author: b-shi <brianshi@amd.com>

* [Tensilelite] UseSubtileImpl: subtile-aligned edge check, OOB guard,
and refactoring

- Replace Size%MT edge check with subtile-aligned check: NonEdge paired
  store when trailing rows/cols are a multiple of the subtile block size
(waveGroupM rows for M, 16 cols for N). Non-last workgroups always take
NonEdge.
- Add per-wave OOB guard (subtileM32ValidBlocksSgpr /
subtileN16ValidBlocksSgpr)
  to skip stores outside valid M/N tile bounds in the NonEdge path.
- Refactor duplicated OOB guard into _emitSubtileOobGuard helper;
refactor
M/N guard SGPR computation into _emitSubtileMGuard / _emitSubtileNGuard.
- Fix orphan scalar store blockIdxM (was tt0, now
(tt0*MatrixInstM)//mBlockSize).
- Add quick-exit and edge/non-edge header comments to generated ASM.

* Add some bias tests, combine M/N guard to single routine

* Add OOB check for C loads, update storeD unit tests to check OOB,
simplify quick exit checks

* Address more PR comments: add M group skip, and skip to store end.
simplified loadC OOB mask

---------

--- 637881a: Fix unit tests & remove legacy code for subtile
interleaving (#33) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

* Fix gr_lr_roundtrip test

* Use non-interleaved version as ref code

* Fix scheduler test

* Removed legacy interleaved mode for LR/GR offset calculation

--- e9cb889: Fix MX FP4 scale buffer allocation and initialization
for batched GEMM (#25) ---
Author: Archana Ramalingam
<98564406+archana-ramalingam@users.noreply.github.com>

* Fix bacth count issue

* Add batch count tests

* Fix bacth count issue

* Address PR review: clarify FP4-specific byte stride and add
non-aligned batched tests

- Updated comments on dataBatchBytes computation to clarify FP4 packing
  assumption (2 elements/byte) and flag that non-FP4 block-scaling types
  would require updating this conversion.
- Added batched test cases with non-multiple-of-32 M/N dimensions:
  FP4 DU=256: [48,48,2] and [33,65,2]
  FP4 DU=512: [63,63,2]
  BF16: [50,100,2]

---------

--- a43247b: Update some test yamls (#31) ---
Author: b-shi <brianshi@amd.com>

--- e2f69c8: Add f4bs origami library with activation function
support. Refactor sgpr allocation to reduce sgpr usage in post loop.
Store code-path reorganization (#32) ---
Author: b-shi <brianshi@amd.com>

* Free swap/localwritebase sgprs before post-loop

* Defer sgpr allocation to remove holds in sgpr pool.

Add Origami library logic files for Cijk_Alik_Bljk_F4BS_MXA32_MXB32
(base, nta4, ntb4 variants).

* Remove uneeded alignment and comment

* Add more epilogue tests

* Remove older origami library for f4bs

* Reorder post-loop code blocks to after persistant loop Misc fixes

* Fix build issues, relax longjump sgpr requirements

* Fix GSU0 branch logic

---------

--- 3f034bf: Add F4HS and F4SS Origami library logic for FP4→F16 and
FP4→F32 GEMM (#35) ---
Author: Majedul Sujon <85503863+msujon-AMD@users.noreply.github.com>

* Add F4HS and F4SS Origami library logic for FP4→F16 and FP4→F32 GEMM

- Add 6 new yaml files (F4HS, F4SS) across Origami, Origami_nta4,
Origami_ntb4
- Update F4BS yaml files: AssertSummationElementMultiple 32→256 for
K%256 enforcement
- Add ("F4", "F4", "H", "S") to _validGEMMTypes and _HPATypes in
Problem.py

* Add F4HS test cases to subtile_mxfp4.yaml

Add two new benchmark problem blocks for FP4→F16 (F4HS):
- No-bias block: same wavetile and problem size coverage as F4SS
- Bias epilogue block: BiasDataTypeList [s, h], relu/none activations

* Add F4HS (FP4->Half) type support to Tensile client

Add TypedGemm_F4_H_S typedef and corresponding reference CPU solver
case so F4HS (FP4 input, Float16 output, Float compute) problems
can be validated by the benchmark client.

---------

--- d0bc8fd: Rewrite subtile-based scheduler. Fix DU>64 & enable very
large MT (#36) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

* Initial support for DU>256

* Renaming

* add option to do DU=512 in the tests

* blocked K-major for scale

* Change scaleSet swap logic

* Update print functions

* Put scales after values for avoid race conditions

* Fix tests

* more test

* tweak printschedule display

* Add PGR2 in the yaml tests

* Add new scaleGROp

* comment out failing tests

* Revert "comment out failing tests"

This reverts commit 1f5802c.

* Draft new logical scheduler

* Refactoring

* Add more test on step1

* Add more tests on step1

* add bf16 320x320 test

* reduce step1 code

* Simplify step1 logic

* validate some step1 test

* Fix partition 2x2 test

* more step1 test

* 320x320 BF16 test

* Add test DU512 + partition2x2

* Simplify step1 code

* Add step2 tests

* Fix multi-partition step2

* Add step2 du512, 2x2 partition test

* Use common algo for all numPartitions

* Draft for step3 tests

* remove useless tests

* New GR algo (draft)

* [Step3] Add more test

* Iteration on GR

* Display ordered GR list with granularities

* More test

* Add some comments

* Disable by default debug logs

* Getting rid of step naming

* Start remove AnnotatedOp (still there in group pass)

* Split dependency Ops

* Add todo on place_GRs pass

* Valid test_annotate_deps_1x1_partition_DU256

* Test output looking better (still WIP)

* single dep for LR tooo

* Add remove_cross_deps pass

* Fix bugs in dependency pass

* insert_gr_lr_inc pass

* Add group_lr_gr pass

* Add emit pass

* Quick port of instruction Emit code

* Move emit function to separate file

* Refactoring instructionEmitter

* Port vgprTile tracking

* Reworking second pass (WIP)

* Display unrolling requirement

* Unrolling check on 2nd pass

* Generic validation for assign_vgpr pass

* Fix unroll

* Add inst schedule in standalone mode

* Use lrGran for vgprTile size calculation

* Fix bug in emit pass (missing depencency)

* PreMFMA path + non-duplication scale load

* missing globalReadLDSBufferSwap for GR_INC scales

* add wairlr_sync on all LR->GR dep

* add waitgr_sync op

* remove_unnecessary_gr_deps

* Change LR dispatch algo a bit to avoid too many waitgr_sync

* Avoid duplicated loads in emitter

* Fix bug on gr_emit code

* GrInc pass. fix duplicated insertion for B

* Fix missing LR_inc for SA/SB

* preloop, NLL, NGLL

* Simplify preloop

* minor changes

* Move unroll logic to scheduler

* minor changes

* Fix unroll id bug on NLL / NGLL

* Disable post GRINC for now

* Remove commented code

* Handle 1x4, 4x1 gr read gran

* Fix vmcnt computation

* Use correct grCount mapping

* Revert in emit logic on buffer_load for PGR0 needs

* Add bf16 version in standalone test

* Fix LR_Inc insertion on DU>64

* Add subIterK/Partition comment to codegen

* Fix issue in GrInc placement

* Remove last_mt

* Fix LR MT index bug with muli-partition

* Disable early LDS size check when subtileImpl is on

* Add pass to remove redundant LR deps + fixed issue on dependency
annotation pass

* Remove more LR redundant deps

* Only insert wait_lr_sync on deps

* Simple algo to select partition config

* Remove HC value for partitions...

* Take into account all inflight GR (all tensors)

* Fix tests and regressions on gr counts

* Fix grCount merge calculation

* Better display of dependencies

* Add remove_wait_lr_sync after grouping

* Add temporary non reg file

* Change merge logic on GR grouping pass

* Fix non necessary wait_lr_sync

* Downgrade some waitlr_sync to sync + added 384x256 no reg test

* non reg test 320x320

* Add larger MT

* non reg test for fp4 256x256

* Moving out instructionScheduler

* Remove old scheduler

* Renaming scheduler

* Re-work test

* Add larger MT test cases

* Rename non-ref test

* Re-add standalone mode

* Refactor DepOp

* Remove dead code

* Remove MFMATileSize class

* Remove from_til_info

* Avoid redundant tensor list creation

* Remove hardcode granularities in vgrpTile allocation pass. Simplify
code.

* Re-enable  # PGR=2 WG 4x1/1x4, K > DU tests

* Remove unused GRScaleOp

* DepRef renaming

* Get rid of MT string representation

* Remove TODO

* EmmitedModule simplication

* Use explicit pass dependencies

* Renaming LogicalScheduler

* Remove old test_InterleavingScheduler.py file

* Commenting failing test for now

* Remove debug logs

* Disable lds padding when using UseSubtileImpl

--- e8e8c09: Fix LR-GR dependency issue when DU>64 (#40) ---
Author: sebvince <115461989+sebvince@users.noreply.github.com>

* Fix and simplify logic for remove_unnecessary_lr_deps

* Add new ref tests for 128x128x(128,64)

--- 4aa441a: Rebase fix ---
Author: b-shi <brianshi@amd.com>

--- 5ba911e: Skip rocRoller for FP4-A/FP4-B + pre-swizzled scale
layout (#42) ---
Author: Archana Ramalingam
<98564406+archana-ramalingam@users.noreply.github.com>

--- 9c74998: Rebase fix ---
Author: b-shi <brianshi@amd.com>

--- 842b149: Addressed review comments for KernelWriter and
KernelWriterAssembly ---
Author: Koji Nakajima <knakajim@amd.com>

--- dce43b1: Fix computeLoadSrd issue ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- c075bbf: Fix preSolution CPU re-sync regressing
subtile_mxfp4.yaml ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- ced840f: Fix computeLoadSrd issue (#43) ---
Author: bnemanich <brad.nemanich@amd.com>

--- bc2f6dd: Small update for gfx950 mx tests + more - enable
UseSubtileImpl for all gfx950 non subtile mx tests - skip all gfx950
mxfp8 - use MXScaleFormat=1 as default - set
AssertSummationElementMultiple=256 for subtile mxfp4 - fix
isSwizzledSubtile in computeLoadSrd ---
Author: Koji Nakajima <knakajim@amd.com>

--- 5c794b7: Fix gsuasb.yaml failures ---
Author: b-shi <brianshi@amd.com>

--- 727f8db: tensilelite: add solution reject conditions for
UseSubtileImpl=1 (#38) ---
Author: Majedul Sujon <85503863+msujon-AMD@users.noreply.github.com>

--- 8928fbb: Add more reject conditions for Subtile ---
Author: Koji Nakajima <knakajim@amd.com>

--- 6e63ab6: Fix kringshift test failures ---
Author: b-shi <brianshi@amd.com>

--- b3e9724: Update reject condtion for DepthU in subtile case. Plus,
update DepthU setting for gfx950 mx test cases ---
Author: Koji Nakajima <knakajim@amd.com>

--- 5ab6009: Fix build errors ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- 4a4edca: Update more mxfp4 tensilelite test cases ---
Author: Koji Nakajima <knakajim@amd.com>

--- bbbc553: Update change log ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- 6476c04: Add more reject conditions for gfx950 subtile ---
Author: Koji Nakajima <knakajim@amd.com>

--- c5828c4: Updated gfx950 mxfp4 test cases - add StreamK setting -
skip groupgemm tests for now (groupgemm does not support streamK) ---
Author: Koji Nakajima <knakajim@amd.com>

--- f1fc2f1: Fix hipblaslt build error of gfx950 ---
Author: Koji Nakajima <knakajim@amd.com>

--- 70cea1b: Updated subtile_mxfp4.yaml (add StreamK) ---
Author: Koji Nakajima <knakajim@amd.com>

--- c1c9b2a: Add uninit lsc,lsp, etc.. fields for subtile ---
Author: b-shi <brianshi@amd.com>

--- c0c1f72: Fixed merge error in testing_matmul.hpp ---
Author: Koji Nakajima <knakajim@amd.com>

--- 191e0cb: Add missed batch_count >1 changes ---
Author: archana-ramalingam <Archana.Ramalingam@amd.com>

--- 01c52f8: Addressed PR comments ---
Author: Koji Nakajima <knakajim@amd.com>

--- 4e89c91: Reduce mxfp4 test time ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- 3dac20f: Prevent overflow for wgmxcc sgpr allocation ---
Author: b-shi <brianshi@amd.com>

--- 18dec79: Fix error with problem type ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- 9e69ffd: Add a reject conditoin for gfx950 mx + non Subtile ---
Author: Koji Nakajima <knakajim@amd.com>

--- 0eed3ba: Add more valid GEMM types ---
Author: Brad Nemanich <brad.nemanich@amd.com>

--- 8b5514e: Fix missing b build error ---
Author: archana-ramalingam <Archana.Ramalingam@amd.com>

--- f981ff5: Fix 1250 tests ---
Author: Brad Nemanich <brad.nemanich@amd.com>

--- d1e69d9: Add more FP4 tests ---
Author: Brad Nemanich <Brad.Nemanich@amd.com>

--- e3a688f: Add MXScaleFormat: 1 to all gfx950 mx test yaml ---
Author: Koji Nakajima <knakajim@amd.com>

--- aaef3f5: Add DataTypeMXSA,B setting in gfx950 mxfp4 logic yaml
---
Author: Koji Nakajima <knakajim@amd.com>

--- 861ef8e: Add DataTypeMXSA,B setting in gfx950 mxfp4 logic yaml
(nta4,ntb4) ---
Author: Koji Nakajima <knakajim@amd.com>

Co-authored-by: Archana Ramalingam <Archana.Ramalingam@amd.com>
Co-authored-by: Brad Nemanich <Brad.Nemanich@amd.com>
Co-authored-by: Brian Shi <Brian.Shi@amd.com>
Co-authored-by: James Newling <James.Newling@amd.com>
Co-authored-by: Koji Nakajima <Koji.Nakajima@amd.com>
Co-authored-by: Majedul Sujon <Majed.Sujon@amd.com>
Co-authored-by: Sebastien Vince <Sebastien.Vince@amd.com>
Co-authored-by: T.J. Alumbaugh <T.J.Alumbaugh@amd.com>

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Archana Ramalingam <Archana.Ramalingam@amd.com>
Co-authored-by: Brad Nemanich <Brad.Nemanich@amd.com>
Co-authored-by: Brian Shi <Brian.Shi@amd.com>
Co-authored-by: James Newling <James.Newling@amd.com>
Co-authored-by: Koji Nakajima <Koji.Nakajima@amd.com>
Co-authored-by: Majedul Sujon <Majed.Sujon@amd.com>
Co-authored-by: Sebastien Vince <Sebastien.Vince@amd.com>
Co-authored-by: T.J. Alumbaugh <T.J.Alumbaugh@amd.com>
aledudek pushed a commit that referenced this pull request May 20, 2026
## Motivation

PR #6499 ("[hipblaslt] Add support for gfx950 mxfp4") changed
`DataTypeInfo::elementSize` from `size_t` → `float` to support sub-byte
FP4/FP6 packed types, but missed updating one assert in
`tensilelite/include/Tensile/TensorDescriptor.hpp:352`. The expression
`totalAllocatedElements() * info.elementSize % info.packing` then
evaluates to `float % size_t`, which the C++ compiler rejects with:

`error: invalid operands to binary expression ('float' and 'const
size_t' (aka 'const unsigned long'))`

## Technical Details

Uses the existing `multiplyElementSize` helper from `Utils.hpp` (already
included). It properly handles the float `elementSize` for both
whole-byte (≥1.0) and sub-byte (0.5, 0.75) cases, returning a `size_t`.

## Test Plan

Build hipblaslt via `install.sh` locally

## Test Result

`MessagePack.cpp` now builds clean

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
aledudek pushed a commit that referenced this pull request May 20, 2026
…#7078)

## Summary
Commit 89b3fc4 ("[hipblaslt] Add support for gfx950 mxfp4" #6499)
inadvertently removed three public `_EXT` datatype constants from
`projects/hipblaslt/library/include/hipblaslt/hipblaslt-types.h`:
- `HIP_R_6F_E2M3_EXT = 31`
- `HIP_R_6F_E3M2_EXT = 32`
- `HIP_R_4F_E2M1_EXT = 33`
Only `HIP_R_8F_E5M3_EXT = 34` was left in place. The original PR
description for #6499 explicitly stated that "`_EXT` HIP types are still
accepted on the host side via aliases", so dropping these public
constants was unintentional.

## Motivation
These constants are part of the public hipBLASLt API surface and are
referenced by downstream projects. Removing them is a source-breaking
change for any consumer that includes `hipblaslt-types.h` and references
`HIP_R_6F_E2M3_EXT`, `HIP_R_6F_E3M2_EXT`, or `HIP_R_4F_E2M1_EXT`.

## Changes
- Re-add the three removed `_EXT` constants in `hipblaslt-types.h` with
their original integer values, alongside the existing
`HIP_R_8F_E5M3_EXT`.

## Backwards compatibility / risk
- Pure restoration of previously-public constants with their original
values — no behavioral change inside hipBLASLt.
- Unblocks downstream builds that broke after #6499 landed.


## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
bnemanich added a commit that referenced this pull request May 23, 2026
…lAllocatedBytes (AIHPBLAS-3570) (#7672)

## Motivation

PR #6499 (gfx950 mxfp4)
changed BaseTypeInfo::ElementSize from sizeof(T)
to sizeof(T) / Packing, so info.elementSize is now the per-element
segment size in bytes (0.5 for Float4, 0.75 for Float6/BFloat6). Two
sites in TensorDescriptor.hpp were not updated to match: elementBytes()
still hard-asserted m_dataType was none of {Float4,Float6,BFloat6} and
carried a stale comment claiming elementSize was the unsegmented
container size, and totalAllocatedBytes() divided by info.packing a
second time on top of multiplyElementSize, returning N/4 bytes for N
Float4 elements instead of the correct N/2.

Drop the obsolete assertion and the obsolete /packing, replace the
misleading comment with one that reflects post-PR-6499 semantics, and
keep totalAllocatedBytes safe by asserting the element count is a whole
number of packed containers. Add a TensorDescriptor_test.cpp covering
Float (control), Half, Float8, Float4, Float6, and BFloat6 in
tensilelite-tests, including an explicit Debug-mode regression check
that elementBytes() does not abort on the packed MX types and that
totalAllocatedBytes() is not double-divided. Non-packed types are

Addresses AIHPBLAS-3570.

## Test Plan

Tested on debug build to make sure everything passed.


## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.