Add gfx950 MXFP4 and BF16 Subtile-based kernel implementation by bnemanich · Pull Request #7017 · ROCm/rocm-libraries

bnemanich · 2026-05-02T14:35:09Z

Add gfx950 MXFP4 Subtile-based kernel implementation

Summary

This PR is a follow-up to #6499 ([hipblaslt] Add support for gfx950 mxfp4)
and adds the Subtile-based kernel implementation (UseSubtileImpl=1)
for hipBLASLt on gfx950. It introduces a new tile-decomposed code
generation path optimized for MXFP4 and BF16 GEMMs, plus the
solution-selection plumbing, validation, Origami logic yamls, and unit tests
needed to make it production-usable.

Motivation

PR #6499 brought MX data type support online for gfx950, but the existing
TensileLite codegen path leaves significant performance on the table for
MXFP4-heavy workloads. The Subtile path restructures global-read /
local-read / MFMA / store scheduling at a finer granularity, which
greatly improves MXFP4 GEMM performance when using
HIPBLASLT_MATMUL_MATRIX_SCALE_BLK32_UE8M0_32_8_EXT (added to the
hipBLASLt CHANGELOG).

What's included

1. New Subtile-based kernel components (Tensile)

New modules under projects/hipblaslt/tensilelite/Tensile/Components/:

SubtileBasedKernel.py (~1850 LOC) — entry point and orchestration of
the subtile codegen path; replaces large portions of the standard
prefetch / unroll / store flow when UseSubtileImpl=1.
SubtileBasedLogicalScheduler.py (~2415 LOC) — logical scheduler that
builds the subtile-grained instruction graph (GR loads, LR offsets,
MFMA tiles, scale loads, stores) from kernel parameters.
SubtileBasedInstructionScheduler.py (~433 LOC) — converts the logical
schedule to an emit order respecting wave / register / hazard
constraints.
SubtileBasedInstructionEmitter.py (~216 LOC) — instruction emission
helpers shared by the subtile components.

2. Kernel writer / common changes

KernelWriter.py, KernelWriterAssembly.py: integration points
for the subtile path — prefetch, GR offset calculation, LR offset
calculation, post-loop, MFMA macro accounting, optimized storeD,
LDS buffer swap, MX FP4 scale emit, SrdMXSA/B+2 handling, sgpr
allocation / overflow guards, computeLoadSrd fix.
SolutionStructs/Solution.py, SolutionStructs/Problem.py:
introduces the UseSubtileImpl parameter, MX-related reject
conditions for non-Subtile paths on gfx950, and additional valid GEMM
type combinations for MX inputs.
Common/ValidParameters.py, Common/RequiredParameters.py,
Common/GlobalParameters.py: UseSubtileImpl registration and
defaults.
Components/StreamK.py: subtile-aware StreamK fixup (incl. import
union with the BufferLoadB32 cache-coherence change from [hipblaslt][tensilelite] Ensure cache coherence for StreamK fixup flag handling on GFX1250 #6837).
Components/GlobalWriteBatch.py: optimized global write batching
for the subtile path (~670 LOC of changes).
Components/ComputeStoreVgprs.py, Components/LSU.py,
Components/WorkGroupMappingAlgos.py, AsmStoreState.py,
KernelWriterModules.py: minor adjustments needed by the subtile
pipeline.

3. rocisa / host / client

rocisa/rocisa/include/container.hpp: helpers needed by the new
emitter.
tensile_host.cpp, include/Tensile/TensorDescriptor.hpp:
small fixups for the subtile path and gfx950 build.
client/include/DataInitialization.hpp, client/src/DataInitialization.cpp,
client/src/Reference.cpp, client/src/ReferenceValidator.cpp,
client/include/TypedId.hpp: MX scale init and reference paths
used by the new tests.
clients/common/include/testing_matmul.hpp,
clients/common/include/norm.hpp,
clients/common/include/hipblaslt_datatype2string.hpp,
clients/common/src/mxDataGen.cpp: wiring for batched (>1)
testing and MX init.

4. Origami / solution selection (gfx950 MXFP4)

New auto-tuned logic yamls under
projects/hipblaslt/library/.../Tensile/Logic/asm_full/gfx950/gfx950/Origami/
covering the FP4 SS / HS / BS variants in three layouts:

Origami/ (default)
Origami/Origami_nta4/ (no-transpose-A FP4)
Origami/Origami_ntb4/ (no-transpose-B FP4)
(9 new gfx950_Cijk_Alik_Bljk_F4{SS,HS,BS}_MXA32_MXB32_*_UserArgs.yaml
files in total.)

5. New tests

End-to-end gfx950 GEMM yamls in
Tensile/Tests/common/gemm/gfx950/:

subtile_bf16.yaml, subtile_mxfp4.yaml
mx32f4_tn.yaml, mx32f8_tn.yaml
mxfp4_mxfp4_{fp32,bf16}_tn_act{,_groupgemm}.yaml
mxfp4_fp8_{fp32,bf16}_tn_act{,_groupgemm}.yaml
fp8_mxfp4_{fp32,bf16}_tn_act{,_groupgemm}.yaml
StreamK + MX: Tensile/Tests/common/streamk/sk_mx32f4_quick.yaml,
sk_mx32f8_quick.yaml.
New unit tests (Tensile/Tests/unit/):
test_SubtileBasedLogicalScheduler.py (~1735 LOC)
test_SubtileBasedSchedulerRef.py (~596 LOC)
test_gr_lr_roundtrip.py (~571 LOC)
test_storeD_roundtrip.py (~2420 LOC)
test_graTileAssignment.py (~354 LOC)
test_lraTileAssignment.py (~360 LOC)
conftest.py, gpu_test_helpers.py shared fixtures (~601 LOC)
New gtest: tensilelite/tests/MXScalePadding_test.cpp.

6. Misc / hardening

Reject conditions: gfx950 MX + non-Subtile, DepthU constraints, GroupGEMM
not yet supported with StreamK + MX, AssertSummationElementMultiple=256
for subtile MXFP4, missing-mxblock check for non-MX types.
Skip rocRoller for FP4-A/FP4-B with pre-swizzled scale layout ([GitHub Actions] more reverts to good state #42).
forceDenorm=False in generateMXInput (Update ISSUE_TEMPLATE.md #11).
Several rebase fixes, copyright/year header updates, and review-comment
fixes to KernelWriter / KernelWriterAssembly.

7. CHANGELOG

Greatly improved MXFP4 GEMM performance when using HIPBLASLT_MATMUL_MATRIX_SCALE_BLK32_UE8M0_32_8_EXT

How to use

Set UseSubtileImpl: 1 on a gfx950 MX-FP4 solution (see the new
subtile_mxfp4.yaml / mx32f4_tn.yaml for canonical configs). The path is
opt-in — non-MX and non-gfx950 kernels are unaffected.

Backwards compatibility / risk

All new behavior is gated on UseSubtileImpl=1 and gfx950. Existing
solutions on other architectures or non-MX paths are unchanged.
GroupGEMM + StreamK + MX is intentionally rejected for now (TODO).
New Origami yamls only add solutions; nothing existing is modified.

Test plan

New gtests + unit tests run automatically in CI (Tensilelite Python
unit suite, MXDataGen_test, MXScalePadding_test).
New end-to-end gfx950 GEMM and StreamK yamls are added to the common
test buckets.
Manual: run the gfx950 MXFP4 subtile suites
(pytest -k gfx950 after building Tensile, plus
tensilelite-client --yaml subtile_mxfp4.yaml for sanity).

Notes for reviewers

This branch was rebased onto current develop (post-[hipblaslt] Add support for gfx950 mxfp4 #6499) by skipping
the users/nakajee/gfx950_mx_rebase_merge history (which [hipblaslt] Add support for gfx950 mxfp4 #6499
squash-merged) and replaying only the subtile-specific work as a single
squashed commit. The actual code changes in this PR are limited to the
files listed above (24 added, 56 modified; ~+170k / −2.6k including
generated logic yamls).
The largest reviewable diffs are:
- Tensile/Components/SubtileBased{Kernel,LogicalScheduler,InstructionScheduler,InstructionEmitter}.py (new files)
- Tensile/KernelWriter.py, Tensile/KernelWriterAssembly.py
- Tensile/SolutionStructs/{Problem,Solution}.py
- Tensile/Components/{GlobalWriteBatch,StreamK}.py
- clients/common/include/testing_matmul.hpp
- client/src/DataInitialization.cpp
Description of all commits that were squashed for this feature branch:

Subtile implementation for gfx950 MX FP4

--- 272f88d: Add sample subtile impl ---
Author: brianshi brianshi@amd.com

--- 60ecede: GR Offset calculation (#1) ---
Author: sebvince 115461989+sebvince@users.noreply.github.com

--- be69c1d: Enable post-loop code generation, and add some subroutines ---
Author: b-shi brianshi@amd.com

--- 646d102: LR offset calculation (#2) ---
Author: sebvince 115461989+sebvince@users.noreply.github.com

--- 71f4bca: Add GR load emit logic, and misc fixes (#3) ---
Author: b-shi brianshi@amd.com

--- 1fd0db9: Emit LR + init ACCVGPR (#4) ---
Author: sebvince 115461989+sebvince@users.noreply.github.com

--- 9d406b9: Add loop and ptr update code ---
Author: b-shi brianshi@amd.com

--- b6127bc: Update GR/LR offset calculation to fully support 2x2, 1x4, 4x1 waveConfigs (#7) ---
Author: sebvince 115461989+sebvince@users.noreply.github.com

--- 89ec87c: Account for valuC macro value in SK WS store code ---
Author: b-shi brianshi@amd.com

--- 6edf53d: Rebase fix ---
Author: b-shi brianshi@amd.com

--- 34e79fc: Enable fp4 (#8) ---
Author: sebvince 115461989+sebvince@users.noreply.github.com

--- d5a5c57: [Tensilelite] Add MX FP4 scale offset computation for subtile-based kernel (#6) ---
Author: Archana Ramalingam 98564406+archana-ramalingam@users.noreply.github.com

--- 7a8a85a: Add lds buffer swap logic ---
Author: b-shi brianshi@amd.com

--- d24a8fe: Add optimized storeD code (#9) ---
Author: b-shi brianshi@amd.com

--- a45c20c: Fix MX scale tensor initialization: set forceDenorm=false in generateMXInput (#11) ---
Author: T.J. Alumbaugh T.J.Alumbaugh@amd.com

--- f945268: [Tensilelite] Enable the MX FP4 scale emit code in the subtile-based kernel (#10) ---
Author: Archana Ramalingam 98564406+archana-ramalingam@users.noreply.github.com

--- cf37df4: Use fixed value for SrdMXSA/B+2 (#14) ---
Author: Koji Nakajima 75698246+nakajee@users.noreply.github.com

--- f0c8dbc: Merge subtile_mx_f4_schedule to subtile_mx branch (#16) ---
Author: b-shi brianshi@amd.com

--- 543796f: Enable DU > 256, and reduce sgpr allocation (#18) ---
Author: b-shi brianshi@amd.com

--- c65bdb0: Add missing mxblock check for non-mx data types ---
Author: b-shi brianshi@amd.com

--- d64d226: Introduce UseSubtileImpl parameter (#20) ---
Author: b-shi brianshi@amd.com

Squash commits 20-35 from subtile_mx branch

--- e4780da: Enable FixSrd2 for A/B (#23) ---
Author: b-shi brianshi@amd.com

Enable FixSrd2 for A/B
Address comments from PR

--- e4c64a7: Add nt libs ---
Author: b-shi brianshi@amd.com

--- cd13ec1: [Tensilelite] Pad MX scale tensor dimensions for unaligned problem sizes (#21) ---
Author: Archana Ramalingam 98564406+archana-ramalingam@users.noreply.github.com

Add scale padding
Add tests
Remove redundant pre-swizzle path
Remove code from
conflict
Fix reverted mxdatagen path for tensile tests
Add diverse test cases for scale padding in MXScalePadding_test and subtile.yaml

Expanded test cases to include non-multiple-of-32, even non-multiple-of-16, and odd dimensions.

--- d87938f: Split subtile.yaml into subtile_bf16.yaml and subtile_mxfp4.yaml (#22) ---
Author: James Newling james.newling@gmail.com

Replace the 'monolithic' subtile.yaml with two focused test files.
All original test coverage is preserved. Two new FP4 groups added.

BF16 coverage (subtile_bf16.yaml, tests are essentially unchanged):

| Description | Dest | MIs | PGR | DU | SK | Sizes

--+--------------------+------+-----+-----+---------+-----+------
0 | BF16 TN main | b | 19 | 0 | 64 | 0,3 | 11
1 | BF16 TN large DU | b | 4 | 0 | 128,192 | 0,3 | 7
2 | BSS (f32 output) | s | 6 | 0 | 64 | 0,3 | 9
3 | BF16 bias | b | 2 | 0 | 64 | 0 | 1

FP4 coverage (subtile_mxfp4.yaml):

| Description | Dest | MIs | PGR | DU | SK | Sizes | Status

--+--------------------+------+-----+-----+-----+-----+-------+--------
0 | FP4 TN main | b | 15 | 0 | 256 | 0,3 | 23 | from original
1 | FP4 TN large DU | b | 4 | 0 | 512 | 0,3 | 13 | from original
2 | F4SS (f32 output) | s | 5 | 0 | 256 | 0,3 | 13 | from original
3 | FP4 bias | b | 2 | 0 | 256 | 0 | 1 | from original
4 | FP4 PGR=2 | b | 13 | 2 | 256 | 0 | 5 | new
5 | FP4 expanded MIWT | b | 24 | 0 | 256 | 0 | 5 | new
6 | PGR=2 WG 4x1/1x4 | | 6 | 2 | 256 | 0 | 1 | known failures (commented)

Run times on gfx950 (8x MI350X):

File | NEV=-1 | NEV=0
-------------------+--------+------
subtile_bf16.yaml | 23s | 23s
subtile_mxfp4.yaml | 37s | 40s

Where NEV is number of elements to validate. I (James) have checked these numbers,
and weirdly it is true that NEV=0 is a bit faster than NEV=-1 for mxfp4.

--- af04f0d: Dependency based instruction scheduling (#19) ---
Author: sebvince 115461989+sebvince@users.noreply.github.com

Revert to single partition
Start using dependencies
as is
start using separate EmittedModules
remove reduntant wait
Add _extractPathsFromBeforeDeps
Continue simplification
Simplifying
Add more rules
cleanup
Add fp4 test
fix test
Add tests
Remove after field on emittedmodule
Refactoring instructionSchedule
Add comments
cleanup modules vs ops
Refactoring print functions
Test cleanup
Add more tests
Replace subgroup by partition
Remove unused unroll param
Add high level notes
Simplify NLL and NGLL GR removal
Add some comments
Force instruction insertion if no slots available
Fix test after rebase
Move scale before A/B and track inflight count
Fine-grain vmcnt calculation
Separate counts for scaleA and B
Avoid using m0 update and buffer_lod on same MFMA slot to avoid scalar instruction serialization
Fix test
Add vmcnt test
Fix duplicated loads for 1x4 and 4x1
Fix placement in reverse order
Fix regression on PGR0
add fallback to numMFMA=1

--- 3ec902b: Add some 1x4 and 4x1 origami solutions ---
Author: b-shi brianshi@amd.com

--- c5000d3: Fix typo ---
Author: b-shi brianshi@amd.com

--- 226ed84: [hipblaslt] Refactor Srd2 calculation for useFixedSrd2 (#30) ---
Author: Koji Nakajima 75698246+nakajee@users.noreply.github.com

--- abf19d4: [Tensilelite] UseSubtileImpl: subtile-aligned edge check for store path (#29) ---
Author: b-shi brianshi@amd.com

[Tensilelite] UseSubtileImpl: subtile-aligned edge check, OOB guard, and refactoring

Replace Size%MT edge check with subtile-aligned check: NonEdge paired
store when trailing rows/cols are a multiple of the subtile block size
(waveGroupM rows for M, 16 cols for N). Non-last workgroups always take NonEdge.
Add per-wave OOB guard (subtileM32ValidBlocksSgpr / subtileN16ValidBlocksSgpr)
to skip stores outside valid M/N tile bounds in the NonEdge path.
Refactor duplicated OOB guard into _emitSubtileOobGuard helper; refactor
M/N guard SGPR computation into _emitSubtileMGuard / _emitSubtileNGuard.
Fix orphan scalar store blockIdxM (was tt0, now (tt0*MatrixInstM)//mBlockSize).
Add quick-exit and edge/non-edge header comments to generated ASM.

Add some bias tests, combine M/N guard to single routine
Add OOB check for C loads, update storeD unit tests to check OOB, simplify quick exit checks
Address more PR comments: add M group skip, and skip to store end. simplified loadC OOB mask

--- 637881a: Fix unit tests & remove legacy code for subtile interleaving (#33) ---
Author: sebvince 115461989+sebvince@users.noreply.github.com

Fix gr_lr_roundtrip test
Use non-interleaved version as ref code
Fix scheduler test
Removed legacy interleaved mode for LR/GR offset calculation

--- e9cb889: Fix MX FP4 scale buffer allocation and initialization for batched GEMM (#25) ---
Author: Archana Ramalingam 98564406+archana-ramalingam@users.noreply.github.com

Fix bacth count issue
Add batch count tests
Fix bacth count issue
Address PR review: clarify FP4-specific byte stride and add non-aligned batched tests

Updated comments on dataBatchBytes computation to clarify FP4 packing
assumption (2 elements/byte) and flag that non-FP4 block-scaling types
would require updating this conversion.
Added batched test cases with non-multiple-of-32 M/N dimensions:
FP4 DU=256: [48,48,2] and [33,65,2]
FP4 DU=512: [63,63,2]
BF16: [50,100,2]

--- a43247b: Update some test yamls (#31) ---
Author: b-shi brianshi@amd.com

--- e2f69c8: Add f4bs origami library with activation function support. Refactor sgpr allocation to reduce sgpr usage in post loop. Store code-path reorganization (#32) ---
Author: b-shi brianshi@amd.com

Free swap/localwritebase sgprs before post-loop
Defer sgpr allocation to remove holds in sgpr pool.

Add Origami library logic files for Cijk_Alik_Bljk_F4BS_MXA32_MXB32 (base, nta4, ntb4 variants).

Remove uneeded alignment and comment
Add more epilogue tests
Remove older origami library for f4bs
Reorder post-loop code blocks to after persistant loop Misc fixes
Fix build issues, relax longjump sgpr requirements
Fix GSU0 branch logic

--- 3f034bf: Add F4HS and F4SS Origami library logic for FP4→F16 and FP4→F32 GEMM (#35) ---
Author: Majedul Sujon 85503863+msujon-AMD@users.noreply.github.com

Add F4HS and F4SS Origami library logic for FP4→F16 and FP4→F32 GEMM

Add 6 new yaml files (F4HS, F4SS) across Origami, Origami_nta4, Origami_ntb4
Update F4BS yaml files: AssertSummationElementMultiple 32→256 for K%256 enforcement
Add ("F4", "F4", "H", "S") to _validGEMMTypes and _HPATypes in Problem.py

Add F4HS test cases to subtile_mxfp4.yaml

Add two new benchmark problem blocks for FP4→F16 (F4HS):

No-bias block: same wavetile and problem size coverage as F4SS
Bias epilogue block: BiasDataTypeList [s, h], relu/none activations

Add F4HS (FP4->Half) type support to Tensile client

Add TypedGemm_F4_H_S typedef and corresponding reference CPU solver
case so F4HS (FP4 input, Float16 output, Float compute) problems
can be validated by the benchmark client.

--- d0bc8fd: Rewrite subtile-based scheduler. Fix DU>64 & enable very large MT (#36) ---
Author: sebvince 115461989+sebvince@users.noreply.github.com

Initial support for DU>256
Renaming
add option to do DU=512 in the tests
blocked K-major for scale
Change scaleSet swap logic
Update print functions
Put scales after values for avoid race conditions
Fix tests
more test
tweak printschedule display
Add PGR2 in the yaml tests
Add new scaleGROp
comment out failing tests
Revert "comment out failing tests"

This reverts commit 1f5802c.

Draft new logical scheduler
Refactoring
Add more test on step1
Add more tests on step1
add bf16 320x320 test
reduce step1 code
Simplify step1 logic
validate some step1 test
Fix partition 2x2 test
more step1 test
320x320 BF16 test
Add test DU512 + partition2x2
Simplify step1 code
Add step2 tests
Fix multi-partition step2
Add step2 du512, 2x2 partition test
Use common algo for all numPartitions
Draft for step3 tests
remove useless tests
New GR algo (draft)
[Step3] Add more test
Iteration on GR
Display ordered GR list with granularities
More test
Add some comments
Disable by default debug logs
Getting rid of step naming
Start remove AnnotatedOp (still there in group pass)
Split dependency Ops
Add todo on place_GRs pass
Valid test_annotate_deps_1x1_partition_DU256
Test output looking better (still WIP)
single dep for LR tooo
Add remove_cross_deps pass
Fix bugs in dependency pass
insert_gr_lr_inc pass
Add group_lr_gr pass
Add emit pass
Quick port of instruction Emit code
Move emit function to separate file
Refactoring instructionEmitter
Port vgprTile tracking
Reworking second pass (WIP)
Display unrolling requirement
Unrolling check on 2nd pass
Generic validation for assign_vgpr pass
Fix unroll
Add inst schedule in standalone mode
Use lrGran for vgprTile size calculation
Fix bug in emit pass (missing depencency)
PreMFMA path + non-duplication scale load
missing globalReadLDSBufferSwap for GR_INC scales
add wairlr_sync on all LR->GR dep
add waitgr_sync op
remove_unnecessary_gr_deps
Change LR dispatch algo a bit to avoid too many waitgr_sync
Avoid duplicated loads in emitter
Fix bug on gr_emit code
GrInc pass. fix duplicated insertion for B
Fix missing LR_inc for SA/SB
preloop, NLL, NGLL
Simplify preloop
minor changes
Move unroll logic to scheduler
minor changes
Fix unroll id bug on NLL / NGLL
Disable post GRINC for now
Remove commented code
Handle 1x4, 4x1 gr read gran
Fix vmcnt computation
Use correct grCount mapping
Revert in emit logic on buffer_load for PGR0 needs
Add bf16 version in standalone test
Fix LR_Inc insertion on DU>64
Add subIterK/Partition comment to codegen
Fix issue in GrInc placement
Remove last_mt
Fix LR MT index bug with muli-partition
Disable early LDS size check when subtileImpl is on
Add pass to remove redundant LR deps + fixed issue on dependency annotation pass
Remove more LR redundant deps
Only insert wait_lr_sync on deps
Simple algo to select partition config
Remove HC value for partitions...
Take into account all inflight GR (all tensors)
Fix tests and regressions on gr counts
Fix grCount merge calculation
Better display of dependencies
Add remove_wait_lr_sync after grouping
Add temporary non reg file
Change merge logic on GR grouping pass
Fix non necessary wait_lr_sync
Downgrade some waitlr_sync to sync + added 384x256 no reg test
non reg test 320x320
Add larger MT
non reg test for fp4 256x256
Moving out instructionScheduler
Remove old scheduler
Renaming scheduler
Re-work test
Add larger MT test cases
Rename non-ref test
Re-add standalone mode
Refactor DepOp
Remove dead code
Remove MFMATileSize class
Remove from_til_info
Avoid redundant tensor list creation
Remove hardcode granularities in vgrpTile allocation pass. Simplify code.
Re-enable # PGR=2 WG 4x1/1x4, K > DU tests
Remove unused GRScaleOp
DepRef renaming
Get rid of MT string representation
Remove TODO
EmmitedModule simplication
Use explicit pass dependencies
Renaming LogicalScheduler
Remove old test_InterleavingScheduler.py file
Commenting failing test for now
Remove debug logs
Disable lds padding when using UseSubtileImpl

--- e8e8c09: Fix LR-GR dependency issue when DU>64 (#40) ---
Author: sebvince 115461989+sebvince@users.noreply.github.com

Fix and simplify logic for remove_unnecessary_lr_deps
Add new ref tests for 128x128x(128,64)

--- 4aa441a: Rebase fix ---
Author: b-shi brianshi@amd.com

--- 5ba911e: Skip rocRoller for FP4-A/FP4-B + pre-swizzled scale layout (#42) ---
Author: Archana Ramalingam 98564406+archana-ramalingam@users.noreply.github.com

--- 9c74998: Rebase fix ---
Author: b-shi brianshi@amd.com

--- 842b149: Addressed review comments for KernelWriter and KernelWriterAssembly ---
Author: Koji Nakajima knakajim@amd.com

--- dce43b1: Fix computeLoadSrd issue ---
Author: Brad Nemanich Brad.Nemanich@amd.com

--- c075bbf: Fix preSolution CPU re-sync regressing subtile_mxfp4.yaml ---
Author: Brad Nemanich Brad.Nemanich@amd.com

--- ced840f: Fix computeLoadSrd issue (#43) ---
Author: bnemanich brad.nemanich@amd.com

--- bc2f6dd: Small update for gfx950 mx tests + more - enable UseSubtileImpl for all gfx950 non subtile mx tests - skip all gfx950 mxfp8 - use MXScaleFormat=1 as default - set AssertSummationElementMultiple=256 for subtile mxfp4 - fix isSwizzledSubtile in computeLoadSrd ---
Author: Koji Nakajima knakajim@amd.com

--- 5c794b7: Fix gsuasb.yaml failures ---
Author: b-shi brianshi@amd.com

--- 727f8db: tensilelite: add solution reject conditions for UseSubtileImpl=1 (#38) ---
Author: Majedul Sujon 85503863+msujon-AMD@users.noreply.github.com

--- 8928fbb: Add more reject conditions for Subtile ---
Author: Koji Nakajima knakajim@amd.com

--- 6e63ab6: Fix kringshift test failures ---
Author: b-shi brianshi@amd.com

--- b3e9724: Update reject condtion for DepthU in subtile case. Plus, update DepthU setting for gfx950 mx test cases ---
Author: Koji Nakajima knakajim@amd.com

--- 5ab6009: Fix build errors ---
Author: Brad Nemanich Brad.Nemanich@amd.com

--- 4a4edca: Update more mxfp4 tensilelite test cases ---
Author: Koji Nakajima knakajim@amd.com

--- bbbc553: Update change log ---
Author: Brad Nemanich Brad.Nemanich@amd.com

--- 6476c04: Add more reject conditions for gfx950 subtile ---
Author: Koji Nakajima knakajim@amd.com

--- c5828c4: Updated gfx950 mxfp4 test cases - add StreamK setting - skip groupgemm tests for now (groupgemm does not support streamK) ---
Author: Koji Nakajima knakajim@amd.com

--- f1fc2f1: Fix hipblaslt build error of gfx950 ---
Author: Koji Nakajima knakajim@amd.com

--- 70cea1b: Updated subtile_mxfp4.yaml (add StreamK) ---
Author: Koji Nakajima knakajim@amd.com

--- c1c9b2a: Add uninit lsc,lsp, etc.. fields for subtile ---
Author: b-shi brianshi@amd.com

--- c0c1f72: Fixed merge error in testing_matmul.hpp ---
Author: Koji Nakajima knakajim@amd.com

--- 191e0cb: Add missed batch_count >1 changes ---
Author: archana-ramalingam Archana.Ramalingam@amd.com

--- 01c52f8: Addressed PR comments ---
Author: Koji Nakajima knakajim@amd.com

--- 4e89c91: Reduce mxfp4 test time ---
Author: Brad Nemanich Brad.Nemanich@amd.com

--- 3dac20f: Prevent overflow for wgmxcc sgpr allocation ---
Author: b-shi brianshi@amd.com

--- 18dec79: Fix error with problem type ---
Author: Brad Nemanich Brad.Nemanich@amd.com

--- 9e69ffd: Add a reject conditoin for gfx950 mx + non Subtile ---
Author: Koji Nakajima knakajim@amd.com

--- 0eed3ba: Add more valid GEMM types ---
Author: Brad Nemanich brad.nemanich@amd.com

--- 8b5514e: Fix missing b build error ---
Author: archana-ramalingam Archana.Ramalingam@amd.com

--- f981ff5: Fix 1250 tests ---
Author: Brad Nemanich brad.nemanich@amd.com

--- d1e69d9: Add more FP4 tests ---
Author: Brad Nemanich Brad.Nemanich@amd.com

--- e3a688f: Add MXScaleFormat: 1 to all gfx950 mx test yaml ---
Author: Koji Nakajima knakajim@amd.com

--- aaef3f5: Add DataTypeMXSA,B setting in gfx950 mxfp4 logic yaml ---
Author: Koji Nakajima knakajim@amd.com

--- 861ef8e: Add DataTypeMXSA,B setting in gfx950 mxfp4 logic yaml (nta4,ntb4) ---
Author: Koji Nakajima knakajim@amd.com

Co-authored-by: Archana Ramalingam Archana.Ramalingam@amd.com
Co-authored-by: Brad Nemanich Brad.Nemanich@amd.com
Co-authored-by: Brian Shi Brian.Shi@amd.com
Co-authored-by: James Newling James.Newling@amd.com
Co-authored-by: Koji Nakajima Koji.Nakajima@amd.com
Co-authored-by: Majedul Sujon Majed.Sujon@amd.com
Co-authored-by: Sebastien Vince Sebastien.Vince@amd.com
Co-authored-by: T.J. Alumbaugh T.J.Alumbaugh@amd.com

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

…st_v2) Co-authored-by: Cursor <cursoragent@cursor.com>

codecov-commenter · 2026-05-02T18:32:11Z

Codecov Report

❌ Patch coverage is 53.33333% with 7 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...rary/src/amd_detail/rocblaslt/src/tensile_host.cpp	53.33%	2 Missing and 5 partials ⚠️

❌ Your project status has failed because the head coverage (77.83%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #7017      +/-   ##
===========================================
+ Coverage    64.72%   64.72%   +0.01%     
===========================================
  Files         2108     2108              
  Lines       328660   328675      +15     
  Branches     43136    43143       +7     
===========================================
+ Hits        212701   212733      +32     
+ Misses       98031    98016      -15     
+ Partials     17928    17926       -2

Flag	Coverage Δ		*Carryforward flag
hipBLAS	`90.65% <ø> (ø)`		Carriedforward from 9393616
hipBLASLt	`39.83% <53.33%> (+0.20%)`	⬆️
hipCUB	`82.21% <ø> (ø)`		Carriedforward from 9393616
hipDNN	`79.72% <ø> (ø)`		Carriedforward from 9393616
hipFFT	`54.95% <ø> (ø)`		Carriedforward from 9393616
hipRAND	`76.12% <ø> (ø)`		Carriedforward from 9393616
hipSOLVER	`69.24% <ø> (ø)`		Carriedforward from 9393616
hipSPARSE	`84.70% <ø> (ø)`		Carriedforward from 9393616
rocBLAS	`48.11% <ø> (ø)`		Carriedforward from 9393616
rocFFT	`48.18% <ø> (ø)`		Carriedforward from 9393616
rocPRIM	`38.99% <ø> (ø)`		Carriedforward from 9393616
rocRAND	`57.03% <ø> (ø)`		Carriedforward from 9393616
rocSOLVER	`77.83% <ø> (ø)`		Carriedforward from 9393616
rocSPARSE	`72.82% <ø> (ø)`		Carriedforward from 9393616

*This pull request uses carry forward flags. Click here to find out more.

Files with missing lines	Coverage Δ
...rary/src/amd_detail/rocblaslt/src/tensile_host.cpp	`39.98% <53.33%> (+0.68%)`	⬆️

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…st_v2) Co-authored-by: Koji Nakajima <knakajim@amd.com> Co-authored-by: Brad Nemanich <Brad.Nemanich@amd.com> Co-authored-by: sebvince <115461989+sebvince@users.noreply.github.com> Co-authored-by: Archana Ramalingam <Archana.Ramalingam@amd.com> Co-authored-by: T.J. Alumbaugh <talumbau@users.noreply.github.com> Co-authored-by: Majedul Sujon <85503863+msujon-AMD@users.noreply.github.com>

…' into users/bnemanich/subtile_rebase_3

…R6934.

nakajee

All CI passed.
Looks good.

# Add gfx950 MXFP4 Subtile-based kernel implementation ## Summary This PR is a follow-up to #6499 ([hipblaslt] Add support for gfx950 mxfp4) and adds the **Subtile-based kernel implementation (`UseSubtileImpl=1`)** for hipBLASLt on **gfx950**. It introduces a new tile-decomposed code generation path optimized for **MXFP4** and **BF16** GEMMs, plus the solution-selection plumbing, validation, Origami logic yamls, and unit tests needed to make it production-usable. ## Motivation PR #6499 brought MX data type support online for gfx950, but the existing TensileLite codegen path leaves significant performance on the table for MXFP4-heavy workloads. The Subtile path restructures global-read / local-read / MFMA / store scheduling at a finer granularity, which **greatly improves MXFP4 GEMM performance when using `HIPBLASLT_MATMUL_MATRIX_SCALE_BLK32_UE8M0_32_8_EXT`** (added to the hipBLASLt CHANGELOG). ## What's included ### 1. New Subtile-based kernel components (Tensile) New modules under `projects/hipblaslt/tensilelite/Tensile/Components/`: * `SubtileBasedKernel.py` (~1850 LOC) — entry point and orchestration of the subtile codegen path; replaces large portions of the standard prefetch / unroll / store flow when `UseSubtileImpl=1`. * `SubtileBasedLogicalScheduler.py` (~2415 LOC) — logical scheduler that builds the subtile-grained instruction graph (GR loads, LR offsets, MFMA tiles, scale loads, stores) from kernel parameters. * `SubtileBasedInstructionScheduler.py` (~433 LOC) — converts the logical schedule to an emit order respecting wave / register / hazard constraints. * `SubtileBasedInstructionEmitter.py` (~216 LOC) — instruction emission helpers shared by the subtile components. ### 2. Kernel writer / common changes * **`KernelWriter.py`**, **`KernelWriterAssembly.py`**: integration points for the subtile path — prefetch, GR offset calculation, LR offset calculation, post-loop, MFMA macro accounting, optimized `storeD`, LDS buffer swap, MX FP4 scale emit, `SrdMXSA/B+2` handling, sgpr allocation / overflow guards, computeLoadSrd fix. * **`SolutionStructs/Solution.py`**, **`SolutionStructs/Problem.py`**: introduces the `UseSubtileImpl` parameter, MX-related reject conditions for non-Subtile paths on gfx950, and additional valid GEMM type combinations for MX inputs. * **`Common/ValidParameters.py`**, **`Common/RequiredParameters.py`**, **`Common/GlobalParameters.py`**: `UseSubtileImpl` registration and defaults. * **`Components/StreamK.py`**: subtile-aware StreamK fixup (incl. import union with the `BufferLoadB32` cache-coherence change from #6837). * **`Components/GlobalWriteBatch.py`**: optimized global write batching for the subtile path (~670 LOC of changes). * **`Components/ComputeStoreVgprs.py`**, **`Components/LSU.py`**, **`Components/WorkGroupMappingAlgos.py`**, **`AsmStoreState.py`**, **`KernelWriterModules.py`**: minor adjustments needed by the subtile pipeline. ### 3. rocisa / host / client * **`rocisa/rocisa/include/container.hpp`**: helpers needed by the new emitter. * **`tensile_host.cpp`**, **`include/Tensile/TensorDescriptor.hpp`**: small fixups for the subtile path and gfx950 build. * **`client/include/DataInitialization.hpp`**, **`client/src/DataInitialization.cpp`**, **`client/src/Reference.cpp`**, **`client/src/ReferenceValidator.cpp`**, **`client/include/TypedId.hpp`**: MX scale init and reference paths used by the new tests. * **`clients/common/include/testing_matmul.hpp`**, **`clients/common/include/norm.hpp`**, **`clients/common/include/hipblaslt_datatype2string.hpp`**, **`clients/common/src/mxDataGen.cpp`**: wiring for batched (>1) testing and MX init. ### 4. Origami / solution selection (gfx950 MXFP4) New auto-tuned logic yamls under `projects/hipblaslt/library/.../Tensile/Logic/asm_full/gfx950/gfx950/Origami/` covering the FP4 SS / HS / BS variants in three layouts: * `Origami/` (default) * `Origami/Origami_nta4/` (no-transpose-A FP4) * `Origami/Origami_ntb4/` (no-transpose-B FP4) (9 new `gfx950_Cijk_Alik_Bljk_F4{SS,HS,BS}_MXA32_MXB32_*_UserArgs.yaml` files in total.) ### 5. New tests **End-to-end gfx950 GEMM yamls** in `Tensile/Tests/common/gemm/gfx950/`: * `subtile_bf16.yaml`, `subtile_mxfp4.yaml` * `mx32f4_tn.yaml`, `mx32f8_tn.yaml` * `mxfp4_mxfp4_{fp32,bf16}_tn_act{,_groupgemm}.yaml` * `mxfp4_fp8_{fp32,bf16}_tn_act{,_groupgemm}.yaml` * `fp8_mxfp4_{fp32,bf16}_tn_act{,_groupgemm}.yaml` **StreamK + MX:** `Tensile/Tests/common/streamk/sk_mx32f4_quick.yaml`, `sk_mx32f8_quick.yaml`. **New unit tests** (`Tensile/Tests/unit/`): * `test_SubtileBasedLogicalScheduler.py` (~1735 LOC) * `test_SubtileBasedSchedulerRef.py` (~596 LOC) * `test_gr_lr_roundtrip.py` (~571 LOC) * `test_storeD_roundtrip.py` (~2420 LOC) * `test_graTileAssignment.py` (~354 LOC) * `test_lraTileAssignment.py` (~360 LOC) * `conftest.py`, `gpu_test_helpers.py` shared fixtures (~601 LOC) **New gtest:** `tensilelite/tests/MXScalePadding_test.cpp`. ### 6. Misc / hardening * Reject conditions: gfx950 MX + non-Subtile, DepthU constraints, GroupGEMM not yet supported with StreamK + MX, AssertSummationElementMultiple=256 for subtile MXFP4, missing-mxblock check for non-MX types. * Skip rocRoller for FP4-A/FP4-B with pre-swizzled scale layout (#42). * `forceDenorm=False` in `generateMXInput` (#11). * Several rebase fixes, copyright/year header updates, and review-comment fixes to `KernelWriter` / `KernelWriterAssembly`. ### 7. CHANGELOG Greatly improved MXFP4 GEMM performance when using HIPBLASLT_MATMUL_MATRIX_SCALE_BLK32_UE8M0_32_8_EXT ## How to use Set `UseSubtileImpl: 1` on a gfx950 MX-FP4 solution (see the new `subtile_mxfp4.yaml` / `mx32f4_tn.yaml` for canonical configs). The path is opt-in — non-MX and non-gfx950 kernels are unaffected. ## Backwards compatibility / risk * All new behavior is gated on `UseSubtileImpl=1` and gfx950. Existing solutions on other architectures or non-MX paths are unchanged. * `GroupGEMM + StreamK + MX` is intentionally rejected for now (TODO). * New Origami yamls only add solutions; nothing existing is modified. ## Test plan * New gtests + unit tests run automatically in CI (Tensilelite Python unit suite, `MXDataGen_test`, `MXScalePadding_test`). * New end-to-end gfx950 GEMM and StreamK yamls are added to the common test buckets. * Manual: run the gfx950 MXFP4 subtile suites (`pytest -k gfx950` after building Tensile, plus `tensilelite-client --yaml subtile_mxfp4.yaml` for sanity). ## Notes for reviewers * This branch was rebased onto current `develop` (post-#6499) by skipping the `users/nakajee/gfx950_mx_rebase_merge` history (which #6499 squash-merged) and replaying only the subtile-specific work as a single squashed commit. The actual code changes in this PR are limited to the files listed above (24 added, 56 modified; ~+170k / −2.6k including generated logic yamls). * The largest reviewable diffs are: * `Tensile/Components/SubtileBased{Kernel,LogicalScheduler,InstructionScheduler,InstructionEmitter}.py` (new files) * `Tensile/KernelWriter.py`, `Tensile/KernelWriterAssembly.py` * `Tensile/SolutionStructs/{Problem,Solution}.py` * `Tensile/Components/{GlobalWriteBatch,StreamK}.py` * `clients/common/include/testing_matmul.hpp` * `client/src/DataInitialization.cpp` * Description of all commits that were squashed for this feature branch: Subtile implementation for gfx950 MX FP4 --- 272f88d: Add sample subtile impl --- Author: brianshi <brianshi@amd.com> --- 60ecede: GR Offset calculation (#1) --- Author: sebvince <115461989+sebvince@users.noreply.github.com> --- be69c1d: Enable post-loop code generation, and add some subroutines --- Author: b-shi <brianshi@amd.com> --- 646d102: LR offset calculation (#2) --- Author: sebvince <115461989+sebvince@users.noreply.github.com> --- 71f4bca: Add GR load emit logic, and misc fixes (#3) --- Author: b-shi <brianshi@amd.com> --- 1fd0db9: Emit LR + init ACCVGPR (#4) --- Author: sebvince <115461989+sebvince@users.noreply.github.com> --- 9d406b9: Add loop and ptr update code --- Author: b-shi <brianshi@amd.com> --- b6127bc: Update GR/LR offset calculation to fully support 2x2, 1x4, 4x1 waveConfigs (#7) --- Author: sebvince <115461989+sebvince@users.noreply.github.com> --- 89ec87c: Account for valuC macro value in SK WS store code --- Author: b-shi <brianshi@amd.com> --- 6edf53d: Rebase fix --- Author: b-shi <brianshi@amd.com> --- 34e79fc: Enable fp4 (#8) --- Author: sebvince <115461989+sebvince@users.noreply.github.com> --- d5a5c57: [Tensilelite] Add MX FP4 scale offset computation for subtile-based kernel (#6) --- Author: Archana Ramalingam <98564406+archana-ramalingam@users.noreply.github.com> --- 7a8a85a: Add lds buffer swap logic --- Author: b-shi <brianshi@amd.com> --- d24a8fe: Add optimized storeD code (#9) --- Author: b-shi <brianshi@amd.com> --- a45c20c: Fix MX scale tensor initialization: set forceDenorm=false in generateMXInput (#11) --- Author: T.J. Alumbaugh <T.J.Alumbaugh@amd.com> --- f945268: [Tensilelite] Enable the MX FP4 scale emit code in the subtile-based kernel (#10) --- Author: Archana Ramalingam <98564406+archana-ramalingam@users.noreply.github.com> --- cf37df4: Use fixed value for SrdMXSA/B+2 (#14) --- Author: Koji Nakajima <75698246+nakajee@users.noreply.github.com> --- f0c8dbc: Merge subtile_mx_f4_schedule to subtile_mx branch (#16) --- Author: b-shi <brianshi@amd.com> --- 543796f: Enable DU > 256, and reduce sgpr allocation (#18) --- Author: b-shi <brianshi@amd.com> --- c65bdb0: Add missing mxblock check for non-mx data types --- Author: b-shi <brianshi@amd.com> --- d64d226: Introduce UseSubtileImpl parameter (#20) --- Author: b-shi <brianshi@amd.com> Squash commits 20-35 from subtile_mx branch --- e4780da: Enable FixSrd2 for A/B (#23) --- Author: b-shi <brianshi@amd.com> * Enable FixSrd2 for A/B * Address comments from PR --------- --- e4c64a7: Add nt libs --- Author: b-shi <brianshi@amd.com> --- cd13ec1: [Tensilelite] Pad MX scale tensor dimensions for unaligned problem sizes (#21) --- Author: Archana Ramalingam <98564406+archana-ramalingam@users.noreply.github.com> * Add scale padding * Add tests * Remove redundant pre-swizzle path * Remove code from conflict * Fix reverted mxdatagen path for tensile tests * Add diverse test cases for scale padding in MXScalePadding_test and subtile.yaml - Expanded test cases to include non-multiple-of-32, even non-multiple-of-16, and odd dimensions. --- d87938f: Split subtile.yaml into subtile_bf16.yaml and subtile_mxfp4.yaml (#22) --- Author: James Newling <james.newling@gmail.com> Replace the 'monolithic' subtile.yaml with two focused test files. All original test coverage is preserved. Two new FP4 groups added. BF16 coverage (subtile_bf16.yaml, tests are essentially unchanged): # | Description | Dest | MIs | PGR | DU | SK | Sizes --+--------------------+------+-----+-----+---------+-----+------ 0 | BF16 TN main | b | 19 | 0 | 64 | 0,3 | 11 1 | BF16 TN large DU | b | 4 | 0 | 128,192 | 0,3 | 7 2 | BSS (f32 output) | s | 6 | 0 | 64 | 0,3 | 9 3 | BF16 bias | b | 2 | 0 | 64 | 0 | 1 FP4 coverage (subtile_mxfp4.yaml): # | Description | Dest | MIs | PGR | DU | SK | Sizes | Status --+--------------------+------+-----+-----+-----+-----+-------+-------- 0 | FP4 TN main | b | 15 | 0 | 256 | 0,3 | 23 | from original 1 | FP4 TN large DU | b | 4 | 0 | 512 | 0,3 | 13 | from original 2 | F4SS (f32 output) | s | 5 | 0 | 256 | 0,3 | 13 | from original 3 | FP4 bias | b | 2 | 0 | 256 | 0 | 1 | from original 4 | FP4 PGR=2 | b | 13 | 2 | 256 | 0 | 5 | new 5 | FP4 expanded MIWT | b | 24 | 0 | 256 | 0 | 5 | new 6 | PGR=2 WG 4x1/1x4 | | 6 | 2 | 256 | 0 | 1 | known failures (commented) Run times on gfx950 (8x MI350X): File | NEV=-1 | NEV=0 -------------------+--------+------ subtile_bf16.yaml | 23s | 23s subtile_mxfp4.yaml | 37s | 40s Where NEV is number of elements to validate. I (James) have checked these numbers, and weirdly it is true that NEV=0 is a bit faster than NEV=-1 for mxfp4. --- af04f0d: Dependency based instruction scheduling (#19) --- Author: sebvince <115461989+sebvince@users.noreply.github.com> * Revert to single partition * Start using dependencies * as is * start using separate EmittedModules * remove reduntant wait * Add _extractPathsFromBeforeDeps * Continue simplification * Simplifying * Add more rules * cleanup * Add fp4 test * fix test * Add tests * Remove after field on emittedmodule * Refactoring instructionSchedule * Add comments * cleanup modules vs ops * Refactoring print functions * Test cleanup * Add more tests * Replace subgroup by partition * Remove unused unroll param * Add high level notes * Simplify NLL and NGLL GR removal * Add some comments * Force instruction insertion if no slots available * Fix test after rebase * Move scale before A/B and track inflight count * Fine-grain vmcnt calculation * Separate counts for scaleA and B * Avoid using m0 update and buffer_lod on same MFMA slot to avoid scalar instruction serialization * Fix test * Add vmcnt test * Fix duplicated loads for 1x4 and 4x1 * Fix placement in reverse order * Fix regression on PGR0 * add fallback to numMFMA=1 --- 3ec902b: Add some 1x4 and 4x1 origami solutions --- Author: b-shi <brianshi@amd.com> --- c5000d3: Fix typo --- Author: b-shi <brianshi@amd.com> --- 226ed84: [hipblaslt] Refactor Srd2 calculation for useFixedSrd2 (#30) --- Author: Koji Nakajima <75698246+nakajee@users.noreply.github.com> --- abf19d4: [Tensilelite] UseSubtileImpl: subtile-aligned edge check for store path (#29) --- Author: b-shi <brianshi@amd.com> * [Tensilelite] UseSubtileImpl: subtile-aligned edge check, OOB guard, and refactoring - Replace Size%MT edge check with subtile-aligned check: NonEdge paired store when trailing rows/cols are a multiple of the subtile block size (waveGroupM rows for M, 16 cols for N). Non-last workgroups always take NonEdge. - Add per-wave OOB guard (subtileM32ValidBlocksSgpr / subtileN16ValidBlocksSgpr) to skip stores outside valid M/N tile bounds in the NonEdge path. - Refactor duplicated OOB guard into _emitSubtileOobGuard helper; refactor M/N guard SGPR computation into _emitSubtileMGuard / _emitSubtileNGuard. - Fix orphan scalar store blockIdxM (was tt0, now (tt0*MatrixInstM)//mBlockSize). - Add quick-exit and edge/non-edge header comments to generated ASM. * Add some bias tests, combine M/N guard to single routine * Add OOB check for C loads, update storeD unit tests to check OOB, simplify quick exit checks * Address more PR comments: add M group skip, and skip to store end. simplified loadC OOB mask --------- --- 637881a: Fix unit tests & remove legacy code for subtile interleaving (#33) --- Author: sebvince <115461989+sebvince@users.noreply.github.com> * Fix gr_lr_roundtrip test * Use non-interleaved version as ref code * Fix scheduler test * Removed legacy interleaved mode for LR/GR offset calculation --- e9cb889: Fix MX FP4 scale buffer allocation and initialization for batched GEMM (#25) --- Author: Archana Ramalingam <98564406+archana-ramalingam@users.noreply.github.com> * Fix bacth count issue * Add batch count tests * Fix bacth count issue * Address PR review: clarify FP4-specific byte stride and add non-aligned batched tests - Updated comments on dataBatchBytes computation to clarify FP4 packing assumption (2 elements/byte) and flag that non-FP4 block-scaling types would require updating this conversion. - Added batched test cases with non-multiple-of-32 M/N dimensions: FP4 DU=256: [48,48,2] and [33,65,2] FP4 DU=512: [63,63,2] BF16: [50,100,2] --------- --- a43247b: Update some test yamls (#31) --- Author: b-shi <brianshi@amd.com> --- e2f69c8: Add f4bs origami library with activation function support. Refactor sgpr allocation to reduce sgpr usage in post loop. Store code-path reorganization (#32) --- Author: b-shi <brianshi@amd.com> * Free swap/localwritebase sgprs before post-loop * Defer sgpr allocation to remove holds in sgpr pool. Add Origami library logic files for Cijk_Alik_Bljk_F4BS_MXA32_MXB32 (base, nta4, ntb4 variants). * Remove uneeded alignment and comment * Add more epilogue tests * Remove older origami library for f4bs * Reorder post-loop code blocks to after persistant loop Misc fixes * Fix build issues, relax longjump sgpr requirements * Fix GSU0 branch logic --------- --- 3f034bf: Add F4HS and F4SS Origami library logic for FP4→F16 and FP4→F32 GEMM (#35) --- Author: Majedul Sujon <85503863+msujon-AMD@users.noreply.github.com> * Add F4HS and F4SS Origami library logic for FP4→F16 and FP4→F32 GEMM - Add 6 new yaml files (F4HS, F4SS) across Origami, Origami_nta4, Origami_ntb4 - Update F4BS yaml files: AssertSummationElementMultiple 32→256 for K%256 enforcement - Add ("F4", "F4", "H", "S") to _validGEMMTypes and _HPATypes in Problem.py * Add F4HS test cases to subtile_mxfp4.yaml Add two new benchmark problem blocks for FP4→F16 (F4HS): - No-bias block: same wavetile and problem size coverage as F4SS - Bias epilogue block: BiasDataTypeList [s, h], relu/none activations * Add F4HS (FP4->Half) type support to Tensile client Add TypedGemm_F4_H_S typedef and corresponding reference CPU solver case so F4HS (FP4 input, Float16 output, Float compute) problems can be validated by the benchmark client. --------- --- d0bc8fd: Rewrite subtile-based scheduler. Fix DU>64 & enable very large MT (#36) --- Author: sebvince <115461989+sebvince@users.noreply.github.com> * Initial support for DU>256 * Renaming * add option to do DU=512 in the tests * blocked K-major for scale * Change scaleSet swap logic * Update print functions * Put scales after values for avoid race conditions * Fix tests * more test * tweak printschedule display * Add PGR2 in the yaml tests * Add new scaleGROp * comment out failing tests * Revert "comment out failing tests" This reverts commit 1f5802c. * Draft new logical scheduler * Refactoring * Add more test on step1 * Add more tests on step1 * add bf16 320x320 test * reduce step1 code * Simplify step1 logic * validate some step1 test * Fix partition 2x2 test * more step1 test * 320x320 BF16 test * Add test DU512 + partition2x2 * Simplify step1 code * Add step2 tests * Fix multi-partition step2 * Add step2 du512, 2x2 partition test * Use common algo for all numPartitions * Draft for step3 tests * remove useless tests * New GR algo (draft) * [Step3] Add more test * Iteration on GR * Display ordered GR list with granularities * More test * Add some comments * Disable by default debug logs * Getting rid of step naming * Start remove AnnotatedOp (still there in group pass) * Split dependency Ops * Add todo on place_GRs pass * Valid test_annotate_deps_1x1_partition_DU256 * Test output looking better (still WIP) * single dep for LR tooo * Add remove_cross_deps pass * Fix bugs in dependency pass * insert_gr_lr_inc pass * Add group_lr_gr pass * Add emit pass * Quick port of instruction Emit code * Move emit function to separate file * Refactoring instructionEmitter * Port vgprTile tracking * Reworking second pass (WIP) * Display unrolling requirement * Unrolling check on 2nd pass * Generic validation for assign_vgpr pass * Fix unroll * Add inst schedule in standalone mode * Use lrGran for vgprTile size calculation * Fix bug in emit pass (missing depencency) * PreMFMA path + non-duplication scale load * missing globalReadLDSBufferSwap for GR_INC scales * add wairlr_sync on all LR->GR dep * add waitgr_sync op * remove_unnecessary_gr_deps * Change LR dispatch algo a bit to avoid too many waitgr_sync * Avoid duplicated loads in emitter * Fix bug on gr_emit code * GrInc pass. fix duplicated insertion for B * Fix missing LR_inc for SA/SB * preloop, NLL, NGLL * Simplify preloop * minor changes * Move unroll logic to scheduler * minor changes * Fix unroll id bug on NLL / NGLL * Disable post GRINC for now * Remove commented code * Handle 1x4, 4x1 gr read gran * Fix vmcnt computation * Use correct grCount mapping * Revert in emit logic on buffer_load for PGR0 needs * Add bf16 version in standalone test * Fix LR_Inc insertion on DU>64 * Add subIterK/Partition comment to codegen * Fix issue in GrInc placement * Remove last_mt * Fix LR MT index bug with muli-partition * Disable early LDS size check when subtileImpl is on * Add pass to remove redundant LR deps + fixed issue on dependency annotation pass * Remove more LR redundant deps * Only insert wait_lr_sync on deps * Simple algo to select partition config * Remove HC value for partitions... * Take into account all inflight GR (all tensors) * Fix tests and regressions on gr counts * Fix grCount merge calculation * Better display of dependencies * Add remove_wait_lr_sync after grouping * Add temporary non reg file * Change merge logic on GR grouping pass * Fix non necessary wait_lr_sync * Downgrade some waitlr_sync to sync + added 384x256 no reg test * non reg test 320x320 * Add larger MT * non reg test for fp4 256x256 * Moving out instructionScheduler * Remove old scheduler * Renaming scheduler * Re-work test * Add larger MT test cases * Rename non-ref test * Re-add standalone mode * Refactor DepOp * Remove dead code * Remove MFMATileSize class * Remove from_til_info * Avoid redundant tensor list creation * Remove hardcode granularities in vgrpTile allocation pass. Simplify code. * Re-enable # PGR=2 WG 4x1/1x4, K > DU tests * Remove unused GRScaleOp * DepRef renaming * Get rid of MT string representation * Remove TODO * EmmitedModule simplication * Use explicit pass dependencies * Renaming LogicalScheduler * Remove old test_InterleavingScheduler.py file * Commenting failing test for now * Remove debug logs * Disable lds padding when using UseSubtileImpl --- e8e8c09: Fix LR-GR dependency issue when DU>64 (#40) --- Author: sebvince <115461989+sebvince@users.noreply.github.com> * Fix and simplify logic for remove_unnecessary_lr_deps * Add new ref tests for 128x128x(128,64) --- 4aa441a: Rebase fix --- Author: b-shi <brianshi@amd.com> --- 5ba911e: Skip rocRoller for FP4-A/FP4-B + pre-swizzled scale layout (#42) --- Author: Archana Ramalingam <98564406+archana-ramalingam@users.noreply.github.com> --- 9c74998: Rebase fix --- Author: b-shi <brianshi@amd.com> --- 842b149: Addressed review comments for KernelWriter and KernelWriterAssembly --- Author: Koji Nakajima <knakajim@amd.com> --- dce43b1: Fix computeLoadSrd issue --- Author: Brad Nemanich <Brad.Nemanich@amd.com> --- c075bbf: Fix preSolution CPU re-sync regressing subtile_mxfp4.yaml --- Author: Brad Nemanich <Brad.Nemanich@amd.com> --- ced840f: Fix computeLoadSrd issue (#43) --- Author: bnemanich <brad.nemanich@amd.com> --- bc2f6dd: Small update for gfx950 mx tests + more - enable UseSubtileImpl for all gfx950 non subtile mx tests - skip all gfx950 mxfp8 - use MXScaleFormat=1 as default - set AssertSummationElementMultiple=256 for subtile mxfp4 - fix isSwizzledSubtile in computeLoadSrd --- Author: Koji Nakajima <knakajim@amd.com> --- 5c794b7: Fix gsuasb.yaml failures --- Author: b-shi <brianshi@amd.com> --- 727f8db: tensilelite: add solution reject conditions for UseSubtileImpl=1 (#38) --- Author: Majedul Sujon <85503863+msujon-AMD@users.noreply.github.com> --- 8928fbb: Add more reject conditions for Subtile --- Author: Koji Nakajima <knakajim@amd.com> --- 6e63ab6: Fix kringshift test failures --- Author: b-shi <brianshi@amd.com> --- b3e9724: Update reject condtion for DepthU in subtile case. Plus, update DepthU setting for gfx950 mx test cases --- Author: Koji Nakajima <knakajim@amd.com> --- 5ab6009: Fix build errors --- Author: Brad Nemanich <Brad.Nemanich@amd.com> --- 4a4edca: Update more mxfp4 tensilelite test cases --- Author: Koji Nakajima <knakajim@amd.com> --- bbbc553: Update change log --- Author: Brad Nemanich <Brad.Nemanich@amd.com> --- 6476c04: Add more reject conditions for gfx950 subtile --- Author: Koji Nakajima <knakajim@amd.com> --- c5828c4: Updated gfx950 mxfp4 test cases - add StreamK setting - skip groupgemm tests for now (groupgemm does not support streamK) --- Author: Koji Nakajima <knakajim@amd.com> --- f1fc2f1: Fix hipblaslt build error of gfx950 --- Author: Koji Nakajima <knakajim@amd.com> --- 70cea1b: Updated subtile_mxfp4.yaml (add StreamK) --- Author: Koji Nakajima <knakajim@amd.com> --- c1c9b2a: Add uninit lsc,lsp, etc.. fields for subtile --- Author: b-shi <brianshi@amd.com> --- c0c1f72: Fixed merge error in testing_matmul.hpp --- Author: Koji Nakajima <knakajim@amd.com> --- 191e0cb: Add missed batch_count >1 changes --- Author: archana-ramalingam <Archana.Ramalingam@amd.com> --- 01c52f8: Addressed PR comments --- Author: Koji Nakajima <knakajim@amd.com> --- 4e89c91: Reduce mxfp4 test time --- Author: Brad Nemanich <Brad.Nemanich@amd.com> --- 3dac20f: Prevent overflow for wgmxcc sgpr allocation --- Author: b-shi <brianshi@amd.com> --- 18dec79: Fix error with problem type --- Author: Brad Nemanich <Brad.Nemanich@amd.com> --- 9e69ffd: Add a reject conditoin for gfx950 mx + non Subtile --- Author: Koji Nakajima <knakajim@amd.com> --- 0eed3ba: Add more valid GEMM types --- Author: Brad Nemanich <brad.nemanich@amd.com> --- 8b5514e: Fix missing b build error --- Author: archana-ramalingam <Archana.Ramalingam@amd.com> --- f981ff5: Fix 1250 tests --- Author: Brad Nemanich <brad.nemanich@amd.com> --- d1e69d9: Add more FP4 tests --- Author: Brad Nemanich <Brad.Nemanich@amd.com> --- e3a688f: Add MXScaleFormat: 1 to all gfx950 mx test yaml --- Author: Koji Nakajima <knakajim@amd.com> --- aaef3f5: Add DataTypeMXSA,B setting in gfx950 mxfp4 logic yaml --- Author: Koji Nakajima <knakajim@amd.com> --- 861ef8e: Add DataTypeMXSA,B setting in gfx950 mxfp4 logic yaml (nta4,ntb4) --- Author: Koji Nakajima <knakajim@amd.com> Co-authored-by: Archana Ramalingam <Archana.Ramalingam@amd.com> Co-authored-by: Brad Nemanich <Brad.Nemanich@amd.com> Co-authored-by: Brian Shi <Brian.Shi@amd.com> Co-authored-by: James Newling <James.Newling@amd.com> Co-authored-by: Koji Nakajima <Koji.Nakajima@amd.com> Co-authored-by: Majedul Sujon <Majed.Sujon@amd.com> Co-authored-by: Sebastien Vince <Sebastien.Vince@amd.com> Co-authored-by: T.J. Alumbaugh <T.J.Alumbaugh@amd.com> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Archana Ramalingam <Archana.Ramalingam@amd.com> Co-authored-by: Brad Nemanich <Brad.Nemanich@amd.com> Co-authored-by: Brian Shi <Brian.Shi@amd.com> Co-authored-by: James Newling <James.Newling@amd.com> Co-authored-by: Koji Nakajima <Koji.Nakajima@amd.com> Co-authored-by: Majedul Sujon <Majed.Sujon@amd.com> Co-authored-by: Sebastien Vince <Sebastien.Vince@amd.com> Co-authored-by: T.J. Alumbaugh <T.J.Alumbaugh@amd.com>

Squashed: subtile work (originally 56 commits on subtile_mx_rebase_te…

593f985

…st_v2) Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions Bot added documentation project: hipblaslt project: hipsparselt ci:hipsparselt-fast labels May 2, 2026

assistant-librarian Bot added the organization: ROCm label May 2, 2026

nakajee added 2 commits May 2, 2026 15:47

Resolve merge confilce in CMakeLists.txt

2f65725

Fixed unit teset fail due to numpy

6859f35

b-shi and others added 3 commits May 2, 2026 21:49

Resolve merge confilce in CMakeLists.txt

f567173

Fixed unit teset fail due to numpy

9393616

bnemanich force-pushed the users/bnemanich/subtile_rebase_3 branch from 6859f35 to 9393616 Compare May 3, 2026 02:52

bnemanich marked this pull request as ready for review May 3, 2026 03:01

bnemanich requested review from a team as code owners May 3, 2026 03:01

nakajee added 3 commits May 3, 2026 05:10

Merge remote-tracking branch 'origin/users/bnemanich/subtile_rebase_3…

fc940f6

…' into users/bnemanich/subtile_rebase_3

Fixed merge issue. Removed debug code. Addressed review comments in P…

e958aa6

…R6934.

Addressed review comment in PR6934

32bc903

nakajee approved these changes May 3, 2026

View reviewed changes

bnemanich merged commit 1fb7e7d into develop May 3, 2026
38 checks passed

bnemanich deleted the users/bnemanich/subtile_rebase_3 branch May 3, 2026 17:30

amd-songpiao mentioned this pull request May 11, 2026

[hipblaslt] Fix tensilelite-tests build after MX API signature changes #7277

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add gfx950 MXFP4 and BF16 Subtile-based kernel implementation#7017

Add gfx950 MXFP4 and BF16 Subtile-based kernel implementation#7017
bnemanich merged 9 commits into
developfrom
users/bnemanich/subtile_rebase_3

bnemanich commented May 2, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented May 2, 2026 •

edited

Loading

Uh oh!

nakajee left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

bnemanich commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add gfx950 MXFP4 Subtile-based kernel implementation

Summary

Motivation

What's included

1. New Subtile-based kernel components (Tensile)

2. Kernel writer / common changes

3. rocisa / host / client

4. Origami / solution selection (gfx950 MXFP4)

5. New tests

6. Misc / hardening

7. CHANGELOG

How to use

Backwards compatibility / risk

Test plan

Notes for reviewers

| Description | Dest | MIs | PGR | DU | SK | Sizes

| Description | Dest | MIs | PGR | DU | SK | Sizes | Status

Submission Checklist

Uh oh!

codecov-commenter commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

nakajee left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bnemanich commented May 2, 2026 •

edited

Loading

codecov-commenter commented May 2, 2026 •

edited

Loading