Enable plr min by adityalj · Pull Request #1208 · ROCm/rocm-libraries

adityalj · 2025-08-14T18:17:14Z

Motivation

For High rate MFMAs in the newer architecture because of insufficient vgprs we can not unroll the loop. This leads to exposed memory operations reducing efficiency. Here we break A and B into two equal parts and divide loop into 4 subiters by reordering MFMAs such as A0B0 ,A0B1,A1B0,A1B1. This helps us to separate dependencies and hide exposed reads.

Technical Details

Rigid schedule is imposed on the instructions in order to get following desired code.

Prefetch phase : read A0,B0
MainLoop:
sub-iter0: MFMA (A0,B0) read(A1,B1)
sub-iter1: MFMA (A0,B1) write(A,B) Global Read(A,B)
sub-iter2: MFMA (A1,B0)
sub-iter3:MFMA (A1,B1) read(A0,B0)

NoLoadLoop:
sub-iter0: MFMA (A0,B0) read(A1,B1)
sub-iter1: MFMA (A0,B1) write(A,B)
sub-iter2: MFMA (A1,B0)
sub-iter3:MFMA (A1,B1) read(A0,B0)

Test Plan

Added Two .yaml files to test the changes.
Currently supported only for even wave tile sizes>2 f8 and f16 using double LDS buffer.

Test Result

For supported configurations , .s is generated with 4 sub-iters along with prefetch. 3-5% improvement in the performance

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

hcman2 · 2025-08-22T00:38:39Z

Could you paste an example of before/after comparison of asm code?

msujon-AMD · 2025-08-22T20:36:55Z

We need test cases with tf32+plr (maybe, with complex datatypes later on).

b-shi · 2025-10-14T16:33:26Z

We need test cases with tf32+plr (maybe, with complex datatypes later on).

I updated TF32 tox tests with PLR test cases.

b-shi · 2025-10-14T16:38:47Z

Could you paste an example of before/after comparison of asm code?

Sure, is there a particular MT case you want to see?

Roughly speaking this change breaks up a single iteration to A = A0 + A1 and B = B0 + B1. Then we have

----- Preloop
LRA0
LRB0
----- Main loop
MFMA A0, B0, LRA1, LRB1
MFMA A1, B0
MFMA A0, B1
MFMA A1, B1, LRA0, LRB0

Previously we had something like:

----- Preloop
----- Main loop
MFMA A, B, LRA, LRB

but the LRA, LRB are interleaved using some dependency checks between LR outputs and MFMA inputs.

hcman2 · 2025-10-20T01:00:59Z

Could you paste an example of before/after comparison of asm code?

Sure, is there a particular MT case you want to see?

I am OK about the changes. The example is to make it easier to understand what this PR do in the future.

Moreover, do you plan to extend this feature?
such as supporting DU > miK, or tile % 2 !=0?
Some of kernels should get benefit with this feature.

On the other hand, do we have any performance regression with this feature always on?

Reduce clutter of whitespace changes enablePLR for even wave tile Scaled down local read for even tile tile sizes WIP

b-shi · 2025-10-21T01:10:43Z

Could you paste an example of before/after comparison of asm code?

Sure, is there a particular MT case you want to see?

I am OK about the changes. The example is to make it easier to understand what this PR do in the future.

Moreover, do you plan to extend this feature? such as supporting DU > miK, or tile % 2 !=0? Some of kernels should get benefit with this feature.

On the other hand, do we have any performance regression with this feature always on?

Thanks @hcman2 for taking a look again! No perf regressions were observed these changes.

Yes, the case tile % 2 !=0 should be generalized and also possibly DU > miK. We will need to investigate the perf benefits for these since they would involve quite a bit of changes in the current scheduling code.

## Motivation Ref: #1208. Enable plr-min optimization for spmm ## Technical Details Plr-min only support no packing case. This PR enables plr-min for spmm when `TransposeLDSMetadata` is True ## Test Plan  ## Test Result  ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

## Motivation Ref: ROCm#1208. Enable plr-min optimization for spmm ## Technical Details Plr-min only support no packing case. This PR enables plr-min for spmm when `TransposeLDSMetadata` is True ## Test Plan  ## Test Result  ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

github-actions Bot added the project: hipblaslt label Aug 14, 2025

assistant-librarian Bot added the organization: ROCm label Aug 14, 2025

adityalj force-pushed the enablePLR_min branch from d83d1d2 to 126c770 Compare August 14, 2025 21:03

github-actions Bot added the project: none Does not target any component label Aug 14, 2025

adityalj force-pushed the enablePLR_min branch from 126c770 to 901d185 Compare August 14, 2025 21:05

eidenyoshida removed the project: none Does not target any component label Aug 15, 2025

adityalj force-pushed the enablePLR_min branch 2 times, most recently from 17d61e1 to dce6d69 Compare August 18, 2025 18:12

adityalj marked this pull request as ready for review August 18, 2025 20:45

adityalj requested a review from a team as a code owner August 18, 2025 20:45

adityalj force-pushed the enablePLR_min branch from d91c7c5 to cb1eae4 Compare August 19, 2025 00:51

nakajee reviewed Aug 19, 2025

View reviewed changes

hcman2 reviewed Aug 20, 2025

View reviewed changes

Comment thread projects/hipblaslt/tensilelite/Tensile/Tests/common/gemm/gfx950/f16_plr.yaml Outdated

b-shi self-requested a review August 21, 2025 13:20

b-shi reviewed Aug 21, 2025

View reviewed changes

Comment thread projects/hipblaslt/tensilelite/Tensile/Tests/common/gemm/gfx950/f16_plr.yaml Outdated

adityalj force-pushed the enablePLR_min branch 2 times, most recently from 1da0783 to 4fb5642 Compare August 21, 2025 19:03

hcman2 reviewed Aug 22, 2025

View reviewed changes

Comment thread projects/hipblaslt/tensilelite/Tensile/SolutionStructs/Solution.py Outdated

nakajee reviewed Aug 22, 2025

View reviewed changes

Comment thread projects/hipblaslt/tensilelite/Tensile/Tests/common/gemm/gfx950/f16_plr.yaml Outdated

Comment thread projects/hipblaslt/tensilelite/Tensile/KernelWriterAssembly.py

adityalj force-pushed the enablePLR_min branch from 4fb5642 to e768800 Compare August 22, 2025 20:44

adityalj commented Aug 22, 2025

View reviewed changes

Comment thread projects/hipblaslt/tensilelite/Tensile/SolutionStructs/Solution.py

b-shi force-pushed the enablePLR_min branch 2 times, most recently from f7d303a to 255ebd2 Compare October 14, 2025 16:31

b-shi force-pushed the enablePLR_min branch from 255ebd2 to c429945 Compare October 14, 2025 17:23

msujon-AMD reviewed Oct 14, 2025

View reviewed changes

Comment thread projects/hipblaslt/tensilelite/Tensile/Tests/common/gemm/gfx950/plr_zero.yaml

b-shi force-pushed the enablePLR_min branch from c429945 to bf2f8b4 Compare October 14, 2025 18:07

b-shi force-pushed the enablePLR_min branch 4 times, most recently from 7dc39e6 to 5ad79e0 Compare October 16, 2025 13:47

adityalj and others added 2 commits October 20, 2025 12:55

Enable PLR when DepthU=MI_K

9fd396c

Reduce clutter of whitespace changes enablePLR for even wave tile Scaled down local read for even tile tile sizes WIP

Fix CI failures, add more tests, enable for ldsTR

e668710

b-shi force-pushed the enablePLR_min branch from 5ad79e0 to e668710 Compare October 20, 2025 18:28

msujon-AMD approved these changes Oct 21, 2025

View reviewed changes

b-shi merged commit 6c10b76 into develop Oct 21, 2025
27 of 28 checks passed

b-shi deleted the enablePLR_min branch October 21, 2025 16:49

alex391a mentioned this pull request Feb 6, 2026

[Tensilelite][Sparse] Enable plr min for spmm #4364

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable plr min#1208

Enable plr min#1208
b-shi merged 2 commits into
developfrom
enablePLR_min

adityalj commented Aug 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hcman2 commented Aug 22, 2025

Uh oh!

Uh oh!

Uh oh!

msujon-AMD commented Aug 22, 2025

Uh oh!

Uh oh!

b-shi commented Oct 14, 2025

Uh oh!

b-shi commented Oct 14, 2025

Uh oh!

Uh oh!

hcman2 commented Oct 20, 2025 •

edited

Loading

Uh oh!

b-shi commented Oct 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

adityalj commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hcman2 commented Aug 22, 2025

Uh oh!

Uh oh!

Uh oh!

msujon-AMD commented Aug 22, 2025

Uh oh!

Uh oh!

b-shi commented Oct 14, 2025

Uh oh!

b-shi commented Oct 14, 2025

Uh oh!

Uh oh!

hcman2 commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

b-shi commented Oct 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

adityalj commented Aug 14, 2025 •

edited

Loading

hcman2 commented Oct 20, 2025 •

edited

Loading