Skip to content

Enable plr min#1208

Merged
b-shi merged 2 commits into
developfrom
enablePLR_min
Oct 21, 2025
Merged

Enable plr min#1208
b-shi merged 2 commits into
developfrom
enablePLR_min

Conversation

@adityalj
Copy link
Copy Markdown
Contributor

@adityalj adityalj commented Aug 14, 2025

Motivation

For High rate MFMAs in the newer architecture because of insufficient vgprs we can not unroll the loop. This leads to exposed memory operations reducing efficiency. Here we break A and B into two equal parts and divide loop into 4 subiters by reordering MFMAs such as A0B0 ,A0B1,A1B0,A1B1. This helps us to separate dependencies and hide exposed reads.

Technical Details

Rigid schedule is imposed on the instructions in order to get following desired code.

Prefetch phase : read A0,B0
MainLoop:
sub-iter0: MFMA (A0,B0) read(A1,B1)
sub-iter1: MFMA (A0,B1) write(A,B) Global Read(A,B)
sub-iter2: MFMA (A1,B0)
sub-iter3:MFMA (A1,B1) read(A0,B0)

NoLoadLoop:
sub-iter0: MFMA (A0,B0) read(A1,B1)
sub-iter1: MFMA (A0,B1) write(A,B)
sub-iter2: MFMA (A1,B0)
sub-iter3:MFMA (A1,B1) read(A0,B0)

Test Plan

Added Two .yaml files to test the changes.
Currently supported only for even wave tile sizes>2 f8 and f16 using double LDS buffer.

Test Result

For supported configurations , .s is generated with 4 sub-iters along with prefetch. 3-5% improvement in the performance

Submission Checklist

@github-actions github-actions Bot added the project: none Does not target any component label Aug 14, 2025
@eidenyoshida eidenyoshida removed the project: none Does not target any component label Aug 15, 2025
@adityalj adityalj force-pushed the enablePLR_min branch 2 times, most recently from 17d61e1 to dce6d69 Compare August 18, 2025 18:12
@adityalj adityalj marked this pull request as ready for review August 18, 2025 20:45
@adityalj adityalj requested a review from a team as a code owner August 18, 2025 20:45
Comment thread projects/hipblaslt/tensilelite/Tensile/SolutionStructs/Solution.py Outdated
Comment thread projects/hipblaslt/tensilelite/Tensile/KernelWriter.py Outdated
Comment thread projects/hipblaslt/tensilelite/Tensile/KernelWriter.py Outdated
Comment thread projects/hipblaslt/tensilelite/Tensile/KernelWriterAssembly.py
Comment thread projects/hipblaslt/tensilelite/Tensile/Tests/common/gemm/gfx950/f16_plr.yaml Outdated
@b-shi b-shi self-requested a review August 21, 2025 13:20
Comment thread projects/hipblaslt/tensilelite/Tensile/Tests/common/gemm/gfx950/f16_plr.yaml Outdated
@adityalj adityalj force-pushed the enablePLR_min branch 2 times, most recently from 1da0783 to 4fb5642 Compare August 21, 2025 19:03
Comment thread projects/hipblaslt/tensilelite/Tensile/SolutionStructs/Solution.py Outdated
@hcman2
Copy link
Copy Markdown
Contributor

hcman2 commented Aug 22, 2025

Could you paste an example of before/after comparison of asm code?

Comment thread projects/hipblaslt/tensilelite/Tensile/Tests/common/gemm/gfx950/f16_plr.yaml Outdated
Comment thread projects/hipblaslt/tensilelite/Tensile/KernelWriterAssembly.py
@msujon-AMD
Copy link
Copy Markdown
Collaborator

We need test cases with tf32+plr (maybe, with complex datatypes later on).

Comment thread projects/hipblaslt/tensilelite/Tensile/SolutionStructs/Solution.py
@b-shi b-shi force-pushed the enablePLR_min branch 2 times, most recently from f7d303a to 255ebd2 Compare October 14, 2025 16:31
@b-shi
Copy link
Copy Markdown
Contributor

b-shi commented Oct 14, 2025

We need test cases with tf32+plr (maybe, with complex datatypes later on).

I updated TF32 tox tests with PLR test cases.

@b-shi
Copy link
Copy Markdown
Contributor

b-shi commented Oct 14, 2025

Could you paste an example of before/after comparison of asm code?

Sure, is there a particular MT case you want to see?

Roughly speaking this change breaks up a single iteration to A = A0 + A1 and B = B0 + B1. Then we have

----- Preloop
LRA0
LRB0
----- Main loop
MFMA A0, B0, LRA1, LRB1
MFMA A1, B0
MFMA A0, B1
MFMA A1, B1, LRA0, LRB0

Previously we had something like:

----- Preloop
----- Main loop
MFMA A, B, LRA, LRB 

but the LRA, LRB are interleaved using some dependency checks between LR outputs and MFMA inputs.

@b-shi b-shi force-pushed the enablePLR_min branch 4 times, most recently from 7dc39e6 to 5ad79e0 Compare October 16, 2025 13:47
@hcman2
Copy link
Copy Markdown
Contributor

hcman2 commented Oct 20, 2025

Could you paste an example of before/after comparison of asm code?

Sure, is there a particular MT case you want to see?

I am OK about the changes. The example is to make it easier to understand what this PR do in the future.

Moreover, do you plan to extend this feature?
such as supporting DU > miK, or tile % 2 !=0?
Some of kernels should get benefit with this feature.

On the other hand, do we have any performance regression with this feature always on?

adityalj and others added 2 commits October 20, 2025 12:55
Reduce clutter of whitespace changes

enablePLR for even wave tile

Scaled down local read for even tile tile sizes

WIP
@b-shi
Copy link
Copy Markdown
Contributor

b-shi commented Oct 21, 2025

Could you paste an example of before/after comparison of asm code?

Sure, is there a particular MT case you want to see?

I am OK about the changes. The example is to make it easier to understand what this PR do in the future.

Moreover, do you plan to extend this feature? such as supporting DU > miK, or tile % 2 !=0? Some of kernels should get benefit with this feature.

On the other hand, do we have any performance regression with this feature always on?

Thanks @hcman2 for taking a look again! No perf regressions were observed these changes.

Yes, the case tile % 2 !=0 should be generalized and also possibly DU > miK. We will need to investigate the perf benefits for these since they would involve quite a bit of changes in the current scheduling code.

@b-shi b-shi merged commit 6c10b76 into develop Oct 21, 2025
27 of 28 checks passed
@b-shi b-shi deleted the enablePLR_min branch October 21, 2025 16:49
alex391a added a commit that referenced this pull request Feb 24, 2026
## Motivation
Ref: #1208. Enable plr-min optimization for spmm
## Technical Details

Plr-min only support no packing case. This PR enables plr-min for spmm
when `TransposeLDSMetadata` is True
## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
aosewski pushed a commit that referenced this pull request Feb 24, 2026
## Motivation
Ref: #1208. Enable plr-min optimization for spmm
## Technical Details

Plr-min only support no packing case. This PR enables plr-min for spmm
when `TransposeLDSMetadata` is True
## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
jovanau pushed a commit to jovanau/rocm-libraries that referenced this pull request Mar 19, 2026
## Motivation
Ref: ROCm#1208. Enable plr-min optimization for spmm
## Technical Details

Plr-min only support no packing case. This PR enables plr-min for spmm
when `TransposeLDSMetadata` is True
## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants