Enable plr min#1208
Conversation
d83d1d2 to
126c770
Compare
126c770 to
901d185
Compare
17d61e1 to
dce6d69
Compare
d91c7c5 to
cb1eae4
Compare
1da0783 to
4fb5642
Compare
|
Could you paste an example of before/after comparison of asm code? |
|
We need test cases with tf32+plr (maybe, with complex datatypes later on). |
4fb5642 to
e768800
Compare
f7d303a to
255ebd2
Compare
I updated TF32 tox tests with PLR test cases. |
Sure, is there a particular MT case you want to see? Roughly speaking this change breaks up a single iteration to Previously we had something like: but the LRA, LRB are interleaved using some dependency checks between LR outputs and MFMA inputs. |
7dc39e6 to
5ad79e0
Compare
I am OK about the changes. The example is to make it easier to understand what this PR do in the future. Moreover, do you plan to extend this feature? On the other hand, do we have any performance regression with this feature always on? |
Reduce clutter of whitespace changes enablePLR for even wave tile Scaled down local read for even tile tile sizes WIP
Thanks @hcman2 for taking a look again! No perf regressions were observed these changes. Yes, the case |
## Motivation Ref: #1208. Enable plr-min optimization for spmm ## Technical Details Plr-min only support no packing case. This PR enables plr-min for spmm when `TransposeLDSMetadata` is True ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation Ref: #1208. Enable plr-min optimization for spmm ## Technical Details Plr-min only support no packing case. This PR enables plr-min for spmm when `TransposeLDSMetadata` is True ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation Ref: ROCm#1208. Enable plr-min optimization for spmm ## Technical Details Plr-min only support no packing case. This PR enables plr-min for spmm when `TransposeLDSMetadata` is True ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Motivation
For High rate MFMAs in the newer architecture because of insufficient vgprs we can not unroll the loop. This leads to exposed memory operations reducing efficiency. Here we break A and B into two equal parts and divide loop into 4 subiters by reordering MFMAs such as A0B0 ,A0B1,A1B0,A1B1. This helps us to separate dependencies and hide exposed reads.
Technical Details
Rigid schedule is imposed on the instructions in order to get following desired code.
Prefetch phase : read A0,B0
MainLoop:
sub-iter0: MFMA (A0,B0) read(A1,B1)
sub-iter1: MFMA (A0,B1) write(A,B) Global Read(A,B)
sub-iter2: MFMA (A1,B0)
sub-iter3:MFMA (A1,B1) read(A0,B0)
NoLoadLoop:
sub-iter0: MFMA (A0,B0) read(A1,B1)
sub-iter1: MFMA (A0,B1) write(A,B)
sub-iter2: MFMA (A1,B0)
sub-iter3:MFMA (A1,B1) read(A0,B0)
Test Plan
Added Two .yaml files to test the changes.
Currently supported only for even wave tile sizes>2 f8 and f16 using double LDS buffer.
Test Result
For supported configurations , .s is generated with 4 sub-iters along with prefetch. 3-5% improvement in the performance
Submission Checklist