[MIOpen] Implement a couple of improvements for bwd and wrw CK kernel selection#1426
Conversation
Prior to this change, splitk values could be the power of 2 values in the range 1 to 128, but CK has a special splitk autodeduce value of -1 that in some cases can provide better performance. This commit adds -1 to the set of splitk CK values, which improves the overall performance of CK solvers.
…grp wrw MIOpen was setting the CK workspace size to zero if some instances did not require a workspace for a particular shape, but this precluded some CK instances that might require a workspace from even being considered. And in some cases, the CK solver performance was reduced because the CK instance that needed the workspace might actually have been the fastest instance for that solver. This commit changes MIOpen to get the workspace size from CK for the grp wrw, grp bwd, and 3D grp wrw solvers. This will boost CK performance for bwd and wrw, which will be needed for bwd and wrw fusions, since we need to make CK as performant as possible (since the fusions will use only CK instances). Note that the other CK solvers can be converted to get the workspace size directly from CK in the future.
Use 0 as the default and then use the ck workspace size value if it is non-zero, otherwise we fallback to GetCKAlphaBetaWorkspace(). Note that this can be cleaned up in the future when we convert all the CK solvers to get the workspace size from CK.
This change fixes some broken unit tests, and addresses an issue with the resulting workspace size not being reported correctly.
This reverts commit 6dc0b7f.
|
CI passed but I noticed the following errors when tuning: This will need to be addressed... |
The check implied that a particular solver had to use the same workspace size for all the different kernels and parameters, but CK can have some kernels that have non-zero workspace size while others have zero workspace size for the same problem.
I think the check was invalid since it implied that the same solver could not have different workspace sizes based on the kernel and tuning parameters, but we know this is not true for CK. I've removed this code, and it seems to be fine but I will let it run through CI again. I will also add @cderb to this review just in case he sees any gotchas here. |
|
The performance of CK when compared to All solvers went from ~18% worse to 13-14% worse when run on the Fremont WRW shapes. See this excel sheet for details |
|
@JonathanLichtnerAMD Would this be significant enough to list as an optimization in the changelog? |
|
Has the CK work this depends on been merged? |
@BradPepersAMD Yes, the initial work was merged, but when we tested the tuna gold shapes we discovered 11 shapes that resulted in failed results due to accuracy of some of the new CK kernels. Bartek is fixing these, and once these are in, then I will retest with the tuna gold shapes. |
@amd-jnovotny Yes, I've added an entry for this in the changelog. |
|
Thanks. Changelog looks okay. |
12a456c
This commit changes the default split_k settings to start at index 0 and split_k 1 rather than -1 (AutoDeduce). The problem with starting at -1 meant that this was the default config, and at the end of the generic search MIOpen runs a final run on the default config *but* this might not be a valid config for CK (not applicable). This would cause the the tuning to fail and would also cause a failure in immediate mode. The solution is to restore the defaults to their original values (index 0, split_k 1) and then iterate through the split_k values from 1, ..., 128, -1 and then wrap to the next index.
|
Ran the 5K TunaGold shapes for each direction (fwd, bwd, and wrw). Overall, TunaGold wrw shapes showed a 1% improvement versus 3-4% for the Fremont wrw shapes. Bwd and fwd shapes showed no change in performance as expected. This can now be merged. |
[MIOpen] Implement a couple of improvements for bwd and wrw CK kernel selection (#1426) This PR includes a couple of changes to improve the selection of CK kernels in MIOpen. These include 1. _Add -1 to the list of splitk values to iterate over for wrw._ Prior to this change, splitk values could be the power of 2 values in the range 1 to 128, but CK has a special splitk autodeduce value of -1 that in some cases can provide better performance. This PR adds -1 to the set of splitk CK values, which improves the overall performance of CK solvers. 4. _Get the workspace size from CK for grp wrw, grp bwd, and 3D grp wrw._ MIOpen was setting the CK workspace size to zero if some instances did not require a workspace for a particular shape, but this precluded some CK instances that might require a workspace from even being considered. And in some cases, the CK solver performance was reduced because the CK instance that needed the workspace might actually have been the fastest instance for that solver. This PR changes MIOpen to get the workspace size from CK for the grp wrw, grp bwd, and 3D grp wrw solvers, and no longer excludes kernels that require a non-zero workspace. ## Test Plan Run a large assortment of fwd, bwd, and wrw shapes using *all* solvers and ensure there is an improvement in the CK-only solvers when compared to all solvers for wrw and no regressions for fwd and bwd. ## Test Result - [x] Performance test results on Fremont shapes indicate CK-solver performance has improved for wrw - [x] CI passes - [x] Run a large set of of the Tuna Gold shapes - [x] Also need to wait for CK commit 29446da1d57170a8bda47a452113ef7e44363a04 to be staged/promoted - [x] Investigate hanging kernel issue on bwd path ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Motivation
Backwards and backwards-weighted fusions will only use CK kernels as part of the fusion, but there is currently a performance hit for bwd and wrw if MIOpen restricts itself to just using CK solvers rather than all solvers. We need to reduce this hit, so that it does not cancel out the gains that can be made by doing fusions.
Technical Details
There are a couple of changes that MIOpen can make to improve our selection of CK kernels. These include
Add -1 to the list of splitk values to iterate over for wrw.
Prior to this change, splitk values could be the power of 2 values in the range 1 to 128, but CK has a special splitk autodeduce value of -1 that in some cases can provide better performance. This PR adds -1 to the set of splitk CK values, which improves the overall performance of CK solvers.
Get the workspace size from CK for grp wrw, grp bwd, and 3D grp wrw.
MIOpen was setting the CK workspace size to zero if some instances did not require a workspace for a particular shape, but this precluded some CK instances that might require a workspace from even being considered. And in some cases, the CK solver performance was reduced because the CK instance that needed the workspace might actually have been the fastest instance for that solver.
This PR changes MIOpen to get the workspace size from CK for the grp wrw, grp bwd, and 3D grp wrw solvers, and no longer excludes kernels that require a non-zero workspace.
Test Plan
Run Fremont shapes using all solvers and ensure there is an improvement in the CK-only solvers when compared to all solvers.
Test Result
Submission Checklist