Skip to content

[MIOpen] Implement a couple of improvements for bwd and wrw CK kernel selection#1426

Merged
JonathanLichtnerAMD merged 20 commits into
developfrom
users/jlichtne/optimize-ck-solvers
Oct 1, 2025
Merged

[MIOpen] Implement a couple of improvements for bwd and wrw CK kernel selection#1426
JonathanLichtnerAMD merged 20 commits into
developfrom
users/jlichtne/optimize-ck-solvers

Conversation

@JonathanLichtnerAMD
Copy link
Copy Markdown
Contributor

@JonathanLichtnerAMD JonathanLichtnerAMD commented Sep 2, 2025

Motivation

Backwards and backwards-weighted fusions will only use CK kernels as part of the fusion, but there is currently a performance hit for bwd and wrw if MIOpen restricts itself to just using CK solvers rather than all solvers. We need to reduce this hit, so that it does not cancel out the gains that can be made by doing fusions.

Technical Details

There are a couple of changes that MIOpen can make to improve our selection of CK kernels. These include

  1. Add -1 to the list of splitk values to iterate over for wrw.

    Prior to this change, splitk values could be the power of 2 values in the range 1 to 128, but CK has a special splitk autodeduce value of -1 that in some cases can provide better performance. This PR adds -1 to the set of splitk CK values, which improves the overall performance of CK solvers.

  2. Get the workspace size from CK for grp wrw, grp bwd, and 3D grp wrw.

    MIOpen was setting the CK workspace size to zero if some instances did not require a workspace for a particular shape, but this precluded some CK instances that might require a workspace from even being considered. And in some cases, the CK solver performance was reduced because the CK instance that needed the workspace might actually have been the fastest instance for that solver.

    This PR changes MIOpen to get the workspace size from CK for the grp wrw, grp bwd, and 3D grp wrw solvers, and no longer excludes kernels that require a non-zero workspace.

  • NOTE: this cannot be merged until the associated CK change is staged.

Test Plan

Run Fremont shapes using all solvers and ensure there is an improvement in the CK-only solvers when compared to all solvers.

Test Result

  • Performance test results on Fremont shapes indicate CK-solver performance has improved for wrw
  • CI passes
  • Run a large set of of the Tuna Gold shapes
  • Also need to wait for CK commit 29446da to be staged/promoted
  • Investigate hanging kernel issue on bwd path

Submission Checklist

Prior to this change, splitk values could be the power of 2 values in
the range 1 to 128, but CK has a special splitk autodeduce value of -1
that in some cases can provide better performance.  This commit adds
-1 to the set of splitk CK values, which improves the overall
performance of CK solvers.
…grp wrw

MIOpen was setting the CK workspace size to zero if some instances did
not require a workspace for a particular shape, but this precluded
some CK instances that might require a workspace from even being
considered.  And in some cases, the CK solver performance was reduced
because the CK instance that needed the workspace might actually have
been the fastest instance for that solver.

This commit changes MIOpen to get the workspace size from CK for the
grp wrw, grp bwd, and 3D grp wrw solvers.  This will boost CK
performance for bwd and wrw, which will be needed for bwd and wrw
fusions, since we need to make CK as performant as possible (since the
fusions will use only CK instances).

Note that the other CK solvers can be converted to get the workspace
size directly from CK in the future.
Use 0 as the default and then use the ck workspace size value if it is
non-zero, otherwise we fallback to GetCKAlphaBetaWorkspace().

Note that this can be cleaned up in the future when we convert all the CK
solvers to get the workspace size from CK.
This change fixes some broken unit tests, and addresses an issue with
the resulting workspace size not being reported correctly.
@JonathanLichtnerAMD
Copy link
Copy Markdown
Contributor Author

CI passed but I noticed the following errors when tuning:

MIOpen(HIP): Error [GenericSearch] #1208 (1223) Workspace size should not depend on PerformanceConfig: 0 != 110592
MIOpen(HIP): Error [GenericSearch] #1208 (1223)  Failed rc=-2

This will need to be addressed...

The check implied that a particular solver had to use the same
workspace size for all the different kernels and parameters, but CK
can have some kernels that have non-zero workspace size while others
have zero workspace size for the same problem.
@JonathanLichtnerAMD
Copy link
Copy Markdown
Contributor Author

CI passed but I noticed the following errors when tuning:

MIOpen(HIP): Error [GenericSearch] #1208 (1223) Workspace size should not depend on PerformanceConfig: 0 != 110592
MIOpen(HIP): Error [GenericSearch] #1208 (1223)  Failed rc=-2

This will need to be addressed...

I think the check was invalid since it implied that the same solver could not have different workspace sizes based on the kernel and tuning parameters, but we know this is not true for CK. I've removed this code, and it seems to be fine but I will let it run through CI again. I will also add @cderb to this review just in case he sees any gotchas here.

iq136boy
iq136boy previously approved these changes Sep 8, 2025
@JonathanLichtnerAMD
Copy link
Copy Markdown
Contributor Author

The performance of CK when compared to All solvers went from ~18% worse to 13-14% worse when run on the Fremont WRW shapes. See this excel sheet for details

@JonathanLichtnerAMD JonathanLichtnerAMD changed the title [MIOpen] Implement a couple of improvments for bwd and wrw CK kernel selection [MIOpen] Implement a couple of improvements for bwd and wrw CK kernel selection Sep 8, 2025
@amd-jnovotny
Copy link
Copy Markdown
Contributor

@JonathanLichtnerAMD Would this be significant enough to list as an optimization in the changelog?

scerzh
scerzh previously approved these changes Sep 9, 2025
@BradPepersAMD
Copy link
Copy Markdown
Contributor

Has the CK work this depends on been merged?

@JonathanLichtnerAMD
Copy link
Copy Markdown
Contributor Author

Has the CK work this depends on been merged?

@BradPepersAMD Yes, the initial work was merged, but when we tested the tuna gold shapes we discovered 11 shapes that resulted in failed results due to accuracy of some of the new CK kernels. Bartek is fixing these, and once these are in, then I will retest with the tuna gold shapes.

@JonathanLichtnerAMD JonathanLichtnerAMD requested a review from a team as a code owner September 16, 2025 01:08
@JonathanLichtnerAMD
Copy link
Copy Markdown
Contributor Author

@JonathanLichtnerAMD Would this be significant enough to list as an optimization in the changelog?

@amd-jnovotny Yes, I've added an entry for this in the changelog.

@amd-jnovotny
Copy link
Copy Markdown
Contributor

Thanks. Changelog looks okay.

@JonathanLichtnerAMD JonathanLichtnerAMD dismissed stale reviews from scerzh and iq136boy via 12a456c September 23, 2025 05:48
This commit changes the default split_k settings to start at index 0
and split_k 1 rather than -1 (AutoDeduce).  The problem with starting
at -1 meant that this was the default config, and at the end of the
generic search MIOpen runs a final run on the default config *but*
this might not be a valid config for CK (not applicable).  This would
cause the the tuning to fail and would also cause a failure in
immediate mode.

The solution is to restore the defaults to their original
values (index 0, split_k 1) and then iterate through the split_k
values from 1, ..., 128, -1 and then wrap to the next index.
@JonathanLichtnerAMD
Copy link
Copy Markdown
Contributor Author

Ran the 5K TunaGold shapes for each direction (fwd, bwd, and wrw). Overall, TunaGold wrw shapes showed a 1% improvement versus 3-4% for the Fremont wrw shapes. Bwd and fwd shapes showed no change in performance as expected.

This can now be merged.

@JonathanLichtnerAMD JonathanLichtnerAMD merged commit e290adb into develop Oct 1, 2025
33 of 47 checks passed
@JonathanLichtnerAMD JonathanLichtnerAMD deleted the users/jlichtne/optimize-ck-solvers branch October 1, 2025 23:03
assistant-librarian Bot pushed a commit to ROCm/MIOpen that referenced this pull request Oct 1, 2025
[MIOpen] Implement a couple of improvements for bwd and wrw
 CK kernel selection (#1426)

This PR includes a couple of changes to improve the selection of CK kernels in MIOpen. These include

1. _Add -1 to the list of splitk values to iterate over for wrw._

Prior to this change, splitk values could be the power of 2 values in
the range 1 to 128, but CK has a special splitk autodeduce value of -1
that in some cases can provide better performance. This PR adds -1 to
the set of splitk CK values, which improves the overall performance of
CK solvers.

4. _Get the workspace size from CK for grp wrw, grp bwd, and 3D grp
wrw._

MIOpen was setting the CK workspace size to zero if some instances did
not require a workspace for a particular shape, but this precluded some
CK instances that might require a workspace from even being considered.
And in some cases, the CK solver performance was reduced because the CK
instance that needed the workspace might actually have been the fastest
instance for that solver.  This PR changes MIOpen to get the workspace size from CK for the grp
wrw, grp bwd, and 3D grp wrw solvers, and no longer excludes kernels
that require a non-zero workspace.

## Test Plan

Run a large assortment of fwd, bwd, and wrw shapes using *all* solvers and ensure there is an
improvement in the CK-only solvers when compared to all solvers for wrw and no regressions for fwd and bwd.

## Test Result

- [x] Performance test results on Fremont shapes indicate CK-solver
performance has improved for wrw
- [x] CI passes
- [x] Run a large set of of the Tuna Gold shapes
- [x] Also need to wait for CK commit
29446da1d57170a8bda47a452113ef7e44363a04 to be staged/promoted
- [x] Investigate hanging kernel issue on bwd path

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants