[hipblaslt] CMS TF32 192x256x32 NN #3544
Merged
sebvince merged 3 commits intoJan 7, 2026
Merged
Conversation
bb0cecd to
811e95d
Compare
talumbau
reviewed
Jan 6, 2026
| ScheduleGlobalRead: 1 | ||
| ScheduleIterAlg: 3 | ||
| ScheduleLocalWrite: 1 | ||
| SolutionIndex: 116 |
Contributor
There was a problem hiding this comment.
Do you have an intuition on how the value of SolutionIndex is chosen here? It's very curious to me.
Contributor
Author
There was a problem hiding this comment.
Previous SolutionIndex in the file is 115
talumbau
approved these changes
Jan 6, 2026
Contributor
talumbau
left a comment
There was a problem hiding this comment.
Approving. The trace looks really great!
811e95d to
66512c4
Compare
66512c4 to
53c9a9a
Compare
rahjain-amd
pushed a commit
that referenced
this pull request
Jan 14, 2026
## Description CMS implementation for tile 192x256x32 NN. ### Tensile - without CMS : 646 us - with CMS : 425 us . - **34 % speedup** ### hipblaslt-bench - baseline (custom assembly kernel in 256x256x32TN) : 556 us - CMS : 512 us - **7 % speedup** ## Technical Details This schedule uses: - `UseMFMAF32XEmulation` to reduce the number of CVT instructions - mfmaReordering to better hide latency introduced but the use of ds_read_b32. Having codegen being able to use ds_read_b128 and to the transpose along CVTs for NN case would greatly simplify the schedule.
assistant-librarian Bot
pushed a commit
that referenced
this pull request
Jan 27, 2026
1. Add base class GridwiseGemm_xdl_cshuffle_base for all gridwise_gemm_xdl classes. - to select correct LDS layout and epilogue behavior , three additional parameters is added. - ForceNaiveLdsLayout: disable XOR based LDS layout when it is true - DirectLoad: pipeline only use directload, we need force naive layout and ignore any padding on gfx9 - IsMxGemm: epilogue has two addtional dimensions 2. Move all LDS descriptor layout related fucntion to base class, including - GetABlockDescriptor_AK0PerBlock_MPerBlock_AK1 - GetBBlockDescriptor_BK0PerBlock_NPerBlock_BK1 - GetCShuffleBlockDescriptor_MBlock_MPerBlock_NBlock_NPerBlock 3. Move several LDS related helper funtions to base class, including - GetSharedMemoryNumberOfByte - GetABlockDescriptor_AKB_AK0PerBlock_MPerBlock_AK1 - GetBBlockDescriptor_BKB_BK0PerBlock_NPerBlock_BK1 - GetCBlockDescriptor_MBlock_NXdlPerWave_MWaveMPerXdl_NBlock_NXdlPerWave_NWaveNPerXdl 4. Move all c epilogue related code to base class, and 4 kind of implementation are provided - RunEpilogueNoShuffle - RunEpilogue - RunMultiDEpilogue - RunMoeEpilogue
ammallya
pushed a commit
that referenced
this pull request
Feb 3, 2026
1. Add base class GridwiseGemm_xdl_cshuffle_base for all gridwise_gemm_xdl classes. - to select correct LDS layout and epilogue behavior , three additional parameters is added. - ForceNaiveLdsLayout: disable XOR based LDS layout when it is true - DirectLoad: pipeline only use directload, we need force naive layout and ignore any padding on gfx9 - IsMxGemm: epilogue has two addtional dimensions 2. Move all LDS descriptor layout related fucntion to base class, including - GetABlockDescriptor_AK0PerBlock_MPerBlock_AK1 - GetBBlockDescriptor_BK0PerBlock_NPerBlock_BK1 - GetCShuffleBlockDescriptor_MBlock_MPerBlock_NBlock_NPerBlock 3. Move several LDS related helper funtions to base class, including - GetSharedMemoryNumberOfByte - GetABlockDescriptor_AKB_AK0PerBlock_MPerBlock_AK1 - GetBBlockDescriptor_BKB_BK0PerBlock_NPerBlock_BK1 - GetCBlockDescriptor_MBlock_NXdlPerWave_MWaveMPerXdl_NBlock_NXdlPerWave_NWaveNPerXdl 4. Move all c epilogue related code to base class, and 4 kind of implementation are provided - RunEpilogueNoShuffle - RunEpilogue - RunMultiDEpilogue - RunMoeEpilogue [ROCm/composable_kernel commit: 23cefda]
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
CMS implementation for tile 192x256x32 NN.
Tensile
hipblaslt-bench
Technical Details
This schedule uses:
UseMFMAF32XEmulationto reduce the number of CVT instructions