Skip to content

[hipblaslt] CMS TF32 192x256x32 NN #3544

Merged
sebvince merged 3 commits into
ROCm:hipblaslt_common_cms_phase2from
sebvince:192x256x32NN_TF32
Jan 7, 2026
Merged

[hipblaslt] CMS TF32 192x256x32 NN #3544
sebvince merged 3 commits into
ROCm:hipblaslt_common_cms_phase2from
sebvince:192x256x32NN_TF32

Conversation

@sebvince
Copy link
Copy Markdown
Contributor

@sebvince sebvince commented Dec 24, 2025

Description

CMS implementation for tile 192x256x32 NN.

Tensile

  • without CMS : 646 us
  • with CMS : 425 us .
  • 34 % speedup

hipblaslt-bench

  • baseline (custom assembly kernel in 256x256x32TN) : 556 us
  • CMS : 512 us
  • 7 % speedup

Technical Details

This schedule uses:

  • UseMFMAF32XEmulation to reduce the number of CVT instructions
  • mfmaReordering to better hide latency introduced but the use of ds_read_b32. Having codegen being able to use ds_read_b128 and to the transpose along CVTs for NN case would greatly simplify the schedule.

@sebvince sebvince added the gfx950 run CI on gfx950 label Dec 24, 2025
@sebvince sebvince marked this pull request as ready for review December 24, 2025 11:03
@sebvince sebvince requested a review from a team as a code owner December 24, 2025 11:03
@talumbau talumbau self-requested a review January 6, 2026 21:42
ScheduleGlobalRead: 1
ScheduleIterAlg: 3
ScheduleLocalWrite: 1
SolutionIndex: 116
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an intuition on how the value of SolutionIndex is chosen here? It's very curious to me.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous SolutionIndex in the file is 115

Copy link
Copy Markdown
Contributor

@talumbau talumbau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving. The trace looks really great!

@sebvince sebvince enabled auto-merge (squash) January 7, 2026 13:00
@sebvince sebvince merged commit 2178f59 into ROCm:hipblaslt_common_cms_phase2 Jan 7, 2026
7 checks passed
rahjain-amd pushed a commit that referenced this pull request Jan 14, 2026
## Description

CMS implementation for tile 192x256x32 NN. 

### Tensile 
- without CMS : 646 us
- with CMS : 425  us .
- **34 % speedup**

### hipblaslt-bench
- baseline (custom assembly kernel in 256x256x32TN) : 556 us
- CMS : 512 us
- **7 % speedup**

## Technical Details

This schedule uses: 
-  `UseMFMAF32XEmulation` to reduce the number of CVT instructions
- mfmaReordering to better hide latency introduced but the use of
ds_read_b32. Having codegen being able to use ds_read_b128 and to the
transpose along CVTs for NN case would greatly simplify the schedule.
assistant-librarian Bot pushed a commit that referenced this pull request Jan 27, 2026
1. Add base class GridwiseGemm_xdl_cshuffle_base for all gridwise_gemm_xdl classes.
- to select correct LDS layout and epilogue behavior , three additional parameters is added.
- ForceNaiveLdsLayout: disable XOR based LDS layout when it is true
- DirectLoad: pipeline only use directload, we need force naive layout and ignore any padding on gfx9
- IsMxGemm: epilogue has two addtional dimensions
2. Move all LDS descriptor layout related fucntion to base class, including
- GetABlockDescriptor_AK0PerBlock_MPerBlock_AK1
- GetBBlockDescriptor_BK0PerBlock_NPerBlock_BK1
- GetCShuffleBlockDescriptor_MBlock_MPerBlock_NBlock_NPerBlock
3. Move several LDS related helper funtions to base class, including
- GetSharedMemoryNumberOfByte
- GetABlockDescriptor_AKB_AK0PerBlock_MPerBlock_AK1
- GetBBlockDescriptor_BKB_BK0PerBlock_NPerBlock_BK1
- GetCBlockDescriptor_MBlock_NXdlPerWave_MWaveMPerXdl_NBlock_NXdlPerWave_NWaveNPerXdl
4. Move all c epilogue related code to base class, and 4 kind of implementation are provided
- RunEpilogueNoShuffle
- RunEpilogue
- RunMultiDEpilogue
- RunMoeEpilogue
ammallya pushed a commit that referenced this pull request Feb 3, 2026
1. Add base class GridwiseGemm_xdl_cshuffle_base for all gridwise_gemm_xdl classes.
- to select correct LDS layout and epilogue behavior , three additional parameters is added.
- ForceNaiveLdsLayout: disable XOR based LDS layout when it is true
- DirectLoad: pipeline only use directload, we need force naive layout and ignore any padding on gfx9
- IsMxGemm: epilogue has two addtional dimensions
2. Move all LDS descriptor layout related fucntion to base class, including
- GetABlockDescriptor_AK0PerBlock_MPerBlock_AK1
- GetBBlockDescriptor_BK0PerBlock_NPerBlock_BK1
- GetCShuffleBlockDescriptor_MBlock_MPerBlock_NBlock_NPerBlock
3. Move several LDS related helper funtions to base class, including
- GetSharedMemoryNumberOfByte
- GetABlockDescriptor_AKB_AK0PerBlock_MPerBlock_AK1
- GetBBlockDescriptor_BKB_BK0PerBlock_NPerBlock_BK1
- GetCBlockDescriptor_MBlock_NXdlPerWave_MWaveMPerXdl_NBlock_NXdlPerWave_NWaveNPerXdl
4. Move all c epilogue related code to base class, and 4 kind of implementation are provided
- RunEpilogueNoShuffle
- RunEpilogue
- RunMultiDEpilogue
- RunMoeEpilogue

[ROCm/composable_kernel commit: 23cefda]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants