[hipblaslt] CMS TF32 192x256x32 NN by sebvince · Pull Request #3544 · ROCm/rocm-libraries

sebvince · 2025-12-24T10:48:04Z

Description

CMS implementation for tile 192x256x32 NN.

Tensile

without CMS : 646 us
with CMS : 425 us .
34 % speedup

hipblaslt-bench

baseline (custom assembly kernel in 256x256x32TN) : 556 us
CMS : 512 us
7 % speedup

Technical Details

This schedule uses:

UseMFMAF32XEmulation to reduce the number of CVT instructions
mfmaReordering to better hide latency introduced but the use of ds_read_b32. Having codegen being able to use ds_read_b128 and to the transpose along CVTs for NN case would greatly simplify the schedule.

talumbau · 2026-01-06T21:45:23Z

+    ScheduleGlobalRead: 1
+    ScheduleIterAlg: 3
+    ScheduleLocalWrite: 1
+    SolutionIndex: 116


Do you have an intuition on how the value of SolutionIndex is chosen here? It's very curious to me.

Previous SolutionIndex in the file is 115

talumbau

Approving. The trace looks really great!

## Description CMS implementation for tile 192x256x32 NN. ### Tensile - without CMS : 646 us - with CMS : 425 us . - **34 % speedup** ### hipblaslt-bench - baseline (custom assembly kernel in 256x256x32TN) : 556 us - CMS : 512 us - **7 % speedup** ## Technical Details This schedule uses: - `UseMFMAF32XEmulation` to reduce the number of CVT instructions - mfmaReordering to better hide latency introduced but the use of ds_read_b32. Having codegen being able to use ds_read_b128 and to the transpose along CVTs for NN case would greatly simplify the schedule.

1. Add base class GridwiseGemm_xdl_cshuffle_base for all gridwise_gemm_xdl classes. - to select correct LDS layout and epilogue behavior , three additional parameters is added. - ForceNaiveLdsLayout: disable XOR based LDS layout when it is true - DirectLoad: pipeline only use directload, we need force naive layout and ignore any padding on gfx9 - IsMxGemm: epilogue has two addtional dimensions 2. Move all LDS descriptor layout related fucntion to base class, including - GetABlockDescriptor_AK0PerBlock_MPerBlock_AK1 - GetBBlockDescriptor_BK0PerBlock_NPerBlock_BK1 - GetCShuffleBlockDescriptor_MBlock_MPerBlock_NBlock_NPerBlock 3. Move several LDS related helper funtions to base class, including - GetSharedMemoryNumberOfByte - GetABlockDescriptor_AKB_AK0PerBlock_MPerBlock_AK1 - GetBBlockDescriptor_BKB_BK0PerBlock_NPerBlock_BK1 - GetCBlockDescriptor_MBlock_NXdlPerWave_MWaveMPerXdl_NBlock_NXdlPerWave_NWaveNPerXdl 4. Move all c epilogue related code to base class, and 4 kind of implementation are provided - RunEpilogueNoShuffle - RunEpilogue - RunMultiDEpilogue - RunMoeEpilogue

1. Add base class GridwiseGemm_xdl_cshuffle_base for all gridwise_gemm_xdl classes. - to select correct LDS layout and epilogue behavior , three additional parameters is added. - ForceNaiveLdsLayout: disable XOR based LDS layout when it is true - DirectLoad: pipeline only use directload, we need force naive layout and ignore any padding on gfx9 - IsMxGemm: epilogue has two addtional dimensions 2. Move all LDS descriptor layout related fucntion to base class, including - GetABlockDescriptor_AK0PerBlock_MPerBlock_AK1 - GetBBlockDescriptor_BK0PerBlock_NPerBlock_BK1 - GetCShuffleBlockDescriptor_MBlock_MPerBlock_NBlock_NPerBlock 3. Move several LDS related helper funtions to base class, including - GetSharedMemoryNumberOfByte - GetABlockDescriptor_AKB_AK0PerBlock_MPerBlock_AK1 - GetBBlockDescriptor_BKB_BK0PerBlock_NPerBlock_BK1 - GetCBlockDescriptor_MBlock_NXdlPerWave_MWaveMPerXdl_NBlock_NXdlPerWave_NWaveNPerXdl 4. Move all c epilogue related code to base class, and 4 kind of implementation are provided - RunEpilogueNoShuffle - RunEpilogue - RunMultiDEpilogue - RunMoeEpilogue [ROCm/composable_kernel commit: 23cefda]

sebvince added the gfx950 run CI on gfx950 label Dec 24, 2025

github-actions Bot added the project: hipblaslt label Dec 24, 2025

sebvince added the organization: ROCm label Dec 24, 2025

sebvince marked this pull request as ready for review December 24, 2025 11:03

sebvince requested a review from a team as a code owner December 24, 2025 11:03

sebvince force-pushed the 192x256x32NN_TF32 branch from bb0cecd to 811e95d Compare January 6, 2026 10:21

talumbau self-requested a review January 6, 2026 21:42

talumbau reviewed Jan 6, 2026

View reviewed changes

talumbau approved these changes Jan 6, 2026

View reviewed changes

sebvince force-pushed the 192x256x32NN_TF32 branch from 811e95d to 66512c4 Compare January 7, 2026 10:43

sebvince added 3 commits January 7, 2026 06:55

Add CMS

57849d9

Logic file

896ead1

Add test

53c9a9a

sebvince force-pushed the 192x256x32NN_TF32 branch from 66512c4 to 53c9a9a Compare January 7, 2026 13:00

sebvince enabled auto-merge (squash) January 7, 2026 13:00

sebvince merged commit 2178f59 into ROCm:hipblaslt_common_cms_phase2 Jan 7, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[hipblaslt] CMS TF32 192x256x32 NN #3544

[hipblaslt] CMS TF32 192x256x32 NN #3544
sebvince merged 3 commits into
ROCm:hipblaslt_common_cms_phase2from
sebvince:192x256x32NN_TF32

sebvince commented Dec 24, 2025 •

edited

Loading

Uh oh!

talumbau Jan 6, 2026

Uh oh!

sebvince Jan 7, 2026

Uh oh!

talumbau left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sebvince commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tensile

hipblaslt-bench

Technical Details

Uh oh!

talumbau Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

sebvince Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

talumbau left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sebvince commented Dec 24, 2025 •

edited

Loading