[hipblaslt] DTL optimizations by b-shi · Pull Request #2487 · ROCm/rocm-libraries

b-shi · 2025-11-05T20:10:44Z

Motivation

Addresses several DTL limitations:

Add support for non-power of 2 MT when TLU=1
- Add support for generalized NLC=1
Add support for m0 padding LDSTr+DTL
Add support for permuting perpDim to help mitigate BC

Technical Details

Add support for generalized NLC=1 (GNLC)

Previously for TLU=1 and non-power of 2 MT, numLoadsCoalesced > 1 so the columns/rows in the non-summation dimension is fetched across multiple loads. This PR generalized NLC=1 by allowing num threads coalesced to be arbitrary.
Uses magic div algo for offset calculations since divisors are static.
Automatically supports DTL for non-power of 2 MT, since we can use same local read offsets for regular NLC=1 case.
Limitations: Initially only supported for DTL, same size inputs, TLDS=1, LSU=1

Add support for m0 padding when using LDSTr+DTL

Updated LdsBlockSizePerPad calcs for LDSTr to support padding.

Add support for permuting perpDim.

In GNLC, support was added to permute the columns/rows in the parallel dimension (ex: permute columns of A, for A col-major)
Permutation is done by applying a stride across consecutive columns/rows. Motivation is to have consecutive columns/rows map to different buffer_loads so that m0 padding can better mitigate bank conflicts.
Permutation is done in blocks of columns/rows instead of across all columns/rows
Formula for permutation: given B(block size where permutes are done), S (stride), I (index)

[Forward Mapping used in global read offset calcs]
S' = B / S
I' = S * (I % S') + I / S'

[Inverse Mapping used in local read offset calcs]
S = B / S'
I = S' * (I' % S) + I' / S

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

math-ci · 2025-11-06T03:33:56Z

perfci run on commit `c3eb0ff`

math-ci run

Adds support for async execution and explicit device/stream selection and updates fusilli plugin to use new API: - Handle creation supports device id and external HIP stream parameters. - Graph execution is async on AMDGPU backend. - Fusilli plugin now uses stream if set by user.

math-ci · 2025-11-06T23:58:05Z

perfci run on commit ab9d0c48f2e00e4709a9f698d1443c5113311e73

math-ci run

math-ci · 2025-11-07T07:33:42Z

perfci run on commit c6b883cb34f0e4ae0bce40a54f6afd2c997bacdc

math-ci run

bnemanich · 2025-11-07T21:55:20Z

All 950 tests passed.

math-ci · 2025-11-07T22:04:15Z

perfci run on commit `7db3030`

math-ci run

…, physical_seqlen_k_end) (#2487) * Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) in pagedkv pipeline * i_nhead_ conversion type to prevent overflow --------- Co-authored-by: ltqin <letaoqin@amd.com> [ROCm/composable_kernel commit: 45904b8]

b-shi requested review from AlexBrownAMD and msujon-AMD November 5, 2025 20:10

b-shi requested a review from a team as a code owner November 5, 2025 20:10

b-shi added gfx950 run CI on gfx950 project: hipblaslt organization: ROCm labels Nov 5, 2025

b-shi force-pushed the users/brianshi/gr_stride branch from 7070d52 to 3eed0fd Compare November 5, 2025 20:15

b-shi requested a review from aazz44ss November 5, 2025 20:21

b-shi force-pushed the users/brianshi/gr_stride branch 3 times, most recently from de7eea0 to 5350a1f Compare November 6, 2025 03:32

nakajee reviewed Nov 6, 2025

View reviewed changes

b-shi force-pushed the users/brianshi/gr_stride branch 2 times, most recently from 71324a8 to 1d7ef14 Compare November 6, 2025 15:00

b-shi force-pushed the users/brianshi/gr_stride branch from 55bb2b4 to 1904e22 Compare November 6, 2025 18:45

nakajee reviewed Nov 7, 2025

View reviewed changes

Comment thread projects/hipblaslt/tensilelite/rocisa/rocisa/include/hardware_caps.hpp Outdated

Comment thread projects/hipblaslt/tensilelite/Tensile/SolutionStructs/Solution.py

b-shi force-pushed the users/brianshi/gr_stride branch 2 times, most recently from df7a000 to c6b883c Compare November 7, 2025 02:43

b-shi added 7 commits November 7, 2025 11:45

DTL optimizations

2b6bd8b

Address comments from PR

51a7b54

Disable GNLC for batched MI

562d8db

Set MI_M as default padding for GNLC+LDSTr

6f23273

Address comments from PR part2

0ccbdbd

add reject case if dtl not doable, but GNLC enabled

186e514

Fixed ldsblocksizepad logic in tailloop LR inc code

7db3030

b-shi force-pushed the users/brianshi/gr_stride branch from 4cb2bc9 to 7db3030 Compare November 7, 2025 17:45

msujon-AMD approved these changes Nov 7, 2025

View reviewed changes

bnemanich merged commit 3acf120 into develop Nov 7, 2025
26 of 30 checks passed

bnemanich deleted the users/brianshi/gr_stride branch November 7, 2025 21:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[hipblaslt] DTL optimizations#2487

[hipblaslt] DTL optimizations#2487
bnemanich merged 7 commits into
developfrom
users/brianshi/gr_stride

b-shi commented Nov 5, 2025 •

edited

Loading

Uh oh!

math-ci Bot commented Nov 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

math-ci Bot commented Nov 6, 2025

Uh oh!

Uh oh!

Uh oh!

math-ci Bot commented Nov 7, 2025

Uh oh!

bnemanich commented Nov 7, 2025

Uh oh!

Uh oh!

math-ci Bot commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

b-shi commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

math-ci Bot commented Nov 6, 2025

perfci run on commit c3eb0ff

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

math-ci Bot commented Nov 6, 2025

perfci run on commit ab9d0c48f2e00e4709a9f698d1443c5113311e73

Uh oh!

Uh oh!

Uh oh!

math-ci Bot commented Nov 7, 2025

perfci run on commit c6b883cb34f0e4ae0bce40a54f6afd2c997bacdc

Uh oh!

bnemanich commented Nov 7, 2025

Uh oh!

Uh oh!

math-ci Bot commented Nov 7, 2025

perfci run on commit 7db3030

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

b-shi commented Nov 5, 2025 •

edited

Loading

perfci run on commit `c3eb0ff`

perfci run on commit `7db3030`