Skip to content

[hipblaslt] DTL optimizations#2487

Merged
bnemanich merged 7 commits into
developfrom
users/brianshi/gr_stride
Nov 7, 2025
Merged

[hipblaslt] DTL optimizations#2487
bnemanich merged 7 commits into
developfrom
users/brianshi/gr_stride

Conversation

@b-shi

@b-shi b-shi commented Nov 5, 2025

Copy link
Copy Markdown
Contributor

Motivation

Addresses several DTL limitations:

  • Add support for non-power of 2 MT when TLU=1
    • Add support for generalized NLC=1
  • Add support for m0 padding LDSTr+DTL
  • Add support for permuting perpDim to help mitigate BC

Technical Details

Add support for generalized NLC=1 (GNLC)

  • Previously for TLU=1 and non-power of 2 MT, numLoadsCoalesced > 1 so the columns/rows in the non-summation dimension is fetched across multiple loads. This PR generalized NLC=1 by allowing num threads coalesced to be arbitrary.
  • Uses magic div algo for offset calculations since divisors are static.
  • Automatically supports DTL for non-power of 2 MT, since we can use same local read offsets for regular NLC=1 case.
  • Limitations: Initially only supported for DTL, same size inputs, TLDS=1, LSU=1

Add support for m0 padding when using LDSTr+DTL

  • Updated LdsBlockSizePerPad calcs for LDSTr to support padding.

Add support for permuting perpDim.

  • In GNLC, support was added to permute the columns/rows in the parallel dimension (ex: permute columns of A, for A col-major)
  • Permutation is done by applying a stride across consecutive columns/rows. Motivation is to have consecutive columns/rows map to different buffer_loads so that m0 padding can better mitigate bank conflicts.
  • Permutation is done in blocks of columns/rows instead of across all columns/rows
  • Formula for permutation: given B(block size where permutes are done), S (stride), I (index)
[Forward Mapping used in global read offset calcs]
S' = B / S
I' = S * (I % S') + I / S'

[Inverse Mapping used in local read offset calcs]
S = B / S'
I = S' * (I' % S) + I' / S

Test Plan

Test Result

Submission Checklist

@b-shi b-shi requested a review from a team as a code owner November 5, 2025 20:10
@b-shi b-shi force-pushed the users/brianshi/gr_stride branch from 7070d52 to 3eed0fd Compare November 5, 2025 20:15
@b-shi b-shi requested a review from aazz44ss November 5, 2025 20:21
@b-shi b-shi force-pushed the users/brianshi/gr_stride branch 3 times, most recently from de7eea0 to 5350a1f Compare November 6, 2025 03:32
@math-ci

math-ci Bot commented Nov 6, 2025

Copy link
Copy Markdown

perfci run on commit c3eb0ff

math-ci run

Comment thread projects/hipblaslt/tensilelite/Tensile/SolutionStructs/Solution.py
Comment thread projects/hipblaslt/tensilelite/Tensile/SolutionStructs/Solution.py
Comment thread projects/hipblaslt/tensilelite/Tensile/KernelWriterAssembly.py Outdated
Comment thread projects/hipblaslt/tensilelite/Tensile/KernelWriterAssembly.py Outdated
Comment thread projects/hipblaslt/tensilelite/Tensile/KernelWriter.py Outdated
@b-shi b-shi force-pushed the users/brianshi/gr_stride branch 2 times, most recently from 71324a8 to 1d7ef14 Compare November 6, 2025 15:00
kliegeois pushed a commit to kliegeois/rocm-libraries that referenced this pull request Nov 6, 2025
Adds support for async execution and explicit device/stream selection
and updates fusilli plugin to use new API:
- Handle creation supports device id and external HIP stream parameters.
 - Graph execution is async on AMDGPU backend.
 - Fusilli plugin now uses stream if set by user.
@b-shi b-shi force-pushed the users/brianshi/gr_stride branch from 55bb2b4 to 1904e22 Compare November 6, 2025 18:45
@math-ci

math-ci Bot commented Nov 6, 2025

Copy link
Copy Markdown

perfci run on commit ab9d0c48f2e00e4709a9f698d1443c5113311e73

math-ci run

Comment thread projects/hipblaslt/tensilelite/rocisa/rocisa/include/hardware_caps.hpp Outdated
Comment thread projects/hipblaslt/tensilelite/Tensile/SolutionStructs/Solution.py
@b-shi b-shi force-pushed the users/brianshi/gr_stride branch 2 times, most recently from df7a000 to c6b883c Compare November 7, 2025 02:43
@math-ci

math-ci Bot commented Nov 7, 2025

Copy link
Copy Markdown

perfci run on commit c6b883cb34f0e4ae0bce40a54f6afd2c997bacdc

math-ci run

@b-shi b-shi force-pushed the users/brianshi/gr_stride branch from 4cb2bc9 to 7db3030 Compare November 7, 2025 17:45
@bnemanich

Copy link
Copy Markdown
Contributor

All 950 tests passed.

@bnemanich bnemanich merged commit 3acf120 into develop Nov 7, 2025
26 of 30 checks passed
@bnemanich bnemanich deleted the users/brianshi/gr_stride branch November 7, 2025 21:55
@math-ci

math-ci Bot commented Nov 7, 2025

Copy link
Copy Markdown

perfci run on commit 7db3030

math-ci run

ammallya pushed a commit that referenced this pull request Feb 3, 2026
…, physical_seqlen_k_end) (#2487)

* Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) in pagedkv pipeline

* i_nhead_ conversion type to prevent overflow

---------

Co-authored-by: ltqin <letaoqin@amd.com>

[ROCm/composable_kernel commit: 45904b8]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants