Skip to content

[AMD] Add basics to allow bypass LDS for dot RHS#5350

Merged
antiagainst merged 16 commits intotriton-lang:mainfrom
plognjen:bypass_lds_upstream_new
Jan 23, 2025
Merged

[AMD] Add basics to allow bypass LDS for dot RHS#5350
antiagainst merged 16 commits intotriton-lang:mainfrom
plognjen:bypass_lds_upstream_new

Conversation

@plognjen
Copy link
Copy Markdown
Contributor

@plognjen plognjen commented Dec 5, 2024

This pull request supersedes #4856.

The AMDBypassLDSForDotOperandPass implements a strategy to bypass using the
Local Data Share (LDS) for one of the operands in an MFMA dot operation.

Under certain conditions, the dot layout of one of the operands allows direct
loading from HBM to VGPRs in the MFMA dot layout, without losing of vectorization of global loads
or increasing the number of global loads due to shared data between threads.
The required conditions are:

K-Major Tensor Layout:
The operand we want to bypass LDS for must be K-major (i.e., row-major for
operand 0 or column-major for operand 1). This supports vectorized global
load instructions, as MFMA instructions require each thread to hold B
operand elements along the K dimension.
kWidth * sizeof(dataType) == 128:
Using the maximum kWidth for a specific data type ensures optimal global
load vectorization (e.g., using global_load_dwordx4 instructions).
Single Warp per CTA Dimension:
Either warpsPerCTA[ndim] == 1 for operand A bypass or warpsPerCTA[mDim] ==
1 for operand B bypass. This guarantees that each tensor element is
handled by exactly one thread, maintaining the same number of global loads
as in the blocked layout (i.e., each element is loaded only once).

Comment thread third_party/amd/python/triton_amd.cc Outdated
Comment thread include/triton/Tools/Sys/GetEnv.hpp Outdated
Comment thread third_party/amd/lib/TritonAMDGPUTransforms/AMDBypassLDSForDotOperand.cpp Outdated
@plognjen plognjen force-pushed the bypass_lds_upstream_new branch from 036fe75 to e8369e6 Compare January 14, 2025 15:26
@plognjen plognjen force-pushed the bypass_lds_upstream_new branch from e8369e6 to e494441 Compare January 14, 2025 22:15
@plognjen plognjen force-pushed the bypass_lds_upstream_new branch from a6bece0 to 3d36fc9 Compare January 19, 2025 20:32
@antiagainst antiagainst marked this pull request as ready for review January 22, 2025 05:51
@antiagainst antiagainst merged commit cea35da into triton-lang:main Jan 23, 2025
pawelszczerbuk added a commit to pawelszczerbuk/triton that referenced this pull request Jan 25, 2025
pawelszczerbuk added a commit to pawelszczerbuk/triton that referenced this pull request Jan 26, 2025
pawelszczerbuk added a commit that referenced this pull request Jan 26, 2025
)

Reverting, as I have to revert
[cec1db5](cec1db5),
(which this change relies on) due to regression in internal tests.
AlexAUT pushed a commit to AlexAUT/triton that referenced this pull request Jan 29, 2025
…#5350)" (triton-lang#5708)

Reverting, as I have to revert
[cec1db5](triton-lang@cec1db5),
(which this change relies on) due to regression in internal tests.
makslevental pushed a commit to makslevental/triton that referenced this pull request Feb 19, 2025
…#5350)" (triton-lang#5708)

Reverting, as I have to revert
[cec1db5](triton-lang@cec1db5),
(which this change relies on) due to regression in internal tests.
plognjen pushed a commit to plognjen/triton that referenced this pull request Mar 21, 2025
plognjen pushed a commit to plognjen/triton that referenced this pull request Mar 24, 2025
jtang10 pushed a commit to ROCm/triton that referenced this pull request Jun 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants