Skip to content

[AMD] Support col-major for device side TDM descriptors#9730

Merged
antiagainst merged 4 commits into
triton-lang:mainfrom
AlexAUT:tdmColMajor
Mar 17, 2026
Merged

[AMD] Support col-major for device side TDM descriptors#9730
antiagainst merged 4 commits into
triton-lang:mainfrom
AlexAUT:tdmColMajor

Conversation

@AlexAUT
Copy link
Copy Markdown
Contributor

@AlexAUT AlexAUT commented Mar 16, 2026

This PR adds a verifier to ensure we have at least one dim with a stride of 1 (HW requirement). The verifiers checks if the fastest dimension is the last logical dimension and also allows a single stride 1 dimension at rank-2 to support col-major tensors.

For col-major tensors we will swap the trailing dimensions in the TDM descriptor, since the last TDM dimensions has to be the stride 1 dimension. This also requires us to rename the out dims of the shared layout during the lowering to match the order TDM uses to transfer the tensor.

AlexAUT and others added 4 commits March 16, 2026 10:17
TDM requires contiguous data so one of the strides has to be 1. Triton does implicitly annotate kernel arguments with value 1 as constexpr so we can check which strides is 1.

Currently the lowering doesn't support reordering so this PR adds a strict check that the last dim is the fastest one. A follow up PR will allow some reordering. see this [ticket](ROCm/triton-internal#1658).
This PR allows TDM load and store with column-major (order=[0,1]) tensors. For this to work we need to swap the dimensions in the TDM descriptor to ensure the stride==1 dimension from Triton is the first dim in the HW TDM descriptor. Note that the order of dimension between Triton and our HW is reversed.
For the >2D case we only allow to swap the last two dimension, this means any batch dims are not allowed to be the fastest dim.

I adjusted the run_tensor_descriptor_load_store_test lit tests to not test uint dtypes because they do not affect the test case but makes the torch handling more tricky and I was unable to make the transpose work.

Note that we cannot do this for gather and scatter because it would reverse the meaning of the indices (from rows to columns).
@antiagainst antiagainst marked this pull request as ready for review March 17, 2026 00:02
@antiagainst antiagainst merged commit a590e14 into triton-lang:main Mar 17, 2026
9 checks passed
%c_stride1 = arith.constant 1 : i64
// expected-error @+2 {{requires shared order [rank-2, rank-1, rank-3, rank-4, ..., 0] because dim[rank-2] has stride 1}}
// expected-error @+1 {{failed to legalize operation}}
%0 = tt.make_tensor_descriptor %arg0, [%c_shape, %c_shape], [%c_stride1, %runtime_stride] : <f16>, <tensor<64x64xf16, #shared>>
Copy link
Copy Markdown
Contributor

@peterbell10 peterbell10 Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In NVIDIA TMA support we limit the descriptor to only the logical order that is handled by the hardware, and transposed loads are handled by putting a transpose (view) after the load in the program. Is there a compelling reason why AMD should be different?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points Peter! This is sync'ing out some internal changes of our initial impl. We are taking another look on this based the pointers. :)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done via #10078

raymondtay pushed a commit to raymondtay/triton that referenced this pull request Mar 22, 2026
…9730)

This PR adds a verifier to ensure we have at least one dim with a stride
of 1 (HW requirement). The verifiers checks if the fastest dimension is
the last logical dimension and also allows a single stride 1 dimension
at rank-2 to support col-major tensors.

For col-major tensors we will swap the trailing dimensions in the TDM
descriptor, since the last TDM dimensions has to be the stride 1
dimension. This also requires us to rename the out dims of the shared
layout during the lowering to match the order TDM uses to transfer the
tensor.
jvican pushed a commit to jvican/triton that referenced this pull request Mar 27, 2026
…9730)

This PR adds a verifier to ensure we have at least one dim with a stride
of 1 (HW requirement). The verifiers checks if the fastest dimension is
the last logical dimension and also allows a single stride 1 dimension
at rank-2 to support col-major tensors.

For col-major tensors we will swap the trailing dimensions in the TDM
descriptor, since the last TDM dimensions has to be the stride 1
dimension. This also requires us to rename the out dims of the shared
layout during the lowering to match the order TDM uses to transfer the
tensor.
plognjen pushed a commit to plognjen/triton that referenced this pull request Apr 14, 2026
…9730)

This PR adds a verifier to ensure we have at least one dim with a stride
of 1 (HW requirement). The verifiers checks if the fastest dimension is
the last logical dimension and also allows a single stride 1 dimension
at rank-2 to support col-major tensors.

For col-major tensors we will swap the trailing dimensions in the TDM
descriptor, since the last TDM dimensions has to be the stride 1
dimension. This also requires us to rename the out dims of the shared
layout during the lowering to match the order TDM uses to transfer the
tensor.
antiagainst pushed a commit that referenced this pull request Apr 20, 2026
Partially reverts:
- #9730

it still keeps/adjusts the improved verifiers to error out if the kernel
provides invalid strides. This brings it in line with TMA which also
only supports the logical HW view and transposes the view in the kernel
instead.
bingyizh233 pushed a commit to bingyizh233/triton that referenced this pull request Apr 20, 2026
Partially reverts:
- triton-lang#9730

it still keeps/adjusts the improved verifiers to error out if the kernel
provides invalid strides. This brings it in line with TMA which also
only supports the logical HW view and transposes the view in the kernel
instead.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants