[AMD] Support col-major for device side TDM descriptors#9730
Merged
Conversation
TDM requires contiguous data so one of the strides has to be 1. Triton does implicitly annotate kernel arguments with value 1 as constexpr so we can check which strides is 1. Currently the lowering doesn't support reordering so this PR adds a strict check that the last dim is the fastest one. A follow up PR will allow some reordering. see this [ticket](ROCm/triton-internal#1658).
This PR allows TDM load and store with column-major (order=[0,1]) tensors. For this to work we need to swap the dimensions in the TDM descriptor to ensure the stride==1 dimension from Triton is the first dim in the HW TDM descriptor. Note that the order of dimension between Triton and our HW is reversed. For the >2D case we only allow to swap the last two dimension, this means any batch dims are not allowed to be the fastest dim. I adjusted the run_tensor_descriptor_load_store_test lit tests to not test uint dtypes because they do not affect the test case but makes the torch handling more tricky and I was unable to make the transpose work. Note that we cannot do this for gather and scatter because it would reverse the meaning of the indices (from rows to columns).
antiagainst
approved these changes
Mar 17, 2026
peterbell10
reviewed
Mar 17, 2026
| %c_stride1 = arith.constant 1 : i64 | ||
| // expected-error @+2 {{requires shared order [rank-2, rank-1, rank-3, rank-4, ..., 0] because dim[rank-2] has stride 1}} | ||
| // expected-error @+1 {{failed to legalize operation}} | ||
| %0 = tt.make_tensor_descriptor %arg0, [%c_shape, %c_shape], [%c_stride1, %runtime_stride] : <f16>, <tensor<64x64xf16, #shared>> |
Contributor
There was a problem hiding this comment.
In NVIDIA TMA support we limit the descriptor to only the logical order that is handled by the hardware, and transposed loads are handled by putting a transpose (view) after the load in the program. Is there a compelling reason why AMD should be different?
Member
There was a problem hiding this comment.
Good points Peter! This is sync'ing out some internal changes of our initial impl. We are taking another look on this based the pointers. :)
raymondtay
pushed a commit
to raymondtay/triton
that referenced
this pull request
Mar 22, 2026
…9730) This PR adds a verifier to ensure we have at least one dim with a stride of 1 (HW requirement). The verifiers checks if the fastest dimension is the last logical dimension and also allows a single stride 1 dimension at rank-2 to support col-major tensors. For col-major tensors we will swap the trailing dimensions in the TDM descriptor, since the last TDM dimensions has to be the stride 1 dimension. This also requires us to rename the out dims of the shared layout during the lowering to match the order TDM uses to transfer the tensor.
jvican
pushed a commit
to jvican/triton
that referenced
this pull request
Mar 27, 2026
…9730) This PR adds a verifier to ensure we have at least one dim with a stride of 1 (HW requirement). The verifiers checks if the fastest dimension is the last logical dimension and also allows a single stride 1 dimension at rank-2 to support col-major tensors. For col-major tensors we will swap the trailing dimensions in the TDM descriptor, since the last TDM dimensions has to be the stride 1 dimension. This also requires us to rename the out dims of the shared layout during the lowering to match the order TDM uses to transfer the tensor.
plognjen
pushed a commit
to plognjen/triton
that referenced
this pull request
Apr 14, 2026
…9730) This PR adds a verifier to ensure we have at least one dim with a stride of 1 (HW requirement). The verifiers checks if the fastest dimension is the last logical dimension and also allows a single stride 1 dimension at rank-2 to support col-major tensors. For col-major tensors we will swap the trailing dimensions in the TDM descriptor, since the last TDM dimensions has to be the stride 1 dimension. This also requires us to rename the out dims of the shared layout during the lowering to match the order TDM uses to transfer the tensor.
antiagainst
pushed a commit
that referenced
this pull request
Apr 20, 2026
Partially reverts: - #9730 it still keeps/adjusts the improved verifiers to error out if the kernel provides invalid strides. This brings it in line with TMA which also only supports the logical HW view and transposes the view in the kernel instead.
bingyizh233
pushed a commit
to bingyizh233/triton
that referenced
this pull request
Apr 20, 2026
Partially reverts: - triton-lang#9730 it still keeps/adjusts the improved verifiers to error out if the kernel provides invalid strides. This brings it in line with TMA which also only supports the logical HW view and transposes the view in the kernel instead.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds a verifier to ensure we have at least one dim with a stride of 1 (HW requirement). The verifiers checks if the fastest dimension is the last logical dimension and also allows a single stride 1 dimension at rank-2 to support col-major tensors.
For col-major tensors we will swap the trailing dimensions in the TDM descriptor, since the last TDM dimensions has to be the stride 1 dimension. This also requires us to rename the out dims of the shared layout during the lowering to match the order TDM uses to transfer the tensor.