[AMD] Support col-major for device side TDM descriptors by AlexAUT · Pull Request #9730 · triton-lang/triton

AlexAUT · 2026-03-16T11:57:15Z

This PR adds a verifier to ensure we have at least one dim with a stride of 1 (HW requirement). The verifiers checks if the fastest dimension is the last logical dimension and also allows a single stride 1 dimension at rank-2 to support col-major tensors.

For col-major tensors we will swap the trailing dimensions in the TDM descriptor, since the last TDM dimensions has to be the stride 1 dimension. This also requires us to rename the out dims of the shared layout during the lowering to match the order TDM uses to transfer the tensor.

TDM requires contiguous data so one of the strides has to be 1. Triton does implicitly annotate kernel arguments with value 1 as constexpr so we can check which strides is 1. Currently the lowering doesn't support reordering so this PR adds a strict check that the last dim is the fastest one. A follow up PR will allow some reordering. see this [ticket](ROCm/triton-internal#1658).

This PR allows TDM load and store with column-major (order=[0,1]) tensors. For this to work we need to swap the dimensions in the TDM descriptor to ensure the stride==1 dimension from Triton is the first dim in the HW TDM descriptor. Note that the order of dimension between Triton and our HW is reversed. For the >2D case we only allow to swap the last two dimension, this means any batch dims are not allowed to be the fastest dim. I adjusted the run_tensor_descriptor_load_store_test lit tests to not test uint dtypes because they do not affect the test case but makes the torch handling more tricky and I was unable to make the transpose work. Note that we cannot do this for gather and scatter because it would reverse the meaning of the indices (from rows to columns).

peterbell10 · 2026-03-17T01:57:17Z

+    %c_stride1 = arith.constant 1 : i64
+    // expected-error @+2 {{requires shared order [rank-2, rank-1, rank-3, rank-4, ..., 0] because dim[rank-2] has stride 1}}
+    // expected-error @+1 {{failed to legalize operation}}
+    %0 = tt.make_tensor_descriptor %arg0, [%c_shape, %c_shape], [%c_stride1, %runtime_stride] : <f16>, <tensor<64x64xf16, #shared>>


In NVIDIA TMA support we limit the descriptor to only the logical order that is handled by the hardware, and transposed loads are handled by putting a transpose (view) after the load in the program. Is there a compelling reason why AMD should be different?

Good points Peter! This is sync'ing out some internal changes of our initial impl. We are taking another look on this based the pointers. :)

Done via #10078

…9730) This PR adds a verifier to ensure we have at least one dim with a stride of 1 (HW requirement). The verifiers checks if the fastest dimension is the last logical dimension and also allows a single stride 1 dimension at rank-2 to support col-major tensors. For col-major tensors we will swap the trailing dimensions in the TDM descriptor, since the last TDM dimensions has to be the stride 1 dimension. This also requires us to rename the out dims of the shared layout during the lowering to match the order TDM uses to transfer the tensor.

Partially reverts: - #9730 it still keeps/adjusts the improved verifiers to error out if the kernel provides invalid strides. This brings it in line with TMA which also only supports the logical HW view and transposes the view in the kernel instead.

Partially reverts: - triton-lang#9730 it still keeps/adjusts the improved verifiers to error out if the kernel provides invalid strides. This brings it in line with TMA which also only supports the logical HW view and transposes the view in the kernel instead.

AlexAUT and others added 4 commits March 16, 2026 10:17

Fix swapTrailingDims

7313d23

Cleanup

8de5d49

antiagainst marked this pull request as ready for review March 17, 2026 00:02

antiagainst requested review from antiagainst, ptillet and zhanglx13 as code owners March 17, 2026 00:02

antiagainst approved these changes Mar 17, 2026

View reviewed changes

antiagainst merged commit a590e14 into triton-lang:main Mar 17, 2026
9 checks passed

peterbell10 reviewed Mar 17, 2026

View reviewed changes

AlexAUT mentioned this pull request Apr 20, 2026

[AMD] Revert col-major support for TDM #10078

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Support col-major for device side TDM descriptors#9730

[AMD] Support col-major for device side TDM descriptors#9730
antiagainst merged 4 commits into
triton-lang:mainfrom
AlexAUT:tdmColMajor

AlexAUT commented Mar 16, 2026

Uh oh!

Uh oh!

peterbell10 Mar 17, 2026 •

edited

Loading

Uh oh!

antiagainst Mar 17, 2026

Uh oh!

antiagainst Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

AlexAUT commented Mar 16, 2026

Uh oh!

Uh oh!

peterbell10 Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antiagainst Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

antiagainst Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

peterbell10 Mar 17, 2026 •

edited

Loading