Skip to content

[AMD] Enable InThreadTranspose pass for RDNA3 / RDNA3.5 (gfx110x/115x)#10390

Merged
antiagainst merged 1 commit into
triton-lang:mainfrom
ROCm:matthias.upstream-itt-rdna
May 27, 2026
Merged

[AMD] Enable InThreadTranspose pass for RDNA3 / RDNA3.5 (gfx110x/115x)#10390
antiagainst merged 1 commit into
triton-lang:mainfrom
ROCm:matthias.upstream-itt-rdna

Conversation

@mgehre-amd
Copy link
Copy Markdown
Contributor

InThreadTranspose rewrites tt.load -> ttg.local_alloc -> ttg.local_load -> dot_op so the K-contiguous WMMA/MFMA operand can be
read from LDS as wide ds_load_b128 instead of scalar ds_load_u16
pairs when the load order doesn't match the consumer's K dimension.
The pattern matcher in matchInThreadTransposePattern already accepts
AMDWmmaEncodingAttr alongside AMDMfmaEncodingAttr, but the gate in
is_in_thread_transpose_enabled only activates the pass on gfx942
(CDNA3) and gfx120x (RDNA4, enabled in #10185). Extend it to also
cover RDNA3 (gfx110x/gfx1103) and RDNA3.5 (gfx115x).

Added a inThreadTranspose_wmma sub-test to
test/TritonGPU/amd/in-thread-transpose.mlir (gfx1151, wave32, WMMA
encoding) that verifies the pass produces an amdg.in_thread_transpose
op and that the downstream ttg.local_load returns the K-contiguous
dot_op layout (kWidth = 16).

On AITER's flash_attn_2.varlen_fwd at the Qwen3-Omni ViT prefill
shape (B=1, S=3200, H=16, head_dim=72, fp16) on gfx1151, this lifts
the inner-loop V local_load from 512 scalar ds_load_u16(_d16_hi)
to 144 vectorized ds_load_b128 and gives a 3.8% median speedup
(3.042 -> 2.925 ms). Stacked with #10389 (D8 int
specialization), the same kernel reaches 2.376 ms (-21.9% vs main).

New contributor declaration

  • I am not making a trivial change, such as fixing a typo in a comment.

  • I have written a PR description following these
    rules.

  • I have run pre-commit run --from-ref origin/main --to-ref HEAD.

  • Select one of the following.

    • I have added tests.
      • /test/TritonGPU/amd/in-thread-transpose.mlir (lit)
    • This PR does not need a test because FILL THIS IN.
  • Select one of the following.

    • I have not added any lit tests.
    • The lit tests I have added follow these best practices,
      including the "tests should be minimal" section.

`InThreadTranspose` rewrites `tt.load -> ttg.local_alloc ->
ttg.local_load -> dot_op` so the K-contiguous WMMA/MFMA operand can be
read from LDS as wide `ds_load_b128` instead of scalar `ds_load_u16`
pairs when the load order doesn't match the consumer's K dimension.
The pattern matcher in `matchInThreadTransposePattern` already accepts
`AMDWmmaEncodingAttr` alongside `AMDMfmaEncodingAttr`, but the gate in
`is_in_thread_transpose_enabled` only activates the pass on gfx942
(CDNA3) and gfx120x (RDNA4). Extend it to also cover RDNA3
(gfx110x/gfx1103) and RDNA3.5 (gfx115x).

On AITER's `flash_attn_2.varlen_fwd` at the Qwen3-Omni ViT prefill
shape (B=1, S=3200, H=16, head_dim=72, fp16) on gfx1151, this lifts
the inner-loop V `local_load` from 512 scalar `ds_load_u16(_d16_hi)`
to 144 vectorized `ds_load_b128` and gives a 3.8% median speedup
(3.042 -> 2.925 ms) on top of triton-lang/triton:main.
@mgehre-amd mgehre-amd force-pushed the matthias.upstream-itt-rdna branch from edb28da to 42e9629 Compare May 27, 2026 12:57
@antiagainst antiagainst marked this pull request as ready for review May 27, 2026 16:25
@antiagainst antiagainst enabled auto-merge (squash) May 27, 2026 17:55
@antiagainst antiagainst merged commit 0418ee6 into triton-lang:main May 27, 2026
10 checks passed
@antiagainst antiagainst deleted the matthias.upstream-itt-rdna branch May 27, 2026 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants