[AMD] Enable IN_THREAD_TRANSPOSE to GFX1201 by default by skysnow2001 · Pull Request #10185 · triton-lang/triton

skysnow2001 · 2026-04-30T16:10:24Z

Enables in-thread transpose on by default for gfx1201, matching the existing default for gfx942. It will transpose
the elements within a thread using registers after global load and before local write in order to maintain good global memory coalescing and wide ds instruction bitwidth and avoid shared memory bank conflicts.

Flash-Attention 2 Results (bf16):

headdim	batch_size	seqlen	TFLOPs ITT=0	TFLOPS ITT=1	Δ
128	32	512	45.16	47.74	+5.7%
128	16	1024	53.34	63.69	+19.4%
128	8	2048	57.77	70.62	+22.3%
128	4	4096	61.90	76.49	+23.6%
128	2	8192	64.18	79.10	+23.3%
128	1	16384	62.15	73.48	+18.2%
128	1	32768	61.45	71.19	+15.9%

GEMM Results (bf16)

M	N	K	TFLOPS ITT=0	TFLOPS ITT=1	Δ
1024	1024	1024	60.64	66.26	+9.3%
2048	2048	2048	101.69	101.26	-0.4%
4096	4096	4096	109.69	120.89	+10.2%
8192	8192	8192	107.64	118.09	+9.7%
4096	11008	4096	107.50	117.04	+8.9%
4096	4096	11008	109.55	119.10	+8.7%
4096	14336	4096	108.94	118.11	+8.4%
4096	4096	14336	109.35	118.83	+8.7%
4096	12288	4096	108.27	117.76	+8.8%

New contributor declaration

I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run pre-commit run --from-ref origin/main --to-ref HEAD.
Select one of the following.
- I have added tests.
  - /test for lit tests
  - /unittest for C++ tests
  - /python/test for end-to-end tests
- This PR does not need a test because it only flips the default value of is_in_thread_transpose_enabled for a new architecture (gfx1201).
Select one of the following.
- I have not added any lit tests.
- The lit tests I have added follow these best practices,
  including the "tests should be minimal" section. (Usually running Python code
  and using the instructions it generates is not minimal.)

antiagainst · 2026-04-30T20:54:41Z

 def is_in_thread_transpose_enabled(arch):
-    return (arch == "gfx942") if knobs.amd.use_in_thread_transpose is None else knobs.amd.use_in_thread_transpose
+    return (arch in ("gfx942",
+                     "gfx1201")) if knobs.amd.use_in_thread_transpose is None else knobs.amd.use_in_thread_transpose


We should include other gfx12 targets too. See how is_hip_rdna4 defined.

#10390) `InThreadTranspose` rewrites `tt.load -> ttg.local_alloc -> ttg.local_load -> dot_op` so the K-contiguous WMMA/MFMA operand can be read from LDS as wide `ds_load_b128` instead of scalar `ds_load_u16` pairs when the load order doesn't match the consumer's K dimension. The pattern matcher in `matchInThreadTransposePattern` already accepts `AMDWmmaEncodingAttr` alongside `AMDMfmaEncodingAttr`, but the gate in `is_in_thread_transpose_enabled` only activates the pass on gfx942 (CDNA3) and gfx120x (RDNA4, enabled in #10185). Extend it to also cover RDNA3 (gfx110x/gfx1103) and RDNA3.5 (gfx115x). Added a `inThreadTranspose_wmma` sub-test to `test/TritonGPU/amd/in-thread-transpose.mlir` (gfx1151, wave32, WMMA encoding) that verifies the pass produces an `amdg.in_thread_transpose` op and that the downstream `ttg.local_load` returns the K-contiguous `dot_op` layout (`kWidth = 16`). On AITER's `flash_attn_2.varlen_fwd` at the Qwen3-Omni ViT prefill shape (B=1, S=3200, H=16, head_dim=72, fp16) on gfx1151, this lifts the inner-loop V `local_load` from 512 scalar `ds_load_u16(_d16_hi)` to 144 vectorized `ds_load_b128` and gives a 3.8% median speedup (3.042 -> 2.925 ms).

Enable in_thread_transpose by default for gfx1201

6f1b392

antiagainst requested changes Apr 30, 2026

View reviewed changes

skysnow2001 and others added 2 commits May 1, 2026 13:21

Enable for all GFX12

6530a81

Format a bit

b319717

antiagainst marked this pull request as ready for review May 1, 2026 19:07

antiagainst requested a review from zhanglx13 as a code owner May 1, 2026 19:07

antiagainst enabled auto-merge (squash) May 1, 2026 19:08

Merge branch 'main' into rdna4_thread_transp

c2920dd

antiagainst approved these changes May 1, 2026

View reviewed changes

antiagainst merged commit 5d69e1c into triton-lang:main May 1, 2026
8 of 9 checks passed

mgehre-amd mentioned this pull request May 27, 2026

[AMD] Enable InThreadTranspose pass for RDNA3 / RDNA3.5 (gfx110x/115x) #10390

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Enable IN_THREAD_TRANSPOSE to GFX1201 by default #10185

[AMD] Enable IN_THREAD_TRANSPOSE to GFX1201 by default #10185
antiagainst merged 4 commits into
triton-lang:mainfrom
skysnow2001:rdna4_thread_transp

skysnow2001 commented Apr 30, 2026

Uh oh!

antiagainst Apr 30, 2026

Uh oh!

skysnow2001 May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

skysnow2001 commented Apr 30, 2026

Flash-Attention 2 Results (bf16):

GEMM Results (bf16)

New contributor declaration

Uh oh!

antiagainst Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

skysnow2001 May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants