Skip to content

[AMD] Clean up shuffleXor implementation#10065

Merged
antiagainst merged 3 commits into
triton-lang:mainfrom
FrederickVu:shufflexor
Apr 18, 2026
Merged

[AMD] Clean up shuffleXor implementation#10065
antiagainst merged 3 commits into
triton-lang:mainfrom
FrederickVu:shufflexor

Conversation

@FrederickVu
Copy link
Copy Markdown
Contributor

We make things a bit more uniform by decomposing the xor mask and emitting instructions accordingly. For a mask in [1, 15], on RDNA + gfx1250 we use a single row_xmask DPP instruction, and on CDNA we use 1 or 2 DPP instructions. For a mask >= 16, on RDNA, we use a single v_permlanex16, and on CDNA we use ds_bpermute.

We also pull some static utility functions into an anonymous namespace and remove the ShflKind::down case from the enum as it was unimplemented.

// CHECK-LABEL: reduce_xor_max
tt.func @reduce_xor_max(%arg0: tensor<32xf32, #blocked4>) {
// CHECK: rocdl.ds_swizzle
// stride 16: CDNA fallback to bpermute
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now this is only checking gfx942. Can you also add check lines for gfx1250 given your changes?

Value hiSel = b.i32_val(buildSelectorMask(8));
return ROCDL::PermlaneX16Op::create(rewriter, loc, val.getType(), val, val,
loSel, hiSel, true, false)
.getRes();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: You don't need to explicitly call getRes for one-result ops--it will automatically convert the op to its only result value I believe.

@antiagainst antiagainst merged commit 2796cea into triton-lang:main Apr 18, 2026
9 checks passed
bingyizh233 pushed a commit to bingyizh233/triton that referenced this pull request Apr 20, 2026
We make things a bit more uniform by decomposing the xor mask and
emitting instructions accordingly. For a `mask` in [1, 15], on RDNA +
gfx1250 we use a single `row_xmask` DPP instruction, and on CDNA we use
1 or 2 DPP instructions. For a `mask >= 16`, on RDNA, we use a single
`v_permlanex16`, and on CDNA we use `ds_bpermute`.

We also pull some static utility functions into an anonymous namespace
and remove the ShflKind::down case from the enum as it was
unimplemented.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants