Add D8 integer specialization tier (divisibility = 8 elements) by mgehre-amd · Pull Request #10389 · triton-lang/triton

mgehre-amd · 2026-05-27T12:19:02Z

Integer kernel args only get tt.divisibility = 16 when the runtime value
is divisible by 16. Strides that are divisible by 8 but not 16 (e.g. a
contiguous stride of 72 for an f16 tensor whose head-dim was padded from 72
to 128 elements) drop to the default divisibility=1. Downstream pointer
analysis then reports byte-divisibility 2 for the loaded tensor, Coalesce
picks sizePerThread = [1, 1], and the load lowers to scalar
buffer_load_u16 / global_load_u16 instead of *_b128.

This PR adds a D8 specialization descriptor that maps to
tt.divisibility = 8, used when the value is divisible by 8 but not 16.
The existing D (=16) path is unchanged. The substring check in
BaseBackend.parse_attr now tests the longer D8 prefix first so
backend-specific suffixes (e.g. AMD's S for tt.pointer_range) still
compose into descriptors like D8S.

On a representative attention shape (AITER's flash_attn_varlen_func for
Qwen3-Omni ViT prefill, B=1 S=3200 H=16 head_dim=72 fp16) on gfx1151 this
lifts the global loads from 128×buffer_load_u16 to 16×buffer_load_b128
and cuts vmcnt waits from 87 to 34, giving a 10.7% median speedup
(3.042 → 2.717 ms). Stacked with #10390 (RDNA InThreadTranspose
enable), the same kernel reaches 2.376 ms (-21.9% vs main).

New contributor declaration

I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run pre-commit run --from-ref origin/main --to-ref HEAD.
Select one of the following.
- I have added tests.
  - /python/test/unit/runtime/test_specialize.py
  - python/test/unit/cuda/test_tensor_descriptor_cuda.py
- This PR does not need a test because FILL THIS IN.
Select one of the following.
- I have not added any lit tests.
- The lit tests I have added follow these best practices,
  including the "tests should be minimal" section.

Integer kernel args currently only get the `tt.divisibility = 16` MLIR attribute when the runtime value is divisible by 16. Strides that are divisible by 8 but not 16 (e.g. `stride = 72` for an f16 tensor whose head-dim is padded from 72 to 128 elements) drop to the default divisibility=1. Once that hint is missing, every downstream pointer arithmetic conservatively reports `divBytes=2` (one fp16 element) for the loaded tensor, Coalesce picks `sizePerThread = [1, 1]`, and the load lowers to scalar `buffer_load_u16` / `global_load_u16` instead of vectorized `buffer_load_b128` / `global_load_b128`. Add a `D8` specialization descriptor that maps to `tt.divisibility = 8`. - `python/src/specialize.cc`: tiered key selection (D / D8 / "") in both the i32/i64 and u64 fast paths of `handle_long_type`. - `python/triton/backends/compiler.py`: `BaseBackend.parse_attr`'s substring check now tests the longer `D8` prefix first so existing backend-specific suffixes (e.g. AMD's `S` for `tt.pointer_range`) still compose cleanly into descriptors like `D8S`. The Python fallback `BaseBackend.get_int_specialization` returns the matching key. - `python/test/unit/runtime/test_specialize.py`: extend `native_inputs_to_specialize` with D8 candidates (8, 24, 56, 72, `2**31-8`, `2**63-8`, ...) so the existing native-vs-reference parametrized cross-check covers the new tier across both `CUDABackend` and `HIPBackend`. Add a focused `test_d8_int_specialization` regression test that pins down the round trip `get_int_specialization -> parse_attr -> tt.divisibility` for the D / D8 / empty cases. Perf impact on a representative attention shape: on the Qwen3-Omni ViT flash-attention forward (B=1, S=3200, H=16, head_dim=72, fp16, non-causal) running AITER's `flash_attn_2.varlen_fwd` on gfx1151, this lifts the global loads from 128 x `buffer_load_u16` to 16 x `buffer_load_b128` and reduces `s_waitcnt vmcnt` from 87 to 34. End-to-end the kernel goes from 3.227 ms / 3.081 ms (median/min) to 3.143 ms / 2.962 ms with this change in isolation. Stacked with the RDNA InThreadTranspose PR (see "New contributor declaration" below), the same kernel reaches 2.538 ms / 2.494 ms.

ThomasRaoux

This will trigger much more kernel compilations. In the past we decided against going into this direction.
This can be worked around at the kernel level by explicitly setting multiply_of on the arguments that need the finer grain alignment information

mgehre-amd · 2026-05-27T15:52:11Z

This will trigger much more kernel compilations. In the past we decided against going into this direction. This can be worked around at the kernel level by explicitly setting multiply_of on the arguments that need the finer grain alignment information

Thanks, will try that!

mgehre-amd marked this pull request as ready for review May 27, 2026 12:19

mgehre-amd requested a review from ptillet as a code owner May 27, 2026 12:19

mgehre-amd mentioned this pull request May 27, 2026

[AMD] Enable InThreadTranspose pass for RDNA3 / RDNA3.5 (gfx110x/115x) #10390

Merged

7 tasks

mgehre-amd marked this pull request as draft May 27, 2026 12:49

ThomasRaoux requested changes May 27, 2026

View reviewed changes

mgehre-amd closed this May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add D8 integer specialization tier (divisibility = 8 elements)#10389

Add D8 integer specialization tier (divisibility = 8 elements)#10389
mgehre-amd wants to merge 1 commit into
triton-lang:mainfrom
ROCm:matthias.upstream-d8

mgehre-amd commented May 27, 2026 •

edited

Loading

Uh oh!

ThomasRaoux left a comment

Uh oh!

mgehre-amd commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mgehre-amd commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New contributor declaration

Uh oh!

ThomasRaoux left a comment

Choose a reason for hiding this comment

Uh oh!

mgehre-amd commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mgehre-amd commented May 27, 2026 •

edited

Loading