Skip to content

Add D8 integer specialization tier (divisibility = 8 elements)#10389

Closed
mgehre-amd wants to merge 1 commit into
triton-lang:mainfrom
ROCm:matthias.upstream-d8
Closed

Add D8 integer specialization tier (divisibility = 8 elements)#10389
mgehre-amd wants to merge 1 commit into
triton-lang:mainfrom
ROCm:matthias.upstream-d8

Conversation

@mgehre-amd
Copy link
Copy Markdown
Contributor

@mgehre-amd mgehre-amd commented May 27, 2026

Integer kernel args only get tt.divisibility = 16 when the runtime value
is divisible by 16. Strides that are divisible by 8 but not 16 (e.g. a
contiguous stride of 72 for an f16 tensor whose head-dim was padded from 72
to 128 elements) drop to the default divisibility=1. Downstream pointer
analysis then reports byte-divisibility 2 for the loaded tensor, Coalesce
picks sizePerThread = [1, 1], and the load lowers to scalar
buffer_load_u16 / global_load_u16 instead of *_b128.

This PR adds a D8 specialization descriptor that maps to
tt.divisibility = 8, used when the value is divisible by 8 but not 16.
The existing D (=16) path is unchanged. The substring check in
BaseBackend.parse_attr now tests the longer D8 prefix first so
backend-specific suffixes (e.g. AMD's S for tt.pointer_range) still
compose into descriptors like D8S.

On a representative attention shape (AITER's flash_attn_varlen_func for
Qwen3-Omni ViT prefill, B=1 S=3200 H=16 head_dim=72 fp16) on gfx1151 this
lifts the global loads from 128×buffer_load_u16 to 16×buffer_load_b128
and cuts vmcnt waits from 87 to 34, giving a 10.7% median speedup
(3.042 → 2.717 ms). Stacked with #10390 (RDNA InThreadTranspose
enable), the same kernel reaches 2.376 ms (-21.9% vs main).

New contributor declaration

  • I am not making a trivial change, such as fixing a typo in a comment.

  • I have written a PR description following these
    rules.

  • I have run pre-commit run --from-ref origin/main --to-ref HEAD.

  • Select one of the following.

    • I have added tests.
      • /python/test/unit/runtime/test_specialize.py
      • python/test/unit/cuda/test_tensor_descriptor_cuda.py
    • This PR does not need a test because FILL THIS IN.
  • Select one of the following.

    • I have not added any lit tests.
    • The lit tests I have added follow these best practices,
      including the "tests should be minimal" section.

Integer kernel args currently only get the `tt.divisibility = 16` MLIR
attribute when the runtime value is divisible by 16. Strides that are
divisible by 8 but not 16 (e.g. `stride = 72` for an f16 tensor whose
head-dim is padded from 72 to 128 elements) drop to the default
divisibility=1. Once that hint is missing, every downstream pointer
arithmetic conservatively reports `divBytes=2` (one fp16 element) for the
loaded tensor, Coalesce picks `sizePerThread = [1, 1]`, and the load
lowers to scalar `buffer_load_u16` / `global_load_u16` instead of
vectorized `buffer_load_b128` / `global_load_b128`.

Add a `D8` specialization descriptor that maps to `tt.divisibility = 8`.
- `python/src/specialize.cc`: tiered key selection (D / D8 / "") in
  both the i32/i64 and u64 fast paths of `handle_long_type`.
- `python/triton/backends/compiler.py`: `BaseBackend.parse_attr`'s
  substring check now tests the longer `D8` prefix first so existing
  backend-specific suffixes (e.g. AMD's `S` for `tt.pointer_range`)
  still compose cleanly into descriptors like `D8S`. The Python
  fallback `BaseBackend.get_int_specialization` returns the matching
  key.
- `python/test/unit/runtime/test_specialize.py`: extend
  `native_inputs_to_specialize` with D8 candidates (8, 24, 56, 72,
  `2**31-8`, `2**63-8`, ...) so the existing native-vs-reference
  parametrized cross-check covers the new tier across both
  `CUDABackend` and `HIPBackend`. Add a focused
  `test_d8_int_specialization` regression test that pins down the
  round trip `get_int_specialization -> parse_attr ->
  tt.divisibility` for the D / D8 / empty cases.

Perf impact on a representative attention shape: on the Qwen3-Omni ViT
flash-attention forward (B=1, S=3200, H=16, head_dim=72, fp16,
non-causal) running AITER's `flash_attn_2.varlen_fwd` on gfx1151, this
lifts the global loads from 128 x `buffer_load_u16` to 16 x
`buffer_load_b128` and reduces `s_waitcnt vmcnt` from 87 to 34.
End-to-end the kernel goes from 3.227 ms / 3.081 ms (median/min) to
3.143 ms / 2.962 ms with this change in isolation. Stacked with the
RDNA InThreadTranspose PR (see "New contributor declaration" below),
the same kernel reaches 2.538 ms / 2.494 ms.
@mgehre-amd mgehre-amd marked this pull request as ready for review May 27, 2026 12:19
@mgehre-amd mgehre-amd requested a review from ptillet as a code owner May 27, 2026 12:19
@mgehre-amd mgehre-amd marked this pull request as draft May 27, 2026 12:49
Copy link
Copy Markdown
Collaborator

@ThomasRaoux ThomasRaoux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will trigger much more kernel compilations. In the past we decided against going into this direction.
This can be worked around at the kernel level by explicitly setting multiply_of on the arguments that need the finer grain alignment information

@mgehre-amd
Copy link
Copy Markdown
Contributor Author

This will trigger much more kernel compilations. In the past we decided against going into this direction. This can be worked around at the kernel level by explicitly setting multiply_of on the arguments that need the finer grain alignment information

Thanks, will try that!

@mgehre-amd mgehre-amd closed this May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants