Add D8 integer specialization tier (divisibility = 8 elements)#10389
Closed
mgehre-amd wants to merge 1 commit into
Closed
Add D8 integer specialization tier (divisibility = 8 elements)#10389mgehre-amd wants to merge 1 commit into
mgehre-amd wants to merge 1 commit into
Conversation
Integer kernel args currently only get the `tt.divisibility = 16` MLIR attribute when the runtime value is divisible by 16. Strides that are divisible by 8 but not 16 (e.g. `stride = 72` for an f16 tensor whose head-dim is padded from 72 to 128 elements) drop to the default divisibility=1. Once that hint is missing, every downstream pointer arithmetic conservatively reports `divBytes=2` (one fp16 element) for the loaded tensor, Coalesce picks `sizePerThread = [1, 1]`, and the load lowers to scalar `buffer_load_u16` / `global_load_u16` instead of vectorized `buffer_load_b128` / `global_load_b128`. Add a `D8` specialization descriptor that maps to `tt.divisibility = 8`. - `python/src/specialize.cc`: tiered key selection (D / D8 / "") in both the i32/i64 and u64 fast paths of `handle_long_type`. - `python/triton/backends/compiler.py`: `BaseBackend.parse_attr`'s substring check now tests the longer `D8` prefix first so existing backend-specific suffixes (e.g. AMD's `S` for `tt.pointer_range`) still compose cleanly into descriptors like `D8S`. The Python fallback `BaseBackend.get_int_specialization` returns the matching key. - `python/test/unit/runtime/test_specialize.py`: extend `native_inputs_to_specialize` with D8 candidates (8, 24, 56, 72, `2**31-8`, `2**63-8`, ...) so the existing native-vs-reference parametrized cross-check covers the new tier across both `CUDABackend` and `HIPBackend`. Add a focused `test_d8_int_specialization` regression test that pins down the round trip `get_int_specialization -> parse_attr -> tt.divisibility` for the D / D8 / empty cases. Perf impact on a representative attention shape: on the Qwen3-Omni ViT flash-attention forward (B=1, S=3200, H=16, head_dim=72, fp16, non-causal) running AITER's `flash_attn_2.varlen_fwd` on gfx1151, this lifts the global loads from 128 x `buffer_load_u16` to 16 x `buffer_load_b128` and reduces `s_waitcnt vmcnt` from 87 to 34. End-to-end the kernel goes from 3.227 ms / 3.081 ms (median/min) to 3.143 ms / 2.962 ms with this change in isolation. Stacked with the RDNA InThreadTranspose PR (see "New contributor declaration" below), the same kernel reaches 2.538 ms / 2.494 ms.
7 tasks
ThomasRaoux
requested changes
May 27, 2026
Collaborator
ThomasRaoux
left a comment
There was a problem hiding this comment.
This will trigger much more kernel compilations. In the past we decided against going into this direction.
This can be worked around at the kernel level by explicitly setting multiply_of on the arguments that need the finer grain alignment information
Contributor
Author
Thanks, will try that! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Integer kernel args only get
tt.divisibility = 16when the runtime valueis divisible by 16. Strides that are divisible by 8 but not 16 (e.g. a
contiguous stride of 72 for an f16 tensor whose head-dim was padded from 72
to 128 elements) drop to the default divisibility=1. Downstream pointer
analysis then reports byte-divisibility 2 for the loaded tensor,
Coalescepicks
sizePerThread = [1, 1], and the load lowers to scalarbuffer_load_u16/global_load_u16instead of*_b128.This PR adds a
D8specialization descriptor that maps tott.divisibility = 8, used when the value is divisible by 8 but not 16.The existing
D(=16) path is unchanged. The substring check inBaseBackend.parse_attrnow tests the longerD8prefix first sobackend-specific suffixes (e.g. AMD's
Sfortt.pointer_range) stillcompose into descriptors like
D8S.On a representative attention shape (AITER's
flash_attn_varlen_funcforQwen3-Omni ViT prefill,
B=1 S=3200 H=16 head_dim=72 fp16) on gfx1151 thislifts the global loads from 128×
buffer_load_u16to 16×buffer_load_b128and cuts
vmcntwaits from 87 to 34, giving a 10.7% median speedup(3.042 → 2.717 ms). Stacked with #10390 (RDNA InThreadTranspose
enable), the same kernel reaches 2.376 ms (-21.9% vs
main).New contributor declaration
I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run
pre-commit run --from-ref origin/main --to-ref HEAD.Select one of the following.
/python/test/unit/runtime/test_specialize.pypython/test/unit/cuda/test_tensor_descriptor_cuda.pyFILL THIS IN.Select one of the following.
littests.littests I have added follow these best practices,including the "tests should be minimal" section.