[NVIDIA] Enable TMA gather4 on sm_120 and sm_121#8498
Merged
masahi merged 10 commits intotriton-lang:mainfrom Oct 24, 2025
Merged
[NVIDIA] Enable TMA gather4 on sm_120 and sm_121#8498masahi merged 10 commits intotriton-lang:mainfrom
masahi merged 10 commits intotriton-lang:mainfrom
Conversation
Contributor
Author
|
Intended to open to my local fork. but I made mistake. Sorry. |
Contributor
Author
|
Ready to get reviewed! |
ita9naiwa
commented
Oct 21, 2025
ThomasRaoux
reviewed
Oct 22, 2025
ThomasRaoux
reviewed
Oct 23, 2025
Comment on lines
+1699
to
+1705
| AsyncTMAGatherOpConversion(LLVMTypeConverter &converter, | ||
| PatternBenefit benefit, int computeCapability) | ||
| : ConvertOpToLLVMPattern<triton::nvidia_gpu::AsyncTMAGatherOp>(converter, | ||
| benefit), | ||
| computeCapability(computeCapability) {} | ||
|
|
||
| int computeCapability; |
Collaborator
There was a problem hiding this comment.
we can revert that too?
Contributor
Author
There was a problem hiding this comment.
Thanks for catching what I missed! fixed!
ThomasRaoux
approved these changes
Oct 23, 2025
masahi
pushed a commit
to masahi/triton
that referenced
this pull request
Oct 24, 2025
- Enable cp.async.bulk.tensor.2d.tile::gather4.shared on sm_120 and sm_121. - Skip TMA scatter4 test on sm_120 since it is unsupported by hardware. Note: All other TMA features except for cluster-related ones are supported on sm_120.
|
How's TMA speeding up things vs non-TMA on the sm_120 btw? |
5 tasks
tmoreau89
pushed a commit
to tmoreau89/triton
that referenced
this pull request
Dec 1, 2025
- Enable cp.async.bulk.tensor.2d.tile::gather4.shared on sm_120 and sm_121. - Skip TMA scatter4 test on sm_120 since it is unsupported by hardware. Note: All other TMA features except for cluster-related ones are supported on sm_120.
1 task
janreges
added a commit
to janreges/vllm
that referenced
this pull request
Dec 21, 2025
- Add SM120 to triton_kernels_supported condition in both backend selection functions (get_mxfp4_backend, get_mxfp4_backend_with_lora) - Use StridedLayout for SM120 to avoid "Must use persistent kernel" error caused by unsupported cluster TMA operations - Configure SM120-specific constraints: is_persistent=False, num_stages=1 Tested on NVIDIA RTX PRO 6000 Blackwell (compute capability 12.0). Requires Triton fix: triton-lang/triton#8498
1 task
4 tasks
meta-codesync Bot
pushed a commit
to facebookexperimental/triton
that referenced
this pull request
Mar 28, 2026
…n sm_120 and sm_121 (#8498)' Summary: This is a cherry-pick of an upstream PR: triton-lang/triton#8498 Upstream commit message: ``` > [NVIDIA] Enable TMA gather4 on sm_120 and sm_121 (#8498) > - Enable cp.async.bulk.tensor.2d.tile::gather4.shared on sm_120 and > sm_121. > - Skip TMA scatter4 test on sm_120 since it is unsupported by hardware. > Note: > All other TMA features except for cluster-related ones are supported on > sm_120. ``` Conflict Resolution: - File: python/test/unit/language/test_tensor_descriptor.py Action: Added is_sm12x() skipif decorator from upstream; kept local function signature without 'device' param (body uses hardcoded 'cuda') Reason: Local version intentionally omits device fixture for this test; upstream's intent was to add the sm120 skip guard Raw Conflicts: https://www.internalfb.com/intern/paste/P2251271000/ Resolution Diff: https://www.internalfb.com/intern/paste/P2251271369/ ***Do not remove the following line from this commit*** Reactor Cherry-pick Revision: 4d85824 Reviewed By: dshi7 Differential Revision: D98272343 fbshipit-source-id: 8578ef3a83f2a4120369c969a58ed6e34adb6deb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note:
All other TMA features except for cluster-related ones are supported on sm_120.