Skip to content

[NVIDIA] Enable TMA gather4 on sm_120 and sm_121#8498

Merged
masahi merged 10 commits intotriton-lang:mainfrom
ita9naiwa:sm120tma
Oct 24, 2025
Merged

[NVIDIA] Enable TMA gather4 on sm_120 and sm_121#8498
masahi merged 10 commits intotriton-lang:mainfrom
ita9naiwa:sm120tma

Conversation

@ita9naiwa
Copy link
Copy Markdown
Contributor

@ita9naiwa ita9naiwa commented Oct 21, 2025

  • Enable cp.async.bulk.tensor.2d.tile::gather4.shared on sm_120 and sm_121.
  • Skip TMA scatter4 test on sm_120 since it is unsupported by hardware.

Note:
All other TMA features except for cluster-related ones are supported on sm_120.

@ita9naiwa ita9naiwa requested a review from ptillet as a code owner October 21, 2025 09:26
@ita9naiwa
Copy link
Copy Markdown
Contributor Author

ita9naiwa commented Oct 21, 2025

Intended to open to my local fork. but I made mistake. Sorry.

@ita9naiwa ita9naiwa closed this Oct 21, 2025
@ita9naiwa ita9naiwa changed the title Support tma scatter, disable gather test [NVIDIA] Enable TMA gather4 on sm_120 and sm_121 Oct 21, 2025
@ita9naiwa
Copy link
Copy Markdown
Contributor Author

Ready to get reviewed!

@ita9naiwa ita9naiwa reopened this Oct 21, 2025
@masahi masahi requested review from Mogball and ThomasRaoux and removed request for ptillet October 21, 2025 20:59
Comment thread test/Conversion/tma_to_llvm.mlir Outdated
Comment thread third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/LoadStoreOpToLLVM.cpp Outdated
Comment on lines +1699 to +1705
AsyncTMAGatherOpConversion(LLVMTypeConverter &converter,
PatternBenefit benefit, int computeCapability)
: ConvertOpToLLVMPattern<triton::nvidia_gpu::AsyncTMAGatherOp>(converter,
benefit),
computeCapability(computeCapability) {}

int computeCapability;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can revert that too?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching what I missed! fixed!

@masahi masahi merged commit 4d85824 into triton-lang:main Oct 24, 2025
9 checks passed
@ita9naiwa ita9naiwa deleted the sm120tma branch October 24, 2025 05:55
masahi pushed a commit to masahi/triton that referenced this pull request Oct 24, 2025
- Enable cp.async.bulk.tensor.2d.tile::gather4.shared on sm_120 and
sm_121.
- Skip TMA scatter4 test on sm_120 since it is unsupported by hardware.

Note:
All other TMA features except for cluster-related ones are supported on
sm_120.
@mobicham
Copy link
Copy Markdown

How's TMA speeding up things vs non-TMA on the sm_120 btw?

tmoreau89 pushed a commit to tmoreau89/triton that referenced this pull request Dec 1, 2025
- Enable cp.async.bulk.tensor.2d.tile::gather4.shared on sm_120 and
sm_121.
- Skip TMA scatter4 test on sm_120 since it is unsupported by hardware.

Note:
All other TMA features except for cluster-related ones are supported on
sm_120.
janreges added a commit to janreges/vllm that referenced this pull request Dec 21, 2025
- Add SM120 to triton_kernels_supported condition in both backend
  selection functions (get_mxfp4_backend, get_mxfp4_backend_with_lora)
- Use StridedLayout for SM120 to avoid "Must use persistent kernel"
  error caused by unsupported cluster TMA operations
- Configure SM120-specific constraints: is_persistent=False, num_stages=1

Tested on NVIDIA RTX PRO 6000 Blackwell (compute capability 12.0).
Requires Triton fix: triton-lang/triton#8498
meta-codesync Bot pushed a commit to facebookexperimental/triton that referenced this pull request Mar 28, 2026
…n sm_120 and sm_121 (#8498)'

Summary:
This is a cherry-pick of an upstream PR: triton-lang/triton#8498

Upstream commit message:
```
> [NVIDIA] Enable TMA gather4 on sm_120 and sm_121 (#8498)

> - Enable cp.async.bulk.tensor.2d.tile::gather4.shared on sm_120 and
> sm_121.
> - Skip TMA scatter4 test on sm_120 since it is unsupported by hardware.

> Note:
> All other TMA features except for cluster-related ones are supported on
> sm_120.
```

Conflict Resolution:
- File: python/test/unit/language/test_tensor_descriptor.py
  Action: Added is_sm12x() skipif decorator from upstream; kept local function signature without 'device' param (body uses hardcoded 'cuda')
  Reason: Local version intentionally omits device fixture for this test; upstream's intent was to add the sm120 skip guard

Raw Conflicts: https://www.internalfb.com/intern/paste/P2251271000/
Resolution Diff: https://www.internalfb.com/intern/paste/P2251271369/

***Do not remove the following line from this commit***
Reactor Cherry-pick Revision: 4d85824

Reviewed By: dshi7

Differential Revision: D98272343

fbshipit-source-id: 8578ef3a83f2a4120369c969a58ed6e34adb6deb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants