[triton][beta] [Cherry-pick] '[AMD] Enable AsyncCopy by default for gfx950 and gfx1250 (#9087)' by agron911 · Pull Request #1335 · facebookexperimental/triton

agron911 · 2026-04-24T18:09:11Z

Summary:
This is a cherry-pick of an upstream PR: triton-lang/triton#9087

Upstream commit message:

> [AMD] Enable AsyncCopy by default for gfx950 and gfx1250 (#9087)

> Enables `ttg.async_copy_global_to_local` for pipelined loads by default
> on `gfx950` and `gfx1250`.

> This increases LDS consumption because we replace one register buffer
> with an additional LDS buffer. After this change, the number of LDS
> buffers is equal to `num_stages` (previously it was `num_stages - 1`).
> Therefore, some test configs need to be skipped because we run out of
> shared memory capacity on `gfx950`.
> ---------

> Co-authored-by: Lei Zhang <antiagainst@gmail.com>

Do not remove the following line from this commit
Reactor Cherry-pick Revision: b54cb94

Diff Comparison: https://www.internalfb.com/intern/paste/P2283343700/

This diff was generated by running:

buck run fbcode//triton/tools/reactor:reactor -- cherrypick --num-commits 1 --no-submit

Differential Revision: D101982809

…hase 5 Summary: Disable budget-aware layout conversion elimination (Phase 5, smem_budget > 0) which crashes on Blackwell with `LLVM ERROR: Invalid out-dim size`. Root cause: `propagateSrcEncodingAndErase()` skips `scf::YieldOp` during type rewriting, leaving `scf::ForOp` results with stale encodings that corrupt LinearLayout dimensions. Also fix `GenerateSubtiledRegion.cpp` build break (`CGAEncodingAttr::getDefault` -> `get1CTALayout`). Differential Revision: D101982801

…bit dot precision to TF32x3 (#9080)' Summary: This is a cherry-pick of an upstream PR: triton-lang/triton#9080 Upstream commit message: ``` > [LANGUAGE] change default 32-bit dot precision to TF32x3 (#9080) ``` Conflict Resolution: - File: python/triton/language/core.py Action: Removed conflict markers; kept the local "where the first dimension..." line and updated docstring to use tf32x3 instead of tf32. Did not add the upstream-introduced assert/if input_precision body code, since the local code path delegates input_precision processing to semantic.py. Reason: The local file was refactored to move input_precision default-setting logic from core.py.dot() to semantic.py. Adding the upstream body code here would duplicate logic and be unreachable. - File: python/triton/language/semantic.py Action: Updated supports_tf32 check and default value from "tf32" to "tf32x3" in the input_precision branch of the dot() method. Reason: This file holds the actual default-precision logic locally; matching upstream's intent of changing the default precision from tf32 to tf32x3 requires updating it here. Raw Conflicts: https://www.internalfb.com/intern/paste/P2283333395/ Resolution Diff: https://www.internalfb.com/intern/paste/P2283336430/ Diff Comparison: https://www.internalfb.com/intern/paste/P2283337118/ ***Do not remove the following line from this commit*** Reactor Cherry-pick Revision: 63b387c Differential Revision: D101982808

…084)' Summary: This is a cherry-pick of an upstream PR: triton-lang/triton#9084 Upstream commit message: ``` > [TOOLS] Add hip support for link.py (#9084) > * Use the same link cpp scr except hipStrean/CUstream etc. > * Add a link.h prelude for AMD/Nvidia to adapt for the difference. > * Enable test_aot.py for AMD. > * Also rename AMD's compile.cpp to compile.c. ``` ***Do not remove the following line from this commit*** Reactor Cherry-pick Revision: a0e769f Diff Comparison: https://www.internalfb.com/intern/paste/P2283337631/ --- This diff was generated by running: ``` buck run fbcode//triton/tools/reactor:reactor -- cherrypick --num-commits 1 --no-submit ``` Differential Revision: D101982807

…slation from mid-end to lowerings (#9082)' Summary: This is a cherry-pick of an upstream PR: triton-lang/triton#9082 Upstream commit message: ``` > [Backend] Move TMA index translation from mid-end to lowerings (#9082) > > Currently the behavior of fp4_padded is different between > `triton::Descriptor` ops and `AsyncTMA` ops. The former is indexed as if > the data is int8, while the latter is indexed by individual fp4 > elements, which is what the TMA hardware expects. > > This now gets leaked into gluon, which isn't ideal. So, this PR moves > the translation into the lowerings. Along the way, this probably fixes > quite a few bugs as there were several places the translation was > missing. ``` Conflict Resolution: - File: lib/Dialect/TritonGPU/Transforms/Pipeliner/LowerLoops.cpp Action: Adopted upstream's approach (drop translateTMAIndices call, use loadOp.getIndices() directly) but preserved local Value() multicastTargets parameter for AsyncTMACopyGlobalToLocalOp::create. Reason: Local op signature includes multicastTargets as the first argument; upstream version does not. - File: lib/Dialect/TritonNvidiaGPU/Transforms/TMALowering.cpp Action: Three conflicts resolved similarly. For AsyncTMACopyGlobalToLocalOp::create kept Value() multicastTargets prefix; for AsyncTMACopyLocalToGlobalOp::create and AsyncTMAReduceOp::create used upstream version directly (no multicastTargets in their signatures). Reason: Stale local references to undefined tmaPtr/indices needed replacement; preserved local-only multicastTargets argument where applicable. - File: third_party/nvidia/lib/Dialect/NVWS/Transforms/LowerAref.cpp Action: Kept Value() multicastTargets prefix and used op.getIndices() directly. Reason: Same rationale as above; stale 'indices' identifier was undefined. Raw Conflicts: https://www.internalfb.com/intern/paste/P2283337998/ Resolution Diff: https://www.internalfb.com/intern/paste/P2283339520/ Diff Comparison: https://www.internalfb.com/intern/paste/P2283341283/ ***Do not remove the following line from this commit*** Reactor Cherry-pick Revision: db1f8c3 Differential Revision: D101982803

…cope (#9088)' Summary: This is a cherry-pick of an upstream PR: triton-lang/triton#9088 Upstream commit message: ``` > Fix address sanitizer stack-use-after-scope (#9088) > std::make_tuple here will copy the arguments into a tuple so it creates > a copy of SmallVector subsliceOffsets and then passes back a tuple with > an ArrayRef. The SmallVector object is then out of scope. Bypassing > make_tuple means that it uses the underlying AllocationSlice's reference > to subsliceOffsets rather than the temporary copy created by make_tuple. ``` ***Do not remove the following line from this commit*** Reactor Cherry-pick Revision: a04108b Diff Comparison: https://www.internalfb.com/intern/paste/P2283341704/ --- This diff was generated by running: ``` buck run fbcode//triton/tools/reactor:reactor -- cherrypick --num-commits 1 --no-submit ``` Differential Revision: D101982805

…ault 32-bit dot precision to TF32x3" (#9090)' Summary: This is a cherry-pick of an upstream PR: triton-lang/triton#9090 Upstream commit message: ``` > Revert "[LANGUAGE] change default 32-bit dot precision to TF32x3" (#9090) > > Reverts triton-lang/triton#9080 as it cause some tmem allocation > regression due to simplistic hoisting logic ``` Conflict Resolution: - File: python/triton/language/core.py Action: Removed conflict markers; kept the local "where the first dimension..." line. Reverted docstring lines from tf32x3 back to tf32 to match upstream's revert. Reason: Same divergence as the original cherry-pick of #9080 (assert/if input_precision body lives in semantic.py locally). Maintained that local refactor by reverting only the docstring here. - File: python/triton/language/semantic.py Action: Reverted supports_tf32 check and default value from "tf32x3" back to "tf32" in the input_precision branch of the dot() method. Reason: Mirror revert: the prior cherry-pick of #9080 applied the tf32x3 change here (instead of upstream's core.py location); this revert undoes it in the same place. Raw Conflicts: https://www.internalfb.com/intern/paste/P2283342039/ Resolution Diff: https://www.internalfb.com/intern/paste/P2283342643/ Diff Comparison: https://www.internalfb.com/intern/paste/P2283343204/ ***Do not remove the following line from this commit*** Reactor Cherry-pick Revision: 606eebc Differential Revision: D101982800

…fx950 and gfx1250 (#9087)' Summary: This is a cherry-pick of an upstream PR: triton-lang/triton#9087 Upstream commit message: ``` > [AMD] Enable AsyncCopy by default for gfx950 and gfx1250 (#9087) > Enables `ttg.async_copy_global_to_local` for pipelined loads by default > on `gfx950` and `gfx1250`. > This increases LDS consumption because we replace one register buffer > with an additional LDS buffer. After this change, the number of LDS > buffers is equal to `num_stages` (previously it was `num_stages - 1`). > Therefore, some test configs need to be skipped because we run out of > shared memory capacity on `gfx950`. > --------- > Co-authored-by: Lei Zhang <antiagainst@gmail.com> ``` ***Do not remove the following line from this commit*** Reactor Cherry-pick Revision: b54cb94 Diff Comparison: https://www.internalfb.com/intern/paste/P2283343700/ --- This diff was generated by running: ``` buck run fbcode//triton/tools/reactor:reactor -- cherrypick --num-commits 1 --no-submit ``` Differential Revision: D101982809

meta-codesync · 2026-04-24T18:09:54Z

@agron911 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D101982809.

…fx950 and gfx1250 (#9087)' (facebookexperimental#1335) Summary: Pull Request resolved: facebookexperimental#1335 This is a cherry-pick of an upstream PR: triton-lang/triton#9087 Upstream commit message: ``` > [AMD] Enable AsyncCopy by default for gfx950 and gfx1250 (#9087) > Enables `ttg.async_copy_global_to_local` for pipelined loads by default > on `gfx950` and `gfx1250`. > This increases LDS consumption because we replace one register buffer > with an additional LDS buffer. After this change, the number of LDS > buffers is equal to `num_stages` (previously it was `num_stages - 1`). > Therefore, some test configs need to be skipped because we run out of > shared memory capacity on `gfx950`. > --------- > Co-authored-by: Lei Zhang <antiagainst@gmail.com> ``` ***Do not remove the following line from this commit*** Reactor Cherry-pick Revision: b54cb94 Diff Comparison: https://www.internalfb.com/intern/paste/P2283343700/ --- This diff was generated by running: ``` buck run fbcode//triton/tools/reactor:reactor -- cherrypick --num-commits 1 --no-submit ``` Reviewed By: sfzhu93 Differential Revision: D101982809

meta-codesync · 2026-04-24T20:58:07Z

This pull request has been merged in 8e47abe.

agron911 added 7 commits April 23, 2026 23:41

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 24, 2026

meta-codesync Bot added fb-exported meta-exported labels Apr 24, 2026

meta-codesync Bot closed this in 8e47abe Apr 24, 2026

facebook-github-tools Bot added the Merged label Apr 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[triton][beta] [Cherry-pick] '[AMD] Enable AsyncCopy by default for gfx950 and gfx1250 (#9087)'#1335

[triton][beta] [Cherry-pick] '[AMD] Enable AsyncCopy by default for gfx950 and gfx1250 (#9087)'#1335
agron911 wants to merge 7 commits into
facebookexperimental:mainfrom
agron911:export-D101982809

agron911 commented Apr 24, 2026

Uh oh!

meta-codesync Bot commented Apr 24, 2026

Uh oh!

meta-codesync Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

agron911 commented Apr 24, 2026

Diff Comparison: https://www.internalfb.com/intern/paste/P2283343700/

Uh oh!

meta-codesync Bot commented Apr 24, 2026

Uh oh!

meta-codesync Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant