[triton][beta] [Cherry-pick] '[AMD] Enable AsyncCopy by default for gfx950 and gfx1250 (#9087)'#1335
Closed
agron911 wants to merge 7 commits into
Closed
[triton][beta] [Cherry-pick] '[AMD] Enable AsyncCopy by default for gfx950 and gfx1250 (#9087)'#1335agron911 wants to merge 7 commits into
agron911 wants to merge 7 commits into
Conversation
…hase 5 Summary: Disable budget-aware layout conversion elimination (Phase 5, smem_budget > 0) which crashes on Blackwell with `LLVM ERROR: Invalid out-dim size`. Root cause: `propagateSrcEncodingAndErase()` skips `scf::YieldOp` during type rewriting, leaving `scf::ForOp` results with stale encodings that corrupt LinearLayout dimensions. Also fix `GenerateSubtiledRegion.cpp` build break (`CGAEncodingAttr::getDefault` -> `get1CTALayout`). Differential Revision: D101982801
…bit dot precision to TF32x3 (#9080)' Summary: This is a cherry-pick of an upstream PR: triton-lang/triton#9080 Upstream commit message: ``` > [LANGUAGE] change default 32-bit dot precision to TF32x3 (#9080) ``` Conflict Resolution: - File: python/triton/language/core.py Action: Removed conflict markers; kept the local "where the first dimension..." line and updated docstring to use tf32x3 instead of tf32. Did not add the upstream-introduced assert/if input_precision body code, since the local code path delegates input_precision processing to semantic.py. Reason: The local file was refactored to move input_precision default-setting logic from core.py.dot() to semantic.py. Adding the upstream body code here would duplicate logic and be unreachable. - File: python/triton/language/semantic.py Action: Updated supports_tf32 check and default value from "tf32" to "tf32x3" in the input_precision branch of the dot() method. Reason: This file holds the actual default-precision logic locally; matching upstream's intent of changing the default precision from tf32 to tf32x3 requires updating it here. Raw Conflicts: https://www.internalfb.com/intern/paste/P2283333395/ Resolution Diff: https://www.internalfb.com/intern/paste/P2283336430/ Diff Comparison: https://www.internalfb.com/intern/paste/P2283337118/ ***Do not remove the following line from this commit*** Reactor Cherry-pick Revision: 63b387c Differential Revision: D101982808
…084)' Summary: This is a cherry-pick of an upstream PR: triton-lang/triton#9084 Upstream commit message: ``` > [TOOLS] Add hip support for link.py (#9084) > * Use the same link cpp scr except hipStrean/CUstream etc. > * Add a link.h prelude for AMD/Nvidia to adapt for the difference. > * Enable test_aot.py for AMD. > * Also rename AMD's compile.cpp to compile.c. ``` ***Do not remove the following line from this commit*** Reactor Cherry-pick Revision: a0e769f Diff Comparison: https://www.internalfb.com/intern/paste/P2283337631/ --- This diff was generated by running: ``` buck run fbcode//triton/tools/reactor:reactor -- cherrypick --num-commits 1 --no-submit ``` Differential Revision: D101982807
…slation from mid-end to lowerings (#9082)' Summary: This is a cherry-pick of an upstream PR: triton-lang/triton#9082 Upstream commit message: ``` > [Backend] Move TMA index translation from mid-end to lowerings (#9082) > > Currently the behavior of fp4_padded is different between > `triton::Descriptor` ops and `AsyncTMA` ops. The former is indexed as if > the data is int8, while the latter is indexed by individual fp4 > elements, which is what the TMA hardware expects. > > This now gets leaked into gluon, which isn't ideal. So, this PR moves > the translation into the lowerings. Along the way, this probably fixes > quite a few bugs as there were several places the translation was > missing. ``` Conflict Resolution: - File: lib/Dialect/TritonGPU/Transforms/Pipeliner/LowerLoops.cpp Action: Adopted upstream's approach (drop translateTMAIndices call, use loadOp.getIndices() directly) but preserved local Value() multicastTargets parameter for AsyncTMACopyGlobalToLocalOp::create. Reason: Local op signature includes multicastTargets as the first argument; upstream version does not. - File: lib/Dialect/TritonNvidiaGPU/Transforms/TMALowering.cpp Action: Three conflicts resolved similarly. For AsyncTMACopyGlobalToLocalOp::create kept Value() multicastTargets prefix; for AsyncTMACopyLocalToGlobalOp::create and AsyncTMAReduceOp::create used upstream version directly (no multicastTargets in their signatures). Reason: Stale local references to undefined tmaPtr/indices needed replacement; preserved local-only multicastTargets argument where applicable. - File: third_party/nvidia/lib/Dialect/NVWS/Transforms/LowerAref.cpp Action: Kept Value() multicastTargets prefix and used op.getIndices() directly. Reason: Same rationale as above; stale 'indices' identifier was undefined. Raw Conflicts: https://www.internalfb.com/intern/paste/P2283337998/ Resolution Diff: https://www.internalfb.com/intern/paste/P2283339520/ Diff Comparison: https://www.internalfb.com/intern/paste/P2283341283/ ***Do not remove the following line from this commit*** Reactor Cherry-pick Revision: db1f8c3 Differential Revision: D101982803
…cope (#9088)' Summary: This is a cherry-pick of an upstream PR: triton-lang/triton#9088 Upstream commit message: ``` > Fix address sanitizer stack-use-after-scope (#9088) > std::make_tuple here will copy the arguments into a tuple so it creates > a copy of SmallVector subsliceOffsets and then passes back a tuple with > an ArrayRef. The SmallVector object is then out of scope. Bypassing > make_tuple means that it uses the underlying AllocationSlice's reference > to subsliceOffsets rather than the temporary copy created by make_tuple. ``` ***Do not remove the following line from this commit*** Reactor Cherry-pick Revision: a04108b Diff Comparison: https://www.internalfb.com/intern/paste/P2283341704/ --- This diff was generated by running: ``` buck run fbcode//triton/tools/reactor:reactor -- cherrypick --num-commits 1 --no-submit ``` Differential Revision: D101982805
…ault 32-bit dot precision to TF32x3" (#9090)' Summary: This is a cherry-pick of an upstream PR: triton-lang/triton#9090 Upstream commit message: ``` > Revert "[LANGUAGE] change default 32-bit dot precision to TF32x3" (#9090) > > Reverts triton-lang/triton#9080 as it cause some tmem allocation > regression due to simplistic hoisting logic ``` Conflict Resolution: - File: python/triton/language/core.py Action: Removed conflict markers; kept the local "where the first dimension..." line. Reverted docstring lines from tf32x3 back to tf32 to match upstream's revert. Reason: Same divergence as the original cherry-pick of #9080 (assert/if input_precision body lives in semantic.py locally). Maintained that local refactor by reverting only the docstring here. - File: python/triton/language/semantic.py Action: Reverted supports_tf32 check and default value from "tf32x3" back to "tf32" in the input_precision branch of the dot() method. Reason: Mirror revert: the prior cherry-pick of #9080 applied the tf32x3 change here (instead of upstream's core.py location); this revert undoes it in the same place. Raw Conflicts: https://www.internalfb.com/intern/paste/P2283342039/ Resolution Diff: https://www.internalfb.com/intern/paste/P2283342643/ Diff Comparison: https://www.internalfb.com/intern/paste/P2283343204/ ***Do not remove the following line from this commit*** Reactor Cherry-pick Revision: 606eebc Differential Revision: D101982800
…fx950 and gfx1250 (#9087)' Summary: This is a cherry-pick of an upstream PR: triton-lang/triton#9087 Upstream commit message: ``` > [AMD] Enable AsyncCopy by default for gfx950 and gfx1250 (#9087) > Enables `ttg.async_copy_global_to_local` for pipelined loads by default > on `gfx950` and `gfx1250`. > This increases LDS consumption because we replace one register buffer > with an additional LDS buffer. After this change, the number of LDS > buffers is equal to `num_stages` (previously it was `num_stages - 1`). > Therefore, some test configs need to be skipped because we run out of > shared memory capacity on `gfx950`. > --------- > Co-authored-by: Lei Zhang <antiagainst@gmail.com> ``` ***Do not remove the following line from this commit*** Reactor Cherry-pick Revision: b54cb94 Diff Comparison: https://www.internalfb.com/intern/paste/P2283343700/ --- This diff was generated by running: ``` buck run fbcode//triton/tools/reactor:reactor -- cherrypick --num-commits 1 --no-submit ``` Differential Revision: D101982809
Contributor
|
@agron911 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D101982809. |
agron911
added a commit
to agron911/triton
that referenced
this pull request
Apr 24, 2026
…fx950 and gfx1250 (#9087)' (facebookexperimental#1335) Summary: Pull Request resolved: facebookexperimental#1335 This is a cherry-pick of an upstream PR: triton-lang/triton#9087 Upstream commit message: ``` > [AMD] Enable AsyncCopy by default for gfx950 and gfx1250 (#9087) > Enables `ttg.async_copy_global_to_local` for pipelined loads by default > on `gfx950` and `gfx1250`. > This increases LDS consumption because we replace one register buffer > with an additional LDS buffer. After this change, the number of LDS > buffers is equal to `num_stages` (previously it was `num_stages - 1`). > Therefore, some test configs need to be skipped because we run out of > shared memory capacity on `gfx950`. > --------- > Co-authored-by: Lei Zhang <antiagainst@gmail.com> ``` ***Do not remove the following line from this commit*** Reactor Cherry-pick Revision: b54cb94 Diff Comparison: https://www.internalfb.com/intern/paste/P2283343700/ --- This diff was generated by running: ``` buck run fbcode//triton/tools/reactor:reactor -- cherrypick --num-commits 1 --no-submit ``` Reviewed By: sfzhu93 Differential Revision: D101982809
Contributor
|
This pull request has been merged in 8e47abe. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
This is a cherry-pick of an upstream PR: triton-lang/triton#9087
Upstream commit message:
Do not remove the following line from this commit
Reactor Cherry-pick Revision: b54cb94
Diff Comparison: https://www.internalfb.com/intern/paste/P2283343700/
This diff was generated by running:
Differential Revision: D101982809