[triton][beta] [Cherry-pick][RESOLVED] '[Backend] Move TMA index translation from mid-end to lowerings (#9082)' (#1334)#1334
Closed
agron911 wants to merge 4 commits into
Closed
Conversation
Contributor
|
@agron911 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D101982803. |
…hase 5 (facebookexperimental#1328) Summary: Pull Request resolved: facebookexperimental#1328 Disable budget-aware layout conversion elimination (Phase 5, smem_budget > 0) which crashes on Blackwell with `LLVM ERROR: Invalid out-dim size`. Root cause: `propagateSrcEncodingAndErase()` skips `scf::YieldOp` during type rewriting, leaving `scf::ForOp` results with stale encodings that corrupt LinearLayout dimensions. Also fix `GenerateSubtiledRegion.cpp` build break (`CGAEncodingAttr::getDefault` -> `get1CTALayout`). Reviewed By: sfzhu93 Differential Revision: D101982801
…bit dot precision to TF32x3 (#9080)' (facebookexperimental#1329) Summary: Pull Request resolved: facebookexperimental#1329 This is a cherry-pick of an upstream PR: triton-lang/triton#9080 Upstream commit message: ``` > [LANGUAGE] change default 32-bit dot precision to TF32x3 (#9080) ``` Conflict Resolution: - File: python/triton/language/core.py Action: Removed conflict markers; kept the local "where the first dimension..." line and updated docstring to use tf32x3 instead of tf32. Did not add the upstream-introduced assert/if input_precision body code, since the local code path delegates input_precision processing to semantic.py. Reason: The local file was refactored to move input_precision default-setting logic from core.py.dot() to semantic.py. Adding the upstream body code here would duplicate logic and be unreachable. - File: python/triton/language/semantic.py Action: Updated supports_tf32 check and default value from "tf32" to "tf32x3" in the input_precision branch of the dot() method. Reason: This file holds the actual default-precision logic locally; matching upstream's intent of changing the default precision from tf32 to tf32x3 requires updating it here. Raw Conflicts: https://www.internalfb.com/intern/paste/P2283333395/ Resolution Diff: https://www.internalfb.com/intern/paste/P2283336430/ Diff Comparison: https://www.internalfb.com/intern/paste/P2283337118/ ***Do not remove the following line from this commit*** Reactor Cherry-pick Revision: 63b387c Reviewed By: sfzhu93 Differential Revision: D101982808
…084)' (facebookexperimental#1331) Summary: Pull Request resolved: facebookexperimental#1331 This is a cherry-pick of an upstream PR: triton-lang/triton#9084 Upstream commit message: ``` > [TOOLS] Add hip support for link.py (#9084) > * Use the same link cpp scr except hipStrean/CUstream etc. > * Add a link.h prelude for AMD/Nvidia to adapt for the difference. > * Enable test_aot.py for AMD. > * Also rename AMD's compile.cpp to compile.c. ``` ***Do not remove the following line from this commit*** Reactor Cherry-pick Revision: a0e769f Diff Comparison: https://www.internalfb.com/intern/paste/P2283337631/ --- This diff was generated by running: ``` buck run fbcode//triton/tools/reactor:reactor -- cherrypick --num-commits 1 --no-submit ``` Reviewed By: sfzhu93 Differential Revision: D101982807
…slation from mid-end to lowerings (#9082)' (facebookexperimental#1334) Summary: Pull Request resolved: facebookexperimental#1334 This is a cherry-pick of an upstream PR: triton-lang/triton#9082 Upstream commit message: ``` > [Backend] Move TMA index translation from mid-end to lowerings (#9082) > > Currently the behavior of fp4_padded is different between > `triton::Descriptor` ops and `AsyncTMA` ops. The former is indexed as if > the data is int8, while the latter is indexed by individual fp4 > elements, which is what the TMA hardware expects. > > This now gets leaked into gluon, which isn't ideal. So, this PR moves > the translation into the lowerings. Along the way, this probably fixes > quite a few bugs as there were several places the translation was > missing. ``` Conflict Resolution: - File: lib/Dialect/TritonGPU/Transforms/Pipeliner/LowerLoops.cpp Action: Adopted upstream's approach (drop translateTMAIndices call, use loadOp.getIndices() directly) but preserved local Value() multicastTargets parameter for AsyncTMACopyGlobalToLocalOp::create. Reason: Local op signature includes multicastTargets as the first argument; upstream version does not. - File: lib/Dialect/TritonNvidiaGPU/Transforms/TMALowering.cpp Action: Three conflicts resolved similarly. For AsyncTMACopyGlobalToLocalOp::create kept Value() multicastTargets prefix; for AsyncTMACopyLocalToGlobalOp::create and AsyncTMAReduceOp::create used upstream version directly (no multicastTargets in their signatures). Reason: Stale local references to undefined tmaPtr/indices needed replacement; preserved local-only multicastTargets argument where applicable. - File: third_party/nvidia/lib/Dialect/NVWS/Transforms/LowerAref.cpp Action: Kept Value() multicastTargets prefix and used op.getIndices() directly. Reason: Same rationale as above; stale 'indices' identifier was undefined. Raw Conflicts: https://www.internalfb.com/intern/paste/P2283337998/ Resolution Diff: https://www.internalfb.com/intern/paste/P2283339520/ Diff Comparison: https://www.internalfb.com/intern/paste/P2283341283/ ***Do not remove the following line from this commit*** Reactor Cherry-pick Revision: db1f8c3 Reviewed By: sfzhu93 Differential Revision: D101982803
57f2df5 to
c253145
Compare
agron911
added a commit
to agron911/triton
that referenced
this pull request
Apr 24, 2026
…slation from mid-end to lowerings (#9082)' (facebookexperimental#1334) Summary: Pull Request resolved: facebookexperimental#1334 This is a cherry-pick of an upstream PR: triton-lang/triton#9082 Upstream commit message: ``` > [Backend] Move TMA index translation from mid-end to lowerings (#9082) > > Currently the behavior of fp4_padded is different between > `triton::Descriptor` ops and `AsyncTMA` ops. The former is indexed as if > the data is int8, while the latter is indexed by individual fp4 > elements, which is what the TMA hardware expects. > > This now gets leaked into gluon, which isn't ideal. So, this PR moves > the translation into the lowerings. Along the way, this probably fixes > quite a few bugs as there were several places the translation was > missing. ``` Conflict Resolution: - File: lib/Dialect/TritonGPU/Transforms/Pipeliner/LowerLoops.cpp Action: Adopted upstream's approach (drop translateTMAIndices call, use loadOp.getIndices() directly) but preserved local Value() multicastTargets parameter for AsyncTMACopyGlobalToLocalOp::create. Reason: Local op signature includes multicastTargets as the first argument; upstream version does not. - File: lib/Dialect/TritonNvidiaGPU/Transforms/TMALowering.cpp Action: Three conflicts resolved similarly. For AsyncTMACopyGlobalToLocalOp::create kept Value() multicastTargets prefix; for AsyncTMACopyLocalToGlobalOp::create and AsyncTMAReduceOp::create used upstream version directly (no multicastTargets in their signatures). Reason: Stale local references to undefined tmaPtr/indices needed replacement; preserved local-only multicastTargets argument where applicable. - File: third_party/nvidia/lib/Dialect/NVWS/Transforms/LowerAref.cpp Action: Kept Value() multicastTargets prefix and used op.getIndices() directly. Reason: Same rationale as above; stale 'indices' identifier was undefined. Raw Conflicts: https://www.internalfb.com/intern/paste/P2283337998/ Resolution Diff: https://www.internalfb.com/intern/paste/P2283339520/ Diff Comparison: https://www.internalfb.com/intern/paste/P2283341283/ ***Do not remove the following line from this commit*** Reactor Cherry-pick Revision: db1f8c3 Reviewed By: sfzhu93 Differential Revision: D101982803
Contributor
|
This pull request has been merged in f27af41. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
This is a cherry-pick of an upstream PR: triton-lang/triton#9082
Upstream commit message:
Conflict Resolution:
Action: Adopted upstream's approach (drop translateTMAIndices call, use loadOp.getIndices() directly) but preserved local Value() multicastTargets parameter for AsyncTMACopyGlobalToLocalOp::create.
Reason: Local op signature includes multicastTargets as the first argument; upstream version does not.
Action: Three conflicts resolved similarly. For AsyncTMACopyGlobalToLocalOp::create kept Value() multicastTargets prefix; for AsyncTMACopyLocalToGlobalOp::create and AsyncTMAReduceOp::create used upstream version directly (no multicastTargets in their signatures).
Reason: Stale local references to undefined tmaPtr/indices needed replacement; preserved local-only multicastTargets argument where applicable.
Action: Kept Value() multicastTargets prefix and used op.getIndices() directly.
Reason: Same rationale as above; stale 'indices' identifier was undefined.
Raw Conflicts: https://www.internalfb.com/intern/paste/P2283337998/
Resolution Diff: https://www.internalfb.com/intern/paste/P2283339520/
Diff Comparison: https://www.internalfb.com/intern/paste/P2283341283/
Do not remove the following line from this commit
Reactor Cherry-pick Revision: db1f8c3
Reviewed By: sfzhu93
Differential Revision: D101982803