[Backend] Move TMA index translation from mid-end to lowerings#9082
Merged
Conversation
Currently the behavior of fp4_padded is different between `triton::Descriptor` ops and `AsyncTMA` ops. The former is indexed as if the data is int8, while the latter is indexed by individual fp4 elements, which is what the TMA hardware expects. This now gets leaked into gluon, which isn't ideal. So, this PR moves the translation into the lowerings. Along the way, this probably fixes quite a few bugs as there were several places the translation was missing.
5d018c4 to
2e663f6
Compare
agron911
added a commit
to agron911/triton
that referenced
this pull request
Apr 24, 2026
…slation from mid-end to lowerings (#9082)' (facebookexperimental#1334) Summary: Pull Request resolved: facebookexperimental#1334 This is a cherry-pick of an upstream PR: triton-lang/triton#9082 Upstream commit message: ``` > [Backend] Move TMA index translation from mid-end to lowerings (#9082) > > Currently the behavior of fp4_padded is different between > `triton::Descriptor` ops and `AsyncTMA` ops. The former is indexed as if > the data is int8, while the latter is indexed by individual fp4 > elements, which is what the TMA hardware expects. > > This now gets leaked into gluon, which isn't ideal. So, this PR moves > the translation into the lowerings. Along the way, this probably fixes > quite a few bugs as there were several places the translation was > missing. ``` Conflict Resolution: - File: lib/Dialect/TritonGPU/Transforms/Pipeliner/LowerLoops.cpp Action: Adopted upstream's approach (drop translateTMAIndices call, use loadOp.getIndices() directly) but preserved local Value() multicastTargets parameter for AsyncTMACopyGlobalToLocalOp::create. Reason: Local op signature includes multicastTargets as the first argument; upstream version does not. - File: lib/Dialect/TritonNvidiaGPU/Transforms/TMALowering.cpp Action: Three conflicts resolved similarly. For AsyncTMACopyGlobalToLocalOp::create kept Value() multicastTargets prefix; for AsyncTMACopyLocalToGlobalOp::create and AsyncTMAReduceOp::create used upstream version directly (no multicastTargets in their signatures). Reason: Stale local references to undefined tmaPtr/indices needed replacement; preserved local-only multicastTargets argument where applicable. - File: third_party/nvidia/lib/Dialect/NVWS/Transforms/LowerAref.cpp Action: Kept Value() multicastTargets prefix and used op.getIndices() directly. Reason: Same rationale as above; stale 'indices' identifier was undefined. Raw Conflicts: https://www.internalfb.com/intern/paste/P2283337998/ Resolution Diff: https://www.internalfb.com/intern/paste/P2283339520/ Diff Comparison: https://www.internalfb.com/intern/paste/P2283341283/ ***Do not remove the following line from this commit*** Reactor Cherry-pick Revision: db1f8c3 Reviewed By: sfzhu93 Differential Revision: D101982803
agron911
added a commit
to agron911/triton
that referenced
this pull request
Apr 24, 2026
…slation from mid-end to lowerings (#9082)' (facebookexperimental#1334) Summary: Pull Request resolved: facebookexperimental#1334 This is a cherry-pick of an upstream PR: triton-lang/triton#9082 Upstream commit message: ``` > [Backend] Move TMA index translation from mid-end to lowerings (#9082) > > Currently the behavior of fp4_padded is different between > `triton::Descriptor` ops and `AsyncTMA` ops. The former is indexed as if > the data is int8, while the latter is indexed by individual fp4 > elements, which is what the TMA hardware expects. > > This now gets leaked into gluon, which isn't ideal. So, this PR moves > the translation into the lowerings. Along the way, this probably fixes > quite a few bugs as there were several places the translation was > missing. ``` Conflict Resolution: - File: lib/Dialect/TritonGPU/Transforms/Pipeliner/LowerLoops.cpp Action: Adopted upstream's approach (drop translateTMAIndices call, use loadOp.getIndices() directly) but preserved local Value() multicastTargets parameter for AsyncTMACopyGlobalToLocalOp::create. Reason: Local op signature includes multicastTargets as the first argument; upstream version does not. - File: lib/Dialect/TritonNvidiaGPU/Transforms/TMALowering.cpp Action: Three conflicts resolved similarly. For AsyncTMACopyGlobalToLocalOp::create kept Value() multicastTargets prefix; for AsyncTMACopyLocalToGlobalOp::create and AsyncTMAReduceOp::create used upstream version directly (no multicastTargets in their signatures). Reason: Stale local references to undefined tmaPtr/indices needed replacement; preserved local-only multicastTargets argument where applicable. - File: third_party/nvidia/lib/Dialect/NVWS/Transforms/LowerAref.cpp Action: Kept Value() multicastTargets prefix and used op.getIndices() directly. Reason: Same rationale as above; stale 'indices' identifier was undefined. Raw Conflicts: https://www.internalfb.com/intern/paste/P2283337998/ Resolution Diff: https://www.internalfb.com/intern/paste/P2283339520/ Diff Comparison: https://www.internalfb.com/intern/paste/P2283341283/ ***Do not remove the following line from this commit*** Reactor Cherry-pick Revision: db1f8c3 Reviewed By: sfzhu93 Differential Revision: D101982803
meta-codesync Bot
pushed a commit
to facebookexperimental/triton
that referenced
this pull request
Apr 24, 2026
…slation from mid-end to lowerings (#9082)' (#1334) Summary: Pull Request resolved: #1334 This is a cherry-pick of an upstream PR: triton-lang/triton#9082 Upstream commit message: ``` > [Backend] Move TMA index translation from mid-end to lowerings (#9082) > > Currently the behavior of fp4_padded is different between > `triton::Descriptor` ops and `AsyncTMA` ops. The former is indexed as if > the data is int8, while the latter is indexed by individual fp4 > elements, which is what the TMA hardware expects. > > This now gets leaked into gluon, which isn't ideal. So, this PR moves > the translation into the lowerings. Along the way, this probably fixes > quite a few bugs as there were several places the translation was > missing. ``` Conflict Resolution: - File: lib/Dialect/TritonGPU/Transforms/Pipeliner/LowerLoops.cpp Action: Adopted upstream's approach (drop translateTMAIndices call, use loadOp.getIndices() directly) but preserved local Value() multicastTargets parameter for AsyncTMACopyGlobalToLocalOp::create. Reason: Local op signature includes multicastTargets as the first argument; upstream version does not. - File: lib/Dialect/TritonNvidiaGPU/Transforms/TMALowering.cpp Action: Three conflicts resolved similarly. For AsyncTMACopyGlobalToLocalOp::create kept Value() multicastTargets prefix; for AsyncTMACopyLocalToGlobalOp::create and AsyncTMAReduceOp::create used upstream version directly (no multicastTargets in their signatures). Reason: Stale local references to undefined tmaPtr/indices needed replacement; preserved local-only multicastTargets argument where applicable. - File: third_party/nvidia/lib/Dialect/NVWS/Transforms/LowerAref.cpp Action: Kept Value() multicastTargets prefix and used op.getIndices() directly. Reason: Same rationale as above; stale 'indices' identifier was undefined. Raw Conflicts: https://www.internalfb.com/intern/paste/P2283337998/ Resolution Diff: https://www.internalfb.com/intern/paste/P2283339520/ Diff Comparison: https://www.internalfb.com/intern/paste/P2283341283/ ***Do not remove the following line from this commit*** Reactor Cherry-pick Revision: db1f8c3 Reviewed By: sfzhu93 Differential Revision: D101982803 fbshipit-source-id: 96cf64cd529bfb9ddc918087f37ad622f792ccb3
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Currently the behavior of fp4_padded is different between
triton::Descriptorops andAsyncTMAops. The former is indexed as if the data is int8, while the latter is indexed by individual fp4 elements, which is what the TMA hardware expects.This now gets leaked into gluon, which isn't ideal. So, this PR moves the translation into the lowerings. Along the way, this probably fixes quite a few bugs as there were several places the translation was missing.