[Backend] Move TMA index translation from mid-end to lowerings by peterbell10 · Pull Request #9082 · triton-lang/triton

peterbell10 · 2025-12-22T20:52:07Z

Currently the behavior of fp4_padded is different between triton::Descriptor ops and AsyncTMA ops. The former is indexed as if the data is int8, while the latter is indexed by individual fp4 elements, which is what the TMA hardware expects.

This now gets leaked into gluon, which isn't ideal. So, this PR moves the translation into the lowerings. Along the way, this probably fixes quite a few bugs as there were several places the translation was missing.

Currently the behavior of fp4_padded is different between `triton::Descriptor` ops and `AsyncTMA` ops. The former is indexed as if the data is int8, while the latter is indexed by individual fp4 elements, which is what the TMA hardware expects. This now gets leaked into gluon, which isn't ideal. So, this PR moves the translation into the lowerings. Along the way, this probably fixes quite a few bugs as there were several places the translation was missing.

Mogball

thank god

…slation from mid-end to lowerings (#9082)' (facebookexperimental#1334) Summary: Pull Request resolved: facebookexperimental#1334 This is a cherry-pick of an upstream PR: triton-lang/triton#9082 Upstream commit message: ``` > [Backend] Move TMA index translation from mid-end to lowerings (#9082) > > Currently the behavior of fp4_padded is different between > `triton::Descriptor` ops and `AsyncTMA` ops. The former is indexed as if > the data is int8, while the latter is indexed by individual fp4 > elements, which is what the TMA hardware expects. > > This now gets leaked into gluon, which isn't ideal. So, this PR moves > the translation into the lowerings. Along the way, this probably fixes > quite a few bugs as there were several places the translation was > missing. ``` Conflict Resolution: - File: lib/Dialect/TritonGPU/Transforms/Pipeliner/LowerLoops.cpp Action: Adopted upstream's approach (drop translateTMAIndices call, use loadOp.getIndices() directly) but preserved local Value() multicastTargets parameter for AsyncTMACopyGlobalToLocalOp::create. Reason: Local op signature includes multicastTargets as the first argument; upstream version does not. - File: lib/Dialect/TritonNvidiaGPU/Transforms/TMALowering.cpp Action: Three conflicts resolved similarly. For AsyncTMACopyGlobalToLocalOp::create kept Value() multicastTargets prefix; for AsyncTMACopyLocalToGlobalOp::create and AsyncTMAReduceOp::create used upstream version directly (no multicastTargets in their signatures). Reason: Stale local references to undefined tmaPtr/indices needed replacement; preserved local-only multicastTargets argument where applicable. - File: third_party/nvidia/lib/Dialect/NVWS/Transforms/LowerAref.cpp Action: Kept Value() multicastTargets prefix and used op.getIndices() directly. Reason: Same rationale as above; stale 'indices' identifier was undefined. Raw Conflicts: https://www.internalfb.com/intern/paste/P2283337998/ Resolution Diff: https://www.internalfb.com/intern/paste/P2283339520/ Diff Comparison: https://www.internalfb.com/intern/paste/P2283341283/ ***Do not remove the following line from this commit*** Reactor Cherry-pick Revision: db1f8c3 Reviewed By: sfzhu93 Differential Revision: D101982803

…slation from mid-end to lowerings (#9082)' (#1334) Summary: Pull Request resolved: #1334 This is a cherry-pick of an upstream PR: triton-lang/triton#9082 Upstream commit message: ``` > [Backend] Move TMA index translation from mid-end to lowerings (#9082) > > Currently the behavior of fp4_padded is different between > `triton::Descriptor` ops and `AsyncTMA` ops. The former is indexed as if > the data is int8, while the latter is indexed by individual fp4 > elements, which is what the TMA hardware expects. > > This now gets leaked into gluon, which isn't ideal. So, this PR moves > the translation into the lowerings. Along the way, this probably fixes > quite a few bugs as there were several places the translation was > missing. ``` Conflict Resolution: - File: lib/Dialect/TritonGPU/Transforms/Pipeliner/LowerLoops.cpp Action: Adopted upstream's approach (drop translateTMAIndices call, use loadOp.getIndices() directly) but preserved local Value() multicastTargets parameter for AsyncTMACopyGlobalToLocalOp::create. Reason: Local op signature includes multicastTargets as the first argument; upstream version does not. - File: lib/Dialect/TritonNvidiaGPU/Transforms/TMALowering.cpp Action: Three conflicts resolved similarly. For AsyncTMACopyGlobalToLocalOp::create kept Value() multicastTargets prefix; for AsyncTMACopyLocalToGlobalOp::create and AsyncTMAReduceOp::create used upstream version directly (no multicastTargets in their signatures). Reason: Stale local references to undefined tmaPtr/indices needed replacement; preserved local-only multicastTargets argument where applicable. - File: third_party/nvidia/lib/Dialect/NVWS/Transforms/LowerAref.cpp Action: Kept Value() multicastTargets prefix and used op.getIndices() directly. Reason: Same rationale as above; stale 'indices' identifier was undefined. Raw Conflicts: https://www.internalfb.com/intern/paste/P2283337998/ Resolution Diff: https://www.internalfb.com/intern/paste/P2283339520/ Diff Comparison: https://www.internalfb.com/intern/paste/P2283341283/ ***Do not remove the following line from this commit*** Reactor Cherry-pick Revision: db1f8c3 Reviewed By: sfzhu93 Differential Revision: D101982803 fbshipit-source-id: 96cf64cd529bfb9ddc918087f37ad622f792ccb3

peterbell10 requested a review from ptillet as a code owner December 22, 2025 20:52

peterbell10 force-pushed the pb/tma-no-translate branch from 5d018c4 to 2e663f6 Compare December 22, 2025 23:01

peterbell10 requested a review from Mogball December 22, 2025 23:42

Mogball approved these changes Dec 23, 2025

View reviewed changes

peterbell10 merged commit db1f8c3 into main Dec 23, 2025
9 checks passed

peterbell10 deleted the pb/tma-no-translate branch December 23, 2025 13:10

agron911 mentioned this pull request Apr 24, 2026

[triton][beta] [Cherry-pick][RESOLVED] '[Backend] Move TMA index translation from mid-end to lowerings (#9082)' (#1334) facebookexperimental/triton#1334

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Backend] Move TMA index translation from mid-end to lowerings#9082

[Backend] Move TMA index translation from mid-end to lowerings#9082
peterbell10 merged 1 commit into
mainfrom
pb/tma-no-translate

peterbell10 commented Dec 22, 2025

Uh oh!

Mogball left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

peterbell10 commented Dec 22, 2025

Uh oh!

Mogball left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants