Skip to content

[Backend] Move TMA index translation from mid-end to lowerings#9082

Merged
peterbell10 merged 1 commit into
mainfrom
pb/tma-no-translate
Dec 23, 2025
Merged

[Backend] Move TMA index translation from mid-end to lowerings#9082
peterbell10 merged 1 commit into
mainfrom
pb/tma-no-translate

Conversation

@peterbell10
Copy link
Copy Markdown
Contributor

Currently the behavior of fp4_padded is different between triton::Descriptor ops and AsyncTMA ops. The former is indexed as if the data is int8, while the latter is indexed by individual fp4 elements, which is what the TMA hardware expects.

This now gets leaked into gluon, which isn't ideal. So, this PR moves the translation into the lowerings. Along the way, this probably fixes quite a few bugs as there were several places the translation was missing.

@peterbell10 peterbell10 requested a review from ptillet as a code owner December 22, 2025 20:52
Currently the behavior of fp4_padded is different between `triton::Descriptor`
ops and `AsyncTMA` ops. The former is indexed as if the data is int8, while the
latter is indexed by individual fp4 elements, which is what the TMA hardware
expects.

This now gets leaked into gluon, which isn't ideal. So, this PR moves
the translation into the lowerings. Along the way, this probably fixes
quite a few bugs as there were several places the translation was
missing.
Copy link
Copy Markdown
Collaborator

@Mogball Mogball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank god

@peterbell10 peterbell10 merged commit db1f8c3 into main Dec 23, 2025
9 checks passed
@peterbell10 peterbell10 deleted the pb/tma-no-translate branch December 23, 2025 13:10
agron911 added a commit to agron911/triton that referenced this pull request Apr 24, 2026
…slation from mid-end to lowerings (#9082)' (facebookexperimental#1334)

Summary:
Pull Request resolved: facebookexperimental#1334

This is a cherry-pick of an upstream PR: triton-lang/triton#9082

Upstream commit message:
```
> [Backend] Move TMA index translation from mid-end to lowerings (#9082)
>
> Currently the behavior of fp4_padded is different between
> `triton::Descriptor` ops and `AsyncTMA` ops. The former is indexed as if
> the data is int8, while the latter is indexed by individual fp4
> elements, which is what the TMA hardware expects.
>
> This now gets leaked into gluon, which isn't ideal. So, this PR moves
> the translation into the lowerings. Along the way, this probably fixes
> quite a few bugs as there were several places the translation was
> missing.
```

Conflict Resolution:
- File: lib/Dialect/TritonGPU/Transforms/Pipeliner/LowerLoops.cpp
  Action: Adopted upstream's approach (drop translateTMAIndices call, use loadOp.getIndices() directly) but preserved local Value() multicastTargets parameter for AsyncTMACopyGlobalToLocalOp::create.
  Reason: Local op signature includes multicastTargets as the first argument; upstream version does not.
- File: lib/Dialect/TritonNvidiaGPU/Transforms/TMALowering.cpp
  Action: Three conflicts resolved similarly. For AsyncTMACopyGlobalToLocalOp::create kept Value() multicastTargets prefix; for AsyncTMACopyLocalToGlobalOp::create and AsyncTMAReduceOp::create used upstream version directly (no multicastTargets in their signatures).
  Reason: Stale local references to undefined tmaPtr/indices needed replacement; preserved local-only multicastTargets argument where applicable.
- File: third_party/nvidia/lib/Dialect/NVWS/Transforms/LowerAref.cpp
  Action: Kept Value() multicastTargets prefix and used op.getIndices() directly.
  Reason: Same rationale as above; stale 'indices' identifier was undefined.

Raw Conflicts: https://www.internalfb.com/intern/paste/P2283337998/
Resolution Diff: https://www.internalfb.com/intern/paste/P2283339520/
Diff Comparison: https://www.internalfb.com/intern/paste/P2283341283/

***Do not remove the following line from this commit***
Reactor Cherry-pick Revision: db1f8c3

Reviewed By: sfzhu93

Differential Revision: D101982803
agron911 added a commit to agron911/triton that referenced this pull request Apr 24, 2026
…slation from mid-end to lowerings (#9082)' (facebookexperimental#1334)

Summary:
Pull Request resolved: facebookexperimental#1334

This is a cherry-pick of an upstream PR: triton-lang/triton#9082

Upstream commit message:
```
> [Backend] Move TMA index translation from mid-end to lowerings (#9082)
>
> Currently the behavior of fp4_padded is different between
> `triton::Descriptor` ops and `AsyncTMA` ops. The former is indexed as if
> the data is int8, while the latter is indexed by individual fp4
> elements, which is what the TMA hardware expects.
>
> This now gets leaked into gluon, which isn't ideal. So, this PR moves
> the translation into the lowerings. Along the way, this probably fixes
> quite a few bugs as there were several places the translation was
> missing.
```

Conflict Resolution:
- File: lib/Dialect/TritonGPU/Transforms/Pipeliner/LowerLoops.cpp
  Action: Adopted upstream's approach (drop translateTMAIndices call, use loadOp.getIndices() directly) but preserved local Value() multicastTargets parameter for AsyncTMACopyGlobalToLocalOp::create.
  Reason: Local op signature includes multicastTargets as the first argument; upstream version does not.
- File: lib/Dialect/TritonNvidiaGPU/Transforms/TMALowering.cpp
  Action: Three conflicts resolved similarly. For AsyncTMACopyGlobalToLocalOp::create kept Value() multicastTargets prefix; for AsyncTMACopyLocalToGlobalOp::create and AsyncTMAReduceOp::create used upstream version directly (no multicastTargets in their signatures).
  Reason: Stale local references to undefined tmaPtr/indices needed replacement; preserved local-only multicastTargets argument where applicable.
- File: third_party/nvidia/lib/Dialect/NVWS/Transforms/LowerAref.cpp
  Action: Kept Value() multicastTargets prefix and used op.getIndices() directly.
  Reason: Same rationale as above; stale 'indices' identifier was undefined.

Raw Conflicts: https://www.internalfb.com/intern/paste/P2283337998/
Resolution Diff: https://www.internalfb.com/intern/paste/P2283339520/
Diff Comparison: https://www.internalfb.com/intern/paste/P2283341283/

***Do not remove the following line from this commit***
Reactor Cherry-pick Revision: db1f8c3

Reviewed By: sfzhu93

Differential Revision: D101982803
meta-codesync Bot pushed a commit to facebookexperimental/triton that referenced this pull request Apr 24, 2026
…slation from mid-end to lowerings (#9082)' (#1334)

Summary:
Pull Request resolved: #1334

This is a cherry-pick of an upstream PR: triton-lang/triton#9082

Upstream commit message:
```
> [Backend] Move TMA index translation from mid-end to lowerings (#9082)
>
> Currently the behavior of fp4_padded is different between
> `triton::Descriptor` ops and `AsyncTMA` ops. The former is indexed as if
> the data is int8, while the latter is indexed by individual fp4
> elements, which is what the TMA hardware expects.
>
> This now gets leaked into gluon, which isn't ideal. So, this PR moves
> the translation into the lowerings. Along the way, this probably fixes
> quite a few bugs as there were several places the translation was
> missing.
```

Conflict Resolution:
- File: lib/Dialect/TritonGPU/Transforms/Pipeliner/LowerLoops.cpp
  Action: Adopted upstream's approach (drop translateTMAIndices call, use loadOp.getIndices() directly) but preserved local Value() multicastTargets parameter for AsyncTMACopyGlobalToLocalOp::create.
  Reason: Local op signature includes multicastTargets as the first argument; upstream version does not.
- File: lib/Dialect/TritonNvidiaGPU/Transforms/TMALowering.cpp
  Action: Three conflicts resolved similarly. For AsyncTMACopyGlobalToLocalOp::create kept Value() multicastTargets prefix; for AsyncTMACopyLocalToGlobalOp::create and AsyncTMAReduceOp::create used upstream version directly (no multicastTargets in their signatures).
  Reason: Stale local references to undefined tmaPtr/indices needed replacement; preserved local-only multicastTargets argument where applicable.
- File: third_party/nvidia/lib/Dialect/NVWS/Transforms/LowerAref.cpp
  Action: Kept Value() multicastTargets prefix and used op.getIndices() directly.
  Reason: Same rationale as above; stale 'indices' identifier was undefined.

Raw Conflicts: https://www.internalfb.com/intern/paste/P2283337998/
Resolution Diff: https://www.internalfb.com/intern/paste/P2283339520/
Diff Comparison: https://www.internalfb.com/intern/paste/P2283341283/

***Do not remove the following line from this commit***
Reactor Cherry-pick Revision: db1f8c3

Reviewed By: sfzhu93

Differential Revision: D101982803

fbshipit-source-id: 96cf64cd529bfb9ddc918087f37ad622f792ccb3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants