Skip to content

[triton][beta] [Cherry-pick] '[AMD] Enable AsyncCopy by default for gfx950 and gfx1250 (#9087)'#1335

Closed
agron911 wants to merge 7 commits into
facebookexperimental:mainfrom
agron911:export-D101982809
Closed

[triton][beta] [Cherry-pick] '[AMD] Enable AsyncCopy by default for gfx950 and gfx1250 (#9087)'#1335
agron911 wants to merge 7 commits into
facebookexperimental:mainfrom
agron911:export-D101982809

Conversation

@agron911
Copy link
Copy Markdown
Contributor

Summary:
This is a cherry-pick of an upstream PR: triton-lang/triton#9087

Upstream commit message:

> [AMD] Enable AsyncCopy by default for gfx950 and gfx1250 (#9087)

> Enables `ttg.async_copy_global_to_local` for pipelined loads by default
> on `gfx950` and `gfx1250`.

> This increases LDS consumption because we replace one register buffer
> with an additional LDS buffer. After this change, the number of LDS
> buffers is equal to `num_stages` (previously it was `num_stages - 1`).
> Therefore, some test configs need to be skipped because we run out of
> shared memory capacity on `gfx950`.
> ---------

> Co-authored-by: Lei Zhang <antiagainst@gmail.com>

Do not remove the following line from this commit
Reactor Cherry-pick Revision: b54cb94

Diff Comparison: https://www.internalfb.com/intern/paste/P2283343700/

This diff was generated by running:

buck run fbcode//triton/tools/reactor:reactor -- cherrypick --num-commits 1 --no-submit

Differential Revision: D101982809

…hase 5

Summary:
Disable budget-aware layout conversion elimination (Phase 5, smem_budget > 0) which crashes on Blackwell with `LLVM ERROR: Invalid out-dim size`. Root cause: `propagateSrcEncodingAndErase()` skips `scf::YieldOp` during type rewriting, leaving `scf::ForOp` results with stale encodings that corrupt LinearLayout dimensions.

Also fix `GenerateSubtiledRegion.cpp` build break (`CGAEncodingAttr::getDefault` -> `get1CTALayout`).

Differential Revision: D101982801
…bit dot precision to TF32x3 (#9080)'

Summary:
This is a cherry-pick of an upstream PR: triton-lang/triton#9080

Upstream commit message:
```
> [LANGUAGE] change default 32-bit dot precision to TF32x3 (#9080)
```

Conflict Resolution:
- File: python/triton/language/core.py
  Action: Removed conflict markers; kept the local "where the first dimension..." line and updated docstring to use tf32x3 instead of tf32. Did not add the upstream-introduced assert/if input_precision body code, since the local code path delegates input_precision processing to semantic.py.
  Reason: The local file was refactored to move input_precision default-setting logic from core.py.dot() to semantic.py. Adding the upstream body code here would duplicate logic and be unreachable.
- File: python/triton/language/semantic.py
  Action: Updated supports_tf32 check and default value from "tf32" to "tf32x3" in the input_precision branch of the dot() method.
  Reason: This file holds the actual default-precision logic locally; matching upstream's intent of changing the default precision from tf32 to tf32x3 requires updating it here.

Raw Conflicts: https://www.internalfb.com/intern/paste/P2283333395/
Resolution Diff: https://www.internalfb.com/intern/paste/P2283336430/
Diff Comparison: https://www.internalfb.com/intern/paste/P2283337118/

***Do not remove the following line from this commit***
Reactor Cherry-pick Revision: 63b387c

Differential Revision: D101982808
…084)'

Summary:
This is a cherry-pick of an upstream PR: triton-lang/triton#9084

Upstream commit message:
```
> [TOOLS] Add hip support for link.py (#9084)

> * Use the same link cpp scr except hipStrean/CUstream etc.
> * Add a link.h prelude for AMD/Nvidia to adapt for the difference.
> * Enable test_aot.py for AMD.
> * Also rename AMD's compile.cpp to compile.c.
```

***Do not remove the following line from this commit***
Reactor Cherry-pick Revision: a0e769f

Diff Comparison: https://www.internalfb.com/intern/paste/P2283337631/
 ---

This diff was generated by running:
```
buck run fbcode//triton/tools/reactor:reactor -- cherrypick --num-commits 1 --no-submit
```

Differential Revision: D101982807
…slation from mid-end to lowerings (#9082)'

Summary:
This is a cherry-pick of an upstream PR: triton-lang/triton#9082

Upstream commit message:
```
> [Backend] Move TMA index translation from mid-end to lowerings (#9082)
>
> Currently the behavior of fp4_padded is different between
> `triton::Descriptor` ops and `AsyncTMA` ops. The former is indexed as if
> the data is int8, while the latter is indexed by individual fp4
> elements, which is what the TMA hardware expects.
>
> This now gets leaked into gluon, which isn't ideal. So, this PR moves
> the translation into the lowerings. Along the way, this probably fixes
> quite a few bugs as there were several places the translation was
> missing.
```

Conflict Resolution:
- File: lib/Dialect/TritonGPU/Transforms/Pipeliner/LowerLoops.cpp
  Action: Adopted upstream's approach (drop translateTMAIndices call, use loadOp.getIndices() directly) but preserved local Value() multicastTargets parameter for AsyncTMACopyGlobalToLocalOp::create.
  Reason: Local op signature includes multicastTargets as the first argument; upstream version does not.
- File: lib/Dialect/TritonNvidiaGPU/Transforms/TMALowering.cpp
  Action: Three conflicts resolved similarly. For AsyncTMACopyGlobalToLocalOp::create kept Value() multicastTargets prefix; for AsyncTMACopyLocalToGlobalOp::create and AsyncTMAReduceOp::create used upstream version directly (no multicastTargets in their signatures).
  Reason: Stale local references to undefined tmaPtr/indices needed replacement; preserved local-only multicastTargets argument where applicable.
- File: third_party/nvidia/lib/Dialect/NVWS/Transforms/LowerAref.cpp
  Action: Kept Value() multicastTargets prefix and used op.getIndices() directly.
  Reason: Same rationale as above; stale 'indices' identifier was undefined.

Raw Conflicts: https://www.internalfb.com/intern/paste/P2283337998/
Resolution Diff: https://www.internalfb.com/intern/paste/P2283339520/
Diff Comparison: https://www.internalfb.com/intern/paste/P2283341283/

***Do not remove the following line from this commit***
Reactor Cherry-pick Revision: db1f8c3

Differential Revision: D101982803
…cope (#9088)'

Summary:
This is a cherry-pick of an upstream PR: triton-lang/triton#9088

Upstream commit message:
```
> Fix address sanitizer stack-use-after-scope (#9088)

> std::make_tuple here will copy the arguments into a tuple so it creates
> a copy of SmallVector subsliceOffsets and then passes back a tuple with
> an ArrayRef. The SmallVector object is then out of scope. Bypassing
> make_tuple means that it uses the underlying AllocationSlice's reference
> to subsliceOffsets rather than the temporary copy created by make_tuple.
```

***Do not remove the following line from this commit***
Reactor Cherry-pick Revision: a04108b

Diff Comparison: https://www.internalfb.com/intern/paste/P2283341704/
 ---

This diff was generated by running:
```
buck run fbcode//triton/tools/reactor:reactor -- cherrypick --num-commits 1 --no-submit
```

Differential Revision: D101982805
…ault 32-bit dot precision to TF32x3" (#9090)'

Summary:
This is a cherry-pick of an upstream PR: triton-lang/triton#9090

Upstream commit message:
```
> Revert "[LANGUAGE] change default 32-bit dot precision to TF32x3" (#9090)
>
> Reverts triton-lang/triton#9080 as it cause some tmem allocation
> regression due to simplistic hoisting logic
```

Conflict Resolution:
- File: python/triton/language/core.py
  Action: Removed conflict markers; kept the local "where the first dimension..." line. Reverted docstring lines from tf32x3 back to tf32 to match upstream's revert.
  Reason: Same divergence as the original cherry-pick of #9080 (assert/if input_precision body lives in semantic.py locally). Maintained that local refactor by reverting only the docstring here.
- File: python/triton/language/semantic.py
  Action: Reverted supports_tf32 check and default value from "tf32x3" back to "tf32" in the input_precision branch of the dot() method.
  Reason: Mirror revert: the prior cherry-pick of #9080 applied the tf32x3 change here (instead of upstream's core.py location); this revert undoes it in the same place.

Raw Conflicts: https://www.internalfb.com/intern/paste/P2283342039/
Resolution Diff: https://www.internalfb.com/intern/paste/P2283342643/
Diff Comparison: https://www.internalfb.com/intern/paste/P2283343204/

***Do not remove the following line from this commit***
Reactor Cherry-pick Revision: 606eebc

Differential Revision: D101982800
…fx950 and gfx1250 (#9087)'

Summary:
This is a cherry-pick of an upstream PR: triton-lang/triton#9087

Upstream commit message:
```
> [AMD] Enable AsyncCopy by default for gfx950 and gfx1250 (#9087)

> Enables `ttg.async_copy_global_to_local` for pipelined loads by default
> on `gfx950` and `gfx1250`.

> This increases LDS consumption because we replace one register buffer
> with an additional LDS buffer. After this change, the number of LDS
> buffers is equal to `num_stages` (previously it was `num_stages - 1`).
> Therefore, some test configs need to be skipped because we run out of
> shared memory capacity on `gfx950`.
> ---------

> Co-authored-by: Lei Zhang <antiagainst@gmail.com>
```

***Do not remove the following line from this commit***
Reactor Cherry-pick Revision: b54cb94

Diff Comparison: https://www.internalfb.com/intern/paste/P2283343700/
 ---

This diff was generated by running:
```
buck run fbcode//triton/tools/reactor:reactor -- cherrypick --num-commits 1 --no-submit
```

Differential Revision: D101982809
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 24, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Apr 24, 2026

@agron911 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D101982809.

agron911 added a commit to agron911/triton that referenced this pull request Apr 24, 2026
…fx950 and gfx1250 (#9087)' (facebookexperimental#1335)

Summary:
Pull Request resolved: facebookexperimental#1335

This is a cherry-pick of an upstream PR: triton-lang/triton#9087

Upstream commit message:
```
> [AMD] Enable AsyncCopy by default for gfx950 and gfx1250 (#9087)

> Enables `ttg.async_copy_global_to_local` for pipelined loads by default
> on `gfx950` and `gfx1250`.

> This increases LDS consumption because we replace one register buffer
> with an additional LDS buffer. After this change, the number of LDS
> buffers is equal to `num_stages` (previously it was `num_stages - 1`).
> Therefore, some test configs need to be skipped because we run out of
> shared memory capacity on `gfx950`.
> ---------

> Co-authored-by: Lei Zhang <antiagainst@gmail.com>
```

***Do not remove the following line from this commit***
Reactor Cherry-pick Revision: b54cb94

Diff Comparison: https://www.internalfb.com/intern/paste/P2283343700/
 ---

This diff was generated by running:
```
buck run fbcode//triton/tools/reactor:reactor -- cherrypick --num-commits 1 --no-submit
```

Reviewed By: sfzhu93

Differential Revision: D101982809
@meta-codesync meta-codesync Bot closed this in 8e47abe Apr 24, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Apr 24, 2026

This pull request has been merged in 8e47abe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported Merged meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant