Skip to content

[AMD] Enable AsyncCopy by default on gfx950 and gfx1250#9087

Merged
antiagainst merged 15 commits into
triton-lang:mainfrom
ROCm:aweinrau/amd-async-load-on
Dec 23, 2025
Merged

[AMD] Enable AsyncCopy by default on gfx950 and gfx1250#9087
antiagainst merged 15 commits into
triton-lang:mainfrom
ROCm:aweinrau/amd-async-load-on

Conversation

@AlexAUT
Copy link
Copy Markdown
Contributor

@AlexAUT AlexAUT commented Dec 23, 2025

Enables ttg.async_copy_global_to_local for pipelined loads by default on gfx950 and gfx1250.

This increases LDS consumption because we replace one register buffer with an additional LDS buffer. After this change, the number of LDS buffers is equal to num_stages (previously it was num_stages - 1). Therefore, some test configs need to be skipped because we run out of shared memory capacity on gfx950.

Fixes https://github.com/ROCm/triton-internal/issues/1020

@antiagainst antiagainst marked this pull request as ready for review December 23, 2025 16:27
@antiagainst antiagainst merged commit b54cb94 into triton-lang:main Dec 23, 2025
9 checks passed
@antiagainst antiagainst deleted the aweinrau/amd-async-load-on branch December 23, 2025 16:28
@xuzhao9
Copy link
Copy Markdown
Contributor

xuzhao9 commented Jan 17, 2026

This PR broke and perf-regressed multiple Tritonbench tests:

meta-pytorch/tritonbench#765

@antiagainst
Copy link
Copy Markdown
Member

Thanks for the report! That's sort of expected. I commented on the issue.

atalman pushed a commit that referenced this pull request Feb 12, 2026
…9445)

Reverts the use of use_async_copy in the hopes of fixing the many rocm
failures. Ideally we could land this PR instead:
#9431, but I think we should
test the revert first so we have a fallback in case LLVM upgrades cause
other issues.
agron911 added a commit to agron911/triton that referenced this pull request Apr 24, 2026
…fx950 and gfx1250 (#9087)' (facebookexperimental#1335)

Summary:
Pull Request resolved: facebookexperimental#1335

This is a cherry-pick of an upstream PR: triton-lang/triton#9087

Upstream commit message:
```
> [AMD] Enable AsyncCopy by default for gfx950 and gfx1250 (#9087)

> Enables `ttg.async_copy_global_to_local` for pipelined loads by default
> on `gfx950` and `gfx1250`.

> This increases LDS consumption because we replace one register buffer
> with an additional LDS buffer. After this change, the number of LDS
> buffers is equal to `num_stages` (previously it was `num_stages - 1`).
> Therefore, some test configs need to be skipped because we run out of
> shared memory capacity on `gfx950`.
> ---------

> Co-authored-by: Lei Zhang <antiagainst@gmail.com>
```

***Do not remove the following line from this commit***
Reactor Cherry-pick Revision: b54cb94

Diff Comparison: https://www.internalfb.com/intern/paste/P2283343700/
 ---

This diff was generated by running:
```
buck run fbcode//triton/tools/reactor:reactor -- cherrypick --num-commits 1 --no-submit
```

Reviewed By: sfzhu93

Differential Revision: D101982809
meta-codesync Bot pushed a commit to facebookexperimental/triton that referenced this pull request Apr 24, 2026
…fx950 and gfx1250 (#9087)' (#1335)

Summary:
Pull Request resolved: #1335

This is a cherry-pick of an upstream PR: triton-lang/triton#9087

Upstream commit message:
```
> [AMD] Enable AsyncCopy by default for gfx950 and gfx1250 (#9087)

> Enables `ttg.async_copy_global_to_local` for pipelined loads by default
> on `gfx950` and `gfx1250`.

> This increases LDS consumption because we replace one register buffer
> with an additional LDS buffer. After this change, the number of LDS
> buffers is equal to `num_stages` (previously it was `num_stages - 1`).
> Therefore, some test configs need to be skipped because we run out of
> shared memory capacity on `gfx950`.
> ---------

> Co-authored-by: Lei Zhang <antiagainst@gmail.com>
```

***Do not remove the following line from this commit***
Reactor Cherry-pick Revision: b54cb94

Diff Comparison: https://www.internalfb.com/intern/paste/P2283343700/
 ---

This diff was generated by running:
```
buck run fbcode//triton/tools/reactor:reactor -- cherrypick --num-commits 1 --no-submit
```

Reviewed By: sfzhu93

Differential Revision: D101982809

fbshipit-source-id: 5af1287a77f46b618c2cd266fcfe2eb3549a6c7a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants