Skip to content

[TritonGPU] Coalesce integer atomics#10059

Merged
ThomasRaoux merged 4 commits into
triton-lang:mainfrom
jack-white1:main
Apr 19, 2026
Merged

[TritonGPU] Coalesce integer atomics#10059
ThomasRaoux merged 4 commits into
triton-lang:mainfrom
jack-white1:main

Conversation

@jack-white1
Copy link
Copy Markdown
Contributor

Hello! The coalescing pass currently chooses sizePerThread for atomics using the same pointer alignment / contiguity heuristic as regular stores.

For some atomic lowerings this is too optimistic: the pass will select a 128-bit per-thread layout even when the backend lowers the atomic as a narrower operation. That creates gaps in the warp-level write pattern and can hurt throughput significantly.

On the attached gh200 atomic-add microbenchmark, this PR improves Triton from:

  • int32: 17.98 -> 72.86 Gupdates/s (~4.1x)
  • int64: 19.29 -> 39.21 Gupdates/s (~2.0x)
  • fp64: 13.42 -> 26.79 Gupdates/s (~2.0x)

These numbers match the best Gluon blocked layouts, which use size_per_thread = 1 for int32 / int64 / fp64.

The attached microbenchmark focuses on atomic add, and we have only tested on Hopper, but we think it could be worth building a richer lowering path for different compute capabilities and dtypes, as more people realise the performance benefits of associative int32 atomics.

Thanks @leijurv for input on these commits.

Raw results (click to expand)
int32 results:
    baseline triton atomic took     382.13 ms @ 17.98 Gupdates/s
    patched triton atomic took      94.32 ms  @ 72.86 Gupdates/s
    Gluon [8] atomic took           706.87 ms @  9.72 Gupdates/s
    Gluon [4] atomic took           381.84 ms @ 18.00 Gupdates/s
    Gluon [2] atomic took           197.41 ms @ 34.81 Gupdates/s
    Gluon [1] atomic took           94.88 ms  @ 72.43 Gupdates/s

int64 results:
    baseline triton atomic took     356.30 ms @ 19.29 Gupdates/s
    patched triton atomic took      175.25 ms @ 39.21 Gupdates/s
    Gluon [8] atomic took           773.59 ms @  8.88 Gupdates/s
    Gluon [4] atomic took           684.48 ms @ 10.04 Gupdates/s
    Gluon [2] atomic took           356.34 ms @ 19.28 Gupdates/s
    Gluon [1] atomic took           174.56 ms @ 39.37 Gupdates/s

fp64 results:
    baseline triton atomic took     512.20 ms @ 13.42 Gupdates/s
    patched triton atomic took      256.55 ms @ 26.79 Gupdates/s
    Gluon [8] atomic took           773.32 ms @  8.89 Gupdates/s
    Gluon [4] atomic took           683.84 ms @ 10.05 Gupdates/s
    Gluon [2] atomic took           512.01 ms @ 13.42 Gupdates/s
    Gluon [1] atomic took           253.05 ms @ 27.16 Gupdates/s
Microbenchmark code (click to expand)
import torch
import triton
import triton.language as tl
import triton.experimental.gluon as gluon
import triton.experimental.gluon.language as gl

@triton.jit
def triton_atomic_add_kernel(input, output, ELEMENTS_PER_PID: tl.constexpr, INNER_REPEATS: tl.constexpr, SEM: tl.constexpr = "relaxed"):
    program_offset = tl.program_id(axis=0) * ELEMENTS_PER_PID
    load_store_indices = program_offset + tl.arange(0, ELEMENTS_PER_PID)
    
    load_indices = input + load_store_indices
    input = tl.load(load_indices)
    
    store_indices = output + load_store_indices
    for _ in tl.static_range(INNER_REPEATS):
        tl.atomic_add(store_indices, input, sem=SEM)
        
@gluon.jit
def gluon_atomic_add_kernel(input, output, ELEMENTS_PER_PID: gl.constexpr, INNER_REPEATS: gl.constexpr, SIZE_PER_THREAD: gl.constexpr, SEM: gl.constexpr = "relaxed"):
    LAYOUT: gl.constexpr = gl.BlockedLayout(size_per_thread=[SIZE_PER_THREAD], threads_per_warp=[32], warps_per_cta=[4], order=[0])
    program_offset = gl.program_id(axis=0) * ELEMENTS_PER_PID
    load_store_indices = program_offset + gl.arange(0, ELEMENTS_PER_PID, layout=LAYOUT)
    
    load_indices = input + load_store_indices
    input = gl.load(load_indices)
    
    store_indices = output + load_store_indices
    for _ in gl.static_range(INNER_REPEATS):
        gl.atomic_add(store_indices, input, sem=SEM)
        
def make_input(total_elements, dtype):
    if dtype.is_floating_point:
        return torch.rand((total_elements,), dtype=dtype)
    else:
        return torch.randint(-1000, 1000, (total_elements,), dtype=dtype)
    
if __name__ == "__main__":
    torch.set_default_device('cuda')
    total_elements = 2**30
    ELEMENTS_PER_PID = 2**9
    grid = total_elements // ELEMENTS_PER_PID
    INNER_REPEATS = 2**6

    for dtype in [torch.int32, torch.int64, torch.float32, torch.float64,]:
        print(f'\tDTYPE: {dtype}')
        input = make_input(total_elements, dtype)
        output_triton = torch.zeros_like(input)
        
        def triton_kernel_launcher():
            triton_atomic_add_kernel[(grid,)](input, output_triton, ELEMENTS_PER_PID, INNER_REPEATS, num_warps=4)
            
        ms_triton = triton.testing.do_bench(triton_kernel_launcher)
        triton_gupdates_per_sec = float(total_elements * INNER_REPEATS) / ms_triton / 10e6
        print(f'Triton atomic took \t{ms_triton:.2f} ms @ {triton_gupdates_per_sec:.2f} Gupdates/s')
            
        for SIZE_PER_THREAD in [8, 4, 2, 1]:
            def gluon_kernel_launcher():
                gluon_atomic_add_kernel[(grid,)](input, output_triton, ELEMENTS_PER_PID, INNER_REPEATS, SIZE_PER_THREAD, num_warps=4)
            ms_gluon = triton.testing.do_bench(gluon_kernel_launcher)
            gluon_gupdates_per_sec = float(total_elements * INNER_REPEATS) / ms_gluon / 10e6
            print(f'Gluon [{SIZE_PER_THREAD}] atomic took \t{ms_gluon:.2f} ms @ {gluon_gupdates_per_sec:.2f} Gupdates/s')

New contributor declaration

  • I am not making a trivial change, such as fixing a typo in a comment.

  • I have written a PR description following these
    rules.

  • I have run pre-commit run --from-ref origin/main --to-ref HEAD.

  • Select one of the following.

    • I have added tests.
      • /test for lit tests
      • /unittest for C++ tests
      • /python/test for end-to-end tests
    • This PR does not need a test because FILL THIS IN.
  • Select one of the following.

    • I have not added any lit tests.
    • The lit tests I have added follow these best practices,
      including the "tests should be minimal" section. (Usually running Python code
      and using the instructions it generates is not minimal.)

@jack-white1 jack-white1 requested a review from ptillet as a code owner April 16, 2026 21:53
Comment thread lib/Dialect/TritonGPU/Transforms/Utility.cpp Outdated
@lezcano lezcano requested a review from Jokeren April 17, 2026 09:14
Comment thread lib/Dialect/TritonGPU/Transforms/Utility.cpp Outdated
Comment thread include/triton/Dialect/TritonGPU/Transforms/Utility.h Outdated
Comment thread lib/Dialect/TritonGPU/Transforms/Utility.cpp Outdated
Comment on lines +146 to +148
int computeCapability = getNVIDIAComputeCapability(moduleOp);
if (computeCapability >= 90)
return std::nullopt;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity, what changed in Hopper?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vectorised atomics were added with Hopper, its buried in https://docs.nvidia.com/cuda/parallel-thread-execution/

"Support for vector types requires sm_90 or higher."

See also a similar-ish discussion: llvm/llvm-project#122760

@ThomasRaoux ThomasRaoux merged commit ca6bd0c into triton-lang:main Apr 19, 2026
15 of 18 checks passed
bingyizh233 pushed a commit to bingyizh233/triton that referenced this pull request Apr 20, 2026
Hello! The coalescing pass currently chooses `sizePerThread` for atomics
using the same pointer alignment / contiguity heuristic as regular
stores.

For some atomic lowerings this is too optimistic: the pass will select a
128-bit per-thread layout even when the backend lowers the atomic as a
narrower operation. That creates gaps in the warp-level write pattern
and can hurt throughput significantly.

On the attached gh200 atomic-add microbenchmark, this PR improves Triton
from:

- `int32`: 17.98 -> 72.86 Gupdates/s (~4.1x)
- `int64`: 19.29 -> 39.21 Gupdates/s (~2.0x)
- `fp64`: 13.42 -> 26.79 Gupdates/s (~2.0x)

These numbers match the best Gluon blocked layouts, which use
`size_per_thread = 1` for `int32` / `int64` / `fp64`.

The attached microbenchmark focuses on atomic add, and we have only
tested on Hopper, but we think it could be worth building a richer
lowering path for different compute capabilities and dtypes, as more
people realise the performance benefits of associative int32 atomics.

Thanks @leijurv for input on these commits.

<details>
<summary><b>Raw results</b> (click to expand)</summary>

```
int32 results:
    baseline triton atomic took     382.13 ms @ 17.98 Gupdates/s
    patched triton atomic took      94.32 ms  @ 72.86 Gupdates/s
    Gluon [8] atomic took           706.87 ms @  9.72 Gupdates/s
    Gluon [4] atomic took           381.84 ms @ 18.00 Gupdates/s
    Gluon [2] atomic took           197.41 ms @ 34.81 Gupdates/s
    Gluon [1] atomic took           94.88 ms  @ 72.43 Gupdates/s

int64 results:
    baseline triton atomic took     356.30 ms @ 19.29 Gupdates/s
    patched triton atomic took      175.25 ms @ 39.21 Gupdates/s
    Gluon [8] atomic took           773.59 ms @  8.88 Gupdates/s
    Gluon [4] atomic took           684.48 ms @ 10.04 Gupdates/s
    Gluon [2] atomic took           356.34 ms @ 19.28 Gupdates/s
    Gluon [1] atomic took           174.56 ms @ 39.37 Gupdates/s

fp64 results:
    baseline triton atomic took     512.20 ms @ 13.42 Gupdates/s
    patched triton atomic took      256.55 ms @ 26.79 Gupdates/s
    Gluon [8] atomic took           773.32 ms @  8.89 Gupdates/s
    Gluon [4] atomic took           683.84 ms @ 10.05 Gupdates/s
    Gluon [2] atomic took           512.01 ms @ 13.42 Gupdates/s
    Gluon [1] atomic took           253.05 ms @ 27.16 Gupdates/s
```

</details>

<details>
<summary><b>Microbenchmark code</b> (click to expand)</summary>

```py
import torch
import triton
import triton.language as tl
import triton.experimental.gluon as gluon
import triton.experimental.gluon.language as gl

@triton.jit
def triton_atomic_add_kernel(input, output, ELEMENTS_PER_PID: tl.constexpr, INNER_REPEATS: tl.constexpr, SEM: tl.constexpr = "relaxed"):
    program_offset = tl.program_id(axis=0) * ELEMENTS_PER_PID
    load_store_indices = program_offset + tl.arange(0, ELEMENTS_PER_PID)
    
    load_indices = input + load_store_indices
    input = tl.load(load_indices)
    
    store_indices = output + load_store_indices
    for _ in tl.static_range(INNER_REPEATS):
        tl.atomic_add(store_indices, input, sem=SEM)
        
@gluon.jit
def gluon_atomic_add_kernel(input, output, ELEMENTS_PER_PID: gl.constexpr, INNER_REPEATS: gl.constexpr, SIZE_PER_THREAD: gl.constexpr, SEM: gl.constexpr = "relaxed"):
    LAYOUT: gl.constexpr = gl.BlockedLayout(size_per_thread=[SIZE_PER_THREAD], threads_per_warp=[32], warps_per_cta=[4], order=[0])
    program_offset = gl.program_id(axis=0) * ELEMENTS_PER_PID
    load_store_indices = program_offset + gl.arange(0, ELEMENTS_PER_PID, layout=LAYOUT)
    
    load_indices = input + load_store_indices
    input = gl.load(load_indices)
    
    store_indices = output + load_store_indices
    for _ in gl.static_range(INNER_REPEATS):
        gl.atomic_add(store_indices, input, sem=SEM)
        
def make_input(total_elements, dtype):
    if dtype.is_floating_point:
        return torch.rand((total_elements,), dtype=dtype)
    else:
        return torch.randint(-1000, 1000, (total_elements,), dtype=dtype)
    
if __name__ == "__main__":
    torch.set_default_device('cuda')
    total_elements = 2**30
    ELEMENTS_PER_PID = 2**9
    grid = total_elements // ELEMENTS_PER_PID
    INNER_REPEATS = 2**6

    for dtype in [torch.int32, torch.int64, torch.float32, torch.float64,]:
        print(f'\tDTYPE: {dtype}')
        input = make_input(total_elements, dtype)
        output_triton = torch.zeros_like(input)
        
        def triton_kernel_launcher():
            triton_atomic_add_kernel[(grid,)](input, output_triton, ELEMENTS_PER_PID, INNER_REPEATS, num_warps=4)
            
        ms_triton = triton.testing.do_bench(triton_kernel_launcher)
        triton_gupdates_per_sec = float(total_elements * INNER_REPEATS) / ms_triton / 10e6
        print(f'Triton atomic took \t{ms_triton:.2f} ms @ {triton_gupdates_per_sec:.2f} Gupdates/s')
            
        for SIZE_PER_THREAD in [8, 4, 2, 1]:
            def gluon_kernel_launcher():
                gluon_atomic_add_kernel[(grid,)](input, output_triton, ELEMENTS_PER_PID, INNER_REPEATS, SIZE_PER_THREAD, num_warps=4)
            ms_gluon = triton.testing.do_bench(gluon_kernel_launcher)
            gluon_gupdates_per_sec = float(total_elements * INNER_REPEATS) / ms_gluon / 10e6
            print(f'Gluon [{SIZE_PER_THREAD}] atomic took \t{ms_gluon:.2f} ms @ {gluon_gupdates_per_sec:.2f} Gupdates/s')
```

</details>


# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [x] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [ ] This PR does not need a test because `FILL THIS IN`.

- Select one of the following.
  - [ ] I have not added any `lit` tests.
- [x] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants