[TritonGPU] Coalesce integer atomics by jack-white1 · Pull Request #10059 · triton-lang/triton

jack-white1 · 2026-04-16T21:53:23Z

Hello! The coalescing pass currently chooses sizePerThread for atomics using the same pointer alignment / contiguity heuristic as regular stores.

For some atomic lowerings this is too optimistic: the pass will select a 128-bit per-thread layout even when the backend lowers the atomic as a narrower operation. That creates gaps in the warp-level write pattern and can hurt throughput significantly.

On the attached gh200 atomic-add microbenchmark, this PR improves Triton from:

int32: 17.98 -> 72.86 Gupdates/s (~4.1x)
int64: 19.29 -> 39.21 Gupdates/s (~2.0x)
fp64: 13.42 -> 26.79 Gupdates/s (~2.0x)

These numbers match the best Gluon blocked layouts, which use size_per_thread = 1 for int32 / int64 / fp64.

The attached microbenchmark focuses on atomic add, and we have only tested on Hopper, but we think it could be worth building a richer lowering path for different compute capabilities and dtypes, as more people realise the performance benefits of associative int32 atomics.

Thanks @leijurv for input on these commits.

Raw results (click to expand)

int32 results:
    baseline triton atomic took     382.13 ms @ 17.98 Gupdates/s
    patched triton atomic took      94.32 ms  @ 72.86 Gupdates/s
    Gluon [8] atomic took           706.87 ms @  9.72 Gupdates/s
    Gluon [4] atomic took           381.84 ms @ 18.00 Gupdates/s
    Gluon [2] atomic took           197.41 ms @ 34.81 Gupdates/s
    Gluon [1] atomic took           94.88 ms  @ 72.43 Gupdates/s

int64 results:
    baseline triton atomic took     356.30 ms @ 19.29 Gupdates/s
    patched triton atomic took      175.25 ms @ 39.21 Gupdates/s
    Gluon [8] atomic took           773.59 ms @  8.88 Gupdates/s
    Gluon [4] atomic took           684.48 ms @ 10.04 Gupdates/s
    Gluon [2] atomic took           356.34 ms @ 19.28 Gupdates/s
    Gluon [1] atomic took           174.56 ms @ 39.37 Gupdates/s

fp64 results:
    baseline triton atomic took     512.20 ms @ 13.42 Gupdates/s
    patched triton atomic took      256.55 ms @ 26.79 Gupdates/s
    Gluon [8] atomic took           773.32 ms @  8.89 Gupdates/s
    Gluon [4] atomic took           683.84 ms @ 10.05 Gupdates/s
    Gluon [2] atomic took           512.01 ms @ 13.42 Gupdates/s
    Gluon [1] atomic took           253.05 ms @ 27.16 Gupdates/s

Microbenchmark code (click to expand)

import torch
import triton
import triton.language as tl
import triton.experimental.gluon as gluon
import triton.experimental.gluon.language as gl

@triton.jit
def triton_atomic_add_kernel(input, output, ELEMENTS_PER_PID: tl.constexpr, INNER_REPEATS: tl.constexpr, SEM: tl.constexpr = "relaxed"):
    program_offset = tl.program_id(axis=0) * ELEMENTS_PER_PID
    load_store_indices = program_offset + tl.arange(0, ELEMENTS_PER_PID)
    
    load_indices = input + load_store_indices
    input = tl.load(load_indices)
    
    store_indices = output + load_store_indices
    for _ in tl.static_range(INNER_REPEATS):
        tl.atomic_add(store_indices, input, sem=SEM)
        
@gluon.jit
def gluon_atomic_add_kernel(input, output, ELEMENTS_PER_PID: gl.constexpr, INNER_REPEATS: gl.constexpr, SIZE_PER_THREAD: gl.constexpr, SEM: gl.constexpr = "relaxed"):
    LAYOUT: gl.constexpr = gl.BlockedLayout(size_per_thread=[SIZE_PER_THREAD], threads_per_warp=[32], warps_per_cta=[4], order=[0])
    program_offset = gl.program_id(axis=0) * ELEMENTS_PER_PID
    load_store_indices = program_offset + gl.arange(0, ELEMENTS_PER_PID, layout=LAYOUT)
    
    load_indices = input + load_store_indices
    input = gl.load(load_indices)
    
    store_indices = output + load_store_indices
    for _ in gl.static_range(INNER_REPEATS):
        gl.atomic_add(store_indices, input, sem=SEM)
        
def make_input(total_elements, dtype):
    if dtype.is_floating_point:
        return torch.rand((total_elements,), dtype=dtype)
    else:
        return torch.randint(-1000, 1000, (total_elements,), dtype=dtype)
    
if __name__ == "__main__":
    torch.set_default_device('cuda')
    total_elements = 2**30
    ELEMENTS_PER_PID = 2**9
    grid = total_elements // ELEMENTS_PER_PID
    INNER_REPEATS = 2**6

    for dtype in [torch.int32, torch.int64, torch.float32, torch.float64,]:
        print(f'\tDTYPE: {dtype}')
        input = make_input(total_elements, dtype)
        output_triton = torch.zeros_like(input)
        
        def triton_kernel_launcher():
            triton_atomic_add_kernel[(grid,)](input, output_triton, ELEMENTS_PER_PID, INNER_REPEATS, num_warps=4)
            
        ms_triton = triton.testing.do_bench(triton_kernel_launcher)
        triton_gupdates_per_sec = float(total_elements * INNER_REPEATS) / ms_triton / 10e6
        print(f'Triton atomic took \t{ms_triton:.2f} ms @ {triton_gupdates_per_sec:.2f} Gupdates/s')
            
        for SIZE_PER_THREAD in [8, 4, 2, 1]:
            def gluon_kernel_launcher():
                gluon_atomic_add_kernel[(grid,)](input, output_triton, ELEMENTS_PER_PID, INNER_REPEATS, SIZE_PER_THREAD, num_warps=4)
            ms_gluon = triton.testing.do_bench(gluon_kernel_launcher)
            gluon_gupdates_per_sec = float(total_elements * INNER_REPEATS) / ms_gluon / 10e6
            print(f'Gluon [{SIZE_PER_THREAD}] atomic took \t{ms_gluon:.2f} ms @ {gluon_gupdates_per_sec:.2f} Gupdates/s')

New contributor declaration

I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run pre-commit run --from-ref origin/main --to-ref HEAD.
Select one of the following.
- I have added tests.
  - /test for lit tests
  - /unittest for C++ tests
  - /python/test for end-to-end tests
- This PR does not need a test because FILL THIS IN.
Select one of the following.
- I have not added any lit tests.
- The lit tests I have added follow these best practices,
  including the "tests should be minimal" section. (Usually running Python code
  and using the instructions it generates is not minimal.)

…hread

lezcano · 2026-04-19T10:15:42Z

+  int computeCapability = getNVIDIAComputeCapability(moduleOp);
+  if (computeCapability >= 90)
+    return std::nullopt;


out of curiosity, what changed in Hopper?

Vectorised atomics were added with Hopper, its buried in https://docs.nvidia.com/cuda/parallel-thread-execution/

"Support for vector types requires sm_90 or higher."

See also a similar-ish discussion: llvm/llvm-project#122760

@leijurv

Hello! The coalescing pass currently chooses `sizePerThread` for atomics using the same pointer alignment / contiguity heuristic as regular stores. For some atomic lowerings this is too optimistic: the pass will select a 128-bit per-thread layout even when the backend lowers the atomic as a narrower operation. That creates gaps in the warp-level write pattern and can hurt throughput significantly. On the attached gh200 atomic-add microbenchmark, this PR improves Triton from: - `int32`: 17.98 -> 72.86 Gupdates/s (~4.1x) - `int64`: 19.29 -> 39.21 Gupdates/s (~2.0x) - `fp64`: 13.42 -> 26.79 Gupdates/s (~2.0x) These numbers match the best Gluon blocked layouts, which use `size_per_thread = 1` for `int32` / `int64` / `fp64`. The attached microbenchmark focuses on atomic add, and we have only tested on Hopper, but we think it could be worth building a richer lowering path for different compute capabilities and dtypes, as more people realise the performance benefits of associative int32 atomics. Thanks @leijurv for input on these commits. <details> <summary><b>Raw results</b> (click to expand)</summary> ``` int32 results: baseline triton atomic took 382.13 ms @ 17.98 Gupdates/s patched triton atomic took 94.32 ms @ 72.86 Gupdates/s Gluon [8] atomic took 706.87 ms @ 9.72 Gupdates/s Gluon [4] atomic took 381.84 ms @ 18.00 Gupdates/s Gluon [2] atomic took 197.41 ms @ 34.81 Gupdates/s Gluon [1] atomic took 94.88 ms @ 72.43 Gupdates/s int64 results: baseline triton atomic took 356.30 ms @ 19.29 Gupdates/s patched triton atomic took 175.25 ms @ 39.21 Gupdates/s Gluon [8] atomic took 773.59 ms @ 8.88 Gupdates/s Gluon [4] atomic took 684.48 ms @ 10.04 Gupdates/s Gluon [2] atomic took 356.34 ms @ 19.28 Gupdates/s Gluon [1] atomic took 174.56 ms @ 39.37 Gupdates/s fp64 results: baseline triton atomic took 512.20 ms @ 13.42 Gupdates/s patched triton atomic took 256.55 ms @ 26.79 Gupdates/s Gluon [8] atomic took 773.32 ms @ 8.89 Gupdates/s Gluon [4] atomic took 683.84 ms @ 10.05 Gupdates/s Gluon [2] atomic took 512.01 ms @ 13.42 Gupdates/s Gluon [1] atomic took 253.05 ms @ 27.16 Gupdates/s ``` </details> <details> <summary><b>Microbenchmark code</b> (click to expand)</summary> ```py import torch import triton import triton.language as tl import triton.experimental.gluon as gluon import triton.experimental.gluon.language as gl @triton.jit def triton_atomic_add_kernel(input, output, ELEMENTS_PER_PID: tl.constexpr, INNER_REPEATS: tl.constexpr, SEM: tl.constexpr = "relaxed"): program_offset = tl.program_id(axis=0) * ELEMENTS_PER_PID load_store_indices = program_offset + tl.arange(0, ELEMENTS_PER_PID) load_indices = input + load_store_indices input = tl.load(load_indices) store_indices = output + load_store_indices for _ in tl.static_range(INNER_REPEATS): tl.atomic_add(store_indices, input, sem=SEM) @gluon.jit def gluon_atomic_add_kernel(input, output, ELEMENTS_PER_PID: gl.constexpr, INNER_REPEATS: gl.constexpr, SIZE_PER_THREAD: gl.constexpr, SEM: gl.constexpr = "relaxed"): LAYOUT: gl.constexpr = gl.BlockedLayout(size_per_thread=[SIZE_PER_THREAD], threads_per_warp=[32], warps_per_cta=[4], order=[0]) program_offset = gl.program_id(axis=0) * ELEMENTS_PER_PID load_store_indices = program_offset + gl.arange(0, ELEMENTS_PER_PID, layout=LAYOUT) load_indices = input + load_store_indices input = gl.load(load_indices) store_indices = output + load_store_indices for _ in gl.static_range(INNER_REPEATS): gl.atomic_add(store_indices, input, sem=SEM) def make_input(total_elements, dtype): if dtype.is_floating_point: return torch.rand((total_elements,), dtype=dtype) else: return torch.randint(-1000, 1000, (total_elements,), dtype=dtype) if __name__ == "__main__": torch.set_default_device('cuda') total_elements = 2**30 ELEMENTS_PER_PID = 2**9 grid = total_elements // ELEMENTS_PER_PID INNER_REPEATS = 2**6 for dtype in [torch.int32, torch.int64, torch.float32, torch.float64,]: print(f'\tDTYPE: {dtype}') input = make_input(total_elements, dtype) output_triton = torch.zeros_like(input) def triton_kernel_launcher(): triton_atomic_add_kernel[(grid,)](input, output_triton, ELEMENTS_PER_PID, INNER_REPEATS, num_warps=4) ms_triton = triton.testing.do_bench(triton_kernel_launcher) triton_gupdates_per_sec = float(total_elements * INNER_REPEATS) / ms_triton / 10e6 print(f'Triton atomic took \t{ms_triton:.2f} ms @ {triton_gupdates_per_sec:.2f} Gupdates/s') for SIZE_PER_THREAD in [8, 4, 2, 1]: def gluon_kernel_launcher(): gluon_atomic_add_kernel[(grid,)](input, output_triton, ELEMENTS_PER_PID, INNER_REPEATS, SIZE_PER_THREAD, num_warps=4) ms_gluon = triton.testing.do_bench(gluon_kernel_launcher) gluon_gupdates_per_sec = float(total_elements * INNER_REPEATS) / ms_gluon / 10e6 print(f'Gluon [{SIZE_PER_THREAD}] atomic took \t{ms_gluon:.2f} ms @ {gluon_gupdates_per_sec:.2f} Gupdates/s') ``` </details> # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [ ] I have not added any `lit` tests. - [x] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.)

[TRITONGPU] Cap coalesced write width for atomic ops

890a3c0

jack-white1 requested a review from ptillet as a code owner April 16, 2026 21:53

lezcano reviewed Apr 17, 2026

View reviewed changes

Comment thread lib/Dialect/TritonGPU/Transforms/Utility.cpp Outdated

lezcano requested a review from Jokeren April 17, 2026 09:14

[TRITONGPU][NFC] Drop pre-SM80 branch in f16 atomic write cap

1b8361e

ThomasRaoux reviewed Apr 17, 2026

View reviewed changes

Comment thread lib/Dialect/TritonGPU/Transforms/Utility.cpp Outdated

[TRITONGPU] Fold atomic write aware width cap into getNumElementsPerT…

8a2bcf7

…hread

ThomasRaoux reviewed Apr 17, 2026

View reviewed changes

Comment thread include/triton/Dialect/TritonGPU/Transforms/Utility.h Outdated

Comment thread lib/Dialect/TritonGPU/Transforms/Utility.cpp Outdated

[TRITONGPU] Remove isWrite and extract getMaxElementsPerThread helper

1339191

Jokeren approved these changes Apr 17, 2026

View reviewed changes

lezcano reviewed Apr 19, 2026

View reviewed changes

ThomasRaoux merged commit ca6bd0c into triton-lang:main Apr 19, 2026
15 of 18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TritonGPU] Coalesce integer atomics#10059

[TritonGPU] Coalesce integer atomics#10059
ThomasRaoux merged 4 commits into
triton-lang:mainfrom
jack-white1:main

jack-white1 commented Apr 16, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lezcano Apr 19, 2026

Uh oh!

jack-white1 Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jack-white1 commented Apr 16, 2026

New contributor declaration

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lezcano Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

jack-white1 Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants