Skip to content

[CPU] Add an OpenMP-based CPU launcher#15

Merged
ienkovich merged 4 commits intotriton-lang:mainfrom
minjang:openmp-parallel-for
Jun 10, 2024
Merged

[CPU] Add an OpenMP-based CPU launcher#15
ienkovich merged 4 commits intotriton-lang:mainfrom
minjang:openmp-parallel-for

Conversation

@minjang
Copy link
Copy Markdown
Collaborator

@minjang minjang commented Jun 5, 2024

This PR implements a simple OpenMP-based CPU launcher with some debugging and tunable environment variables.

Details:

  • By default, it uses the entire threads on a machine. So there might be a chance of contention between logical threads on the same physical core. Need a fine tuning.
  • It uses static scheduling. No reason to use dynamic scheduling for now (overhead is high). I used the default chunk size: total iterations / # of threads, where the minimum chunk size is 10.

Initial performance:

Environment: I only tested on an AMD EPYC Zen4 96-core machine with Centos 9, as a development server in Meta. My test wasn't on a physical machine; it is a VM.

I only tested the vector addition example. I added TritonCPU 1-core as well as TorchCPU as the references. I disabled the turbo boost-like feature during this test. I used a smaller BLOCK_SIZE of 128 for CPU. Some observations so far:

  • For large vector sizes >= 2^24, the GB/s seems to be stable.
    • The performance of TritonCPU with maximum threads is similar to that of TorchCPU.
  • A large variance is observed in the sizes of 2^18 to 2^22, for both TritonCPU and TorchCPU
  • The OpenMP's speedups for larger sizes are not impressive: ~6x.
> % python3 python/tutorials/01-vector-add.py
tensor([0.5151, 1.6826, 0.9153,  ..., 0.9852, 1.2714, 1.8192])
tensor([0.5151, 1.6826, 0.9153,  ..., 0.9852, 1.2714, 1.8192])
The maximum difference between torch-cpu and triton-cpu is 0.0
tensor([0.5151, 1.6826, 0.9153,  ..., 0.9852, 1.2714, 1.8192], device='cuda:0')
tensor([0.5151, 1.6826, 0.9153,  ..., 0.9852, 1.2714, 1.8192], device='cuda:0')
The maximum difference between torch-gpu and triton-gpu is 0.0
vector-add-performance (CPU_BLOCK_SIZE=128, GPU_BLOCK_SIZE=1024):
           size  TritonCPU 1-core  TritonCPU    TorchCPU    TritonGPU     TorchGPU
0        4096.0          0.403202   0.062194    4.472021    10.816901    10.666666
1        8192.0          3.120957   1.619745    7.883864    21.186207    21.041095
2       16384.0          3.280629   0.958061    3.545745    41.234899    40.959998
3       32768.0          3.188169   2.692145    6.301640    81.919996    80.313725
4       65536.0          3.969092   0.291063    7.116936   158.554837   156.535030
5      131072.0          5.689773   5.975995    9.832181   284.115618   282.482757
6      262144.0          5.911389  12.657541    4.723294   484.256150   489.074621
7      524288.0          6.442556  15.187768    5.880182   783.298835   771.011790
8     1048576.0          6.401166  55.634510    1.416960  1153.126078  1149.754339
9     2097152.0         10.465727  32.779651   98.862590  1521.145131  1481.039473
10    4194304.0          9.414268  34.194252  195.153484  1767.262872  1745.687030
11    8388608.0          3.347814   4.684119    5.160477  1972.861670  1957.515852
12   16777216.0          3.677767   7.394172    7.372241  2109.103660  2077.759573
13   33554432.0          3.057506  12.829415   10.310923  2168.532851  2172.839199
14   67108864.0          3.555315  17.150573   15.640461  2226.669987  2237.160951
15  134217728.0          3.711412  16.852324   18.304376  2246.948652  2260.369439

TODO:

  • More experimentation with the fused softmax case.
  • More rigorous performance testing on a stable environment.

Debugging and control features

TRITON_CPU_MAX_THREADS and TRITON_CPU_OMP_DEBUG are supported.

TRITON_CPU_MAX_THREADS=40 TRITON_CPU_OMP_DEBUG=1 python3 python/tutorials/01-vector-add.py

@minjang minjang requested review from bertmaher and ienkovich June 5, 2024 03:05
@minjang minjang requested a review from ptillet as a code owner June 5, 2024 03:05
@ienkovich
Copy link
Copy Markdown
Collaborator

Thanks for the patch, looks like a good start!

Some results look suspicious though. There is a big perf drop starting from 8MB size. The drop happens for all CPU options. I tried it on my machine, it shows quite stable perf growth with size increase but it also has this drop at 8MB:

           size  TritonCPU 1-core   TritonCPU    TorchCPU
0        4096.0          0.473994    0.369581    1.530837
1        8192.0          0.790753    0.747536    2.880876
2       16384.0          1.479289    1.597310    4.705235
3       32768.0          2.556686    3.840678    9.008160
4       65536.0          5.013719    8.386279   12.160891
5      131072.0          7.338396   14.899770   25.372058
6      262144.0         10.075180   29.821873   51.502574
7      524288.0         11.859673   36.852700   91.065672
8     1048576.0         12.967142   61.431152  110.480621
9     2097152.0         13.879187   87.099157  172.930611
10    4194304.0         14.490595  101.303109  220.705400
11    8388608.0          6.876308   23.051736   26.772240
12   16777216.0          6.798736   27.438252   24.234607
13   33554432.0          6.649066   25.322378   26.525232
14   67108864.0          6.489616   28.934346   30.688774
15  134217728.0          6.335913   28.097894   29.345317

Do you have any theory why it might happen?

Comment thread third_party/cpu/backend/driver.py Outdated
include_dir = [
os.path.join(dirname, "include"),
os.path.join(llvm_root, "include"),
os.path.join(".", "include"),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean Triton can now be run only from the root repo dir?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was to use GetEnv.hpp. I will remove

Comment thread third_party/cpu/backend/driver.py Outdated
"z",
]

MINIMUM_OMP_CHUNK_SIZE = 10
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain why we need this one? I can easily imagine cases when a chunk size of 1 is preferred.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought was that a typical iteration in the omp-parallel-for loop in TritonCPU would tend to be (1) uniform and have (2) small amount of the workload. So, having a smallest chunk size like 1 might have more scheduling overhead, losing cache affinity as well as potential prefetch.

Let me rethink on how to control chunk size.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to come up with a universal heuristic here. For vector-add, we have a very small block processed by a kernel, and setting the minimum chunk size to 10 looks reasonable. But for matmul kernels that would be very different. We might have dozens of quite heavy kernels and this minimum chunk size would significantly limit parallelism.

We definitely cannot find the best strategy based on the vector-add example only and will have multiple iterations here. So it's OK to start with some basics for now, but I'd like to avoid any hard-coded limits here that cannot be controlled in our perf experiments.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. This was more like an interim testing code. For now, let's just use the default chunk size, that is, don't specify it. I removed the code that overrides chunk size.

When we have more perf benchmarks, we can introduce some good overridable options.

Comment thread third_party/cpu/backend/driver.py Outdated
#include <omp.h>
#include <cmath>

#include "triton/Tools/Sys/GetEnv.hpp"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think GetEnv.hpp is a part of the Triton package. This code is compiled on a user's machine and therefore header would be unavailable.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right. Let me just implement simple helper functions directly in here.


static PyObject* launch(PyObject* self, PyObject* args) {{
static void run_omp_kernels(uint32_t gridX, uint32_t gridY, uint32_t gridZ, kernel_ptr_t kernel_ptr {', ' + arg_decls if len(arg_decls) > 0 else ''}) {{
auto all_grids = get_all_grids(gridX, gridY, gridZ);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we try collapse(3) OMP rule instead of building these grids?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good idea. But I had some troubles to get it right with clang compiler with OpenMP. E.g., only one cpu was used etc. So, my suggestion is that keep the current approach for the time being. Once we confirm MacOS+Clang works well with OpenMP, then I will try collapse clause to make the code simpler.

Comment thread third_party/cpu/backend/driver.py Outdated
max_threads = omp_get_max_threads();

int chunk_size = std::ceil((double)N / (double)max_threads.value());
chunk_size = std::max(chunk_size, {MINIMUM_OMP_CHUNK_SIZE});
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chunk size can affect performance a lot. For our experiments, I'd remove the min border and add an opportunity to override the chunk size instead.

Comment thread third_party/cpu/backend/driver.py Outdated
chunk_size = std::max(chunk_size, {MINIMUM_OMP_CHUNK_SIZE});

if (mlir::triton::tools::getBoolEnv("TRITON_CPU_OMP_DEBUG"))
printf("N: %zu, max_threads: %d, chunk_size: %zu\\n", N, max_threads, chunk_size);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You pass std::optional<int> max_threads here instead of int value.

Comment thread third_party/cpu/backend/driver.py Outdated
chunk_size = std::max(chunk_size, {MINIMUM_OMP_CHUNK_SIZE});

if (mlir::triton::tools::getBoolEnv("TRITON_CPU_OMP_DEBUG"))
printf("N: %zu, max_threads: %d, chunk_size: %zu\\n", N, max_threads, chunk_size);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunk_size type doesn't match the specifier %zu

Comment thread third_party/cpu/backend/driver.py Outdated
if (mlir::triton::tools::getBoolEnv("TRITON_CPU_OMP_DEBUG"))
printf("N: %zu, max_threads: %d, chunk_size: %zu\\n", N, max_threads, chunk_size);

#pragma omp parallel for schedule(static, chunk_size) num_threads(max_threads.value())
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get much better results if remove chunk_size from here. Can you try such an option?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no chunk_size, the default behavior should be the same in my code (unless the minimum chunk), which is total iteration / # of threads. Let me do some update for the chunk size.

@minjang
Copy link
Copy Markdown
Collaborator Author

minjang commented Jun 5, 2024

6.876308 23.051736 26.772240
12 16777216.0 6.798736 27.438252 24.234607
13 33554432.0 6.649066 25.322378 26.525232
14 67108864.0 6.489616 28.934346 30.688774
15 134217728.0 6.335913 28.097894 29.345317

Your numbers look much stable than me. Note that I was doing on our devvm. This big drop at 8MB size is confusing us as well. Our engineers will also take a look at.

@embg
Copy link
Copy Markdown
Collaborator

embg commented Jun 5, 2024

This big drop at 8MB size is confusing us as well.

In the past, when I have hit weird benchmark issues at 8MB, it was because of jemalloc. It caches allocations smaller than 8MB, but not larger.

Is it possible to rewrite the benchmark so that it avoids malloc / free, and uses the same memory each time?

Edit: never mind, the drop is at 8M elements, not 8M bytes, disregard my comment.

@minjang
Copy link
Copy Markdown
Collaborator Author

minjang commented Jun 5, 2024

Edit: never mind, the drop is at 8M elements, not 8M bytes, disregard my comment.

Nonetheless, the memory allocator still might be a reason. This is a good pointer.

I'm not sure whether I made a terrible mistake in do_bench with is_cpu=True in python/triton/testing.py. Originally, do_bench only considered GPU. So I patched do_bench for CPU. It does allocate some chunk of memory to flush L3 cache.

Let's do some more investigation.

@embg
Copy link
Copy Markdown
Collaborator

embg commented Jun 5, 2024

It does allocate some chunk of memory to flush L3 cache.

It would be better if we can flush L3 cache by writing to an existing allocation. So we don't have page faults while flushing L3 cache.

@ienkovich
Copy link
Copy Markdown
Collaborator

Nonetheless, the memory allocator still might be a reason. This is a good pointer.

Moving the creation of the output tensor out of the add functions 'fixes' the problem. So the problem seems to be in this allocation. Maybe the torch initializes tensors with zeros for big sizes or a garbage collection takes place here?

@minjang
Copy link
Copy Markdown
Collaborator Author

minjang commented Jun 5, 2024

Nonetheless, the memory allocator still might be a reason. This is a good pointer.

Moving the creation of the output tensor out of the add functions 'fixes' the problem. So the problem seems to be in this allocation. Maybe the torch initializes tensors with zeros for big sizes or a garbage collection takes place here?

Thanks for the suggestion! Let me "fix" it as well.

@minjang
Copy link
Copy Markdown
Collaborator Author

minjang commented Jun 7, 2024

Updated:

  • Removed chunk size override. For now, it's okay to use the default chunk size policy.
  • Introduced "prealloc" of output for TritonCPU in the vector-add benchmark.
  • Removed the dependency of using GetEnv.hpp.

At least, for now, this initial OpenMP launcher should be good.

> % python3 python/tutorials/01-vector-add.py
tensor([0.5151, 1.6826, 0.9153,  ..., 0.9852, 1.2714, 1.8192])
tensor([0.5151, 1.6826, 0.9153,  ..., 0.9852, 1.2714, 1.8192])
The maximum difference between torch-cpu and triton-cpu is 0.0
tensor([0.5151, 1.6826, 0.9153,  ..., 0.9852, 1.2714, 1.8192], device='cuda:0')
tensor([0.5151, 1.6826, 0.9153,  ..., 0.9852, 1.2714, 1.8192], device='cuda:0')
The maximum difference between torch-gpu and triton-gpu is 0.0
vector-add-performance (CPU_BLOCK_SIZE=4096, GPU_BLOCK_SIZE=1024):
           size  TritonCPU 1+pre  TritonCPU pre  TritonCPU 1   TritonCPU    TorchCPU    TritonGPU     TorchGPU
0        4096.0         0.321897       0.312638     0.512353    0.266988    1.003532    10.893617    10.741259
1        8192.0         1.680468       0.253968     0.660272    0.389941    1.980758    21.482518    21.186207
2       16384.0         1.265458       1.102377     1.119157    0.994537    4.398677    41.795919    41.234899
3       32768.0         1.898384       0.025930     1.201788    2.000009    5.412919    82.469799    81.377483
4       65536.0         2.338790       0.047724     2.306263    0.050069    0.050090   159.584422   157.538467
5      131072.0         7.719694      16.581940     8.433994    0.173002   18.489479   287.438585   282.482757
6      262144.0         7.617477      26.243797     6.414238   20.382074   12.042284   486.653476   489.074621
7      524288.0        13.967046      26.663678     6.975471   20.112033    2.151081   783.298835   771.011790
8     1048576.0        11.244772      50.954944     6.729201   38.159927   87.629926  1159.929234  1153.126078
9     2097152.0         8.302956       1.613608     6.132167   94.959996   69.786135  1524.093109  1483.833908
10    4194304.0         5.125372      73.253841     6.545267  139.382712  189.500988  1767.262872  1751.518877
11    8388608.0         5.471637     181.309510     2.984222   13.983397   13.686297  1973.480515  1958.734738
12   16777216.0         5.714181     234.262771     2.915569   13.067338    7.429136  2109.810944  2078.102643
13   33554432.0         5.825378      92.550439     2.952943    9.851228    9.754659  2168.345960  2173.214564
14   67108864.0         5.985693     107.708580     3.317379    9.899601   12.165736  2227.063999  2237.160951
15  134217728.0         5.997662      69.945116     2.988294   11.943508   12.079766  2246.046192  2260.775658

@minjang
Copy link
Copy Markdown
Collaborator Author

minjang commented Jun 7, 2024

It turned out torch.add can take preallocated output buffer. Let's simply always have preallocated output buffer for CPU cases.

> % python3 python/tutorials/01-vector-add.py
tensor([0.5151, 1.6826, 0.9153,  ..., 0.9852, 1.2714, 1.8192])
tensor([0.5151, 1.6826, 0.9153,  ..., 0.9852, 1.2714, 1.8192])
The maximum difference between torch-cpu and triton-cpu is 0.0
tensor([0.5151, 1.6826, 0.9153,  ..., 0.9852, 1.2714, 1.8192], device='cuda:0')
tensor([0.5151, 1.6826, 0.9153,  ..., 0.9852, 1.2714, 1.8192], device='cuda:0')
The maximum difference between torch-gpu and triton-gpu is 0.0
vector-add-performance (CPU_BLOCK_SIZE=4096, GPU_BLOCK_SIZE=1024):
           size  TritonCPU 1   TritonCPU    TorchCPU    TritonGPU     TorchGPU
0        4096.0     0.330691    0.003187    1.640316    11.050360    10.893617
1        8192.0     0.481347    0.011473    2.034690    21.787235    21.633802
2       16384.0     1.036711    0.012717    3.091976    42.372414    42.082190
3       32768.0     2.670090    0.025314   11.095744    83.591837    82.469799
4       65536.0     3.189345    0.048478    0.040158   161.684218   160.627450
5      131072.0    10.870723    9.019075    0.094202   290.840226   287.438585
6      262144.0     8.618670    7.680253   27.162953   493.989945   499.005073
7      524288.0     5.992274   25.221713   33.713298   792.774204   780.190482
8     1048576.0     6.782821   89.671679   65.595797  1166.813039  1159.929234
9     2097152.0     7.074116   68.878399    1.539270  1533.005829  1492.280845
10    4194304.0     5.453523  140.154074  237.239931  1773.240131  1755.428547
11    8388608.0     5.936637  226.831542  267.304201  1977.201751  1963.625499
12   16777216.0     5.669703   12.547080  235.037567  2108.396850  2077.759573
13   33554432.0     5.743683  165.028759  218.124103  2170.590254  2174.716971
14   67108864.0     7.190059  129.821671  127.623851  2227.458335  2237.558870
15  134217728.0     7.682400   70.595290   77.931064  2246.697937  2260.978822

Copy link
Copy Markdown
Collaborator

@ienkovich ienkovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks!

@ienkovich ienkovich merged commit a8d8838 into triton-lang:main Jun 10, 2024
@minjang minjang deleted the openmp-parallel-for branch June 13, 2024 05:02
minjang added a commit to minjang/triton-cpu that referenced this pull request Jun 22, 2024
* [CPU] Add OpenMP launcher

* Address the comments

* Fix induction variable type

* Always use preallocated output buffer for CPU with torch.add
minjang pushed a commit that referenced this pull request Jun 24, 2024
When running
[convert_blocked1d_to_slice0](https://github.com/triton-lang/triton/blob/0ba5f0c3cd029d5c3d1f01b9bf29dac32c27345e/test/Conversion/tritongpu_to_llvm.mlir#L924)
Triton ends up computing a rank of a matrix with 0 columns during linear
layout lowering, which trips up f2reduce, and causes undefined behavior,
detectable through
[UBSAN](https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html).

Fix this by returning the rank (0) early in these cases, without calling
f2reduce.

<details><summary>Stack trace</summary>
<p>

```
third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30: runtime error: shift exponent 18446744073709551615 is too large for 64-bit type 'unsigned long long'
    #0 0x556ee2fea3be in inplace_rref_small third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30
    #1 0x556ee2fea3be in f2reduce::inplace_rref_strided(unsigned long*, unsigned long, unsigned long, unsigned long) third_party/triton/third_party/f2reduce/f2reduce.cpp:470:9
    #2 0x556ee2ea70da in getMatrixRank third_party/triton/lib/Tools/LinearLayout.cpp:125:3
    #3 0x556ee2ea70da in mlir::triton::LinearLayout::checkInvariants(bool) third_party/triton/lib/Tools/LinearLayout.cpp:299:7
    #4 0x556ee2ea656d in mlir::triton::LinearLayout::tryCreate(llvm::MapVector<mlir::StringAttr, std::__u::vector<std::__u::vector<int, std::__u::allocator<int>>, std::__u::allocator<std::__u::vector<int, std::__u::allocator<int>>>>, llvm::DenseMap<mlir::StringAttr, unsigned int, llvm::DenseMapInfo<mlir::StringAttr, void>, llvm::detail::DenseMapPair<mlir::StringAttr, unsigned int>>, llvm::SmallVector<std::__u::pair<mlir::StringAttr, std::__u::vector<std::__u::vector<int, std::__u::allocator<int>>, std::__u::allocator<std::__u::vector<int, std::__u::allocator<int>>>>>, 0u>>, llvm::ArrayRef<std::__u::pair<mlir::StringAttr, int>>, bool) third_party/triton/lib/Tools/LinearLayout.cpp:190:41
    #5 0x556ee2eb2150 in mlir::triton::LinearLayout::divideRight(mlir::triton::LinearLayout const&) third_party/triton/lib/Tools/LinearLayout.cpp:654:51
    #6 0x556ee2ee1c39 in mlir::cvtNeedsSharedMemory(mlir::RankedTensorType, mlir::RankedTensorType) third_party/triton/lib/Analysis/Utility.cpp:652:14
    #7 0x556ee2cf38fd in mlir::triton::getRepShapeForCvtLayout(mlir::triton::gpu::ConvertLayoutOp) third_party/triton/lib/Analysis/Allocation.cpp:66:8
    #8 0x556ee2cf3efa in mlir::triton::getScratchConfigForCvtLayout(mlir::triton::gpu::ConvertLayoutOp, unsigned int&, unsigned int&) third_party/triton/lib/Analysis/Allocation.cpp:95:19
    #9 0x556ee2cf6057 in mlir::triton::AllocationAnalysis::getScratchValueSize(mlir::Operation*) third_party/triton/lib/Analysis/Allocation.cpp:272:24
    #10 0x556ee2cf5499 in operator() third_party/triton/lib/Analysis/Allocation.cpp:343:7
    #11 0x556ee2cf5499 in void llvm::function_ref<void (mlir::Operation*)>::callback_fn<mlir::triton::AllocationAnalysis::getValuesAndSizes()::'lambda'(mlir::Operation*)>(long, mlir::Operation*) third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:45:12
    #12 0x556edeeee7a9 in operator() third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:68:12
    #13 0x556edeeee7a9 in void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:174:5
    #14 0x556edeeee87c in void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:182:9
    #15 0x556ee2cf49e7 in walk<(mlir::WalkOrder)0, mlir::ForwardIterator, (lambda at third_party/triton/lib/Analysis/Allocation.cpp:341:42), mlir::Operation *, void> third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:313:10
    #16 0x556ee2cf49e7 in walk<(mlir::WalkOrder)0, mlir::ForwardIterator, (lambda at third_party/triton/lib/Analysis/Allocation.cpp:341:42), void> third_party/llvm/llvm-project/mlir/include/mlir/IR/Operation.h:794:12
    #17 0x556ee2cf49e7 in mlir::triton::AllocationAnalysis::getValuesAndSizes() third_party/triton/lib/Analysis/Allocation.cpp:341:16
    #18 0x556ee2cf4852 in run third_party/triton/lib/Analysis/Allocation.cpp:182:5
    #19 0x556ee2cf4852 in AllocationAnalysis third_party/triton/lib/Analysis/Allocation.cpp:169:5
    #20 0x556ee2cf4852 in mlir::Allocation::run(llvm::DenseMap<mlir::FunctionOpInterface, mlir::Allocation, llvm::DenseMapInfo<mlir::FunctionOpInterface, void>, llvm::detail::DenseMapPair<mlir::FunctionOpInterface, mlir::Allocation>>&) third_party/triton/lib/Analysis/Allocation.cpp:627:3
    #21 0x556ee1677402 in operator() third_party/triton/include/triton/Analysis/Allocation.h:227:26
    #22 0x556ee1677402 in void mlir::CallGraph<mlir::Allocation>::doWalk<(mlir::WalkOrder)0, (mlir::WalkOrder)1, mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::CallOpInterface, mlir::FunctionOpInterface), mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::FunctionOpInterface)>(mlir::FunctionOpInterface, llvm::DenseSet<mlir::FunctionOpInterface, llvm::DenseMapInfo<mlir::FunctionOpInterface, void>>&, mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::CallOpInterface, mlir::FunctionOpInterface), mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::FunctionOpInterface)) third_party/triton/include/triton/Analysis/Utility.h:350:7
    #23 0x556ee16756b3 in walk<(mlir::WalkOrder)0, (mlir::WalkOrder)1, (lambda at third_party/triton/include/triton/Analysis/Allocation.h:222:9), (lambda at third_party/triton/include/triton/Analysis/Allocation.h:224:9)> third_party/triton/include/triton/Analysis/Utility.h:242:7
    #24 0x556ee16756b3 in mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp) third_party/triton/include/triton/Analysis/Allocation.h:220:5
    #25 0x556ee2c2bf18 in (anonymous namespace)::AllocateSharedMemory::runOnOperation() third_party/triton/lib/Conversion/TritonGPUToLLVM/AllocateSharedMemory.cpp:26:22
...
UndefinedBehaviorSanitizer: invalid-shift-exponent third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30 
```
</p>
</details>
minjang added a commit that referenced this pull request Jun 24, 2024
* [CPU] Add OpenMP launcher

* Address the comments

* Fix induction variable type

* Always use preallocated output buffer for CPU with torch.add
Devjiu pushed a commit to Devjiu/triton-cpu that referenced this pull request Aug 13, 2024
* [CPU] Add OpenMP launcher

* Address the comments

* Fix induction variable type

* Always use preallocated output buffer for CPU with torch.add
int3 pushed a commit that referenced this pull request Aug 29, 2024
* [CPU] Add OpenMP launcher

* Address the comments

* Fix induction variable type

* Always use preallocated output buffer for CPU with torch.add
minjang added a commit that referenced this pull request Sep 22, 2024
* [CPU] Add OpenMP launcher

* Address the comments

* Fix induction variable type

* Always use preallocated output buffer for CPU with torch.add
minjang added a commit that referenced this pull request Oct 22, 2024
* [CPU] Add OpenMP launcher

* Address the comments

* Fix induction variable type

* Always use preallocated output buffer for CPU with torch.add
minjang added a commit that referenced this pull request Oct 24, 2024
* [CPU] Add OpenMP launcher

* Address the comments

* Fix induction variable type

* Always use preallocated output buffer for CPU with torch.add
Devjiu pushed a commit to Devjiu/triton-cpu that referenced this pull request Nov 13, 2024
Adds an optional flag to move matmul input preprocessing
outside of the benchmarked kernel.
This option allows to exclude preprocessing overhead from
performance measurements.
Devjiu pushed a commit to Devjiu/triton-cpu that referenced this pull request Nov 13, 2024
Adds an optional flag to move matmul input preprocessing
outside of the benchmarked kernel.
This option allows to exclude preprocessing overhead from
performance measurements.
int3 pushed a commit that referenced this pull request Dec 6, 2024
* [CPU] Add OpenMP launcher

* Address the comments

* Fix induction variable type

* Always use preallocated output buffer for CPU with torch.add
ienkovich pushed a commit that referenced this pull request Dec 6, 2024
* [CPU] Add OpenMP launcher

* Address the comments

* Fix induction variable type

* Always use preallocated output buffer for CPU with torch.add
Devjiu pushed a commit to Devjiu/triton-cpu that referenced this pull request Jan 20, 2025
Adds an optional flag to move matmul input preprocessing
outside of the benchmarked kernel.
This option allows to exclude preprocessing overhead from
performance measurements.
Devjiu pushed a commit to Devjiu/triton-cpu that referenced this pull request Feb 20, 2025
* [CPU] Add OpenMP launcher

* Address the comments

* Fix induction variable type

* Always use preallocated output buffer for CPU with torch.add
Devjiu pushed a commit to Devjiu/triton-cpu that referenced this pull request Feb 24, 2025
* [CPU] Add OpenMP launcher

* Address the comments

* Fix induction variable type

* Always use preallocated output buffer for CPU with torch.add
Devjiu pushed a commit to Devjiu/triton-cpu that referenced this pull request Feb 28, 2025
* [CPU] Add OpenMP launcher

* Address the comments

* Fix induction variable type

* Always use preallocated output buffer for CPU with torch.add
Devjiu pushed a commit to Devjiu/triton-cpu that referenced this pull request Mar 3, 2025
* [CPU] Add OpenMP launcher

* Address the comments

* Fix induction variable type

* Always use preallocated output buffer for CPU with torch.add
Devjiu pushed a commit to Devjiu/triton-cpu that referenced this pull request Apr 3, 2025
* [CPU] Add OpenMP launcher

* Address the comments

* Fix induction variable type

* Always use preallocated output buffer for CPU with torch.add
jopperm pushed a commit to jopperm/triton-cpu that referenced this pull request Feb 20, 2026
…leaveTMem.cpp (triton-lang#7924)

`TritonNvidiaGPU/interleave_tmem.mlir` fails under address sanitizer. 

The `ConstantIntOp` operations were created without attachment to any
block in https://github.com/triton-lang/triton/pull/7622, which caused a
memory leak. This change addresses the problem by adding an insertion
point.

<details open>
  <summary>Full log</summary>

=================================================================
==3831==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 576 byte(s) in 6 object(s) allocated from:
#0 0x55c3eca39164 in malloc
[third_party/llvm/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:67](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp?l=67&ws=tap-presubmit-server/421956858&snapshot=2):3
triton-lang#1 0x55c3f176afb3 in mlir::Operation::create(mlir::Location,
mlir::OperationName, mlir::TypeRange, mlir::ValueRange,
mlir::DictionaryAttr, mlir::OpaqueProperties, mlir::BlockRange, unsigned
int)
[third_party/llvm/llvm-project/mlir/lib/IR/Operation.cpp:113](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/mlir/lib/IR/Operation.cpp?l=113&ws=tap-presubmit-server/421956858&snapshot=2):46
triton-lang#2 0x55c3f176a90c in create
[third_party/llvm/llvm-project/mlir/lib/IR/Operation.cpp:74](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/mlir/lib/IR/Operation.cpp?l=74&ws=tap-presubmit-server/421956858&snapshot=2):10
triton-lang#3 0x55c3f176a90c in mlir::Operation::create(mlir::Location,
mlir::OperationName, mlir::TypeRange, mlir::ValueRange,
mlir::NamedAttrList&&, mlir::OpaqueProperties, mlir::BlockRange,
mlir::RegionRange)
[third_party/llvm/llvm-project/mlir/lib/IR/Operation.cpp:57](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/mlir/lib/IR/Operation.cpp?l=57&ws=tap-presubmit-server/421956858&snapshot=2):7
triton-lang#4 0x55c3f176a61b in mlir::Operation::create(mlir::OperationState
const&)
[third_party/llvm/llvm-project/mlir/lib/IR/Operation.cpp:35](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/mlir/lib/IR/Operation.cpp?l=35&ws=tap-presubmit-server/421956858&snapshot=2):7
triton-lang#5 0x55c3f1678a78 in mlir::OpBuilder::create(mlir::OperationState
const&)
[third_party/llvm/llvm-project/mlir/lib/IR/Builders.cpp:453](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/mlir/lib/IR/Builders.cpp?l=453&ws=tap-presubmit-server/421956858&snapshot=2):17
triton-lang#6 0x55c3ecf3668f in mlir::arith::ConstantIntOp
mlir::OpBuilder::create<mlir::arith::ConstantIntOp, int,
int>(mlir::Location, int&&, int&&)
[third_party/llvm/llvm-project/mlir/include/mlir/IR/Builders.h:507](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/mlir/include/mlir/IR/Builders.h?l=507&ws=tap-presubmit-server/421956858&snapshot=2):16
triton-lang#7 0x55c3eefa690a in findBufferAccessMemdescSubview
[third_party/triton/lib/Dialect/TritonNvidiaGPU/Transforms/InterleaveTMem.cpp:75](https://cs.corp.google.com/piper///depot/google3/third_party/triton/lib/Dialect/TritonNvidiaGPU/Transforms/InterleaveTMem.cpp?l=75&ws=tap-presubmit-server/421956858&snapshot=2):33
triton-lang#8 0x55c3eefa690a in mlir::triton::nvidia_gpu::(anonymous
namespace)::findBufferAccess(mlir::Value)
[third_party/triton/lib/Dialect/TritonNvidiaGPU/Transforms/InterleaveTMem.cpp:151](https://cs.corp.google.com/piper///depot/google3/third_party/triton/lib/Dialect/TritonNvidiaGPU/Transforms/InterleaveTMem.cpp?l=151&ws=tap-presubmit-server/421956858&snapshot=2):12
triton-lang#9 0x55c3eefa70e7 in mlir::triton::nvidia_gpu::(anonymous
namespace)::findBufferAccess(mlir::Value)
[third_party/triton/lib/Dialect/TritonNvidiaGPU/Transforms/InterleaveTMem.cpp:156](https://cs.corp.google.com/piper///depot/google3/third_party/triton/lib/Dialect/TritonNvidiaGPU/Transforms/InterleaveTMem.cpp?l=156&ws=tap-presubmit-server/421956858&snapshot=2):34
triton-lang#10 0x55c3eefa4c0c in tmemMayAlias
[third_party/triton/lib/Dialect/TritonNvidiaGPU/Transforms/InterleaveTMem.cpp:173](https://cs.corp.google.com/piper///depot/google3/third_party/triton/lib/Dialect/TritonNvidiaGPU/Transforms/InterleaveTMem.cpp?l=173&ws=tap-presubmit-server/421956858&snapshot=2):28
triton-lang#11 0x55c3eefa4c0c in sinkOps
[third_party/triton/lib/Dialect/TritonNvidiaGPU/Transforms/InterleaveTMem.cpp:227](https://cs.corp.google.com/piper///depot/google3/third_party/triton/lib/Dialect/TritonNvidiaGPU/Transforms/InterleaveTMem.cpp?l=227&ws=tap-presubmit-server/421956858&snapshot=2):36
triton-lang#12 0x55c3eefa4c0c in trySinkOp
[third_party/triton/lib/Dialect/TritonNvidiaGPU/Transforms/InterleaveTMem.cpp:253](https://cs.corp.google.com/piper///depot/google3/third_party/triton/lib/Dialect/TritonNvidiaGPU/Transforms/InterleaveTMem.cpp?l=253&ws=tap-presubmit-server/421956858&snapshot=2):10
triton-lang#13 0x55c3eefa4c0c in
mlir::triton::nvidia_gpu::TritonNvidiaGPUInterleaveTMemPass::runOnOperation()
[third_party/triton/lib/Dialect/TritonNvidiaGPU/Transforms/InterleaveTMem.cpp:275](https://cs.corp.google.com/piper///depot/google3/third_party/triton/lib/Dialect/TritonNvidiaGPU/Transforms/InterleaveTMem.cpp?l=275&ws=tap-presubmit-server/421956858&snapshot=2):14
triton-lang#14 0x55c3f1560ad1 in operator()
[third_party/llvm/llvm-project/mlir/lib/Pass/Pass.cpp:553](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/mlir/lib/Pass/Pass.cpp?l=553&ws=tap-presubmit-server/421956858&snapshot=2):17
triton-lang#15 0x55c3f1560ad1 in void llvm::function_ref<void
()>::callback_fn<mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*,
mlir::Operation*, mlir::AnalysisManager, bool, unsigned int)::$_1>(long)
[third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:46](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h?l=46&ws=tap-presubmit-server/421956858&snapshot=2):12
triton-lang#16 0x55c3f1559920 in operator()
[third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:69](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h?l=69&ws=tap-presubmit-server/421956858&snapshot=2):12
triton-lang#17 0x55c3f1559920 in executeAction<mlir::PassExecutionAction,
mlir::Pass &>
[third_party/llvm/llvm-project/mlir/include/mlir/IR/MLIRContext.h:280](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/mlir/include/mlir/IR/MLIRContext.h?l=280&ws=tap-presubmit-server/421956858&snapshot=2):7
triton-lang#18 0x55c3f1559920 in mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*,
mlir::Operation*, mlir::AnalysisManager, bool, unsigned int)
[third_party/llvm/llvm-project/mlir/lib/Pass/Pass.cpp:547](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/mlir/lib/Pass/Pass.cpp?l=547&ws=tap-presubmit-server/421956858&snapshot=2):21
triton-lang#19 0x55c3f155d46f in runPipeline
[third_party/llvm/llvm-project/mlir/lib/Pass/Pass.cpp:619](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/mlir/lib/Pass/Pass.cpp?l=619&ws=tap-presubmit-server/421956858&snapshot=2):16
triton-lang#20 0x55c3f155d46f in mlir::PassManager::runPasses(mlir::Operation*,
mlir::AnalysisManager)
[third_party/llvm/llvm-project/mlir/lib/Pass/Pass.cpp:933](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/mlir/lib/Pass/Pass.cpp?l=933&ws=tap-presubmit-server/421956858&snapshot=2):10
triton-lang#21 0x55c3f155d15b in mlir::PassManager::run(mlir::Operation*)
[third_party/llvm/llvm-project/mlir/lib/Pass/Pass.cpp:913](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/mlir/lib/Pass/Pass.cpp?l=913&ws=tap-presubmit-server/421956858&snapshot=2):60
triton-lang#22 0x55c3ed0a8b20 in performActions(llvm::raw_ostream&,
std::__u::shared_ptr<llvm::SourceMgr> const&, mlir::MLIRContext*,
mlir::MlirOptMainConfig const&)
[third_party/llvm/llvm-project/mlir/lib/Tools/mlir-opt/MlirOptMain.cpp:477](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/mlir/lib/Tools/mlir-opt/MlirOptMain.cpp?l=477&ws=tap-presubmit-server/421956858&snapshot=2):17
triton-lang#23 0x55c3ed0a8363 in processBuffer
[third_party/llvm/llvm-project/mlir/lib/Tools/mlir-opt/MlirOptMain.cpp:553](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/mlir/lib/Tools/mlir-opt/MlirOptMain.cpp?l=553&ws=tap-presubmit-server/421956858&snapshot=2):12
triton-lang#24 0x55c3ed0a8363 in operator()
[third_party/llvm/llvm-project/mlir/lib/Tools/mlir-opt/MlirOptMain.cpp:642](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/mlir/lib/Tools/mlir-opt/MlirOptMain.cpp?l=642&ws=tap-presubmit-server/421956858&snapshot=2):12
triton-lang#25 0x55c3ed0a8363 in llvm::LogicalResult
llvm::function_ref<llvm::LogicalResult
(std::__u::unique_ptr<llvm::MemoryBuffer,
std::__u::default_delete<llvm::MemoryBuffer>>, llvm::MemoryBufferRef
const&,
llvm::raw_ostream&)>::callback_fn<mlir::MlirOptMain(llvm::raw_ostream&,
std::__u::unique_ptr<llvm::MemoryBuffer,
std::__u::default_delete<llvm::MemoryBuffer>>, mlir::DialectRegistry&,
mlir::MlirOptMainConfig const&)::$_0>(long,
std::__u::unique_ptr<llvm::MemoryBuffer,
std::__u::default_delete<llvm::MemoryBuffer>>, llvm::MemoryBufferRef
const&, llvm::raw_ostream&)
[third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:46](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h?l=46&ws=tap-presubmit-server/421956858&snapshot=2):12
triton-lang#26 0x55c3f17bd34f in operator()
[third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:69](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h?l=69&ws=tap-presubmit-server/421956858&snapshot=2):12
triton-lang#27 0x55c3f17bd34f in
mlir::splitAndProcessBuffer(std::__u::unique_ptr<llvm::MemoryBuffer,
std::__u::default_delete<llvm::MemoryBuffer>>,
llvm::function_ref<llvm::LogicalResult
(std::__u::unique_ptr<llvm::MemoryBuffer,
std::__u::default_delete<llvm::MemoryBuffer>>, llvm::MemoryBufferRef
const&, llvm::raw_ostream&)>, llvm::raw_ostream&, llvm::StringRef,
llvm::StringRef)
[third_party/llvm/llvm-project/mlir/lib/Support/ToolUtilities.cpp:30](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/mlir/lib/Support/ToolUtilities.cpp?l=30&ws=tap-presubmit-server/421956858&snapshot=2):12
triton-lang#28 0x55c3ed09d0c6 in mlir::MlirOptMain(llvm::raw_ostream&,
std::__u::unique_ptr<llvm::MemoryBuffer,
std::__u::default_delete<llvm::MemoryBuffer>>, mlir::DialectRegistry&,
mlir::MlirOptMainConfig const&)
[third_party/llvm/llvm-project/mlir/lib/Tools/mlir-opt/MlirOptMain.cpp:647](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/mlir/lib/Tools/mlir-opt/MlirOptMain.cpp?l=647&ws=tap-presubmit-server/421956858&snapshot=2):26
triton-lang#29 0x55c3ed09d67f in mlir::MlirOptMain(int, char**, llvm::StringRef,
llvm::StringRef, mlir::DialectRegistry&)
[third_party/llvm/llvm-project/mlir/lib/Tools/mlir-opt/MlirOptMain.cpp:693](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/mlir/lib/Tools/mlir-opt/MlirOptMain.cpp?l=693&ws=tap-presubmit-server/421956858&snapshot=2):14
triton-lang#30 0x55c3ed09dc59 in mlir::MlirOptMain(int, char**, llvm::StringRef,
mlir::DialectRegistry&)
[third_party/llvm/llvm-project/mlir/lib/Tools/mlir-opt/MlirOptMain.cpp:709](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/mlir/lib/Tools/mlir-opt/MlirOptMain.cpp?l=709&ws=tap-presubmit-server/421956858&snapshot=2):10
triton-lang#31 0x55c3eca74a70 in main
[third_party/triton/bin/triton-opt.cpp:14](https://cs.corp.google.com/piper///depot/google3/third_party/triton/bin/triton-opt.cpp?l=14&ws=tap-presubmit-server/421956858&snapshot=2):33
triton-lang#32 0x7f1fd58613d3 in __libc_start_main
(/usr/grte/v5/lib64/libc.so.6+0x613d3) (BuildId:
9a996398ce14a94560b0c642eb4f6e94)
triton-lang#33 0x55c3ec995aa9 in _start
/usr/grte/v5/debug-src/src/csu/../sysdeps/x86_64/start.S:120

</details>

---------

Co-authored-by: Thomas Raoux <thomas.raoux@openai.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants