[LANGUAGE] change default 32-bit dot precision to TF32x3 by ptillet · Pull Request #9080 · triton-lang/triton

ptillet · 2025-12-22T20:17:29Z

No description provided.

…)" This reverts commit 63b387c.

) Reverts #9080 as it cause some tmem allocation regression due to simplistic hoisting logic

* [AMD][gfx1250] Support test_extract_slice_concat_op.py (#8960) Add test configurations for threads_per_warp = 32, which is the case for GFX1250. * [AMD] Make sure `tt.dot` dominates predecessors when pingpong (#9027) Make sure we place all predecessor of the dot before the dot. This PR fixes `language/test_tensor_descriptor.py/test_tensor_descriptor_batched_gemm_2d_tma` when enabling `AsyncCopy` on `gfx950`. * [AMD] Fix shared order selection for direct-to-LDS loads on GFX9 (#9028) On GFX9, direct-to-LDS loads must write coalesced to LDS. This requires that the distributed order and the shared order agree across all dimensions covered by a single warp. This PR ensures that the shared order computed during pipelining preserves the fastest dimension based on getOrder, and then assigns remaining dimensions following the thread order. This approach guarantees that there are no gaps when writing to LDS for each warp. We cannot directly use the `threadOrder` because contiguous registers may exhaust the fastest dimension. For example, consider a `4x64x4xi32` tensor with the following layout: ``` reg=[[0, 0, 1], [0, 0, 2], [0, 0, 4]] lane = [[0, 1, 0], [0, 2, 0], [0, 4, 0], [0, 8, 0], [0, 16, 0], [0, 32, 0]] warp= [[1, 0, 0], [2, 0, 0]] ``` Here: - getOrder returns `[2, 0, 1]` - getLaneOrder returns `[1, 0, 2]` But the required order is `[2, 1, 0]`. We achieve this by taking `order[0]` (fastest dimension) and then using `laneOrder` for the remaining unassigned dimensions. This PR fixes `language/test_tensor_descriptor.py:test_tensor_descriptor_reshape_matmul[float16]` when enabling `AsyncCopy` * Use fresh_knobs_except_libraries for some tests when setting `TRITON_PTXAS_BLACKWELL_PATH` (#9011) When using the `TRITON_PTXAS_BLACKWELL_PATH` envvar to override the default `ptxas-blackwell` some tests still look for the hardcoded path and binary name for `ptxas-blackwell`. This PR fixes those tests to use `fresh_knobs_except_libraries` so that the proper environment variable value is used. # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because `it is fixing tests` - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) * [Backend] Delay wgmma wait(0) to first use of the accumulator (#9021) Currently we place a wait(0) immediately after any pipelined wgmma loop, but by delaying this to the first use of the accumulator we can overlap some of the epilogue with the mma. In non-persistent bf16xmxfp4 moe I'm seeing a modest (but repeatable) 0.4% speedup. * [Reapply][BACKEND] Add missing waits in WGMMA rhs in register pipelining (#8997) This reverts #8970 with an additional fix to the non-persistent matmul kernel that fixes wgmma pipelining for bf16xmxfp4. The performance regression was caused by the use of `acc.trans()` inside the loop which is seen as a use of the accumulator and triggers a wait 0 to be generated. The fix is to use the same pattern as `_p_matmul` and hoist the transpose out of the loop. * [Nvidia] Revert ptxas 13.1 upgrade to 12.9.86 (#9016) * [AMD][gfx1250] Support 4 Warps Scheduling in MXFP GEMM Gluon Kernel (#9031) This PR - refactored the mxgemm kernel to more easily adapt different scheduling - supported 4 warps scheduling, i.e. slicing A along K and slicing B along N and K and reordering operations. - supported the case where activation doesn't have block scale - supported using async copy for scales as an option to relieve sgpr pressure * [WS] store asyncOp per partition (#9007) * In `aref-tmem-insertion` we need to store `asyncOp` per partition, so that `put/get.exitOp in nested cases use correct `asyncOp`, e.g. ``` put.enter @1 for .. { tc5mma @1 put.exit <tc5mma> @1 get.enter @0 tmem_load @0 get.exit <none> @0 put.enter @1 } put.exit <tc5mma> @1 ``` if we don't store `asyncOp` per partition, the `tmem_load` will override `asyncOp` with `<none>`, and we get `put.exit <none> @1` outside the loop. * a minor bug fix, loop may have `tt.warp_speclaize` but it was still not partitioned if it doesn't satisfy partitioning criteria, e.g. no tma_load, no tc5mma (https://github.com/triton-lang/triton/issues/8932) --------- Co-authored-by: evghenii <egaburov@nvidia> * [KERNELS] closure-based output mapping for peer shards (#8999) * [Frontend] Fix desc.shape values for fp4_padded tensor descriptors (#9012) This also strengthens the tensor descriptor validation code around fp4_padded tensors as I spent an embarrassing amount of time trying to figure out why the descriptor creation was failing, until I realised I was running on an h100 box... * [mxfp] fix mx8 w_scale mask on Hopper (#8974) The w_scale masking on Hopper and non-persistent kernel was only correct for mxfp4, and not for mxfp8. This happens even if we mask weight values since, in unlucky case when the scales we missed masking happen to be NaN (0xff), multiplying the NaN with 0 is still NaN.  # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because `hard to create a unit test that gets 0xff in OOB accesses`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) * [BACKEND] Remove synchronisation in 2CTA mma (#8986) We have now disabled 2CTA mode in Triton, and this should now works as expected in Gluon, so no need for the workaround * [Gluon][Dialect] More fixes and verifier improvements (#9018) Improve TensorDescriptor, SMEM lowering, tmem_copy lowering. Importantly, this fixes a miscompile in the TMEMCopyOp lowering by changing `or -> add` because the TMEM base pointer is not necessarily aligned. --------- Co-authored-by: Peter Bell <peterbell10@openai.com> * [FRONTEND] Support scaled bmm (#9000) Fix the issue that the check `verify_scaled_shape` added in https://github.com/triton-lang/triton/pull/8564 does not support bmm operands. Check the last 2 dims instead of full shape of the tensor.  # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) * Move gluon changes from #8981 into C++ verifiers (#9033) My attention was drawn to these changes because the verifier is too strict, in particular ```python assert tensor_desc.layout == smem.layout ``` breaks dimension-reducing loads, which have different ranks between the descriptor and destination memory. The C++ verifier handles this correctly, and is better tested by virtue of being used in the normal triton compilation flow. So, it's my opinion that we should use the C++ verifiers as the main source of truth. * [AMD][Test] Reduce num stages to avoid out of resource for mi350 (#9043) * [AMD] Fix missing else case in deduceMinCountBetweeOps (#9034) * [CI][AMD] Bring up new AMD runners as shadow CI (#9032) Bringing up new runners for GFX950 with improved stability. Using this as shadow CI before full migration. Signed-off-by: Stanley Winata <stanley.winata@amd.com> * [AMD] Fix null uniformSum crash in CanonicalizePointers (#8991) Root cause: When Gemm op M=1 and loop is fully unrolled (k = 256, BLOCK_K=128), all pointer offsets become compile-time constants. In rewriteSmallTensorPtr, both offset expressions are classified as "splatTensors" (constant tensors), leaving uniforms[] and nonUniforms[] empty. This causes uniformSum to remain uninitialized (NULL). Co-authored-by: jianlian <jianlian@amd.com> * [AMD] NFC: Move direct-to-lds pre-condition checks to utilities (#9041) NFC just moves the pre-condition checks for direct-to-lds loads to the utility file so we can use them in the pipeliner and refactor them in a follow up PR. --------- Co-authored-by: Lei Zhang <antiagainst@gmail.com> * [CONSAN] Fix for false positive deadlock detection (#9046) ConSan was ignoring barrier_expect, instead treating every `tma.async_copy_global_to_shared` as always arriving on a barrier. Because of this, if two or more `tma.async_copy_global_to_shared` were sharing a barrier, ConSan would model this as barrier overarrival and report as a deadlock if such barrier is waited on. Ultimate fix for this would be to add full support for counting transferred bytes and flipping barrier phase when it is being reached. This PR proposes simpler solution, which is treating `mbarrier.expect` as an arrive. This is more correct than behavior before the fix, but still allows for deadlocks caused by wrong transfer size in `expect`. * [Gluon] More Blackwell tutorials (#8982) * TMA scatter and gather * tcgen05_copy * tcgen05_mma_scaled --------- Co-authored-by: Peter Bell <peterbell10@openai.com> * Revert "[CI][AMD] Bring up new AMD runners as shadow CI" (#9049) Runner is migrated so reverting/removing `.test` runner label. Reverts triton-lang/triton#9032 * [PROTON] Significantly reduce `deactivate` and `get_data` overhead for cuda graph profiling and expose a new `get_data_msgpack` api (#9030) 3x faster on `deactivate` 10x faster on `get_data_msgpack` vs `get_data` * [NFC] Update README for conference materials (#9009) Is there a google drive with the slides? If not I've created one where folks can upload. Would folks like to upload posters as well? * [AMD][gfx1250] Use TDM predicate in f16 gemm variants (#9054) Use TDM enable/disable predicate to eliminate the if statements. * [AMD] Refactor direct-to-lds warp coalescing check for GFX9 (#9048) Refactors `canCoalesceWriteIntoSharedMemory` to just check if we can divide the `srcToShared` layout by a identity layout for one instruction (based on contig). This tells us if warps write coalesced to LDS which is a hardware requirement on `GFX9`. This also makes `doesSwizzleInsideWarp` redundant. The lowering now also correctly rejects subslices which break the warp coalesced writes (see added lit tests). Note that we pass the alloc shape instead of the `memdesc` because as a follow up we will use this function in the pipeliner where we do not have the complete memdesc when selecting layouts. --------- Co-authored-by: Lei Zhang <antiagainst@gmail.com> * [AMD][gfx1250] Add persistent subtiled variant for WS f16 GEMM (#9052) Increase supported block size by using subtiling techniques to reduce shared memory usage and eliminate register spills. * [AMD] Fix ignored atomic_cas sem and scope (#9042) The sem= (memory semantics) and scope= parameters in tl.atomic_cas() are currently ignored during code generation in the AMDGCN backend. Different values for these parameters produce identical assembly, which defeats the purpose of memory ordering control and breaks the memory model API. See godbolt examples here: https://godbolt.org/z/T4dMrbqcc This PR adjusts it. * [AMD] NFC: include ArrayRef via mlir/support/LLVM (#9059) mlir/Support/LLVM.h introduces `using llvm::ArrayRef;` to mlir namespace. Without it compilation with utility.h fails in some setups. * [AMD] NFC: remove std::move(temporary) (#9060) In both cases here we have moved a temporary, which is not needed. Clang also complains that "moving a temporary object prevents copy elision". * [TRITONGPU][IR] Verify reduce/scan op axis value (#9061) * [BACKEND] Fix uses of CGAEncodingAttr::getDefault (#9040) `getDefault` as implemented, just made sense for 1CTA mode. In this PR we split it into two helper functions `get1CTALayout` and `get1DLayout`. We also add a verifier that checks that MemDescTypes have the right amount of bases in their cga_layout/block attribute. We also fix the CoalesceUtils that didn't use to support multicta, and now they do. I will exercise this in a follow-up PR. * [WS] Correctly assign stage / cluster annotation to aref ops for a block-arg producer (#8883) When the "produced value" is a block argument, such as the "row-max tensor" in attention, we are not assigning stage / cluster annotations to aref put ops and `local_store`. This is a bug, but it didn't manifest for fused loop / non-persistent kernel because the second `schedule-loop` pass that runs after WS annotates them. For nested-loop persistent attention, such "fixup" didn't work for some reason, so we end up getting (benign) errors from the pipeliner like `<unknown>:0: error: 'arith.addi' op not assigned a pipeline stage`. `RewritePartitionDependices` was correctly handling this case, thanks to this logic https://github.com/triton-lang/triton/blob/de8e71503fea971dfb65308147798657e18f8568/lib/Dialect/TritonGPU/Transforms/WarpSpecialization/RewritePartitionDependencies.cpp#L231-L240. Porting that to `insert-aref` fixes this issue. * [Nvidia][cuBLAS] Block-scaled matmul baselining (mxfp8, nvfp4) (#9044) # Summary Added cuBLAS baselining support for nvfp4 & mxfp8 block-scaled matrix multiplication, and extended Tutorial 10 to include performance comparisons between Triton and cuBLAS. mxfp4 and mixed precision matmul with cuBLAS are not supported as of 13.1 (see [cuBLASLt](https://docs.nvidia.com/cuda/cublas/index.html?highlight=CUBLASLT_MATMUL_DESC_FAST_ACCUM#using-the-cublaslt-api) API docs) ### Minor changes * Changed the `10-block-scaled-matmul.py` tutorial code to introduce multiple warmup iterations before benchmarking. # Test setup | Param | Value | |-------|-------| | OS | Ubuntu 24.04.3 LTS | | GPU | B200 (unlocked clocks) | | Driver | 580.105.08 | | Torch | 2.10.0.dev20251205+cu130 | | nvidia-cuBLAS | 13.1.0.3 | # Results ### 8192 x 8192 x 8192 mxfp8 example ``` $ python 10-block-scaled-matmul.py --format mxfp8 -K 8192 ✅ (pass mxfp8 - Triton and cuBLAS) Problem Shape = 8192x8192x8192 Done benchmarking 2484.174 8852.130 ROOT ├─ 2201.650 4994.035 block_scaled_matmul_kernel_mxfp8 [M=8192, N=8192, K=8192] └─ 2849.882 3858.095 cublas [M=8192, N=8192, K=8192] └─ nan 3858.095 nvjet_sm100_qqhsh_128x256_128x6_2x1_2cta_v_bz_Avec32UE8M0_Bvec32UE8M0_TNT ``` ### 8192 x 8192 x 8192 nvfp4 example ``` $ python 10-block-scaled-matmul.py --format nvfp4 -K 8192 ✅ (pass nvfp4 - Triton and cuBLAS) Problem Shape = 8192x8192x8192 Done benchmarking 4681.998 4696.762 ROOT ├─ 4004.368 2745.781 block_scaled_matmul_kernel_nvfp4 [M=8192, N=8192, K=8192] └─ 5635.685 1950.981 cublas [M=8192, N=8192, K=8192] └─ nan 1950.981 cutlass3x_sm100_bstensorop_s256x256x64gemm_block_scaled_ue4m3xf4_ue4m3xf4_f32_f16_f16_256x256x256_0_tnn_align32_o_vs16_2sm_bias_f16_relu ``` # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - [x] I have added tests. - extended `python/test/unit/runtime/test_blaslt.py` - Select one of the following. - [x] I have not added any `lit` tests. * [AMD][gfx1250] Enable matmul tests (#9065) * [BACKEND] Add support for TMA with multicast (#9005) For now we just add enough machinery to be able to run it on its own. We will add the logic needed to be run in a loop with `tcgen05.mma` in the next PR. * [AMD] Add custom FuncOpConversion (#8962) Add custom FuncOpConversion pattern to make sure functions don't have unused/irrelevant nvvm attributes. * [INTERPRETER] A few typing and efficiency cleanups (#9072) * [AMD][KERNELS] Improve matmul ogs config on RDNA (#8052) The default configuration for matmul_ogs resulted in significant register spilling on RDNA when the batch size exceeded 512. With this optimization, we observe a notable improvement in GPT-OSS end-to-end performance. gpt-oss-20B on 1x GPU | Device | Optimization | Throughput (reqs/sec) | Improvement | | Navi31-48GB | Baseline | 3.06 | 100% | | Navi31-48GB | Config Optimization | 4.47 | 146% | | Navi48-32GB | Baseline | 5.63 | 100% | | Navi48-32GB | Config Optimization | 6.66 | 118% | * [AMD] Add missing bitcast to fix extract_element type error (#9075) The subsequent `b.extract_element` requires an aggregate type (e.g. vector) as input. In the else branch (when `scale` is false), we bitcast each element of `pkVals` to `v2i32` (2-element i32 vector) to match the type requirement of `extract_element`, aligning with the type handling logic in the `scale` branch. * [PROTON] Clean up context source after tearing down through `finalize` (#9069) Previously there are some non-deterministic failures of proton on the CI. Suppose we have the following code ``` with proton.scope("a"): proton.finalize(session0) session1 = proton.start() ``` session1 will end up have imbalanced context stack if we don't clean up thread local states in the context source. * [PROTON][NFC] Document experimental APIs (#9056) * [PROTON] Ignore metric kernels' timing data in the final profile (#9058) * [SWP] Fix barrier location in loop lowering for MMA op with non-pipelined operands (#8732) Consider the following example IR: ``` %y_16 = tt.descriptor_load %y_desc[%c0_i32, %y] {loop.cluster = 1 : i32, loop.stage = 0 : i32} : !tt.tensordesc<tensor<64x64xbf16, #shared>> -> tensor<64x64xbf16, #blocked> %y_17 = ttg.local_alloc %y_16 {loop.cluster = 1 : i32, loop.stage = 0 : i32} : (tensor<64x64xbf16, #blocked>) -> !ttg.memdesc<64x64xbf16, #shared, #smem> %acc_18 = ttng.tc_gen5_mma %x_12, %y_17, %acc_13[%acc_15], %acc, %true {loop.cluster = 1 : i32, loop.stage = 0 : i32, tt.self_latency = 1 : i32} : !ttg.memdesc<64x64xbf16, #shared, #smem>, !ttg.memdesc<64x64xbf16, #shared, #smem>, !ttg.memdesc<64x64xf32, #tmem, #ttng.tensor_memory, mutable> %acc_19, %acc_20 = ttng.tmem_load %acc_13[%acc_18] {loop.cluster = 0 : i32, loop.stage = 1 : i32} : !ttg.memdesc<64x64xf32, #tmem, #ttng.tensor_memory, mutable> -> tensor<64x64xf32, #blocked1> ``` The loop lower step will attempt to determine the barrier location to mark the MMA as "done" based on the earliest of the TMEM load or a non-pipelined operand. However, the current implementation leverages `schedule.isOpBefore`, which is inaccurate because its informing which operand happens first, not which operation happens first in the body of the loop. For example it would indicate `tt.descriptor_load` comes before `ttng.tmem_load`. We need to update this check so it account for the fact that the operands may occur before the MMA and therefore the location comparison should be invocation "after" the first MMA operation. * [KERNELS] Add reduce_forward metadata, improve perf. (#9068) * [Backend][Test] Fix inspect stages unit test (#9081) * [LANGUAGE] change default 32-bit dot precision to TF32x3 (#9080) * [TOOLS] Add hip support for link.py (#9084) * Use the same link cpp scr except hipStrean/CUstream etc. * Add a link.h prelude for AMD/Nvidia to adapt for the difference. * Enable test_aot.py for AMD. * Also rename AMD's compile.cpp to compile.c. * [Backend] Move TMA index translation from mid-end to lowerings (#9082) Currently the behavior of fp4_padded is different between `triton::Descriptor` ops and `AsyncTMA` ops. The former is indexed as if the data is int8, while the latter is indexed by individual fp4 elements, which is what the TMA hardware expects. This now gets leaked into gluon, which isn't ideal. So, this PR moves the translation into the lowerings. Along the way, this probably fixes quite a few bugs as there were several places the translation was missing. * Fix address sanitizer stack-use-after-scope (#9088) std::make_tuple here will copy the arguments into a tuple so it creates a copy of SmallVector subsliceOffsets and then passes back a tuple with an ArrayRef. The SmallVector object is then out of scope. Bypassing make_tuple means that it uses the underlying AllocationSlice's reference to subsliceOffsets rather than the temporary copy created by make_tuple. * Revert "[LANGUAGE] change default 32-bit dot precision to TF32x3" (#9090) Reverts triton-lang/triton#9080 as it cause some tmem allocation regression due to simplistic hoisting logic * [AMD] Enable AsyncCopy by default for gfx950 and gfx1250 (#9087) Enables `ttg.async_copy_global_to_local` for pipelined loads by default on `gfx950` and `gfx1250`. This increases LDS consumption because we replace one register buffer with an additional LDS buffer. After this change, the number of LDS buffers is equal to `num_stages` (previously it was `num_stages - 1`). Therefore, some test configs need to be skipped because we run out of shared memory capacity on `gfx950`. --------- Co-authored-by: Lei Zhang <antiagainst@gmail.com> * [PROTON][TEST] Test proton on tensor descriptor and two-cta mode (#9070) * [AMD] Use contiguity hint for buffer ops (#9089) Enables an approach similar to #8752 in the AMD backend for buffer ops. This helps to preserve vectorization based on kernel annotations when converting to buffer_load/store on the AMD backend. * [Kernels] Fix hopper mxfp4 swizzled kernels for num_warps=4 (#9029) Currently the tests always use `num_warps=8` for hopper scale swizzling, makes the required changes to get it working for 4 warps. I also update the layout tests to test 4 and 8 warps as well. * [BACKEND] Do not pipeline loops containing asserts or prints (#9055) https://github.com/triton-lang/triton/pull/6180 already disabled it for AMD but it got lost during refactoring. Instead of reintroducing it AMD specific I added it to the common utilities. Without the change the lit tests also fail to compile for the nv pipeliner because we cannot predicate those ops. * Correct the filename of ProxFenceInsertion (#9077) ### Changes - Renamed the file `ProxFenceInsertion.cpp` to `ProxyFenceInsertion.cpp` - Updated all relevant references to match the new file name ### Reason - The original file name contained a typo and was inconsistent with other references in the project - This renaming ensures consistency and avoids confusion * Properly expose arguments to WS partitions (#9023) While we explicitly expose the region relationships between the WS operations and the regions they enclose, we do not expose how the operands to the `WarpSpecializeOp` are passed to the partition regions. Doing so allows us to eliminate the remaining explicit handling of `WarpSpecializeOp` in dataflow analyses. * [NFC] Improve robustness of ext slice rematerialization (#9019) When trying to extend the instructions to rematerialise to include the backward slice of the ext, we currently perform the following steps: 1. Attempt to obtain the backward slice of the ext independently 2. Validate that the slice can be safely merged with the current slice 3. Merge the slices Instead, we can attempt to directly extend the existing slice and just roll back if that fails. This is potentially slightly less efficient, since we copy the slice and layout, but it means that the validation happens entirely within `getConvertBackwardSlice`. This organisation would have sidestepped the bug fixed in #7058, and the bug that caused #8292 to be reverted. While investigating this, I have found that the code for hoisting into conditionals may be prone to the same bug, since it also attempts to extend an existing slice. But I will keep that for a separate PR since it is more complex and I need to write a test case. * [distributed] pass symm_mem_pool by argument; fix matmul launch_metadata (#9092) * [AMD][gfx1250] Fix for enabling language/test_core.py (#9095) * [AMD] Add finite/isfinited for AMD libdevice.py (#9097) - fp32 finite calls __ocml_isfinite_f32. - fp64 isfinited calls __ocml_isfinite_f64. * Restore add_stages_inspection_hook in test (#9096) * Fix Kernel Argument missing in LLVM Debug Infomation (#9002) This change fixes the missing kernel arguments in LLVM Debug Information. The background is, when there is a need to create GPU memory trace for a GPU kernel runs, they find there is no `DW_TAG_formal_parameter` field in the extracted `DWARF` section from kernel binary. When it comes to the LLVM IR, take an example from an simple vector add case, ```py @triton.jit def vector_add_triton(x_ptr: torch.Tensor, y_ptr: torch.Tensor, out_ptr: torch.Tensor, n_elements: int, BLOCK_SIZE: tl.constexpr): pid = tl.program_id(axis=0) block_start = pid * BLOCK_SIZE offsets = block_start + tl.arange(0, BLOCK_SIZE) mask = offsets < n_elements x = tl.load(x_ptr + offsets, mask=mask) y = tl.load(y_ptr + offsets, mask=mask) output = x + y tl.store(out_ptr + offsets, output, mask=mask) ``` the expected LLVM IR would be ```c !0 = distinct !DICompileUnit(language: DW_LANG_C, file: !1, producer: "triton", isOptimized: true, runtimeVersion: 0, emissionKind: FullDebug) !1 = !DIFile(filename: "vector_add.py", directory: "") !5 = distinct !DISubprogram(name: "vector_add_triton", linkageName: "vector_add_triton", scope: !1, file: !1, line: 12, type: !6, scopeLine: 12, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !0, retainedNodes: !11) !6 = !DISubroutineType(cc: DW_CC_normal, types: !7) !7 = !{null, !8, !8, !8, !10} !8 = !DIDerivedType(tag: DW_TAG_pointer_type, baseType: !9, size: 64, dwarfAddressSpace: 1) !9 = !DIBasicType(name: "float", size: 32, encoding: DW_ATE_float) !10 = !DIBasicType(name: "int", size: 32, encoding: DW_ATE_signed) !11 = !{!12, !13, !14, !15} !12 = !DILocalVariable(name: "x_ptr", arg: 1, scope: !5, file: !1, line: 12, type: !8) !13 = !DILocalVariable(name: "y_ptr", arg: 2, scope: !5, file: !1, line: 12, type: !8) !14 = !DILocalVariable(name: "out_ptr", arg: 3, scope: !5, file: !1, line: 12, type: !8) !15 = !DILocalVariable(name: "n_elements", arg: 4, scope: !5, file: !1, line: 12, type: !10) !16 = !DILocation(line: 12, scope: !5) !17 = !DILocalVariable(name: "offsets", scope: !5, file: !18, type: !10) ``` while the triton compiler output before this change is ```c !5 = distinct !DISubprogram(name: "vector_add_triton", linkageName: "vector_add_triton", scope: !1, file: !1, line: 12, type: !6, scopeLine: 12, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !0) !6 = !DISubroutineType(cc: DW_CC_normal, types: !7) !7 = !{} ``` The reason of this issue is when `LLVMDIScope` pass create `LLVM::DISubprogramAttr`, it just ignores the kernel arguments. This fix create `LLVM::LocalVariableAttr` for each valid argument and pass it to `LLVM::DISubprogramAttr` in `retainedNodes`. After this change, the kernel argument can be found in the `DWARF` section ```c ... 0x0000000c: DW_TAG_compile_unit DW_AT_producer ("triton") DW_AT_language (DW_LANG_C) DW_AT_name ("triton_vector_add.py") DW_AT_str_offsets_base (0x00000008) DW_AT_stmt_list (0x00000000) DW_AT_comp_dir ("/home/leling/tests") DW_AT_low_pc (0x0000000000001700) DW_AT_high_pc (0x0000000000001790) DW_AT_addr_base (0x00000008) DW_AT_loclists_base (0x0000000c) 0x00000027: DW_TAG_subprogram DW_AT_low_pc (0x0000000000001700) DW_AT_high_pc (0x0000000000001790) DW_AT_linkage_name ("vector_add_triton") DW_AT_name ("vector_add_triton") DW_AT_decl_file ("/root/tests/vector_add_triton.py") DW_AT_decl_line (12) DW_AT_external (true) 0x00000031: DW_TAG_formal_parameter DW_AT_name ("x_ptr") DW_AT_decl_file ("/root/tests/vector_add_triton.py") DW_AT_decl_line (12) DW_AT_type (0x0000006f "float *") 0x00000039: DW_TAG_formal_parameter DW_AT_name ("y_ptr") DW_AT_decl_file ("/root/tests/vector_add_triton.py") DW_AT_decl_line (12) DW_AT_type (0x0000006f "float *") 0x00000041: DW_TAG_formal_parameter DW_AT_name ("out_ptr") DW_AT_decl_file ("/root/tests/vector_add_triton.py") DW_AT_decl_line (12) DW_AT_type (0x0000006f "float *") 0x00000049: DW_TAG_formal_parameter DW_AT_name ("n_elements") DW_AT_decl_file ("/root/tests/vector_add_triton.py") DW_AT_decl_line (12) DW_AT_type (0x00000067 "int") 0x00000051: DW_TAG_variable DW_AT_location (indexed (0x0) loclist = 0x00000018: [0x0000000000001718, 0x0000000000001720): DW_OP_consts +6, DW_OP_stack_value [0x0000000000001720, 0x0000000000001750): DW_OP_regx VGPR0) DW_AT_name ("offsets") DW_AT_type (0x00000067 "int") ... ``` * [AMD][KERNELS] Enable reduce and expert sharding tests (#9110) Co-authored-by: jianlian <jianlian@amd.com> * [AMD][Backend] Fix atomic cas for non int types (#9116) Users should be aware that float atomic-cas does a bit-wise compare, not floating point compare (e.g. NAN != NAN). This is the same behavior compared to nv. * [nvws][insert_aref] Modify aref.get creation to filter results not in scheduled loop (#9114) This is done because otherwise the result might not have its last use (in either direction) inside the scheduled loop and we will not be able to get stageClusterEnter and/or stageClusterExit.  # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [ ] I have not added any `lit` tests. - [x] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) Fixes https://github.com/triton-lang/triton/issues/8072 * [Backend][NFC] FuncOpToLLVM: Move handleArgPtrDatatype to Utility.h (#9120) Make handleArgPtrDatatype a utility function to avoid code duplication. * Fix handling conflicting layouts when hoisting convert into conditionals (#9083) I noticed that hoisting into conditionals is prone to the same layout conflict issue that we have previously fixed in hoisting above ext/broadcast ops (#7058). Fix this by attempting to extend the existing slice and layout map when considering rematerialisation. * Skip values with existing conversions in getConvertBackwardSlice (#8291) `getConvertBackwardSlice` currently includes any existing rematerialisation it finds in the returned slice. However, this shouldn't be necessary because that value does not need to be rematerialised or included in the cost calculation. Instead, once we find a valid rematerialisation, we can stop the traversal right away. # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because it is a simplification and existing tests pass. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) * [AMD] ReorderInstructions: Remove sinkSecondLoad optimization (#9119) This optimization triggers only on a very limited number of cases and no longer has any visible impact on the perf.  # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because `it removes previously added tests`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) * [AMD] Generalize CTA layout fields in WMMA layout (#8946) ctaLayout is a linear layout describing how warps are arranged across WMMA tiles. Previously, this information was encoded using warpsPerCTA and tilesPerWarp parameters. For instance, a configuration with 4 warps, represented as: warpsPerCTA = [2, 2], tilesPerWarp = [1, 1] would translate to: ctaLayout = {reg = [], warp = [[0, 1], [1, 0]]} By default, WMMA assumes that each warp in a CTA computes exactly one WMMA tile. In the grid below, each w* label indicates which warp computes that tile: w0 w1 w0 w1 w2 w3 w2 w3 w0 w1 w0 w1 w2 w3 w2 w3 To express more complex layouts, we must also account for repetitions within the mapping. For example, the configuration formerly described as: warpsPerCTA = [2, 2], tilesPerWarp = [2, 2] would translate to: ctaLayout = {reg = [[0, 1], [1, 0]], warps = [[0, 2], [2, 0]] } w0 w0 w1 w1 w0 w0 w1 w1 w2 w2 w3 w3 w2 w2 w3 w3 This parameter provides a more general way to define warp mappings than what warpsPerCTA and tilesPerWarp alone could express. For instance: ctaLayout = {reg = [[1, 0], [0, 1]], warps = [[0, 2], [2, 0]]} still represents a layout similar to: warpsPerCTA = [2, 2], tilesPerWarp = [2, 2] but with a different ordering of repetitions. The motivation for this broader formulation comes from the need to describe swizzled warp layouts, which help avoid LDS partition conflicts on architectures such as gfx1250. A valid example of such swizzled configuration is: ctaLayout = {reg = [[2, 0]], warps = [[2, 1], [1, 0]]} With corresponding mapping: w0 w1 <- second tile computed by w1 w2 w3 w0 w1 <- first tile computed by w1 w2 w3 Note that ctaLayout naturally composes with layout defined on a single WMMA tile to form final WMMA layout. wmmaLayout = tileLayout * ctaLayout This simplifies both WMMA and dotOperand layouts lowering to linear layout. --------- Co-authored-by: Ognjen Plavsic <plognjen@amd.com> * [NFC][gluon] Fix GluonSimplifyControlFlow's description (#9066) The previous description for the pass was copy pasted from GluonInline pass. Modify the description to match what the pass does.  # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because `it fixes a pass description`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) * [Backend][NFC] Change getClusterCTAId to return 0 for 1 CTA (#9125) Follow up based on [this comment on a previous PR](https://github.com/triton-lang/triton/pull/8790#discussion_r2549358694) to return `0` if we compile for a single CTA. This also removes some guard when lowering load ops in the AMD backend which are no longer required. * [AMD][BACKEND] Add TDM L2 prefetch support (#9086) Adds support for prefetching a TDM tile into `L2`. The underlying intrinsic/instruction expects an address and prefetches 256 bytes into L2 (hardware enforces 256-byte alignment), it cannot directly consume a TDM descriptor. Therefore, to prefetch tiles defined by a TDM descriptor we need to cooperatively prefetch the TDM tile across CTAs, warps and lanes to keep the number of prefetch instructions low. The newly added Op supports returning its prefetched memory location. This is purely for testing purposes, since the Op has no observable side effects on its own. A follow-up PR will expose this Op to Gluon and add unit tests based on the returned offsets. * [AMD] Support single block execute regions in `UpdateAsyncWaitCount` (#9126) `UpdateAsyncWaitCount` has to walk through `scf.ExecuteRegion` (with a single block in its region) to compute correct wait counts when warp pipelining. * [Backend] Support llvm struct and array type to DITypeAttr (#9124) This change supports `DITypeAttr` conversion for `llvm.struct` and `llvm.array` in LLVM DI passes as a supplement to https://github.com/triton-lang/triton/pull/9002. In practice, we found the lowering for some kernels with tensordesc input like ```mlir tt.func public @tensor_descriptor_load_store_nd_kernel_host_tdm(%arg0: !tt.tensordesc<tensor<8x4xsi32, #shared>>, %arg1: i32, %arg2: i32, %arg3: i64, %arg4: i64, %arg5: !tt.tensordesc<tensor<8x4xsi32, #shared>>, %arg6: i32, %arg7: i32, %arg8: i64, %arg9: i64) ``` It would be lowered to ```mlir llvm.func @tensor_host_tdm_copy_kernel(%arg0: !llvm.struct<(i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32)>, %arg1: i32, %arg2: i32, %arg3: i64, %arg4: i64, ``` Then fails in `LLVMDIScope` pass for no conversion for `LLVMStructType` to `DITypeAttr`. This PR aims to fix this. * fix flood of warning related num_stage and shared_mem capacity (#9121) The intention was warn only once using ``warnings.warn`` but with the f-string many different logging messages were generated. Piggybacking a small change in matmul kernel removing unused variable definition.  # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because this is just logging change. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) * [Backend] NFC: drop accidentally checked in file (#9128) This was added in https://github.com/triton-lang/triton/pull/8923. * [Backend] Add uniform hint to ttg.warp_id (#9073) This commit adds an attribute to `omitUniformHint` for NVIDIA backend for controlling cases where we would like to emit nvvm.shfl.sync idx 0 in targetted manner. * Update ptxas to 12.9 for hopper and ampere (#8983) * [Frontend] Fix setting attr on constexpr argument (#9053) Specialization contains a pair for each argument, the first of which is the argument's type (e.g. int/ptr/constexpr). For dynamic (non-constexpr) argument, the second element is a string describing the attribute of the argument (e.g. 'D' means divisibility by 16), for constexpr argument, second element is the argument's compiling time value. For example: ``` [(i32, 'D'), (float, ''), (constexpr, 'CDNA'), (constexpr, 128)] ``` For MLIR func-op's arguments, only dynamic arguments have correspondence, constexpr arguments are already propagated/inlined and not passed as runtime arguments. So when determining func-op arguments' attrs from specialization, we should filter-out constexpr arguments * [AMD][gfx1250] Fix Gluon AMDWMMALayout rank consistency (#9127) * Ensure NVWSDialect is loaded for NVGPUWarpSpecialization (#9131) Ensure that the dialect is loaded so we can run reproducers involving this pass using `triton-opt`. * [AMD][gfx1250] Avoid cluster load for scalar pointers (#9129) This fixes a crash which is checked in as a lit test. * [PROTON] Simplify runtime and metric correlation to reduce overhead (#9132) * [TRITON_KERNELS] some refactoring (#9134) * Revert "[TRITON_KERNELS] some refactoring" (#9140) Reverts triton-lang/triton#9134 * Support JITFunction in preload (#8794) * [AMD] Fix vector size for padded encodings with direct to lds loads (#9149) For architectures not supporting scattering (GFX9) we can only support padding intervals which are a multiple of `vectorWidth * warpSize` because we can only add padding at warp boundaries. This PR properly enforces this to bail out for such encodings. Note that we do not create such layouts right now, but will be later. * [AMD] Emit detailed error when failing to choose mfma/wmma instruction (#9143) This PR improves the error message emitted when failing to choose mfma/wmma instruction on AMD backend. Previously, the error claimed the failure was due to "unsupported element type", but in reality the issue could also be caused by: - Invalid MFMA instruction shape (e.g., unsupported or misaligned `kDim`) - Unsupported MFMA version for the target GPU The new message includes the actual parameters (`mfmaVersion`, `[m,n,k]`, element types) and guides users to check AMD architecture documentation, making debugging much easier. Fixes https://github.com/triton-lang/triton/issues/9141 Co-authored-by: xinheng <xinheng.dx@alibaba-inc.com> * Revert "[distributed] pass symm_mem_pool by argument; fix matmul launch_metadata" (#9155) Reverts triton-lang/triton#9092 * [hopper][WS] Support tt.split/join in data partition (#456) (#9147) Add support for `SplitOp` and `JoinOp` operations in the WSDataPartition pass. This enables data partitioning for tensor split and join operations along the M dimension during warp specialization. Key changes: - Add `SplitOp` and `JoinOp` to `getBackwardSliceToPartition` to include them in backward slice traversal - Add handling for `SplitOp` and `JoinOp` in `sliceOp` function to correctly partition these operations - Fix a bug where multi-result operations (like `SplitOp` which returns two tensors) were not having all their results sliced correctly - the `cloneAndSetResultType` lambda now iterates over all results instead of just the first one - changes to backward slice traversal and sliceOp function Also adds a lit test that covers partitioning of: - `tt.trans` (transpose) - `tt.reshape` (reshape with allow_reorder) - `tt.join` (join two tensors) - `tt.split` (split into two tensors) * Pass explicit captures to WarpSpecializePartitionsOp (#9133) Explicit captures are currently operands on `WarpSpecializeOp` but the regions that consume them are owned by the `WarpSpecializePartitionsOp` in the enclosed holder region. That is, the block arguments to the actual partitions do not get their values from an immediate parent region. This has been mostly fine, but some MLIR common code assumes that the operands of a `RegionBranchOpInterface` operation flow directly into the regions that it encloses. For example, attempting to use a `SparseBackwardDataFlowAnalysis` will trigger a crash. To address this, move the explicit captures to be operands on the operation that holds the non-default partition regions. * [PROTON][TRITONGPU] Deprecate Proton's `GlobalScratchAllocOp` and use TritonGPU's `GlobalScratchAllocOp` with a custom backend instead (#8976) * [WGMMA] Limit rs-dot splitting to only 2 splits (#9152) When pipelining wgmma with the left hand argument in registers, we split the dot along the K dimension which allows us to effectively "multi-buffer" the wgmma by having multiple sets of registers which are not used by the wgmma at the same time. Currently rs dots are split into `K // instrK` dot ops which theoretically maximizes parallelism, but in practice I see that splitting into 2 gives the best performance. This gives ~1% uplift in bf16 x mxfp4 moe. * [Gluon] Add local scatter/gather for Gluon (#8480)  It is often desirable to operate on matrix sections when using Gluon for numerical linear algebra. For example, accessing the diagonal to compute the trace or updating a single column. Currently these operations require loading the entire matrix into registers, performing a warp shuffle, selection, and store. This results in poor performance. This PR proposes adding specialized scatter and gather operations for operating on 2D tensors in shared memory. The API for these operations is similar to the scatter and gather operations in Triton except they accept two 1D tensors for indexing into the 0th and 1st dimension respectively. For example, a gather with two `tl.arange` reads the diagonal of a matrix. A gather with one `tl.arange` and a `tl.zeros` reads a column vector. These operations are implemented by passing the runtime index register values through the instructions generated by the inverted shared memory linear layout. # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) * Control `driver.active` with `TRITON_DEFAULT_BACKEND` (#9144) When multiple backend drivers are active, Triton fails when `triton.runtime.driver.active` is accessed. This seems like an unnecessary limitation: multiple drivers _could_ be active at the same time. The `...driver.active` API is used over 100 times across this code base; it likely does not make sense to change that API. Instead, this change adds an environment variable, `TRITON_DEFAULT_BACKEND`, that allows users to select which active backend to use. If not present, `...driver.active` continues to work as before but, if it is, the default driver is constructed from the name of the backend given, e.g., `TRITON_DEFAULT_BACKEND=nvidia`. This allows users to run examples and tests using a different backend without modifying those files (though, of course, one can still programmatically set the active driver with `...driver.set_active()`. # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because this adds a new configuration option to hack with, not core functionality. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) * [Gluon] Tigthen + cleanup TMA op verifiers and add "illegal instruction" sanitize mode (#9112) * Cleanup and refactor TMA op verifiers * Apply `fp4_padded` restrictions to some TMA ops * Relax rank-reduce verifier to allow non-unit leading dims as long as numels remains the same * Add an "iisan" mode that generates runtime checks to guard against illegal instructions from TMA operations * [TritonGPU][AutoWS] Fix autows when mixing TMA and non-TMA loads (#9111) When mixing non-TMA loads with warp-specialized MMAs and TMA loads, we still need LowerLoops to insert the MMA `wait_barrier` based on the self latency. This is achieved with some fiddling in AssignLatencies to set the right self-latency based on whether all the MMA operands are going to be warp specialized. At the same time, this changes `lowerMMA` to also handle warp-specialized MMAs. This PR contains various other things: * error message improvements * Hopper autoWS is turned off when `numStages <= 1` to avoid various downstream issues * add more exhaustive warp spec integration tests * turn on consan for a subset of warp spec integration tests for better coverage * add a few positive consan tests * Fix flakiness in test_line_info_ir_source (#9161) * [Blackwell] Enable tcgen05 MMA for sm110 (Jetson Thor) (#9160) Building on [the work in openxla](https://github.com/openxla/xla/pull/34705), I confirmed that most MMA tests are passing on sm110 with tc05 enabled: `test_warp_specialization` all pass, `test_matmul.py` mostly ok except for some off-by-one errors, and tut10 runs fine. Closes https://github.com/triton-lang/triton/issues/8896 * Broadcast support in cat (#9163) Fix tt.cat for broadcasting layouts. * [Kernels] Use ex2.approx.ftz in swiglu (#9164) This seems to get most of the speedup of the other version but without compromising numerics. Since we add 1 after the exponential, any denormal values would be lost to rounding anyway. It's worth about 1-2 GBps, or around 0.1% in bf16 x mxfp4 moe. Not world-changing, but it is repeatable. * [AMD] Add rint to amd libdevice (#9166) * Fix test conflict with cuda module (#9165) PyTorch now depends on the `cuda-bindings` package which introduces a module called `cuda` (https://github.com/pytorch/pytorch/pull/167769). The presence of an `__init__.py` in the cuda test directory causes pytest to treat it as a module, and it resolves to the new aforementioned module, causing test failures when Triton tests are run with the latest PyTorch nightly installed. To fix this, delete the `__init__.py`, and rename the tests that have naming conflicts. * [Testing] Propagate err_msg to np in triton's assert_close (#9170) Pass `err_msg` arg to `np.testing.assert_allclose` for richer error message * [AMD][GLUON] Expose TDM L2 prefetch (#9148) Followup based on https://github.com/triton-lang/triton/pull/9086 which introduced the prefetch `amdgpu` op and lowering. Since L2 prefetch has no side effects, apart from segfaulting for non speculative prefetches, we optionally return the offsets to verify we prefetch the correct memory locations (`_test_prefetch_with_offsets`). `get_shape_from_tensor` is required to deduplicate the logic computing the shape/layout of the returned offsets. * [AMD][gfx1250] Skip FP4 matmul tests packed along M/N (#9176) * [AMD][gfx1250] Skip MFMA-specific tests for GFX1250 (#9177) This PR skips MFMA-layout-specific checks on GFX1250. * Revert "[Backend] Delay wgmma wait(0) to first use of the accumulator" (#9179) Reverts triton-lang/triton#9021 * [AMD] ReorderInstructions: Remove sinkDotConversion optimization (#9139) This optimization is basically no-op because ConvertLayout operations are already replaced with local_load by ReduceDataDuplication by the time ReorderInstructions runs. * [AMD] Fix wrong vectorization width of atomic-rmw with mask (#9142) * [Gluon] Expose finer grained cluster fences (#9076) We also expose the utility `fence_cluster_init` that ensures that all the CTAs see the initialisation of the mbarriers before multiCTA ops. * Remove calls to llvm_update_compile_flags in CMake (#9167) I noticed that `llvm_update_compile_flags ` in upstream LLVM now depends on also including HandleLLVMOptions (llvm/llvm-project#174084). We could pull that in but it enables a lot of warnings globally, and seems likely to create additional churn when updating LLVM. Instead, we can just disable exceptions and rtti ourselves, and remove calls to the LLVM provided helpers. This makes the library builds easier to follow, and reduces the chance of further upstream LLVM changes breaking the build. * [BACKEND] Pick better layout for small async_cp (#9183) When picking layout for small loads we may decide to coalesce less to avoid convert_layout. When those become async_cp we can decouple the src layout to the destination one, so we can pick better layout for async_cp * Adding clone for triton_kernels.tensor.Tensor (#9178) * [BACKEND] Add tcgen05.mma + multicast support (#9071) * [PROTON] Periodic profiling dumping (#9150) * [AMD] Add TritonAMDGPUSinkLayoutConversions pass (#9168) Add a new pass to sink LayoutConversion operations instead of using ReorderInstructions pass to do that. Eventually, we want each pass to have a clear purpose and ability to run/test separately. * [TRITON_KERNELS] Reland refactor of tensor/layout/distributed (#9186) * [TRITON_KERNELS] removing debug print (#9187) * [BACKEND] Fix tt.scan with a broadcasted layout (#9185) We fix the case when there is broadcasting in register or lanes Fixes https://github.com/triton-lang/triton/issues/7817 Supersedes https://github.com/triton-lang/triton/pull/9123 * [Backend] Fix more tt.scan layout issues (#9189) Following https://github.com/triton-lang/triton/pull/9185, I asked codex to find other issues with regression tests. It hacked around the issue, but this was enough for me to find the real issue and fix it properly. Great team work. We should audit generally other uses of `linearize`/`delinearize` as those that use the legacy APIs will most likely be broken when used with broadcasted layouts. * fix strided layout handling when setting requires_persistent (#9198) Follow-up https://github.com/triton-lang/triton/pull/9186 now ``StridedLayout`` has "STRIDED" as the name not None so we can't rely on ``layout.name is None`` to check if the layout is strided. Because of this, we were hitting ``assert supports_persistent`` when running with strided layout on Blackwell. Ideally, a unit test should've checked this but currently unit tests are overriding ``is_persistent`` via ``constraints`` so it's not going to take the code path hitting ``assert supports_persistent``. * [BACKEND] Improve and simplify ReduceOp's lowering (#9192) We implement a LinearLayout-based `ReduceOp` lowering. This has a number of benefits: - The logic is noticeably simpler as we barely have to implement anything. ConvertLayout and some LL helpers do all the heavy lifting - We get shmem swizzling for free - We sometimes save a shmem round-trip (before we did it unconditionally) - It is now clear that we have a `tmpLl` variable we can carefully choose (we'll do so in a future PR) - It opens the door to returning an arbitrary layout (fusing a `convert_layout` into this op) - It is now really simple to generalise this op to perform cross-cluster reductions, provided that `convert_layout` supports them. - We fix some latent issues the previous implementation had when run on arbitrary linear layouts. We add a funky regression test that used to fail and now passes. - All this while being LOC-neutral! In future PRs we will improve the choice fo `tmpLl` to avoid in many cases the last `convert_layout`, and we will pack the inputs in shmem to be able to vectorize the load/stores for full reductions with multiple inputs. This PR was the result of quite a long (but rather successful) vibe-coding session together with `gpt-5.2-codex`. I found particularly useful being able to emit a ConvertLayout within this lowering rather than having to call the lowering of the function manually. This simplifies the code quite a bit and I would have struggled to convince MLIR to do so myself. TODO: Benchmark * [TEST] Update fresh_knobs fixture default behavior (#9184) Refactor the default behavior of fresh_knobs in test fixtures. `fresh_knobs` (default): Now preserves library paths (build, nvidia, amd knobs) Most tests need CUDA toolkit paths to compile successfully Respects environment variables like `TRITON_PTXAS_BLACKWELL_PATH` `fresh_knobs_including_libraries` (new): Resets ALL knobs including library paths * Guard sync point in launch_metadata with launch_metadata_allow_sync (#9197) * [AMD] relax padded layout heuristics to smaller block size (#9074) leveraging wrap around due to padding, we can still get bank conflict free padded share layout when block size is smaller than 16KB. take Mx64xbf16, k contiguous, kWidth=8, mfma16x16 for example: (rX stands for row X), the minimal block size can be 32x64. padding here is set to 16 elements (32 bytes) to avoid bank conflicts we can pack r0,r4,r8,r12,r16,r20,r24,r28to compose a contiguous tile ``` r0[0+], r0[8+], r1[0+], r1[8+], r2[0+], r2[8+], r3[0+], r3[8+], r4[0+], r4[8+], r5[0+], r5[8+], r6[0+], r6[8+], r7[0+], r7[8+], r8[0+], r8[8+], ``` in LDS, the rows are arranged as below ``` r0, r4, r8, r12, r16, r20, r24, r28 pad, r1, r5, r9, r13, r17, r21, r25 r29, pad, r2, r6, r10, r14, r18, r22, r26, r30, pad, r3, r7, r11, r15, r19, r23, r27, r31 ``` * Revering 77f610a - Adding clone for triton_kernels.tensor.Tensor (#9208) Clone turned out not to be needed for now, reverting to limit complexity. * [AMD] ReorderInstructions: Remove moveUpTranspose optimization (#9204) This optimization moves TransOps closer to their defs, but that doesn't actually have any impact on the generated code because the actual transpose code generated by ConvertLayout ops. * [AMD] skip test_gather[src_shape2-indices_shape2-0] for RDNA (#9210) The test_gather[src_shape2-indices_shape2-0] test fails on RDNA3 and RDNA4 GPUs with: triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 131072, Hardware limit: 65536. Extend the existing skip condition (which already covers CDNA2 and CDNA3) to also include RDNA3 and RDNA4 GPUs. * [Backend] Bump to llvm/llvm-project@8f264586d752 (#8987) This picks up several ROCDL changes: llvm/llvm-project#171810 llvm/llvm-project#171449 llvm/llvm-project#169672 Co-authored-by: Lei Zhang <antiagainst@gmail.com> * Revert "[BACKEND] Improve and simplify ReduceOp's lowering" (#9214) temporarily revert triton-lang/triton#9192 as it changes the numerics * [TRITON_KERNELS] remove warning (#9213) * [LAYOUTS] Tighten LinearEncoding checks (#9215) Before, we mistakenly allowed repeated non-zero bases * Skip InsertTmemAref if function does not use WS (#9212) Most autoWS passes automatically skip code that does not use autoWS, but InsertTmemAref does not end up doing that because it needs to examine `TMEMAlloc`s that occur outside WS regions as well. This can cause assertion failures when the assumptions baked into the autoWS implementation are violated (e.g. that an alloc must only have a single use if there is no token). * [AMD][NFC] Replace llvm instrinsic calls with rocdl ops (#9222) PR replaces amdgpu intrinsic calls with ROCDL ops which were recently exposed to the ROCDL dialect in MLIR * Allow proton to record metadata for selective kernels (#9158) * [Nvidia] Enable TMA im2col mode -- AsyncTMACopyGlobalToLocalOp (#9202) ### Summary This is the first PR in a series that enables TMA im2col mode (in addition to the existing tiled mode) for NVIDIA GPUs. The overall goal of the series is to support TMA im2col mode in the Triton compiler and the Gluon DSL. The PTX ISA documentation for TMA im2col mode is available here: https://…

…bit dot precision to TF32x3 (#9080)' (facebookexperimental#1329) Summary: This is a cherry-pick of an upstream PR: triton-lang/triton#9080 Upstream commit message: ``` > [LANGUAGE] change default 32-bit dot precision to TF32x3 (#9080) ``` Conflict Resolution: - File: python/triton/language/core.py Action: Removed conflict markers; kept the local "where the first dimension..." line and updated docstring to use tf32x3 instead of tf32. Did not add the upstream-introduced assert/if input_precision body code, since the local code path delegates input_precision processing to semantic.py. Reason: The local file was refactored to move input_precision default-setting logic from core.py.dot() to semantic.py. Adding the upstream body code here would duplicate logic and be unreachable. - File: python/triton/language/semantic.py Action: Updated supports_tf32 check and default value from "tf32" to "tf32x3" in the input_precision branch of the dot() method. Reason: This file holds the actual default-precision logic locally; matching upstream's intent of changing the default precision from tf32 to tf32x3 requires updating it here. Raw Conflicts: https://www.internalfb.com/intern/paste/P2283333395/ Resolution Diff: https://www.internalfb.com/intern/paste/P2283336430/ Diff Comparison: https://www.internalfb.com/intern/paste/P2283337118/ ***Do not remove the following line from this commit*** Reactor Cherry-pick Revision: 63b387c Reviewed By: sfzhu93 Differential Revision: D101982808

…bit dot precision to TF32x3 (#9080)' (facebookexperimental#1329) Summary: Pull Request resolved: facebookexperimental#1329 This is a cherry-pick of an upstream PR: triton-lang/triton#9080 Upstream commit message: ``` > [LANGUAGE] change default 32-bit dot precision to TF32x3 (#9080) ``` Conflict Resolution: - File: python/triton/language/core.py Action: Removed conflict markers; kept the local "where the first dimension..." line and updated docstring to use tf32x3 instead of tf32. Did not add the upstream-introduced assert/if input_precision body code, since the local code path delegates input_precision processing to semantic.py. Reason: The local file was refactored to move input_precision default-setting logic from core.py.dot() to semantic.py. Adding the upstream body code here would duplicate logic and be unreachable. - File: python/triton/language/semantic.py Action: Updated supports_tf32 check and default value from "tf32" to "tf32x3" in the input_precision branch of the dot() method. Reason: This file holds the actual default-precision logic locally; matching upstream's intent of changing the default precision from tf32 to tf32x3 requires updating it here. Raw Conflicts: https://www.internalfb.com/intern/paste/P2283333395/ Resolution Diff: https://www.internalfb.com/intern/paste/P2283336430/ Diff Comparison: https://www.internalfb.com/intern/paste/P2283337118/ ***Do not remove the following line from this commit*** Reactor Cherry-pick Revision: 63b387c Reviewed By: sfzhu93 Differential Revision: D101982808

…ault 32-bit dot precision to TF32x3" (#9090)' (facebookexperimental#1330) Summary: Pull Request resolved: facebookexperimental#1330 This is a cherry-pick of an upstream PR: triton-lang/triton#9090 Upstream commit message: ``` > Revert "[LANGUAGE] change default 32-bit dot precision to TF32x3" (#9090) > > Reverts triton-lang/triton#9080 as it cause some tmem allocation > regression due to simplistic hoisting logic ``` Conflict Resolution: - File: python/triton/language/core.py Action: Removed conflict markers; kept the local "where the first dimension..." line. Reverted docstring lines from tf32x3 back to tf32 to match upstream's revert. Reason: Same divergence as the original cherry-pick of #9080 (assert/if input_precision body lives in semantic.py locally). Maintained that local refactor by reverting only the docstring here. - File: python/triton/language/semantic.py Action: Reverted supports_tf32 check and default value from "tf32x3" back to "tf32" in the input_precision branch of the dot() method. Reason: Mirror revert: the prior cherry-pick of #9080 applied the tf32x3 change here (instead of upstream's core.py location); this revert undoes it in the same place. Raw Conflicts: https://www.internalfb.com/intern/paste/P2283342039/ Resolution Diff: https://www.internalfb.com/intern/paste/P2283342643/ Diff Comparison: https://www.internalfb.com/intern/paste/P2283343204/ ***Do not remove the following line from this commit*** Reactor Cherry-pick Revision: 606eebc Reviewed By: sfzhu93 Differential Revision: D101982800

…bit dot precision to TF32x3 (#9080)' (facebookexperimental#1329) Summary: Pull Request resolved: facebookexperimental#1329 This is a cherry-pick of an upstream PR: triton-lang/triton#9080 Upstream commit message: ``` > [LANGUAGE] change default 32-bit dot precision to TF32x3 (#9080) ``` Conflict Resolution: - File: python/triton/language/core.py Action: Removed conflict markers; kept the local "where the first dimension..." line and updated docstring to use tf32x3 instead of tf32. Did not add the upstream-introduced assert/if input_precision body code, since the local code path delegates input_precision processing to semantic.py. Reason: The local file was refactored to move input_precision default-setting logic from core.py.dot() to semantic.py. Adding the upstream body code here would duplicate logic and be unreachable. - File: python/triton/language/semantic.py Action: Updated supports_tf32 check and default value from "tf32" to "tf32x3" in the input_precision branch of the dot() method. Reason: This file holds the actual default-precision logic locally; matching upstream's intent of changing the default precision from tf32 to tf32x3 requires updating it here. Raw Conflicts: https://www.internalfb.com/intern/paste/P2283333395/ Resolution Diff: https://www.internalfb.com/intern/paste/P2283336430/ Diff Comparison: https://www.internalfb.com/intern/paste/P2283337118/ ***Do not remove the following line from this commit*** Reactor Cherry-pick Revision: 63b387c Reviewed By: sfzhu93 Differential Revision: D101982808

…bit dot precision to TF32x3 (#9080)' (#1329) Summary: Pull Request resolved: #1329 This is a cherry-pick of an upstream PR: triton-lang/triton#9080 Upstream commit message: ``` > [LANGUAGE] change default 32-bit dot precision to TF32x3 (#9080) ``` Conflict Resolution: - File: python/triton/language/core.py Action: Removed conflict markers; kept the local "where the first dimension..." line and updated docstring to use tf32x3 instead of tf32. Did not add the upstream-introduced assert/if input_precision body code, since the local code path delegates input_precision processing to semantic.py. Reason: The local file was refactored to move input_precision default-setting logic from core.py.dot() to semantic.py. Adding the upstream body code here would duplicate logic and be unreachable. - File: python/triton/language/semantic.py Action: Updated supports_tf32 check and default value from "tf32" to "tf32x3" in the input_precision branch of the dot() method. Reason: This file holds the actual default-precision logic locally; matching upstream's intent of changing the default precision from tf32 to tf32x3 requires updating it here. Raw Conflicts: https://www.internalfb.com/intern/paste/P2283333395/ Resolution Diff: https://www.internalfb.com/intern/paste/P2283336430/ Diff Comparison: https://www.internalfb.com/intern/paste/P2283337118/ ***Do not remove the following line from this commit*** Reactor Cherry-pick Revision: 63b387c Reviewed By: sfzhu93 Differential Revision: D101982808 fbshipit-source-id: 89337aa0f9307edb2f456fb38cc7f2c83418e991

…ault 32-bit dot precision to TF32x3" (#9090)' (#1330) Summary: Pull Request resolved: #1330 This is a cherry-pick of an upstream PR: triton-lang/triton#9090 Upstream commit message: ``` > Revert "[LANGUAGE] change default 32-bit dot precision to TF32x3" (#9090) > > Reverts triton-lang/triton#9080 as it cause some tmem allocation > regression due to simplistic hoisting logic ``` Conflict Resolution: - File: python/triton/language/core.py Action: Removed conflict markers; kept the local "where the first dimension..." line. Reverted docstring lines from tf32x3 back to tf32 to match upstream's revert. Reason: Same divergence as the original cherry-pick of #9080 (assert/if input_precision body lives in semantic.py locally). Maintained that local refactor by reverting only the docstring here. - File: python/triton/language/semantic.py Action: Reverted supports_tf32 check and default value from "tf32x3" back to "tf32" in the input_precision branch of the dot() method. Reason: Mirror revert: the prior cherry-pick of #9080 applied the tf32x3 change here (instead of upstream's core.py location); this revert undoes it in the same place. Raw Conflicts: https://www.internalfb.com/intern/paste/P2283342039/ Resolution Diff: https://www.internalfb.com/intern/paste/P2283342643/ Diff Comparison: https://www.internalfb.com/intern/paste/P2283343204/ ***Do not remove the following line from this commit*** Reactor Cherry-pick Revision: 606eebc Reviewed By: sfzhu93 Differential Revision: D101982800 fbshipit-source-id: fd9eeb84ccb01ad8f126269c5c88fd71d76114ba

[LANGUAGE] change default 32-bit dot precision to TF32x3

32d10a7

ptillet merged commit 63b387c into main Dec 23, 2025
9 checks passed

ptillet deleted the phil/default-tf32x3 branch December 23, 2025 02:45

ThomasRaoux added a commit that referenced this pull request Dec 23, 2025

Revert "[LANGUAGE] change default 32-bit dot precision to TF32x3 (#9080…

3bd5cc1

…)" This reverts commit 63b387c.

ThomasRaoux mentioned this pull request Dec 23, 2025

Revert "[LANGUAGE] change default 32-bit dot precision to TF32x3" #9090

Merged

ThomasRaoux added a commit that referenced this pull request Dec 23, 2025

Revert "[LANGUAGE] change default 32-bit dot precision to TF32x3" (#9090

606eebc

) Reverts #9080 as it cause some tmem allocation regression due to simplistic hoisting logic

agron911 mentioned this pull request Apr 24, 2026

[triton][beta] [Cherry-pick][RESOLVED] '[LANGUAGE] change default 32-bit dot precision to TF32x3 (#9080)' (#1329) facebookexperimental/triton#1329

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LANGUAGE] change default 32-bit dot precision to TF32x3#9080

[LANGUAGE] change default 32-bit dot precision to TF32x3#9080
ptillet merged 1 commit into
mainfrom
phil/default-tf32x3

ptillet commented Dec 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ptillet commented Dec 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant