Skip to content

[LANGUAGE] change default 32-bit dot precision to TF32x3#9080

Merged
ptillet merged 1 commit into
mainfrom
phil/default-tf32x3
Dec 23, 2025
Merged

[LANGUAGE] change default 32-bit dot precision to TF32x3#9080
ptillet merged 1 commit into
mainfrom
phil/default-tf32x3

Conversation

@ptillet
Copy link
Copy Markdown
Collaborator

@ptillet ptillet commented Dec 22, 2025

No description provided.

@ptillet ptillet merged commit 63b387c into main Dec 23, 2025
9 checks passed
@ptillet ptillet deleted the phil/default-tf32x3 branch December 23, 2025 02:45
ThomasRaoux added a commit that referenced this pull request Dec 23, 2025
ThomasRaoux added a commit that referenced this pull request Dec 23, 2025
)

Reverts #9080 as it cause some tmem allocation
regression due to simplistic hoisting logic
azazhu pushed a commit to triton-lang/Triton-to-tile-IR that referenced this pull request Feb 9, 2026
* [AMD][gfx1250] Support test_extract_slice_concat_op.py (#8960)

Add test configurations for threads_per_warp = 32, which is the case for
GFX1250.

* [AMD] Make sure `tt.dot` dominates predecessors when pingpong (#9027)

Make sure we place all predecessor of the dot before the dot. This PR
fixes
`language/test_tensor_descriptor.py/test_tensor_descriptor_batched_gemm_2d_tma`
when enabling `AsyncCopy` on `gfx950`.

* [AMD] Fix shared order selection for direct-to-LDS loads on GFX9 (#9028)

On GFX9, direct-to-LDS loads must write coalesced to LDS. This requires
that the distributed order and the shared order agree across all
dimensions covered by a single warp.

This PR ensures that the shared order computed during pipelining
preserves the fastest dimension based on getOrder, and then assigns
remaining dimensions following the thread order. This approach
guarantees that there are no gaps when writing to LDS for each warp.
We cannot directly use the `threadOrder` because contiguous registers
may exhaust the fastest dimension. For example, consider a `4x64x4xi32`
tensor with the following layout:

```
reg=[[0, 0, 1], [0, 0, 2], [0, 0, 4]]
lane = [[0, 1, 0], [0, 2, 0], [0, 4, 0], [0, 8, 0], [0, 16, 0], [0, 32, 0]]
warp= [[1, 0, 0], [2, 0, 0]]
```
Here:
- getOrder returns `[2, 0, 1]`
- getLaneOrder returns `[1, 0, 2]`

But the required order is `[2, 1, 0]`. We achieve this by taking
`order[0]` (fastest dimension) and then using `laneOrder` for the
remaining unassigned dimensions.

This PR fixes
`language/test_tensor_descriptor.py:test_tensor_descriptor_reshape_matmul[float16]`
when enabling `AsyncCopy`

* Use fresh_knobs_except_libraries for some tests when setting `TRITON_PTXAS_BLACKWELL_PATH` (#9011)

When using the `TRITON_PTXAS_BLACKWELL_PATH` envvar to override the
default `ptxas-blackwell` some tests still look for the hardcoded path
and binary name for `ptxas-blackwell`. This PR fixes those tests to use
`fresh_knobs_except_libraries` so that the proper environment variable
value is used.




# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [x] This PR does not need a test because `it is fixing tests`

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

* [Backend] Delay wgmma wait(0) to first use of the accumulator (#9021)

Currently we place a wait(0) immediately after any pipelined wgmma loop,
but by delaying this to the first use of the accumulator we can overlap
some of the epilogue with the mma. In non-persistent bf16xmxfp4 moe I'm
seeing a modest (but repeatable) 0.4% speedup.

* [Reapply][BACKEND] Add missing waits in WGMMA rhs in register pipelining (#8997)

This reverts #8970 with an additional fix to the non-persistent matmul
kernel that fixes wgmma pipelining for bf16xmxfp4.

The performance regression was caused by the use of `acc.trans()` inside
the loop which is seen as a use of the accumulator and triggers a wait 0
to be generated. The fix is to use the same pattern as `_p_matmul` and
hoist the transpose out of the loop.

* [Nvidia] Revert ptxas 13.1 upgrade to 12.9.86 (#9016)

* [AMD][gfx1250] Support 4 Warps Scheduling in MXFP GEMM Gluon Kernel (#9031)

This PR
- refactored the mxgemm kernel to more easily adapt different scheduling
- supported 4 warps scheduling, i.e. slicing A along K and slicing B
along N and K and reordering operations.
- supported the case where activation doesn't have block scale
- supported using async copy for scales as an option to relieve sgpr
pressure

* [WS] store asyncOp per partition (#9007)

* In `aref-tmem-insertion` we need to store `asyncOp` per partition, so
that `put/get.exitOp in nested cases use correct `asyncOp`, e.g.

```
put.enter @1
for ..  {
    tc5mma  @1
    put.exit <tc5mma> @1
    get.enter @0
    tmem_load @0
    get.exit <none> @0
    put.enter @1
}
put.exit <tc5mma> @1
```

if we don't store `asyncOp` per partition, the `tmem_load` will override
`asyncOp` with `<none>`, and we get `put.exit <none> @1` outside the
loop.

* a minor bug fix, loop may have `tt.warp_speclaize` but it was still
not partitioned if it doesn't satisfy partitioning criteria, e.g. no
tma_load, no tc5mma (https://github.com/triton-lang/triton/issues/8932)

---------

Co-authored-by: evghenii <egaburov@nvidia>

* [KERNELS] closure-based output mapping for peer shards (#8999)

* [Frontend] Fix desc.shape values for fp4_padded tensor descriptors (#9012)

This also strengthens the tensor descriptor validation code around
fp4_padded tensors as I spent an embarrassing amount of time trying to
figure out why the descriptor creation was failing, until I realised I
was running on an h100 box...

* [mxfp] fix mx8 w_scale mask on Hopper (#8974)

The w_scale masking on Hopper and non-persistent kernel was only correct
for mxfp4, and not for mxfp8.
This happens even if we mask weight values since, in unlucky case when
the scales we missed masking happen to be NaN (0xff), multiplying the
NaN with 0 is still NaN.

<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
- [x] This PR does not need a test because `hard to create a unit test
that gets 0xff in OOB accesses`.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

* [BACKEND] Remove synchronisation in 2CTA mma (#8986)

We have now disabled 2CTA mode in Triton, and this should now works as
expected
in Gluon, so no need for the workaround

* [Gluon][Dialect] More fixes and verifier improvements (#9018)

Improve TensorDescriptor, SMEM lowering, tmem_copy lowering.

Importantly, this fixes a miscompile in the TMEMCopyOp lowering by
changing `or -> add` because the TMEM base pointer is not necessarily
aligned.

---------

Co-authored-by: Peter Bell <peterbell10@openai.com>

* [FRONTEND] Support scaled bmm (#9000)

Fix the issue that the check `verify_scaled_shape` added in
https://github.com/triton-lang/triton/pull/8564 does not support bmm
operands.

Check the last 2 dims instead of full shape of the tensor.

<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [x] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [x] This PR does not need a test because `FILL THIS IN`.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

* Move gluon changes from #8981 into C++ verifiers (#9033)

My attention was drawn to these changes because the verifier is too
strict, in particular
```python
assert tensor_desc.layout == smem.layout
```
breaks dimension-reducing loads, which have different ranks between the
descriptor and destination memory.

The C++ verifier handles this correctly, and is better tested by virtue
of being used in the normal triton compilation flow. So, it's my opinion
that we should use the C++ verifiers as the main source of truth.

* [AMD][Test] Reduce num stages to avoid out of resource for mi350 (#9043)

* [AMD] Fix missing else case in deduceMinCountBetweeOps (#9034)

* [CI][AMD] Bring up new AMD runners as shadow CI (#9032)

Bringing up new runners for GFX950 with improved stability. Using this
as shadow CI before full migration.

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

* [AMD] Fix null uniformSum crash in CanonicalizePointers (#8991)

Root cause:
When Gemm op M=1 and loop is fully unrolled (k = 256, BLOCK_K=128), all
pointer offsets become compile-time constants. In rewriteSmallTensorPtr,
both offset expressions are classified as "splatTensors" (constant
tensors), leaving uniforms[] and nonUniforms[] empty. This causes
uniformSum to remain uninitialized (NULL).

Co-authored-by: jianlian <jianlian@amd.com>

* [AMD] NFC: Move direct-to-lds pre-condition checks to utilities (#9041)

NFC just moves the pre-condition checks for direct-to-lds loads to the
utility file so we can use them in the pipeliner and refactor them in a
follow up PR.

---------

Co-authored-by: Lei Zhang <antiagainst@gmail.com>

* [CONSAN] Fix for false positive deadlock detection (#9046)

ConSan was ignoring barrier_expect, instead treating every
`tma.async_copy_global_to_shared` as always arriving on a barrier.
Because of this, if two or more `tma.async_copy_global_to_shared` were
sharing a barrier, ConSan would model this as barrier overarrival and
report as a deadlock if such barrier is waited on.
Ultimate fix for this would be to add full support for counting
transferred bytes and flipping barrier phase when it is being reached.
This PR proposes simpler solution, which is treating `mbarrier.expect`
as an arrive. This is more correct than behavior before the fix, but
still allows for deadlocks caused by wrong transfer size in `expect`.

* [Gluon] More Blackwell tutorials (#8982)

* TMA scatter and gather
* tcgen05_copy
* tcgen05_mma_scaled

---------

Co-authored-by: Peter Bell <peterbell10@openai.com>

* Revert "[CI][AMD] Bring up new AMD runners as shadow CI" (#9049)

Runner is migrated so reverting/removing `.test` runner label.

Reverts triton-lang/triton#9032

* [PROTON] Significantly reduce `deactivate` and `get_data` overhead for cuda graph profiling and expose a new `get_data_msgpack` api (#9030)

3x faster on `deactivate`
10x faster on `get_data_msgpack` vs `get_data`

* [NFC] Update README for conference materials (#9009)

Is there a google drive with the slides?
If not I've created one where folks can upload.

Would folks like to upload posters as well?

* [AMD][gfx1250] Use TDM predicate in f16 gemm variants (#9054)

Use TDM enable/disable predicate to eliminate the if statements.

* [AMD] Refactor direct-to-lds warp coalescing check for GFX9 (#9048)

Refactors `canCoalesceWriteIntoSharedMemory` to just check if we can
divide the `srcToShared` layout by a identity layout for one instruction
(based on contig). This tells us if warps write coalesced to LDS which
is a hardware requirement on `GFX9`. This also makes
`doesSwizzleInsideWarp` redundant.

The lowering now also correctly rejects subslices which break the warp
coalesced writes (see added lit tests). Note that we pass the alloc
shape instead of the `memdesc` because as a follow up we will use this
function in the pipeliner where we do not have the complete memdesc when
selecting layouts.

---------

Co-authored-by: Lei Zhang <antiagainst@gmail.com>

* [AMD][gfx1250] Add persistent subtiled variant for WS f16 GEMM  (#9052)

Increase supported block size by using subtiling techniques to reduce
shared memory usage and eliminate register spills.

* [AMD] Fix ignored atomic_cas sem and scope (#9042)

The sem= (memory semantics) and scope= parameters in tl.atomic_cas() are
currently ignored during code generation in the AMDGCN backend.
Different values for these parameters produce identical assembly, which
defeats the purpose of memory ordering control and breaks the memory
model API.
See godbolt examples here: https://godbolt.org/z/T4dMrbqcc

This PR adjusts it.

* [AMD] NFC: include ArrayRef via mlir/support/LLVM (#9059)

mlir/Support/LLVM.h introduces `using llvm::ArrayRef;` to mlir
namespace. Without it compilation with utility.h fails in some setups.

* [AMD] NFC: remove std::move(temporary) (#9060)

In both cases here we have moved a temporary, which is not needed. Clang
also complains that "moving a temporary object prevents copy elision".

* [TRITONGPU][IR] Verify reduce/scan op axis value (#9061)

* [BACKEND] Fix uses of CGAEncodingAttr::getDefault (#9040)

`getDefault` as implemented, just made sense for 1CTA mode. In this PR
we split it into two helper functions `get1CTALayout` and `get1DLayout`.

We also add a verifier that checks that MemDescTypes have the right
amount
of bases in their cga_layout/block attribute.

We also fix the CoalesceUtils that didn't use to support multicta, and
now they do. I will exercise this in a follow-up PR.

* [WS] Correctly assign stage / cluster annotation to aref ops for a block-arg producer (#8883)

When the "produced value" is a block argument, such as the "row-max
tensor" in attention, we are not assigning stage / cluster annotations
to aref put ops and `local_store`. This is a bug, but it didn't manifest
for fused loop / non-persistent kernel because the second
`schedule-loop` pass that runs after WS annotates them. For nested-loop
persistent attention, such "fixup" didn't work for some reason, so we
end up getting (benign) errors from the pipeliner like `<unknown>:0:
error: 'arith.addi' op not assigned a pipeline stage`.

`RewritePartitionDependices` was correctly handling this case, thanks to
this logic
https://github.com/triton-lang/triton/blob/de8e71503fea971dfb65308147798657e18f8568/lib/Dialect/TritonGPU/Transforms/WarpSpecialization/RewritePartitionDependencies.cpp#L231-L240.
Porting that to `insert-aref` fixes this issue.

* [Nvidia][cuBLAS] Block-scaled matmul baselining (mxfp8, nvfp4)  (#9044)

# Summary
Added cuBLAS baselining support for nvfp4 & mxfp8 block-scaled matrix
multiplication, and extended Tutorial 10 to include performance
comparisons between Triton and cuBLAS. mxfp4 and mixed precision matmul
with cuBLAS are not supported as of 13.1 (see
[cuBLASLt](https://docs.nvidia.com/cuda/cublas/index.html?highlight=CUBLASLT_MATMUL_DESC_FAST_ACCUM#using-the-cublaslt-api)
API docs)

### Minor changes
* Changed the `10-block-scaled-matmul.py` tutorial code to introduce
multiple warmup iterations before benchmarking.

# Test setup
| Param | Value |
|-------|-------|
| OS |  Ubuntu 24.04.3 LTS |
| GPU | B200 (unlocked clocks) |
| Driver | 580.105.08 |
| Torch | 2.10.0.dev20251205+cu130 |
| nvidia-cuBLAS | 13.1.0.3 |

# Results

### 8192 x 8192 x 8192 mxfp8 example
```
$ python 10-block-scaled-matmul.py --format mxfp8 -K 8192
✅ (pass mxfp8 - Triton and cuBLAS)
Problem Shape = 8192x8192x8192
Done benchmarking
2484.174 8852.130 ROOT
├─ 2201.650 4994.035 block_scaled_matmul_kernel_mxfp8 [M=8192, N=8192, K=8192]
└─ 2849.882 3858.095 cublas [M=8192, N=8192, K=8192]
   └─ nan 3858.095 nvjet_sm100_qqhsh_128x256_128x6_2x1_2cta_v_bz_Avec32UE8M0_Bvec32UE8M0_TNT
```

### 8192 x 8192 x 8192 nvfp4 example
```
$ python 10-block-scaled-matmul.py --format nvfp4 -K 8192
✅ (pass nvfp4 - Triton and cuBLAS)
Problem Shape = 8192x8192x8192
Done benchmarking
4681.998 4696.762 ROOT
├─ 4004.368 2745.781 block_scaled_matmul_kernel_nvfp4 [M=8192, N=8192, K=8192]
└─ 5635.685 1950.981 cublas [M=8192, N=8192, K=8192]
   └─ nan 1950.981 cutlass3x_sm100_bstensorop_s256x256x64gemm_block_scaled_ue4m3xf4_ue4m3xf4_f32_f16_f16_256x256x256_0_tnn_align32_o_vs16_2sm_bias_f16_relu
```



# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.
- [x] I have written a PR description following these
- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.
- [x] I have added tests.
    - extended `python/test/unit/runtime/test_blaslt.py`
- Select one of the following.
  - [x] I have not added any `lit` tests.

* [AMD][gfx1250] Enable matmul tests (#9065)

* [BACKEND] Add support for TMA with multicast (#9005)

For now we just add enough machinery to be able to run it on its own.
We will add the logic needed to be run in a loop with `tcgen05.mma` in
the next PR.

* [AMD] Add custom FuncOpConversion (#8962)

Add custom FuncOpConversion pattern to make sure functions
don't have unused/irrelevant nvvm attributes.

* [INTERPRETER] A few typing and efficiency cleanups (#9072)

* [AMD][KERNELS] Improve matmul ogs config on RDNA (#8052)

The default configuration for matmul_ogs resulted in significant
register spilling on RDNA when the batch size exceeded 512.

With this optimization, we observe a notable improvement in GPT-OSS
end-to-end performance.

gpt-oss-20B on 1x GPU
| Device | Optimization | Throughput (reqs/sec) | Improvement |
| Navi31-48GB | Baseline | 3.06 | 100% |
| Navi31-48GB | Config Optimization | 4.47 | 146% |
| Navi48-32GB | Baseline | 5.63 | 100% |
| Navi48-32GB | Config Optimization | 6.66 | 118% |

* [AMD] Add missing bitcast to fix extract_element type error (#9075)

The subsequent `b.extract_element` requires an aggregate type (e.g.
vector) as input. In the else branch (when `scale` is false), we bitcast
each element of `pkVals` to `v2i32` (2-element i32 vector) to match the
type requirement of `extract_element`, aligning with the type handling
logic in the `scale` branch.

* [PROTON] Clean up context source after tearing down through `finalize` (#9069)

Previously there are some non-deterministic failures of proton on the
CI.
Suppose we have the following code

```
with proton.scope("a"):
  proton.finalize(session0)

session1 = proton.start()
```

session1 will end up have imbalanced context stack if we don't clean up
thread local states in the context source.

* [PROTON][NFC] Document experimental APIs (#9056)

* [PROTON] Ignore metric kernels' timing data in the final profile (#9058)

* [SWP] Fix barrier location in loop lowering for MMA op with non-pipelined operands (#8732)

Consider the following example IR:

```
%y_16 = tt.descriptor_load %y_desc[%c0_i32, %y] {loop.cluster = 1 : i32, loop.stage = 0 : i32} : !tt.tensordesc<tensor<64x64xbf16, #shared>> -> tensor<64x64xbf16, #blocked>
%y_17 = ttg.local_alloc %y_16 {loop.cluster = 1 : i32, loop.stage = 0 : i32} : (tensor<64x64xbf16, #blocked>) -> !ttg.memdesc<64x64xbf16, #shared, #smem>
%acc_18 = ttng.tc_gen5_mma %x_12, %y_17, %acc_13[%acc_15], %acc, %true {loop.cluster = 1 : i32, loop.stage = 0 : i32, tt.self_latency = 1 : i32} : !ttg.memdesc<64x64xbf16, #shared, #smem>, !ttg.memdesc<64x64xbf16, #shared, #smem>, !ttg.memdesc<64x64xf32, #tmem, #ttng.tensor_memory, mutable>
%acc_19, %acc_20 = ttng.tmem_load %acc_13[%acc_18] {loop.cluster = 0 : i32, loop.stage = 1 : i32} : !ttg.memdesc<64x64xf32, #tmem, #ttng.tensor_memory, mutable> -> tensor<64x64xf32, #blocked1>
```
The loop lower step will attempt to determine the barrier location to
mark the MMA as "done" based on the earliest of the TMEM load or a
non-pipelined operand. However, the current implementation leverages
`schedule.isOpBefore`, which is inaccurate because its informing which
operand happens first, not which operation happens first in the body of
the loop. For example it would indicate `tt.descriptor_load` comes
before `ttng.tmem_load`.

We need to update this check so it account for the fact that the
operands may occur before the MMA and therefore the location comparison
should be invocation "after" the first MMA operation.

* [KERNELS] Add reduce_forward metadata, improve perf. (#9068)

* [Backend][Test] Fix inspect stages unit test (#9081)

* [LANGUAGE] change default 32-bit dot precision to TF32x3 (#9080)

* [TOOLS] Add hip support for link.py (#9084)

* Use the same link cpp scr except hipStrean/CUstream etc.
* Add a link.h prelude for AMD/Nvidia to adapt for the difference.
* Enable test_aot.py for AMD.
* Also rename AMD's compile.cpp to compile.c.

* [Backend] Move TMA index translation from mid-end to lowerings (#9082)

Currently the behavior of fp4_padded is different between
`triton::Descriptor` ops and `AsyncTMA` ops. The former is indexed as if
the data is int8, while the latter is indexed by individual fp4
elements, which is what the TMA hardware expects.

This now gets leaked into gluon, which isn't ideal. So, this PR moves
the translation into the lowerings. Along the way, this probably fixes
quite a few bugs as there were several places the translation was
missing.

* Fix address sanitizer stack-use-after-scope (#9088)

std::make_tuple here will copy the arguments into a tuple so it creates
a copy of SmallVector subsliceOffsets and then passes back a tuple with
an ArrayRef. The SmallVector object is then out of scope. Bypassing
make_tuple means that it uses the underlying AllocationSlice's reference
to subsliceOffsets rather than the temporary copy created by make_tuple.

* Revert "[LANGUAGE] change default 32-bit dot precision to TF32x3" (#9090)

Reverts triton-lang/triton#9080 as it cause some tmem allocation
regression due to simplistic hoisting logic

* [AMD] Enable AsyncCopy by default for gfx950 and gfx1250 (#9087)

Enables `ttg.async_copy_global_to_local` for pipelined loads by default
on `gfx950` and `gfx1250`.

This increases LDS consumption because we replace one register buffer
with an additional LDS buffer. After this change, the number of LDS
buffers is equal to `num_stages` (previously it was `num_stages - 1`).
Therefore, some test configs need to be skipped because we run out of
shared memory capacity on `gfx950`.
---------

Co-authored-by: Lei Zhang <antiagainst@gmail.com>

* [PROTON][TEST] Test proton on tensor descriptor and two-cta mode (#9070)

* [AMD] Use contiguity hint for buffer ops (#9089)

Enables an approach similar to #8752 in the AMD backend for buffer ops.
This helps to preserve vectorization based on kernel annotations when
converting to buffer_load/store on the AMD backend.

* [Kernels] Fix hopper mxfp4 swizzled kernels for num_warps=4 (#9029)

Currently the tests always use `num_warps=8` for hopper scale swizzling,
makes the required changes to get it working for 4 warps. I also update the layout tests to
test 4 and 8 warps as well.

* [BACKEND] Do not pipeline loops containing asserts or prints (#9055)

https://github.com/triton-lang/triton/pull/6180 already disabled it for
AMD but it got lost during refactoring.
Instead of reintroducing it AMD specific I added it to the common
utilities. Without the change the lit tests also fail to compile for the
nv pipeliner because we cannot predicate those ops.

* Correct the filename of ProxFenceInsertion (#9077)

### Changes
- Renamed the file `ProxFenceInsertion.cpp` to `ProxyFenceInsertion.cpp`
- Updated all relevant references to match the new file name

### Reason
- The original file name contained a typo and was inconsistent with
other references in the project
- This renaming ensures consistency and avoids confusion

* Properly expose arguments to WS partitions (#9023)

While we explicitly expose the region relationships between the WS
operations and the regions they enclose, we do not expose how the
operands to the `WarpSpecializeOp` are passed to the partition regions.
Doing so allows us to eliminate the remaining explicit handling of
`WarpSpecializeOp` in dataflow analyses.

* [NFC] Improve robustness of ext slice rematerialization (#9019)

When trying to extend the instructions to rematerialise to include the
backward slice of the ext, we currently perform the following steps:
1. Attempt to obtain the backward slice of the ext independently
2. Validate that the slice can be safely merged with the current slice
3. Merge the slices

Instead, we can attempt to directly extend the existing slice and just
roll back if that fails. This is potentially slightly less efficient,
since we copy the slice and layout, but it means that the validation
happens entirely within `getConvertBackwardSlice`.

This organisation would have sidestepped the bug fixed in #7058, and the
bug that caused #8292 to be reverted.

While investigating this, I have found that the code for hoisting into
conditionals may be prone to the same bug, since it also attempts to
extend an existing slice. But I will keep that for a separate PR since
it is more complex and I need to write a test case.

* [distributed] pass symm_mem_pool by argument; fix matmul launch_metadata (#9092)

* [AMD][gfx1250] Fix for enabling language/test_core.py (#9095)

* [AMD] Add finite/isfinited for AMD libdevice.py (#9097)

- fp32 finite calls __ocml_isfinite_f32.
- fp64 isfinited calls __ocml_isfinite_f64.

* Restore add_stages_inspection_hook in test (#9096)

* Fix Kernel Argument missing in LLVM Debug Infomation (#9002)

This change fixes the missing kernel arguments in LLVM Debug
Information.

The background is, when there is a need to create GPU memory trace for a
GPU kernel runs, they find there is no `DW_TAG_formal_parameter` field
in the extracted `DWARF` section from kernel binary.

When it comes to the LLVM IR, take an example from an simple vector add
case,

```py
@triton.jit
def vector_add_triton(x_ptr: torch.Tensor, y_ptr: torch.Tensor, out_ptr: torch.Tensor, n_elements: int, BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(axis=0)
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    output = x + y
    tl.store(out_ptr + offsets, output, mask=mask)
```

the expected LLVM IR would be

```c
!0 = distinct !DICompileUnit(language: DW_LANG_C, file: !1, producer: "triton", isOptimized: true, runtimeVersion: 0, emissionKind: FullDebug)
!1 = !DIFile(filename: "vector_add.py", directory: "")
!5 = distinct !DISubprogram(name: "vector_add_triton", linkageName: "vector_add_triton", scope: !1, file: !1, line: 12, type: !6, scopeLine: 12, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !0, retainedNodes: !11)
!6 = !DISubroutineType(cc: DW_CC_normal, types: !7)
!7 = !{null, !8, !8, !8, !10}
!8 = !DIDerivedType(tag: DW_TAG_pointer_type, baseType: !9, size: 64, dwarfAddressSpace: 1)
!9 = !DIBasicType(name: "float", size: 32, encoding: DW_ATE_float)
!10 = !DIBasicType(name: "int", size: 32, encoding: DW_ATE_signed)
!11 = !{!12, !13, !14, !15}
!12 = !DILocalVariable(name: "x_ptr", arg: 1, scope: !5, file: !1, line: 12, type: !8)
!13 = !DILocalVariable(name: "y_ptr", arg: 2, scope: !5, file: !1, line: 12, type: !8)
!14 = !DILocalVariable(name: "out_ptr", arg: 3, scope: !5, file: !1, line: 12, type: !8)
!15 = !DILocalVariable(name: "n_elements", arg: 4, scope: !5, file: !1, line: 12, type: !10)
!16 = !DILocation(line: 12, scope: !5)
!17 = !DILocalVariable(name: "offsets", scope: !5, file: !18, type: !10)
``` 

while the triton compiler output before this change is

```c
  !5 = distinct !DISubprogram(name: "vector_add_triton", linkageName: "vector_add_triton", scope: !1, file: !1, line: 12, type: !6, scopeLine: 12, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !0)
  !6 = !DISubroutineType(cc: DW_CC_normal, types: !7)
  !7 = !{}
```

The reason of this issue is when `LLVMDIScope` pass create
`LLVM::DISubprogramAttr`, it just ignores the kernel arguments.
This fix create `LLVM::LocalVariableAttr` for each valid argument and
pass it to `LLVM::DISubprogramAttr` in `retainedNodes`.

After this change, the kernel argument can be found in the `DWARF`
section

```c
...
0x0000000c: DW_TAG_compile_unit
              DW_AT_producer	("triton")
              DW_AT_language	(DW_LANG_C)
              DW_AT_name	("triton_vector_add.py")
              DW_AT_str_offsets_base	(0x00000008)
              DW_AT_stmt_list	(0x00000000)
              DW_AT_comp_dir	("/home/leling/tests")
              DW_AT_low_pc	(0x0000000000001700)
              DW_AT_high_pc	(0x0000000000001790)
              DW_AT_addr_base	(0x00000008)
              DW_AT_loclists_base	(0x0000000c)

0x00000027:   DW_TAG_subprogram
                DW_AT_low_pc	(0x0000000000001700)
                DW_AT_high_pc	(0x0000000000001790)
                DW_AT_linkage_name	("vector_add_triton")
                DW_AT_name	("vector_add_triton")
                DW_AT_decl_file	("/root/tests/vector_add_triton.py")
                DW_AT_decl_line	(12)
                DW_AT_external	(true)

0x00000031:     DW_TAG_formal_parameter
                  DW_AT_name	("x_ptr")
                  DW_AT_decl_file	("/root/tests/vector_add_triton.py")
                  DW_AT_decl_line	(12)
                  DW_AT_type	(0x0000006f "float *")

0x00000039:     DW_TAG_formal_parameter
                  DW_AT_name	("y_ptr")
                  DW_AT_decl_file	("/root/tests/vector_add_triton.py")
                  DW_AT_decl_line	(12)
                  DW_AT_type	(0x0000006f "float *")

0x00000041:     DW_TAG_formal_parameter
                  DW_AT_name	("out_ptr")
                  DW_AT_decl_file	("/root/tests/vector_add_triton.py")
                  DW_AT_decl_line	(12)
                  DW_AT_type	(0x0000006f "float *")

0x00000049:     DW_TAG_formal_parameter
                  DW_AT_name	("n_elements")
                  DW_AT_decl_file	("/root/tests/vector_add_triton.py")
                  DW_AT_decl_line	(12)
                  DW_AT_type	(0x00000067 "int")

0x00000051:     DW_TAG_variable
                  DW_AT_location	(indexed (0x0) loclist = 0x00000018: 
                     [0x0000000000001718, 0x0000000000001720): DW_OP_consts +6, DW_OP_stack_value
                     [0x0000000000001720, 0x0000000000001750): DW_OP_regx VGPR0)
                  DW_AT_name	("offsets")
                  DW_AT_type	(0x00000067 "int")
...
```

* [AMD][KERNELS] Enable reduce and expert sharding tests (#9110)

Co-authored-by: jianlian <jianlian@amd.com>

* [AMD][Backend] Fix atomic cas for non int types (#9116)

Users should be aware that float atomic-cas does
a bit-wise compare, not floating point compare
(e.g. NAN != NAN). This is the same behavior
compared to nv.

* [nvws][insert_aref] Modify aref.get creation to filter results not in scheduled loop (#9114)

This is done because otherwise the result might not have its last use
(in either direction) inside the scheduled loop and we will not be able
to get stageClusterEnter and/or stageClusterExit.

<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [x] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [ ] This PR does not need a test because `FILL THIS IN`.

- Select one of the following.
  - [ ] I have not added any `lit` tests.
- [x] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

Fixes https://github.com/triton-lang/triton/issues/8072

* [Backend][NFC] FuncOpToLLVM: Move handleArgPtrDatatype to Utility.h (#9120)

Make handleArgPtrDatatype a utility function to avoid code duplication.

* Fix handling conflicting layouts when hoisting convert into conditionals (#9083)

I noticed that hoisting into conditionals is prone to the same layout
conflict issue that we have previously fixed in hoisting above
ext/broadcast ops (#7058). Fix this by attempting to extend the existing
slice and layout map when considering rematerialisation.

* Skip values with existing conversions in getConvertBackwardSlice (#8291)

`getConvertBackwardSlice` currently includes any existing
rematerialisation it finds in the returned slice. However, this
shouldn't be necessary because that value does not need to be
rematerialised or included in the cost calculation.

Instead, once we find a valid rematerialisation, we can stop the
traversal right away.

# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
- [x] This PR does not need a test because it is a simplification and
existing tests pass.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

* [AMD] ReorderInstructions: Remove sinkSecondLoad optimization (#9119)

This optimization triggers only on a very limited number of cases and no
longer has any visible impact on the perf.

<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
- [x] This PR does not need a test because `it removes previously added
tests`.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

* [AMD] Generalize CTA layout fields in WMMA layout (#8946)

ctaLayout is a linear layout describing how warps are arranged across
WMMA tiles.
Previously, this information was encoded using warpsPerCTA and
tilesPerWarp parameters.
For instance, a configuration with 4 warps, represented as:

warpsPerCTA = [2, 2], tilesPerWarp = [1, 1]

would translate to:

ctaLayout = {reg = [], warp = [[0, 1], [1, 0]]}

By default, WMMA assumes that each warp in a CTA computes exactly one
WMMA tile.
In the grid below, each w* label indicates which warp computes that
tile:

w0 w1 w0 w1
w2 w3 w2 w3
w0 w1 w0 w1
w2 w3 w2 w3

To express more complex layouts, we must also account for repetitions
within the mapping.
For example, the configuration formerly described as:

warpsPerCTA = [2, 2], tilesPerWarp  = [2, 2]

would translate to:

ctaLayout = {reg = [[0, 1], [1, 0]], warps = [[0, 2], [2, 0]] }

w0 w0 w1 w1
w0 w0 w1 w1
w2 w2 w3 w3
w2 w2 w3 w3

This parameter provides a more general way to define warp mappings than
what
warpsPerCTA and tilesPerWarp alone could express.
For instance:

ctaLayout = {reg = [[1, 0], [0, 1]], warps = [[0, 2], [2, 0]]}

still represents a layout similar to:

warpsPerCTA  = [2, 2], tilesPerWarp = [2, 2]

but with a different ordering of repetitions.

The motivation for this broader formulation comes from the need to
describe swizzled warp
layouts, which help avoid LDS partition conflicts on architectures such
as gfx1250.
A valid example of such swizzled configuration is:

ctaLayout = {reg = [[2, 0]], warps = [[2, 1], [1, 0]]}

With corresponding mapping:

w0 w1 <- second tile computed by w1
w2 w3
w0 w1 <- first tile computed by w1
w2 w3

Note that ctaLayout naturally composes with layout defined on a single
WMMA tile
to form final WMMA layout.

wmmaLayout = tileLayout * ctaLayout

This simplifies both WMMA and dotOperand layouts lowering to linear
layout.

---------

Co-authored-by: Ognjen Plavsic <plognjen@amd.com>

* [NFC][gluon] Fix GluonSimplifyControlFlow's description (#9066)

The previous description for the pass was copy pasted from GluonInline
pass. Modify the description to match what the pass does.

<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
- [x] This PR does not need a test because `it fixes a pass
description`.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

* [Backend][NFC] Change getClusterCTAId to return 0 for 1 CTA (#9125)

Follow up based on [this comment on a previous
PR](https://github.com/triton-lang/triton/pull/8790#discussion_r2549358694)
to return `0` if we compile for a single CTA.
This also removes some guard when lowering load ops in the AMD backend
which are no longer required.

* [AMD][BACKEND] Add TDM L2 prefetch support (#9086)

Adds support for prefetching a TDM tile into `L2`. The underlying
intrinsic/instruction expects an address and prefetches 256 bytes into
L2 (hardware enforces 256-byte alignment), it cannot directly consume a
TDM descriptor.
Therefore, to prefetch tiles defined by a TDM descriptor we need to
cooperatively prefetch the TDM tile across CTAs, warps and lanes to keep
the number of prefetch instructions low.

The newly added Op supports returning its prefetched memory location.
This is purely for testing purposes, since the Op has no observable side
effects on its own.

A follow-up PR will expose this Op to Gluon and add unit tests based on
the returned offsets.

* [AMD] Support single block execute regions in `UpdateAsyncWaitCount` (#9126)

`UpdateAsyncWaitCount` has to walk through `scf.ExecuteRegion` (with a
single block in its region) to compute correct wait counts when warp
pipelining.

* [Backend] Support llvm struct and array type to DITypeAttr (#9124)

This change supports `DITypeAttr` conversion for `llvm.struct` and
`llvm.array` in LLVM DI passes as a supplement to
https://github.com/triton-lang/triton/pull/9002.

In practice, we found the lowering for some kernels with tensordesc
input like
```mlir
  tt.func public @tensor_descriptor_load_store_nd_kernel_host_tdm(%arg0: !tt.tensordesc<tensor<8x4xsi32, #shared>>, %arg1: i32, %arg2: i32, %arg3: i64, %arg4: i64, %arg5: !tt.tensordesc<tensor<8x4xsi32, #shared>>, %arg6: i32, %arg7: i32, %arg8: i64, %arg9: i64)
```
It would be lowered to
```mlir
llvm.func @tensor_host_tdm_copy_kernel(%arg0: !llvm.struct<(i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32)>, %arg1: i32, %arg2: i32, %arg3: i64, %arg4: i64, 
```
Then fails in `LLVMDIScope` pass for no conversion for `LLVMStructType`
to `DITypeAttr`.
This PR aims to fix this.

* fix flood of warning related num_stage and shared_mem capacity (#9121)

The intention was warn only once using ``warnings.warn`` but with the
f-string many different logging messages were generated.
Piggybacking a small change in matmul kernel removing unused variable
definition.

<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
- [x] This PR does not need a test because this is just logging change.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

* [Backend] NFC: drop accidentally checked in file (#9128)

This was added in https://github.com/triton-lang/triton/pull/8923.

* [Backend] Add uniform hint to ttg.warp_id (#9073)

This commit adds an attribute to `omitUniformHint` for NVIDIA backend
for controlling cases where we would like to emit nvvm.shfl.sync idx 0
in targetted manner.

* Update ptxas to 12.9 for hopper and ampere (#8983)

* [Frontend] Fix setting attr on constexpr argument (#9053)

Specialization contains a pair for each argument, the first of which is
the argument's type (e.g. int/ptr/constexpr). For dynamic
(non-constexpr) argument, the second element is a string describing the
attribute of the argument (e.g. 'D' means divisibility by 16), for
constexpr argument, second element is the argument's compiling time
value. For example:

```
[(i32, 'D'), (float, ''), (constexpr, 'CDNA'), (constexpr, 128)]
```

For MLIR func-op's arguments, only dynamic arguments have
correspondence, constexpr arguments are already propagated/inlined and
not passed as runtime arguments. So when determining func-op arguments'
attrs from specialization, we should filter-out constexpr arguments

* [AMD][gfx1250] Fix Gluon AMDWMMALayout rank consistency (#9127)

* Ensure NVWSDialect is loaded for NVGPUWarpSpecialization (#9131)

Ensure that the dialect is loaded so we can run reproducers involving
this pass using `triton-opt`.

* [AMD][gfx1250] Avoid cluster load for scalar pointers (#9129)

This fixes a crash which is checked in as a lit test.

* [PROTON] Simplify runtime and metric correlation to reduce overhead (#9132)

* [TRITON_KERNELS] some refactoring (#9134)

* Revert "[TRITON_KERNELS] some refactoring" (#9140)

Reverts triton-lang/triton#9134

* Support JITFunction in preload (#8794)

* [AMD] Fix vector size for padded encodings with direct to lds loads (#9149)

For architectures not supporting scattering (GFX9) we can only support
padding intervals which are a multiple of `vectorWidth * warpSize`
because we can only add padding at warp boundaries. This PR properly
enforces this to bail out for such encodings.

Note that we do not create such layouts right now, but will be later.

* [AMD] Emit detailed error when failing to choose mfma/wmma instruction (#9143)

This PR improves the error message emitted when failing to choose
mfma/wmma instruction on AMD backend.

Previously, the error claimed the failure was due to "unsupported
element type", but in reality the issue could also be caused by:
- Invalid MFMA instruction shape (e.g., unsupported or misaligned
`kDim`)
- Unsupported MFMA version for the target GPU

The new message includes the actual parameters (`mfmaVersion`,
`[m,n,k]`, element types) and guides users to check AMD architecture
documentation, making debugging much easier.

Fixes https://github.com/triton-lang/triton/issues/9141

Co-authored-by: xinheng <xinheng.dx@alibaba-inc.com>

* Revert "[distributed] pass symm_mem_pool by argument; fix matmul launch_metadata" (#9155)

Reverts triton-lang/triton#9092

* [hopper][WS] Support tt.split/join in data partition (#456) (#9147)

Add support for `SplitOp` and `JoinOp` operations in the WSDataPartition
pass. This enables data partitioning for tensor split and join
operations along the M dimension during warp specialization.

Key changes:
- Add `SplitOp` and `JoinOp` to `getBackwardSliceToPartition` to include
them in backward slice traversal
- Add handling for `SplitOp` and `JoinOp` in `sliceOp` function to
correctly partition these operations
- Fix a bug where multi-result operations (like `SplitOp` which returns
two tensors) were not having all their results sliced correctly - the
`cloneAndSetResultType` lambda now iterates over all results instead of
just the first one
- changes to backward slice traversal and sliceOp function

Also adds a lit test that covers partitioning of:
- `tt.trans` (transpose)
- `tt.reshape` (reshape with allow_reorder)
- `tt.join` (join two tensors)
- `tt.split` (split into two tensors)

* Pass explicit captures to WarpSpecializePartitionsOp (#9133)

Explicit captures are currently operands on `WarpSpecializeOp` but the
regions that consume them are owned by the `WarpSpecializePartitionsOp`
in the enclosed holder region. That is, the block arguments to the
actual partitions do not get their values from an immediate parent
region.

This has been mostly fine, but some MLIR common code assumes that the
operands of a `RegionBranchOpInterface` operation flow directly into the
regions that it encloses. For example, attempting to use a
`SparseBackwardDataFlowAnalysis` will trigger a crash.

To address this, move the explicit captures to be operands on the
operation that holds the non-default partition regions.

* [PROTON][TRITONGPU] Deprecate Proton's `GlobalScratchAllocOp` and use TritonGPU's `GlobalScratchAllocOp` with a custom backend instead (#8976)

* [WGMMA] Limit rs-dot splitting to only 2 splits (#9152)

When pipelining wgmma with the left hand argument in registers, we split
the dot along the K dimension which allows us to effectively
"multi-buffer" the wgmma by having multiple sets of registers which are
not used by the wgmma at the same time.

Currently rs dots are split into `K // instrK` dot ops which
theoretically maximizes parallelism, but in practice I see that
splitting into 2 gives the best performance. This gives ~1% uplift in
bf16 x mxfp4 moe.

* [Gluon] Add local scatter/gather for Gluon (#8480)

<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

It is often desirable to operate on matrix sections when using Gluon for
numerical linear algebra. For example, accessing the diagonal to compute
the trace or updating a single column. Currently these operations
require loading the entire matrix into registers, performing a warp
shuffle, selection, and store. This results in poor performance.

This PR proposes adding specialized scatter and gather operations for
operating on 2D tensors in shared memory. The API for these operations
is similar to the scatter and gather operations in Triton except they
accept two 1D tensors for indexing into the 0th and 1st dimension
respectively. For example, a gather with two `tl.arange` reads the
diagonal of a matrix. A gather with one `tl.arange` and a `tl.zeros`
reads a column vector.

These operations are implemented by passing the runtime index register
values through the instructions generated by the inverted shared memory
linear layout.

# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [x] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [ ] This PR does not need a test because `FILL THIS IN`.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

* Control `driver.active` with `TRITON_DEFAULT_BACKEND` (#9144)

When multiple backend drivers are active, Triton fails when
`triton.runtime.driver.active` is accessed. This seems like an
unnecessary limitation: multiple drivers _could_ be active at the same
time. The `...driver.active` API is used over 100 times across this code
base; it likely does not make sense to change that API. Instead, this
change adds an environment variable, `TRITON_DEFAULT_BACKEND`, that
allows users to select which active backend to use. If not present,
`...driver.active` continues to work as before but, if it is, the
default driver is constructed from the name of the backend given, e.g.,
`TRITON_DEFAULT_BACKEND=nvidia`. This allows users to run examples and
tests using a different backend without modifying those files (though,
of course, one can still programmatically set the active driver with
`...driver.set_active()`.

# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
- [x] This PR does not need a test because this adds a new configuration
option to hack with, not core functionality.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

* [Gluon] Tigthen + cleanup TMA op verifiers and add "illegal instruction" sanitize mode (#9112)

* Cleanup and refactor TMA op verifiers
* Apply `fp4_padded` restrictions to some TMA ops
* Relax rank-reduce verifier to allow non-unit leading dims as long as
numels remains the same
* Add an "iisan" mode that generates runtime checks to guard against
illegal instructions from TMA operations

* [TritonGPU][AutoWS] Fix autows when mixing TMA and non-TMA loads (#9111)

When mixing non-TMA loads with warp-specialized MMAs and TMA loads, we
still need LowerLoops to insert the MMA `wait_barrier` based on the self
latency. This is achieved with some fiddling in AssignLatencies to set
the right self-latency based on whether all the MMA operands are going
to be warp specialized. At the same time, this changes `lowerMMA` to
also handle warp-specialized MMAs.

This PR contains various other things:
* error message improvements 
* Hopper autoWS is turned off when `numStages <= 1` to avoid various
downstream issues
* add more exhaustive warp spec integration tests
* turn on consan for a subset of warp spec integration tests for better
coverage
* add a few positive consan tests

* Fix flakiness in test_line_info_ir_source (#9161)

* [Blackwell] Enable tcgen05 MMA for sm110 (Jetson Thor) (#9160)

Building on [the work in
openxla](https://github.com/openxla/xla/pull/34705), I confirmed that
most MMA tests are passing on sm110 with tc05 enabled:
`test_warp_specialization` all pass, `test_matmul.py` mostly ok except
for some off-by-one errors, and tut10 runs fine.

Closes https://github.com/triton-lang/triton/issues/8896

* Broadcast support in cat (#9163)

Fix tt.cat for broadcasting layouts.

* [Kernels] Use ex2.approx.ftz in swiglu (#9164)

This seems to get most of the speedup of the other version but without
compromising numerics. Since we add 1 after the exponential, any
denormal values would be lost to rounding anyway.

It's worth about 1-2 GBps, or around 0.1% in bf16 x mxfp4 moe. Not
world-changing, but it is repeatable.

* [AMD] Add rint to amd libdevice (#9166)

* Fix test conflict with cuda module (#9165)

PyTorch now depends on the `cuda-bindings` package which introduces a
module called `cuda` (https://github.com/pytorch/pytorch/pull/167769).
The presence of an `__init__.py` in the cuda test directory causes
pytest to treat it as a module, and it resolves to the new
aforementioned module, causing test failures when Triton tests are run
with the latest PyTorch nightly installed.

To fix this, delete the `__init__.py`, and rename the tests that have
naming conflicts.

* [Testing] Propagate err_msg to np in triton's assert_close (#9170)

Pass `err_msg` arg to `np.testing.assert_allclose` for richer error
message

* [AMD][GLUON] Expose TDM L2 prefetch (#9148)

Followup based on https://github.com/triton-lang/triton/pull/9086 which
introduced the prefetch `amdgpu` op and lowering.

Since L2 prefetch has no side effects, apart from segfaulting for non
speculative prefetches, we optionally return the offsets to verify we
prefetch the correct memory locations (`_test_prefetch_with_offsets`).

`get_shape_from_tensor` is required to deduplicate the logic computing
the shape/layout of the returned offsets.

* [AMD][gfx1250] Skip FP4 matmul tests packed along M/N (#9176)

* [AMD][gfx1250] Skip MFMA-specific tests for GFX1250 (#9177)

This PR skips MFMA-layout-specific checks on GFX1250.

* Revert "[Backend] Delay wgmma wait(0) to first use of the accumulator" (#9179)

Reverts triton-lang/triton#9021

* [AMD] ReorderInstructions: Remove sinkDotConversion optimization (#9139)

This optimization is basically no-op because ConvertLayout operations
are already replaced with local_load by ReduceDataDuplication by the
time ReorderInstructions runs.

* [AMD] Fix wrong vectorization width of atomic-rmw with mask (#9142)

* [Gluon] Expose finer grained cluster fences (#9076)

We also expose the utility `fence_cluster_init` that ensures that all
the CTAs see the initialisation of the mbarriers before multiCTA ops.

* Remove calls to llvm_update_compile_flags in CMake (#9167)

I noticed that `llvm_update_compile_flags ` in upstream LLVM now depends
on also including HandleLLVMOptions (llvm/llvm-project#174084). We could
pull that in but it enables a lot of warnings globally, and seems likely
to create additional churn when updating LLVM.

Instead, we can just disable exceptions and rtti ourselves, and remove
calls to the LLVM provided helpers. This makes the library builds easier
to follow, and reduces the chance of further upstream LLVM changes
breaking the build.

* [BACKEND] Pick better layout for small async_cp (#9183)

When picking layout for small loads we may decide to coalesce less to
avoid convert_layout. When those become async_cp we can decouple the src
layout to the destination one, so we can pick better layout for async_cp

* Adding clone for triton_kernels.tensor.Tensor (#9178)

* [BACKEND] Add tcgen05.mma + multicast support (#9071)

* [PROTON] Periodic profiling dumping (#9150)

* [AMD] Add TritonAMDGPUSinkLayoutConversions pass (#9168)

Add a new pass to sink LayoutConversion operations instead of using
ReorderInstructions pass to do that. Eventually, we want each pass to
have a clear purpose and ability to run/test separately.

* [TRITON_KERNELS] Reland refactor of tensor/layout/distributed (#9186)

* [TRITON_KERNELS] removing debug print (#9187)

* [BACKEND] Fix tt.scan with a broadcasted layout (#9185)

We fix the case when there is broadcasting in register or lanes

Fixes https://github.com/triton-lang/triton/issues/7817
Supersedes https://github.com/triton-lang/triton/pull/9123

* [Backend] Fix more tt.scan layout issues (#9189)

Following https://github.com/triton-lang/triton/pull/9185, I asked codex
to find other issues with regression tests. It hacked around the issue,
but this was enough for me to find the real issue and fix it properly.
Great team work.

We should audit generally other uses of `linearize`/`delinearize` as
those that use the legacy APIs will most likely be broken when used with
broadcasted layouts.

* fix strided layout handling when setting requires_persistent (#9198)

Follow-up https://github.com/triton-lang/triton/pull/9186 now
``StridedLayout`` has "STRIDED" as the name not None so we can't rely on
``layout.name is None`` to check if the layout is strided.
Because of this, we were hitting ``assert supports_persistent`` when
running with strided layout on Blackwell.
Ideally, a unit test should've checked this but currently unit tests are
overriding ``is_persistent`` via ``constraints`` so it's not going to
take the code path hitting ``assert supports_persistent``.

* [BACKEND] Improve and simplify ReduceOp's lowering (#9192)

We implement a LinearLayout-based `ReduceOp` lowering. This has a
number of benefits:

- The logic is noticeably simpler as we barely have to implement
anything. ConvertLayout and some LL helpers do all the heavy lifting
- We get shmem swizzling for free
- We sometimes save a shmem round-trip (before we did it
unconditionally)
- It is now clear that we have a `tmpLl` variable we can carefully
choose (we'll do so in a future PR)
- It opens the door to returning an arbitrary layout (fusing a
`convert_layout` into this op)
- It is now really simple to generalise this op to perform cross-cluster
reductions, provided that `convert_layout` supports them.
- We fix some latent issues the previous implementation had when run on
arbitrary linear layouts. We add a funky regression test that used to
fail and now passes.
- All this while being LOC-neutral!

In future PRs we will improve the choice fo `tmpLl` to avoid in many
cases the last `convert_layout`, and we will pack the inputs in shmem to
be able to vectorize the load/stores for full reductions with multiple
inputs.

This PR was the result of quite a long (but rather successful)
vibe-coding session together with `gpt-5.2-codex`. I found particularly
useful being able to emit a ConvertLayout within this lowering rather
than having to call the lowering of the function manually. This
simplifies the code quite a bit and I would have struggled to convince
MLIR to do so myself.

TODO: Benchmark

* [TEST] Update fresh_knobs fixture default behavior (#9184)

Refactor the default behavior of fresh_knobs in test fixtures.

`fresh_knobs` (default): Now preserves library paths (build, nvidia, amd
knobs)
Most tests need CUDA toolkit paths to compile successfully
Respects environment variables like `TRITON_PTXAS_BLACKWELL_PATH`

`fresh_knobs_including_libraries` (new): Resets ALL knobs including
library paths

* Guard sync point in launch_metadata with launch_metadata_allow_sync (#9197)

* [AMD] relax padded layout heuristics to smaller block size (#9074)

leveraging wrap around due to padding, we can still get bank conflict
free padded share layout when block size is smaller than 16KB.
take Mx64xbf16, k contiguous, kWidth=8, mfma16x16 for example: (rX
stands for row X), the minimal block size can be 32x64.
padding here is set to 16 elements (32 bytes) to avoid bank conflicts
we can pack r0,r4,r8,r12,r16,r20,r24,r28to compose a contiguous tile
```
r0[0+], r0[8+],
                r1[0+], r1[8+],
                                r2[0+], r2[8+],
                                                r3[0+], r3[8+],
r4[0+], r4[8+],
                r5[0+], r5[8+],
                                r6[0+], r6[8+],
                                                r7[0+], r7[8+],
r8[0+], r8[8+],
```
 in LDS, the rows are arranged as below
```
r0,  r4,  r8,  r12, r16, r20, r24, r28
pad, r1,  r5,  r9,  r13, r17, r21, r25
r29, pad, r2,  r6,  r10, r14, r18, r22,
r26, r30, pad, r3,  r7,  r11, r15, r19,
r23, r27, r31
```

* Revering 77f610a - Adding clone for triton_kernels.tensor.Tensor (#9208)

Clone turned out not to be needed for now, reverting to limit
complexity.

* [AMD] ReorderInstructions: Remove moveUpTranspose optimization (#9204)

This optimization moves TransOps closer to their defs, but that doesn't
actually have any impact on the generated code because the actual
transpose code generated by ConvertLayout ops.

* [AMD] skip test_gather[src_shape2-indices_shape2-0] for RDNA (#9210)

The test_gather[src_shape2-indices_shape2-0] test fails on RDNA3 and
RDNA4 GPUs with:
triton.runtime.errors.OutOfResources: out of resource: shared memory,
Required: 131072, Hardware limit: 65536.

Extend the existing skip condition (which already covers CDNA2 and
CDNA3) to also include RDNA3 and RDNA4 GPUs.

* [Backend] Bump to llvm/llvm-project@8f264586d752 (#8987)

This picks up several ROCDL changes:
llvm/llvm-project#171810
llvm/llvm-project#171449
llvm/llvm-project#169672

Co-authored-by: Lei Zhang <antiagainst@gmail.com>

* Revert "[BACKEND] Improve and simplify ReduceOp's lowering" (#9214)

temporarily revert triton-lang/triton#9192 as it changes the numerics

* [TRITON_KERNELS] remove warning (#9213)

* [LAYOUTS] Tighten LinearEncoding checks (#9215)

Before, we mistakenly allowed repeated non-zero bases

* Skip InsertTmemAref if function does not use WS (#9212)

Most autoWS passes automatically skip code that does not use autoWS, but
InsertTmemAref does not end up doing that because it needs to examine
`TMEMAlloc`s that occur outside WS regions as well. This can cause
assertion failures when the assumptions baked into the autoWS
implementation are violated (e.g. that an alloc must only have a single
use if there is no token).

* [AMD][NFC] Replace llvm instrinsic calls with rocdl ops (#9222)

PR replaces amdgpu intrinsic calls with ROCDL ops which were recently
exposed to the ROCDL dialect in MLIR

* Allow proton to record metadata for selective kernels (#9158)

* [Nvidia] Enable TMA im2col mode -- AsyncTMACopyGlobalToLocalOp (#9202)

### Summary 
This is the first PR in a series that enables TMA im2col mode (in
addition to the existing tiled mode) for NVIDIA GPUs. The overall goal
of the series is to support TMA im2col mode in the Triton compiler and
the Gluon DSL.

The PTX ISA documentation for TMA im2col mode is available here:
https://…
agron911 added a commit to agron911/triton that referenced this pull request Apr 24, 2026
…bit dot precision to TF32x3 (#9080)' (facebookexperimental#1329)

Summary:

This is a cherry-pick of an upstream PR: triton-lang/triton#9080

Upstream commit message:
```
> [LANGUAGE] change default 32-bit dot precision to TF32x3 (#9080)
```

Conflict Resolution:
- File: python/triton/language/core.py
  Action: Removed conflict markers; kept the local "where the first dimension..." line and updated docstring to use tf32x3 instead of tf32. Did not add the upstream-introduced assert/if input_precision body code, since the local code path delegates input_precision processing to semantic.py.
  Reason: The local file was refactored to move input_precision default-setting logic from core.py.dot() to semantic.py. Adding the upstream body code here would duplicate logic and be unreachable.
- File: python/triton/language/semantic.py
  Action: Updated supports_tf32 check and default value from "tf32" to "tf32x3" in the input_precision branch of the dot() method.
  Reason: This file holds the actual default-precision logic locally; matching upstream's intent of changing the default precision from tf32 to tf32x3 requires updating it here.

Raw Conflicts: https://www.internalfb.com/intern/paste/P2283333395/
Resolution Diff: https://www.internalfb.com/intern/paste/P2283336430/
Diff Comparison: https://www.internalfb.com/intern/paste/P2283337118/

***Do not remove the following line from this commit***
Reactor Cherry-pick Revision: 63b387c

Reviewed By: sfzhu93

Differential Revision: D101982808
agron911 added a commit to agron911/triton that referenced this pull request Apr 24, 2026
…bit dot precision to TF32x3 (#9080)' (facebookexperimental#1329)

Summary:
Pull Request resolved: facebookexperimental#1329

This is a cherry-pick of an upstream PR: triton-lang/triton#9080

Upstream commit message:
```
> [LANGUAGE] change default 32-bit dot precision to TF32x3 (#9080)
```

Conflict Resolution:
- File: python/triton/language/core.py
  Action: Removed conflict markers; kept the local "where the first dimension..." line and updated docstring to use tf32x3 instead of tf32. Did not add the upstream-introduced assert/if input_precision body code, since the local code path delegates input_precision processing to semantic.py.
  Reason: The local file was refactored to move input_precision default-setting logic from core.py.dot() to semantic.py. Adding the upstream body code here would duplicate logic and be unreachable.
- File: python/triton/language/semantic.py
  Action: Updated supports_tf32 check and default value from "tf32" to "tf32x3" in the input_precision branch of the dot() method.
  Reason: This file holds the actual default-precision logic locally; matching upstream's intent of changing the default precision from tf32 to tf32x3 requires updating it here.

Raw Conflicts: https://www.internalfb.com/intern/paste/P2283333395/
Resolution Diff: https://www.internalfb.com/intern/paste/P2283336430/
Diff Comparison: https://www.internalfb.com/intern/paste/P2283337118/

***Do not remove the following line from this commit***
Reactor Cherry-pick Revision: 63b387c

Reviewed By: sfzhu93

Differential Revision: D101982808
agron911 added a commit to agron911/triton that referenced this pull request Apr 24, 2026
…ault 32-bit dot precision to TF32x3" (#9090)' (facebookexperimental#1330)

Summary:
Pull Request resolved: facebookexperimental#1330

This is a cherry-pick of an upstream PR: triton-lang/triton#9090

Upstream commit message:
```
> Revert "[LANGUAGE] change default 32-bit dot precision to TF32x3" (#9090)
>
> Reverts triton-lang/triton#9080 as it cause some tmem allocation
> regression due to simplistic hoisting logic
```

Conflict Resolution:
- File: python/triton/language/core.py
  Action: Removed conflict markers; kept the local "where the first dimension..." line. Reverted docstring lines from tf32x3 back to tf32 to match upstream's revert.
  Reason: Same divergence as the original cherry-pick of #9080 (assert/if input_precision body lives in semantic.py locally). Maintained that local refactor by reverting only the docstring here.
- File: python/triton/language/semantic.py
  Action: Reverted supports_tf32 check and default value from "tf32x3" back to "tf32" in the input_precision branch of the dot() method.
  Reason: Mirror revert: the prior cherry-pick of #9080 applied the tf32x3 change here (instead of upstream's core.py location); this revert undoes it in the same place.

Raw Conflicts: https://www.internalfb.com/intern/paste/P2283342039/
Resolution Diff: https://www.internalfb.com/intern/paste/P2283342643/
Diff Comparison: https://www.internalfb.com/intern/paste/P2283343204/

***Do not remove the following line from this commit***
Reactor Cherry-pick Revision: 606eebc

Reviewed By: sfzhu93

Differential Revision: D101982800
agron911 added a commit to agron911/triton that referenced this pull request Apr 24, 2026
…bit dot precision to TF32x3 (#9080)' (facebookexperimental#1329)

Summary:
Pull Request resolved: facebookexperimental#1329

This is a cherry-pick of an upstream PR: triton-lang/triton#9080

Upstream commit message:
```
> [LANGUAGE] change default 32-bit dot precision to TF32x3 (#9080)
```

Conflict Resolution:
- File: python/triton/language/core.py
  Action: Removed conflict markers; kept the local "where the first dimension..." line and updated docstring to use tf32x3 instead of tf32. Did not add the upstream-introduced assert/if input_precision body code, since the local code path delegates input_precision processing to semantic.py.
  Reason: The local file was refactored to move input_precision default-setting logic from core.py.dot() to semantic.py. Adding the upstream body code here would duplicate logic and be unreachable.
- File: python/triton/language/semantic.py
  Action: Updated supports_tf32 check and default value from "tf32" to "tf32x3" in the input_precision branch of the dot() method.
  Reason: This file holds the actual default-precision logic locally; matching upstream's intent of changing the default precision from tf32 to tf32x3 requires updating it here.

Raw Conflicts: https://www.internalfb.com/intern/paste/P2283333395/
Resolution Diff: https://www.internalfb.com/intern/paste/P2283336430/
Diff Comparison: https://www.internalfb.com/intern/paste/P2283337118/

***Do not remove the following line from this commit***
Reactor Cherry-pick Revision: 63b387c

Reviewed By: sfzhu93

Differential Revision: D101982808
meta-codesync Bot pushed a commit to facebookexperimental/triton that referenced this pull request Apr 24, 2026
…bit dot precision to TF32x3 (#9080)' (#1329)

Summary:
Pull Request resolved: #1329

This is a cherry-pick of an upstream PR: triton-lang/triton#9080

Upstream commit message:
```
> [LANGUAGE] change default 32-bit dot precision to TF32x3 (#9080)
```

Conflict Resolution:
- File: python/triton/language/core.py
  Action: Removed conflict markers; kept the local "where the first dimension..." line and updated docstring to use tf32x3 instead of tf32. Did not add the upstream-introduced assert/if input_precision body code, since the local code path delegates input_precision processing to semantic.py.
  Reason: The local file was refactored to move input_precision default-setting logic from core.py.dot() to semantic.py. Adding the upstream body code here would duplicate logic and be unreachable.
- File: python/triton/language/semantic.py
  Action: Updated supports_tf32 check and default value from "tf32" to "tf32x3" in the input_precision branch of the dot() method.
  Reason: This file holds the actual default-precision logic locally; matching upstream's intent of changing the default precision from tf32 to tf32x3 requires updating it here.

Raw Conflicts: https://www.internalfb.com/intern/paste/P2283333395/
Resolution Diff: https://www.internalfb.com/intern/paste/P2283336430/
Diff Comparison: https://www.internalfb.com/intern/paste/P2283337118/

***Do not remove the following line from this commit***
Reactor Cherry-pick Revision: 63b387c

Reviewed By: sfzhu93

Differential Revision: D101982808

fbshipit-source-id: 89337aa0f9307edb2f456fb38cc7f2c83418e991
meta-codesync Bot pushed a commit to facebookexperimental/triton that referenced this pull request Apr 24, 2026
…ault 32-bit dot precision to TF32x3" (#9090)' (#1330)

Summary:
Pull Request resolved: #1330

This is a cherry-pick of an upstream PR: triton-lang/triton#9090

Upstream commit message:
```
> Revert "[LANGUAGE] change default 32-bit dot precision to TF32x3" (#9090)
>
> Reverts triton-lang/triton#9080 as it cause some tmem allocation
> regression due to simplistic hoisting logic
```

Conflict Resolution:
- File: python/triton/language/core.py
  Action: Removed conflict markers; kept the local "where the first dimension..." line. Reverted docstring lines from tf32x3 back to tf32 to match upstream's revert.
  Reason: Same divergence as the original cherry-pick of #9080 (assert/if input_precision body lives in semantic.py locally). Maintained that local refactor by reverting only the docstring here.
- File: python/triton/language/semantic.py
  Action: Reverted supports_tf32 check and default value from "tf32x3" back to "tf32" in the input_precision branch of the dot() method.
  Reason: Mirror revert: the prior cherry-pick of #9080 applied the tf32x3 change here (instead of upstream's core.py location); this revert undoes it in the same place.

Raw Conflicts: https://www.internalfb.com/intern/paste/P2283342039/
Resolution Diff: https://www.internalfb.com/intern/paste/P2283342643/
Diff Comparison: https://www.internalfb.com/intern/paste/P2283343204/

***Do not remove the following line from this commit***
Reactor Cherry-pick Revision: 606eebc

Reviewed By: sfzhu93

Differential Revision: D101982800

fbshipit-source-id: fd9eeb84ccb01ad8f126269c5c88fd71d76114ba
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant