Skip to content

[pull] main from triton-lang:main#1

Open
pull[bot] wants to merge 3395 commits into
maryamtahhan:mainfrom
triton-lang:main
Open

[pull] main from triton-lang:main#1
pull[bot] wants to merge 3395 commits into
maryamtahhan:mainfrom
triton-lang:main

Conversation

@pull
Copy link
Copy Markdown

@pull pull Bot commented Apr 30, 2025

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

@pull pull Bot added ⤵️ pull merge-conflict Resolve conflicts manually labels Apr 30, 2025
plotfi and others added 28 commits April 8, 2026 09:30
This PR adds version checking. If `TRITON_PLUGIN_VERSION_CHECK` is unset
then only the release version will be checked (ie that both the plugin
and Triton were built on version say "3.7.0"). However is
`TRITON_PLUGIN_VERSION_CHECK=on` then both the release version and the
git hash built on will be checked to match, and finally if
`TRITON_PLUGIN_VERSION_CHECK=off` no version checking other than the
plugin api version itself will be checked.
tl.rand returns values in the half-open interval [0, 1), not the closed
interval [0, 1]. The test assertion used x <= 1, which would have
accepted 1.0 as a valid output.

Philox-based float generation masks the upper mantissa bits and sets the
exponent to produce values strictly less than 1.0. A value of exactly
1.0 indicates a bug (e.g. broken seed delivery returning all-ones from
the uint-to-float conversion).

<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [x] This PR does not need a test because it's fixing a test.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

Co-authored-by: Wes Turner <westurner@users.noreply.github.com>
We do so by following the PTX docs and our LLVM lowerings
1. In PTX, a barrier flips when both `arrivals == 0` and `tx-count == 0`
2. In PTX, an expect implies a commit of 1.
3. In triton, we lower `ttng.expect` as an expect on the leader CTA and
as a commit of 1

We model all these points in consan.
This commit changes `tt.tensordesc` type to be directly
composed of shape, element type, and shared layout.
It drops the previous tensor type wrapper around them.
The main reason is that the shared memory layout encoding
attaching inside tensor type was confusing since the layout
describes shared memory, not the tensor.

`!tt.tensordesc<tensor<128x64xf16, #shared>>` now becomes
`!tt.tensordesc<128x64xf16, #shared>`. This is also closer to
`ttg.memdesc`.
When backward remat identifies an operand that is already in the desired
encoding, it skips that operand in the slice. However, we do not record
the desired layout for that value in `layouts`.

As a result, we cannot detect if we also encounter a conflicting
encoding for that same value. When that happens, the value is
rematerialised, resulting in an encoding mismatch with the user that
expected the original encoding.

To fix this, record the original encoding in `layouts`, without adding
it to the slice. This way, if a conflicting encoding is encountered, we
will just bail on backward remat.
…9934)

We add proper support for `expect` by tracking the bytes being expected
on a barrier, not only the commits. We follow the PTX semantics rather
closely.

In the second commit, we add support for TMA multicast. We also fix the
fanout CTA masks that were missing in the transfer visibility functions.

W add a few tests for consan meets tcgen05 / tma ops with / without
multicast / twoCTA mode.

---------

Co-authored-by: Codex <noreply@openai.com>
There are mainly three incremental improvements over the MoE module,
benchmarked by `bench_mlp.py` script via `torchrun --nproc-per-node=1
python/triton_kernels/bench/bench_mlp.py`.

Main improvements are from 1) offline shuffling for MX4 weight; 2)
disable block shape swap to increase num-stage=5 for FC2; 3) disable
block shape swap and enable epilogue subtiling to increase num-stage=5
for FC1.

Perf results measured on single B200 (for batch_per_expt = [1, 2, 4, 8,
16, 32, 64]):

```
torchrun --nproc-per-node=1 python/triton_kernels/bench/bench_mlp.py

=========================================
logs/gpt-oss-x2/fp8x-fp8w-EP1.csv...
=========================================
batch_per_expt:     1 | MS: 70.19 | TFLOPS: 18.22 | TBPS: 5.89
batch_per_expt:     2 | MS: 89.87 | TFLOPS: 28.46 | TBPS: 6.04
batch_per_expt:     4 | MS: 103.82 | TFLOPS: 49.27 | TBPS: 6.10
batch_per_expt:     8 | MS: 105.98 | TFLOPS: 96.52 | TBPS: 6.03
batch_per_expt:    16 | MS: 108.33 | TFLOPS: 188.9 | TBPS: 5.93
batch_per_expt:    32 | MS: 112.29 | TFLOPS: 364.4 | TBPS: 5.76
batch_per_expt:    64 | MS: 126.07 | TFLOPS: 649.1 | TBPS: 5.19
=========================================
logs/gpt-oss-x2/fp8x-mx4w-EP1.csv...
=========================================
batch_per_expt:     1 | MS: 43.55 | TFLOPS: 29.36 | TBPS: 4.51
batch_per_expt:     2 | MS: 56.66 | TFLOPS: 45.14 | TBPS: 4.93
batch_per_expt:     4 | MS: 64.89 | TFLOPS: 78.83 | TBPS: 5.03
batch_per_expt:     8 | MS: 66.92 | TFLOPS: 152.9 | TBPS: 4.97
batch_per_expt:    16 | MS: 67.62 | TFLOPS: 302.6 | TBPS: 4.96
batch_per_expt:    32 | MS: 70.89 | TFLOPS: 577.3 | TBPS: 4.79
batch_per_expt:    64 | MS: 90.00 | TFLOPS: 909.4 | TBPS: 3.87
=========================================
logs/gpt-oss-x2/fp8x-mx4w-EP1-shuffled.csv...
=========================================
batch_per_expt:     1 | MS: 42.54 | TFLOPS: 30.06 | TBPS: 5.03
batch_per_expt:     2 | MS: 55.23 | TFLOPS: 46.31 | TBPS: 5.36
batch_per_expt:     4 | MS: 61.35 | TFLOPS: 83.37 | TBPS: 5.40
batch_per_expt:     8 | MS: 62.00 | TFLOPS: 165.0 | TBPS: 5.49
batch_per_expt:    16 | MS: 63.62 | TFLOPS: 321.6 | TBPS: 5.38
batch_per_expt:    32 | MS: 66.58 | TFLOPS: 614.5 | TBPS: 5.21
batch_per_expt:    64 | MS: 85.04 | TFLOPS: 962.4 | TBPS: 4.18
=========================================
logs/gpt-oss-x2/fp8x-mx4w-EP1-shuffled-fc2stages5.csv...
=========================================
batch_per_expt:     1 | MS: 39.80 | TFLOPS: 32.13 | TBPS: 5.11
batch_per_expt:     2 | MS: 56.78 | TFLOPS: 45.04 | TBPS: 5.45
batch_per_expt:     4 | MS: 60.13 | TFLOPS: 85.07 | TBPS: 5.46
batch_per_expt:     8 | MS: 60.95 | TFLOPS: 167.8 | TBPS: 5.58
batch_per_expt:    16 | MS: 62.56 | TFLOPS: 327.0 | TBPS: 5.47
batch_per_expt:    32 | MS: 65.33 | TFLOPS: 626.4 | TBPS: 5.31
batch_per_expt:    64 | MS: 82.49 | TFLOPS: 992.1 | TBPS: 4.31
=========================================
logs/gpt-oss-x2/fp8x-mx4w-EP1-shuffled-fc1stages5-fc2stages5-fc1subtile2.csv...
=========================================
batch_per_expt:     1 | MS: 40.27 | TFLOPS: 31.75 | TBPS: 5.19
batch_per_expt:     2 | MS: 52.45 | TFLOPS: 48.76 | TBPS: 5.60
batch_per_expt:     4 | MS: 58.43 | TFLOPS: 87.54 | TBPS: 5.67
batch_per_expt:     8 | MS: 60.16 | TFLOPS: 170.1 | TBPS: 5.65
batch_per_expt:    16 | MS: 60.75 | TFLOPS: 336.8 | TBPS: 5.64
batch_per_expt:    32 | MS: 63.28 | TFLOPS: 646.6 | TBPS: 5.48
batch_per_expt:    64 | MS: 78.85 | TFLOPS: 1038. | TBPS: 4.51
```

<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

# New contributor declaration
- [ ] I am not making a trivial change, such as fixing a typo in a
comment.

- [ ] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [ ] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [ ] This PR does not need a test because `FILL THIS IN`.

- Select one of the following.
  - [ ] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

---------

Co-authored-by: Thomas Raoux <thomas.raoux@openai.com>
In general splitk was disabled in the mlp matmuls since when it is
ragged it is not trivial to statically choose the splitk factor (though
it would be possible to do so dynamically)

In the small batch, non-ragged case though, it is simple to allow split
k. This PR does that and does some basic heuristic tuning for such cases
as well as optimization to the splitk reduce itself.

These changes exposed a bug in the smem accounting heuristics where we
weren't counting smem needed to perform SWAP_XW. This change fixes that.

Perf will probably get even better after integrating the shuffled mxfp4
weight layout from #9698 as
well
Add canonicalization pattern for IntToPtrOp that recognizes the pattern:
      int_to_ptr(addi(ptr_to_int(ptr), constant_offset))

    and transforms it to:
      addptr(ptr, element_offset)

    where element_offset = constant_offset / element_size_bytes.

    This pattern appears when performing pointer arithmetic via integer
operations (e.g., adding byte offsets to pointers). By canonicalizing
    to addptr, AxisInfoAnalysis can correctly track contiguity, enabling
    proper vectorization for operations like async_copy_local_to_global.

    The pattern only applies when:
- The offset is a compile-time constant (IntegerAttr or
SplatElementsAttr)
    - The byte offset is evenly divisible by the element size

    Added to both standard canonicalize and gluon-canonicalize passes.

    Tests added for positive cases (f32, f16, commutative) and negative
    cases (non-constant offset, indivisible offset).
As per the PTX docs, TMAs have a very specific behaviour when executed
in a 2CTA kernel:

>.cta_group::1 : The mbarrier signal is also multicasted to the same
offset as mbar in the shared memory of the destination CTA.
.cta_group::2 : The mbarrier signal is multicasted either to all the odd
numbered CTAs or the even numbered CTAs within the corresponding
CTA-Pair. For each destination CTA specified in the ctaMask, the
mbarrier signal is sent either to the destination CTA or its peer-CTA
based on CTAs %cluster_ctarank parity of shared memory where the
mbarrier object mbar resides.

As such, we require these CTA layouts in TMA barriers.
A `nvmma_shared` layout with swizzle=0 represents a flat, contiguous
layout. This is valid for TMA but it is never the correct layout for
Hopper WGMMA and Blackwell tcgen05 MMA instructions, since operand SMEMs
with swizzle=0 are required to be in the special layout ("core matrices"
format). Being able to use swizzle=0 for MMA is useful in case other
swizzling modes cannot be applied for some reason.

In principle, if the operand in gmem is already in the right format that
MMA expects, using TMA with swizzle 0 and directly feeding the result
into MMA can work. But currently I'm not aware of a way to express that
in Triton. This PR adds a key rewrite pass that enables that. The idea
is the same as how we enable blocked-scale load via TMA and `tmem_copy`.
See the attached test case for a fully worked-out example.
* Users must prepare the operand in gmem in the correct, swizzle-0
core-matrices format. Conceptually, an operand would have a shape like
`(num_blocks_m, num_blocks_k, num_cm_m, num_cm_k, 16, 8)` for fp8.
* In a kernel, use `tt.trans` and `tt.reshape` to "undo" the
core-matrices transformation. The op sequence looks like
`desc_load<swizzle=0> -> tt.reshape / tt.trans -> local_alloc -> mma`
* The new rewrite pass bubbles up `local_alloc` so that it immediately
follows `desc_load`. It also lifts those reshape / trans into
transformations on memdesc. The op sequence now looks like
`desc_load<swizzle=0> -> local_alloc<swizzle=0> -> memdesc reshape /
trans -> mma<swizzle=0>`.
* Thanks to the LinearLayout inference machinery on memdesc reshape /
trans, the MMA now gets an operand smem with a special `#shared_linear`
layout. If the `tt.trans` and `tt.reshape` transformations the user
specified are correct, the operand linear layout is correctly identified
as representing the swizzle-0 core-matrices format and lowered to the
corresponding SMEM descriptor.
There is no reason why we should fuse these two. tcgen05_mma has a fused
barrier because that way the codegen is better, but this is not the case
for TMEMCopy. Also we rarely would want to do a tmem_copy and then wait
on it without a tcgen05_mma, so out it goes.
Simplify use of shuffled blackwell mx value weights

- convert directly to BlackwellMX4ValueShuffledLayout; don't require
first going through BlackwelllValueLayout
- use block sizes from BlackwellMX4ValueShuffledLayout as opt flag
constraints. removes complicated code needed to infer the block sizes
before making the layout. pick a better default of block_n = 256,
block_k = 128 which generally works well and is the inferred one except
in cases where N < 256. also makes it simpler to just use, instead of
also needing to override disable_mx4_block_swap = True when shuffled
weights are used.
- add more test coverage

same perf from running `torchrun --nproc-per-node=1
python/triton_kernels/bench/bench_mlp.py`
These now obey the usual algebraic properties, namely the trigonometric
angle sum identities, reflection formulae, and the `sin(x)^2 + cos(x)^2
== 1` norm condition. Unit tests are included to verify these.

This is accomplished by taking the coefficients of `(-3/5 + 4/5 i)^x` in
the quadratic extension of the payload domain by a formal symbol `i`
satisfying `i^2 == -1`.
…10010)

SM121 (GB10 DGX Spark) supports the same mma.sync block-scaled
instructions as SM120 (RTX 5090) but was excluded from the native
lowering path by exact compute capability checks.

Without this fix, dot_scaled on SM121 falls through to
DecomposeScaledBlocked which upcasts to bf16 — ~10 TFLOPS vs ~270 TFLOPS
with native mma.sync block-scaled FP4.

Tested on GB10 with both MXFP4 (scale_vec::2X, ue8m0) and NVFP4
(scale_vec::4X, ue4m3).


# New contributor declaration
- [ x] I am not making a trivial change, such as fixing a typo in a
comment.

- [ x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [ x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
- [ x ] This PR does not need a test because current test paths cover
the flow, though there are no GB10s in CI to verify AFAIK it does work
for me.

- Select one of the following.
  - [x ] I have not added any `lit` tests.
Fpsan emulates tmem with global scratch. Each tmem buffer is being
mapped to a scratch allocation. There was an issue in how the mapping
was created for WS partitions, with the same tmem buffer being mapped to
distinct scratch allocations. We track the "canonical" tmem allocation
now, to make sure that tmem aliases in different regions, even though
they are represented by different SSA values, are getting mapped to the
original allocation correctly.

---------

Co-authored-by: root <root@codex-gb200-0.brix.pawelszczerbuk.svc.cluster.local>
Co-authored-by: Codex <noreply@openai.com>

---------

Co-authored-by: Codex <noreply@openai.com>
roman-openai and others added 30 commits May 11, 2026 12:03
Fix Hopper MX scale layout conversion for zero-sized scale tensors.

`torch.nn.functional.pad` rejects some valid zero-numel outputs. This
shows up in Hopper scale swizzle after RHS transpose: for example, `(0,
64)` becomes `(64, 0)`, and row padding still leaves a zero-width
output.

The fix mirrors Blackwell scale layout: skip `pad` when there are no
elements, update `M`/`K` to the padded extents, and let the existing
zero-element reshape produce the swizzled storage shape.

Also relax only `StridedLayoutTransformation.swizzle_data` for
zero-numel input. Hopper scale roundtrip can produce a valid empty
canonical tensor such as `(2, 0)` with `stride(-1) == 2`; there are no
elements for the packed-stride invariant to constrain.

Validation:
- H100: `PYTHONPATH=$PWD/python/triton_kernels python -m pytest -q
python/triton_kernels/tests/test_tensor_details/test_layout_hopper.py::test_mxfp4_scale_zero_sized_roundtrip`
The GSan runtime is compiled with --cuda-gpu-arch=sm_80, which requires
PTX 7.0 or newer. Most LLVM builds infer a compatible PTX version
automatically, but some do not and will yield the following error
message:

```
PTX version 4.2 does not support target 'sm_80'. Minimum required PTX version is 7.0.
```
Here's another ease-of-use commit that I'm separating out from an
unrelated change. This makes it easy to test Triton's plugin system with
a single command: `make test-plugins`.
Determining whether an op is used outside of the slice being
rematerialised is currently conservative. For instance, if an op is used
by an `scf.yield`, we automatically treat that as a non-slice user, even
if the value that the yield flows into is actually part of the slice or
completely unused.

To address this, trace through region control flow when deciding whether
an op is single use.
TDM supports strides up to 48-bit but we silently truncated them to
32bit. This happened in the lowering and in the creation of host side
descriptors so this PR adjusts both to preserve 48-bit and adds range
checks for host side descriptors.
This pass is causing more difficulties than help.
The conversion patterns in `TritonToTritonGPU` (`TritonFuncOpPattern`,
`TritonCallOpPattern`, `TritonReturnOpPattern`) did not handle `tt.func`
ops returning tensors: `TritonFuncOpPattern` reused the original
`FunctionType` verbatim, and the `triton::FuncOp` legality check
inspected only argument types, so tensor result types came out without
any layout encoding. There was already a fork of MLIR upstream's
signature-conversion pattern in
`Dialect/Triton/Transforms/FunctionTypeConversion.h` (added because
upstream is unaware of `tt.call`/`tt.return`) which handles inputs,
results, one-to-many conversions, and arg-attribute remapping — so this
PR drops those three custom patterns and reuses this one. The legality
predicate is also extended to require encodings on result tensor types.

Adds a regression test in `test/Conversion/triton_to_tritongpu.mlir`
covering a `tt.func` returning a tensor and a `tt.call` site picking up
the encoded result type.

Making this PR on behalf of @saagarjha (he is waiting on #8913)

- [x] I am not making a trivial change, such as fixing a typo in a
comment.
- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).
- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.
- [x] I have added tests.
- [x] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

---------

Co-authored-by: Saagar Jha <saagar@saagarjha.com>
… pipeline support (#9929)

### Summary

- **Flat (unrolled) pipeline support**: Extend `ConvertWarpPipeline` to
handle sequences of `scf.execute_region` ops outside `scf.for` (produced
by `WarpPipeliner::createFlatPipeline`). Emit pre-barrier, phase shift,
cluster barriers, and reconverge around them.
- **Eliminate redundant barriers between back-to-back pipelines**: When
two warp-pipelined regions are adjacent with no intervening operations,
the post-loop reconverge + prelude barrier + phase shift cancel out. The
phase from the first pipeline carries over naturally.
- **Cross-pipeline LDS dependency analysis**: Before eliminating
boundary barriers, verify that no uncovered LDS hazard exists at the
merge point. Concatenates cluster infos from both pipelines and runs
`analyzePipelineDependencies` on the merged sequence. Skips the
optimization when a dependency is found.
- **Adjacent-stage dependency check**: Add a distance-1 check to
`analyzePipelineDependencies`. The existing loop only checked pairs at
distance 2+, so consecutive clusters sharing an LDS allocation never got
a LOCAL barrier — causing `ModuleMembarAnalysis` to insert a redundant
`ttg.barrier local` inside the pipeline.
- **Refactors**: Extract `analyzePipelineDependencies`,
`emitPipelinePrelude`/`Postlude`, and `emitClusterBarrier` helpers.

### Test plan

- `back_to_back_cross_dep_kept`: shared-buffer RAW at boundary →
barriers kept
- `back_to_back_no_dep_elimination`: loop B has no LDS → barriers
eliminated
- `back_to_back_dep_covered_elimination`: 3-stage loop A with internal
barrier covering the cross-pipeline dep → barriers eliminated
- `back_to_back_for_then_flat`: pipelined loop + flat pipeline →
barriers eliminated
- `adjacent_stage_lds_dep`: 3-stage pipeline verifying LOCAL barrier
between adjacent stages with RAW dependency
- `flat_pipeline_existing_barrier`: pre-existing `async_wait` wrapped
with `sched_barrier`
- Existing 2-stage and 3-stage pipeline tests updated
This PR added a MoE gluon kernel for gfx1250 platform.
This reverts commit 4790c35. We decided
to introduce initializations with NaNs for shared and tensor memory
instead when running in consan mode.

---------

Co-authored-by: Codex <noreply@openai.com>
This PR adds an optional `warp_used_hint` attribute to
`AsyncTDMCopyGlobalToLocalOp` that enables partial TDM copy : only the
selected subset of warps perform useful TDM loads while the rest get
`pred=0` in their descriptor (hardware virtually no-op, instruction
still issued but no data moved).

The attribute is an `i32` bitmask: bit `n` selects warp `n`. The hint is
a performance hint only; it does not change the logical copy or the data
written to shared memory. For example, with `num_warps=8`,
`warp_used_hint = 0b00001111` means warps 0-3 perform the copy and warps
4-7 are predicated off. The verifier requires the active warps to follow
a regular axis-aligned bit pattern so lowering can derive a power-of-two
active warp count and reuse the existing LinearLayout/free-variable
machinery for offset and predicate generation.

During lowering, the tensor descriptor is first represented as a base
TDM descriptor containing tensor metadata (base pointer, shape, strides,
padding). The final per-instruction hardware descriptor is completed
later when lowering each `async_tdm_copy`, where op-local fields such as
`pred`, LDS address, barrier, destination layout/partitioning, and
`tile_dim*` are known. `warp_used_hint` is ignored earlier by the base
descriptor and used only when completing those per-instruction hardware
descriptor fields, especially `tile_dim*` and `pred`.

For a hint with `K = popcount(warp_used_hint)` active warps,
`fillTDMDescriptor` re-encodes per-warp tile dimensions as `block / K`
so the selected warps still cover the same user-visible block in one TDM
instruction. This is useful when `num_warps` exceeds what is needed for
the copy.
The PR includes an example, verifier tests, lowering lit tests, and
Python coverage.
True16 was disabled on gfx11 a while back due to correctness regressions
in LLVM. Those have since been resolved upstream, so this PR drops the
-real-true16 target-feature override in make_amdgcn / make_hsaco and
lets
the backend emit native 16-bit ops (e.g. v_add_u16 v0.l, v1.h, v2.h).
Benchmarks across GEMM and FlashAttention show this is roughly
perf-neutral with a slight positive lean and no major regressions.

  Benchmark results

 GEMM (bf16, TFLOPs)      
M | N | K | no_true16 | true16 | Δ%
-- | -- | -- | -- | -- | --
1024 | 1024 | 1024 | 44.13 | 43.85 | −0.65
2048 | 2048 | 2048 | 65.58 | 66.37 | +1.20
4096 | 4096 | 4096 | 78.80 | 79.88 | +1.36
8192 | 8192 | 8192 | 74.98 | 77.71 | +3.65
4096 | 11008 | 4096 | 74.62 | 75.73 | +1.49
4096 | 4096 | 11008 | 74.76 | 75.52 | +1.02
4096 | 14336 | 4096 | 74.99 | 76.88 | +2.53
4096 | 4096 | 14336 | 74.55 | 76.77 | +2.97
4096 | 12288 | 4096 | 76.61 | 77.50 | +1.16
                  
FlashAttention-2 fwd (causal=True, TFLOPs)
                  
headdim | batch | seqlen | no_true16 | true16 | Δ%
-- | -- | -- | -- | -- | --
64 | 32 | 512 | 33.34 | 34.14 | +2.41
64 | 16 | 1024 | 42.94 | 43.47 | +1.24
64 | 8 | 2048 | 49.84 | 48.75 | −2.19
64 | 4 | 4096 | 53.32 | 52.62 | −1.31
64 | 2 | 8192 | 53.46 | 54.63 | +2.19
64 | 1 | 16384 | 50.09 | 52.35 | +4.51
64 | 1 | 32768 | 43.80 | 44.52 | +1.65
128 | 32 | 512 | 30.81 | 31.18 | +1.20
128 | 16 | 1024 | 43.24 | 44.80 | +3.61
128 | 8 | 2048 | 52.88 | 53.39 | +0.96
128 | 4 | 4096 | 58.80 | 58.87 | +0.13
128 | 2 | 8192 | 61.43 | 64.05 | +4.26
128 | 1 | 16384 | 59.82 | 61.31 | +2.49
128 | 1 | 32768 | 55.20 | 55.94 | +1.33

Co-authored-by: Saeid Rostami <srostami@amd.com>
… K=8 (#10234)

Enable `tl.dot` with TF32 precision on tiles with **N=8** and **K=8**
(e.g. `wgmma.mma_async.sync.aligned.m64n8k8.f32.tf32.tf32`) via the
standard `tt.dot` → `AccelerateMatmul` path on sm90+.

Related to
#10060 (comment)
I am trying Triton for Finite Elements, and it does wonders! 
The matrices used in those computations are usually quite small. With
some management, it is possible to pack several operations into MMA
cores, but the tile sizes implemented were too big.

I ran the lit test, and they are passing, so I guess the resulting IR is
the same. Addes test for the new functionality.

--- 

## Changes

### `lib/Analysis/Utility.cpp`
In `supportMMA` (version 3), relaxed the N-dimension divisibility check
from `% 16` to `% 8`:
```cpp
- retShapePerCTA[rank - 1] % 16 == 0
+ retShapePerCTA[rank - 1] % 8 == 0
```
The WGMMAv3 op verifier already required only `N % 8 == 0`, and
`mmaVersionToInstrShape` already listed `n=8` as valid. This was the
sole gatekeeper preventing N=8 tiles from using WGMMA, causing a silent
fallback to MMAv2.

### `third_party/nvidia/backend/compiler.py`
In `min_dot_size`, added an explicit case for 32-bit types (TF32/FP32):
```python
+ elif lhs_bitwidth == 32:
+     return (1, 1, 8)
```
The TF32 hardware instruction has K=8. The previous fallthrough to the
`else` branch returned `K >= 16`, blocking compilation of K=8 TF32
kernels.

### `python/test/unit/language/test_core.py`
Added `test_dot_wgmma_tf32_n8k8` parametrized over `M ∈ {64, 128}`,
verifying both numerical correctness and that the emitted PTX contains
`wgmma.mma_async.sync.aligned.m64n8k8.f32.tf32.tf32`.


# New contributor declaration
- [X] I am not making a trivial change, such as fixing a typo in a
comment.

- [X] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [x] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [ ] This PR does not need a test because `FILL THIS IN`.

- Select one of the following.
  - [X] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)
`python/test/unit/tools/test_triton_to_gluon.py` fails on AMD RDNA3 (gfx11xx)
and RDNA4 (gfx120x) hardware because `TranslatorTarget` only enumerates
the gfx targets the translator currently supports. Every test in the file
errors at setup with:

  ValueError: 'gfx1201' is not a valid TranslatorTarget

  (observed with 9 failures on a gfx1201/gfx1100 box).

  This PR adds a module-level `pytestmark` that skips the file when
`is_hip_rdna()` is true, so CI / local runs on RDNA hardware are clean.
Since we now generate arith.negf, it needs to be supported by fpsan.
We also check that when reinterpreting a pipelining buffer, the intial
dimensions are the same.

Addresses
#10243 (comment)
Implement InterpreterBuilder.create_dot_scaled.

deduceScaleFactor relied on mlir::Values. Refactor to be independent of
mlir so it can be called
when using the interpreter.

Mark several tests that use tl.dot_scaled to run during the interpreter
tests.

# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
- [x] This PR does not need a test because existing tests have been
marked to run with the interpreter which covers the new functionality.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)
#9704)

Does not need `LD_PRELOAD`. Intializes rocprofsdk on the triton profiler
import which does a lightweight intialization one time cost which
replaces `hsa_queue_create_fn` pointer in the HSA table and creates &
registers our 2 SDK contexts. Must happen before any GPU operation
creates an HSA queue, because the SDK can only intercept queues created
after the replacement. Queues that already exist are plain HSA queues
and invisible to the profiler.

context 1: `codeObjectContext` used for kernel registration names
context 2: `profilingContext` does the heavy work which is started only
after `doStart(0` and `doStop()`.

between proton `start()` and `end(0` the SDK WriteInterceptor intercepts
the dispatches inbetween with barrier packets.

Rewrote `_select_backend()` to avoid calling `get_current_target()`,
which triggers HIP runtime init before `force_configure` can run.

Tested locally by building rocm-systems from main with
`TRITON_ROCPROFILER_SDK..` env's.

---------

Co-authored-by: Keren Zhou <kerenzhou@openai.com>
Co-authored-by: Lei Zhang <antiagainst@gmail.com>
…10317)

We "sort" the mask index so that rows with similar number of active
input rows are grouped together. Here, "number of active input rows" are
defined as `mask[:, :, 0].sum(dim=1)`, assuming the reduction dimension
is 1 and the mask is broadcast over dim=2.

Then, for each block, we call the main logic with hard-coded loop bound,
determined by the max number of active input rows.

Benchmark on GB300:

```
B=8192,  M=4, N=8192, dim=1, dtype=float16, mask=broadcast_n 7983 --> 8271 GB/s (+3.6%)
B=16384, M=4, N=8192, dim=1, dtype=float16, mask=broadcast_n 8501 --> 8947 GB/s (+5.2%)
B=65536, M=4, N=8192, dim=1, dtype=float16, mask=broadcast_n 8998 --> 9650 GB/s (+7.2%)
```
…les (#10310)

For TDM, negative offsets must be treated as fully OOB, as partial OOB
handling on the left is not supported.
Previously, we adjusted the tensor shape `tensor_shape = tensor_shape -
offset`. For negative offsets this increased the tensor dimension,
leading to OOB reads by the HW.

To fix this, we now clamp the tensor shape to zero for negative offsets,
which treats the entire tile as OOB (zero-fill).
This PR fixes a bug where the wide-store epilogue path builds a derived
linear layout by swapping N-dimension basis bits. For some valid shapes,
the N dimension is too small for the target basis bit to exist, so
layout construction indexed past the end of the basis vector list and
the compiler would crash. This change makes the store-layout helper bail
out in that case, allowing the existing fallback path to handle the
store instead of crashing.

This issue can be reproduced on gfx950 with the following small program:
```python
@triton.jit
def repro(A, B, C):
    m = tl.arange(0, 128)
    k = tl.arange(0, 16)
    n = tl.arange(0, 8)
    a = tl.load(A + m[:, None] * 16 + k[None, :])
    b = tl.load(B + k[:, None] * 8 + n[None, :])
    acc = tl.dot(a, b, out_dtype=tl.float32)
    out = acc.to(tl.bfloat16)
    tl.store(C + m[:, None] * 8 + n[None, :], out)
```
MN-packed fp4 operands force a `DotScaledOp` op to survive until
codegen, which breaks compilation on pre-Blackwell architectures.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⤵️ pull merge-conflict Resolve conflicts manually

Projects

None yet

Development

Successfully merging this pull request may close these issues.