[pull] main from triton-lang:main#1
Open
pull[bot] wants to merge 3395 commits into
Open
Conversation
This PR adds version checking. If `TRITON_PLUGIN_VERSION_CHECK` is unset then only the release version will be checked (ie that both the plugin and Triton were built on version say "3.7.0"). However is `TRITON_PLUGIN_VERSION_CHECK=on` then both the release version and the git hash built on will be checked to match, and finally if `TRITON_PLUGIN_VERSION_CHECK=off` no version checking other than the plugin api version itself will be checked.
tl.rand returns values in the half-open interval [0, 1), not the closed interval [0, 1]. The test assertion used x <= 1, which would have accepted 1.0 as a valid output. Philox-based float generation masks the upper mantissa bits and sets the exponent to produce values strictly less than 1.0. A value of exactly 1.0 indicates a bug (e.g. broken seed delivery returning all-ones from the uint-to-float conversion). <!--- The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because it's fixing a test. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) Co-authored-by: Wes Turner <westurner@users.noreply.github.com>
We do so by following the PTX docs and our LLVM lowerings 1. In PTX, a barrier flips when both `arrivals == 0` and `tx-count == 0` 2. In PTX, an expect implies a commit of 1. 3. In triton, we lower `ttng.expect` as an expect on the leader CTA and as a commit of 1 We model all these points in consan.
(Coauthored with Thomas Raoux)
This commit changes `tt.tensordesc` type to be directly composed of shape, element type, and shared layout. It drops the previous tensor type wrapper around them. The main reason is that the shared memory layout encoding attaching inside tensor type was confusing since the layout describes shared memory, not the tensor. `!tt.tensordesc<tensor<128x64xf16, #shared>>` now becomes `!tt.tensordesc<128x64xf16, #shared>`. This is also closer to `ttg.memdesc`.
#9960) Updated triton-ascend administrator.
When backward remat identifies an operand that is already in the desired encoding, it skips that operand in the slice. However, we do not record the desired layout for that value in `layouts`. As a result, we cannot detect if we also encounter a conflicting encoding for that same value. When that happens, the value is rematerialised, resulting in an encoding mismatch with the user that expected the original encoding. To fix this, record the original encoding in `layouts`, without adding it to the slice. This way, if a conflicting encoding is encountered, we will just bail on backward remat.
…9934) We add proper support for `expect` by tracking the bytes being expected on a barrier, not only the commits. We follow the PTX semantics rather closely. In the second commit, we add support for TMA multicast. We also fix the fanout CTA masks that were missing in the transfer visibility functions. W add a few tests for consan meets tcgen05 / tma ops with / without multicast / twoCTA mode. --------- Co-authored-by: Codex <noreply@openai.com>
There are mainly three incremental improvements over the MoE module, benchmarked by `bench_mlp.py` script via `torchrun --nproc-per-node=1 python/triton_kernels/bench/bench_mlp.py`. Main improvements are from 1) offline shuffling for MX4 weight; 2) disable block shape swap to increase num-stage=5 for FC2; 3) disable block shape swap and enable epilogue subtiling to increase num-stage=5 for FC1. Perf results measured on single B200 (for batch_per_expt = [1, 2, 4, 8, 16, 32, 64]): ``` torchrun --nproc-per-node=1 python/triton_kernels/bench/bench_mlp.py ========================================= logs/gpt-oss-x2/fp8x-fp8w-EP1.csv... ========================================= batch_per_expt: 1 | MS: 70.19 | TFLOPS: 18.22 | TBPS: 5.89 batch_per_expt: 2 | MS: 89.87 | TFLOPS: 28.46 | TBPS: 6.04 batch_per_expt: 4 | MS: 103.82 | TFLOPS: 49.27 | TBPS: 6.10 batch_per_expt: 8 | MS: 105.98 | TFLOPS: 96.52 | TBPS: 6.03 batch_per_expt: 16 | MS: 108.33 | TFLOPS: 188.9 | TBPS: 5.93 batch_per_expt: 32 | MS: 112.29 | TFLOPS: 364.4 | TBPS: 5.76 batch_per_expt: 64 | MS: 126.07 | TFLOPS: 649.1 | TBPS: 5.19 ========================================= logs/gpt-oss-x2/fp8x-mx4w-EP1.csv... ========================================= batch_per_expt: 1 | MS: 43.55 | TFLOPS: 29.36 | TBPS: 4.51 batch_per_expt: 2 | MS: 56.66 | TFLOPS: 45.14 | TBPS: 4.93 batch_per_expt: 4 | MS: 64.89 | TFLOPS: 78.83 | TBPS: 5.03 batch_per_expt: 8 | MS: 66.92 | TFLOPS: 152.9 | TBPS: 4.97 batch_per_expt: 16 | MS: 67.62 | TFLOPS: 302.6 | TBPS: 4.96 batch_per_expt: 32 | MS: 70.89 | TFLOPS: 577.3 | TBPS: 4.79 batch_per_expt: 64 | MS: 90.00 | TFLOPS: 909.4 | TBPS: 3.87 ========================================= logs/gpt-oss-x2/fp8x-mx4w-EP1-shuffled.csv... ========================================= batch_per_expt: 1 | MS: 42.54 | TFLOPS: 30.06 | TBPS: 5.03 batch_per_expt: 2 | MS: 55.23 | TFLOPS: 46.31 | TBPS: 5.36 batch_per_expt: 4 | MS: 61.35 | TFLOPS: 83.37 | TBPS: 5.40 batch_per_expt: 8 | MS: 62.00 | TFLOPS: 165.0 | TBPS: 5.49 batch_per_expt: 16 | MS: 63.62 | TFLOPS: 321.6 | TBPS: 5.38 batch_per_expt: 32 | MS: 66.58 | TFLOPS: 614.5 | TBPS: 5.21 batch_per_expt: 64 | MS: 85.04 | TFLOPS: 962.4 | TBPS: 4.18 ========================================= logs/gpt-oss-x2/fp8x-mx4w-EP1-shuffled-fc2stages5.csv... ========================================= batch_per_expt: 1 | MS: 39.80 | TFLOPS: 32.13 | TBPS: 5.11 batch_per_expt: 2 | MS: 56.78 | TFLOPS: 45.04 | TBPS: 5.45 batch_per_expt: 4 | MS: 60.13 | TFLOPS: 85.07 | TBPS: 5.46 batch_per_expt: 8 | MS: 60.95 | TFLOPS: 167.8 | TBPS: 5.58 batch_per_expt: 16 | MS: 62.56 | TFLOPS: 327.0 | TBPS: 5.47 batch_per_expt: 32 | MS: 65.33 | TFLOPS: 626.4 | TBPS: 5.31 batch_per_expt: 64 | MS: 82.49 | TFLOPS: 992.1 | TBPS: 4.31 ========================================= logs/gpt-oss-x2/fp8x-mx4w-EP1-shuffled-fc1stages5-fc2stages5-fc1subtile2.csv... ========================================= batch_per_expt: 1 | MS: 40.27 | TFLOPS: 31.75 | TBPS: 5.19 batch_per_expt: 2 | MS: 52.45 | TFLOPS: 48.76 | TBPS: 5.60 batch_per_expt: 4 | MS: 58.43 | TFLOPS: 87.54 | TBPS: 5.67 batch_per_expt: 8 | MS: 60.16 | TFLOPS: 170.1 | TBPS: 5.65 batch_per_expt: 16 | MS: 60.75 | TFLOPS: 336.8 | TBPS: 5.64 batch_per_expt: 32 | MS: 63.28 | TFLOPS: 646.6 | TBPS: 5.48 batch_per_expt: 64 | MS: 78.85 | TFLOPS: 1038. | TBPS: 4.51 ``` <!--- The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> # New contributor declaration - [ ] I am not making a trivial change, such as fixing a typo in a comment. - [ ] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [ ] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [ ] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) --------- Co-authored-by: Thomas Raoux <thomas.raoux@openai.com>
In general splitk was disabled in the mlp matmuls since when it is ragged it is not trivial to statically choose the splitk factor (though it would be possible to do so dynamically) In the small batch, non-ragged case though, it is simple to allow split k. This PR does that and does some basic heuristic tuning for such cases as well as optimization to the splitk reduce itself. These changes exposed a bug in the smem accounting heuristics where we weren't counting smem needed to perform SWAP_XW. This change fixes that. Perf will probably get even better after integrating the shuffled mxfp4 weight layout from #9698 as well
…d add a verifier (#9984)
Add canonicalization pattern for IntToPtrOp that recognizes the pattern:
int_to_ptr(addi(ptr_to_int(ptr), constant_offset))
and transforms it to:
addptr(ptr, element_offset)
where element_offset = constant_offset / element_size_bytes.
This pattern appears when performing pointer arithmetic via integer
operations (e.g., adding byte offsets to pointers). By canonicalizing
to addptr, AxisInfoAnalysis can correctly track contiguity, enabling
proper vectorization for operations like async_copy_local_to_global.
The pattern only applies when:
- The offset is a compile-time constant (IntegerAttr or
SplatElementsAttr)
- The byte offset is evenly divisible by the element size
Added to both standard canonicalize and gluon-canonicalize passes.
Tests added for positive cases (f32, f16, commutative) and negative
cases (non-constant offset, indivisible offset).
As per the PTX docs, TMAs have a very specific behaviour when executed in a 2CTA kernel: >.cta_group::1 : The mbarrier signal is also multicasted to the same offset as mbar in the shared memory of the destination CTA. .cta_group::2 : The mbarrier signal is multicasted either to all the odd numbered CTAs or the even numbered CTAs within the corresponding CTA-Pair. For each destination CTA specified in the ctaMask, the mbarrier signal is sent either to the destination CTA or its peer-CTA based on CTAs %cluster_ctarank parity of shared memory where the mbarrier object mbar resides. As such, we require these CTA layouts in TMA barriers.
A `nvmma_shared` layout with swizzle=0 represents a flat, contiguous
layout. This is valid for TMA but it is never the correct layout for
Hopper WGMMA and Blackwell tcgen05 MMA instructions, since operand SMEMs
with swizzle=0 are required to be in the special layout ("core matrices"
format). Being able to use swizzle=0 for MMA is useful in case other
swizzling modes cannot be applied for some reason.
In principle, if the operand in gmem is already in the right format that
MMA expects, using TMA with swizzle 0 and directly feeding the result
into MMA can work. But currently I'm not aware of a way to express that
in Triton. This PR adds a key rewrite pass that enables that. The idea
is the same as how we enable blocked-scale load via TMA and `tmem_copy`.
See the attached test case for a fully worked-out example.
* Users must prepare the operand in gmem in the correct, swizzle-0
core-matrices format. Conceptually, an operand would have a shape like
`(num_blocks_m, num_blocks_k, num_cm_m, num_cm_k, 16, 8)` for fp8.
* In a kernel, use `tt.trans` and `tt.reshape` to "undo" the
core-matrices transformation. The op sequence looks like
`desc_load<swizzle=0> -> tt.reshape / tt.trans -> local_alloc -> mma`
* The new rewrite pass bubbles up `local_alloc` so that it immediately
follows `desc_load`. It also lifts those reshape / trans into
transformations on memdesc. The op sequence now looks like
`desc_load<swizzle=0> -> local_alloc<swizzle=0> -> memdesc reshape /
trans -> mma<swizzle=0>`.
* Thanks to the LinearLayout inference machinery on memdesc reshape /
trans, the MMA now gets an operand smem with a special `#shared_linear`
layout. If the `tt.trans` and `tt.reshape` transformations the user
specified are correct, the operand linear layout is correctly identified
as representing the swizzle-0 core-matrices format and lowered to the
corresponding SMEM descriptor.
There is no reason why we should fuse these two. tcgen05_mma has a fused barrier because that way the codegen is better, but this is not the case for TMEMCopy. Also we rarely would want to do a tmem_copy and then wait on it without a tcgen05_mma, so out it goes.
Simplify use of shuffled blackwell mx value weights - convert directly to BlackwellMX4ValueShuffledLayout; don't require first going through BlackwelllValueLayout - use block sizes from BlackwellMX4ValueShuffledLayout as opt flag constraints. removes complicated code needed to infer the block sizes before making the layout. pick a better default of block_n = 256, block_k = 128 which generally works well and is the inferred one except in cases where N < 256. also makes it simpler to just use, instead of also needing to override disable_mx4_block_swap = True when shuffled weights are used. - add more test coverage same perf from running `torchrun --nproc-per-node=1 python/triton_kernels/bench/bench_mlp.py`
These now obey the usual algebraic properties, namely the trigonometric angle sum identities, reflection formulae, and the `sin(x)^2 + cos(x)^2 == 1` norm condition. Unit tests are included to verify these. This is accomplished by taking the coefficients of `(-3/5 + 4/5 i)^x` in the quadratic extension of the payload domain by a formal symbol `i` satisfying `i^2 == -1`.
…10010) SM121 (GB10 DGX Spark) supports the same mma.sync block-scaled instructions as SM120 (RTX 5090) but was excluded from the native lowering path by exact compute capability checks. Without this fix, dot_scaled on SM121 falls through to DecomposeScaledBlocked which upcasts to bf16 — ~10 TFLOPS vs ~270 TFLOPS with native mma.sync block-scaled FP4. Tested on GB10 with both MXFP4 (scale_vec::2X, ue8m0) and NVFP4 (scale_vec::4X, ue4m3). # New contributor declaration - [ x] I am not making a trivial change, such as fixing a typo in a comment. - [ x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [ x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ x ] This PR does not need a test because current test paths cover the flow, though there are no GB10s in CI to verify AFAIK it does work for me. - Select one of the following. - [x ] I have not added any `lit` tests.
Fpsan emulates tmem with global scratch. Each tmem buffer is being mapped to a scratch allocation. There was an issue in how the mapping was created for WS partitions, with the same tmem buffer being mapped to distinct scratch allocations. We track the "canonical" tmem allocation now, to make sure that tmem aliases in different regions, even though they are represented by different SSA values, are getting mapped to the original allocation correctly. --------- Co-authored-by: root <root@codex-gb200-0.brix.pawelszczerbuk.svc.cluster.local>
Co-authored-by: Codex <noreply@openai.com> --------- Co-authored-by: Codex <noreply@openai.com>
Fix Hopper MX scale layout conversion for zero-sized scale tensors. `torch.nn.functional.pad` rejects some valid zero-numel outputs. This shows up in Hopper scale swizzle after RHS transpose: for example, `(0, 64)` becomes `(64, 0)`, and row padding still leaves a zero-width output. The fix mirrors Blackwell scale layout: skip `pad` when there are no elements, update `M`/`K` to the padded extents, and let the existing zero-element reshape produce the swizzled storage shape. Also relax only `StridedLayoutTransformation.swizzle_data` for zero-numel input. Hopper scale roundtrip can produce a valid empty canonical tensor such as `(2, 0)` with `stride(-1) == 2`; there are no elements for the packed-stride invariant to constrain. Validation: - H100: `PYTHONPATH=$PWD/python/triton_kernels python -m pytest -q python/triton_kernels/tests/test_tensor_details/test_layout_hopper.py::test_mxfp4_scale_zero_sized_roundtrip`
The GSan runtime is compiled with --cuda-gpu-arch=sm_80, which requires PTX 7.0 or newer. Most LLVM builds infer a compatible PTX version automatically, but some do not and will yield the following error message: ``` PTX version 4.2 does not support target 'sm_80'. Minimum required PTX version is 7.0. ```
Here's another ease-of-use commit that I'm separating out from an unrelated change. This makes it easy to test Triton's plugin system with a single command: `make test-plugins`.
Determining whether an op is used outside of the slice being rematerialised is currently conservative. For instance, if an op is used by an `scf.yield`, we automatically treat that as a non-slice user, even if the value that the yield flows into is actually part of the slice or completely unused. To address this, trace through region control flow when deciding whether an op is single use.
TDM supports strides up to 48-bit but we silently truncated them to 32bit. This happened in the lowering and in the creation of host side descriptors so this PR adjusts both to preserve 48-bit and adds range checks for host side descriptors.
This pass is causing more difficulties than help.
The conversion patterns in `TritonToTritonGPU` (`TritonFuncOpPattern`, `TritonCallOpPattern`, `TritonReturnOpPattern`) did not handle `tt.func` ops returning tensors: `TritonFuncOpPattern` reused the original `FunctionType` verbatim, and the `triton::FuncOp` legality check inspected only argument types, so tensor result types came out without any layout encoding. There was already a fork of MLIR upstream's signature-conversion pattern in `Dialect/Triton/Transforms/FunctionTypeConversion.h` (added because upstream is unaware of `tt.call`/`tt.return`) which handles inputs, results, one-to-many conversions, and arg-attribute remapping — so this PR drops those three custom patterns and reuses this one. The legality predicate is also extended to require encodings on result tensor types. Adds a regression test in `test/Conversion/triton_to_tritongpu.mlir` covering a `tt.func` returning a tensor and a `tt.call` site picking up the encoded result type. Making this PR on behalf of @saagarjha (he is waiting on #8913) - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - [x] I have added tests. - [x] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) --------- Co-authored-by: Saagar Jha <saagar@saagarjha.com>
… pipeline support (#9929) ### Summary - **Flat (unrolled) pipeline support**: Extend `ConvertWarpPipeline` to handle sequences of `scf.execute_region` ops outside `scf.for` (produced by `WarpPipeliner::createFlatPipeline`). Emit pre-barrier, phase shift, cluster barriers, and reconverge around them. - **Eliminate redundant barriers between back-to-back pipelines**: When two warp-pipelined regions are adjacent with no intervening operations, the post-loop reconverge + prelude barrier + phase shift cancel out. The phase from the first pipeline carries over naturally. - **Cross-pipeline LDS dependency analysis**: Before eliminating boundary barriers, verify that no uncovered LDS hazard exists at the merge point. Concatenates cluster infos from both pipelines and runs `analyzePipelineDependencies` on the merged sequence. Skips the optimization when a dependency is found. - **Adjacent-stage dependency check**: Add a distance-1 check to `analyzePipelineDependencies`. The existing loop only checked pairs at distance 2+, so consecutive clusters sharing an LDS allocation never got a LOCAL barrier — causing `ModuleMembarAnalysis` to insert a redundant `ttg.barrier local` inside the pipeline. - **Refactors**: Extract `analyzePipelineDependencies`, `emitPipelinePrelude`/`Postlude`, and `emitClusterBarrier` helpers. ### Test plan - `back_to_back_cross_dep_kept`: shared-buffer RAW at boundary → barriers kept - `back_to_back_no_dep_elimination`: loop B has no LDS → barriers eliminated - `back_to_back_dep_covered_elimination`: 3-stage loop A with internal barrier covering the cross-pipeline dep → barriers eliminated - `back_to_back_for_then_flat`: pipelined loop + flat pipeline → barriers eliminated - `adjacent_stage_lds_dep`: 3-stage pipeline verifying LOCAL barrier between adjacent stages with RAW dependency - `flat_pipeline_existing_barrier`: pre-existing `async_wait` wrapped with `sched_barrier` - Existing 2-stage and 3-stage pipeline tests updated
This PR added a MoE gluon kernel for gfx1250 platform.
…cumentation website (#10300)
This PR adds an optional `warp_used_hint` attribute to `AsyncTDMCopyGlobalToLocalOp` that enables partial TDM copy : only the selected subset of warps perform useful TDM loads while the rest get `pred=0` in their descriptor (hardware virtually no-op, instruction still issued but no data moved). The attribute is an `i32` bitmask: bit `n` selects warp `n`. The hint is a performance hint only; it does not change the logical copy or the data written to shared memory. For example, with `num_warps=8`, `warp_used_hint = 0b00001111` means warps 0-3 perform the copy and warps 4-7 are predicated off. The verifier requires the active warps to follow a regular axis-aligned bit pattern so lowering can derive a power-of-two active warp count and reuse the existing LinearLayout/free-variable machinery for offset and predicate generation. During lowering, the tensor descriptor is first represented as a base TDM descriptor containing tensor metadata (base pointer, shape, strides, padding). The final per-instruction hardware descriptor is completed later when lowering each `async_tdm_copy`, where op-local fields such as `pred`, LDS address, barrier, destination layout/partitioning, and `tile_dim*` are known. `warp_used_hint` is ignored earlier by the base descriptor and used only when completing those per-instruction hardware descriptor fields, especially `tile_dim*` and `pred`. For a hint with `K = popcount(warp_used_hint)` active warps, `fillTDMDescriptor` re-encodes per-warp tile dimensions as `block / K` so the selected warps still cover the same user-visible block in one TDM instruction. This is useful when `num_warps` exceeds what is needed for the copy. The PR includes an example, verifier tests, lowering lit tests, and Python coverage.
True16 was disabled on gfx11 a while back due to correctness regressions
in LLVM. Those have since been resolved upstream, so this PR drops the
-real-true16 target-feature override in make_amdgcn / make_hsaco and
lets
the backend emit native 16-bit ops (e.g. v_add_u16 v0.l, v1.h, v2.h).
Benchmarks across GEMM and FlashAttention show this is roughly
perf-neutral with a slight positive lean and no major regressions.
Benchmark results
GEMM (bf16, TFLOPs)
M | N | K | no_true16 | true16 | Δ%
-- | -- | -- | -- | -- | --
1024 | 1024 | 1024 | 44.13 | 43.85 | −0.65
2048 | 2048 | 2048 | 65.58 | 66.37 | +1.20
4096 | 4096 | 4096 | 78.80 | 79.88 | +1.36
8192 | 8192 | 8192 | 74.98 | 77.71 | +3.65
4096 | 11008 | 4096 | 74.62 | 75.73 | +1.49
4096 | 4096 | 11008 | 74.76 | 75.52 | +1.02
4096 | 14336 | 4096 | 74.99 | 76.88 | +2.53
4096 | 4096 | 14336 | 74.55 | 76.77 | +2.97
4096 | 12288 | 4096 | 76.61 | 77.50 | +1.16
FlashAttention-2 fwd (causal=True, TFLOPs)
headdim | batch | seqlen | no_true16 | true16 | Δ%
-- | -- | -- | -- | -- | --
64 | 32 | 512 | 33.34 | 34.14 | +2.41
64 | 16 | 1024 | 42.94 | 43.47 | +1.24
64 | 8 | 2048 | 49.84 | 48.75 | −2.19
64 | 4 | 4096 | 53.32 | 52.62 | −1.31
64 | 2 | 8192 | 53.46 | 54.63 | +2.19
64 | 1 | 16384 | 50.09 | 52.35 | +4.51
64 | 1 | 32768 | 43.80 | 44.52 | +1.65
128 | 32 | 512 | 30.81 | 31.18 | +1.20
128 | 16 | 1024 | 43.24 | 44.80 | +3.61
128 | 8 | 2048 | 52.88 | 53.39 | +0.96
128 | 4 | 4096 | 58.80 | 58.87 | +0.13
128 | 2 | 8192 | 61.43 | 64.05 | +4.26
128 | 1 | 16384 | 59.82 | 61.31 | +2.49
128 | 1 | 32768 | 55.20 | 55.94 | +1.33
Co-authored-by: Saeid Rostami <srostami@amd.com>
… K=8 (#10234) Enable `tl.dot` with TF32 precision on tiles with **N=8** and **K=8** (e.g. `wgmma.mma_async.sync.aligned.m64n8k8.f32.tf32.tf32`) via the standard `tt.dot` → `AccelerateMatmul` path on sm90+. Related to #10060 (comment) I am trying Triton for Finite Elements, and it does wonders! The matrices used in those computations are usually quite small. With some management, it is possible to pack several operations into MMA cores, but the tile sizes implemented were too big. I ran the lit test, and they are passing, so I guess the resulting IR is the same. Addes test for the new functionality. --- ## Changes ### `lib/Analysis/Utility.cpp` In `supportMMA` (version 3), relaxed the N-dimension divisibility check from `% 16` to `% 8`: ```cpp - retShapePerCTA[rank - 1] % 16 == 0 + retShapePerCTA[rank - 1] % 8 == 0 ``` The WGMMAv3 op verifier already required only `N % 8 == 0`, and `mmaVersionToInstrShape` already listed `n=8` as valid. This was the sole gatekeeper preventing N=8 tiles from using WGMMA, causing a silent fallback to MMAv2. ### `third_party/nvidia/backend/compiler.py` In `min_dot_size`, added an explicit case for 32-bit types (TF32/FP32): ```python + elif lhs_bitwidth == 32: + return (1, 1, 8) ``` The TF32 hardware instruction has K=8. The previous fallthrough to the `else` branch returned `K >= 16`, blocking compilation of K=8 TF32 kernels. ### `python/test/unit/language/test_core.py` Added `test_dot_wgmma_tf32_n8k8` parametrized over `M ∈ {64, 128}`, verifying both numerical correctness and that the emitted PTX contains `wgmma.mma_async.sync.aligned.m64n8k8.f32.tf32.tf32`. # New contributor declaration - [X] I am not making a trivial change, such as fixing a typo in a comment. - [X] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [X] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.)
`python/test/unit/tools/test_triton_to_gluon.py` fails on AMD RDNA3 (gfx11xx) and RDNA4 (gfx120x) hardware because `TranslatorTarget` only enumerates the gfx targets the translator currently supports. Every test in the file errors at setup with: ValueError: 'gfx1201' is not a valid TranslatorTarget (observed with 9 failures on a gfx1201/gfx1100 box). This PR adds a module-level `pytestmark` that skips the file when `is_hip_rdna()` is true, so CI / local runs on RDNA hardware are clean.
Since we now generate arith.negf, it needs to be supported by fpsan.
We also check that when reinterpreting a pipelining buffer, the intial dimensions are the same. Addresses #10243 (comment)
Implement InterpreterBuilder.create_dot_scaled. deduceScaleFactor relied on mlir::Values. Refactor to be independent of mlir so it can be called when using the interpreter. Mark several tests that use tl.dot_scaled to run during the interpreter tests. # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because existing tests have been marked to run with the interpreter which covers the new functionality. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.)
#9704) Does not need `LD_PRELOAD`. Intializes rocprofsdk on the triton profiler import which does a lightweight intialization one time cost which replaces `hsa_queue_create_fn` pointer in the HSA table and creates & registers our 2 SDK contexts. Must happen before any GPU operation creates an HSA queue, because the SDK can only intercept queues created after the replacement. Queues that already exist are plain HSA queues and invisible to the profiler. context 1: `codeObjectContext` used for kernel registration names context 2: `profilingContext` does the heavy work which is started only after `doStart(0` and `doStop()`. between proton `start()` and `end(0` the SDK WriteInterceptor intercepts the dispatches inbetween with barrier packets. Rewrote `_select_backend()` to avoid calling `get_current_target()`, which triggers HIP runtime init before `force_configure` can run. Tested locally by building rocm-systems from main with `TRITON_ROCPROFILER_SDK..` env's. --------- Co-authored-by: Keren Zhou <kerenzhou@openai.com> Co-authored-by: Lei Zhang <antiagainst@gmail.com>
…10317) We "sort" the mask index so that rows with similar number of active input rows are grouped together. Here, "number of active input rows" are defined as `mask[:, :, 0].sum(dim=1)`, assuming the reduction dimension is 1 and the mask is broadcast over dim=2. Then, for each block, we call the main logic with hard-coded loop bound, determined by the max number of active input rows. Benchmark on GB300: ``` B=8192, M=4, N=8192, dim=1, dtype=float16, mask=broadcast_n 7983 --> 8271 GB/s (+3.6%) B=16384, M=4, N=8192, dim=1, dtype=float16, mask=broadcast_n 8501 --> 8947 GB/s (+5.2%) B=65536, M=4, N=8192, dim=1, dtype=float16, mask=broadcast_n 8998 --> 9650 GB/s (+7.2%) ```
…les (#10310) For TDM, negative offsets must be treated as fully OOB, as partial OOB handling on the left is not supported. Previously, we adjusted the tensor shape `tensor_shape = tensor_shape - offset`. For negative offsets this increased the tensor dimension, leading to OOB reads by the HW. To fix this, we now clamp the tensor shape to zero for negative offsets, which treats the entire tile as OOB (zero-fill).
This PR fixes a bug where the wide-store epilogue path builds a derived
linear layout by swapping N-dimension basis bits. For some valid shapes,
the N dimension is too small for the target basis bit to exist, so
layout construction indexed past the end of the basis vector list and
the compiler would crash. This change makes the store-layout helper bail
out in that case, allowing the existing fallback path to handle the
store instead of crashing.
This issue can be reproduced on gfx950 with the following small program:
```python
@triton.jit
def repro(A, B, C):
m = tl.arange(0, 128)
k = tl.arange(0, 16)
n = tl.arange(0, 8)
a = tl.load(A + m[:, None] * 16 + k[None, :])
b = tl.load(B + k[:, None] * 8 + n[None, :])
acc = tl.dot(a, b, out_dtype=tl.float32)
out = acc.to(tl.bfloat16)
tl.store(C + m[:, None] * 8 + n[None, :], out)
```
MN-packed fp4 operands force a `DotScaledOp` op to survive until codegen, which breaks compilation on pre-Blackwell architectures.
… used in atomic operations (#10314)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )