[ BACKEND ] Enable `tl.dot` with TF32 precision on tiles with N=8 and K=8 by mwichro · Pull Request #10234 · triton-lang/triton

mwichro · 2026-05-05T19:44:37Z

Enable tl.dot with TF32 precision on tiles with N=8 and K=8 (e.g. wgmma.mma_async.sync.aligned.m64n8k8.f32.tf32.tf32) via the standard tt.dot → AccelerateMatmul path on sm90+.

Related to #10060 (comment)
I am trying Triton for Finite Elements, and it does wonders!
The matrices used in those computations are usually quite small. With some management, it is possible to pack several operations into MMA cores, but the tile sizes implemented were too big.

I ran the lit test, and they are passing, so I guess the resulting IR is the same. Addes test for the new functionality.

Changes

`lib/Analysis/Utility.cpp`

In supportMMA (version 3), relaxed the N-dimension divisibility check from % 16 to % 8:

- retShapePerCTA[rank - 1] % 16 == 0
+ retShapePerCTA[rank - 1] % 8 == 0

The WGMMAv3 op verifier already required only N % 8 == 0, and mmaVersionToInstrShape already listed n=8 as valid. This was the sole gatekeeper preventing N=8 tiles from using WGMMA, causing a silent fallback to MMAv2.

`third_party/nvidia/backend/compiler.py`

In min_dot_size, added an explicit case for 32-bit types (TF32/FP32):

+ elif lhs_bitwidth == 32:
+     return (1, 1, 8)

The TF32 hardware instruction has K=8. The previous fallthrough to the else branch returned K >= 16, blocking compilation of K=8 TF32 kernels.

`python/test/unit/language/test_core.py`

Added test_dot_wgmma_tf32_n8k8 parametrized over M ∈ {64, 128}, verifying both numerical correctness and that the emitted PTX contains wgmma.mma_async.sync.aligned.m64n8k8.f32.tf32.tf32.

New contributor declaration

I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run pre-commit run --from-ref origin/main --to-ref HEAD.
Select one of the following.
- I have added tests.
  - /test for lit tests
  - /unittest for C++ tests
  - /python/test for end-to-end tests
- This PR does not need a test because FILL THIS IN.
Select one of the following.
- I have not added any lit tests.
- The lit tests I have added follow these best practices,
  including the "tests should be minimal" section. (Usually running Python code
  and using the instructions it generates is not minimal.)

Jokeren · 2026-05-05T20:10:50Z

            return (1, 1, 32)
        elif lhs_bitwidth == 64:
            return (1, 1, 4)
+        elif lhs_bitwidth == 32:


This is more general than sm90+. Likely you need more code updates to support a smaller dot shape across archs. I'll defer it to @ThomasRaoux to determine if this is right direction

Yes I think it would be good to update Blackwell as well the same way.

I don't have access to Blackwell GPUs to test it directly, but I've updated MMAv5

What about sm80 with TF32, are you able to use (1,1,8) to pass all tests?

A100 is sm80, so MMAv3 is not available, so it should have no effect anyway.

I have A100, this one I can test:

python -m pytest python/test/unit/language/test_core.py -k "dot" -q --no-header 736 passed, 1882 skipped, 6728 deselected in 286.06s (0:04:46)

Looks fine to me.

Wait, A100 has some other instructions in MMAv2 that also could be used, let me check that.
Checked it's fine.

Ah, btw, on review, Claude is suggesting:

So min_dot_size is reinventing the same formula with a hardcoded if/elif chain, ignoring the target parameter it receives. The fix is to use the formula directly:

def min_dot_size(target: GPUTarget): def check_dot_compatibility(lhs_type, rhs_type) -> Tuple[int, int, int]: lhs_bitwidth = lhs_type.scalar.primitive_bitwidth rhs_bitwidth = rhs_type.scalar.primitive_bitwidth assert lhs_bitwidth == rhs_bitwidth, "lhs and rhs bitwidth must be the same" # For small M/N we can still use tensor cores with padding. # The minimum K is determined by the native MMA tile: 256 / bitwidth. return (1, 1, 256 // lhs_bitwidth) return check_dot_compatibility

This is identical in behaviour to the current code for all supported types, eliminates the special-casing we added for 32-bit, and directly mirrors the hardware formula used in mmaVersionToInstrShape. It also correctly handles any future type (e.g., fp8 at 8-bit is already covered).

I think I like it better the way it is right now

python -m pytest python/test/unit/language/test_core.py -k "dot" -q --no-header
736 passed, 1882 skipped, 6728 deselected in 286.06s (0:04:46)

We don't have very small dot tests previously and your test skipped sm80

ThomasRaoux · 2026-05-05T23:34:15Z



+@pytest.mark.parametrize("M, num_warps", [(64, 4), (128, 8)])
+def test_dot_wgmma_tf32_n8k8(M, num_warps, device):


can you add to current test_dot instead there are already get_test_small_dots_cases

I just tried that: it turns out K has to be constexpr for this to properly emit wgmma instruction.
So the less invasive way is probably adding just another test.

Jokeren · 2026-05-06T00:45:59Z

+    if not is_cuda():
+        pytest.skip("WGMMA is NVIDIA-only")
+    capability = torch.cuda.get_device_capability()
+    if capability[0] < 9:


I still think there's something missing. What if you run this test on ampere by removing this constraint? My point is, the current code may not work on sm80 when the tile size is as small as 8x8x8. Before we at least emit an error on the frontend

why? mma,.sync k dim is 8 for tf32 so why wouldn't it work?

that being said we should definitely run the test on all target and not do it for sm_90 only. See my comment here: #10234 (comment)

why? mma,.sync k dim is 8 for tf32 so why wouldn't it work?

Yeah, in theory it should work, but I'd like to confirm by enabling tests.

Done, sm80 is covered.

ThomasRaoux · 2026-05-06T20:56:16Z

+     for M, nw in [(64, 4), (128, 8)]
+     for dtype, K, prec, sm80, sm90 in _dot_n8_cases],
+)
+def test_dot_n8(M, num_warps, dtype, K, input_precision, sm80_ptx, sm90_ptx, device):


why can't we add cases to the exist test_dot?

K must be constexpr, otherwise trition is not able to emit wgmma for K=8 (I tried). I think assuming that K is constexpr is quite reasonable for such a small MMA.

what K are we talking about here? the block size is always constexpr

Line 3520 in python/test/unit/language/test_core.py:

if (K > 16 or N > 16 or M > 16) and (M * N // (num_warps * 32) >= 4): # XXX: skip small sizes because they are not vectorized

With runtime strides, Triton can only prove contiguity along the inner dim when it is at least 16 elements wide, so K >= 16 is required to vectorize loads to v4.

Are you talking about failures in any of the assert operations? Feel free to update the assert conditions

how is vectorization related to this?

I misunderstood how this test works. I updated the tests, and I moved the MMA Mx8x8 tests inside test_dot.

ThomasRaoux · 2026-05-06T22:52:54Z


    is_tcgen5 = (capability[0] == 10) and (num_warps % 4) == 0 and (M % 64) == 0 and (N % 8) == 0

+    n_pat = '8' if N == 8 else r'\d+'


let's remove this. I don't think it helps coverage much and makes the test even more complex

ThomasRaoux · 2026-05-07T00:04:59Z

+                                                 (16, 'ieee', 'float16', 'float32'),
+                                                 (32, 'ieee', 'float16', 'float32')]]


are those cases changing at all?

ThomasRaoux · 2026-05-07T00:06:50Z

+def get_test_dot_n8_cases():
+    if not is_cuda():
+        return []
+    return [(M, 8, K, nw, False, False, 'none', prec, in_dtype, out_dtype, 1, None)
+            for M, nw in [(64, 4), (128, 8)]
+            for K, prec, in_dtype, out_dtype in [(8, 'tf32', 'float32', 'float32'),
+                                                 (16, 'tf32', 'float32', 'float32'),
+                                                 (16, 'ieee', 'float16', 'float32'),
+                                                 (32, 'ieee', 'float16', 'float32')]]


feels like you could just add a couple well chosen tests in get_test_small_dots_cases to test the right corner cases

…wgmma.mma_async.sync.aligned.m64n8k8.f32.tf32.tf32`) via the standard `tt.dot` → `AccelerateMatmul` path on sm90+.

mwichro · 2026-05-11T21:54:54Z

Thanks for your patience and approval!

The tests failure do not look related to the changes

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

They seem to originate from torch...

mwichro · 2026-05-13T17:24:35Z

Test passed

mwichro requested review from Jokeren and ptillet as code owners May 5, 2026 19:44

Jokeren reviewed May 5, 2026

View reviewed changes

mwichro changed the title ~~[ BACKEND ] Enable tl.dot with TF32 precision on tiles with **N=8** and **K=8** via the standard tt.dot → AccelerateMatmul path on sm90+.~~ [ BACKEND ] Enable tl.dot with TF32 precision on tiles with N=8 and K=8 May 5, 2026

ThomasRaoux reviewed May 5, 2026

View reviewed changes

Jokeren reviewed May 6, 2026

View reviewed changes

ThomasRaoux reviewed May 6, 2026

View reviewed changes

Comment thread lib/Analysis/Utility.cpp

mwichro force-pushed the wgmma.m64n8k8 branch from 929a978 to 60cfc0e Compare May 6, 2026 20:04

ThomasRaoux reviewed May 6, 2026

View reviewed changes

ThomasRaoux reviewed May 7, 2026

View reviewed changes

Enable tl.dot with TF32 precision on tiles with N=8 and K=8 (e.g. `…

bd39d3c

…wgmma.mma_async.sync.aligned.m64n8k8.f32.tf32.tf32`) via the standard `tt.dot` → `AccelerateMatmul` path on sm90+.

mwichro force-pushed the wgmma.m64n8k8 branch from ac0e1ec to bd39d3c Compare May 7, 2026 12:03

ThomasRaoux approved these changes May 11, 2026

View reviewed changes

ThomasRaoux merged commit 3de9d04 into triton-lang:main May 13, 2026
23 of 27 checks passed



		@pytest.mark.parametrize("M, num_warps", [(64, 4), (128, 8)])
		def test_dot_wgmma_tf32_n8k8(M, num_warps, device):


		is_tcgen5 = (capability[0] == 10) and (num_warps % 4) == 0 and (M % 64) == 0 and (N % 8) == 0

		n_pat = '8' if N == 8 else r'\d+'

		(16, 'ieee', 'float16', 'float32'),
		(32, 'ieee', 'float16', 'float32')]]

Conversation

mwichro commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

lib/Analysis/Utility.cpp

third_party/nvidia/backend/compiler.py

python/test/unit/language/test_core.py

New contributor declaration

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mwichro May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mwichro May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mwichro commented May 11, 2026

Uh oh!

mwichro commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mwichro commented May 5, 2026 •

edited

Loading

`lib/Analysis/Utility.cpp`

`third_party/nvidia/backend/compiler.py`

`python/test/unit/language/test_core.py`

mwichro May 5, 2026 •

edited

Loading

mwichro May 6, 2026 •

edited

Loading