[kernels] change heuristic of smem calculation to be more accurate by wendazhou · Pull Request #10386 · triton-lang/triton

wendazhou · 2026-05-27T06:06:49Z

More accurately account the smem usage of the persistent tf32 matmul kernel. This makes it so that we correctly enable 4-stage for tf32 persistent matmul when B is provided in the correct layout.

For M=N=K=4096 matmul this leads to an improvement of about ~13% on GB200, from 560 TFLOP/s to 630 TFLOP/s.

Validated that both NNN and NNT matmuls still run. Note that NNN does not run with 4 stages, so some level of accounting / capping is necessary.

New contributor declaration

I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run pre-commit run --from-ref origin/main --to-ref HEAD.
Select one of the following.
- I have added tests.
  - /test for lit tests
  - /unittest for C++ tests
  - /python/test for end-to-end tests
- This PR does not need a test because it is a straightforward heuristics change.
Select one of the following.
- I have not added any lit tests.
- The lit tests I have added follow these best practices,
  including the "tests should be minimal" section. (Usually running Python code
  and using the instructions it generates is not minimal.)

This makes it so that we correctly enable 4-stage for tf32 persistent matmul when B is provided in the correct layout. For M=N=K=4096 matmul this leads to an improvement of about ~13% on GB200, from 560 TFLOP/s to 630 TFLOP/s

Change heuristic of smem calculation to be more accurate

7274cbc

This makes it so that we correctly enable 4-stage for tf32 persistent matmul when B is provided in the correct layout. For M=N=K=4096 matmul this leads to an improvement of about ~13% on GB200, from 560 TFLOP/s to 630 TFLOP/s

wendazhou requested a review from ptillet as a code owner May 27, 2026 06:06

ThomasRaoux approved these changes May 27, 2026

View reviewed changes

lezcano approved these changes May 27, 2026

View reviewed changes

lezcano merged commit 3e233a6 into main May 27, 2026
10 checks passed

lezcano deleted the wenda/4-stage-tf32-matmul branch May 27, 2026 13:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[kernels] change heuristic of smem calculation to be more accurate#10386

[kernels] change heuristic of smem calculation to be more accurate#10386
lezcano merged 1 commit into
mainfrom
wenda/4-stage-tf32-matmul

wendazhou commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wendazhou commented May 27, 2026

New contributor declaration

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants