Skip to content

[kernels] change heuristic of smem calculation to be more accurate#10386

Merged
lezcano merged 1 commit into
mainfrom
wenda/4-stage-tf32-matmul
May 27, 2026
Merged

[kernels] change heuristic of smem calculation to be more accurate#10386
lezcano merged 1 commit into
mainfrom
wenda/4-stage-tf32-matmul

Conversation

@wendazhou
Copy link
Copy Markdown
Collaborator

More accurately account the smem usage of the persistent tf32 matmul kernel. This makes it so that we correctly enable 4-stage for tf32 persistent matmul when B is provided in the correct layout.

For M=N=K=4096 matmul this leads to an improvement of about ~13% on GB200, from 560 TFLOP/s to 630 TFLOP/s.

Validated that both NNN and NNT matmuls still run. Note that NNN does not run with 4 stages, so some level of accounting / capping is necessary.

New contributor declaration

  • I am not making a trivial change, such as fixing a typo in a comment.

  • I have written a PR description following these
    rules.

  • I have run pre-commit run --from-ref origin/main --to-ref HEAD.

  • Select one of the following.

    • I have added tests.
      • /test for lit tests
      • /unittest for C++ tests
      • /python/test for end-to-end tests
    • This PR does not need a test because it is a straightforward heuristics change.
  • Select one of the following.

    • I have not added any lit tests.
    • The lit tests I have added follow these best practices,
      including the "tests should be minimal" section. (Usually running Python code
      and using the instructions it generates is not minimal.)

This makes it so that we correctly enable 4-stage for tf32 persistent matmul
when B is provided in the correct layout.

For M=N=K=4096 matmul this leads to an improvement of about ~13% on GB200,
from 560 TFLOP/s to 630 TFLOP/s
@wendazhou wendazhou requested a review from ptillet as a code owner May 27, 2026 06:06
@lezcano lezcano merged commit 3e233a6 into main May 27, 2026
10 checks passed
@lezcano lezcano deleted the wenda/4-stage-tf32-matmul branch May 27, 2026 13:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants