[KERNELS] Tuning for small batch mxfp4 matmuls with splitk by aeng-openai · Pull Request #9980 · triton-lang/triton

aeng-openai · 2026-04-09T17:18:42Z

In general splitk was disabled in the mlp matmuls since when it is ragged it is not trivial to statically choose the splitk factor (though it would be possible to do so dynamically)

In the small batch, non-ragged case though, it is simple to allow split k. This PR does that and does some basic heuristic tuning for such cases as well as optimization to the splitk reduce itself.

These changes exposed a bug in the smem accounting heuristics where we weren't counting smem needed to perform SWAP_XW. This change fixes that.

Perf will probably get even better after integrating the shuffled mxfp4 weight layout from #9698 as well

- get up to 5 stages for small batch matmuls with mxfp4 weights - tune the split k reduce as well - add a benchmark script for the reduce perf will probably get even better after integrating the shuffled mxfp4 weight layout as well

This reverts commit f06e5941c93e5e345cb93ea6d14e82b69557556c.

…ng#9980) In general splitk was disabled in the mlp matmuls since when it is ragged it is not trivial to statically choose the splitk factor (though it would be possible to do so dynamically) In the small batch, non-ragged case though, it is simple to allow split k. This PR does that and does some basic heuristic tuning for such cases as well as optimization to the splitk reduce itself. These changes exposed a bug in the smem accounting heuristics where we weren't counting smem needed to perform SWAP_XW. This change fixes that. Perf will probably get even better after integrating the shuffled mxfp4 weight layout from triton-lang#9698 as well

aeng-openai added 3 commits April 9, 2026 10:07

smaller block_n if it allows no split k

43ff4c6

Revert "smaller block_n if it allows no split k"

093bb17

This reverts commit f06e5941c93e5e345cb93ea6d14e82b69557556c.

aeng-openai requested a review from ptillet as a code owner April 9, 2026 17:18

aeng-openai changed the title ~~Tuning for small batch mxfp4 matmuls with splitk~~ [KERNELS] Tuning for small batch mxfp4 matmuls with splitk Apr 9, 2026

ThomasRaoux approved these changes Apr 9, 2026

View reviewed changes

aeng-openai added 5 commits April 9, 2026 11:12

overrideable opt flags for reduce in case bad heuristics

ffd91a2

formatting

e94d5cf

Merge branch 'main' into aeng/small-batch-mxfp4-again

00d73a1

fix heuristic

99eb170

fix heuristic

3374c4c

aeng-openai merged commit 617cff0 into triton-lang:main Apr 9, 2026
17 of 18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KERNELS] Tuning for small batch mxfp4 matmuls with splitk#9980

[KERNELS] Tuning for small batch mxfp4 matmuls with splitk#9980
aeng-openai merged 8 commits into
triton-lang:mainfrom
aeng-openai:aeng/small-batch-mxfp4-again

aeng-openai commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aeng-openai commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants