Skip to content

[KERNELS] simplify mx shuffled weights defaults#9986

Merged
aeng-openai merged 5 commits into
triton-lang:mainfrom
aeng-openai:aeng/mx-shuffle
Apr 10, 2026
Merged

[KERNELS] simplify mx shuffled weights defaults#9986
aeng-openai merged 5 commits into
triton-lang:mainfrom
aeng-openai:aeng/mx-shuffle

Conversation

@aeng-openai
Copy link
Copy Markdown
Collaborator

@aeng-openai aeng-openai commented Apr 10, 2026

Simplify use of shuffled blackwell mx value weights

  • convert directly to BlackwellMX4ValueShuffledLayout; don't require first going through BlackwelllValueLayout
  • use block sizes from BlackwellMX4ValueShuffledLayout as opt flag constraints. removes complicated code needed to infer the block sizes before making the layout. pick a better default of block_n = 256, block_k = 128 which generally works well and is the inferred one except in cases where N < 256. also makes it simpler to just use, instead of also needing to override disable_mx4_block_swap = True when shuffled weights are used.
  • add more test coverage

same perf from running torchrun --nproc-per-node=1 python/triton_kernels/bench/bench_mlp.py

@aeng-openai aeng-openai marked this pull request as ready for review April 10, 2026 18:55
@aeng-openai aeng-openai requested a review from ptillet as a code owner April 10, 2026 18:55
Copy link
Copy Markdown
Collaborator

@ThomasRaoux ThomasRaoux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@aeng-openai aeng-openai merged commit 028e5da into triton-lang:main Apr 10, 2026
9 checks passed
aeng-openai added a commit that referenced this pull request Apr 11, 2026
plognjen pushed a commit to plognjen/triton that referenced this pull request Apr 14, 2026
Simplify use of shuffled blackwell mx value weights

- convert directly to BlackwellMX4ValueShuffledLayout; don't require
first going through BlackwelllValueLayout
- use block sizes from BlackwellMX4ValueShuffledLayout as opt flag
constraints. removes complicated code needed to infer the block sizes
before making the layout. pick a better default of block_n = 256,
block_k = 128 which generally works well and is the inferred one except
in cases where N < 256. also makes it simpler to just use, instead of
also needing to override disable_mx4_block_swap = True when shuffled
weights are used.
- add more test coverage

same perf from running `torchrun --nproc-per-node=1
python/triton_kernels/bench/bench_mlp.py`
plognjen pushed a commit to plognjen/triton that referenced this pull request Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants