[Triton] Add Fused GEMM A8W8 + Split + Concat Triton Kernel#1553
Merged
[Triton] Add Fused GEMM A8W8 + Split + Concat Triton Kernel#1553
Conversation
39f32e8 to
0939360
Compare
3 tasks
3669008 to
7e815c1
Compare
k50112113
previously approved these changes
Dec 13, 2025
2242ba1 to
f36a34a
Compare
7b04721 to
5b63d75
Compare
5b63d75 to
7c39488
Compare
k50112113
approved these changes
Jan 8, 2026
zhuyuhua-v
pushed a commit
that referenced
this pull request
Jan 14, 2026
* add weight preshuffling for triton fp8 blockscale gemm * add config interface * add x_scale shuffle * import * add default config for gfx942 * fix get_config return * fix * Added tuned configs for gemm a8w8 blockscale preshuffled * Fixed tuned configs keys * resolve comments * resolve comments * Created a fused_kv_proj_cat kernel * Created tests for the fused_kv_proj_cat kernel * Renamed kernel * Renamed R block size * Ran black formatter * UT comments * move test file * fix * fix get_arch * Implemented preshuffled GEMM + split + cat * Ran black formatter * Moved gemm to new folders * Fixed merge * Added transpose_scale parameter * Added tests for fused reduce rms quant with transpose_scale * Use ck from main * Updated imports to follow dir structure --------- Co-authored-by: ShaoChunLee <Shao-Chun.Lee@amd.com>
valarLip
pushed a commit
that referenced
this pull request
Mar 18, 2026
* add weight preshuffling for triton fp8 blockscale gemm * add config interface * add x_scale shuffle * import * add default config for gfx942 * fix get_config return * fix * Added tuned configs for gemm a8w8 blockscale preshuffled * Fixed tuned configs keys * resolve comments * resolve comments * Created a fused_kv_proj_cat kernel * Created tests for the fused_kv_proj_cat kernel * Renamed kernel * Renamed R block size * Ran black formatter * UT comments * move test file * fix * fix get_arch * Implemented preshuffled GEMM + split + cat * Ran black formatter * Moved gemm to new folders * Fixed merge * Added transpose_scale parameter * Added tests for fused reduce rms quant with transpose_scale * Use ck from main * Updated imports to follow dir structure --------- Co-authored-by: ShaoChunLee <Shao-Chun.Lee@amd.com>
valarLip
pushed a commit
that referenced
this pull request
Mar 18, 2026
* add weight preshuffling for triton fp8 blockscale gemm * add config interface * add x_scale shuffle * import * add default config for gfx942 * fix get_config return * fix * Added tuned configs for gemm a8w8 blockscale preshuffled * Fixed tuned configs keys * resolve comments * resolve comments * Created a fused_kv_proj_cat kernel * Created tests for the fused_kv_proj_cat kernel * Renamed kernel * Renamed R block size * Ran black formatter * UT comments * move test file * fix * fix get_arch * Implemented preshuffled GEMM + split + cat * Ran black formatter * Moved gemm to new folders * Fixed merge * Added transpose_scale parameter * Added tests for fused reduce rms quant with transpose_scale * Use ck from main * Updated imports to follow dir structure --------- Co-authored-by: ShaoChunLee <Shao-Chun.Lee@amd.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
This PR adds a new Triton fusion kernel. The fused kernel performs an FP8 GEMM using a blockscale quantization approach. It will then split the GEMM result into two 3D tensors. Finally, it will concatenate a third tensor to the first split of the GEMM result.
Technical Details
Equivalent to the following sequence:
Test Plan
Test Result
Passes all unit tests comparing Triton output with PyTorch output.
Submission Checklist