Skip to content

[Triton] Add Fused GEMM A8W8 + Split + Concat Triton Kernel#1553

Merged
k50112113 merged 29 commits intomainfrom
farlukas/fused_gemm_a8w8_blockscale_split_cat
Jan 9, 2026
Merged

[Triton] Add Fused GEMM A8W8 + Split + Concat Triton Kernel#1553
k50112113 merged 29 commits intomainfrom
farlukas/fused_gemm_a8w8_blockscale_split_cat

Conversation

@farlukas
Copy link
Copy Markdown
Contributor

@farlukas farlukas commented Dec 3, 2025

Motivation

This PR adds a new Triton fusion kernel. The fused kernel performs an FP8 GEMM using a blockscale quantization approach. It will then split the GEMM result into two 3D tensors. Finally, it will concatenate a third tensor to the first split of the GEMM result.

Technical Details

Equivalent to the following sequence:

c = (x @ w).view(-1, y.shape(1), S1 + S2)
c1, c2 = c.split([S1, S2], dim=-1)
c1 = c1.cat(y, dim=-1)
return c1, c2

Test Plan

pytest pytest op_tests/triton_tests/test_fused_gemm_a8w8_blockscale_split_cat.py

Test Result

Passes all unit tests comparing Triton output with PyTorch output.

Submission Checklist

@farlukas farlukas force-pushed the farlukas/fused_gemm_a8w8_blockscale_split_cat branch from 39f32e8 to 0939360 Compare December 4, 2025 16:10
@k50112113 k50112113 marked this pull request as ready for review December 12, 2025 21:40
@k50112113 k50112113 requested a review from a team December 12, 2025 21:40
@k50112113 k50112113 force-pushed the farlukas/fused_gemm_a8w8_blockscale_split_cat branch from 3669008 to 7e815c1 Compare December 12, 2025 21:46
k50112113
k50112113 previously approved these changes Dec 13, 2025
@farlukas farlukas force-pushed the farlukas/fused_gemm_a8w8_blockscale_split_cat branch from 2242ba1 to f36a34a Compare January 6, 2026 19:58
github-actions[bot]

This comment was marked as resolved.

@farlukas farlukas force-pushed the farlukas/fused_gemm_a8w8_blockscale_split_cat branch 2 times, most recently from 7b04721 to 5b63d75 Compare January 7, 2026 21:18
@farlukas farlukas force-pushed the farlukas/fused_gemm_a8w8_blockscale_split_cat branch from 5b63d75 to 7c39488 Compare January 7, 2026 21:22
@k50112113 k50112113 requested review from azaidy and k50112113 January 8, 2026 22:22
@k50112113 k50112113 merged commit 35e3f68 into main Jan 9, 2026
19 checks passed
@k50112113 k50112113 deleted the farlukas/fused_gemm_a8w8_blockscale_split_cat branch January 9, 2026 19:00
zhuyuhua-v pushed a commit that referenced this pull request Jan 14, 2026
* add weight preshuffling for triton fp8 blockscale gemm

* add config interface

* add x_scale shuffle

* import

* add default config for gfx942

* fix get_config return

* fix

* Added tuned configs for gemm a8w8 blockscale preshuffled

* Fixed tuned configs keys

* resolve comments

* resolve comments

* Created a fused_kv_proj_cat kernel

* Created tests for the fused_kv_proj_cat kernel

* Renamed kernel

* Renamed R block size

* Ran black formatter

* UT comments

* move test file

* fix

* fix get_arch

* Implemented preshuffled GEMM + split + cat

* Ran black formatter

* Moved gemm to new folders

* Fixed merge

* Added transpose_scale parameter

* Added tests for fused reduce rms quant with transpose_scale

* Use ck from main

* Updated imports to follow dir structure

---------

Co-authored-by: ShaoChunLee <Shao-Chun.Lee@amd.com>
valarLip pushed a commit that referenced this pull request Mar 18, 2026
* add weight preshuffling for triton fp8 blockscale gemm

* add config interface

* add x_scale shuffle

* import

* add default config for gfx942

* fix get_config return

* fix

* Added tuned configs for gemm a8w8 blockscale preshuffled

* Fixed tuned configs keys

* resolve comments

* resolve comments

* Created a fused_kv_proj_cat kernel

* Created tests for the fused_kv_proj_cat kernel

* Renamed kernel

* Renamed R block size

* Ran black formatter

* UT comments

* move test file

* fix

* fix get_arch

* Implemented preshuffled GEMM + split + cat

* Ran black formatter

* Moved gemm to new folders

* Fixed merge

* Added transpose_scale parameter

* Added tests for fused reduce rms quant with transpose_scale

* Use ck from main

* Updated imports to follow dir structure

---------

Co-authored-by: ShaoChunLee <Shao-Chun.Lee@amd.com>
valarLip pushed a commit that referenced this pull request Mar 18, 2026
* add weight preshuffling for triton fp8 blockscale gemm

* add config interface

* add x_scale shuffle

* import

* add default config for gfx942

* fix get_config return

* fix

* Added tuned configs for gemm a8w8 blockscale preshuffled

* Fixed tuned configs keys

* resolve comments

* resolve comments

* Created a fused_kv_proj_cat kernel

* Created tests for the fused_kv_proj_cat kernel

* Renamed kernel

* Renamed R block size

* Ran black formatter

* UT comments

* move test file

* fix

* fix get_arch

* Implemented preshuffled GEMM + split + cat

* Ran black formatter

* Moved gemm to new folders

* Fixed merge

* Added transpose_scale parameter

* Added tests for fused reduce rms quant with transpose_scale

* Use ck from main

* Updated imports to follow dir structure

---------

Co-authored-by: ShaoChunLee <Shao-Chun.Lee@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants