[Triton] Add Fused GEMM A8W8 + Split + Concat Triton Kernel by farlukas · Pull Request #1553 · ROCm/aiter

farlukas · 2025-12-03T21:12:10Z

Motivation

This PR adds a new Triton fusion kernel. The fused kernel performs an FP8 GEMM using a blockscale quantization approach. It will then split the GEMM result into two 3D tensors. Finally, it will concatenate a third tensor to the first split of the GEMM result.

Technical Details

Equivalent to the following sequence:

c = (x @ w).view(-1, y.shape(1), S1 + S2)
c1, c2 = c.split([S1, S2], dim=-1)
c1 = c1.cat(y, dim=-1)
return c1, c2

Test Plan

pytest pytest op_tests/triton_tests/test_fused_gemm_a8w8_blockscale_split_cat.py

Test Result

Passes all unit tests comparing Triton output with PyTorch output.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

* add weight preshuffling for triton fp8 blockscale gemm * add config interface * add x_scale shuffle * import * add default config for gfx942 * fix get_config return * fix * Added tuned configs for gemm a8w8 blockscale preshuffled * Fixed tuned configs keys * resolve comments * resolve comments * Created a fused_kv_proj_cat kernel * Created tests for the fused_kv_proj_cat kernel * Renamed kernel * Renamed R block size * Ran black formatter * UT comments * move test file * fix * fix get_arch * Implemented preshuffled GEMM + split + cat * Ran black formatter * Moved gemm to new folders * Fixed merge * Added transpose_scale parameter * Added tests for fused reduce rms quant with transpose_scale * Use ck from main * Updated imports to follow dir structure --------- Co-authored-by: ShaoChunLee <Shao-Chun.Lee@amd.com>

farlukas force-pushed the farlukas/fused_gemm_a8w8_blockscale_split_cat branch from 39f32e8 to 0939360 Compare December 4, 2025 16:10

farlukas mentioned this pull request Dec 4, 2025

[Triton] Add Fused GEMM A8W8 + Split + Concat to Deepseek FP8 Model ROCm/vllm#831

Draft

3 tasks

k50112113 marked this pull request as ready for review December 12, 2025 21:40

k50112113 requested a review from a team December 12, 2025 21:40

k50112113 force-pushed the farlukas/fused_gemm_a8w8_blockscale_split_cat branch from 3669008 to 7e815c1 Compare December 12, 2025 21:46

k50112113 previously approved these changes Dec 13, 2025

View reviewed changes

k50112113 dismissed their stale review via 2242ba1 December 15, 2025 14:05

k50112113 and others added 21 commits January 5, 2026 14:07

add weight preshuffling for triton fp8 blockscale gemm

ef5a923

add config interface

062555b

add x_scale shuffle

b186362

import

ed01a0b

add default config for gfx942

b773416

fix get_config return

f290e9b

fix

a0ce5c8

Added tuned configs for gemm a8w8 blockscale preshuffled

780d860

Fixed tuned configs keys

53b1ed6

resolve comments

344a236

resolve comments

5384815

Created a fused_kv_proj_cat kernel

fd21dc8

Created tests for the fused_kv_proj_cat kernel

b6c388b

Renamed kernel

a1a391b

Renamed R block size

98f7f4e

Ran black formatter

70cba09

UT comments

6a63300

move test file

ffc62f1

fix

ba97711

fix get_arch

6f5f2c3

Implemented preshuffled GEMM + split + cat

f36a34a

farlukas force-pushed the farlukas/fused_gemm_a8w8_blockscale_split_cat branch from 2242ba1 to f36a34a Compare January 6, 2026 19:58

Ran black formatter

6065a9d

Moved gemm to new folders

170c39c

This comment was marked as resolved.

Sign in to view

farlukas force-pushed the farlukas/fused_gemm_a8w8_blockscale_split_cat branch 2 times, most recently from 7b04721 to 5b63d75 Compare January 7, 2026 21:18

Merge branch 'main' into farlukas/fused_gemm_a8w8_blockscale_split_cat

7c39488

farlukas force-pushed the farlukas/fused_gemm_a8w8_blockscale_split_cat branch from 5b63d75 to 7c39488 Compare January 7, 2026 21:22

farlukas added 5 commits January 7, 2026 21:25

Fixed merge

6fb543a

Added transpose_scale parameter

fa6e460

Added tests for fused reduce rms quant with transpose_scale

ccf4c1e

Use ck from main

812cd6c

Updated imports to follow dir structure

458a894

k50112113 requested review from azaidy and k50112113 January 8, 2026 22:22

k50112113 approved these changes Jan 8, 2026

View reviewed changes

k50112113 mentioned this pull request Jan 9, 2026

[Triton] DS FP4/FP8 Triton fusion and GEMM optimization ROCm/ATOM#119

Merged

k50112113 merged commit 35e3f68 into main Jan 9, 2026
19 checks passed

k50112113 deleted the farlukas/fused_gemm_a8w8_blockscale_split_cat branch January 9, 2026 19:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Triton] Add Fused GEMM A8W8 + Split + Concat Triton Kernel#1553

[Triton] Add Fused GEMM A8W8 + Split + Concat Triton Kernel#1553
k50112113 merged 29 commits intomainfrom
farlukas/fused_gemm_a8w8_blockscale_split_cat

farlukas commented Dec 3, 2025 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

farlukas commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

farlukas commented Dec 3, 2025 •

edited

Loading