[BlockScale GEMM] FP8 Blockscale GEMM optimization and ckProfiler by aska-0096 · Pull Request #1913 · ROCm/composable_kernel

aska-0096 · 2025-02-25T02:36:18Z

Proposed changes

The first version of optimized f8 blockscale gemm, enhanced version will be delivered in recent days.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

…nto f8blockscale_opt

carlushuang

LGTM

ghostplant · 2025-03-01T05:37:17Z

@aska-0096 How to profile and use this GEMM format? (FP8 BlockScale GEMM)

aska-0096 · 2025-03-03T02:44:43Z

@aska-0096 How to profile and use this GEMM format? (FP8 BlockScale GEMM)

Try

mkdir build; cd build;
sh ../script/cmake-ck-dev.sh ../ gfx942;
make -j ckProfiler
./bin/ckProfiler gemm_ab_scale

…iler (#1913)" This reverts commit 020148d.

…iler (#1913)" (#1933) This reverts commit 020148d.

ghostplant · 2025-03-07T01:37:32Z

@aska-0096 Why is it reverted?

aska-0096 · 2025-03-07T01:54:49Z

@aska-0096 Why is it reverted?

Some unexpected memory consumption issues in CI. I bring it back in #1950

ghostplant · 2025-03-22T12:52:08Z

@aska-0096 Does this API require to shuffle weight beforehand as aiter.ck_moe() does?

aska-0096 · 2025-03-24T02:36:09Z

No, the current kernel doesn't need shuffle weight. But we will have a version that use weight-shuffle layout

ghostplant · 2025-03-24T06:27:04Z

No, the current kernel doesn't need shuffle weight. But we will have a version that use weight-shuffle layout

Nice, I love doing that based on original format. Can you share the correct command to evaluate this case?
I uses this command which doesn't work:

$ /opt/rocm/bin/ckProfiler gemm_ab_scale 1 1 0 1 2 1 1 32 4096 4096   4096 4096 32
this data_type & layout is not implemented

aska-0096 · 2025-03-24T06:33:22Z

No, the current kernel doesn't need shuffle weight. But we will have a version that use weight-shuffle layout

Nice, I love the doing that based on original format. Can you share the correct command to test this case? I uses this command which doesn't work:
$ /opt/rocm/bin/ckProfiler gemm_ab_scale 1 1 0 1 2 1 1 32 4096 4096   4096 4096 32
this data_type & layout is not implemented

Try
/opt/rocm/bin/ckProfiler gemm_ab_scale 7 1 1 0 2 0 1 32 4096 4096 -1 -1 -1 20 50 512
for (A [32, 4096]*AScale[32, 32]) x (B[4096, 4096]*BScale[32, 32]) = C[4096 4096]
Performance estimation based on 20times warms up and 50times repeat averaged execution time, 512MiB rotating buffer enabled to remove the impact from cache.

ghostplant · 2025-03-24T08:20:30Z

Thank you, do you know its current performance against FP8 Rowwise-scale GEMM (i.e. on MI300)? Do both outperform w16a16?

aska-0096 · 2025-03-24T08:26:50Z

For compute bound case, blockscale is not as good as fp8 rowwise gemm since algorithm and tile size limitation.
For memory bound case, blockscale should on par with fp8 rowwise gemm in standalone kernel level.
I think both of them outperform w16a16 gemm.

ghostplant · 2025-03-24T13:26:18Z

I uses the latest develop branch, but no idea why the suggested command still doesn't work after a fresh build:

/mnt/composable_kernel/build/bin$ ./ckProfiler gemm_ab_scale 7 1 1 0 2 0 1 32 4096 4096 -1 -1 -1 20 50 512
cannot find operation: gemm_ab_scale

aska-0096 · 2025-03-24T13:33:54Z

It seems like even the operator was not included in ckProfiler. Could you try to see if ./ckProfiler gemm_ab_scale works?
Another thing that needs to be checked is that the operator is only enabled on MI300 GPUs, you need to set gfx942 when you build the ck.

Meanwhile, let me check if the develop branch and command work on my machine.

ghostplant · 2025-03-25T03:13:57Z

It seems like even the operator was not included in ckProfiler. Could you try to see if ./ckProfiler gemm_ab_scale works? Another thing that needs to be checked is that the operator is only enabled on MI300 GPUs, you need to set gfx942 when you build the ck.

Meanwhile, let me check if the develop branch and command work on my machine.

It works after completing a full make -j. Does it support bmm, which may involve another blockIdx.z to resolve the batch dimension?

aska-0096 · 2025-03-27T03:54:21Z

It seems like even the operator was not included in ckProfiler. Could you try to see if ./ckProfiler gemm_ab_scale works? Another thing that needs to be checked is that the operator is only enabled on MI300 GPUs, you need to set gfx942 when you build the ck.
Meanwhile, let me check if the develop branch and command work on my machine.

It works after completing a full make -j. Does it support bmm, which may involve another blockIdx.z to resolve the batch dimension?

Hi, Did you enable the flush cache on rocblas side? otherwise, the comparison is unfair.

It seems like even the operator was not included in ckProfiler. Could you try to see if ./ckProfiler gemm_ab_scale works? Another thing that needs to be checked is that the operator is only enabled on MI300 GPUs, you need to set gfx942 when you build the ck.
Meanwhile, let me check if the develop branch and command work on my machine.

It works after completing a full make -j. Does it support bmm, which may involve another blockIdx.z to resolve the batch dimension?

Currently, we don't have bmm support, but it's not hard to add that support; it depends on the priority.

mtgu0705 and others added 16 commits December 20, 2024 15:42

Added two kernel for M=32 problem

f294808

Comment the first one

e5bc56a

Enable multiply_multiply for Scale_Block_M = 1 for deepseek

1fcd332

Modify the a_thread offset since the A data load is different from B.

f728087

edit fp8 ab scale for Scale_Block_M=1

988478d

edit GemmSpec to MNKPadding

d58d55e

enable blockwise pipelie v1 and v2. v1 is work for small K.

9dac971

add instance for gemm_ab_scale

363b674

fix cmakelist of ckProfiler

7ae141f

Merge branch 'develop' of https://github.com/ROCm/composable_kernel i…

3df24f0

…nto f8blockscale_opt

optimize blockscale gemm. todo: reduce vgpr usage

3d4ad53

fix a correctness bug

b9a97f4

sanity checked

dd6d879

revert ckprofiler cmake changes

00c5f0f

Merge branch 'develop' of https://github.com/ROCm/composable_kernel i…

da2f9e0

…nto f8blockscale_opt

clang format

2367a4f

aska-0096 requested review from a team, afagaj, andriy-ca, aosewski, bartekxk, carlushuang, geyyer, illsilin, poyenc and qianfengz as code owners February 25, 2025 02:36

aska-0096 added 2 commits February 25, 2025 02:41

revert unnecessary changes.

4d56921

remove commented codes.

41fab2d

carlushuang approved these changes Feb 25, 2025

View reviewed changes

carlushuang merged commit 020148d into develop Feb 25, 2025

asleepzzz added a commit that referenced this pull request Mar 3, 2025

Revert "[BlockScale GEMM] FP8 Blockscale GEMM optimization and ckProf…

755ee4e

…iler (#1913)" This reverts commit 020148d.

asleepzzz mentioned this pull request Mar 3, 2025

Revert "[BlockScale GEMM] FP8 Blockscale GEMM optimization and ckProf… #1933

Merged

7 tasks

illsilin pushed a commit that referenced this pull request Mar 3, 2025

Revert "[BlockScale GEMM] FP8 Blockscale GEMM optimization and ckProf…

ef16010

…iler (#1913)" (#1933) This reverts commit 020148d.

Conversation

aska-0096 commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Checklist

Discussion

Uh oh!

carlushuang left a comment

Choose a reason for hiding this comment

Uh oh!

ghostplant commented Mar 1, 2025

Uh oh!

aska-0096 commented Mar 3, 2025

Uh oh!

ghostplant commented Mar 7, 2025

Uh oh!

aska-0096 commented Mar 7, 2025

Uh oh!

ghostplant commented Mar 22, 2025

Uh oh!

aska-0096 commented Mar 24, 2025

Uh oh!

ghostplant commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aska-0096 commented Mar 24, 2025

Uh oh!

ghostplant commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aska-0096 commented Mar 24, 2025

Uh oh!

ghostplant commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aska-0096 commented Mar 24, 2025

Uh oh!

ghostplant commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aska-0096 commented Mar 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aska-0096 commented Feb 25, 2025 •

edited

Loading

ghostplant commented Mar 24, 2025 •

edited

Loading

ghostplant commented Mar 24, 2025 •

edited

Loading

ghostplant commented Mar 24, 2025 •

edited

Loading

ghostplant commented Mar 25, 2025 •

edited

Loading