Skip to content

[BlockScale GEMM] FP8 Blockscale GEMM optimization and ckProfiler#1913

Merged
carlushuang merged 18 commits into
developfrom
f8blockscale_opt
Feb 25, 2025
Merged

[BlockScale GEMM] FP8 Blockscale GEMM optimization and ckProfiler#1913
carlushuang merged 18 commits into
developfrom
f8blockscale_opt

Conversation

@aska-0096
Copy link
Copy Markdown
Contributor

@aska-0096 aska-0096 commented Feb 25, 2025

Proposed changes

The first version of optimized f8 blockscale gemm, enhanced version will be delivered in recent days.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
  • I have added inline documentation which enables the maintainers with understanding the motivation
  • I have removed the stale documentation which is no longer relevant after this pull request
  • (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
  • I have run clang-format on all changed files
  • Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

Copy link
Copy Markdown
Contributor

@carlushuang carlushuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@carlushuang carlushuang merged commit 020148d into develop Feb 25, 2025
@ghostplant
Copy link
Copy Markdown

@aska-0096 How to profile and use this GEMM format? (FP8 BlockScale GEMM)

@aska-0096
Copy link
Copy Markdown
Contributor Author

@aska-0096 How to profile and use this GEMM format? (FP8 BlockScale GEMM)

Try

mkdir build; cd build;
sh ../script/cmake-ck-dev.sh ../ gfx942;
make -j ckProfiler
./bin/ckProfiler gemm_ab_scale

asleepzzz added a commit that referenced this pull request Mar 3, 2025
illsilin pushed a commit that referenced this pull request Mar 3, 2025
@ghostplant
Copy link
Copy Markdown

@aska-0096 Why is it reverted?

@aska-0096
Copy link
Copy Markdown
Contributor Author

@aska-0096 Why is it reverted?

Some unexpected memory consumption issues in CI. I bring it back in #1950

@ghostplant
Copy link
Copy Markdown

@aska-0096 Does this API require to shuffle weight beforehand as aiter.ck_moe() does?

@aska-0096
Copy link
Copy Markdown
Contributor Author

No, the current kernel doesn't need shuffle weight. But we will have a version that use weight-shuffle layout

@ghostplant
Copy link
Copy Markdown

ghostplant commented Mar 24, 2025

No, the current kernel doesn't need shuffle weight. But we will have a version that use weight-shuffle layout

Nice, I love doing that based on original format. Can you share the correct command to evaluate this case?
I uses this command which doesn't work:

$ /opt/rocm/bin/ckProfiler gemm_ab_scale 1 1 0 1 2 1 1 32 4096 4096   4096 4096 32
this data_type & layout is not implemented

@aska-0096
Copy link
Copy Markdown
Contributor Author

No, the current kernel doesn't need shuffle weight. But we will have a version that use weight-shuffle layout

Nice, I love the doing that based on original format. Can you share the correct command to test this case? I uses this command which doesn't work:

$ /opt/rocm/bin/ckProfiler gemm_ab_scale 1 1 0 1 2 1 1 32 4096 4096   4096 4096 32
this data_type & layout is not implemented

Try
/opt/rocm/bin/ckProfiler gemm_ab_scale 7 1 1 0 2 0 1 32 4096 4096 -1 -1 -1 20 50 512
for (A [32, 4096]*AScale[32, 32]) x (B[4096, 4096]*BScale[32, 32]) = C[4096 4096]
Performance estimation based on 20times warms up and 50times repeat averaged execution time, 512MiB rotating buffer enabled to remove the impact from cache.

@ghostplant
Copy link
Copy Markdown

ghostplant commented Mar 24, 2025

Thank you, do you know its current performance against FP8 Rowwise-scale GEMM (i.e. on MI300)? Do both outperform w16a16?

@aska-0096
Copy link
Copy Markdown
Contributor Author

For compute bound case, blockscale is not as good as fp8 rowwise gemm since algorithm and tile size limitation.
For memory bound case, blockscale should on par with fp8 rowwise gemm in standalone kernel level.
I think both of them outperform w16a16 gemm.

@ghostplant
Copy link
Copy Markdown

ghostplant commented Mar 24, 2025

I uses the latest develop branch, but no idea why the suggested command still doesn't work after a fresh build:

/mnt/composable_kernel/build/bin$ ./ckProfiler gemm_ab_scale 7 1 1 0 2 0 1 32 4096 4096 -1 -1 -1 20 50 512
cannot find operation: gemm_ab_scale

@aska-0096
Copy link
Copy Markdown
Contributor Author

It seems like even the operator was not included in ckProfiler. Could you try to see if ./ckProfiler gemm_ab_scale works?
Another thing that needs to be checked is that the operator is only enabled on MI300 GPUs, you need to set gfx942 when you build the ck.

Meanwhile, let me check if the develop branch and command work on my machine.

@ghostplant
Copy link
Copy Markdown

ghostplant commented Mar 25, 2025

It seems like even the operator was not included in ckProfiler. Could you try to see if ./ckProfiler gemm_ab_scale works? Another thing that needs to be checked is that the operator is only enabled on MI300 GPUs, you need to set gfx942 when you build the ck.

Meanwhile, let me check if the develop branch and command work on my machine.

It works after completing a full make -j. Does it support bmm, which may involve another blockIdx.z to resolve the batch dimension?

@aska-0096
Copy link
Copy Markdown
Contributor Author

It seems like even the operator was not included in ckProfiler. Could you try to see if ./ckProfiler gemm_ab_scale works? Another thing that needs to be checked is that the operator is only enabled on MI300 GPUs, you need to set gfx942 when you build the ck.
Meanwhile, let me check if the develop branch and command work on my machine.

It works after completing a full make -j. Does it support bmm, which may involve another blockIdx.z to resolve the batch dimension?

Hi, Did you enable the flush cache on rocblas side? otherwise, the comparison is unfair.

It seems like even the operator was not included in ckProfiler. Could you try to see if ./ckProfiler gemm_ab_scale works? Another thing that needs to be checked is that the operator is only enabled on MI300 GPUs, you need to set gfx942 when you build the ck.
Meanwhile, let me check if the develop branch and command work on my machine.

It works after completing a full make -j. Does it support bmm, which may involve another blockIdx.z to resolve the batch dimension?

Currently, we don't have bmm support, but it's not hard to add that support; it depends on the priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants