-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Add CUTLASS FP8 Blockscale MoE kernel for Hopper architecture #7278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
zhyncs
merged 20 commits into
sgl-project:main
from
ayrnb:feat/sm90_fp8_blockwise_grouped_gemm_r1
Jul 3, 2025
Merged
Changes from 1 commit
Commits
Show all changes
20 commits
Select commit
Hold shift + click to select a range
1af8a4b
feat/support sm90_fp8_blockwise_scaled_group_mm
ayrnb bbe9198
fix bug
ayrnb 5fadc62
Fix sm_count for H20 GPU.
HydraQYH 4ea113b
fix/code refine
ayrnb aeeca14
[benchmark]: single kernel perf of cutlass/deepgemm/triton group gemm
TianQiLin666666 f9e55be
Use ATen to get device & properties.
HydraQYH 465e8f5
Fix benchmark.
HydraQYH 7587bba
Optimize MMAConfig2 by using Pingpong schedule and new TiledMMA shape…
HydraQYH 488ff0f
[benchmark] add more shapes
TianQiLin666666 117ecb8
add ep cases
TianQiLin666666 47e7404
Use unique config for sm90 fp8 blockwise scaling grouped gemm.
HydraQYH 54af9e9
Refine code. Support unitest.
HydraQYH 33fb8b9
add Qwen3 benchmark cases
TianQiLin666666 8b80ac7
fix bench shapes
TianQiLin666666 3d81974
Remove duplicated check.
HydraQYH 5cfea52
Merge branch 'main' into feat/sm90_fp8_blockwise_grouped_gemm_r1
FlamingoPg 28f6545
update init
ayrnb 8d737dd
pre-commit
ayrnb 559fc54
format
ayrnb 95f2860
Merge branch 'main' into feat/sm90_fp8_blockwise_grouped_gemm_r1
ayrnb File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kernel sm_count configuration is not good in H800.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yuan-luo What is the SM Count on H800? It seems that this may be caused by load imbalance. We have some engineers working on optimizing load balancing. Since I lack the H800 machine, could you please provide me with an ncu report?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yuan-luo We use SM Count as the number of CTAs. I checked and found that H800 has 144 SM Cores, the same as H100. In the second and fourth configurations, there are only (256/128)*(512/128)*256 = 2048 Output Tiles. 2048 / 144 = 14.22 The last stage may only enable 20% of the SM Cores. I guess this may be the cause of the performance issue and hope you can help provide an ncu report to confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HydraQYH I'm working on a related PR about FP8 MoE kernel for Hopper. Will update and relate with this PR later on.