Add deepseek style fused moe group gate selection kernel #4445

qingquansong · 2025-03-15T04:20:26Z

Motivation

PR adapted and improved from #3191
Rewrite Macro. Extended to support all power of 2 # expert & # expert group, also all # topk_group & # topk use cases + dtype support fp16/bf16/fp32.

TODO:

Add trt llm reference? (currently cannot find the original reference)
e2e perf run

NOTE:

Current CUDA kernel matches the eager mode results for supported cases, but torch compile with static or dynamic will cause original torch implementation results mismatch. (not a problem for this pr but worth investigating in the future)
Currently only support # experts is power of 2
Currently force # experts / # expert group <= 32 as we fix size of AlignedArray in expression (MAX_VPT=32 later we can make this dynamic equal to the params.VPT size can improve the speed for smaller cases)

Test

Unit tests (in pr)
Speed test for deepseek v3 case: 256 expert + 8 expert group + first select top4 expert group + then top8 final selected expert

seq_length	Original (eager)	Original (compile static)	Original (compile dynamic)	SGL Kernel (cutlass dtype/array)	SGL Kernel (cutlass fp16)	SGL Kernel (cutlass fp32)	SGL Kernel (bfloat16 native dtype/array)
5k	251.12	199.94	213.01	26.40	26.56	36.06	31.26
10k	445.47	381.00	399.78	37.82	37.22	52.93	44.67
15k	589.89	510.21	534.14	43.39	44.74	74.46	52.54
20k	772.83	678.59	707.28	54.11	54.50	87.46	64.77
25k	957.54	848.14	879.70	59.84	59.30	100.05	70.98
30k	1100.74	979.17	1015.07	68.10	68.70	118.94	81.60
35k	1284.54	1152.86	1190.45	78.50	78.94	137.25	93.86
40k	1432.72	1283.01	1325.00	84.38	86.05	152.22	100.64

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

yiakwy-xpu-ml-framework-team · 2025-03-16T11:04:47Z

@qingquansong Great job! Would you like to consider of usage AlignedArray, and native datatype? Of course I can create amd/ck/fused_moe_gate.cu counter part, but it will be great if I can reuse the codes.

I think the fusion algorithm is great and we can do some deeper engineering work later.

yiakwy-xpu-ml-framework-team · 2025-03-16T14:02:05Z

sgl-kernel/csrc/moe/moe_fused_gate.cu

Hi @qingquansong TRT_LLM has removed "tensorrt_llm/kernels/mixtureOfExperts/moe_kernels.h" from v0.16.0 to
"tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h" in v0.17.0 and hide the implementation a static library.

Could you tell me moe_fused_gate_impl done by our team without referencing any implementation before (#3191) ? (copyright issue)

If that is not case, that is great! Do we have ncu profiling for sharing ?

Hey @yiakwy-xpu-ml-framework-team Thanks! I seem to not able to find it from the file https://github.com/NVIDIA/TensorRT-LLM/blob/v0.16.0/cpp/tensorrt_llm/kernels/mixtureOfExperts/moe_kernels.cu but only our previous moeSoftmax kernel is adapted from there, maybe this is a new implementation or just referred to some similar code to implement? @BBuf do you happen to find it somewhere in the old trt version so we can put as reference? Thanks both!

qingquansong · 2025-03-17T01:29:47Z

@qingquansong Great job! Would you like to consider of usage Aligned job, and native datatype? Of course I can create amd/ck/fused_moe_gate.cu counter part, but it will be great if I can reuse the codes.

I think the fusion algorithm is great and we can do some deeper engineering work later.

Definitely, pushed one version with a USE_ROCM switch of the native type and the cutlass one and let me know if my understanding is correct. The speed results are put in the description for reference. Thank you!

qingquansong · 2025-03-18T03:10:48Z

The PR is picked up at #4530 cc @zhyncs @yiakwy-xpu-ml-framework-team @BBuf @hebiao064 @zcnrex @HandH1998

qingquansong requested review from BBuf, HandH1998, ispobock, merrymercy, yizhang2077 and zhyncs as code owners March 15, 2025 04:20

qingquansong marked this pull request as draft March 15, 2025 04:20

qingquansong force-pushed the qsong/deepseek_fused_gate branch 6 times, most recently from d6dc100 to fa12f60 Compare March 16, 2025 06:03

zhyncs added the high priority label Mar 16, 2025

qingquansong force-pushed the qsong/deepseek_fused_gate branch from fa12f60 to e5ba381 Compare March 16, 2025 07:00

qingquansong changed the title ~~Add deepseek moe fused gate kernel~~ [WIP] Add deepseek moe fused gate kernel Mar 16, 2025

qingquansong force-pushed the qsong/deepseek_fused_gate branch 2 times, most recently from b647b14 to ec3a1f3 Compare March 16, 2025 07:24

qingquansong changed the title ~~[WIP] Add deepseek moe fused gate kernel~~ [WIP] Add deepseek fused moe group gate selection kernel Mar 16, 2025

qingquansong changed the title ~~[WIP] Add deepseek fused moe group gate selection kernel~~ [WIP] Add deepseek style fused moe group gate selection kernel Mar 16, 2025

qingquansong force-pushed the qsong/deepseek_fused_gate branch from 206cf4b to 5127d70 Compare March 16, 2025 08:12

yiakwy-xpu-ml-framework-team reviewed Mar 16, 2025

View reviewed changes

qingquansong force-pushed the qsong/deepseek_fused_gate branch from 5127d70 to 9e61c59 Compare March 17, 2025 01:07

qingquansong force-pushed the qsong/deepseek_fused_gate branch 5 times, most recently from 1ad3b5c to e433ba5 Compare March 17, 2025 04:51

qingquansong force-pushed the qsong/deepseek_fused_gate branch 4 times, most recently from 50e18a9 to fc5b464 Compare March 17, 2025 21:41

qingquansong changed the title ~~[WIP] Add deepseek style fused moe group gate selection kernel~~ Add deepseek style fused moe group gate selection kernel Mar 17, 2025

qingquansong requested a review from yiakwy-xpu-ml-framework-team March 17, 2025 22:31

qingquansong marked this pull request as ready for review March 17, 2025 22:32

qingquansong requested review from HaiShaw and Ying1123 as code owners March 17, 2025 22:32

qingquansong closed this Mar 18, 2025

qingquansong force-pushed the qsong/deepseek_fused_gate branch from b66591f to f81a27f Compare March 18, 2025 03:02

qingquansong mentioned this pull request Mar 18, 2025

Add deepseek style fused moe group gate selection kernel #4530

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add deepseek style fused moe group gate selection kernel #4445

Add deepseek style fused moe group gate selection kernel #4445

Uh oh!

qingquansong commented Mar 15, 2025 •

edited

Loading

Uh oh!

yiakwy-xpu-ml-framework-team commented Mar 16, 2025 •

edited

Loading

Uh oh!

yiakwy-xpu-ml-framework-team Mar 16, 2025 •

edited

Loading

Uh oh!

qingquansong Mar 17, 2025 •

edited

Loading

Uh oh!

qingquansong commented Mar 17, 2025 •

edited

Loading

Uh oh!

qingquansong commented Mar 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add deepseek style fused moe group gate selection kernel #4445

Add deepseek style fused moe group gate selection kernel #4445

Uh oh!

Conversation

qingquansong commented Mar 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Test

Modifications

Checklist

Uh oh!

yiakwy-xpu-ml-framework-team commented Mar 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yiakwy-xpu-ml-framework-team Mar 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qingquansong Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qingquansong commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qingquansong commented Mar 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qingquansong commented Mar 15, 2025 •

edited

Loading

yiakwy-xpu-ml-framework-team commented Mar 16, 2025 •

edited

Loading

yiakwy-xpu-ml-framework-team Mar 16, 2025 •

edited

Loading

qingquansong Mar 17, 2025 •

edited

Loading

qingquansong commented Mar 17, 2025 •

edited

Loading