Skip to content

Conversation

@qingquansong
Copy link
Collaborator

@qingquansong qingquansong commented Mar 15, 2025

Motivation

PR adapted and improved from #3191
Rewrite Macro. Extended to support all power of 2 # expert & # expert group, also all # topk_group & # topk use cases + dtype support fp16/bf16/fp32.

TODO:

  • Add trt llm reference? (currently cannot find the original reference)
  • e2e perf run

NOTE:

  • Current CUDA kernel matches the eager mode results for supported cases, but torch compile with static or dynamic will cause original torch implementation results mismatch. (not a problem for this pr but worth investigating in the future)
  • Currently only support # experts is power of 2
  • Currently force # experts / # expert group <= 32 as we fix size of AlignedArray in expression (MAX_VPT=32 later we can make this dynamic equal to the params.VPT size can improve the speed for smaller cases)

Test

Unit tests (in pr)
Speed test for deepseek v3 case: 256 expert + 8 expert group + first select top4 expert group + then top8 final selected expert

seq_length Original (eager) Original (compile static) Original (compile dynamic) SGL Kernel (cutlass dtype/array) SGL Kernel (cutlass fp16) SGL Kernel (cutlass fp32) SGL Kernel (bfloat16 native dtype/array)
5k 251.12 199.94 213.01 26.40 26.56 36.06 31.26
10k 445.47 381.00 399.78 37.82 37.22 52.93 44.67
15k 589.89 510.21 534.14 43.39 44.74 74.46 52.54
20k 772.83 678.59 707.28 54.11 54.50 87.46 64.77
25k 957.54 848.14 879.70 59.84 59.30 100.05 70.98
30k 1100.74 979.17 1015.07 68.10 68.70 118.94 81.60
35k 1284.54 1152.86 1190.45 78.50 78.94 137.25 93.86
40k 1432.72 1283.01 1325.00 84.38 86.05 152.22 100.64

Modifications

Checklist

@qingquansong qingquansong marked this pull request as draft March 15, 2025 04:20
@qingquansong qingquansong force-pushed the qsong/deepseek_fused_gate branch 6 times, most recently from d6dc100 to fa12f60 Compare March 16, 2025 06:03
@qingquansong qingquansong force-pushed the qsong/deepseek_fused_gate branch from fa12f60 to e5ba381 Compare March 16, 2025 07:00
@qingquansong qingquansong changed the title Add deepseek moe fused gate kernel [WIP] Add deepseek moe fused gate kernel Mar 16, 2025
@qingquansong qingquansong force-pushed the qsong/deepseek_fused_gate branch 2 times, most recently from b647b14 to ec3a1f3 Compare March 16, 2025 07:24
@qingquansong qingquansong changed the title [WIP] Add deepseek moe fused gate kernel [WIP] Add deepseek fused moe group gate selection kernel Mar 16, 2025
@qingquansong qingquansong changed the title [WIP] Add deepseek fused moe group gate selection kernel [WIP] Add deepseek style fused moe group gate selection kernel Mar 16, 2025
@qingquansong qingquansong force-pushed the qsong/deepseek_fused_gate branch from 206cf4b to 5127d70 Compare March 16, 2025 08:12
@yiakwy-xpu-ml-framework-team
Copy link
Contributor

yiakwy-xpu-ml-framework-team commented Mar 16, 2025

@qingquansong Great job! Would you like to consider of usage AlignedArray, and native datatype? Of course I can create amd/ck/fused_moe_gate.cu counter part, but it will be great if I can reuse the codes.

I think the fusion algorithm is great and we can do some deeper engineering work later.

Copy link
Contributor

@yiakwy-xpu-ml-framework-team yiakwy-xpu-ml-framework-team Mar 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @qingquansong TRT_LLM has removed "tensorrt_llm/kernels/mixtureOfExperts/moe_kernels.h" from v0.16.0 to
"tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h" in v0.17.0 and hide the implementation a static library.

Could you tell me moe_fused_gate_impl done by our team without referencing any implementation before (#3191) ? (copyright issue)

If that is not case, that is great! Do we have ncu profiling for sharing ?

Copy link
Collaborator Author

@qingquansong qingquansong Mar 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @yiakwy-xpu-ml-framework-team Thanks! I seem to not able to find it from the file https://github.com/NVIDIA/TensorRT-LLM/blob/v0.16.0/cpp/tensorrt_llm/kernels/mixtureOfExperts/moe_kernels.cu but only our previous moeSoftmax kernel is adapted from there, maybe this is a new implementation or just referred to some similar code to implement? @BBuf do you happen to find it somewhere in the old trt version so we can put as reference? Thanks both!

@qingquansong qingquansong force-pushed the qsong/deepseek_fused_gate branch from 5127d70 to 9e61c59 Compare March 17, 2025 01:07
@qingquansong
Copy link
Collaborator Author

qingquansong commented Mar 17, 2025

@qingquansong Great job! Would you like to consider of usage Aligned job, and native datatype? Of course I can create amd/ck/fused_moe_gate.cu counter part, but it will be great if I can reuse the codes.

I think the fusion algorithm is great and we can do some deeper engineering work later.

Definitely, pushed one version with a USE_ROCM switch of the native type and the cutlass one and let me know if my understanding is correct. The speed results are put in the description for reference. Thank you!

@qingquansong qingquansong force-pushed the qsong/deepseek_fused_gate branch 5 times, most recently from 1ad3b5c to e433ba5 Compare March 17, 2025 04:51
@qingquansong qingquansong force-pushed the qsong/deepseek_fused_gate branch 4 times, most recently from 50e18a9 to fc5b464 Compare March 17, 2025 21:41
@qingquansong qingquansong changed the title [WIP] Add deepseek style fused moe group gate selection kernel Add deepseek style fused moe group gate selection kernel Mar 17, 2025
@qingquansong qingquansong marked this pull request as ready for review March 17, 2025 22:32
@qingquansong qingquansong force-pushed the qsong/deepseek_fused_gate branch from b66591f to f81a27f Compare March 18, 2025 03:02
@qingquansong
Copy link
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants