-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Add deepseek style fused moe group gate selection kernel #4445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add deepseek style fused moe group gate selection kernel #4445
Conversation
d6dc100 to
fa12f60
Compare
fa12f60 to
e5ba381
Compare
b647b14 to
ec3a1f3
Compare
206cf4b to
5127d70
Compare
|
@qingquansong Great job! Would you like to consider of usage AlignedArray, and native datatype? Of course I can create amd/ck/fused_moe_gate.cu counter part, but it will be great if I can reuse the codes. I think the fusion algorithm is great and we can do some deeper engineering work later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @qingquansong TRT_LLM has removed "tensorrt_llm/kernels/mixtureOfExperts/moe_kernels.h" from v0.16.0 to
"tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h" in v0.17.0 and hide the implementation a static library.
Could you tell me moe_fused_gate_impl done by our team without referencing any implementation before (#3191) ? (copyright issue)
If that is not case, that is great! Do we have ncu profiling for sharing ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @yiakwy-xpu-ml-framework-team Thanks! I seem to not able to find it from the file https://github.com/NVIDIA/TensorRT-LLM/blob/v0.16.0/cpp/tensorrt_llm/kernels/mixtureOfExperts/moe_kernels.cu but only our previous moeSoftmax kernel is adapted from there, maybe this is a new implementation or just referred to some similar code to implement? @BBuf do you happen to find it somewhere in the old trt version so we can put as reference? Thanks both!
5127d70 to
9e61c59
Compare
Definitely, pushed one version with a |
1ad3b5c to
e433ba5
Compare
50e18a9 to
fc5b464
Compare
b66591f to
f81a27f
Compare
|
The PR is picked up at #4530 cc @zhyncs @yiakwy-xpu-ml-framework-team @BBuf @hebiao064 @zcnrex @HandH1998 |
Motivation
PR adapted and improved from #3191
Rewrite Macro. Extended to support all power of 2
# expert&# expert group, also all# topk_group&# topkuse cases + dtype supportfp16/bf16/fp32.TODO:
NOTE:
# expertsis power of 2# experts / # expert group <= 32as we fix size of AlignedArray in expression (MAX_VPT=32later we can make this dynamic equal to theparams.VPTsize can improve the speed for smaller cases)Test
Unit tests (in pr)
Speed test for deepseek v3 case: 256 expert + 8 expert group + first select top4 expert group + then top8 final selected expert
Modifications
Checklist