[Kernel] Add fused grouped_topk kernel for MoE#23274
[Kernel] Add fused grouped_topk kernel for MoE#23274simon-mo merged 10 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Code Review
This pull request introduces a fused grouped_topk kernel for MoE, which shows a nice performance improvement based on the benchmarks. The implementation looks solid and the integration across the codebase is well-handled. I've found a critical issue related to const correctness in the new CUDA kernel that should be addressed. Otherwise, great work!
48cae44 to
8d77bdd
Compare
yewentao256
left a comment
There was a problem hiding this comment.
Thanks for the work!
Could you also test the acc for R1?
lm_eval --model local-completions --model_args "base_url=http://127.0.0.1:9256/v1/completions,model=deepseek-ai/DeepSeek-R1-0528,num_concurrent=256" --tasks gsm8k
There was a problem hiding this comment.
Thanks for your review!
In my other PR #23123, I have added the routed_scaling_factor to grouped_topk, which is coming from config.routed_scaling_factor. This PR is dependent on that PR.
If that PR (#23123) can get merged first, I will change this hardcoded value to the routed_scaling_factor passed in.
There was a problem hiding this comment.
What decides this - is it a heuristic or what is implemented?
There was a problem hiding this comment.
Thanks for your review!
This is based on what is implemented, see in grouped_topk_kernels.cu:
TORCH_CHECK(n_group <= 32,
"n_group should be smaller than or equal to 32 for now");
TORCH_CHECK(topk <= 32, "topk should be smaller than or equal to 32 for now");
0333a29 to
10895ce
Compare
@yewentao256 Thanks for the review! I have run the accuracy testing. There's no accuracy drop: Baseline: This PR: |
mgoin
left a comment
There was a problem hiding this comment.
Hey @xyang16 I think everything checks out to me, thanks for working on this!
One thing we must fix before landing is I think this kernel does not work on AMD and broke that build https://buildkite.com/vllm/ci/builds/28032/steps/canvas?jid=0198d277-447a-409d-857d-047cd38164d2#0198d277-447a-409d-857d-047cd38164d2/30-2862
#12 85.24 FAILED: [code=1] CMakeFiles/_moe_C.dir/csrc/moe/grouped_topk_kernels.hip.o
#12 85.24 /opt/rocm/lib/llvm/bin/clang++ -DPy_LIMITED_API=3 -DTORCH_EXTENSION_NAME=_moe_C -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_PROF_API=1 -DUSE_RPC -DUSE_TENSORPIPE -D__HIP_PLATFORM_AMD__ -D__HIP_PLATFORM_AMD__=1 -D__HIP_ROCclr__=1 -D_moe_C_EXPORTS -I/app/vllm/build/temp.linux-x86_64-cpython-312/csrc -isystem /usr/include/python3.12 -isystem /usr/local/lib/python3.12/dist-packages/torch/include -isystem /usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -isystem /opt/rocm-6.4.1/include/hiprand -isystem /opt/rocm-6.4.1/include/rocrand -Wno-unused-result -O2 -g -DNDEBUG --offload-arch=gfx90a --offload-arch=gfx942 -fPIC -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DHIP_ENABLE_WARP_SYNC_BUILTINS=1 -DUSE_ROCM -DENABLE_FP8 -U__HIP_NO_HALF_CONVERSIONS__ -U__HIP_NO_HALF_OPERATORS__ -Werror=unused-variable -fno-gpu-rdc -D_GLIBCXX_USE_CXX11_ABI=1 -DTORCH_HIP_VERSION=604 -Wno-shift-count-negative -Wno-shift-count-overflow -Wno-duplicate-decl-specifier -DCAFFE2_USE_MIOPEN -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP -std=c++17 -DHIPBLASLT_VEC_EXT -DHIP_ENABLE_WARP_SYNC_BUILTINS -MD -MT CMakeFiles/_moe_C.dir/csrc/moe/grouped_topk_kernels.hip.o -MF CMakeFiles/_moe_C.dir/csrc/moe/grouped_topk_kernels.hip.o.d -o CMakeFiles/_moe_C.dir/csrc/moe/grouped_topk_kernels.hip.o -x hip -c /app/vllm/build/temp.linux-x86_64-cpython-312/csrc/moe/grouped_topk_kernels.hip
#12 85.24 /app/vllm/build/temp.linux-x86_64-cpython-312/csrc/moe/grouped_topk_kernels.hip:27:10: fatal error: 'cooperative_groups/reduce.h' file not found
#12 85.24 27 | #include <cooperative_groups/reduce.h>
#12 85.24 | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#12 85.24 1 error generated when compiling for gfx90a.
CMakeLists.txt
Outdated
There was a problem hiding this comment.
Move this kernel to the CUDA section below
There was a problem hiding this comment.
Good catch! I have moved this. Thanks.
There was a problem hiding this comment.
nit/future work: maybe we should default routed_scaling_factor to None so we can skip the multiply if it isn't set
There was a problem hiding this comment.
This should include current_platform.is_cuda()?
There was a problem hiding this comment.
I have included current_platform.is_cuda(). Thanks.
There was a problem hiding this comment.
When is enable_fused set? I don't see any changes other than the test. Do we need this?
There was a problem hiding this comment.
Initially I was wanting to make grouped_topk and fused_grouped_topk separate, but rocm is only calling single grouped_topk (https://github.com/vllm-project/vllm/blob/v0.10.1.1/vllm/model_executor/layers/fused_moe/layer.py#L55-L61), so I put fused_grouped_topk inside grouped_topk to make the change minimal.
Thus I need a flag to distinguish grouped_topk and fused_grouped_topk in test.
There was a problem hiding this comment.
If it is just for testing, I would personally just use an environment variable. We have quite a few TEST env vars
Line 637 in 24d0c9e
There was a problem hiding this comment.
I have added VLLM_USE_FUSED_MOE_GROUPED_TOPK env. Thanks.
csrc/moe/moe_ops.h
Outdated
There was a problem hiding this comment.
Move to the #ifndef USE_ROCM section
csrc/moe/torch_bindings.cpp
Outdated
There was a problem hiding this comment.
Move to the #ifndef USE_ROCM section
yewentao256
left a comment
There was a problem hiding this comment.
Nice work! Some additional thoughts
|
@yewentao256 I don't think it is worthwhile to review/change the implementation copied from trtllm. Usually when we copy vendor code, we tend to leave it as-is so if we need to resync changes it is easy. From that perspective, I basically ignored that section of code |
Signed-off-by: Xin Yang <xyangx@amazon.com>
Signed-off-by: Xin Yang <xyangx@amazon.com>
Signed-off-by: Xin Yang <xyangx@amazon.com>
Signed-off-by: Xin Yang <xyangx@amazon.com>
@yewentao256 I have added the description in the PR. Thanks. |
|
FYI the CI failed, but I think it's unrelated to this PR. |
yewentao256
left a comment
There was a problem hiding this comment.
LGTM, thanks for the work!
mgoin
left a comment
There was a problem hiding this comment.
LGTM, thanks for the iterations here. Will carefully look over CI to see if all failures are known.
|
@mgoin @yewentao256 Thanks a lot! Could you please also review #23123? This is to read |
Signed-off-by: Xin Yang <xyangx@amazon.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: tc-mb <caitianchi@modelbest.cn>
Signed-off-by: Xin Yang <xyangx@amazon.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Signed-off-by: Xin Yang <xyangx@amazon.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>
Signed-off-by: Xin Yang <xyangx@amazon.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Signed-off-by: Xin Yang <xyangx@amazon.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Signed-off-by: Xin Yang <xyangx@amazon.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Purpose
This PR add fused grouped_topk kernel for MoE.
grouped_topk_kernels.cu: grouped topk kernelrenormalizearg, instead of always renormalizerouted_scaling_factorparameter to bothgrouped_topkandfused_grouped_topk, which is needed for grouped topk compute.Test Plan
Added unit tests.
Test Result
Unit tests passed.
Accuracy Testing
Baseline:
This PR:
Benchmarking
Serve on a p5e.48xlarge (8 H200) instance:
Benchmark cmd:
Result shows around ~9.7% improvement on output token throughput.
Baseline:
This PR:
(Optional) Documentation Update
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.