[CPU] Refactor CPU fused MOE#30531
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the CPU fused MOE implementation, introducing significant performance optimizations through a new tile-based scheduling approach and enabling torch.compile. The changes are extensive, covering C++ kernels, Python layers, build configurations, and CI. I've found a critical correctness issue in the C++ kernel's weighted sum logic that occurs when topk_num is 1, leading to incorrect outputs. I have provided a detailed comment with a suggested fix for this issue. The rest of the changes look solid and well-implemented.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
3064dd5 to
517e558
Compare
| @@ -1,7 +1,7 @@ | |||
| cmake>=3.26.1 | |||
| ninja | |||
| packaging>=24.2 | |||
| setuptools>=77.0.3,<81.0.0 | |||
| setuptools==77.0.3 # this version can reuse CMake build dir | |||
There was a problem hiding this comment.
The changes are unrelated to the description?
There was a problem hiding this comment.
It is a small change. I'd like to add it at one😂
| @@ -50,6 +50,7 @@ function cpu_tests() { | |||
| docker exec cpu-test-"$NUMA_NODE" bash -c " | |||
| set -e | |||
| pytest -x -v -s tests/kernels/attention/test_cpu_attn.py | |||
| pytest -x -v -s tests/kernels/moe/test_cpu_fused_moe.py | |||
There was a problem hiding this comment.
Given you enabled this through ISA::Vec which is CPU agnostic, would it be a good idea to enable this for all run-cpu-tests, i.e:
- .buildkite/scripts/hardware_ci/run-cpu-test-arm.sh
- .buildkite/scripts/hardware_ci/run-cpu-test-ppc64le.sh
- .buildkite/scripts/hardware_ci/run-cpu-test-s390x.sh
There was a problem hiding this comment.
Not exactly. This kerenl is only avaliable on AVX512 because of some new vec_op. Other platforms need a bit further change and verification.
| @@ -296,6 +307,19 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) { | |||
| "pack_factor, str isa_hint) -> ()"); | |||
| ops.impl("cpu_gemm_wna16", torch::kCPU, &cpu_gemm_wna16); | |||
| #endif | |||
|
|
|||
| // fused moe | |||
| #if defined(__AVX512F__) | |||
There was a problem hiding this comment.
If isa::vec is the intended cross-arch abstraction (which is currently the case in cpu attention), it’d be nice if this wasn’t hard-wired to AVX512. If we only want to expose this on x86 for now, could we push that policy into Python based on CpuArchEnum instead?
| self, | ||
| layer: torch.nn.Module, | ||
| ) -> tuple[bool, str]: | ||
| if not hasattr(torch.ops._C, "prepack_moe_weight"): |
There was a problem hiding this comment.
I think it's better to enable/disable CPU grouped gemm based on CpuArchEnum, rather than the existence of prepack_moe_weight which is currently determined by an AVX512 ifdef.
IMO, prepack_moe_weight should exist for all architectures given it relies on isa::vec.
I'm under the impression that isa::vec is meant to be Architecture agnostic (since this is the case in attention)
| @@ -352,6 +352,10 @@ struct FP32Vec16 : public Vec<FP32Vec16> { | |||
| explicit FP32Vec16(bool, void* ptr) | |||
| : reg((__m512)_mm512_stream_load_si512(ptr)) {} | |||
|
|
|||
| // strided load | |||
| explicit FP32Vec16(const float* ptr, INT32Vec16 idx) | |||
There was a problem hiding this comment.
in my opinion, we should aim to keep the vectorizer APIs consistent across CPUs supported in vLLM - similar to what we do in PyTorch vectorizer classes.
I don't think vLLM vectorizers are currently consistent for different CPUs, but we should aim not to make them diverge further.
I.e. if we are to introduce a new API to any vectorizer, IMO we should make sure that it's supported by the other platforms (even if that support is through a slow/scalar impl), otherwise we'll end with a lot of ifdefs and unmaintainable code in implementations using vLLM's vec abstractions.
For this particular case, one should be able to use explicit FP32Vec16(const float* ptr, INT32Vec16 idx) on any CPU vLLM supports. I don't think this is the case, but please correct me if I'm wrong.
Please let me know if there's a good reason as to why this (generally speaking) is not currently the case and/or shouldn't be the case in the future.
There was a problem hiding this comment.
For these fundmental operations I think throwing error explicitly is more helpful for finding missed features.
There was a problem hiding this comment.
I guess this depends on what we want the vectorizer classes in vLLM to be.
In PyTorch, the vectorizer class is an architecture agnostic interface for a bunch of ops with architecture specific implementations (e.g. Arm uses SVE, x86 uses AVX, etc). The PyTorch vectorizer does not allow you to define/expose a new op without providing a reference impl (usually scalar impl) avaliable for all vectrorizer implementations.
In my opinion the PyTorch vectorizer approach is a lot more elegant than what we currently have in vLLM.
We can ignore this comment for the purposes of this PR, but should discuss/agree later on what we want vLLM vectorizers to be :)
There was a problem hiding this comment.
Yes, in fact the vec_op needs to refactor and clean up as many operations are no longer be used.
| @@ -365,6 +365,7 @@ if (AVX512_FOUND AND NOT AVX512_DISABLED) | |||
| set(VLLM_EXT_SRC | |||
| "csrc/cpu/shm.cpp" | |||
| "csrc/cpu/cpu_wna16.cpp" | |||
| "csrc/cpu/cpu_fused_moe.cpp" | |||
There was a problem hiding this comment.
See my comments above about isa::vec being platform agnostic.
| @@ -0,0 +1,172 @@ | |||
| # SPDX-License-Identifier: Apache-2.0 | |||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project | |||
There was a problem hiding this comment.
let's try to unify this with CPU MoE test in https://github.com/vllm-project/vllm/blob/main/tests/kernels/moe/test_moe.py#L1062
There was a problem hiding this comment.
The numeric difference from dtype conversion is too large in torch_experts.
There was a problem hiding this comment.
sorry, I don't fully get you. I'm just proposing to have both CPU MoE tests in the same file.
There was a problem hiding this comment.
Oh, I thought you suggest to use torch_experts as reference impl likes test cases in test_moe.py.
I think it is not needed to put them togther as there is nothing to reuse.
There was a problem hiding this comment.
I'm just trying to say that it's a good idea to have 1 test file to test all CPU FusedMoE impls
| @@ -15,6 +15,16 @@ | |||
| namespace cpu_utils { | |||
| enum class ISA { AMX, VEC }; | |||
|
|
|||
| inline ISA get_isa(const std::string& isa) { | |||
There was a problem hiding this comment.
We have code that does the same thing in attention:
- https://github.com/vllm-project/vllm/blob/main/csrc/cpu/cpu_attn.cpp#L82
- https://github.com/vllm-project/vllm/blob/main/csrc/cpu/cpu_attn_impl.hpp#L17
Given that you're (re)introducing this get_isa function and ISA enums at this high level in cpu/utils.hpp, can we try to unify it with what we already have in attention.
|
Thanks for working on this again - it's a great step forward for CPU MoE! |
fadara01
left a comment
There was a problem hiding this comment.
btw, should we trim the SGLang CPU MoE kernel path? Is there any reason as to why it needs to co-exist with the new fused MoE impl? https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py#L247
31ef642 to
5dfcc2d
Compare
Yes, it will be deprecated finally. But some workloads are depending on it, should leave some time for transition. |
5dfcc2d to
3218b41
Compare
|
This PR is causing Apple Silicon test to fail on main: https://github.com/vllm-project/vllm/actions/runs/20328342556/job/58398002846 |
Hi @DarkLight1337 I've noticed and am fixing it. |
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Purpose
Refactor CPU fused MOE by optimizing tile schedule and enable torch compile
Part of #29580
main:
this:
Test Plan
CI tests
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.