Skip to content

[CPU] Refactor CPU fused MOE#30531

Merged
jikunshang merged 5 commits intovllm-project:mainfrom
bigPYJ1151:grouped_gemm
Dec 18, 2025
Merged

[CPU] Refactor CPU fused MOE#30531
jikunshang merged 5 commits intovllm-project:mainfrom
bigPYJ1151:grouped_gemm

Conversation

@bigPYJ1151
Copy link
Member

@bigPYJ1151 bigPYJ1151 commented Dec 12, 2025

Purpose

Refactor CPU fused MOE by optimizing tile schedule and enable torch compile

Part of #29580

main:

============ Serving Benchmark Result ============
Successful requests:                     16        
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  274.87    
Total input tokens:                      16384     
Total generated tokens:                  16384     
Request throughput (req/s):              0.06      
Output token throughput (tok/s):         59.61     
Peak output token throughput (tok/s):    88.00     
Peak concurrent requests:                15.00     
Total token throughput (tok/s):          119.21    
---------------Time to First Token----------------
Mean TTFT (ms):                          3477.60   
Median TTFT (ms):                        3591.83   
P99 TTFT (ms):                           4792.09   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          130.91    
Median TPOT (ms):                        130.83    
P99 TPOT (ms):                           133.00    
---------------Inter-token Latency----------------
Mean ITL (ms):                           130.91    
Median ITL (ms):                         130.06    
P99 ITL (ms):                            145.10    
==================================================

this:

============ Serving Benchmark Result ============
Successful requests:                     16        
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  156.29    
Total input tokens:                      16384     
Total generated tokens:                  16384     
Request throughput (req/s):              0.10      
Output token throughput (tok/s):         104.83    
Peak output token throughput (tok/s):    144.00    
Peak concurrent requests:                16.00     
Total token throughput (tok/s):          209.67    
---------------Time to First Token----------------
Mean TTFT (ms):                          1381.60   
Median TTFT (ms):                        1432.94   
P99 TTFT (ms):                           1965.64   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          75.00     
Median TPOT (ms):                        75.35     
P99 TPOT (ms):                           76.97     
---------------Inter-token Latency----------------
Mean ITL (ms):                           75.00     
Median ITL (ms):                         74.75     
P99 ITL (ms):                            85.00     
==================================================

Test Plan

CI tests

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the CPU fused MOE implementation, introducing significant performance optimizations through a new tile-based scheduling approach and enabling torch.compile. The changes are extensive, covering C++ kernels, Python layers, build configurations, and CI. I've found a critical correctness issue in the C++ kernel's weighted sum logic that occurs when topk_num is 1, leading to incorrect outputs. I have provided a detailed comment with a suggested fix for this issue. The rest of the changes look solid and well-implemented.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@bigPYJ1151 bigPYJ1151 force-pushed the grouped_gemm branch 4 times, most recently from 3064dd5 to 517e558 Compare December 12, 2025 05:29
@bigPYJ1151 bigPYJ1151 added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 12, 2025
@@ -1,7 +1,7 @@
cmake>=3.26.1
ninja
packaging>=24.2
setuptools>=77.0.3,<81.0.0
setuptools==77.0.3 # this version can reuse CMake build dir
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes are unrelated to the description?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a small change. I'd like to add it at one😂

Copy link
Contributor

@fadara01 fadara01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Thank you :)

I added some initial comments, sorry for the NITs

@@ -50,6 +50,7 @@ function cpu_tests() {
docker exec cpu-test-"$NUMA_NODE" bash -c "
set -e
pytest -x -v -s tests/kernels/attention/test_cpu_attn.py
pytest -x -v -s tests/kernels/moe/test_cpu_fused_moe.py
Copy link
Contributor

@fadara01 fadara01 Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given you enabled this through ISA::Vec which is CPU agnostic, would it be a good idea to enable this for all run-cpu-tests, i.e:

  • .buildkite/scripts/hardware_ci/run-cpu-test-arm.sh
  • .buildkite/scripts/hardware_ci/run-cpu-test-ppc64le.sh
  • .buildkite/scripts/hardware_ci/run-cpu-test-s390x.sh

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly. This kerenl is only avaliable on AVX512 because of some new vec_op. Other platforms need a bit further change and verification.

@@ -296,6 +307,19 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
"pack_factor, str isa_hint) -> ()");
ops.impl("cpu_gemm_wna16", torch::kCPU, &cpu_gemm_wna16);
#endif

// fused moe
#if defined(__AVX512F__)
Copy link
Contributor

@fadara01 fadara01 Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If isa::vec is the intended cross-arch abstraction (which is currently the case in cpu attention), it’d be nice if this wasn’t hard-wired to AVX512. If we only want to expose this on x86 for now, could we push that policy into Python based on CpuArchEnum instead?

self,
layer: torch.nn.Module,
) -> tuple[bool, str]:
if not hasattr(torch.ops._C, "prepack_moe_weight"):
Copy link
Contributor

@fadara01 fadara01 Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to enable/disable CPU grouped gemm based on CpuArchEnum, rather than the existence of prepack_moe_weight which is currently determined by an AVX512 ifdef.

IMO, prepack_moe_weight should exist for all architectures given it relies on isa::vec.

I'm under the impression that isa::vec is meant to be Architecture agnostic (since this is the case in attention)

@mgoin mgoin self-assigned this Dec 14, 2025
@@ -352,6 +352,10 @@ struct FP32Vec16 : public Vec<FP32Vec16> {
explicit FP32Vec16(bool, void* ptr)
: reg((__m512)_mm512_stream_load_si512(ptr)) {}

// strided load
explicit FP32Vec16(const float* ptr, INT32Vec16 idx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in my opinion, we should aim to keep the vectorizer APIs consistent across CPUs supported in vLLM - similar to what we do in PyTorch vectorizer classes.

I don't think vLLM vectorizers are currently consistent for different CPUs, but we should aim not to make them diverge further.

I.e. if we are to introduce a new API to any vectorizer, IMO we should make sure that it's supported by the other platforms (even if that support is through a slow/scalar impl), otherwise we'll end with a lot of ifdefs and unmaintainable code in implementations using vLLM's vec abstractions.

For this particular case, one should be able to use explicit FP32Vec16(const float* ptr, INT32Vec16 idx) on any CPU vLLM supports. I don't think this is the case, but please correct me if I'm wrong.

Please let me know if there's a good reason as to why this (generally speaking) is not currently the case and/or shouldn't be the case in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these fundmental operations I think throwing error explicitly is more helpful for finding missed features.

Copy link
Contributor

@fadara01 fadara01 Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this depends on what we want the vectorizer classes in vLLM to be.
In PyTorch, the vectorizer class is an architecture agnostic interface for a bunch of ops with architecture specific implementations (e.g. Arm uses SVE, x86 uses AVX, etc). The PyTorch vectorizer does not allow you to define/expose a new op without providing a reference impl (usually scalar impl) avaliable for all vectrorizer implementations.

In my opinion the PyTorch vectorizer approach is a lot more elegant than what we currently have in vLLM.

We can ignore this comment for the purposes of this PR, but should discuss/agree later on what we want vLLM vectorizers to be :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in fact the vec_op needs to refactor and clean up as many operations are no longer be used.

@@ -365,6 +365,7 @@ if (AVX512_FOUND AND NOT AVX512_DISABLED)
set(VLLM_EXT_SRC
"csrc/cpu/shm.cpp"
"csrc/cpu/cpu_wna16.cpp"
"csrc/cpu/cpu_fused_moe.cpp"
Copy link
Contributor

@fadara01 fadara01 Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comments above about isa::vec being platform agnostic.

@@ -0,0 +1,172 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
Copy link
Contributor

@fadara01 fadara01 Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The numeric difference from dtype conversion is too large in torch_experts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I don't fully get you. I'm just proposing to have both CPU MoE tests in the same file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I thought you suggest to use torch_experts as reference impl likes test cases in test_moe.py.

I think it is not needed to put them togther as there is nothing to reuse.

Copy link
Contributor

@fadara01 fadara01 Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just trying to say that it's a good idea to have 1 test file to test all CPU FusedMoE impls

@@ -15,6 +15,16 @@
namespace cpu_utils {
enum class ISA { AMX, VEC };

inline ISA get_isa(const std::string& isa) {
Copy link
Contributor

@fadara01 fadara01 Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have code that does the same thing in attention:

Given that you're (re)introducing this get_isa function and ISA enums at this high level in cpu/utils.hpp, can we try to unify it with what we already have in attention.

@fadara01
Copy link
Contributor

fadara01 commented Dec 15, 2025

Thanks for working on this again - it's a great step forward for CPU MoE!
I know I left quite a few comments. If it helps, I’m happy to collaborate directly on this (e.g. push commits to your branch or pair on some of the refactors) to get it over the line. Just let me know what you’d prefer and we can iterate from there

@bigPYJ1151 bigPYJ1151 removed the ready ONLY add when PR is ready to merge/full CI is needed label Dec 15, 2025
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Copy link
Contributor

@fadara01 fadara01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, should we trim the SGLang CPU MoE kernel path? Is there any reason as to why it needs to co-exist with the new fused MoE impl? https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py#L247

@bigPYJ1151
Copy link
Member Author

btw, should we trim the SGLang CPU MoE kernel path? Is there any reason as to why it needs to co-exist with the new fused MoE impl? https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py#L247

Yes, it will be deprecated finally. But some workloads are depending on it, should leave some time for transition.

Signed-off-by: jiang1.li <jiang1.li@intel.com>
@bigPYJ1151 bigPYJ1151 added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 18, 2025
@jikunshang jikunshang merged commit e3ab93c into vllm-project:main Dec 18, 2025
59 checks passed
@DarkLight1337
Copy link
Member

This PR is causing Apple Silicon test to fail on main: https://github.com/vllm-project/vllm/actions/runs/20328342556/job/58398002846

@bigPYJ1151
Copy link
Member Author

This PR is causing Apple Silicon test to fail on main: https://github.com/vllm-project/vllm/actions/runs/20328342556/job/58398002846

Hi @DarkLight1337 I've noticed and am fixing it.

@bigPYJ1151 bigPYJ1151 mentioned this pull request Dec 18, 2025
5 tasks
yugong333 pushed a commit to yugong333/vllm that referenced this pull request Dec 22, 2025
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Majid-Taheri pushed a commit to Majid-Taheri/vllm that referenced this pull request Dec 23, 2025
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build cpu Related to CPU backends ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants