[CPU] Refactor CPU fused MOE by bigPYJ1151 · Pull Request #30531 · vllm-project/vllm

bigPYJ1151 · 2025-12-12T02:08:38Z

Purpose

Refactor CPU fused MOE by optimizing tile schedule and enable torch compile

main:

============ Serving Benchmark Result ============
Successful requests:                     16        
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  274.87    
Total input tokens:                      16384     
Total generated tokens:                  16384     
Request throughput (req/s):              0.06      
Output token throughput (tok/s):         59.61     
Peak output token throughput (tok/s):    88.00     
Peak concurrent requests:                15.00     
Total token throughput (tok/s):          119.21    
---------------Time to First Token----------------
Mean TTFT (ms):                          3477.60   
Median TTFT (ms):                        3591.83   
P99 TTFT (ms):                           4792.09   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          130.91    
Median TPOT (ms):                        130.83    
P99 TPOT (ms):                           133.00    
---------------Inter-token Latency----------------
Mean ITL (ms):                           130.91    
Median ITL (ms):                         130.06    
P99 ITL (ms):                            145.10    
==================================================

this:

============ Serving Benchmark Result ============
Successful requests:                     16        
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  156.29    
Total input tokens:                      16384     
Total generated tokens:                  16384     
Request throughput (req/s):              0.10      
Output token throughput (tok/s):         104.83    
Peak output token throughput (tok/s):    144.00    
Peak concurrent requests:                16.00     
Total token throughput (tok/s):          209.67    
---------------Time to First Token----------------
Mean TTFT (ms):                          1381.60   
Median TTFT (ms):                        1432.94   
P99 TTFT (ms):                           1965.64   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          75.00     
Median TPOT (ms):                        75.35     
P99 TPOT (ms):                           76.97     
---------------Inter-token Latency----------------
Mean ITL (ms):                           75.00     
Median ITL (ms):                         74.75     
P99 ITL (ms):                            85.00     
==================================================

Test Plan

CI tests

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request refactors the CPU fused MOE implementation, introducing significant performance optimizations through a new tile-based scheduling approach and enabling torch.compile. The changes are extensive, covering C++ kernels, Python layers, build configurations, and CI. I've found a critical correctness issue in the C++ kernel's weighted sum logic that occurs when topk_num is 1, leading to incorrect outputs. I have provided a detailed comment with a suggested fix for this issue. The rest of the changes look solid and well-implemented.

csrc/cpu/cpu_fused_moe.cpp

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

csrc/cpu/cpu_fused_moe.cpp

aditew01 · 2025-12-12T11:41:51Z

requirements/cpu-build.txt

@@ -1,7 +1,7 @@
 cmake>=3.26.1
 ninja
 packaging>=24.2
-setuptools>=77.0.3,<81.0.0
+setuptools==77.0.3 # this version can reuse CMake build dir


The changes are unrelated to the description?

It is a small change. I'd like to add it at one😂

fadara01

Great work! Thank you :)

I added some initial comments, sorry for the NITs

csrc/cpu/cpu_arch_macros.h

.buildkite/release-pipeline.yaml

fadara01 · 2025-12-12T11:31:17Z

.buildkite/scripts/hardware_ci/run-cpu-test.sh

@@ -50,6 +50,7 @@ function cpu_tests() {
  docker exec cpu-test-"$NUMA_NODE" bash -c "
    set -e
    pytest -x -v -s tests/kernels/attention/test_cpu_attn.py
+    pytest -x -v -s tests/kernels/moe/test_cpu_fused_moe.py


Given you enabled this through ISA::Vec which is CPU agnostic, would it be a good idea to enable this for all run-cpu-tests, i.e:

.buildkite/scripts/hardware_ci/run-cpu-test-arm.sh

.buildkite/scripts/hardware_ci/run-cpu-test-ppc64le.sh

.buildkite/scripts/hardware_ci/run-cpu-test-s390x.sh

Not exactly. This kerenl is only avaliable on AVX512 because of some new vec_op. Other platforms need a bit further change and verification.

csrc/cpu/cpu_fused_moe.cpp

csrc/cpu/micro_gemm/cpu_micro_gemm_vec.hpp

fadara01 · 2025-12-12T12:10:22Z

csrc/cpu/torch_bindings.cpp

@@ -296,6 +307,19 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
      "pack_factor, str isa_hint) -> ()");
  ops.impl("cpu_gemm_wna16", torch::kCPU, &cpu_gemm_wna16);
 #endif
+
+  // fused moe
+#if defined(__AVX512F__)


If isa::vec is the intended cross-arch abstraction (which is currently the case in cpu attention), it’d be nice if this wasn’t hard-wired to AVX512. If we only want to expose this on x86 for now, could we push that policy into Python based on CpuArchEnum instead?

csrc/cpu/micro_gemm/cpu_micro_gemm_vec.hpp

fadara01 · 2025-12-14T08:03:24Z

vllm/model_executor/layers/fused_moe/cpu_fused_moe.py

+        self,
+        layer: torch.nn.Module,
+    ) -> tuple[bool, str]:
+        if not hasattr(torch.ops._C, "prepack_moe_weight"):


I think it's better to enable/disable CPU grouped gemm based on CpuArchEnum, rather than the existence of prepack_moe_weight which is currently determined by an AVX512 ifdef.

IMO, prepack_moe_weight should exist for all architectures given it relies on isa::vec.

I'm under the impression that isa::vec is meant to be Architecture agnostic (since this is the case in attention)

vllm/model_executor/layers/fused_moe/cpu_fused_moe.py

fadara01 · 2025-12-14T12:13:21Z

csrc/cpu/cpu_types_x86.hpp

@@ -352,6 +352,10 @@ struct FP32Vec16 : public Vec<FP32Vec16> {
  explicit FP32Vec16(bool, void* ptr)
      : reg((__m512)_mm512_stream_load_si512(ptr)) {}

+  // strided load
+  explicit FP32Vec16(const float* ptr, INT32Vec16 idx)


in my opinion, we should aim to keep the vectorizer APIs consistent across CPUs supported in vLLM - similar to what we do in PyTorch vectorizer classes.

I don't think vLLM vectorizers are currently consistent for different CPUs, but we should aim not to make them diverge further.

I.e. if we are to introduce a new API to any vectorizer, IMO we should make sure that it's supported by the other platforms (even if that support is through a slow/scalar impl), otherwise we'll end with a lot of ifdefs and unmaintainable code in implementations using vLLM's vec abstractions.

For this particular case, one should be able to use explicit FP32Vec16(const float* ptr, INT32Vec16 idx) on any CPU vLLM supports. I don't think this is the case, but please correct me if I'm wrong.

Please let me know if there's a good reason as to why this (generally speaking) is not currently the case and/or shouldn't be the case in the future.

For these fundmental operations I think throwing error explicitly is more helpful for finding missed features.

I guess this depends on what we want the vectorizer classes in vLLM to be.
In PyTorch, the vectorizer class is an architecture agnostic interface for a bunch of ops with architecture specific implementations (e.g. Arm uses SVE, x86 uses AVX, etc). The PyTorch vectorizer does not allow you to define/expose a new op without providing a reference impl (usually scalar impl) avaliable for all vectrorizer implementations.

In my opinion the PyTorch vectorizer approach is a lot more elegant than what we currently have in vLLM.

We can ignore this comment for the purposes of this PR, but should discuss/agree later on what we want vLLM vectorizers to be :)

Yes, in fact the vec_op needs to refactor and clean up as many operations are no longer be used.

csrc/cpu/cpu_fused_moe.cpp

fadara01 · 2025-12-14T12:29:45Z

cmake/cpu_extension.cmake

@@ -365,6 +365,7 @@ if (AVX512_FOUND AND NOT AVX512_DISABLED)
    set(VLLM_EXT_SRC
        "csrc/cpu/shm.cpp"
        "csrc/cpu/cpu_wna16.cpp"
+        "csrc/cpu/cpu_fused_moe.cpp"


See my comments above about isa::vec being platform agnostic.

fadara01 · 2025-12-14T12:59:01Z

tests/kernels/moe/test_cpu_fused_moe.py

@@ -0,0 +1,172 @@
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project


let's try to unify this with CPU MoE test in https://github.com/vllm-project/vllm/blob/main/tests/kernels/moe/test_moe.py#L1062

The numeric difference from dtype conversion is too large in torch_experts.

sorry, I don't fully get you. I'm just proposing to have both CPU MoE tests in the same file.

Oh, I thought you suggest to use torch_experts as reference impl likes test cases in test_moe.py.

I think it is not needed to put them togther as there is nothing to reuse.

I'm just trying to say that it's a good idea to have 1 test file to test all CPU FusedMoE impls

fadara01 · 2025-12-15T06:01:19Z

csrc/cpu/utils.hpp

@@ -15,6 +15,16 @@
 namespace cpu_utils {
 enum class ISA { AMX, VEC };

+inline ISA get_isa(const std::string& isa) {


We have code that does the same thing in attention:

https://github.com/vllm-project/vllm/blob/main/csrc/cpu/cpu_attn.cpp#L82

https://github.com/vllm-project/vllm/blob/main/csrc/cpu/cpu_attn_impl.hpp#L17

Given that you're (re)introducing this get_isa function and ISA enums at this high level in cpu/utils.hpp, can we try to unify it with what we already have in attention.

csrc/cpu/cpu_fused_moe.cpp

csrc/cpu/utils.hpp

csrc/cpu/cpu_fused_moe.cpp

fadara01 · 2025-12-15T07:12:09Z

Thanks for working on this again - it's a great step forward for CPU MoE!
I know I left quite a few comments. If it helps, I’m happy to collaborate directly on this (e.g. push commits to your branch or pair on some of the refactors) to get it over the line. Just let me know what you’d prefer and we can iterate from there

Signed-off-by: jiang1.li <jiang1.li@intel.com>

fadara01

btw, should we trim the SGLang CPU MoE kernel path? Is there any reason as to why it needs to co-exist with the new fused MoE impl? https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py#L247

bigPYJ1151 · 2025-12-15T13:32:13Z

btw, should we trim the SGLang CPU MoE kernel path? Is there any reason as to why it needs to co-exist with the new fused MoE impl? https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py#L247

Yes, it will be deprecated finally. But some workloads are depending on it, should leave some time for transition.

Signed-off-by: jiang1.li <jiang1.li@intel.com>

DarkLight1337 · 2025-12-18T06:57:08Z

This PR is causing Apple Silicon test to fail on main: https://github.com/vllm-project/vllm/actions/runs/20328342556/job/58398002846

bigPYJ1151 · 2025-12-18T07:01:28Z

This PR is causing Apple Silicon test to fail on main: https://github.com/vllm-project/vllm/actions/runs/20328342556/job/58398002846

Hi @DarkLight1337 I've noticed and am fixing it.

Signed-off-by: jiang1.li <jiang1.li@intel.com>

Signed-off-by: jiang1.li <jiang1.li@intel.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>

Signed-off-by: jiang1.li <jiang1.li@intel.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

Signed-off-by: jiang1.li <jiang1.li@intel.com>

bigPYJ1151 requested review from WoosukKwon, mgoin, pavanimajety, tlrmchlsmth and yewentao256 as code owners December 12, 2025 02:08

mergify bot added the ci/build label Dec 12, 2025

gemini-code-assist bot reviewed Dec 12, 2025

View reviewed changes

csrc/cpu/cpu_fused_moe.cpp Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Dec 12, 2025

View reviewed changes

csrc/cpu/cpu_fused_moe.cpp Outdated Show resolved Hide resolved

bigPYJ1151 force-pushed the grouped_gemm branch 4 times, most recently from 3064dd5 to 517e558 Compare December 12, 2025 05:29

bigPYJ1151 added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 12, 2025

aditew01 reviewed Dec 12, 2025

View reviewed changes

fadara01 suggested changes Dec 12, 2025

View reviewed changes

fadara01 reviewed Dec 14, 2025

View reviewed changes

csrc/cpu/micro_gemm/cpu_micro_gemm_vec.hpp Show resolved Hide resolved

fadara01 reviewed Dec 14, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/cpu_fused_moe.py Show resolved Hide resolved

mgoin self-assigned this Dec 14, 2025

fadara01 suggested changes Dec 14, 2025

View reviewed changes

fadara01 reviewed Dec 15, 2025

View reviewed changes

bigPYJ1151 removed the ready ONLY add when PR is ready to merge/full CI is needed label Dec 15, 2025

bigPYJ1151 added 3 commits December 15, 2025 09:53

update

8c955cc

Signed-off-by: jiang1.li <jiang1.li@intel.com>

fix

711e647

Signed-off-by: jiang1.li <jiang1.li@intel.com>

fix

12451f7

Signed-off-by: jiang1.li <jiang1.li@intel.com>

fadara01 reviewed Dec 15, 2025

View reviewed changes

bigPYJ1151 force-pushed the grouped_gemm branch from 31ef642 to 5dfcc2d Compare December 15, 2025 11:16

fix

3218b41

Signed-off-by: jiang1.li <jiang1.li@intel.com>

bigPYJ1151 force-pushed the grouped_gemm branch from 5dfcc2d to 3218b41 Compare December 15, 2025 13:43

fadara01 mentioned this pull request Dec 16, 2025

[Bug] [CPU Backend]: Wrong L2 cache size on Arm CPU for Attention tiles #30487

Closed

1 task

mergify bot assigned fadara01 Dec 17, 2025

mergify bot added the cpu Related to CPU backends label Dec 17, 2025

Merge branch 'main' into grouped_gemm

fde02ce

bigPYJ1151 added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 18, 2025

jikunshang approved these changes Dec 18, 2025

View reviewed changes

jikunshang merged commit e3ab93c into vllm-project:main Dec 18, 2025
59 checks passed

bigPYJ1151 mentioned this pull request Dec 18, 2025

[Bugfix][CPU] Fix Mac CPU build #30955

Merged

5 tasks

yugong333 pushed a commit to yugong333/vllm that referenced this pull request Dec 22, 2025

[CPU] Refactor CPU fused MOE (vllm-project#30531)

a0326c7

Signed-off-by: jiang1.li <jiang1.li@intel.com>

Majid-Taheri pushed a commit to Majid-Taheri/vllm that referenced this pull request Dec 23, 2025

[CPU] Refactor CPU fused MOE (vllm-project#30531)

0069cb6

Signed-off-by: jiang1.li <jiang1.li@intel.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>

fadara01 mentioned this pull request Jan 5, 2026

[Feature][CPU Backend]: Add Paged Attention Benchmarks for CPU backend #30374

Closed

1 task

andikarachman mentioned this pull request Jan 5, 2026

[Feature]: Fused MoE Micro Benchmark for CPU Backend #31721

Closed

1 task

LucasWilkinson mentioned this pull request Jan 5, 2026

[Misc] Fix Current vLLM config is not set. warnings, assert to avoid issues in the future #31747

Merged

dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026

[CPU] Refactor CPU fused MOE (vllm-project#30531)

f323e4c

Signed-off-by: jiang1.li <jiang1.li@intel.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

inkcherry mentioned this pull request Feb 19, 2026

[CI][AMD][BugFix][P/D] Add default_vllm_config to test_moriio_connector.py so tests pass #33739

Merged

2 tasks

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[CPU] Refactor CPU fused MOE (vllm-project#30531)

6daefc2

Signed-off-by: jiang1.li <jiang1.li@intel.com>

		@@ -0,0 +1,172 @@
		# SPDX-License-Identifier: Apache-2.0
		# SPDX-FileCopyrightText: Copyright contributors to the vLLM project

Uh oh!

Conversation

bigPYJ1151 commented Dec 12, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fadara01 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

fadara01 Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

fadara01 Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fadara01 Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fadara01 Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fadara01 Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fadara01 Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fadara01 Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fadara01 Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bigPYJ1151 commented Dec 12, 2025 •

edited by github-actions bot

Loading

fadara01 left a comment •

edited

Loading

fadara01 Dec 12, 2025 •

edited

Loading

fadara01 Dec 12, 2025 •

edited

Loading

fadara01 Dec 14, 2025 •

edited

Loading

fadara01 Dec 15, 2025 •

edited

Loading

fadara01 Dec 14, 2025 •

edited

Loading

fadara01 Dec 14, 2025 •

edited

Loading

fadara01 Dec 15, 2025 •

edited

Loading

fadara01 Dec 15, 2025 •

edited

Loading

fadara01 commented Dec 15, 2025 •

edited

Loading