[fix][cpu] Use a SwigluOAI impl which supports interleaved gate-up wei#29273
Conversation
|
@mgoin this is the last piece needed to enable gpt-oss in BF16 on Arm CPUs, could you please have a look? |
There was a problem hiding this comment.
Code Review
This pull request correctly fixes a bug in the CPU implementation of swigluoai_and_mul by using an implementation that supports interleaved gate-up weights. The change replaces a local function with a shared SwigluOAIAndMul class from the activation layers, which improves code reuse. However, I've identified a performance issue where this new class is instantiated inside a loop, which should be addressed.
| @@ -284,7 +273,7 @@ def __call__( | |||
|
|
|||
| gate_up = layer.gate_up_linear[i](tokens_for_this_expert) | |||
| if activation == "swigluoai": | |||
| gate_up = swigluoai_and_mul(gate_up) | |||
| gate_up = SwigluOAIAndMul().forward_native(gate_up) | |||
There was a problem hiding this comment.
There was a problem hiding this comment.
good idea, done!
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
SwigluOAI activation pairs gate/up values incorrectly on CPU
On the CPU fallback path, SwigluOAIAndMul().forward_native is now fed the gate/up output directly from gate_up_linear, but forward_native expects the gate and up features to be interleaved element‑wise (uses x[..., ::2]/x[..., 1::2]). The CPU weights are loaded as two contiguous halves (gate followed by up) with no interleaving (vllm/model_executor/layers/fused_moe/layer.py lines 984‑1010), and in the CPUFusedMOE path those weights are used without any reordering (unquantized_fused_moe_method.py lines 238‑267). With the new call the gate tensor mixes gate and up values (e.g., gate0 with gate1 instead of gate0 with up0), producing incorrect activations whenever swigluoai is used on CPU when the SGL kernel is unavailable. This regression will distort expert outputs for models such as GPT‑OSS running on CPU.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
The CPU weights are loaded as two contiguous halves (gate followed by up) with no interleaving (vllm/model_executor/layers/fused_moe/layer.py
I wish ... if that was the case we wouldn't be here.
…ights Current impl of `swigluoai_and_mul` for CPU assumes that gate-up weights have been de-interleaved at load time, which is not the case. The new impl we dispatch to is the same one used for the BF16 path on GPU and handles interleaved gate-up. Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
da1644d to
e343b14
Compare
|
CI is green! |
|
BF16 loading/de-interleaving of gpt-oss's gate-up needs to be addressed in general for both GPU and CPU path. Any future PR addressing this will have to consequently update all SwigluOAI impls introduced in #22951, including SwigluOAI.forwad_native since it's used as the reference impl for GPUs. Since this PR uses the same SwigluOAI impl as GPU BF16 path, any changes to loading in that path will work (correctly) OOB on CPU backend. |
|
cc @mgoin |
vllm-project#29273) Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
vllm-project#29273) Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
[fix][cpu] Use a SwigluOAI impl which supports interleaved gate-up wei
Purpose
Current impl of
swigluoai_and_mulfor CPU assumes that gate-up weights have been de-interleaved at load time, which is not the case and as a result gpt-oss generates garbage.The new impl we dispatch to in this PR is the same one used for the BF16 path on GPU and handles interleaved gate-up.
See comments on #27024 for full context.
Test Plan
Ran gpt-oss-20b end to end on a few prompts.
Test Result
Generations with this fix are decent and very similar to what we get with HF transformers.
Without this fix, generations are garbage
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.