Support sycl impl relu2_no_mul for NVIDIA-Nemotron-3-Nano-30B-A3B-bf16#232
Conversation
Signed-off-by: Qiao, Zhefeng <zhefeng.qiao@intel.com>
There was a problem hiding this comment.
Pull request overview
Adds a SYCL/XPU implementation of the relu2_no_mul activation and wires it into the fused MoE path to support NVIDIA-Nemotron-3-Nano-30B-A3B-bf16 more efficiently.
Changes:
- Register a new XPU custom op
relu2_no_muland implement its SYCL kernel. - Extend
xpu_fused_moeto routeactivation="relu2_no_mul"and adjust GEMM2’sKaccordingly. - Add unit-test coverage for the standalone activation op.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| vllm_xpu_kernels/fused_moe_interface.py | Adds relu2_no_mul activation handling and scales GEMM2 K by 2 for this activation. |
| csrc/activation.cpp | Implements relu2_no_mul SYCL elementwise kernel and dispatch. |
| csrc/ops.h | Declares relu2_no_mul C++ entrypoint. |
| csrc/torch_bindings.cpp | Registers relu2_no_mul in the Torch extension library for XPU. |
| tests/ops/activation_op.py | Adds Relu2NoMul CustomOp wrapper + PyTorch reference implementation. |
| tests/test_activation.py | Extends activation tests to include relu2_no_mul. |
| tests/register_ops.py | Adds a Python test wrapper that calls torch.ops._C.relu2_no_mul. |
Comments suppressed due to low confidence (1)
vllm_xpu_kernels/fused_moe_interface.py:252
act_outputis allocated unconditionally with shape(num_moe_inputs, inter_size)but therelu2_no_mulbranch immediately replaces it withtorch.empty_like(gemm1_output). This results in an extra large allocation per call for that activation; consider allocatingact_outputinside each activation branch (or computing the needed output shape first) to avoid the wasted allocation.
inter_size_scale = 1
# act
act_output = torch.empty((num_moe_inputs, inter_size),
dtype=gemm1_output.dtype,
device=gemm1_output.device)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| K=inter_size * inter_size_scale, | ||
| num_experts=num_experts, |
There was a problem hiding this comment.
K is derived from inter_size * inter_size_scale, but cutlass_grouped_gemm_interface does not validate tensor shapes and will trust the provided K. For relu2_no_mul, this makes correctness/safety depend on w2 actually having the matching K-dimension (2*inter_size for non-int4 layouts). Please add an explicit shape assertion for w2 (and possibly w13) for this activation to prevent out-of-bounds reads if a caller passes incompatible weights.
| elif activation == "relu2_no_mul": | ||
| act_output = torch.empty_like(gemm1_output) | ||
| torch.ops._C.relu2_no_mul(act_output, gemm1_output) | ||
| inter_size_scale = 2 |
There was a problem hiding this comment.
The new relu2_no_mul activation path in xpu_fused_moe isn’t covered by the existing fused-MoE test suite (current tests/fused_moe/test_fused_moe.py cases all use activation="silu"). Please add a fused-MoE unit test that runs xpu_fused_moe(..., activation="relu2_no_mul") with appropriately-shaped w2 and compares against a reference implementation, so regressions in the K-scaling/activation behavior are caught.
|
@jikunshang , please help to review |
| elif activation == "swigluoai" or ("SWIGLUOAI" in str(activation)): | ||
| torch.ops._C.swigluoai_and_mul(act_output, gemm1_output, 1.702, 7.0) | ||
| elif activation == "relu2_no_mul": | ||
| act_output = torch.empty_like(gemm1_output) |
There was a problem hiding this comment.
act_ouput shape in relu2_no_mul is different with that in XXX_and_mul, can not reuse the definition at line 250.
There was a problem hiding this comment.
we'd better fix on L250?
| double alpha = 1.702, | ||
| double limit = 7.0); | ||
|
|
||
| void relu2_no_mul(torch::Tensor& out, torch::Tensor& input); |
There was a problem hiding this comment.
minor: vllm repo doesn't have this cuda kernel yet. I prefer to make this into torch.ops._xpu_C as this is a xpu specifc kernel. (though it will not be used in vllm side yet).
keep it here is fine.
81203dc to
bc0d9d3
Compare
Signed-off-by: Qiao, Zhefeng <zhefeng.qiao@intel.com>
bc0d9d3 to
396a18b
Compare
|
pls rebase & fix conflicts |
Signed-off-by: Zhefeng, Qiao <zhefeng.qiao@intel.com>
vllm-project#232) Signed-off-by: Qiao, Zhefeng <zhefeng.qiao@intel.com> Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com>
* [OneDNN] add mxfp8, mxfp4 onednn gemm (#20) * add mxfp4 onednn gemm Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com> * add ut for mx Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com> * fix Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com> * format with pre-commit Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com> * thanks copilot Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com> --------- Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com> * format Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com> * refine onednn gemm ut Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com> * skip scales check (#256) Signed-off-by: mayuyuace <qiming1.zhang@intel.com> Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com> * Support sycl impl relu2_no_mul for NVIDIA-Nemotron-3-Nano-30B-A3B-bf16 (#232) Signed-off-by: Qiao, Zhefeng <zhefeng.qiao@intel.com> Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com> * Update test_fp8_gemm_onednn.py Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com> --------- Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com> Signed-off-by: mayuyuace <qiming1.zhang@intel.com> Signed-off-by: Qiao, Zhefeng <zhefeng.qiao@intel.com> Co-authored-by: root <root@emr813693.jf.intel.com> Co-authored-by: Qiming Zhang <qiming1.zhang@intel.com> Co-authored-by: Zhefeng, Qiao <zhefeng.qiao@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS ABOVE HAVE BEEN CONSIDERED.
Purpose
Support sycl kernel relu2_no_mul, this is an enhancement PR of #200.
compared to torch implementation, It has ~55% improvement: ~44us vs ~100us on ops level, and about ~0.8% improvement in each step
Test Plan
python -m pytest tests/test_activation.py -v
Test Result
pass
(Optional) Documentation Update
BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)