[XPU][MoE Refactor] Refactor xpu mxfp4 support into oracle#37784
[XPU][MoE Refactor] Refactor xpu mxfp4 support into oracle#37784bigPYJ1151 merged 2 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
|
@zyongye @mgoin @robertgshaw2-redhat PTAL, thanks! |
There was a problem hiding this comment.
Code Review
This pull request refactors the XPU MXFP4 support for Mixture-of-Experts layers to integrate it into the MoE oracle system. This is achieved by removing the specialized XpuMxfp4MoEMethod and introducing a new XPUExpertsMXFp4 class that can be selected by the oracle. While this is a good refactoring for modularity, I've identified a potential performance regression. The previous implementation used XPU-specific custom operators for routing, but the new implementation appears to fall back to a generic PyTorch-based router. My review includes a comment highlighting this issue and asking for clarification.
I am having trouble creating individual review comments. Click here to see my feedback.
vllm/model_executor/layers/quantization/mxfp4.py (416-506)
By removing XpuMxfp4MoEMethod, the routing logic for XPU MXFP4 MoE layers is now handled by the generic Router class. The previous implementation used XPU-specific custom ops (torch.ops._moe_C.fused_grouped_topk and torch.ops._moe_C.topk_softmax) for routing, which are likely more performant on XPU hardware.
The new implementation uses a pure PyTorch-based router, which might cause a performance regression, especially for models that use grouped top-k routing. Was this change in routing implementation intentional? If the custom routing ops are still desired, the logic from XpuMxfp4MoEMethod.apply_monolithic might need to be preserved, perhaps by creating a monolithic XPU expert class (FusedMoEExpertsMonolithic) that can be selected by the oracle and can perform the specialized routing.
…ect#37784) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
…ect#37784) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
…ect#37784) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
…ect#37784) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
…ect#37784) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
…ect#37784) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
…ect#37784) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>
…ect#37784) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
…ect#37784) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>
…ect#37784) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>
…ect#37784) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
…ect#37784) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> Signed-off-by: rishitdholakia13 <rishit+github@cohere.com>
…ect#37784) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
…ect#37784) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Purpose
follow up of #37128, move xpu mxfp4 support into oracle as well.
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.