[XPU] enable fp8 online streaming quantization #30944
[XPU] enable fp8 online streaming quantization #30944jikunshang merged 3 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request enables fp8 online streaming quantization for XPU. It achieves this by refactoring the XPU quantization methods for Linear and FusedMoE layers to inherit from new base classes that implement the streaming logic. A key addition is the CopyNumelCounter utility, which robustly tracks the number of loaded weight elements to trigger the online quantization process at the correct time. The changes are well-structured, improve code reuse, and correctly implement the new feature for the XPU backend.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| prefix=prefix, | ||
| ignored_layers=self.ignored_layers, | ||
| fused_mapping=self.packed_modules_mapping, | ||
| ): | ||
| return UnquantizedLinearMethod() |
There was a problem hiding this comment.
Return MoE-compatible method when skipping XPU layers
On XPU, when a FusedMoE layer is listed in ignored_layers, get_xpu_quant_method now returns UnquantizedLinearMethod (lines 310–314). FusedMoE initialization asserts that its quant_method is a FusedMoEMethodBase (layer.py lines 582–592), so this new skip path raises during model construction instead of leaving the layer unquantized. The skip logic should return the unquantized MoE method to avoid the assertion failure.
Useful? React with 👍 / 👎.
e4a759f to
09ea853
Compare
|
@jikunshang please take a review. |
| ignored_layers=self.ignored_layers, | ||
| fused_mapping=self.packed_modules_mapping, | ||
| ): | ||
| return UnquantizedLinearMethod() |
There was a problem hiding this comment.
should be UnquantizedFusedMoEMethod?
| @@ -1058,6 +1069,8 @@ def maybe_make_prepare_finalize( | |||
| self, | |||
| routing_tables: tuple[torch.Tensor, torch.Tensor, torch.Tensor] | None = None, | |||
| ) -> mk.FusedMoEPrepareAndFinalize | None: | |||
| if current_platform.is_xpu(): | |||
There was a problem hiding this comment.
merge into L1073 if condition?
Signed-off-by: Yan Ma <yan.ma@intel.com>
Signed-off-by: Yan Ma <yan.ma@intel.com>
Signed-off-by: Yan Ma <yan.ma@intel.com>
2199d73 to
436433c
Compare
jikunshang
left a comment
There was a problem hiding this comment.
LGTM. Thanks for fixing!
Signed-off-by: Yan Ma <yan.ma@intel.com>
Signed-off-by: Yan Ma <yan.ma@intel.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
Signed-off-by: Yan Ma <yan.ma@intel.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Signed-off-by: Yan Ma <yan.ma@intel.com>
Purpose
This PR enables fp8 online streaming quantization on xpu path for other linear and MoE.
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.