[XPU] enable fp8 online streaming quantization by yma11 · Pull Request #30944 · vllm-project/vllm

yma11 · 2025-12-18T07:44:57Z

Purpose

This PR enables fp8 online streaming quantization on xpu path for other linear and MoE.

Test Plan

VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 VLLM_WORKER_MULTIPROC_METHOD=spawn python3 examples/offline_inference/basic/generate.py --model Qwen/Qwen3-30B-A3B --enforce-eager --dtype=float16 --max-model-len=4096 --quantization=fp8 -tp=4

Test Result

Processed prompts: 100%|██████████████████| 4/4 [00:00<00:00,  4.21it/s, est. speed input: 23.18 toks/s, output: 67.43 toks/s]
--------------------------------------------------
Prompt: 'Hello, my name is'
Generated text: ' Dr. S. I am a professional artist, and I have been working in'
--------------------------------------------------
Prompt: 'The president of the United States is'
Generated text: ' the head of state and head of government of the United States of America, and'
--------------------------------------------------
Prompt: 'The capital of France is'
Generated text: ' Paris. What is the capital of Germany?\n\nThe capital of Germany is Berlin.'
--------------------------------------------------
Prompt: 'The future of AI is'
Generated text: ' not just about the technology itself, but also about the people who build and use'
--------------------------------------------------
(Worker_TP1 pid=6494) INFO 12-18 07:26:13 [multiproc_executor.py:709] Parent process exited, terminating worker
(Worker_TP2 pid=6495) INFO 12-18 07:26:13 [multiproc_executor.py:709] Parent process exited, terminating worker
(Worker_TP0 pid=6493) INFO 12-18 07:26:13 [multiproc_executor.py:709] Parent process exited, terminating worker
(Worker_TP3 pid=6496) INFO 12-18 07:26:13 [multiproc_executor.py:709] Parent process exited, terminating worker

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request enables fp8 online streaming quantization for XPU. It achieves this by refactoring the XPU quantization methods for Linear and FusedMoE layers to inherit from new base classes that implement the streaming logic. A key addition is the CopyNumelCounter utility, which robustly tracks the number of loaded weight elements to trigger the online quantization process at the correct time. The changes are well-structured, improve code reuse, and correctly implement the new feature for the XPU backend.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-12-18T07:50:26Z

vllm/model_executor/layers/quantization/fp8.py

+                prefix=prefix,
+                ignored_layers=self.ignored_layers,
+                fused_mapping=self.packed_modules_mapping,
+            ):
+                return UnquantizedLinearMethod()


Return MoE-compatible method when skipping XPU layers

On XPU, when a FusedMoE layer is listed in ignored_layers, get_xpu_quant_method now returns UnquantizedLinearMethod (lines 310–314). FusedMoE initialization asserts that its quant_method is a FusedMoEMethodBase (layer.py lines 582–592), so this new skip path raises during model construction instead of leaving the layer unquantized. The skip logic should return the unquantized MoE method to avoid the assertion failure.

Useful? React with 👍 / 👎.

yma11 · 2025-12-19T01:46:48Z

@jikunshang please take a review.

jikunshang · 2025-12-19T02:03:13Z

vllm/model_executor/layers/quantization/fp8.py

+                ignored_layers=self.ignored_layers,
+                fused_mapping=self.packed_modules_mapping,
+            ):
+                return UnquantizedLinearMethod()


should be UnquantizedFusedMoEMethod?

oh, yes. Updated.

jikunshang · 2025-12-20T02:08:50Z

vllm/model_executor/layers/quantization/fp8.py

@@ -1058,6 +1069,8 @@ def maybe_make_prepare_finalize(
        self,
        routing_tables: tuple[torch.Tensor, torch.Tensor, torch.Tensor] | None = None,
    ) -> mk.FusedMoEPrepareAndFinalize | None:
+        if current_platform.is_xpu():


merge into L1073 if condition?

vllm/model_executor/layers/quantization/fp8.py

Signed-off-by: Yan Ma <yan.ma@intel.com>

jikunshang

LGTM. Thanks for fixing!

Signed-off-by: Yan Ma <yan.ma@intel.com>

Signed-off-by: Yan Ma <yan.ma@intel.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>

Signed-off-by: Yan Ma <yan.ma@intel.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

Signed-off-by: Yan Ma <yan.ma@intel.com>

yma11 requested review from mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners December 18, 2025 07:44

yma11 marked this pull request as draft December 18, 2025 07:45

gemini-code-assist bot reviewed Dec 18, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Dec 18, 2025

View reviewed changes

yma11 force-pushed the fp8-quantization branch from e4a759f to 09ea853 Compare December 19, 2025 01:29

yma11 marked this pull request as ready for review December 19, 2025 01:46

jikunshang reviewed Dec 19, 2025

View reviewed changes

LucasWilkinson assigned jikunshang Dec 19, 2025

jikunshang reviewed Dec 20, 2025

View reviewed changes

yma11 added 3 commits December 20, 2025 10:46

[XPU] enable fp8 online streaming quantization

3c8ba33

Signed-off-by: Yan Ma <yan.ma@intel.com>

correction

f7e87ed

Signed-off-by: Yan Ma <yan.ma@intel.com>

address comments

436433c

Signed-off-by: Yan Ma <yan.ma@intel.com>

yma11 force-pushed the fp8-quantization branch from 2199d73 to 436433c Compare December 20, 2025 10:55

jikunshang approved these changes Dec 20, 2025

View reviewed changes

jikunshang enabled auto-merge (squash) December 20, 2025 11:49

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 20, 2025

jikunshang merged commit 560ae96 into vllm-project:main Dec 20, 2025
54 checks passed

yugong333 pushed a commit to yugong333/vllm that referenced this pull request Dec 22, 2025

[XPU] enable fp8 online streaming quantization (vllm-project#30944)

47314bb

Signed-off-by: Yan Ma <yan.ma@intel.com>

Majid-Taheri pushed a commit to Majid-Taheri/vllm that referenced this pull request Dec 23, 2025

[XPU] enable fp8 online streaming quantization (vllm-project#30944)

a94e974

Signed-off-by: Yan Ma <yan.ma@intel.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>

dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026

[XPU] enable fp8 online streaming quantization (vllm-project#30944)

5531b10

Signed-off-by: Yan Ma <yan.ma@intel.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[XPU] enable fp8 online streaming quantization (vllm-project#30944)

c731786

Signed-off-by: Yan Ma <yan.ma@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[XPU] enable fp8 online streaming quantization #30944

[XPU] enable fp8 online streaming quantization #30944
jikunshang merged 3 commits intovllm-project:mainfrom
yma11:fp8-quantization

yma11 commented Dec 18, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Dec 18, 2025

Uh oh!

yma11 commented Dec 19, 2025

Uh oh!

jikunshang Dec 19, 2025

Uh oh!

yma11 Dec 19, 2025

Uh oh!

jikunshang Dec 20, 2025

Uh oh!

Uh oh!

jikunshang left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

yma11 commented Dec 18, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

yma11 commented Dec 19, 2025

Uh oh!

jikunshang Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

yma11 Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

jikunshang Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jikunshang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yma11 commented Dec 18, 2025 •

edited by github-actions bot

Loading