Skip to content

[XPU] enable fp8 online streaming quantization #30944

Merged
jikunshang merged 3 commits intovllm-project:mainfrom
yma11:fp8-quantization
Dec 20, 2025
Merged

[XPU] enable fp8 online streaming quantization #30944
jikunshang merged 3 commits intovllm-project:mainfrom
yma11:fp8-quantization

Conversation

@yma11
Copy link
Contributor

@yma11 yma11 commented Dec 18, 2025

Purpose

This PR enables fp8 online streaming quantization on xpu path for other linear and MoE.

Test Plan

VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 VLLM_WORKER_MULTIPROC_METHOD=spawn python3 examples/offline_inference/basic/generate.py --model Qwen/Qwen3-30B-A3B --enforce-eager --dtype=float16 --max-model-len=4096 --quantization=fp8 -tp=4

Test Result

Processed prompts: 100%|██████████████████| 4/4 [00:00<00:00,  4.21it/s, est. speed input: 23.18 toks/s, output: 67.43 toks/s]
--------------------------------------------------
Prompt: 'Hello, my name is'
Generated text: ' Dr. S. I am a professional artist, and I have been working in'
--------------------------------------------------
Prompt: 'The president of the United States is'
Generated text: ' the head of state and head of government of the United States of America, and'
--------------------------------------------------
Prompt: 'The capital of France is'
Generated text: ' Paris. What is the capital of Germany?\n\nThe capital of Germany is Berlin.'
--------------------------------------------------
Prompt: 'The future of AI is'
Generated text: ' not just about the technology itself, but also about the people who build and use'
--------------------------------------------------
(Worker_TP1 pid=6494) INFO 12-18 07:26:13 [multiproc_executor.py:709] Parent process exited, terminating worker
(Worker_TP2 pid=6495) INFO 12-18 07:26:13 [multiproc_executor.py:709] Parent process exited, terminating worker
(Worker_TP0 pid=6493) INFO 12-18 07:26:13 [multiproc_executor.py:709] Parent process exited, terminating worker
(Worker_TP3 pid=6496) INFO 12-18 07:26:13 [multiproc_executor.py:709] Parent process exited, terminating worker


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables fp8 online streaming quantization for XPU. It achieves this by refactoring the XPU quantization methods for Linear and FusedMoE layers to inherit from new base classes that implement the streaming logic. A key addition is the CopyNumelCounter utility, which robustly tracks the number of loaded weight elements to trigger the online quantization process at the correct time. The changes are well-structured, improve code reuse, and correctly implement the new feature for the XPU backend.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +310 to +314
prefix=prefix,
ignored_layers=self.ignored_layers,
fused_mapping=self.packed_modules_mapping,
):
return UnquantizedLinearMethod()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Return MoE-compatible method when skipping XPU layers

On XPU, when a FusedMoE layer is listed in ignored_layers, get_xpu_quant_method now returns UnquantizedLinearMethod (lines 310–314). FusedMoE initialization asserts that its quant_method is a FusedMoEMethodBase (layer.py lines 582–592), so this new skip path raises during model construction instead of leaving the layer unquantized. The skip logic should return the unquantized MoE method to avoid the assertion failure.

Useful? React with 👍 / 👎.

@yma11 yma11 marked this pull request as ready for review December 19, 2025 01:46
@yma11
Copy link
Contributor Author

yma11 commented Dec 19, 2025

@jikunshang please take a review.

ignored_layers=self.ignored_layers,
fused_mapping=self.packed_modules_mapping,
):
return UnquantizedLinearMethod()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be UnquantizedFusedMoEMethod?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, yes. Updated.

@@ -1058,6 +1069,8 @@ def maybe_make_prepare_finalize(
self,
routing_tables: tuple[torch.Tensor, torch.Tensor, torch.Tensor] | None = None,
) -> mk.FusedMoEPrepareAndFinalize | None:
if current_platform.is_xpu():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge into L1073 if condition?

Signed-off-by: Yan Ma <yan.ma@intel.com>
Signed-off-by: Yan Ma <yan.ma@intel.com>
Signed-off-by: Yan Ma <yan.ma@intel.com>
Copy link
Collaborator

@jikunshang jikunshang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for fixing!

@jikunshang jikunshang enabled auto-merge (squash) December 20, 2025 11:49
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 20, 2025
@jikunshang jikunshang merged commit 560ae96 into vllm-project:main Dec 20, 2025
54 checks passed
yugong333 pushed a commit to yugong333/vllm that referenced this pull request Dec 22, 2025
Majid-Taheri pushed a commit to Majid-Taheri/vllm that referenced this pull request Dec 23, 2025
Signed-off-by: Yan Ma <yan.ma@intel.com>
Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
Signed-off-by: Yan Ma <yan.ma@intel.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants