fix fp8 online quantization streaming with tp > 1 by vkuzo · Pull Request #30900 · vllm-project/vllm

vkuzo · 2025-12-17T19:41:50Z

Summary:

When we added online fp8 quant with streaming weight post-processing in #29196, a bug was introduced where TP>1 case was not always handled correctly. Specifically:

online fp8 quant with streaming weight post-processing #29196 assumed that weight_loader copies loaded_weight to param directly
this is not always true, as weight_loader can call arbitrary logic on both param and loaded_weight before eventually calling copy_. An example is here:

vllm/vllm/model_executor/parameter.py

Line 195 in e3a0f21

param_data = param_data.narrow(self.output_dim, shard_offset, shard_size)

A fix is to track exactly how many number of elements were updated with copy_. The PR implements this fix with TorchDispatchMode.

Test Plan:

// tp 1 still works
CUDA_VISIBLE_DEVICES=6,7 with-proxy python3 examples/offline_inference/basic/generate.py --model Qwen/Qwen1.5-MoE-A2.7B --enforce-eager --quantization fp8 -tp=1
// tp > 2 was broken before, now works
CUDA_VISIBLE_DEVICES=6,7 with-proxy python3 examples/offline_inference/basic/generate.py --model Qwen/Qwen1.5-MoE-A2.7B --enforce-eager --quantization fp8 -tp=2

Reviewers:

Subscribers:

Tasks:

Tags:

Purpose

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

chatgpt-codex-connector · 2025-12-17T19:41:59Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

gemini-code-assist

Code Review

This pull request addresses a bug in the online FP8 quantization streaming logic when using tensor parallelism (TP > 1). The issue stemmed from an incorrect assumption about the number of elements loaded by the weight_loader. The fix introduces a CopyNumelCounter class using TorchDispatchMode to accurately track the number of elements modified by copy_ operations, even when weights are transformed (e.g., with narrow) before being copied. This is a clever and robust solution. The changes in Fp8LinearMethod and Fp8OnlineMoEMethod correctly apply this new counter. The implementation is clean and effectively resolves the described problem. I have no further comments.

chatgpt-codex-connector · 2025-12-17T19:46:08Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

kylesayrs

Totally correct, thanks for the fix

chatgpt-codex-connector · 2025-12-17T19:52:39Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

mergify · 2025-12-17T20:40:12Z

Hi @vkuzo, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Summary: When we added online fp8 quant with streaming weight post-processing in vllm-project#29196, a bug was introduced where TP>1 case was not always handled correctly. Specifically: * vllm-project#29196 assumed that `weight_loader` copies `loaded_weight` to `param` * this is not true, as `weight_loader` can call arbitrary logic on both `param` and `loaded_weight` before eventually calling `copy_`. An example is here: https://github.com/vllm-project/vllm/blob/e3a0f21e6ce78268865cafcdc3dc58c7a80dbc57/vllm/model_executor/parameter.py#L195 A fix is to track exactly how many number of elements were updated with `copy_`. The PR implements this fix with `TorchDispatchMode`. Test Plan: ```bash // tp 1 still works CUDA_VISIBLE_DEVICES=6,7 with-proxy python3 examples/offline_inference/basic/generate.py --model Qwen/Qwen1.5-MoE-A2.7B --enforce-eager --quantization fp8 -tp=1 // tp > 2 was broken before, now works CUDA_VISIBLE_DEVICES=6,7 with-proxy python3 examples/offline_inference/basic/generate.py --model Qwen/Qwen1.5-MoE-A2.7B --enforce-eager --quantization fp8 -tp=2 ``` Reviewers: Subscribers: Tasks: Tags: Signed-off-by: vasiliy <vasiliy@fb.com>

Signed-off-by: vasiliy <vasiliy@fb.com>

Signed-off-by: vasiliy <vasiliy@fb.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>

Signed-off-by: vasiliy <vasiliy@fb.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

Signed-off-by: vasiliy <vasiliy@fb.com>

vkuzo requested review from mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners December 17, 2025 19:41

vkuzo force-pushed the 20251217_fp8_streaming_fix branch from 2c3d7d0 to 23cc7ff Compare December 17, 2025 19:43

gemini-code-assist bot reviewed Dec 17, 2025

View reviewed changes

vkuzo force-pushed the 20251217_fp8_streaming_fix branch from 23cc7ff to 076d42f Compare December 17, 2025 19:44

robertgshaw2-redhat approved these changes Dec 17, 2025

View reviewed changes

robertgshaw2-redhat enabled auto-merge (squash) December 17, 2025 19:48

kylesayrs approved these changes Dec 17, 2025

View reviewed changes

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 17, 2025

yma11 mentioned this pull request Dec 18, 2025

[XPU] enable fp8 online streaming quantization #30944

Merged

5 tasks

auto-merge was automatically disabled December 18, 2025 12:18
Head branch was pushed to by a user without write access

vkuzo force-pushed the 20251217_fp8_streaming_fix branch from 076d42f to 37f60d8 Compare December 18, 2025 12:18

vkuzo force-pushed the 20251217_fp8_streaming_fix branch from 37f60d8 to 5152043 Compare December 18, 2025 13:41

robertgshaw2-redhat merged commit f4ee2c3 into vllm-project:main Dec 18, 2025
53 checks passed

yma11 mentioned this pull request Dec 22, 2025

[Bug]: accuracy issue on MoE online fp8 quantization #30830

Closed

1 task

yugong333 pushed a commit to yugong333/vllm that referenced this pull request Dec 22, 2025

fix fp8 online quantization streaming with tp > 1 (vllm-project#30900)

168b07b

Signed-off-by: vasiliy <vasiliy@fb.com>

Majid-Taheri pushed a commit to Majid-Taheri/vllm that referenced this pull request Dec 23, 2025

fix fp8 online quantization streaming with tp > 1 (vllm-project#30900)

493f85a

Signed-off-by: vasiliy <vasiliy@fb.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>

vkuzo mentioned this pull request Jan 12, 2026

[QeRL] Layerwise Reloading #32133

Merged

dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026

fix fp8 online quantization streaming with tp > 1 (vllm-project#30900)

ebef119

Signed-off-by: vasiliy <vasiliy@fb.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

fix fp8 online quantization streaming with tp > 1 (vllm-project#30900)

9a46eab

Signed-off-by: vasiliy <vasiliy@fb.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix fp8 online quantization streaming with tp > 1#30900

fix fp8 online quantization streaming with tp > 1#30900
robertgshaw2-redhat merged 1 commit intovllm-project:mainfrom
vkuzo:20251217_fp8_streaming_fix

vkuzo commented Dec 17, 2025 •

edited by github-actions bot

Loading

Uh oh!

chatgpt-codex-connector bot commented Dec 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

chatgpt-codex-connector bot commented Dec 17, 2025

Uh oh!

kylesayrs left a comment

Uh oh!

chatgpt-codex-connector bot commented Dec 17, 2025

Uh oh!

mergify bot commented Dec 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

vkuzo commented Dec 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector bot commented Dec 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

chatgpt-codex-connector bot commented Dec 17, 2025

Uh oh!

kylesayrs left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot commented Dec 17, 2025

Uh oh!

mergify bot commented Dec 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vkuzo commented Dec 17, 2025 •

edited by github-actions bot

Loading