Skip to content

fix fp8 online quantization streaming with tp > 1#30900

Merged
robertgshaw2-redhat merged 1 commit intovllm-project:mainfrom
vkuzo:20251217_fp8_streaming_fix
Dec 18, 2025
Merged

fix fp8 online quantization streaming with tp > 1#30900
robertgshaw2-redhat merged 1 commit intovllm-project:mainfrom
vkuzo:20251217_fp8_streaming_fix

Conversation

@vkuzo
Copy link
Contributor

@vkuzo vkuzo commented Dec 17, 2025

Summary:

Fix for #30830

When we added online fp8 quant with streaming weight post-processing in #29196, a bug was introduced where TP>1 case was not always handled correctly. Specifically:

A fix is to track exactly how many number of elements were updated with copy_. The PR implements this fix with TorchDispatchMode.

Test Plan:

// tp 1 still works
CUDA_VISIBLE_DEVICES=6,7 with-proxy python3 examples/offline_inference/basic/generate.py --model Qwen/Qwen1.5-MoE-A2.7B --enforce-eager --quantization fp8 -tp=1
// tp > 2 was broken before, now works
CUDA_VISIBLE_DEVICES=6,7 with-proxy python3 examples/offline_inference/basic/generate.py --model Qwen/Qwen1.5-MoE-A2.7B --enforce-eager --quantization fp8 -tp=2

Reviewers:

Subscribers:

Tasks:

Tags:

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@vkuzo vkuzo force-pushed the 20251217_fp8_streaming_fix branch from 2c3d7d0 to 23cc7ff Compare December 17, 2025 19:43
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug in the online FP8 quantization streaming logic when using tensor parallelism (TP > 1). The issue stemmed from an incorrect assumption about the number of elements loaded by the weight_loader. The fix introduces a CopyNumelCounter class using TorchDispatchMode to accurately track the number of elements modified by copy_ operations, even when weights are transformed (e.g., with narrow) before being copied. This is a clever and robust solution. The changes in Fp8LinearMethod and Fp8OnlineMoEMethod correctly apply this new counter. The implementation is clean and effectively resolves the described problem. I have no further comments.

@vkuzo vkuzo force-pushed the 20251217_fp8_streaming_fix branch from 23cc7ff to 076d42f Compare December 17, 2025 19:44
@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@robertgshaw2-redhat robertgshaw2-redhat enabled auto-merge (squash) December 17, 2025 19:48
Copy link
Contributor

@kylesayrs kylesayrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally correct, thanks for the fix

@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 17, 2025
@mergify
Copy link

mergify bot commented Dec 17, 2025

Hi @vkuzo, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

auto-merge was automatically disabled December 18, 2025 12:18

Head branch was pushed to by a user without write access

@vkuzo vkuzo force-pushed the 20251217_fp8_streaming_fix branch from 076d42f to 37f60d8 Compare December 18, 2025 12:18
Summary:

When we added online fp8 quant with streaming weight post-processing in
vllm-project#29196, a bug was introduced
where TP>1 case was not always handled correctly. Specifically:

* vllm-project#29196 assumed that
  `weight_loader` copies `loaded_weight` to `param`
* this is not true, as `weight_loader` can call arbitrary logic on both
  `param` and `loaded_weight` before eventually calling `copy_`. An
  example is here:
  https://github.com/vllm-project/vllm/blob/e3a0f21e6ce78268865cafcdc3dc58c7a80dbc57/vllm/model_executor/parameter.py#L195

A fix is to track exactly how many number of elements were updated with
`copy_`.  The PR implements this fix with `TorchDispatchMode`.

Test Plan:

```bash
// tp 1 still works
CUDA_VISIBLE_DEVICES=6,7 with-proxy python3 examples/offline_inference/basic/generate.py --model Qwen/Qwen1.5-MoE-A2.7B --enforce-eager --quantization fp8 -tp=1
// tp > 2 was broken before, now works
CUDA_VISIBLE_DEVICES=6,7 with-proxy python3 examples/offline_inference/basic/generate.py --model Qwen/Qwen1.5-MoE-A2.7B --enforce-eager --quantization fp8 -tp=2
```

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: vasiliy <vasiliy@fb.com>
@vkuzo vkuzo force-pushed the 20251217_fp8_streaming_fix branch from 37f60d8 to 5152043 Compare December 18, 2025 13:41
@robertgshaw2-redhat robertgshaw2-redhat merged commit f4ee2c3 into vllm-project:main Dec 18, 2025
53 checks passed
yugong333 pushed a commit to yugong333/vllm that referenced this pull request Dec 22, 2025
Majid-Taheri pushed a commit to Majid-Taheri/vllm that referenced this pull request Dec 23, 2025
Signed-off-by: vasiliy <vasiliy@fb.com>
Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
Signed-off-by: vasiliy <vasiliy@fb.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants