fix fp8 online quantization streaming with tp > 1#30900
fix fp8 online quantization streaming with tp > 1#30900robertgshaw2-redhat merged 1 commit intovllm-project:mainfrom
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
2c3d7d0 to
23cc7ff
Compare
There was a problem hiding this comment.
Code Review
This pull request addresses a bug in the online FP8 quantization streaming logic when using tensor parallelism (TP > 1). The issue stemmed from an incorrect assumption about the number of elements loaded by the weight_loader. The fix introduces a CopyNumelCounter class using TorchDispatchMode to accurately track the number of elements modified by copy_ operations, even when weights are transformed (e.g., with narrow) before being copied. This is a clever and robust solution. The changes in Fp8LinearMethod and Fp8OnlineMoEMethod correctly apply this new counter. The implementation is clean and effectively resolves the described problem. I have no further comments.
23cc7ff to
076d42f
Compare
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
kylesayrs
left a comment
There was a problem hiding this comment.
Totally correct, thanks for the fix
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
Hi @vkuzo, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Head branch was pushed to by a user without write access
076d42f to
37f60d8
Compare
Summary: When we added online fp8 quant with streaming weight post-processing in vllm-project#29196, a bug was introduced where TP>1 case was not always handled correctly. Specifically: * vllm-project#29196 assumed that `weight_loader` copies `loaded_weight` to `param` * this is not true, as `weight_loader` can call arbitrary logic on both `param` and `loaded_weight` before eventually calling `copy_`. An example is here: https://github.com/vllm-project/vllm/blob/e3a0f21e6ce78268865cafcdc3dc58c7a80dbc57/vllm/model_executor/parameter.py#L195 A fix is to track exactly how many number of elements were updated with `copy_`. The PR implements this fix with `TorchDispatchMode`. Test Plan: ```bash // tp 1 still works CUDA_VISIBLE_DEVICES=6,7 with-proxy python3 examples/offline_inference/basic/generate.py --model Qwen/Qwen1.5-MoE-A2.7B --enforce-eager --quantization fp8 -tp=1 // tp > 2 was broken before, now works CUDA_VISIBLE_DEVICES=6,7 with-proxy python3 examples/offline_inference/basic/generate.py --model Qwen/Qwen1.5-MoE-A2.7B --enforce-eager --quantization fp8 -tp=2 ``` Reviewers: Subscribers: Tasks: Tags: Signed-off-by: vasiliy <vasiliy@fb.com>
37f60d8 to
5152043
Compare
Signed-off-by: vasiliy <vasiliy@fb.com>
Signed-off-by: vasiliy <vasiliy@fb.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
Signed-off-by: vasiliy <vasiliy@fb.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Signed-off-by: vasiliy <vasiliy@fb.com>
Summary:
Fix for #30830
When we added online fp8 quant with streaming weight post-processing in #29196, a bug was introduced where TP>1 case was not always handled correctly. Specifically:
weight_loadercopiesloaded_weighttoparamdirectlyweight_loadercan call arbitrary logic on bothparamandloaded_weightbefore eventually callingcopy_. An example is here:vllm/vllm/model_executor/parameter.py
Line 195 in e3a0f21
A fix is to track exactly how many number of elements were updated with
copy_. The PR implements this fix withTorchDispatchMode.Test Plan:
// tp 1 still works CUDA_VISIBLE_DEVICES=6,7 with-proxy python3 examples/offline_inference/basic/generate.py --model Qwen/Qwen1.5-MoE-A2.7B --enforce-eager --quantization fp8 -tp=1 // tp > 2 was broken before, now works CUDA_VISIBLE_DEVICES=6,7 with-proxy python3 examples/offline_inference/basic/generate.py --model Qwen/Qwen1.5-MoE-A2.7B --enforce-eager --quantization fp8 -tp=2Reviewers:
Subscribers:
Tasks:
Tags:
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.