fp8 online quant: split out Fp8OnlineLinearMethod#32189
fp8 online quant: split out Fp8OnlineLinearMethod#32189mgoin merged 1 commit intovllm-project:mainfrom
Conversation
| # if we have loaded all of the elements, call | ||
| # process_weights_after_loading | ||
| target_loaded_numel = layer.weight.numel() | ||
| if layer._loaded_numel == target_loaded_numel: |
There was a problem hiding this comment.
this is outdated, deleting
| assert input_scale is not None | ||
| input_scale = input_scale.max() | ||
| weight = weight.t() | ||
| weight = layer.weight |
There was a problem hiding this comment.
all of these changes is just moving online quant code to the new child class
| if self.quant_config.is_checkpoint_fp8_serialized: | ||
| weight = create_fp8_weight_parameter( | ||
| output_size_per_partition, input_size_per_partition, weight_loader | ||
| weight = create_fp8_weight_parameter( |
There was a problem hiding this comment.
all of these changes is just moving online quant code to the new child class
There was a problem hiding this comment.
Code Review
This pull request is a well-executed refactoring that splits the online quantization logic for FP8 linear layers into a new Fp8OnlineLinearMethod class. This change significantly improves code clarity and maintainability by separating the concerns of online and offline quantization, following the pattern established for MoE layers. The implementation is clean, and the logic has been moved correctly. The tests have also been updated to cover both dense and MoE models for online quantization, which is a great improvement. Overall, this is a solid contribution that enhances the codebase.
|
This looks good. Can you make a new directory called |
|
Hi @vkuzo, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
kylesayrs
left a comment
There was a problem hiding this comment.
Agree with @robertgshaw2-redhat, it'd be great to break these out into separate files/directories
Looks great in concept, thanks for looking at this
a59a81f to
d209af2
Compare
|
Hi @vkuzo, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
d209af2 to
b9d2f36
Compare
mgoin
left a comment
There was a problem hiding this comment.
LGTM as just structural changes, thanks!
|
CI is failing with an OOM on QWen 1.5B on a 24GB NVIDIA L4 machine, my best guess is that we actually need #31914 to land for memory usage to be sane. I'm going to remove Qwen 1.5B from this PR (since the fp8.py changes here do not touch MoEs) to unblock, and we can revisit online quant + moe in CI once the memory issue is fixed. |
4806453 to
7df7c9a
Compare
|
@mgoin thanks, CI is green after the latest fix |
7df7c9a to
2fe028f
Compare
|
rebasing manually since automatic rebase failed on permissions |
Summary: Enables using float8 blockwise scaling with `fp8.py` online quantization. For now, the UI part of this PR is a placeholder pending the discussions in vllm-project#32412 . The bulk of the PR is just wiring up kernels that already exist to fp8.py + online quant + blockwise scaling. This will need to be rebased after the following PRs land: * vllm-project#32189 * vllm-project#31914 Test Plan: TODO Signed-off-by: Vasiliy Kuznetsov <vasiliy@meta.com>
2fe028f to
2debbac
Compare
Summary: Split out `Fp8OnlineLinearMethod` from `Fp8LinearMethod` to more clearly separate online quant from offline quant logic, following a similar PR recently landed for `Fp8OnlineMoEMethod`. Test Plan: ``` // run online quant test (dense + moe smoke tests) with-proxy pytest tests/quantization/test_fp8.py -s -x -k online_quantization // run entire fp8.py test suite with-proxy pytest tests/quantization/test_fp8.py -s -x ``` Reviewers: Subscribers: Tasks: Tags: Signed-off-by: vasiliy <vasiliy@fb.com>
2debbac to
0ebce01
Compare
|
rebased on top of #27814 which just landed |
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Signed-off-by: mohammad najafi <mohammad.najafi@amd.com>
Signed-off-by: Yan Ma <yan.ma@intel.com>
* Revert "offload weights to cpu before fp8 online quant (vllm-project#225)" This reverts commit fc5a0a6. * fp8 online quant: split out Fp8OnlineLinearMethod (vllm-project#32189) Signed-off-by: Yan Ma <yan.ma@intel.com> * cherry-pick:fix memory for online fp8 quantization with streaming weight load Signed-off-by: Yan Ma <yan.ma@intel.com> * fix fp8 Signed-off-by: Yan Ma <yan.ma@intel.com> --------- Signed-off-by: Yan Ma <yan.ma@intel.com> Co-authored-by: Vasiliy Kuznetsov <vkuzo@users.noreply.github.com>
Signed-off-by: Yan Ma <yan.ma@intel.com>
Summary:
Split out
Fp8OnlineLinearMethodfromFp8LinearMethodto more clearly separate online quant from offline quant logic, following a similar PR recently landed forFp8OnlineMoEMethod.In the same PR, beef up testing for online quant in integration tests a bit so we can depend on tests for testing future functionality for online quant. Specifically, extend the online fp8 quant test to also include a small moe model, and also extend it to run inference with a couple of tokens.
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Note
Cursor Bugbot is generating a summary for commit d209af298f5d6601a9328c21c252c4fc11de1911. Configure here.
Note
Cleanly separates online FP8 linear quantization from offline (serialized) flow and strengthens smoke test coverage.
Fp8OnlineLinearMethodthat patches weight loading to quantize fp16/bf16 weights on-the-fly viaops.scaled_fp8_quant, with optional Marlin preparationFp8Config.get_quant_methodto chooseFp8OnlineLinearMethodvsFp8LinearMethodbased onis_checkpoint_fp8_serialized; simplifiesFp8LinearMethodto the serialized-FP8 pathFp8OnlineMoEMethod/Fp8MoEMethod) unchanged, integrates with selection logictest_online_quantizationruns onfacebook/opt-125mandQwen/Qwen1.5-MoE-A2.7B, parameterizes KV cache (auto/fp8) and Marlin/ROCm flags, and performs short greedy generationWritten by Cursor Bugbot for commit d209af298f5d6601a9328c21c252c4fc11de1911. This will update automatically on new commits. Configure here.
Note
Cleanly separates online FP8 linear quantization from the serialized (offline) flow and strengthens smoke test coverage.
Fp8OnlineLinearMethodthat patches weight loading to quantize fp16/bf16 weights viaops.scaled_fp8_quant, with optional Marlin prepFp8Config.get_quant_methodto chooseFp8OnlineLinearMethodvsFp8LinearMethodbased onis_checkpoint_fp8_serialized;Fp8LinearMethodnow focuses on fp8-serialized pathFp8OnlineMoEMethod/Fp8MoEMethod) and integrates it into the same selection logictest_online_quantization: runs onfacebook/opt-125mandQwen/Qwen1.5-MoE-A2.7B, parameterizes KV cache (auto/fp8) and Marlin/ROCm flags, and validates a short greedy generationWritten by Cursor Bugbot for commit b9d2f36f55819d195b00afd8efabf99ab9824a22. This will update automatically on new commits. Configure here.