[not ready for review] extend fp8 online quant with blockwise scaling#32485
[not ready for review] extend fp8 online quant with blockwise scaling#32485vkuzo wants to merge 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request extends FP8 online quantization to support blockwise scaling. The changes primarily involve wiring up existing kernels for blockwise operations within the fp8.py quantization logic.
My review focuses on the implementation in vllm/model_executor/layers/quantization/fp8.py. The changes look mostly correct and consistent. I've identified a few areas for improvement:
- A critical assertion that was removed should be restored to prevent potential misconfigurations.
- There are a couple of local imports that should be moved to the top of the file for better code style and maintainability.
- There is some repeated logic for checking the quantization type, which could be refactored into a helper property to improve code clarity and reduce duplication.
Overall, this is a good step towards enabling more flexible FP8 quantization schemes.
| @@ -1089,7 +1156,21 @@ def __init__(self, quant_config: Fp8Config, layer: torch.nn.Module): | |||
| super().__init__(quant_config, layer) | |||
| assert not quant_config.is_checkpoint_fp8_serialized | |||
| assert quant_config.activation_scheme == "dynamic" | |||
| assert quant_config.weight_block_size is None | |||
|
|
|||
There was a problem hiding this comment.
The assertion assert quant_config.weight_block_size is None was removed. Fp8OnlineMoEMethod is used for online quantization (is_checkpoint_fp8_serialized=False), where weight_block_size in Fp8Config is expected to be None. This assertion is a crucial sanity check to prevent misconfiguration. Please restore it.
| assert self.quant_config.weight_block_size is None |
| if ( | ||
| self.block_quant | ||
| or self.quant_config.online_quant_scaling_type | ||
| is OnlineQuantScalingType.BLOCKWISE | ||
| ): |
There was a problem hiding this comment.
The condition self.block_quant or self.quant_config.online_quant_scaling_type is OnlineQuantScalingType.BLOCKWISE is repeated in several places within this class (e.g., in process_weights_after_loading and apply). To improve maintainability and reduce code duplication, consider creating a helper property within the Fp8LinearMethod class to encapsulate this logic. For example:
@property
def _is_blockwise_quant(self):
return (self.block_quant or
self.quant_config.online_quant_scaling_type is
OnlineQuantScalingType.BLOCKWISE)You could then use if self._is_blockwise_quant: in this and other locations.
| is OnlineQuantScalingType.BLOCKWISE | ||
| ): | ||
| # blockwise | ||
| from vllm.utils.deep_gemm import per_block_cast_to_fp8 |
| is OnlineQuantScalingType.BLOCKWISE | ||
| ): | ||
| # Blockwise quantization | ||
| from vllm.utils.deep_gemm import per_block_cast_to_fp8 |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 4 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
| if self.block_quant: | ||
| assert self.weight_block_size is not None | ||
|
|
||
| if ( |
There was a problem hiding this comment.
Marlin path accesses non-existent weight scale attribute
High Severity
The Marlin code path in apply checks only self.block_quant to decide between layer.weight_scale_inv and layer.weight_scale, but block_quant is False when doing online quantization (it only reflects checkpoint-provided weight_block_size). Since online_quant_scaling_type is hardcoded to BLOCKWISE, process_weights_after_loading creates weight_scale_inv, but the Marlin path tries to access weight_scale which doesn't exist, causing an AttributeError on GPUs without FP8 hardware support.
Additional Locations (1)
| @@ -150,7 +152,8 @@ def test_load_fp16_model( | |||
| monkeypatch.setenv("VLLM_TEST_FORCE_FP8_MARLIN", "1") | |||
|
|
|||
| with vllm_runner( | |||
| "facebook/opt-125m", | |||
| # "facebook/opt-125m", | |||
| "Qwen/Qwen1.5-MoE-A2.7B", | |||
There was a problem hiding this comment.
Test validation disabled with debug model change
Medium Severity
The test model was changed from facebook/opt-125m to Qwen/Qwen1.5-MoE-A2.7B, but the check_model validation function (which references opt-125m-specific layer paths like model.model.decoder.layers[0].fc1) was commented out instead of updated. The test now only runs inference without validating quantization was applied correctly. Additionally, test parameterization was reduced, decreasing coverage.
Additional Locations (2)
| self.block_quant | ||
| or self.quant_config.online_quant_scaling_type | ||
| is OnlineQuantScalingType.BLOCKWISE | ||
| ): |
There was a problem hiding this comment.
Blockwise condition breaks serialized per-tensor FP8 checkpoints
High Severity
The conditions checking online_quant_scaling_type is BLOCKWISE in the non-Marlin apply paths don't account for serialized checkpoints. For pre-quantized per-tensor FP8 checkpoints, create_weights creates weight_scale (not weight_scale_inv), but the hardcoded BLOCKWISE setting causes both the batch-invariant path and the main w8a8_block_fp8_linear.apply path to access layer.weight_scale_inv, which doesn't exist. This causes AttributeError when loading any serialized per-tensor FP8 checkpoint on FP8-capable hardware.
Additional Locations (1)
| block_quant=self.block_quant, | ||
| tp_size=layer.moe_parallel_config.tp_size, | ||
| with_lora_support=self.moe.is_lora_enabled, | ||
| ) |
There was a problem hiding this comment.
Parent validation rejects online blockwise before child overrides
Medium Severity
Fp8OnlineMoEMethod.__init__ calls super().__init__() before setting block_quant=True for online blockwise quantization. The parent class validates using its initial block_quant=False and activation_scheme="dynamic", computing dynamic_per_token=True. On SM90/SM100 GPUs with FlashInfer enabled, the parent selects a FlashInfer backend and raises NotImplementedError about "dynamic per token activation quantization" before the child can override block_quant=True. This causes online blockwise MoE quantization to fail on H100/Blackwell GPUs with a misleading error.
Additional Locations (1)
5d141ae to
3b8c1f4
Compare
|
Hi @vkuzo, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
3b8c1f4 to
7e08f35
Compare
Summary: Enables using float8 blockwise scaling with `fp8.py` online quantization. For now, the UI part of this PR is a placeholder pending the discussions in vllm-project#32412 . The bulk of the PR is just wiring up kernels that already exist to fp8.py + online quant + blockwise scaling. This will need to be rebased after the following PRs land: * vllm-project#32189 * vllm-project#31914 Test Plan: TODO Signed-off-by: Vasiliy Kuznetsov <vasiliy@meta.com>
7e08f35 to
419634b
Compare
|
Hi @vkuzo, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
This pull request has merge conflicts that must be resolved before it can be |
Summary:
Enables using float8 blockwise scaling with
fp8.pyonline quantization.For now, the UI part of this PR is a placeholder pending the discussions in #32412 . The bulk of the PR is just wiring up kernels that already exist to fp8.py + online quant
This will need to be rebased after the following PRs land:
Test Plan:
TODO
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.