[QeRL] Compose online quantization with quantized reloading#38032
[QeRL] Compose online quantization with quantized reloading#38032vllm-bot merged 8 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the online FP8 quantization loading mechanism by extracting initialization logic into a new initialize_online_processing function and adjusting how process_weights_after_loading is handled within the model loading pipeline. The LayerReloadingInfo dataclass is updated to make kernel_tensors optional, and related code is adjusted to handle this change. A critical issue was identified where torch.empty_like with a meta device tensor could lead to a NotImplementedError during weight initialization, requiring explicit device specification. Additionally, a commented-out line in base_loader.py should be removed for clarity.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the online quantization loading mechanism by introducing initialize_online_processing and finalize_layerwise_process functions. These changes streamline how weights are loaded and processed for online quantization, particularly for FP8 and MXFP8 methods, and include updates to handle meta devices and weight loaders. A new test test_online_quantize_reload has been added to cover these functionalities. However, there are two critical issues identified: an incorrect method call in finalize_layerwise_process where layer.process_weights_after_loading is called with an incorrect argument and on the wrong object, and the removal of a device type check in CopyCounter and an assertion in get_numel_loaded, which could lead to incorrect tracking of loaded elements and premature weight processing.
1564b60 to
15db502
Compare
| tensor_parallel_size=tp_size, | ||
| enable_expert_parallel=(tp_size > 1 and "DeepSeek" in base_model), | ||
| enable_prefix_caching=False, | ||
| ) as llm: |
There was a problem hiding this comment.
can we test behavior after the model is loaded and before the first reload happens
There was a problem hiding this comment.
One of the tested models is empty inference-optimization/DeepSeek-V3-debug-empty, so won't produce sane results.
Rather than add a kluge, I'd rather rely on specific online quantization tests, ie tests/quantization/test_fp8.py::test_online_quantization. Do you think this is enough coverage?
There was a problem hiding this comment.
could just skip this for for DeepSeek-V3-debug-empty?
I do think we should test for sane output after online quant and before the first reloading call, test_fp8.py does not test anything related to layerwise reloading
|
|
||
| # kernel format (device) | ||
| kernel_tensors: LayerTensors = field(default_factory=lambda: ({}, {})) | ||
| kernel_tensors: LayerTensors | None = None |
There was a problem hiding this comment.
nit: can we make the comment explain when this is none vs specified
vkuzo
left a comment
There was a problem hiding this comment.
thank you! my comments are optional nits, lg if tests pass
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
4dd2c5b to
acf0959
Compare
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
| logger.info("vLLM model structure:\n%s", format_model_inspection(model)) | ||
|
|
||
|
|
||
| def _has_online_quant(model: nn.Module): |
There was a problem hiding this comment.
This kluge will be removed by neuralmagic#153
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
mgoin
left a comment
There was a problem hiding this comment.
Changes look nice to me, and I'm fairly certain this can't affect the non-online path so I feel good about the spurious failures.
…ject#38032) Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>
…ject#38032) Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
| vllm_config=vllm_config, model_config=model_config, prefix=prefix | ||
| ) | ||
|
|
||
| with set_default_torch_dtype(model_config.dtype), target_device: |
There was a problem hiding this comment.
I feel this may not safe. and I guess it cause Language Models Tests (Extra Standard) 2 case fail.
see #38426
There was a problem hiding this comment.
@jikunshang I agree, although I'm not 100% sure why. I'm still trying to figure out why/ how to have both
…llm-project#38032)" This reverts commit 648edcf.
…ject#38032) Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: Elham Harirpoush <elham.harirpoush@arm.com>
…ject#38032) Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>
…ject#38032) Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: zhutaoyu <zhutaoyu97@gmail.com>
…ject#38032) Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: neweyes <328719365@qq.com>
Purpose
Lifecycle
Online quantization follows a similar lifecycle to reloading:
record_metadata_for_reloadingrestore_layer_on_metainitialize_online_processingonline_process_loader, which buffers weights until all layer weights are ready_layerwise_processfinalize_layerwise_reloadChanges
initialize_online_processingfor use by online quantization methodsinitialize_online_processingon the layer to add layerwise wrappersonline_process_loader, we can ensure that weight loaders are never wrapped twice during online quantized reloadingfinalize_layerwise_processinBaseModelLoader.load_weights. In the future, this function could fully replaceprocess_weights_after_loadingEdge case handling
e_score_correction_bias. Ideally, these parameters would be initialized on the meta device just like the weights. However, quant_method has no control over these parameters, so we end up loading them twice, leading to a small performance lossTesting
test_online_quantizationandtest_reload_weightstest_online_quantize_reload