fp8.py online quant: reuse layerwise reloading infra, take 3#34332
fp8.py online quant: reuse layerwise reloading infra, take 3#34332vkuzo wants to merge 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the online quantization for FP8 to reuse the generic layerwise reloading infrastructure. This simplifies the code in fp8.py by removing custom patched_weight_loader implementations and centralizes the loading logic. The changes to the layerwise loading infrastructure to support both initial loading and reloading are well-implemented.
However, I've found a high-severity issue where Fp8OnlineLinearMethod seems to have been only partially refactored. The process_weights_after_loading method for this class was not updated to reflect the new loading mechanism, unlike its MoE counterpart, which will lead to incorrect behavior on weight reloading.
| weight = ModelWeightParameter( | ||
| data=torch.empty( | ||
| output_size_per_partition, | ||
| input_size_per_partition, | ||
| # materialized just-in-time in `patched_weight_loader` | ||
| device="meta", | ||
| dtype=params_dtype, | ||
| ), | ||
| input_dim=1, | ||
| output_dim=0, | ||
| weight_loader=patched_weight_loader, | ||
| weight_loader=weight_loader, | ||
| ) |
There was a problem hiding this comment.
While replacing patched_weight_loader with the generic weight_loader is a good step towards reusing the layerwise loading infrastructure, the corresponding Fp8OnlineLinearMethod.process_weights_after_loading method has not been updated. It still contains logic for deferred initialization and a re-entry guard (_already_called_process_weights_after_loading) which are now handled by or incompatible with the new layerwise loader. This will cause issues, for example on weight reloading. This method should be refactored similarly to Fp8OnlineMoEMethod.process_weights_after_loading to ensure correctness and consistency.
Summary: Copy of vllm-project#34184 Test Plan: TODO Signed-off-by: Vasiliy Kuznetsov <vasiliy@meta.com>
2e77014 to
39c805d
Compare
|
abandoning for now since additional complexity is needed here to handle tied weights. In the current code, if B is tied to A, the materialization of B currently overrides the already-loaded weight of A. |
|
closing in favor of #33814 |
Summary:
Copy of #34184
TODO write me up
Test Plan: TODO
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.