Skip to content

fp8.py online quant: reuse layerwise reloading infra, take 3#34332

Closed
vkuzo wants to merge 1 commit intovllm-project:mainfrom
vkuzo:20260211_layerwise_reuse_v3
Closed

fp8.py online quant: reuse layerwise reloading infra, take 3#34332
vkuzo wants to merge 1 commit intovllm-project:mainfrom
vkuzo:20260211_layerwise_reuse_v3

Conversation

@vkuzo
Copy link
Copy Markdown
Contributor

@vkuzo vkuzo commented Feb 11, 2026

Summary:

Copy of #34184

TODO write me up

Test Plan: TODO

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the online quantization for FP8 to reuse the generic layerwise reloading infrastructure. This simplifies the code in fp8.py by removing custom patched_weight_loader implementations and centralizes the loading logic. The changes to the layerwise loading infrastructure to support both initial loading and reloading are well-implemented.

However, I've found a high-severity issue where Fp8OnlineLinearMethod seems to have been only partially refactored. The process_weights_after_loading method for this class was not updated to reflect the new loading mechanism, unlike its MoE counterpart, which will lead to incorrect behavior on weight reloading.

Comment on lines 547 to 557
weight = ModelWeightParameter(
data=torch.empty(
output_size_per_partition,
input_size_per_partition,
# materialized just-in-time in `patched_weight_loader`
device="meta",
dtype=params_dtype,
),
input_dim=1,
output_dim=0,
weight_loader=patched_weight_loader,
weight_loader=weight_loader,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While replacing patched_weight_loader with the generic weight_loader is a good step towards reusing the layerwise loading infrastructure, the corresponding Fp8OnlineLinearMethod.process_weights_after_loading method has not been updated. It still contains logic for deferred initialization and a re-entry guard (_already_called_process_weights_after_loading) which are now handled by or incompatible with the new layerwise loader. This will cause issues, for example on weight reloading. This method should be refactored similarly to Fp8OnlineMoEMethod.process_weights_after_loading to ensure correctness and consistency.

Summary:

Copy of vllm-project#34184

Test Plan: TODO

Signed-off-by: Vasiliy Kuznetsov <vasiliy@meta.com>
@vkuzo vkuzo force-pushed the 20260211_layerwise_reuse_v3 branch from 2e77014 to 39c805d Compare February 11, 2026 19:01
@vkuzo
Copy link
Copy Markdown
Contributor Author

vkuzo commented Feb 11, 2026

abandoning for now since additional complexity is needed here to handle tied weights. In the current code, if B is tied to A, the materialization of B currently overrides the already-loaded weight of A.

@vkuzo
Copy link
Copy Markdown
Contributor Author

vkuzo commented Mar 3, 2026

closing in favor of #33814

@vkuzo vkuzo closed this Mar 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants