fp8.py online quant: reuse layerwise reloading infra, take 3 by vkuzo · Pull Request #34332 · vllm-project/vllm

vkuzo · 2026-02-11T11:07:33Z

Summary:

Copy of #34184

TODO write me up

Test Plan: TODO

Purpose

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request refactors the online quantization for FP8 to reuse the generic layerwise reloading infrastructure. This simplifies the code in fp8.py by removing custom patched_weight_loader implementations and centralizes the loading logic. The changes to the layerwise loading infrastructure to support both initial loading and reloading are well-implemented.

However, I've found a high-severity issue where Fp8OnlineLinearMethod seems to have been only partially refactored. The process_weights_after_loading method for this class was not updated to reflect the new loading mechanism, unlike its MoE counterpart, which will lead to incorrect behavior on weight reloading.

gemini-code-assist · 2026-02-11T11:11:13Z

vllm/model_executor/layers/quantization/fp8.py

        weight = ModelWeightParameter(
            data=torch.empty(
                output_size_per_partition,
                input_size_per_partition,
-                # materialized just-in-time in `patched_weight_loader`
                device="meta",
                dtype=params_dtype,
            ),
            input_dim=1,
            output_dim=0,
-            weight_loader=patched_weight_loader,
+            weight_loader=weight_loader,
        )


While replacing patched_weight_loader with the generic weight_loader is a good step towards reusing the layerwise loading infrastructure, the corresponding Fp8OnlineLinearMethod.process_weights_after_loading method has not been updated. It still contains logic for deferred initialization and a re-entry guard (_already_called_process_weights_after_loading) which are now handled by or incompatible with the new layerwise loader. This will cause issues, for example on weight reloading. This method should be refactored similarly to Fp8OnlineMoEMethod.process_weights_after_loading to ensure correctness and consistency.

Summary: Copy of vllm-project#34184 Test Plan: TODO Signed-off-by: Vasiliy Kuznetsov <vasiliy@meta.com>

vkuzo · 2026-02-11T19:02:27Z

abandoning for now since additional complexity is needed here to handle tied weights. In the current code, if B is tied to A, the materialization of B currently overrides the already-loaded weight of A.

vkuzo · 2026-03-03T14:09:33Z

closing in favor of #33814

gemini-code-assist bot reviewed Feb 11, 2026

View reviewed changes

fp8.py online quant: reuse layerwise reloading infra, take 3

39c805d

Summary: Copy of vllm-project#34184 Test Plan: TODO Signed-off-by: Vasiliy Kuznetsov <vasiliy@meta.com>

vkuzo force-pushed the 20260211_layerwise_reuse_v3 branch from 2e77014 to 39c805d Compare February 11, 2026 19:01

fxmarty-amd mentioned this pull request Mar 3, 2026

[ROCm] Add dynamic mxfp4 quantization for DeepSeek V2 projection layers #34157

Merged

5 tasks

vkuzo closed this Mar 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fp8.py online quant: reuse layerwise reloading infra, take 3#34332

fp8.py online quant: reuse layerwise reloading infra, take 3#34332
vkuzo wants to merge 1 commit intovllm-project:mainfrom
vkuzo:20260211_layerwise_reuse_v3

vkuzo commented Feb 11, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 11, 2026

Uh oh!

vkuzo commented Feb 11, 2026

Uh oh!

vkuzo commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

vkuzo commented Feb 11, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

vkuzo commented Feb 11, 2026

Uh oh!

vkuzo commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vkuzo commented Feb 11, 2026 •

edited by github-actions bot

Loading