refactor fp8.py online quant weight loading to use layerwise reload utils by vkuzo · Pull Request #33814 · vllm-project/vllm

vkuzo · 2026-02-04T18:50:57Z

Summary:

Moves fp8.py's online quantization to be more consistent with the QERL abstractions introduced in #32133. The main benefit is the removal of custom logic in fp8.py in favor of a more generalized and composable path. The peak memory usage is unchanged with this PR. The new high level way fp8.py streaming weight loading works:

A layer's create_weights method can create weights on device meta to opt in to saving memory with streaming weight loading + quantization
In a ModelLoader's load_model function, a new API can be called to turn on streaming weight loading:

if use_layerwise_loading:
    # wrap the weight loaders
    initialize_layerwise_reload(model, is_reload=False, ...)
    # load weights, `process_weights_after_loading` will be called just-in-time as weights are loaded
    self.load_weights(model, model_config)
    # call `process_weights_after_loading` for any layer where JIT processing did not happen
    finalize_layerwise_reload(model, is_reload=False, ...)
else:
    # simple weight loading without layerwise processing
    self.load_weights(model, model_config)
    process_weights_after_loading(model, model_config, target_device)

The abstractions introduced in QERL ([QeRL] Layerwise Reloading #32133) are modified to introduce a simpler initial-loading path which skips kernel format handling and cuda graph related tensor movement completely, since these are not relevant to model initial load. High level differences between the reloading and initial loading paths:

reloading path ([QeRL] Layerwise Reloading #32133 ) high level flow
- capture every layer's tensors (in kernel format) to kernel_tensors
- move every layer to meta device
- wrap every layer's weight loaders
- layerwise load the weights
  - for each weight, whenever all shards are there, then
    - materialize from meta to GPU (different tensor)
    - call _layerwise_process (converts to kernel format)
    - copy data into kernel_tensors
    - delete the newly created GPU tensors
- call finalize to take care of any stragglers
loading path (this PR) high level flow
- wrap every layer's weight loaders
- layerwise load the weights
  - for each weight, whenever all shards are there, then
    - for weights that are not materialized - materialize. This way layers can opt-in to save memory with streaming weight loading by initializing their weights on device meta.
    - call _layerwise_process (converts to kernel format)
- call finalize to take care of any stragglers

Test Plan:

// unit and integration tests, this includes testing for peak memory after weight loading
pytest tests/quantization/test_fp8.py -s

// dense
VLLM_LOGGING_LEVEL=DEBUG python3 examples/basic/offline_inference/generate.py --model facebook/opt-125m --enforce-eager --dtype=bfloat16 --max_model_len=2048 --quantization=fp8
// moe
VLLM_LOGGING_LEVEL=DEBUG python3 examples/basic/offline_inference/generate.py --model Qwen/Qwen3-30B-A3B  --enforce-eager --dtype=bfloat16 --block-size=64 --max_model_len=2048 --gpu-memory-utilization=0.8 --trust-remote-code --quantization=fp8
// moe with tp on
CUDA_VISIBLE_DEVICES=0,1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/basic/offline_inference/generate.py --model Qwen/Qwen3-30B-A3B  --enforce-eager --dtype=bfloat16 --block-size=64 --max_model_len=2048 --gpu-memory-utilization=0.8 --trust-remote-code --quantization=fp8 -tp 2

Purpose

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces an experimental implementation for layer-wise reloading for FP8 online quantization. The changes are clearly a work-in-progress, with several temporary code blocks, hardcoded flags, and debug statements. My review focuses on identifying these temporary elements and suggesting their removal or proper implementation for the final version. Key areas of feedback include removing dead code under if False blocks, replacing hardcoded feature flags with configuration options, and removing debug print/log statements. These changes are crucial for making the code production-ready.

vllm/model_executor/model_loader/base_loader.py

vllm/model_executor/layers/quantization/fp8.py

vllm/model_executor/model_loader/reload/layerwise.py

vllm/model_executor/model_loader/reload/meta.py

kylesayrs

I think these changes look great! I think the documentation makes it clear how and where the two flows differ. Just small nits/cleanups from my side

kylesayrs · 2026-02-16T18:22:20Z

vllm/model_executor/model_loader/reload/meta.py

            setattr(layer, name, materialize_meta_tensor(tensor))


+def materialize_layer_tensors_with_device_meta(layer: torch.nn.Module) -> None:


Do you want to have this implementation just replace materialize_layer? It should be safe to do so, as the assumption of materialize_layer is that it should only be relevant for meta tensors.

kylesayrs · 2026-02-16T18:24:15Z

vllm/model_executor/model_loader/reload/layerwise.py

+    """
+    if is_reload:
+        # Materialize layer tensors onto device
+        materialize_layer(layer)


I think that, with your if device == "meta" guard, it should be safe to call this function in all cases, right? That would ensure that the entire layer is materialized at this point, including any scales, ect.

I think it's better to be explicit, materialization is not relevant to the initial load path in this section of the code, so simpler to just skip it.

kylesayrs · 2026-02-16T18:25:51Z

vllm/model_executor/model_loader/reload/layerwise.py

+            num_loaded, ret = get_numel_loaded(original_loader, bound_args)
+
+        else:
+            if info.load_numel == 0:


I think, theoretically, you don't need this check, as the if device == "meta" check guards against double materialization anyways, right?

I think it's easier to understand this way

vllm/model_executor/model_loader/reload/layerwise.py

mergify · 2026-02-24T03:19:51Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @vkuzo.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-02-24T18:07:24Z

Hi @vkuzo, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

vkuzo · 2026-02-24T18:22:38Z

vllm/model_executor/layers/quantization/fp8.py

            )
            layer.register_parameter("w13_bias", w13_bias)
-            set_weight_attrs(w13_bias, orig_extra_weight_attrs)
+            set_weight_attrs(w13_bias, extra_weight_attrs)


need to verify GPT-OSS 120B still works as this changes the code added by #34906 and there is no CI coverage

following up on this, GPT-OSS bf16 is not expected to work with fp8.py online quant because:

fp8.py online quant (and future online quant backends in vllm) require weight_loaders, because we use weight_loaders to inject the streaming weight loading functionality

gpt_oss.py model definition for the bf16 weights case does not use weight loaders:

vllm/vllm/model_executor/models/gpt_oss.py

Line 1009 in 234a65b

param.copy_(narrow_weight)

I'm not exactly sure how #34906 worked given 1 and 2 ^. Going to skip this for now as gpt-oss + online quant seems low pri because the official weights are in mxfp4, and we can follow-up if needed.

for posterity, the easiest way to test this is using the 20b model from unsloth which goes through the same path as the 120b:

VLLM_ENABLE_V1_MULTIPROCESSING=0 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/basic/chat.py --model unsloth/gpt-oss-20b-BF16 --enforce-eager --dtype=bfloat16 --quantization=fp8

vkuzo · 2026-02-27T20:58:28Z

vllm/model_executor/layers/quantization/fp8.py

+                    # Note: this is currently broken for gpt-oss because it
+                    # does not use weight loaders at all in the bf16 weights
+                    # path
+                    device="meta",


gpt-oss bf16 is broken whether biases are initialized on gpu or on meta, going with meta to be consistent with layerwise loading infra

if we want gpt-oss to work with fp8.py we should refactor gpt_oss.py to use weight loaders

vkuzo · 2026-03-06T17:17:50Z

rebased on latest main

mergify · 2026-03-06T17:22:20Z

Hi @vkuzo, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

vkuzo · 2026-03-09T15:21:51Z

rebased on latest main

Summary: WIP, for now just getting a POC to see what is needed for the real version. Test Plan: ```bash // example with facebook/opt-125m VLLM_LOGGING_LEVEL=DEBUG VLLM_ENABLE_V1_MULTIPROCESSING=0 python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --enforce-eager --dtype=bfloat16 --max_model_len=2048 --quantization=fp8 // before: DEBUG 02-04 18:49:50 [model_executor/model_loader/base_loader.py:66] Peak GPU memory after loading weights: 0.18 GiB // after: DEBUG 02-04 18:49:08 [model_executor/model_loader/base_loader.py:83] Peak GPU memory after loading weights: 0.25 GiB ``` Signed-off-by: Vasiliy Kuznetsov <vasiliy@meta.com>

mergify · 2026-03-24T14:52:51Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @vkuzo.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vkuzo · 2026-03-24T14:59:17Z

rebased and re-ran the test plan on 2xB200

vkuzo · 2026-03-25T16:18:23Z

after a conversation with @kylesayrs , abandoning in favor of #38032 which we expect will have an easier time passing PR review

gemini-code-assist bot reviewed Feb 4, 2026

View reviewed changes

vkuzo force-pushed the 20260204_fp8_online_use_layerwise branch 4 times, most recently from 50273e5 to 2b64e12 Compare February 6, 2026 19:22

vkuzo mentioned this pull request Feb 9, 2026

[wip] layerwise loading for fp8.py, take 2 #34020

Closed

5 tasks

vkuzo force-pushed the 20260204_fp8_online_use_layerwise branch 5 times, most recently from ddfd954 to 1a1b04a Compare February 12, 2026 11:50

vkuzo changed the title ~~[wip] explore using layerwise reloading utils for fp8 online quant~~ refactor fp8.py online quant weight loading to use layerwise reload utils Feb 12, 2026

vkuzo force-pushed the 20260204_fp8_online_use_layerwise branch from 1a1b04a to 1fe187b Compare February 12, 2026 12:11

vkuzo mentioned this pull request Feb 12, 2026

[Quantization] - Consolidate experts_int8 with FP8 Modular Kernels #33178

Closed

6 tasks

kylesayrs approved these changes Feb 16, 2026

View reviewed changes

vkuzo force-pushed the 20260204_fp8_online_use_layerwise branch from 1fe187b to 1f57fa3 Compare February 20, 2026 17:15

vkuzo marked this pull request as ready for review February 20, 2026 17:16

vkuzo requested review from 22quinn, mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners February 20, 2026 17:16

mergify bot added the needs-rebase label Feb 24, 2026

vkuzo force-pushed the 20260204_fp8_online_use_layerwise branch from 1f57fa3 to cfef389 Compare February 24, 2026 18:02

mergify bot removed the needs-rebase label Feb 24, 2026

vkuzo force-pushed the 20260204_fp8_online_use_layerwise branch from cfef389 to 1a4c59d Compare February 24, 2026 18:18

vkuzo commented Feb 24, 2026

View reviewed changes

robertgshaw2-redhat added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 24, 2026

vkuzo force-pushed the 20260204_fp8_online_use_layerwise branch from 1a4c59d to 0d77ac9 Compare February 27, 2026 20:57

vkuzo commented Feb 27, 2026

View reviewed changes

vkuzo force-pushed the 20260204_fp8_online_use_layerwise branch from 0d77ac9 to f66481c Compare February 28, 2026 11:26

This was referenced Mar 3, 2026

fp8.py online quant: reuse layerwise reloading infra, take 3 #34332

Closed

[ROCm] Add dynamic mxfp4 quantization for DeepSeek V2 projection layers #34157

Merged

vkuzo force-pushed the 20260204_fp8_online_use_layerwise branch from f66481c to 62b89b5 Compare March 6, 2026 17:17

vkuzo force-pushed the 20260204_fp8_online_use_layerwise branch 2 times, most recently from cbd5460 to de1ed65 Compare March 9, 2026 15:21

vkuzo mentioned this pull request Mar 12, 2026

[Bugfix] Fix FP8 online quantization premature trigger with TP sharded weights #36621

Open

mergify bot added the needs-rebase label Mar 24, 2026

vkuzo force-pushed the 20260204_fp8_online_use_layerwise branch from de1ed65 to 4341ba1 Compare March 24, 2026 14:57

mergify bot removed the needs-rebase label Mar 24, 2026

mgoin added the quantization label Mar 24, 2026

vkuzo closed this Mar 25, 2026

		setattr(layer, name, materialize_meta_tensor(tensor))


		def materialize_layer_tensors_with_device_meta(layer: torch.nn.Module) -> None:

Uh oh!

Conversation

vkuzo commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kylesayrs left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergify bot commented Feb 24, 2026

Uh oh!

mergify bot commented Feb 24, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vkuzo commented Mar 6, 2026

Uh oh!

mergify bot commented Mar 6, 2026

Uh oh!

vkuzo commented Mar 9, 2026

Uh oh!

mergify bot commented Mar 24, 2026

Uh oh!

vkuzo commented Mar 24, 2026

Uh oh!

vkuzo commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vkuzo commented Feb 4, 2026 •

edited

Loading