Skip to content

refactor fp8.py online quant weight loading to use layerwise reload utils#33814

Closed
vkuzo wants to merge 1 commit intovllm-project:mainfrom
vkuzo:20260204_fp8_online_use_layerwise
Closed

refactor fp8.py online quant weight loading to use layerwise reload utils#33814
vkuzo wants to merge 1 commit intovllm-project:mainfrom
vkuzo:20260204_fp8_online_use_layerwise

Conversation

@vkuzo
Copy link
Copy Markdown
Contributor

@vkuzo vkuzo commented Feb 4, 2026

Summary:

Moves fp8.py's online quantization to be more consistent with the QERL abstractions introduced in #32133. The main benefit is the removal of custom logic in fp8.py in favor of a more generalized and composable path. The peak memory usage is unchanged with this PR. The new high level way fp8.py streaming weight loading works:

  1. A layer's create_weights method can create weights on device meta to opt in to saving memory with streaming weight loading + quantization
  2. In a ModelLoader's load_model function, a new API can be called to turn on streaming weight loading:
if use_layerwise_loading:
    # wrap the weight loaders
    initialize_layerwise_reload(model, is_reload=False, ...)
    # load weights, `process_weights_after_loading` will be called just-in-time as weights are loaded
    self.load_weights(model, model_config)
    # call `process_weights_after_loading` for any layer where JIT processing did not happen
    finalize_layerwise_reload(model, is_reload=False, ...)
else:
    # simple weight loading without layerwise processing
    self.load_weights(model, model_config)
    process_weights_after_loading(model, model_config, target_device)
  1. The abstractions introduced in QERL ([QeRL] Layerwise Reloading #32133) are modified to introduce a simpler initial-loading path which skips kernel format handling and cuda graph related tensor movement completely, since these are not relevant to model initial load. High level differences between the reloading and initial loading paths:
  • reloading path ([QeRL] Layerwise Reloading #32133 ) high level flow
    • capture every layer's tensors (in kernel format) to kernel_tensors
    • move every layer to meta device
    • wrap every layer's weight loaders
    • layerwise load the weights
      • for each weight, whenever all shards are there, then
        • materialize from meta to GPU (different tensor)
        • call _layerwise_process (converts to kernel format)
        • copy data into kernel_tensors
        • delete the newly created GPU tensors
    • call finalize to take care of any stragglers
  • loading path (this PR) high level flow
    • wrap every layer's weight loaders
    • layerwise load the weights
      • for each weight, whenever all shards are there, then
        • for weights that are not materialized - materialize. This way layers can opt-in to save memory with streaming weight loading by initializing their weights on device meta.
        • call _layerwise_process (converts to kernel format)
    • call finalize to take care of any stragglers

Test Plan:

// unit and integration tests, this includes testing for peak memory after weight loading
pytest tests/quantization/test_fp8.py -s

// dense
VLLM_LOGGING_LEVEL=DEBUG python3 examples/basic/offline_inference/generate.py --model facebook/opt-125m --enforce-eager --dtype=bfloat16 --max_model_len=2048 --quantization=fp8
// moe
VLLM_LOGGING_LEVEL=DEBUG python3 examples/basic/offline_inference/generate.py --model Qwen/Qwen3-30B-A3B  --enforce-eager --dtype=bfloat16 --block-size=64 --max_model_len=2048 --gpu-memory-utilization=0.8 --trust-remote-code --quantization=fp8
// moe with tp on
CUDA_VISIBLE_DEVICES=0,1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/basic/offline_inference/generate.py --model Qwen/Qwen3-30B-A3B  --enforce-eager --dtype=bfloat16 --block-size=64 --max_model_len=2048 --gpu-memory-utilization=0.8 --trust-remote-code --quantization=fp8 -tp 2 

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an experimental implementation for layer-wise reloading for FP8 online quantization. The changes are clearly a work-in-progress, with several temporary code blocks, hardcoded flags, and debug statements. My review focuses on identifying these temporary elements and suggesting their removal or proper implementation for the final version. Key areas of feedback include removing dead code under if False blocks, replacing hardcoded feature flags with configuration options, and removing debug print/log statements. These changes are crucial for making the code production-ready.

@vkuzo vkuzo force-pushed the 20260204_fp8_online_use_layerwise branch 4 times, most recently from 50273e5 to 2b64e12 Compare February 6, 2026 19:22
@vkuzo vkuzo force-pushed the 20260204_fp8_online_use_layerwise branch 5 times, most recently from ddfd954 to 1a1b04a Compare February 12, 2026 11:50
@vkuzo vkuzo changed the title [wip] explore using layerwise reloading utils for fp8 online quant refactor fp8.py online quant weight loading to use layerwise reload utils Feb 12, 2026
@vkuzo vkuzo force-pushed the 20260204_fp8_online_use_layerwise branch from 1a1b04a to 1fe187b Compare February 12, 2026 12:11
Copy link
Copy Markdown
Contributor

@kylesayrs kylesayrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these changes look great! I think the documentation makes it clear how and where the two flows differ. Just small nits/cleanups from my side

setattr(layer, name, materialize_meta_tensor(tensor))


def materialize_layer_tensors_with_device_meta(layer: torch.nn.Module) -> None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to have this implementation just replace materialize_layer? It should be safe to do so, as the assumption of materialize_layer is that it should only be relevant for meta tensors.

"""
if is_reload:
# Materialize layer tensors onto device
materialize_layer(layer)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that, with your if device == "meta" guard, it should be safe to call this function in all cases, right? That would ensure that the entire layer is materialized at this point, including any scales, ect.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to be explicit, materialization is not relevant to the initial load path in this section of the code, so simpler to just skip it.

num_loaded, ret = get_numel_loaded(original_loader, bound_args)

else:
if info.load_numel == 0:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, theoretically, you don't need this check, as the if device == "meta" check guards against double materialization anyways, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's easier to understand this way

@vkuzo vkuzo force-pushed the 20260204_fp8_online_use_layerwise branch from 1fe187b to 1f57fa3 Compare February 20, 2026 17:15
@vkuzo vkuzo marked this pull request as ready for review February 20, 2026 17:16
@mergify
Copy link
Copy Markdown

mergify bot commented Feb 24, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @vkuzo.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Feb 24, 2026
@vkuzo vkuzo force-pushed the 20260204_fp8_online_use_layerwise branch from 1f57fa3 to cfef389 Compare February 24, 2026 18:02
@mergify mergify bot removed the needs-rebase label Feb 24, 2026
@mergify
Copy link
Copy Markdown

mergify bot commented Feb 24, 2026

Hi @vkuzo, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@vkuzo vkuzo force-pushed the 20260204_fp8_online_use_layerwise branch from cfef389 to 1a4c59d Compare February 24, 2026 18:18
)
layer.register_parameter("w13_bias", w13_bias)
set_weight_attrs(w13_bias, orig_extra_weight_attrs)
set_weight_attrs(w13_bias, extra_weight_attrs)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to verify GPT-OSS 120B still works as this changes the code added by #34906 and there is no CI coverage

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

following up on this, GPT-OSS bf16 is not expected to work with fp8.py online quant because:

  1. fp8.py online quant (and future online quant backends in vllm) require weight_loaders, because we use weight_loaders to inject the streaming weight loading functionality
  2. gpt_oss.py model definition for the bf16 weights case does not use weight loaders:
    param.copy_(narrow_weight)

I'm not exactly sure how #34906 worked given 1 and 2 ^. Going to skip this for now as gpt-oss + online quant seems low pri because the official weights are in mxfp4, and we can follow-up if needed.

for posterity, the easiest way to test this is using the 20b model from unsloth which goes through the same path as the 120b:

VLLM_ENABLE_V1_MULTIPROCESSING=0 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/basic/chat.py --model unsloth/gpt-oss-20b-BF16 --enforce-eager --dtype=bfloat16 --quantization=fp8

@robertgshaw2-redhat robertgshaw2-redhat added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 24, 2026
@vkuzo vkuzo force-pushed the 20260204_fp8_online_use_layerwise branch from 1a4c59d to 0d77ac9 Compare February 27, 2026 20:57
# Note: this is currently broken for gpt-oss because it
# does not use weight loaders at all in the bf16 weights
# path
device="meta",
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gpt-oss bf16 is broken whether biases are initialized on gpu or on meta, going with meta to be consistent with layerwise loading infra

if we want gpt-oss to work with fp8.py we should refactor gpt_oss.py to use weight loaders

@vkuzo
Copy link
Copy Markdown
Contributor Author

vkuzo commented Mar 6, 2026

rebased on latest main

@mergify
Copy link
Copy Markdown

mergify bot commented Mar 6, 2026

Hi @vkuzo, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@vkuzo vkuzo force-pushed the 20260204_fp8_online_use_layerwise branch 2 times, most recently from cbd5460 to de1ed65 Compare March 9, 2026 15:21
@vkuzo
Copy link
Copy Markdown
Contributor Author

vkuzo commented Mar 9, 2026

rebased on latest main

Summary:

WIP, for now just getting a POC to see what is needed for the real
version.

Test Plan:

```bash
// example with facebook/opt-125m
VLLM_LOGGING_LEVEL=DEBUG VLLM_ENABLE_V1_MULTIPROCESSING=0 python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --enforce-eager --dtype=bfloat16 --max_model_len=2048 --quantization=fp8

// before:
DEBUG 02-04 18:49:50 [model_executor/model_loader/base_loader.py:66] Peak GPU memory after loading weights: 0.18 GiB

// after:
DEBUG 02-04 18:49:08 [model_executor/model_loader/base_loader.py:83] Peak GPU memory after loading weights: 0.25 GiB
```

Signed-off-by: Vasiliy Kuznetsov <vasiliy@meta.com>
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 24, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @vkuzo.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 24, 2026
@vkuzo vkuzo force-pushed the 20260204_fp8_online_use_layerwise branch from de1ed65 to 4341ba1 Compare March 24, 2026 14:57
@vkuzo
Copy link
Copy Markdown
Contributor Author

vkuzo commented Mar 24, 2026

rebased and re-ran the test plan on 2xB200

@vkuzo
Copy link
Copy Markdown
Contributor Author

vkuzo commented Mar 25, 2026

after a conversation with @kylesayrs , abandoning in favor of #38032 which we expect will have an easier time passing PR review

@vkuzo vkuzo closed this Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

quantization ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants