[wip] layerwise loading for fp8.py, take 2 by vkuzo · Pull Request #34020 · vllm-project/vllm

vkuzo · 2026-02-06T21:23:52Z

Summary:

Refactor fp8.py's streaming weight loading to be more similar to QERL (#32133).

Because the logic we need during the initial load is singificantly simpler than QERL (don't care about kernel format, don't care about cuda graphs, etc), I ended up rewriting the logic in a similar style to QERL with only minimal reuse. The alternative would be to branch in the high level reloading functions (like #33814), we decided to rewrite instead of reuse after chatting with @kylesayrs .

Readiness of current PR:

dense - works
MoE single node - works
MoE with TP and EP multi node - need to test
load_format dummy - broken and need to fix
API design and polish - 80% (some cleanups still to do)

Test Plan:

// dense model single node
VLLM_LOGGING_LEVEL=DEBUG VLLM_ENABLE_V1_MULTIPROCESSING=0 python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --enforce-eager --dtype=bfloat16 --max_model_len=2048 --quantization=fp8

// moe model single node
VLLM_LOGGING_LEVEL=DEBUG VLLM_ENABLE_V1_MULTIPROCESSING=0 python3 examples/offline_inference/basic/generate.py --model ibm-granite/granite-3.0-1b-a400m-base --enforce-eager --dtype=bfloat16 --max_model_len=2048 --quantization=fp8

// test cases (currently load_format dummy fails, need to investigate + fix)
pytest tests/quantization/test_fp8.py -s -k

TODO write me

Purpose

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request refactors the layer-wise loading mechanism for online quantization, centralizing the logic from fp8.py into a more general framework within vllm/model_executor/model_loader/reload/. This is a positive architectural change that improves modularity and maintainability.

However, the pull request is clearly a work-in-progress, as indicated by numerous TODO comments for documentation, configuration, and code placement. A significant limitation is the NotImplementedError for handling attention layers (Attention and MLAAttention), which will need to be addressed. All the identified TODOs should be resolved before this PR is considered for merging.

gemini-code-assist · 2026-02-06T21:25:39Z

vllm/model_executor/layers/quantization/fp8.py

                # materialized just-in-time in `patched_weight_loader`
+                # TODO(before review): say where exactly this will be materialized


This comment and TODO are outdated since patched_weight_loader has been removed in this refactoring. The materialization now happens in make_online_initial_load_process_loader within vllm/model_executor/model_loader/reload/layerwise.py. Please remove these lines.

gemini-code-assist · 2026-02-06T21:25:39Z

vllm/model_executor/model_loader/base_loader.py

+            # TODO(before review): set this from config
+            is_online_quant = True


The is_online_quant flag is currently hardcoded. As the TODO suggests, this should be driven by a configuration parameter to allow for dynamic control over the quantization path.

gemini-code-assist · 2026-02-06T21:25:39Z