refactor fp8.py online quant weight loading to use layerwise reload utils #33814

vkuzo · 2026-02-27T20:58:28Z

gpt-oss bf16 is broken whether biases are initialized on gpu or on meta, going with meta to be consistent with layerwise loading infra

if we want gpt-oss to work with fp8.py we should refactor gpt_oss.py to use weight loaders

vkuzo · 2026-02-24T18:22:38Z

need to verify GPT-OSS 120B still works as this changes the code added by #34906 and there is no CI coverage

following up on this, GPT-OSS bf16 is not expected to work with fp8.py online quant because:

fp8.py online quant (and future online quant backends in vllm) require weight_loaders, because we use weight_loaders to inject the streaming weight loading functionality

gpt_oss.py model definition for the bf16 weights case does not use weight loaders:

vllm/vllm/model_executor/models/gpt_oss.py

Line 1009 in 234a65b

param.copy_(narrow_weight)

I'm not exactly sure how #34906 worked given 1 and 2 ^. Going to skip this for now as gpt-oss + online quant seems low pri because the official weights are in mxfp4, and we can follow-up if needed.

for posterity, the easiest way to test this is using the 20b model from unsloth which goes through the same path as the 120b:

VLLM_ENABLE_V1_MULTIPROCESSING=0 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/basic/chat.py --model unsloth/gpt-oss-20b-BF16 --enforce-eager --dtype=bfloat16 --quantization=fp8

-Original file line number
+Diff line change
@@ Expand Up / @@ -9,6 +9,10 @@ @@
     from vllm.config import ModelConfig, VllmConfig
     from vllm.config.load import LoadConfig
     from vllm.logger import init_logger
+    from vllm.model_executor.model_loader.reload import (
+        finalize_layerwise_reload,
+        initialize_layerwise_reload,
+    )
     from vllm.model_executor.model_loader.utils import (
         initialize_model,
         process_weights_after_loading,
@@ Expand Down Expand Up / @@ -58,8 +62,27 @@ def load_model( @@
                 log_model_inspection(model)
                 logger.debug("Loading weights on %s ...", load_device)
-                # Quantization does not happen in `load_weights` but after it
-                self.load_weights(model, model_config)
+                use_layerwise_loading = _get_use_layerwise_loading(model, self)
+                if use_layerwise_loading:
+                    # set up layer loading
+                    initialize_layerwise_reload(
+                        model, is_reload=False, target_device=load_device
+                    )
+                    # load weights, quantization via each layer's
+                    # `process_weights_after_loading` will happen for each layer
+                    # as soon as all of that layer's weights are loaded
+                    self.load_weights(model, model_config)
+                    # finalize layer reloading
+                    finalize_layerwise_reload(model, model_config, is_reload=False)
+                else:
+                    # Load weights to model format
+                    self.load_weights(model, model_config)
+                    # For layers with quantization, convert to kernel format
+                    with target_device:
+                        process_weights_after_loading(model, model_config, target_device)
                 # Log peak GPU memory after loading weights. This is needed
                 # to have test coverage on peak memory for online quantization.
@@ Expand All / @@ -71,11 +94,24 @@ def load_model( @@
                         scope="local",
                     )
-                process_weights_after_loading(model, model_config, target_device)
             return model.eval()
+    def _get_use_layerwise_loading(
+        model: torch.nn.Module,
+        model_loader: BaseModelLoader,
+    ) -> bool:
+        from vllm.model_executor.model_loader.dummy_loader import DummyModelLoader
+        from vllm.model_executor.model_loader.utils import (
+            model_has_any_online_quant_with_device_meta,
+        )
+        has_online_quant = model_has_any_online_quant_with_device_meta(model)
+        is_dummy_loader = isinstance(model_loader, DummyModelLoader)
+        return has_online_quant and not is_dummy_loader
     def log_model_inspection(model: nn.Module) -> None:
         """Log model structure if VLLM_LOG_MODEL_INSPECTION=1."""
         if not envs.VLLM_LOG_MODEL_INSPECTION:
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor fp8.py online quant weight loading to use layerwise reload utils #33814

Uh oh!

Diff view

Diff view

There are no files selected for viewing

vkuzo Feb 27, 2026

Uh oh!

vkuzo Feb 24, 2026

Uh oh!

vkuzo Feb 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

refactor fp8.py online quant weight loading to use layerwise reload utils #33814

Uh oh!

refactor fp8.py online quant weight loading to use layerwise reload utils #33814

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

vkuzo Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

vkuzo Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

vkuzo Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!