Revert "[QeRL] Compose online quantization with quantized reloading" (#38032) by zhewenl · Pull Request #38446 · vllm-project/vllm

zhewenl · 2026-03-29T01:05:29Z

Revert of #38032

This reverts #38032 (merge commit 648edcf).

Reason: 2 new CI failures in build #58604 traced to this PR:

Fusion and Compile Unit Tests (2xB200) — 4 tests failed (test_fp8_kv_scale_compile)
Quantization — 6 tests failed (test_online_quantization, test_online_quant_peak_mem, test_online_quant_load_format_dummy)

Root cause: reload/layerwise.py:_layerwise_process() calls process_weights_after_loading() which invokes ops.scaled_fp8_quant() → dynamic_scaled_fp8_quant on CPU tensors, raising NotImplementedError: Could not run '_C::dynamic_scaled_fp8_quant' with arguments from the 'CPU' backend.

Note: Merge conflict in base_loader.py was auto-resolved (trivial formatting conflict from #38426). The resolution preserves current formatting while undoing only #38032 changes — please review carefully.

Auto-generated by CI failure analyzer

…llm-project#38032)" This reverts commit 648edcf.

gemini-code-assist

Code Review

This pull request refactors the online quantization and weight reloading mechanism by introducing just-in-time weight materialization through a patched weight loader. This replaces the previous layer-wise processing approach for FP8 and MoE quantization methods. Additionally, the dummy weight initialization logic is updated to skip parameters on meta devices when online quantization is enabled. A critical issue was identified regarding the removal of the @torch.no_grad() decorator from the initialize_single_dummy_weight function, which is necessary for safe in-place weight modifications.

gemini-code-assist · 2026-03-29T01:08:34Z

vllm/model_executor/model_loader/weight_utils.py

        initialize_single_dummy_weight(param, low, high, seed)


-@torch.no_grad()


The @torch.no_grad() decorator should not be removed from initialize_single_dummy_weight. This function performs in-place modifications on model parameters (param.uniform_). It's crucial to wrap such operations in torch.no_grad() to prevent unintended side effects with the autograd engine and to adhere to best practices for weight manipulation.

jikunshang · 2026-03-29T01:37:20Z

latest fix #38442

mergify · 2026-03-29T21:05:33Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zhewenl.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Revert "[QeRL] Compose online quantization with quantized reloading (v…

fffdd56

…llm-project#38032)" This reverts commit 648edcf.

gemini-code-assist bot reviewed Mar 29, 2026

View reviewed changes

mergify bot added the needs-rebase label Mar 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Revert "[QeRL] Compose online quantization with quantized reloading" (#38032)#38446

Revert "[QeRL] Compose online quantization with quantized reloading" (#38032)#38446
zhewenl wants to merge 1 commit intovllm-project:mainfrom
zhewenl:auto-revert/pr-38032

zhewenl commented Mar 29, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 29, 2026

Uh oh!

jikunshang commented Mar 29, 2026

Uh oh!

mergify bot commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		initialize_single_dummy_weight(param, low, high, seed)


		@torch.no_grad()

Uh oh!

Conversation

zhewenl commented Mar 29, 2026

Revert of #38032

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

jikunshang commented Mar 29, 2026

Uh oh!

mergify bot commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants