Skip to content

Revert "[QeRL] Compose online quantization with quantized reloading" (#38032)#38446

Draft
zhewenl wants to merge 1 commit intovllm-project:mainfrom
zhewenl:auto-revert/pr-38032
Draft

Revert "[QeRL] Compose online quantization with quantized reloading" (#38032)#38446
zhewenl wants to merge 1 commit intovllm-project:mainfrom
zhewenl:auto-revert/pr-38032

Conversation

@zhewenl
Copy link
Copy Markdown
Collaborator

@zhewenl zhewenl commented Mar 29, 2026

Revert of #38032

This reverts #38032 (merge commit 648edcf).

Reason: 2 new CI failures in build #58604 traced to this PR:

  • Fusion and Compile Unit Tests (2xB200) — 4 tests failed (test_fp8_kv_scale_compile)
  • Quantization — 6 tests failed (test_online_quantization, test_online_quant_peak_mem, test_online_quant_load_format_dummy)

Root cause: reload/layerwise.py:_layerwise_process() calls process_weights_after_loading() which invokes ops.scaled_fp8_quant()dynamic_scaled_fp8_quant on CPU tensors, raising NotImplementedError: Could not run '_C::dynamic_scaled_fp8_quant' with arguments from the 'CPU' backend.

Note: Merge conflict in base_loader.py was auto-resolved (trivial formatting conflict from #38426). The resolution preserves current formatting while undoing only #38032 changes — please review carefully.


Auto-generated by CI failure analyzer

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the online quantization and weight reloading mechanism by introducing just-in-time weight materialization through a patched weight loader. This replaces the previous layer-wise processing approach for FP8 and MoE quantization methods. Additionally, the dummy weight initialization logic is updated to skip parameters on meta devices when online quantization is enabled. A critical issue was identified regarding the removal of the @torch.no_grad() decorator from the initialize_single_dummy_weight function, which is necessary for safe in-place weight modifications.

initialize_single_dummy_weight(param, low, high, seed)


@torch.no_grad()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The @torch.no_grad() decorator should not be removed from initialize_single_dummy_weight. This function performs in-place modifications on model parameters (param.uniform_). It's crucial to wrap such operations in torch.no_grad() to prevent unintended side effects with the autograd engine and to adhere to best practices for weight manipulation.

@jikunshang
Copy link
Copy Markdown
Collaborator

latest fix #38442

@mergify
Copy link
Copy Markdown

mergify bot commented Mar 29, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zhewenl.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants