Skip to content

[QeRL] Fix online quantized reloading#38442

Merged
mgoin merged 7 commits intovllm-project:mainfrom
neuralmagic:kylesayrs/fix-online-quant-reload
Mar 29, 2026
Merged

[QeRL] Fix online quantized reloading#38442
mgoin merged 7 commits intovllm-project:mainfrom
neuralmagic:kylesayrs/fix-online-quant-reload

Conversation

@kylesayrs
Copy link
Copy Markdown
Contributor

Background

#38032 Added support for online quantized reloading. However, the change to load weights and process within the load device context broke some models (for an unknown reason), which prompted #38426 to fix the issue, but break the load device context behavior required for QeRL. This PR fixes both and enables quantized reloading tests, which were previously being skipped by insufficient runner hardware.

Purpose

  • Fix online quantized reloading
  • Enable reloading tests on CI for better hardening

Changes

  • Capture the load device in record_metadata_for_reloading (which is called under the load device context)
    • This is the device which will be used to rematerialize the tensors later
    • Note: this assumes that all tensors should be restored to the load device (torch.get_default_device()). This has been a fine assumption up until now, but will break if vLLM ever instantiates a model parameter which is not on the load device. If this happens, we should modify this to capture parameter devices on a more granular level.
  • Fix surfaced bug where fp8 online moe scales were instantiated on the wrong device
    • This was not an issue before, since we previously relied on loading weights under the load device context
  • Enable quantized reloading tests
    • Add slow_test marker to skip non-critical tests
    • Add -m '(not slow_test)' to all tests which run on the model_executor folder for consistency
  • Miscellaneous
    • Add docstrings/comments
    • LayerReloadingInfo must now be constructed with required arguments

Testing

  • Tested quantized reloading
  • Tested online quantized reloading
  • Skipping slow_test marker is safe, since this tag is only used by the tests mentioned in this PR
    • grep -r 'slow_test' tests/model_executor/

Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@kylesayrs kylesayrs changed the title [QeRL] Fix [QeRL] Fix online quantized reloading Mar 28, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the weight reloading mechanism to be device-aware by incorporating a restore_device field into the LayerReloadingInfo structure. This ensures that layer materialization occurs on the intended device, allowing for the removal of global device context managers during weight reloading in the GPU model runner. Additionally, the PR updates CI workflows and test suites to skip or categorize long-running tests as slow_test, improves device placement for FP8 quantization scales, and adds an expected failure for a known DeepSeek-V3 bug. I have no feedback to provide as there are no review comments.

@jikunshang
Copy link
Copy Markdown
Collaborator

@claude review

@haosdent
Copy link
Copy Markdown
Contributor

Related to #38456

@mgoin mgoin added bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed quantization labels Mar 29, 2026
@kylesayrs
Copy link
Copy Markdown
Contributor Author

kylesayrs commented Mar 29, 2026

Looks like the failure is some sort of issue with the newly related tests, looking at it now.

EDIT: It seems like this is expected given how much memory is reserved for MLA activations, even with a 1b mla model. I was able to replicate locally, I fixed this by reducing the max model len and seq len to reduce the amount of reserved memory.

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
@kylesayrs kylesayrs force-pushed the kylesayrs/fix-online-quant-reload branch from c23002e to d152326 Compare March 29, 2026 15:42
@kylesayrs
Copy link
Copy Markdown
Contributor Author

kylesayrs commented Mar 29, 2026

Eagle test has failed twice now, but passes locally.

EDIT: looks like this test is noisy on main as well.

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
@mgoin mgoin merged commit d28d86e into vllm-project:main Mar 29, 2026
70 checks passed
Elm8116 pushed a commit to Elm8116/vllm that referenced this pull request Mar 30, 2026
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Elham Harirpoush <elham.harirpoush@arm.com>
benenzhu pushed a commit to benenzhu/vllm that referenced this pull request Mar 31, 2026
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: zhutaoyu <zhutaoyu97@gmail.com>
neweyes pushed a commit to neweyes/vllm that referenced this pull request Mar 31, 2026
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: neweyes <328719365@qq.com>
@AndreasKaratzas
Copy link
Copy Markdown
Collaborator

@kylesayrs Why did you add this skip in tests? Skipping non-critical tests is not a fix, so I assume there is a different reason.

EricccYang pushed a commit to EricccYang/vllm that referenced this pull request Apr 1, 2026
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: EricccYang <yangyang4991@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ci/build quantization ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants