[QeRL] Fix online quantized reloading by kylesayrs · Pull Request #38442 · vllm-project/vllm

kylesayrs · 2026-03-28T22:55:31Z

Background

#38032 Added support for online quantized reloading. However, the change to load weights and process within the load device context broke some models (for an unknown reason), which prompted #38426 to fix the issue, but break the load device context behavior required for QeRL. This PR fixes both and enables quantized reloading tests, which were previously being skipped by insufficient runner hardware.

Purpose

Fix online quantized reloading
Enable reloading tests on CI for better hardening

Changes

Capture the load device in record_metadata_for_reloading (which is called under the load device context)
- This is the device which will be used to rematerialize the tensors later
- Note: this assumes that all tensors should be restored to the load device (torch.get_default_device()). This has been a fine assumption up until now, but will break if vLLM ever instantiates a model parameter which is not on the load device. If this happens, we should modify this to capture parameter devices on a more granular level.
Fix surfaced bug where fp8 online moe scales were instantiated on the wrong device
- This was not an issue before, since we previously relied on loading weights under the load device context
Enable quantized reloading tests
- Add slow_test marker to skip non-critical tests
- Add -m '(not slow_test)' to all tests which run on the model_executor folder for consistency
Miscellaneous
- Add docstrings/comments
- LayerReloadingInfo must now be constructed with required arguments

Testing

Tested quantized reloading
Tested online quantized reloading
Skipping slow_test marker is safe, since this tag is only used by the tests mentioned in this PR
- grep -r 'slow_test' tests/model_executor/

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request refactors the weight reloading mechanism to be device-aware by incorporating a restore_device field into the LayerReloadingInfo structure. This ensures that layer materialization occurs on the intended device, allowing for the removal of global device context managers during weight reloading in the GPU model runner. Additionally, the PR updates CI workflows and test suites to skip or categorize long-running tests as slow_test, improves device placement for FP8 quantization scales, and adds an expected failure for a known DeepSeek-V3 bug. I have no feedback to provide as there are no review comments.

jikunshang · 2026-03-29T00:32:04Z

@claude review

haosdent · 2026-03-29T08:08:37Z

Related to #38456

kylesayrs · 2026-03-29T15:08:32Z

Looks like the failure is some sort of issue with the newly related tests, looking at it now.

EDIT: It seems like this is expected given how much memory is reserved for MLA activations, even with a 1b mla model. I was able to replicate locally, I fixed this by reducing the max model len and seq len to reduce the amount of reserved memory.

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs · 2026-03-29T16:41:16Z

Eagle test has failed twice now, but passes locally.

EDIT: looks like this test is noisy on main as well.

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: Elham Harirpoush <elham.harirpoush@arm.com>

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: zhutaoyu <zhutaoyu97@gmail.com>

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: neweyes <328719365@qq.com>

AndreasKaratzas · 2026-03-31T17:27:48Z

@kylesayrs Why did you add this skip in tests? Skipping non-critical tests is not a fix, so I assume there is a different reason.

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>

kylesayrs requested review from 22quinn, mgoin, njhill, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners March 28, 2026 22:55

claude bot reviewed Mar 28, 2026

View reviewed changes

kylesayrs changed the title ~~[QeRL] Fix~~ [QeRL] Fix online quantized reloading Mar 28, 2026

mergify bot added ci/build v1 labels Mar 28, 2026

gemini-code-assist bot reviewed Mar 28, 2026

View reviewed changes

jikunshang mentioned this pull request Mar 29, 2026

Revert "[QeRL] Compose online quantization with quantized reloading" (#38032) #38446

Draft

haosdent mentioned this pull request Mar 29, 2026

[CI] Fix online FP8 quantization materializing tensors on CPU #38456

Closed

mgoin added bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed quantization labels Mar 29, 2026

mgoin approved these changes Mar 29, 2026

View reviewed changes

kylesayrs added 5 commits March 29, 2026 15:42

capture device

10da667

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

simplify reload_weights, finish comment

f4f1880

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

skip tests, fix scale device

058e52c

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

reenable test

72b5b1d

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

reduce memory usage of added test

d152326

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs force-pushed the kylesayrs/fix-online-quant-reload branch from c23002e to d152326 Compare March 29, 2026 15:42

kylesayrs added 2 commits March 29, 2026 16:44

fix other as well

056c9a2

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

mla taoyan

f8b9cfb

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

mgoin merged commit d28d86e into vllm-project:main Mar 29, 2026
70 checks passed

Elm8116 pushed a commit to Elm8116/vllm that referenced this pull request Mar 30, 2026

[QeRL] Fix online quantized reloading (vllm-project#38442)

8584d4a

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: Elham Harirpoush <elham.harirpoush@arm.com>

benenzhu pushed a commit to benenzhu/vllm that referenced this pull request Mar 31, 2026

[QeRL] Fix online quantized reloading (vllm-project#38442)

f968955

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: zhutaoyu <zhutaoyu97@gmail.com>

neweyes pushed a commit to neweyes/vllm that referenced this pull request Mar 31, 2026

[QeRL] Fix online quantized reloading (vllm-project#38442)

99b7b10

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: neweyes <328719365@qq.com>

EricccYang pushed a commit to EricccYang/vllm that referenced this pull request Apr 1, 2026

[QeRL] Fix online quantized reloading (vllm-project#38442)

506646c

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[QeRL] Fix online quantized reloading#38442

[QeRL] Fix online quantized reloading#38442
mgoin merged 7 commits intovllm-project:mainfrom
neuralmagic:kylesayrs/fix-online-quant-reload

kylesayrs commented Mar 28, 2026

Uh oh!

claude bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

jikunshang commented Mar 29, 2026

Uh oh!

haosdent commented Mar 29, 2026

Uh oh!

kylesayrs commented Mar 29, 2026 •

edited

Loading

Uh oh!

kylesayrs commented Mar 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

AndreasKaratzas commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

kylesayrs commented Mar 28, 2026

Background

Purpose

Changes

Testing

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

jikunshang commented Mar 29, 2026

Uh oh!

haosdent commented Mar 29, 2026

Uh oh!

kylesayrs commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kylesayrs commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

AndreasKaratzas commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kylesayrs commented Mar 29, 2026 •

edited

Loading

kylesayrs commented Mar 29, 2026 •

edited

Loading