[QeRL] Fix online quantized reloading#38442
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the weight reloading mechanism to be device-aware by incorporating a restore_device field into the LayerReloadingInfo structure. This ensures that layer materialization occurs on the intended device, allowing for the removal of global device context managers during weight reloading in the GPU model runner. Additionally, the PR updates CI workflows and test suites to skip or categorize long-running tests as slow_test, improves device placement for FP8 quantization scales, and adds an expected failure for a known DeepSeek-V3 bug. I have no feedback to provide as there are no review comments.
|
@claude review |
|
Related to #38456 |
|
Looks like the failure is some sort of issue with the newly related tests, looking at it now. EDIT: It seems like this is expected given how much memory is reserved for MLA activations, even with a 1b mla model. I was able to replicate locally, I fixed this by reducing the max model len and seq len to reduce the amount of reserved memory. |
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
c23002e to
d152326
Compare
|
Eagle test has failed twice now, but passes locally. EDIT: looks like this test is noisy on main as well. |
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: Elham Harirpoush <elham.harirpoush@arm.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: zhutaoyu <zhutaoyu97@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: neweyes <328719365@qq.com>
|
@kylesayrs Why did you add this skip in tests? Skipping non-critical tests is not a fix, so I assume there is a different reason. |
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>
Background
#38032 Added support for online quantized reloading. However, the change to load weights and process within the load device context broke some models (for an unknown reason), which prompted #38426 to fix the issue, but break the load device context behavior required for QeRL. This PR fixes both and enables quantized reloading tests, which were previously being skipped by insufficient runner hardware.
Purpose
Changes
record_metadata_for_reloading(which is called under the load device context)torch.get_default_device()). This has been a fine assumption up until now, but will break if vLLM ever instantiates a model parameter which is not on the load device. If this happens, we should modify this to capture parameter devices on a more granular level.slow_testmarker to skip non-critical tests-m '(not slow_test)'to all tests which run on themodel_executorfolder for consistencyLayerReloadingInfomust now be constructed with required argumentsTesting
slow_testmarker is safe, since this tag is only used by the tests mentioned in this PRgrep -r 'slow_test' tests/model_executor/