Skip to content

[QeRL] Layerwise Reloading#32133

Merged
mgoin merged 39 commits intovllm-project:mainfrom
neuralmagic:kylesayrs/layerwise-reload
Jan 30, 2026
Merged

[QeRL] Layerwise Reloading#32133
mgoin merged 39 commits intovllm-project:mainfrom
neuralmagic:kylesayrs/layerwise-reload

Conversation

@kylesayrs
Copy link
Copy Markdown
Contributor

@kylesayrs kylesayrs commented Jan 12, 2026

Purpose

  • Support weight reloading of quantized models
    • Improves upon [Quantization] FP8 Weight Reloading for Quantized RL Rollout #28480
      • Changes are more maintainable
        • FP8 can now do arbitrary kernel processing (padding, renaming, deepgemm, ect.)
        • FP8 does not need to reattach weight loaders after processing
        • Can remove complications around renaming weight_scale_inv
      • Support reloading with FP8 marlin
    • Improves upon Support RL online quantization with torchao #23014
      • Reduce memory requirements by 3x for FP8
      • Expand support to all quant configs, not just torchao
    • Support reloading of (theoretically any quantization scheme)
      • INT4 has been tested
      • FP8_BLOCK has been tested
      • FP8_DYNAMIC has been tested
      • NVFP4_A16 has been tested

Nomenclature

Details
  • “Checkpoint format” refers to the format in which weights are loaded from disk or provided by a user.
  • “Model format” refers to the state of the model after init but before weights are processed with process_weights_after_loading . The mapping between “checkpoint format” and “model format” is implemented by model.load_weights.
  • “Kernel format” refers to the state of the model after process_weights_after_loading
  • In the case that checkpoint format is unquantized, but the kernel format is quantized, we call this “online quantization”, where unquantized weights are quantized by vLLM during/after loading.

Design

  1. Weight metadata (shape, attributes) are captured before kernel processing
  2. Kernel processing occurs
  3. User triggers reloading
  4. Weight metadata is used to restore the layer to original load-ready model format
  5. Loaded weights are cached
  6. Once all loaded weights are ready for a given layer
    a. the layer is materialized on device
    b. the weights are loaded into the layer
    c. the layer is processed into kernel format
    d. the new tensors are copied into the original tensor storage
  7. Any remaining layers which did not load all weights are cleaned up

Changes

  • Implement 4 public functions
    • record_metadata_for_reloading is called at the end of initialize_model
    • initialize_layerwise_reload is called before weights are loaded a second time
    • finalize_layerwise_reload is called after weights are loaded a second
    • support_quantized_model_reload_from_hp_weights is a decorator which calls initialize_layerwise_reload and finalize_layerwise_reload, preserves backwards compatibility for torchao users
  • Track layer information (restore metadata, kernel tensors, loading progresss) via the global dictionary LAYER_RELOADING_INFO
    • Before and after weight reloading, this dictionary only holds meta tensors. This dictionary is a weak key dictionary, so those meta tensors are garbage collected if the model is garbage collected.
  • Expand capabilities of lm.collective_rpc("reload_weights")
    • Can now pass weights_iterator directly
    • Alternatively, can pass weights_path to point to another checkpoint on disk
    • checkpoint_format controls whether the weights are provided in model format (default) or kernel format (quantized, renamed, and sharded)
  • Misc
    • Improve generate_prompt_perplexity by allowing users to pass a mask

Testing

  • Add unit tests for materializing and restoring layers
  • Add model tests for full precision, int4, and fp8 weight updating
  • Add model test for MoE weight updating
  • Modify test_online_quant_config_dict_json to actually test online quantization via the model.load_weights pathway

Future Work

@mergify mergify bot added the v1 label Jan 12, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the model reloading logic, replacing the previous torchao-specific online quantization mechanism with a more general, layer-wise approach. The new implementation in reload_utils.py leverages meta tensors to manage weight reloading and processing, aiming for better compatibility, especially with CUDA graphs. While the direction is good, the new reload_utils.py file contains several critical bugs that will prevent it from working correctly, including issues with dictionary manipulation, incorrect variable references, and improper attribute access on torch.nn.Module instances. I've provided specific comments and suggestions to fix these issues.

@kylesayrs kylesayrs changed the title [WIP] [Model Reloading] Layerwise Reload and Processing [WIP] [QeRL] Layerwise Reloading Jan 12, 2026
@kylesayrs kylesayrs force-pushed the kylesayrs/layerwise-reload branch from cabd2e3 to 8652b3f Compare January 13, 2026 17:19
@kylesayrs kylesayrs changed the title [WIP] [QeRL] Layerwise Reloading [QeRL] Layerwise Reloading Jan 20, 2026
@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 20, 2026
@kylesayrs kylesayrs force-pushed the kylesayrs/layerwise-reload branch from 72d7c4c to 14605fd Compare January 21, 2026 00:13
@kylesayrs kylesayrs marked this pull request as ready for review January 21, 2026 02:29
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

@kylesayrs kylesayrs force-pushed the kylesayrs/layerwise-reload branch 2 times, most recently from f62af57 to 1ddaff2 Compare January 22, 2026 00:12
@mergify
Copy link
Copy Markdown

mergify bot commented Jan 22, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kylesayrs.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 22, 2026
@kylesayrs kylesayrs force-pushed the kylesayrs/layerwise-reload branch from 95452a2 to 0128190 Compare January 22, 2026 23:11
@mergify mergify bot removed the needs-rebase label Jan 22, 2026
@kylesayrs kylesayrs force-pushed the kylesayrs/layerwise-reload branch 2 times, most recently from 2755952 to 07aa6cc Compare January 23, 2026 05:33
Comment on lines +4053 to +4054
if weights_path is not None:
self.model_config.model = weights_path
Copy link
Copy Markdown
Member

@mgoin mgoin Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? Also do we need to do validation on this path? It seems like a late state to change this shared config

Copy link
Copy Markdown
Contributor Author

@kylesayrs kylesayrs Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ability to pass weights_path is a nice-to-have feature, makes testing easier and allows users to pass directory paths, which is a real use case.

Do you think we need to change anything more than self.model_config.model?

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Copy link
Copy Markdown
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work, LGTM!

@mgoin mgoin merged commit f857a03 into vllm-project:main Jan 30, 2026
58 checks passed
vkuzo added a commit to vkuzo/vllm that referenced this pull request Jan 30, 2026
Summary:

vllm-project#32133 missed a rebase
on vllm-project#32064,
fixing the attention path import

Test Plan:

```bash
// before this PR, the test runner failed because the old attention
// import path no longer exists
pytest tests/quantization/test_fp8.py -s -x
```

Reviewers:

Subscribers:

Tasks:

Tags:

Signed-off-by: vasiliy <vasiliy@fb.com>
@vkuzo vkuzo mentioned this pull request Jan 30, 2026
5 tasks
PiratePai pushed a commit to PiratePai/epd_shm that referenced this pull request Feb 3, 2026
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Pai <416932041@qq.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
guyueh1 pushed a commit to guyueh1/vllm that referenced this pull request Feb 20, 2026
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants