[QeRL] Layerwise Reloading by kylesayrs · Pull Request #32133 · vllm-project/vllm

kylesayrs · 2026-01-12T00:20:08Z

Purpose

Support weight reloading of quantized models
- Improves upon [Quantization] FP8 Weight Reloading for Quantized RL Rollout #28480
  - Changes are more maintainable
    - FP8 can now do arbitrary kernel processing (padding, renaming, deepgemm, ect.)
    - FP8 does not need to reattach weight loaders after processing
    - Can remove complications around renaming weight_scale_inv
  - Support reloading with FP8 marlin
- Improves upon Support RL online quantization with torchao #23014
  - Reduce memory requirements by 3x for FP8
  - Expand support to all quant configs, not just torchao
- Support reloading of (theoretically any quantization scheme)
  - INT4 has been tested
  - FP8_BLOCK has been tested
  - FP8_DYNAMIC has been tested
  - NVFP4_A16 has been tested

Nomenclature

Details

“Checkpoint format” refers to the format in which weights are loaded from disk or provided by a user.
“Model format” refers to the state of the model after init but before weights are processed with process_weights_after_loading . The mapping between “checkpoint format” and “model format” is implemented by model.load_weights.
“Kernel format” refers to the state of the model after process_weights_after_loading
In the case that checkpoint format is unquantized, but the kernel format is quantized, we call this “online quantization”, where unquantized weights are quantized by vLLM during/after loading.

Design

Weight metadata (shape, attributes) are captured before kernel processing
Kernel processing occurs
User triggers reloading
Weight metadata is used to restore the layer to original load-ready model format
Loaded weights are cached
Once all loaded weights are ready for a given layer
a. the layer is materialized on device
b. the weights are loaded into the layer
c. the layer is processed into kernel format
d. the new tensors are copied into the original tensor storage
Any remaining layers which did not load all weights are cleaned up

Changes

Implement 4 public functions
- record_metadata_for_reloading is called at the end of initialize_model
- initialize_layerwise_reload is called before weights are loaded a second time
- finalize_layerwise_reload is called after weights are loaded a second
- support_quantized_model_reload_from_hp_weights is a decorator which calls initialize_layerwise_reload and finalize_layerwise_reload, preserves backwards compatibility for torchao users
Track layer information (restore metadata, kernel tensors, loading progresss) via the global dictionary LAYER_RELOADING_INFO
- Before and after weight reloading, this dictionary only holds meta tensors. This dictionary is a weak key dictionary, so those meta tensors are garbage collected if the model is garbage collected.
Expand capabilities of lm.collective_rpc("reload_weights")
- Can now pass weights_iterator directly
- Alternatively, can pass weights_path to point to another checkpoint on disk
- checkpoint_format controls whether the weights are provided in model format (default) or kernel format (quantized, renamed, and sharded)
Misc
- Improve generate_prompt_perplexity by allowing users to pass a mask

Testing

Add unit tests for materializing and restoring layers
Add model tests for full precision, int4, and fp8 weight updating
Add model test for MoE weight updating
Modify test_online_quant_config_dict_json to actually test online quantization via the model.load_weights pathway

Future Work

Implement composing with the following features
- Attention/MLA quantization
- Expert-parallel load balancing (all that should be needed is to pause async EPLB during reload)
- Reuse code and compose with fix memory for online fp8 quantization with streaming weight load #31914
- Distributed weight transfer and updating: [Feat][RL][1/2] Native Weight Syncing API: NCCL #31943
- Remove logic from [Quantization] FP8 Weight Reloading for Quantized RL Rollout #28480

gemini-code-assist

Code Review

This pull request refactors the model reloading logic, replacing the previous torchao-specific online quantization mechanism with a more general, layer-wise approach. The new implementation in reload_utils.py leverages meta tensors to manage weight reloading and processing, aiming for better compatibility, especially with CUDA graphs. While the direction is good, the new reload_utils.py file contains several critical bugs that will prevent it from working correctly, including issues with dictionary manipulation, incorrect variable references, and improper attribute access on torch.nn.Module instances. I've provided specific comments and suggestions to fix these issues.

vllm/model_executor/model_loader/reload_utils.py

vllm/model_executor/model_loader/reload/meta.py

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

tests/conftest.py

vllm/model_executor/model_loader/reload/layerwise.py

mergify · 2026-01-22T16:37:56Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kylesayrs.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/model_executor/model_loader/reload/layerwise.py

vllm/v1/worker/gpu_model_runner.py

mgoin · 2026-01-23T18:42:38Z

vllm/v1/worker/gpu_model_runner.py

+            if weights_path is not None:
+                self.model_config.model = weights_path


Why is this needed? Also do we need to do validation on this path? It seems like a late state to change this shared config

The ability to pass weights_path is a nice-to-have feature, makes testing easier and allows users to pass directory paths, which is a real use case.

Do you think we need to change anything more than self.model_config.model?

vllm/v1/worker/gpu_model_runner.py

vllm/model_executor/model_loader/reload/types.py

vllm/model_executor/model_loader/reload/__init__.py

vllm/model_executor/model_loader/reload/utils.py

vllm/model_executor/model_loader/reload/layerwise.py

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

mgoin

Awesome work, LGTM!

Summary: vllm-project#32133 missed a rebase on vllm-project#32064, fixing the attention path import Test Plan: ```bash // before this PR, the test runner failed because the old attention // import path no longer exists pytest tests/quantization/test_fp8.py -s -x ``` Reviewers: Subscribers: Tasks: Tags: Signed-off-by: vasiliy <vasiliy@fb.com>

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: Pai <416932041@qq.com>

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

mergify bot added the v1 label Jan 12, 2026

gemini-code-assist bot reviewed Jan 12, 2026

View reviewed changes

kylesayrs mentioned this pull request Jan 12, 2026

[RFC] [QeRL]: Online Quantization and Model Reloading #30359

Open

1 task

kylesayrs changed the title ~~[WIP] [Model Reloading] Layerwise Reload and Processing~~ [WIP] [QeRL] Layerwise Reloading Jan 12, 2026

vkuzo reviewed Jan 12, 2026

View reviewed changes

vllm/model_executor/model_loader/reload_utils.py Outdated Show resolved Hide resolved

kouroshHakha mentioned this pull request Jan 12, 2026

[RFC]: Native Weight Syncing APIs #31848

Open

19 tasks

kylesayrs force-pushed the kylesayrs/layerwise-reload branch from cabd2e3 to 8652b3f Compare January 13, 2026 17:19

jerryzh168 reviewed Jan 13, 2026

View reviewed changes

vllm/model_executor/model_loader/reload_utils.py Outdated Show resolved Hide resolved

kylesayrs mentioned this pull request Jan 15, 2026

[Feat][RL][1/2] Native Weight Syncing API: NCCL #31943

Merged

5 tasks

vkuzo reviewed Jan 16, 2026

View reviewed changes

vllm/model_executor/model_loader/reload/meta.py Outdated Show resolved Hide resolved

kylesayrs changed the title ~~[WIP] [QeRL] Layerwise Reloading~~ [QeRL] Layerwise Reloading Jan 20, 2026

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 20, 2026

kylesayrs force-pushed the kylesayrs/layerwise-reload branch from 72d7c4c to 14605fd Compare January 21, 2026 00:13

kylesayrs marked this pull request as ready for review January 21, 2026 02:29

kylesayrs requested review from 22quinn, mgoin, pavanimajety, robertgshaw2-redhat and yewentao256 as code owners January 21, 2026 02:29

cursor bot reviewed Jan 21, 2026

View reviewed changes

tests/conftest.py Outdated Show resolved Hide resolved

vllm/model_executor/model_loader/reload/layerwise.py Outdated Show resolved Hide resolved

kylesayrs force-pushed the kylesayrs/layerwise-reload branch 2 times, most recently from f62af57 to 1ddaff2 Compare January 22, 2026 00:12

kylesayrs mentioned this pull request Jan 22, 2026

[Quantization] FP8 Weight Reloading for Quantized RL Rollout #28480

Merged

mergify bot added the needs-rebase label Jan 22, 2026

kylesayrs force-pushed the kylesayrs/layerwise-reload branch from 95452a2 to 0128190 Compare January 22, 2026 23:11

mergify bot removed the needs-rebase label Jan 22, 2026

kylesayrs force-pushed the kylesayrs/layerwise-reload branch 2 times, most recently from 2755952 to 07aa6cc Compare January 23, 2026 05:33

mgoin reviewed Jan 23, 2026

View reviewed changes

kylesayrs added 13 commits January 25, 2026 17:36

support ep

8410264

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

fix torchao test

e145e2e

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

fix perplexity bug

bf52c89

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

add eplb test

7350365

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

fix typo

b9fc627

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

fix perplexity bug, break out make_online_process_loader, fix docstring

af0a44f

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

fix hadamard skip

e4cb84f

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

add reference sanitation to enable proper model cleanup

fde119d

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

add model cleanup test

8b9bb36

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

cleanup

893d3fc

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

change api to pass model_config

5a808f8

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

address comments, add tests

80feeaf

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

fix logic

363b92e

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs force-pushed the kylesayrs/layerwise-reload branch from ad9b548 to 363b92e Compare January 25, 2026 17:36

kylesayrs mentioned this pull request Jan 25, 2026

[Feature]: Integrate layerwise reloading with other vLLM loading features #33038

Open

6 tasks

mgoin approved these changes Jan 30, 2026

View reviewed changes

mgoin merged commit f857a03 into vllm-project:main Jan 30, 2026
58 checks passed

vkuzo mentioned this pull request Jan 30, 2026

fix QERL attention import path #33432

Merged

5 tasks

xuechendi mentioned this pull request Jan 30, 2026

[BUGFIX] fix Attention import path #33436

Closed

5 tasks

PiratePai pushed a commit to PiratePai/epd_shm that referenced this pull request Feb 3, 2026

[QeRL] Layerwise Reloading (vllm-project#32133)

8d4d7e8

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: Pai <416932041@qq.com>

This was referenced Feb 6, 2026

refactor fp8.py online quant weight loading to use layerwise reload utils #33814

Closed

[wip] layerwise loading for fp8.py, take 2 #34020

Closed

fergusfinn mentioned this pull request Feb 11, 2026

fix: preserve parameter attrs during weight reload (FP8 block + MXFP4 MoE) doublewordai/vllm#2

Open

4 tasks

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[QeRL] Layerwise Reloading (vllm-project#32133)

0a43c10

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

hao-aaron mentioned this pull request Feb 19, 2026

[Feat][RL][2/2] Native Weight Syncing API: IPC #34171

Merged

5 tasks

guyueh1 pushed a commit to guyueh1/vllm that referenced this pull request Feb 20, 2026

[QeRL] Layerwise Reloading (vllm-project#32133)

77c4eda

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

This was referenced Mar 7, 2026

Large memory usage 400GB+ in n300-llmbox vllm TP tests after vllm v0.16.0 uplift on March 4th tenstorrent/tt-xla#3611

Open

[Bug]: record_metadata_for_reloading causes ~3x host memory regression during torch.compile on XLA backends #36537

Open

		if weights_path is not None:
		self.model_config.model = weights_path

Uh oh!

Conversation

kylesayrs commented Jan 12, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Nomenclature

Design

Changes

Testing

Future Work

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Jan 22, 2026

Uh oh!

Uh oh!

Uh oh!

mgoin Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kylesayrs Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kylesayrs commented Jan 12, 2026 •

edited by github-actions bot

Loading

mgoin Jan 23, 2026 •

edited

Loading

kylesayrs Jan 23, 2026 •

edited

Loading